.. _awk2:

*********************************
AWK 2
*********************************

This class cover additional aspects of AWK. The examples presented here build on the material covered in the first AWK lab and cover more complex pattern matching and text processing. We will then use what we have learned to use AWK to reformat a text file containing geophysical data.

=============================================
Changing the Field and Record Separators
=============================================

So far, we have only been using the default AWK behavior of separating records by the newline character (``\n``) and separating fields by any amount of whitespace. However, what if you need to use a different character to designate these differences? For instance, what if you have a table of values separated by commas? Or what if you have a file with multiple lines that should be grouped together, with a blank line to separate? AWK can process these files without any trouble if you tell it how to do so:

* **Field Separator:** The field separator is denoted by the ``FS`` variable. We can change the field separator by setting it in the ``BEGIN`` 
  commands. For example, let's say we have a file that contains comma separated values like this: ::

    0.1,20,1.4,7

  We can read these values into AWK and print them to the screen with the following code: ::

    % awk 'BEGIN { FS = "," } { print $1, $2, $3, $4 }' << END
    ? 0.1,20,1.4,7
    ? END

  Here we have done a few things. First, we set the field separator to be a comma (note that you need to put it in quotes when setting the 
  value). Then AWK reads in every record and divides it into fields using commas instead of spaces to separate the fields, printing out the 
  four fields that it finds.

  We have also done one thing new here: we have used standard input to enter our data, rather than reading from a file. This involves using the 
  ``<<`` operator, one that we only touched on briefly when introducing the shell. The ``<<`` operator is followed by some end designation 
  (here it is ``END``, but in practice this can be any set of characters) that tells the shell the characters that you will enter to tell it 
  that it has reached the end of the file. Once you type the line containing the AWK code, the shell lets you enter input, line by line, until 
  you enter the set of characters that you specified in the first line. Here, we only enter one line, followed by ``END``, and AWK proceeds to 
  process that input like it was input from a file. This technique is useful for entering short input into AWK, particularly in a shell script, 
  where saving it as a separate file does not make much sense.

* **Record Separator:** The record separator is denoted by the ``RS`` variable. As with the field separator, this is most frequently changed in 
  the ``BEGIN`` portion of the program. For example, let's say we have a file of addresses, each separated by a blank line. We can have AWK
  read each address as a record, and then separate the record into name, address, and city/state/zip fields as follows: ::

    % awk 'BEGIN { FS = "\n"; RS = "" } { print $1, $2, $3 }' << END
    ? Eric Daub
    ? 3890 Central Ave
    ? Memphis, TN 38152
    ?
    ? Barack Obama
    ? 1600 Pennsylvania Ave
    ? Washington, DC 20500
    ? END

  This will print each address on a single line. However, this code assumes each address is exactly three lines, which is not necessarily true. 
  You can make this code more robust as follows: ::

    % awk 'BEGIN { FS = "\n"; RS = ""; ORS = "" } \
    ? { for (i = 1; i <= NF; i++) \
    ? { print  $i; print " " } \
    ? print "\n" }' << END
    ? Eric Daub
    ? CERI
    ? 3890 Central Ave
    ? Memphis, TN 38152
    ?
    ? Barack Obama
    ? 1600 Pennsylvania Ave
    ? Washington, DC 20500
    ? END

  Here we used a ``for`` loop to iterate over all fields in the current record. To prevent AWK from printing each output field on a new line, I 
  changed the output record separator variable ``ORS`` (AWK puts this character after all ``print`` statements) to be the empty string, and 
  then explicitly printed spaces and new lines to format the output.

You do not necessarily need to specify the field and record separators in the ``BEGIN`` section of the program; in fact, in many cases you may need to change the separators as you proceed through the program depending on whether or not certain conditions are met. Also, the field separator does not have to be a single character; it can be a multi-character string or even a regular expression (see below). However, the record separator must be a single character.

=======================
Regular Expressions
=======================

In the Unix terminal, we often can use wildcards and other special characters to reduce the amount of typing and match particular files. We can also perform similar tricks in AWK, using what are known as **regular expressions**.

Regular expressions are used as search patterns in AWK in the conditional portions of the program. Regular expressions are enclosed in forward slashes, and are case-sensitive. For example, the following program will print all lines in a file that contain the characters "the:" ::

    % awk '/the/ { print $0 }' file.txt

If you simply place a pattern within forward slashes as the condition, AWK will execute the commands in braces for any record that contains that pattern. If you want to match a pattern only within a certain field, we use the tilde operator ``~``: ::

    % awk '$1 ~ /the/ { print $0 }' file.txt

This will print all lines where the first field contains the characters "the," including "there" and "they," but not "The." If you would like to find all lines that do not contain "the," use ``!~`` (i.e. does not match) in place of ``~``.

There are sophisticated rules for constructing search patterns. Here is the syntax for some of the more common patterns:

* To require that the pattern appears at the beginning of the string, use ``^`` at the beginning of the string. For example, ``/^the/`` will 
  onlymatch lines that begin with "the." Lines containing "the" in the middle of the string will not be matched.

* To require that the pattern appears at the end of the string, use ``$`` at the end. For example, ``/the$/`` will only match lines whose final 
  three characters are "the."

* To match one of a number of characters, use square brackets. ``/[Tt]he/`` will match either "The" or "the" at any location in a string.

* To match any character *except* certain ones, precede the square bracket expression with ``^``. Note that the caret has a different meaning 
  inside of square brackets as it does outside of square brackets. For example, ``/t[^h]e/`` will match any character between "t" and "e" 
  except "the:" it will match any of "tee," "tre," "t1e," "tHe," and many more.

* Inside square brackets, you can match a range of characters with a dash. ``/[a-z]he/`` will match any lower case alphabetical character 
  followed by "he." It will not match any upper case or numeric characters followed by "he." You can specify numbers and upper case using 
  ``[0-9]`` and ``[A-Z]``, or any alphanumeric character with ``[0-9a-zA-Z]``.
  
* The vertical bar designates a logical or. ``/(^The)|(^Start)/`` matches any line that starts with "The" or "Start." Note how we needed to use 
  parentheses to group the expressions -- parentheses are used in general within regular expressions to group characters.

* A dot (period) represents any single arbitrary character. ``/.he/`` will match any character followed by "he."

* An asterisk represents an arbitrary number of occurrences (zero or more) of the previous character. ``/th*e/`` will match "the," "thhe," 
  "thhhhhhhhhhe," and "te." Note that this is different from the star wildcard in the terminal.
  
* A plus represents one or more occurrences of the previous character. ``/th+e/`` will match "the," "thhe," and "thhhhhhhhhhe," but not "te."

* A question mark represents zero or one occurrence of the previous character. ``/th?e/`` will match "the" and "te," but not "thhe."

* A number in curly braces matches the previous character a specified number of times. You can also specify a range using two numbers separated 
  by a comma, or a minimum with a single number and a comma. ``/th{2}e/`` will only match "thhe," ``/th{2,}e/`` will match "thhe" and any other 
  combination with more than two "h" characters. ``/th{2,4}e/`` will match "thhe," "thhhe," and "thhhhe."

* If you need to match any of the special characters outlined here, put a backslash before the character.

Regular expressions are a bit complicated, but are very useful for dealing with complex data files. Here are some exercises to practice with. You can either make up text files to test your programs with, or enter your own test text using the example above using standard input.

* Write a regular expression that matches postal codes. A U.S. ZIP code consists of 5 digits and an optional hyphen with 4 more digits.

* Write a regular expression that matches any number, including an optional decimal point followed by more digits.

* Write a regular expression that finds e-mail addresses. Look for valid characters (letters, numbers, dots, and underscores) followed by an 
  "@", then more valid characters with at least one dot.

=====================
Formatting Output
=====================

We have been using the ``print`` function to print output. For more control over output formatting, we can use the ``printf`` function. The syntax is ``printf <format>, <item1>, <item2>, ...`` where ``<format>`` specifies how to format the output, and the remaining arguments are the items to be printed. ``printf`` uses a syntax for formatting that is very similar to MATLAB's ``fprintf``, so you may recognize some of this (both are derived from the same command that is available in the shell). An example use of ``printf`` is as follows: ::

    % awk '{ printf "%i\n", $1 }' << END
    ? 1
    ? 2
    ? 3
    ? END

This will print field 1 formatted as an integer, followed by a newline character. You can also use ``%f`` for a floating point number, ``%e`` for a number in exponential notation, and ``%s`` for a string. ``%d`` can be used interchangeably with ``%i`` as both produce integer results. You can also specify additional information on the number of digits to print: ::

    % awk '{ printf "%4.3f\n", $1 }' << END
    ? 1
    ? 2
    ? 3
    ? END

Here, the "4" tells AWK to print at least 4 digits (it will pad with spaces to the left if the width is less than 4), and include 3 figures beyond the decimal point. The meaning of the second formatting digit changes depending on the formatting specifier -- for integers, it specifies the minimum number of digits to print, while for strings it specifies the maximum number of characters to print. There are many additional details regarding the use of ``printf`` to format output, which you can find in the user manual. Spend some time formatting output differently. I find that the majority of my work with AWK only needs simple print statements, and ``printf`` is only needed occasionally (though it is absolutely necessary in those cases).

=============
Exercise
=============

Here is an example of how you would use AWK to process a text file in your research. I have provided a google maps "KML"  file that contains fault trace information for the Imperial Fault at the California/Mexico Border, taken from the USGS fault data that is publicly available. KML files contain information on how to draw markers, line segments, and other things on a map. However, you may need to take that information and put it into a different format to plot in MATLAB or GMT (one problem on HW5 involves doing this in GMT for a different fault, so it is probably a good idea to save your work as a script for later use).

KML files use XML tags to describe how to plot information on a map. There are many different tags to designate different types of information. Here we would like to extract all of the line segments in the file, each of which represents coordinates used to draw fault traces on a map. Each path to be drawn on the map has the following format: ::

    <LineString  id="g39911"><altitudeMode>clampedToGround</altitudeMode><coordinates>-120.200876555996,40.2943269812509,0
    -120.200833555813,40.2943569808352,0
    </coordinates></LineString>

This represents a line segment, going from the first coordinate listed to the second coordinate listed. The coordinates are separated by newline characters, and each set of coordinates is in the form (longitude, latitude, height). While this example has only two coordinates, other lines may have many more segments. There are also additional markers in the file that we would not like to extract; these markers contain coordinates but do not use the ``<LineString>`` tag.

Use AWK to extract longitude and latitude coordinates that are used to draw the line segments representing each fault section in the provided KML file. You should print a line containing a ``>`` character between each set of segments that make up a LineString (this output format is how GMT reads :math:`{x}`-:math:`{y}` data segments, so we choose this format because we will eventually be plotting this data in GMT). Separate each longitude and latitude pair with a space.

*Hint:* It is easiest to do this in stages. First, write AWK code that uses a regular expression to identify any line that contains a set of coordinates. Set the field separator to be a comma, and print out the latitude and longitude that you extract from each set of coordinates. You should try to make this work for the general case, as you will be using this again in your homework. It is usually easiest to build up the regular expression gradually by adding one piece at a time and printing out all lines that match to be sure that it is working correctly.

However, not all of these coordinates are ones that you would like to keep. The ones that you want to keep satisfy one of the following conditions: the coordinates are the first thing on the line, or the tag ``<LineString`` (followed by some additional information) occurs somewhere else on the line. Modify your regular expression (you may need to add a second matching pattern) so that only these two lines are kept in the ouput.

After doing that, there will still be two problems. First, we have not put the ``>`` separator between segments, and there is still text that precedes the coordinates for the first entry in each segment. The ``>`` separator should be easy for you to fix. Removing the additional text requires additional string functions.

To remove the text, I found two string functions to be useful: the ``match(<string>, <regexp>)`` function, which gives the leftmost position in a string that matches the given regular expression, and the ``substr(<string>,<index>)`` function (covered in the previous lab), which returns the substring of ``<string>`` beginning at the position given by ``<index>``. Use these to correct the output. When all is done correctly, the tail end of your output on the terminal should look something like this: ::

    -115.543829900054 32.9199728287905
    -115.543870909139 32.9197048434163
    -115.54375391094 32.9194648530375
    -115.543577910021 32.9192018633972
    -115.54355891634 32.9189568754859
    -115.543631926001 32.9187188887966
    -115.543614930613 32.9185388975864
    -115.543487931054 32.918292906931
    -115.543549942905 32.9179639250401
    -115.5438819708 32.917499955812
    -115.544415016053 32.916745004646
    -115.544718052504 32.9159310516597
    >
    -115.543157660434 32.9275224330804
    -115.543366676364 32.9273054482584
    -115.543620695267 32.9270404666198
    -115.543728702943 32.9269414746468
    >
    -115.537694060571 32.9399966852325
    -115.537746051277 32.940417665069
    -115.537799046526 32.9406706533491
    -115.537915039282 32.9411296322393
    
This is a tricky problem, so do not worry if you do not get it right on the first try. Be methodical and write your script incrementally.