.. _awk1:

*********************************
AWK 1
*********************************

In Homework 1, we wrote a Python function to look at earthquake catalog data that was read from a file. If we want to reuse that function on multiple earthquake catalogs that we might download from the internet, we will need to have all of the catalogs in that same format. Unfortunately, many public data sources provide their output in different formats. They all contain the same basic information, but this information is usually organized differently each time, plus there may be extra information that we do not need in our Python function. Therefore, we need a simple way to reformat text files in order to make our code useful in a more general way. This can be done in Python, but in fact the easiest tool to accomplish this task is another programming language known as AWK.

We will spend two labs learning how to use AWK to process text files. The first lab will focus on basic use for reformatting simple text files (the task described above). Lab 17 will focus on more complicated text processing and pattern matching, which allows you to do more complex tasks to reformat data that is not in simple columns. I find AWK to be very useful when using the Generic Mapping Tools (which we will cover at the end of the semester), so having a basic understanding of how to use AWK will be needed for those classes.

================
What is AWK?
================

AWK derives its name from the initials of its original designers (Aho, Weinberger, and Kernighan). They initially developed AWK in 1977 at AT&T's Bell Labs (the same place where Unix originated). Since then, several further versions of AWK have been developed -- New AWK (``nawk``) and GNU AWK (``gawk``) are the other versions of AWK that you are likely to come across. On many operating systems, ``awk`` and ``nawk`` are one and the same, and ``gawk`` is an implementation of ``nawk``, so for the most part any AWK program will run equally well regardless of the specific implementation. On the Mac, I believe that the ``awk`` command is an implementation of New AWK, but there is no ``nawk`` command. ``gawk`` is also available on the Mac Lab computers; its path is ``/sw/bin/gawk`` if you wish to put it in an AWK script. I have written all of the following examples to use the ``awk`` command, but they should also work using ``gawk``.

AWK is a fairly simple, yet powerful command line tool for pattern matching and processing text files. There are several other programs that you can use to so similar things when processing text, including SED, Perl, and Python. SED (Stream Editor) is more basic than AWK and does process text, but it does not have the capability to save variables or do math that is available in AWK. Perl and Python are full fledged programming languages with many more bells and whistles, as we have seen when using Python in this class. Because of this, Perl and Python require a bit more overhead in writing a program. I feel that AWK gives you the most "bang for your buck" for text processing tasks, in that you can learn enough in a couple of hours to be quite competent, and that it is relatively easy to write simple programs. Most of the AWK work that I do involves short "one-liners" to reformat files, and I find that writing such programs are easier in AWK than in Perl or Python.

=========================
Running an AWK Program
=========================

AWK programs can be run in two ways. First, you can run AWK directly from the command line. Here is a simple example that prints every line in the input file: ::

	% awk '{ print $0 }' inputfile.txt

We will see shortly what the specific program commands mean, but there are three parts to this. First, we enter the ``awk`` command, followed by the AWK program (the part in single quotes, in this case ``'{ print $0 }'``), and finally the text file that you wish to process. Try entering this command for a text file (or use the provided "newmadrid2015raw.txt" as the input file, which is the original version of the file that I used to create "newmadrid2015.txt," which we used in the MATLAB classes, using AWK) -- the output should just be the contents of the text file.

When calling AWK from the command line, we put the entire AWK program in single quotes so that the shell will interpret the entire program as a single string (the shell has a complex set of quoting rules, as we will see in the upcoming classes on shell scripts). Otherwise, the shell would interpret any spaces in your program as the end of the program, and you would get an error.

Here, we give AWK a single input file, but you can specify as many input files as you like. AWK will process one file after the next without stopping when you supply multiple input files. Additionally, AWK can take standard input from another Unix command through a pipe. By default, AWK sends its output to standard ouput. If you want to save the output of AWK as a text file, use output redirection (``>`` and ``>>``).

Alternatively, you can write an AWK program in a stand alone file (much like you would write a Python program as a stand alone file). To do this, create a text file using your favorite editor called ``awkscript.awk`` that contains the following lines: ::

	#!/usr/bin/awk -f
	
	# awk program prints out entire file
	
	{ print $0 }

The first line specifies the command that will be used to interpret the commands in the script, and the second line is a comment (using the ``#`` character). The one line that is executed is simply is the same command that we entered above. Here the ``-f`` option tells awk that the commands to be executed are contained in a file. This option needs to be in place for AWK to correctly interpret the script.

Change the permissions for ``awkscript.awk`` so that you can execute the file (type ``chmod 755 awkscript.awk`` into the terminal). Then enter the following to run your script on the input file "inputfile.txt" as before: ::

	% ./awkscript.awk inputfile.txt

All of this is not strictly necessary; you could just put ``{ print $0 }`` in the file and then execute the script from the shell using ::

	% awk -f awkscript.awk inputfile.txt

though it is usually considered good programming practice to specify the command at the beginning of the script so that either you or someone else reading your script are aware that it is an AWK script.

=========================
AWK Program Structure
=========================

AWK programs have the following basic structure: ::

	BEGIN { <begin commands> } <condition1> { <commands1> } <condition2> { <commands2> } ...
		... END { <end commands> }

* First, AWK will execute all the commands in ``<begin commands>``. This is often where you will define
  variables that you will use later in your program. This command is optional.

* Then, after executing the begin commands, AWK reads through the input text files one "record" at a time. 
  By default, each record is a line in the file (it splits the file up using newline characters ``\n``),
  but you can change the behavior of AWK so that it divides a file into records using whatever character 
  you choose. We will see in the next lab how to change this behavior.
  
* If the current record matches the specified condition ``<condition1>``, then the program executes 
  ``<commands1>``. AWK then proceeds to do the same for all of the pairs of conditions and commands that 
  follow for the current record. Any conditions that are true will result in the corresponding commands 
  being executed. The conditions are optional -- if you do not provide a condition, then the corresponding 
  commands will be executed for every record in the file. Technically, you do not need these types of 
  lines, but nearly all AWK programs have at least one condition/execution block.

* Once AWK has executed all conditions and commands for the first record in the text file, it proceeds to 
  the second record, starting at the first condition and testing for all conditions and executing the 
  appropriate commands for all conditions that are true. It continues in this manner for all records in 
  all input files.

* Finally, after going through all records in all input files, AWK executes all of the commands in 
  ``<end commands>``. This is where you often do things relating to the total number of lines in the file. 
  As with ``BEGIN`` commands, this part is optional.

This may seem a bit abstract, but we will give a number of examples shortly to make this structure a bit more concrete.

==========
Fields
==========

When AWK reads a record from an input file, it automatically parses the record into a number of "fields" separated by spaces. For example, if a line of input text is ::

	Eric Daub egdaub CERI

Then the first field would be "Eric," the second field would be "Daub," the third would be "egdaub," and the fourth field would be "CERI." You can refer to these fields using the variables ``$1`` for the first field, ``$2`` for the second field, etc. The entire line is represented by ``$0``. By default, AWK uses any number of consecutive spaces to separate fields, but this can be changed (we will see how this is done later).

One thing to be careful about is the fact that the ``$1`` syntax has a different meaning in shell scripts (where it signifies command line inputs -- this is similar to how SAC macros accept inputs). When you invoke AWK in a shell script, you must be careful about how you use quotes. This is why we generally use single quotes when using AWK in the command line, because it enforces literal interpretation of ``$1``, rather than substituting a command line input in its place. We will cover the details underlying this next week when discussing shell scripts.

===========
Examples
===========

Here are a number of sample AWK programs to give you an idea of how AWK works. These examples use the input file "newmadrid2015raw.txt" Try them out to see how they work, and make changes to confirm your understanding. You cannot try too much code here!

First, here is the AWK program we showed earlier that prints every line in the file: ::

	% awk '{ print $0 }' newmadrid2015raw.txt

Now that we know a bit about AWK syntax, we can explain why the program does what it does. This program only contains a single condition and command, so it omits the optional ``BEGIN`` and ``END`` commands. Further, the program omits the optional condition, so that the sole provided command in the braces is executed for every line in the file. This command is a print statement that prints the entire record, so the program simply prints the file to standard output.

You can put multiple commands within the same set of braces if you separate them with a semicolon. The following prints the first two fields of each line in the file (in this case, the network code and the event date) on separate lines: ::

	% awk '{ print $1; print $2 }' newmadrid2015raw.txt

Alternatively, you could have put each print statement in a separate set of braces. Since neither one of the set of commands have a condition associated with them, both are executed for all records: ::

	% awk '{ print $1 } { print $2 }' newmadrid2015raw.txt
	
If you want to print more than one field on the same line, simply put all fields in the same print statement: ::

	% awk '{ print $1, $2 }' newmadrid2015raw.txt

Here is an example that contains a condition; this program prints the entire record of each earthquake that was detected from the New Madrid Network. Note that you do not need to include any sort of ``if`` command in an AWK condition; this is implied by the program syntax. ::

	% awk '$1 == "NM" { print $0 }' newmadrid2015raw.txt

If you provide only a single condition and no statments to be executed in curly braces, by default AWK prints out the entire record. Therefore, the following AWK program is identical to the previous one: ::

	% awk '$1 == "NM"' newmadrid2015raw.txt

AWK stores all fields as strings, but can perform numerical comparisons using them. If AWK can interpret the string as a number, then it will make a numerical comparison; if the string contains non-numeric characters, then a string comparisons is made based on alphabetical order, with numerical values coming before the alphabetical characters. Thus, to print the full record of each event with a magnitude greater than or equal to 2.5, we can use ::

	% awk '$7 >= 2.5 { print $0 }' newmadrid2015raw.txt

You can also combine conditionals using ``&&`` (and) and ``||`` (or). The following prints the full record of all events with a latitude between :math:`{35^\circ}` and :math:`{35.5^\circ}`: ::

	% awk '$4 >= 35. && $4 <= 35.5 { print $0 }' newmadrid2015raw.txt

As you can see, this allows you to very quickly reformat a file with a short piece of code if all you are doing is simply collecting rows from a text file. To make a file similar to the one that we used previously in MATLAB (doing all of the reformatting except for the date/time), we simply need to enter the following into the shell: ::

	% awk '{print $2"T"$3, $4, $5, $7}' newmadrid2015raw.txt > newmadrid2015.txt

Concatenating strings is very easy in AWK; the code above combines the date (field 2) and time (field 3) with a "T" when printing to file. More complex string operations are available, and are described below.

===========================
Formatting AWK Scripts
===========================

For writing short AWK scripts on the command line, you can usually just write the entire command out in the terminal on a single line. However, for longer scripts, both on the command line and within a file, programs formatted in such a way can be hard to read and understand. Therefore, it is a good idea to format your AWK scripts by putting statements on different lines in a logical way. This includes putting each ``<condition> { <statements> }`` block on its own line, and breaking up conditional and loop statements into separate lines to make them easier to read, much like you would do so when writing a MATLAB script.

Within an AWK script saved as a file, putting these statements on separate lines is simple. However, when writing an AWK program directly in the C shell, pressing return to create a new line will cause the program to start running. In the C shell, you can break up commands into separate lines by putting a backslash at the end of the line. This will cause the shell to wait until you have finished entering all of your code before it begins executing the command. For example: ::

	% awk '$1 == "NM" { print $4 } \
	? $2 == "2015/01/01" { print $7 }' newmadrid2015raw.txt

Note that here each set of condition/statment blocks is on its own line. This is important because if you try to break up one of these units, AWK will not understand that they are linked. If you prefer spread your AWK program over several lines and not have to worry about maintaining the integrity of grouped statements, you need to put a double backslash before the new line: ::

	% awk '$1 == "NM" \\
	? { print $4 } \ 
	? $2 == "2015/01/01" \\
	? { print $7 }' newmadrid2015raw.txt

Because of this, I only use double backslashes if I ever need to break up lines in the C shell as it is robust and works in all cases.

(Note: the above statements about line breaking and line break characters apply only to the C shell, the default shell in the Mac Lab. If you are using ``bash``, then you do not need to place backslashes before carriage returns in command line AWK programs, and only need a single backslash to break up statements like print and conditionals.)

===================
Variables in AWK
===================

You can define your own variables in AWK. As mentioned above, the only real trick with AWK variables is that they are always stored as strings, though you can do math with them. There is no special syntax needed to assign values to variables, do math with variables, or access values of varibles. Here are a number of examples of AWK programs that use variables.

The following calculates the mean magnitude of all events in the catalog and then prints it at the end. Note that we first initialize the variable in the BEGIN statement, add up the values found from each line, and then print the final value: ::

    % awk 'BEGIN {total = 0} { total = total + $7 } \\
    ? END { print "Average magntidue:", total/NR }' \\
    ? newmadrid2015raw.txt

You do not need to put spaces between variables and operators. There is also a simplified syntax for a number of math operations: ``a++`` increments ``a`` by one, ``a--`` decrements ``a`` by one, ``a += 2`` sets ``a`` equal to its previous value plus 2, ``a -= 2`` sets ``a`` to its previous value minus 2.

There are also a number of built-in variables in AWK. One of the most common is ``NR`` which is the number of records that AWK has processed since being initialized. This variable can also be used to get the total number of lines if accessed in the ``END`` commands, which we used above to calculate the average magnitude.

If you are dealing with more than one file, ``FNR`` will give you the number of records that have been read from the current file. Another useful variable is ``NF``, the number of fields in the current record. This way, you can print out the last entry of each record, even if the number of fields varies. ::

	% awk '{ print $NF }' newmadrid2015raw.txt

=================
Arrays in AWK
=================

You can also define arrays in AWK. One crucial difference between AWK and all of the other languages that we have considered thus far is that array in AWK is what are known as an associative array. Associative arrays differ from regular arrays in that the indices need not be a consecutive set of integers. In MATLAB, if you type ``a(10) = 0``, then MATLAB automatically creates an array with size 10 and initializes all of the additional entries to zero. After doing this, you can access ``a(1)`` without a problem. However, in AWK if you create an array using the syntax ``a[10] = 0``, then the array only contains a single entry, and NOT ten entries. The indices need not be integers; you can just as easily write ``a["ten"] = 0`` and AWK will add a single element that can be retrieved with that string.

(For those interested in the technical details, this type of an array is implemented using what is known as a "hash table," and its primary advantage over a standard array is that its elements can be accessed with a time that is typically independent of the size of the array. This is not necessarily the case for standard arrays; you can only achieve a constant look-up time for an array if all elements of the array have the same size and the size of the array is constant. However, since AWK stores everything as a string, and the length of each array entry thus cannot be guaranteed to have the same length, we cannot be assured that this will work in this case. The fast look-up times for a hash table are achieved by trading time for space and distributing the array storage over a large number of "buckets" that each contain a small number of elements, ideally either one or zero, so that looking up an array element involves simply finding the necessary bucket that hopefully only contains the desired item.)

Because array indices are not necessarily consecutive, this makes it difficult to iterate over an entire array. To iterate over all indices of an array, we use the syntax ``for (<index> in <array>)``, followed by the statements to execute. ``<index>`` can be any variable name you choose. Here is an array example that stores all of the earthquake longitudes in an array, and then iterates over the array to print each value: ::

	% awk '{ longs[NR] = $3 } END { for (l in longs) print l, longs[l] }' newmadrid2015raw.txt

``l`` will take all previously used index values that have been assigned to ``longs`` and then print out the value stored in the array for that index. One thing to note is that an associative array does not preserve the same order in which the indices were added -- you will iterate over all of the elements that have been added to your array using ``for (<index> in <array>)``, but not in the same order in which they were added. This is a quirk of how associative arrays work, related to the technical details of how they are implemented.

=================================
Other Programming Structures
=================================

Within the various command blocks, you can execute conditionals and loops and other flow control statements. The syntax for these are as follows:

* ``if`` statements have the syntax ``if (<condition>) <statement>`` with an optional ``else <statement>`` 
  if desired. Example: ::

	% awk '$2 == "2015/01/01" { if ($4 > 36) print $4, "north"; else print $4, "south" }' newmadrid2015raw.txt

  You can also use curly braces to indicate where the statements end for the ``if`` clause rather than a   
  semicolon.

  Because of the way AWK programs are executed where a conditional is implied in every group of commands, 
  ``if`` statements are much less common than in other programming languages. I find that they are mostly 
  useful if you need to use the ``else`` clause, or if you need to check some condition in the ``BEGIN`` 
  or ``END`` blocks in the program.

* ``while`` loops have the syntax ``while (<condition>) <statements>}``, for example, to print only the 
  first three fields in each record, you can use a while loop: ::

	% awk '{ i = 1; while (i <= 3) { print $i; i++ } }' newmadrid2015raw.txt

  Note that you need to group all statements to be executed in the while loop in curly braces. Without 
  that, AWK assumes that only the ``print`` statement is in the while loop, and since ``i`` is never 
  incremented you have an infinite loop. Also note that you can use ``$i`` to refer to field ``i``; as 
  ``i`` varies through the loop, it prints the appropriate value each time through the loop.

* ``for`` loops have the syntax ``for (<initialization>; <condition>; <increment>) <statements>`` ::

    % awk 'BEGIN { for ( i = 1; i < 4; i++) a[i] = 0 } \\
    ? $1 == "NM" { a[1]++ } $4 > 36 { a[2]++ } \\
    ? $6 > 6 { a[3]++ } END { print a[1], a[2], a[3] }' newmadrid2015raw.txt

  This program determines the number of events detected by the New Madrid network, the number of events 
  north of :math:`{36^\circ}` north latitude, and the number that are deeper than 6 km, and prints all 
  three values to the terminal.

* ``do-while`` loops are similar to ``while`` loops, with the key difference that the condition is tested  
  *after* each execution of the loop, guaranteeing that they execute at least once. This differs from 
  while loops, which test the condition *before* executing the statements and are not guaranteed to 
  execute. The syntax is ``do <statements> while (<condition>)``. In practice, ``do-while`` loops are 
  only occasionally needed, as a normal while loop usually suffices.

* ``break`` exits from the innermost ``for``, ``while``, or ``do-while`` loop and continues execution with
  the statements that follow for the current record.

* ``continue`` stops executing the current ``for``, ``while``, or ``do-while`` loop and proceeds with the 
  next execution of the loop.

* ``next`` stops executing the current record and proceeds to the next record in the file. Any subsequent 
  conditions are not processed for that particular line.

* ``nextfile`` stops executing the current file and proceeds to the next file. Any subsequent records in 
  the current file are not processed.

* ``exit`` stops the program from running altogether.

=====================
String Operations
=====================

When doing basic text formatting, we often encounter a situation where we need to further break a field up. Returning to our Earthquake Catalog example, dates are often presented as "month/day/year" or "year/month/day" in catalogs. However, you may want to separate those numbers out for another task. This is most easily done using various string operations in AWK. The ones that I use most commonly include:

* ``index(<input>, <target>)`` finds the location of the target string in the input string. Useful if you 
  are looking for a specific string but are not certain where it occurs.

* ``split(<input>, <array>, <separator>)`` takes the input string and breaks it up into pieces, using the
  separator to divide the string. Useful for things like dates that have a specific character dividing 
  numeric values that you would like to extract. The resulting strings are stored in the array variable 
  given to the program, which you can access using numeric indices starting with 1. The separator can be 
  omitted if you would like to use whitespace as the separator.

* ``substr(<input>, <start>, <length>)`` returns the substring of the given length, starting at the given 
  position. Useful if you are looking for unknown text but know the position of the string and its length. 
  You can omit the length if you want to take all remaining string characters.

For more information on string operations in AWK, look at the manual (it can be easily found with a web search).

===============
Exercises
===============

These exercises all use the file "newmadrid2015raw.txt."

* Write an AWK program that reprints each catalog entry preceded by the number of that entry (i.e. 1. <event data> <new line> 2. <event data>, etc.)

* Write an AWK program that prints out the date and time of all events occurring in the box bounded by
  :math:`{35^\circ}` and :math:`{36^\circ}` latitude and :math:`{-91^\circ}` and :math:`{-90^\circ}` 
  longitude.

* Write an AWK program that finds the average depth of all earthquakes detected by the New Madrid Network.

* Write an AWK program that determines the number of events that occurred in July.

* Reformat the catalog to work with your code from Homework 1. Do this 2 ways: (1) convert the date and
  time to a string with a format "2015-01-01T00:00:00.0000", and (2) convert the date to a decimal year in 
  AWK.