.. _intro:

*********************************
Introduction to the Course
*********************************

Why does this course exist? Doing research in Geophysics requires processing *lots* of data.
As an example, Hi-net is a seismic network operating in Japan. It consists of 1000 stations,
each recording 3 components, at a frequency of 100 Hz. This turns out to be about 9 TB of 
data *every year*.

We could not possibly process this amount of data by hand. Instead, we need to automate
this task using a computer. This course is meant to help you learn how to do this.

But we are not here simply to automate data processing tasks -- we are scientists, so
it is important to keep in mind the particular needs of scientists when writing computer
programs to analyze data.

The scientific method says that we need to do tasks

1. Accurately
2. Reproducibly
3. Efficiently

These are listed in order of importance, as there may be tradeoffs between these goals in a
given task. For instance, an earthquake may only happen once, so we cannot expect reproducible
results in that case. Note that efficiency is last -- it is far more important that you write
code that is correct, easy to understand and maintain, and produces reproducible results than 
code that runs fast.

==================================
What is data?
==================================

At the basic level, data is a bunch of numbers. However, that by itself is not enough, as we
need to know units, as well as how the data was collected, where it was collected, etc. This
is usually called *metadata*, which means data that describes other data. How do we handle
data on a computer?

1. **Binary** (i.e. sequences of 0s and 1s) -- A computer treats everything as a sequence of 0/1,
   so perhaps this is a good approach.

   **Pros:** This turns out to be very efficient, as we do not add anything extra above what the
   computer needs to understand the numbers

   **Cons:** The data is not human readable, and there is no way to include metadata.
   Byte ordering is not standardized (artifact of history of computer hardware development)

2. **ASCII** (i.e. text file) Data is represented as a series of text characters.

   **Pros:** Human readable format, also allows for combination of data and metadata in the same file
   
   **Cons:** Text representation of numbers is less efficient than straight binary format, so datasets
   can be very large if there is a great quantity of data.

3. **Combination Format** Format combining both text and binary, for instance text metadata and binary
   for the numerical aspect of the data. Also includes proprietary formats like Microsoft Excel.
   
   **Pros:** Overcomes limitations of both file formats.
   
   **Cons:** Need to define a standard format for doing so, but there are often competing version for a 
   particular type of data. Also need special software to read data formats, as special formats may
   not be built into all software packages. Proprietary formats may not be useful without the special
   software needed.
   
=======================================
What does Data Analysis involve?
=======================================

1. **Reformatting data:** You may get data in one format, but software that you have requires another. For
   instance, you might get a text file that contains all the correct information, but needs to be put into
   a different text format for use with another program. Or you might need to convert from ASCII to binary
   or vice versa.

   This is something you should always automate! Human typing errors are very common. Computers do not make
   mistakes or get tired.
   
2. **Processing data:** This is the actual number crunching part. You might use a program from someone else,
   or a program that you write yourself.
   
   This course uses *high level* tools for analysis. High level means that there are many built-in tools and
   capabilities in the software. This is a good approach for many reasons
   
   * Shorter code is easier to debug, so more likely to be correct.
   * Built-in tools are likely to have been more rigorously tested than something you write yourself. Also more
     likely to be more efficient than your implementation. It is a good idea to avoid re-inventing the wheel!

   Eventually, you may need to write your own code and for performance reasons you might need to use a lower
   level programming language. Fortunately, writing a computer program is highly transferrable across different
   programming languages, so if you get good at using a high level language, it will not be too difficult to 
   translate your skills to another language as nearly all of the same concepts will still apply.
   
3. **Representing processed data:** In publications, you are most likely going to publish plots and maps, rather
   than the data itself. We will explore computer tools for making graphs and maps in this class, which are much
   more efficient than drawing the plots yourself in a drawing program.
   
4. **Automating everything:** To do your analysis in an accurate and reproducible fashion, you should automate
   as much as possible. This reduces errors, increases efficiency (computer programmers are notoriously lazy),
   and makes things easy to do again in case you realize you need to do something differently.