Introduction to the Course¶
Why does this course exist? Doing research in Geophysics requires processing lots of data. As an example, Hi-net is a seismic network operating in Japan. It consists of 1000 stations, each recording 3 components, at a frequency of 100 Hz. This turns out to be about 9 TB of data every year.
We could not possibly process this amount of data by hand. Instead, we need to automate this task using a computer. This course is meant to help you learn how to do this.
But we are not here simply to automate data processing tasks – we are scientists, so it is important to keep in mind the particular needs of scientists when writing computer programs to analyze data.
The scientific method says that we need to do tasks
- Accurately
- Reproducibly
- Efficiently
These are listed in order of importance, as there may be tradeoffs between these goals in a given task. For instance, an earthquake may only happen once, so we cannot expect reproducible results in that case. Note that efficiency is last – it is far more important that you write code that is correct, easy to understand and maintain, and produces reproducible results than code that runs fast.
What is data?¶
At the basic level, data is a bunch of numbers. However, that by itself is not enough, as we need to know units, as well as how the data was collected, where it was collected, etc. This is usually called metadata, which means data that describes other data. How do we handle data on a computer?
Binary (i.e. sequences of 0s and 1s) – A computer treats everything as a sequence of 0/1, so perhaps this is a good approach.
Pros: This turns out to be very efficient, as we do not add anything extra above what the computer needs to understand the numbers
Cons: The data is not human readable, and there is no way to include metadata. Byte ordering is not standardized (artifact of history of computer hardware development)
ASCII (i.e. text file) Data is represented as a series of text characters.
Pros: Human readable format, also allows for combination of data and metadata in the same file
Cons: Text representation of numbers is less efficient than straight binary format, so datasets can be very large if there is a great quantity of data.
Combination Format Format combining both text and binary, for instance text metadata and binary for the numerical aspect of the data. Also includes proprietary formats like Microsoft Excel.
Pros: Overcomes limitations of both file formats.
Cons: Need to define a standard format for doing so, but there are often competing version for a particular type of data. Also need special software to read data formats, as special formats may not be built into all software packages. Proprietary formats may not be useful without the special software needed.
What does Data Analysis involve?¶
Reformatting data: You may get data in one format, but software that you have requires another. For instance, you might get a text file that contains all the correct information, but needs to be put into a different text format for use with another program. Or you might need to convert from ASCII to binary or vice versa.
This is something you should always automate! Human typing errors are very common. Computers do not make mistakes or get tired.
Processing data: This is the actual number crunching part. You might use a program from someone else, or a program that you write yourself.
This course uses high level tools for analysis. High level means that there are many built-in tools and capabilities in the software. This is a good approach for many reasons
- Shorter code is easier to debug, so more likely to be correct.
- Built-in tools are likely to have been more rigorously tested than something you write yourself. Also more likely to be more efficient than your implementation. It is a good idea to avoid re-inventing the wheel!
Eventually, you may need to write your own code and for performance reasons you might need to use a lower level programming language. Fortunately, writing a computer program is highly transferrable across different programming languages, so if you get good at using a high level language, it will not be too difficult to translate your skills to another language as nearly all of the same concepts will still apply.
Representing processed data: In publications, you are most likely going to publish plots and maps, rather than the data itself. We will explore computer tools for making graphs and maps in this class, which are much more efficient than drawing the plots yourself in a drawing program.
Automating everything: To do your analysis in an accurate and reproducible fashion, you should automate as much as possible. This reduces errors, increases efficiency (computer programmers are notoriously lazy), and makes things easy to do again in case you realize you need to do something differently.