Unix Basics 2

Hopefully you are becoming familiar with the terminal environment in Unix. Today we will examine more basics for working in Unix.

Wildcards

When entering file and directory names, Unix gives us a number of special ways to refer to filenames, knowing as “globbing” using “wildcard” characters. Wildcard characters that have a special meaning in the terminal include *, \, ?, ^ and [ ] and can be used to designate various patterns. One example is the * character, which stands for any number of characters or any type. This is often used to find files with a certain extension, as *.txt will match any file with the extension .txt. You can also use multiple wildcards in the same expression; for instance if you want list all files containing the word “data” that ends in ”.txt” you could use ls *data*.txt and the shell would show you all of the files matching that pattern.

Other useful wildcards include:

  • ?, which can be used to represent a single unknown character. For instance, ?at matches Bat, bat, Cat, cat, and many others, but not at alone.
  • [<characters>] can be used to represent any one of the characters included in the square brackets. As an example, [CB]at matches Bat and Cat, but not bat or cat. You can also specify a range of characters such as [A-Z] for any capital letter, [a-z] for any lowercase letter, and [0-9] for any numeric character, or [A-Za-z0-9] for any alphanumeric character.
  • ^ negates whatever is entered. For example, [^<characters>] matches any single character except those included in the square brackets. [^C]at will match Bat, bat, cat, and many other strings, but not Cat.

What if your expression contains one of the special wildcard characters? Like we saw with having a space in the file name, you can enter an “escape character” by preceding the character with a backslash \. You will also need to escape wildcards in some other situations, like when using wildcard to denote a file name when using find (see below). Thus if you want to match a sequence that includes a question mark, use, for example, hello\?.txt.

There are more sophisticated pattern matching techniques called “regular expressions” which we will use more extensively when we talk about AWK. Getting used to using these wildcards in the terminal will make regular expressions a bit easier to understand, so be sure to spend some time experimenting with them.

Finding Files

To make queries of files in the filesystem, use the find command, which has a number of sophisticated options. The syntax for find is find <locations> <criteria>, where <locations> are the directories that you wish to search (including all subdirectories) and <criteria> are the specific options that you want to search for (these are specified by command options).

Command options for find that I use frequently include:

  • -name <pattern> finds files whose name matches the given pattern. You are allowed to use wildcards in name patterns, but you must put a backslash in front of any special wildcard characters (otherwise, the shell expands the wildcards before searching the files, while we want the wildcards to remain until find does its search).
  • -iname <pattern> same as -name, but is case insensitive
  • -type <type>, target is of a specific type, common examples include f regular file, d directory
  • -atime <time>, target has a most recent access date that differs from the present date by exactly <time> days, rounded to the next 24 hour period. You can also specify units other than days for <time> using any of smhdw (seconds, minutes, hours, days, weeks, respectively). For files that are newer than the time specified, use -<time>, and for files that are older than the time specified, use +<time>
  • -mtime <time>, works like -atime but uses file modification date rather than access date

For example, to find files within my home directory with a name that match the name “test”, I would enter find ~ -name test -type f into the terminal. To find all directories in my Documents directory that were modified in the past 24 hours, use find ~/Documents -type d -mtime -1 to perform this search. Wildcards can be very useful for finding files of a certain type; to find all SAC files in my home directory I can use find ~ -type f -iname \*.sac to find all SAC files. Note that using -iname ensures that both files ending in .SAC and .sac are found, and that I put a backslash in front of the wildcard character to prevent expanding the character prior to invocation of find.

Exercise: Practice searching for different types of files. I use this command frequently, so it is a good idea to be comfortable with its use.

Finding Text Within Files

We can also search for text patterns within files. This is done using grep (short for Globally search a Regular Expression and Print). The basic syntax is grep <pattern> <files> where you specify a single pattern and (potentially) multiple text files in which to look for that pattern. For example, if you have a list of fruits as a text file fruit.txt, then grep apple fruit.txt will print any lines in the file that contain the string “apple.”

As the title alludes to, the pattern specification is a regular expression and can be used to match rather complex patterns. Practice using grep, using straight text strings as well as wildcards to make various patterns that appear in a text file. You can also use wildcards on the text files if you want to search a large number of files for the pattern. As previously mentioned, we will come back to regular expressions and will talk a bit more about grep.

Exercise: Using a text editor, create a text file that contains several lines of text. Use grep to find patterns, in particular to practice using various wildcards to match patterns.

Permissions

Back in the last lab, we noted that ls -l gave us a rather cryptic list of characters at the beginning of each entry:

total 472
drwxr-xr-x   53 egdaub  staff    1802 Nov  4  2015 MATLAB
drwxr-xr-x    2 egdaub  staff      68 Aug 10  2015 awk
-rw-r--r--@   1 egdaub  staff  147712 Sep 25  2015 ceri7104_dataanalysis.docx
drwxr-xr-x    3 egdaub  staff     102 Jan  7  2016 compexam
drwxr-xr-x    9 egdaub  staff     306 Dec 17  2015 csh
-rw-r--r--    1 egdaub  staff      84 Aug 25  2015 data_syllabus.aux
-rw-r--r--    1 egdaub  staff    5231 Aug 25  2015 data_syllabus.log
-rw-r--r--@   1 egdaub  staff   52581 Aug 25  2015 data_syllabus.pdf
-rw-r--r--    1 egdaub  staff   15882 Aug 25  2015 data_syllabus.synctex.gz
-rw-r--r--    1 egdaub  staff    4942 Oct 19  2015 data_syllabus.tex
drwxr-xr-x   56 egdaub  staff    1904 Dec 17  2015 gmt
drwxr-xr-x   77 egdaub  staff    2618 Dec  7  2015 homework
drwxr-xr-x  148 egdaub  staff    5032 Nov 24  2015 lectures
drwxr-xr-x   12 egdaub  staff     408 Oct 19  2015 python
drwxr-xr-x   11 egdaub  staff     374 Nov 19  2015 sac
drwxr-xr-x   12 egdaub  staff     408 May  2 12:34 studentwork

These characters signify what are known as “permissions.” These characters tell us who is allowed to read (r), write (w), and execute (x) this file. There are three sets of letters signifying this that follow the first characters, which tells us whether or not the entry is a directory): the first set of three tells us what the owner is allowed to do (for the above list, the owner is egdaub), the second set of three tells us what the group is allowed to do (for the above list, the group is staff), and the third set tells us what any other user is allowed to do. Thus, the sequence -rw-r--r-- signifies that the user can read and write the file, the group can read the file, and others can read the file. These are the default permissions for a newly created file. Note that for the directories, the listing is drwxr-xr-x. The leading d says that this is a directory, and that all can read and execute, but only the owner can write. Execute is a bit of a misnomer here because this is a directory (you cannot really “execute” a directory, it just means that said class of users can cd into that directory and search in that directory).

These permissions can be changed using the chmod command (Change Mode). The syntax is chmod <mode> <files> and there are several different ways to specify the mode. You can add, remove, or set specific permissions for specific groups using the letters u, g, o, and a (for user, group, others, and all respectively), the operators +, -, and = (add permission, remove permission, and set permission, respectively), and r, w, and x (read, write, and excecute, respectively). Thus, to add execute privileges for the user, enter chmod u+x file.txt into the terminal. To remove read and write privileges for others, enter chmod o-rw file.txt into the terminal. To give everyone read, write, and execute access, enter chmod a=rwx file.txt into the terminal. You can also specify more than one file in the command, and the specified changes will be made to all the files given in the command.

A shorthand way to specify permissions is to use an octal number to specify the binary representation of read, write, and execute as follows:

Permissions --- --x -w- -wx r-- r-x rw- rwx
Binary 000 001 010 011 100 101 110 111
Octal 0 1 2 3 4 5 6 7

To set owner permissions to rwx, group to r-x, and others to r--, you can enter chmod 754 file.txt (the first octal number is for the owner, the second is for the group, and the third is for others). This is shorter than specifying each one individually through multiple commands, so if you need to change many different permissions at once, this is a handy trick.

Aliases

You may have noticed that some of your Unix commands can get a bit long and difficult to type if you choose a number of options. If you have a command that is long and is one you use frequently, you may consider making an alias. An alias is a way to essentially define your own Unix command using some other command. For instance, if you want to always remove files using rm -i, then you could make an alias to do this. Enter alias rm 'rm -i' and then a carriage return. This sets the command rm to execute rm -i automatically. Try this, and see that it will ask you to confirm when deleting a file. To undo an alias, precede the command with a backslash (\rm). To remove an alias, use the unalias command (unalias rm to remove the example here). unalias -a will undo all aliases.

Other Shells

One thing to note here: the syntax for an alias in the shell that you are currently using (the Tenex C shell tcsh) is different from some other Unix shells. Each of these different shells represent a terminal with different available commands and shortcuts. While they are all relatively similar, they are not exactly the same. Other shells available in the Mac Lab include the Bourne-again shell bash (default on most GNU/Linux systems), the Korn shell ksh, the Tenex C Shell, and the Z Shell zsh. While the majority of the commands that we will use work in all shells, there are some specific ones like this that are specific to a certain type of shell.

To run a different shell, enter the appropriate command into your present terminal. It will start a new session within that shell (the current directory will remain the same). To exit, type exit and you will return to your first shell. You can set your default shell in Terminal to be whatever you like – go to Terminal > Preferences... and set the “Shells open with” command to whatever shell you want. You need to give the full path, and all of the shells are under /bin.

rc files

If you want to set up an alias for all of your terminal sessions, is there a way to avoid typing in the alias every time you start up a terminal? You can automatically set things like aliases using a startup file. For the default tcsh in the Mac lab, the startup file is a file in your home directory ~/.tcshrc and is executed automatically every time you start up a tcsh session. Note that the file begins with a period – this means that this is a file that is not normally visible when you type ls in your home directory. To see files that begin with a period, use the ls -a option. When you ls -a in your home directory, if you still do not see the file, then it does not exist. You can create this file in a text editor like nano – enter nano .tcshrc when in your home directory, and then type in any commands you would like to be executed by default on startup. Common things to put in a startup file include setting aliases, changing the terminal prompt appearance, and setting variables known as environment variables (more on that later in the semester). Try putting an alias in your startup file, restarting the terminal, and seeing that the alias is in effect. The other shells have similar startup files, such as ~/.bashrc, though some have more than one possible startup file and preferential execution of one versus the other, depending on whether the shell is the login shell or was launched from another shell. If you want to keep using tcsh, then do not worry about login versus non-login shells – regardless of how your shell is launched, it will always execute ~/.tcshrc file when it starts.

Making Programs Interact

Standard Input/Output and Pipes

One thing you may be wondering now is if the Unix philosophy is to make simple, robust programs that do one thing and do it well, how do we use multiple programs to do something complex? This requires that we introduce the concept of “pipes,” “standard input,” and “standard output.”

As you may have noticed, all of the commands you have typed typically print out some result to the screen. This could be a list of files using ls or find, the contents of a text file using cat, or a list of lines matching a pattern using grep. This output is referred to as “standard output” and every Unix command that produces output prints it to the screen in roughly the same way.

Programs can also receive input via what is called “standard input,” which can come from the keyboard, from a file, or from the standard output of another program. This interaction where the standard output of one program becomes the standard input of another is referred to as “piping” and is the principal way that we can do more complicated Unix tasks using the basic commands.

As a very simple example, let’s say that we have a directory containing a large number of files and we want to be able to look at the long format list (ls -l) of the contents. However, we want to read this in the terminal with the less program, rather than having to manually scroll up and down. This can be accomplished using a “pipe:” enter ls -l | less into the terminal, and you should have the list contents open using less. The verical line | is what is known as a “pipe,” and its meaning here is to tell the terminal to use the standard output of the command ls -l as the standard input to the less command. We can then read the output from ls -l in less like it is any old text file.

While this may seem really simple (and you would be correct) we can do more complicated things. We can search within the output of some program using grep via a pipe – to find out which files in a directory some user can read, write, and execute, enter ls -l | grep rwx and the terminal will print the lines that contain the string “rwx.” Using grep in a pipe to find patterns in the output of some other command is a very common use of pipes, and you are likely to come across many examples where this is useful.

Input and Output Redirection

There are other ways to deal with command output, as you can save results of terminal operations as files. Saving output to files (as well as using files as standard input) is done with the operators <, >, >>, and the command tee. < signifies that the file following the < operator is to be used as input to the command. Many of the commands that we have already seen use files as standard input without needing to use < (for example, try entering cat testfile.txt and cat < testfile.txt; the output should be identical), but you will likely encounter cases where you need to specify the use of a file as input.

The operators >, >>, and the command tee can all be used to save command output to a file. > saves the command line output in a new file, overwriting the old file in the event that file exists. Try entering ls > temp.txt into the terminal; the result should create a file “temp.txt” containing the result of the ls command. >> will append the results of a command to the file. Check the contents of “temp.txt,” and then enter ls -l >> temp.txt. Use less to view “temp.txt” to verify that it contains the results of both the short and the long list command.

Sometimes if you are piping the result of one command to another, you have some intermediate output that you might want to save for another purpose. To save the standard output from a command while simultaneously sending the same output to standard output, the tee command can be used. Basically, tee <filename> saves its standard input to file, but also sends its standard input to standard output. Try using tee by entering ls -l | tee temp.txt. The ls -l results should both print to the terminal and save to the file “temp.txt.”

xargs

One final trick for making programs interact is the following: what if you have some program that takes a number of individual file names as input, rather than standard input (i.e. a text stream)? For example, in the piping example above, we piped the output of ls into grep to search the output for a certain pattern. However, what if we didn’t want to search the output for that pattern, but rather the contents of those particular files for a pattern? We cannot do that using a pipe. Similarly, if we want to find a bunch of files, and then copy them to some other directory, we cannot use standard output in the cp command, since cp takes a file and copies it to another directory.

You can perform the above operations by making commands interact with the xargs command. xargs takes standard input and reformats it as a list of arguments, then repeatedly executes the command that follows using each of the elements of the list as an argument. By default, xargs puts the argument at the end of the command, so if you have a command where the inputs need to be placed at a different point in the command, you should specify where xargs should put the arguments using the -J option and some other character as a placeholder (see an example of how to do this below).

As an example, let’s say I want to search every shell script (files with a suffix ”.sh”) in my home directory for the pattern “gmt” to figure out which of my shell scripts invoke GMT commands. This is one of those cases where I need to use xargs to turn standard output into a list of arguments. I can do this with find ~ -name \*.sh | xargs grep gmt as a terminal entry. The first command is find to find all shell scripts (files ending in ”.sh”) in my home directory. The find command by itself sends a list of files to standard output. Since grep needs a list of arguments, rather than standard output, xargs is used to transform standard output into a list of arguments. What is actually happening is that xargs tells the terminal to execute the command grep gmt <file> repeatedly, where <file> is replaced on each successive execution with a line from the output from the find command. Note that in this case, because xargs puts the argument at the end of the command by default, and grep takes its target file as its last argument, we could use xargs without needing to specify any options.

As another example, here is a situation where we cannot use xargs without any options. Let’s say that we want to copy every file that we modified in the past 24 hours to a single folder as a backup copy. Since cp takes a filename as an argument, rather than standard input, we need to use xargs to achieve this. Also note that the syntax for cp is cp <file> <target>, so the default usage of xargs will not work because we need to insert our filenames in the middle of the command. To successfully perform this operation, we use the -J option for xargs, where we designate a special character to serve as a placeholder, and then xargs replaces the placeholder with the argument derived from the input in each execution of the command. Assuming that the directory ~/backup already exists, then we can copy all recently modified files to this directory by entering find ~ -type f -mtime -1 | xargs -J % cp % ~/backup into the terminal. The first part of the command is the find operation, which sends the files satisfying the critera to standard output. This is then piped into xargs, which takes the list of files and inserts each part of the list into the appropriate place in the cp command (using % as the placeholder, which is specified by the -J option). Note that other versions of Unix may have a different way of doing this command substitution, so check the manual page if you are on a system other than the Mac OS.

Do not worry if you do not totally understand how to use pipes, redirections of output, and xargs, and everything else here. Unix takes time to learn how to make it do what you want. Over the semester, as we work with Unix more and start using Shell Scripts, AWK, and GMT, you will gradually become more comfortable with this. However, nothing can take the place of practice: if you want to truly become comfortable with Unix, then you should use these tools on a daily basis in your research.

Summary

Here are a list of the commands that we introduced in this lab. They are among the most common, and you will likely find yourself using them very often:

  • Wildcards *, \, ?, ^ and [ ]
  • find
  • grep
  • chmod
  • alias
  • unalias
  • tcsh and many other shell types
  • > (redirect to file)
  • < (redirect from file)
  • | (pipe to next command)
  • tee (send to file and standard output)
  • xargs