Organizing with Unix

Unix is the operating system of choice in data science. We will introduce you to the approach and way of thinking [AN: “approach and way of thinking”" of what? Unix?] using an example: how to keep a data analysis project organized. We will learn some of the most commonly used commands along the way. However, we won’t go into the details here. We highly encourage you to learn more, especially when you find yourself using the mouse too much or performing a repetitive task often. In those cases, there is probably a more efficient way to do it in Unix. Here are some basic courses to get you started:

There are many other courses [AN: “books”? and if you make this change, perhaps change the link] that are also very informative reference books.

When searching for Unix resources, keep in mind that other terms used to describe what we will learn here are Linux, the shell and the command line. Basically, what we are learning is a series of commands and a way of thinking that facilitates the organization of files without using the mouse.

To serve as motivation, we are going to start constructing a directory using Unix tools and RStudio.

The Terminal

The Terminal is our window into the Unix world. Instead of clicking, dragging and dropping to organize our files and folders, we will be typing commands into The Terminal. We will do this in The Terminal, similar to how we type commands into the R console, but instead of generating plots and statistical summaries, we will be organizing files on our system.

We have already described how we can access a terminal using RStudio, namely by going to Tools then Terminal then New Terminal. But often we want access to The Terminal, but do not need RStudio. We already described how to access The Terminal on the Mac by opening the application in the Utilities folder:

You can also use the Spotlight feature on the Mac by typing command-spacebar, then type Terminal.

On Windows, assuming you’ve installed Git Bash, we can also access a terminal without RStudio by running the Git Bash program:

Once you have a terminal open, you can start typing commands. You should see a blinking cursor at the spot where what you type will show up. This position is called the command line. Once you type something and hit enter on Windows or return on the Mac, Unix will try to execute this command. If you want to try out an example, type this command into your command line:

echo "hello world"

The command echo is similar to cat in R. Executing this line should print out hello world, then return back to the command line. Notice that you can’t use the mouse to move around in the terminal. You have to use the keyboard. To go back to a command you previously typed, you can use the up arrow.

Above we included a chunk of code showing Unix commands in the same way we have previously shown R commands. We will make sure to distinguish when the command is meant for R and when it is meant for Unix.

The Filesystem

We refer to all the files, folder, and programs on your computer as the filesystem. Keep in mind that folders and programs are also files, but this is a technicality we rarely think about and ignore in this book. We will focus on files and folders for now and discuss programs, or executables, in a later section.

Directories and subdirectories

The first concept you need to grasp to become a Unix user is how your file system is organized. You should think of it as a series of nested folders each containing files, folders, and executables.

Here is a visual representation of the structure we are describing:

In Unix, we refer to folders as directories. Directories that are inside other directories are often referred to as subdirectories. So, for example, in the figure above, the directory docs has two subdirectories: reports and resumes, and docs is a subdirectory of home.

The home directory

The home directory is where all your stuff is kept. In the figure above, the directory called home represents your home directory but that is rarely the name used. On your system, the name of your home directory is likely the same as your username on that system. Below is an example on Windows showing a home directory, in this case, named rafa:

Here is an example from a Mac:

Now, look back at the figure showing a filesystem. Suppose you are using a point-and-click system and you want to remove the file cv.tex. Imagine that on your screen you can see the home directory. To erase this file, you would double click on the home directory, then docs, the resumes, and then drag cv.tex to the trash. Here you are experiencing the hierarchical nature of the system: cv.tex is a file inside the resumes directory, which is a subdirectory inside the docs directory, which is a subdirectory of the home directory.

Now suppose you can’t see your home directory on your screen. You would somehow need to make it appear on your screen. One way to do this is to navigate from what is called the root directory all the way to your home directory. Any file system will have what is called a root directory which is the directory that contains all directories. The home directory shown in the figure above will usually be two or more levels from the root. On Windows, you will have a structure like this:

while on the Mac, it will be like this:

Note for Windows User: The typical R installation will make your Documents your home directory in R. This will likely be different from your home directory in Git Bash. Generally, when we discuss home directories, we refer to the Unix home directory which for Windows, in this book, is Git Bash.

Working directory

The concept of a current location is part of the point-and-click experience: at any given moment we are in a folder and see the content of that folder. As you search for a file, as we did above, you are experiencing the concept of a current location: once you double click on a directory, you change locations and are now in that folder, as opposed to the folder you were in before.

In Unix, we don’t have the same visual cues, but the concept of a current location is indispensable. We refer to this as the working directory. Each terminal window you have open has a working directory associated with it.

How do we know what is our working directory? The first command we learn is pwd, which stands for print working directory. This command returns the working directory.

Open a terminal and type:

pwd

We do not show the result of running this command because it will be quite different on your system compared to others. If you open a terminal and type pwd as your first command, you should see something like /Users/yourusername on a Mac or something like /c/Users/yourusername on Windows. The character string returned by calling pwd represents your working directory. When we first open a terminal, it will start in our home directory so in this case the working directory is the home directory.

Notice that the forward slashes / in the strings above separate directories. So, for example, the location /c/Users/rafa implies that our working directory is called rafa and it is a subdirectory of Users, which is a subdirectory of c, which is a subdirectory of the root directory. The root directory is therefore represented by just a forward slash: /.

Paths

We refer to the string returned by pwd as the full path of the working directory. The name comes from the fact that this string spells out the path you need to follow to get to the directory in question from the root directory. Every directory has a full path. Later, we will learn about relative paths, which tell us how to get to a directory from the working directory.

In Unix we use the shorthand ~ as a nickname for your home directory. So, for example, if docs in the path to a directory [AN: change “in” to “is”?] in your home directory, docs can be written like this ~/docs.

Most terminals will show the path to your working directory right on the command line. If you are using default settings and open a terminal on the Mac, you will see that right at the command line you have something like computername:~ username with ~ representing your working directory, which in this example is the home directory ~. The same is true for the Git Bash terminal where you will see something like username@computername MINGW64 ~, with the working directory at the end. When we change directories, we will see this change on both Macs and Windows.

Unix commands

We will now learn a series of Unix commands that will permit us to prepare a directory for a data science project. We also provide examples of commands that, if you type into your terminal, will return an error. This is because we are assuming the file system in the cartoon above. Your file system is different. In the next section, we will provide examples that you can type in.

ls: Listing directory content

In a point-and-click system, we know what is in a directory because we see it. In the terminal, we do not see the icons. Instead, we use the command ls to list the directory content.

To see the content of you home directory, open a terminal and type:

ls

We will see more examples soon.

mkdir and rmdir: make and remove a directory

When we are preparing for a data science project, we will need to create directories. In Unix, we can do this with the command mkdir, which stands for make directory.

Because you will soon be working on several projects, we highly recommend creating a directory called projects in your home directory.

You can try this particular example on your system. Open a terminal and type:

mkdir projects

If you do this correctly, nothing will happen: no news is good news. If the directory already exists, you will get an error message and the existing directory will remain untouched.

To confirm that you created these directories, you can list the directories:

ls

You should see the directories we just created listed. Perhaps you can also see many other directories that come pre-installed on your computer.

For illustrative purposes, let’s make a few more directories. You can list more than one directory name like this:

mkdir docs teaching

You can check to see if the three directories were created:

ls

If you made a mistake and need to remove the directory, you can use the command rmdir to remove it.

mkdir junk
rmdir junk

This will remove the directory as long as it is empty. If it is not empty, you will get an error message and the directory will remain untouched. To remove directories that are not empty, we we will learn about the command rm later.

cd: Navigating the filesystem by changing directories

Next we want to create directories inside directories that we have already created. We also want to avoid pointing and clicking our way through the filesystem. We therefore will explain how do this in Unix, using the command line.

Suppose we open a terminal. Our working directory is our home directory. We want to change our working directory to projects. We do this using the cd command, which stands for change directory:

cd projects

To check that the working directory changed, we can use a command we previously learned:

pwd

Our working directory should now be ~/projects. Note that on your computer the home directory ~ will be spelled out to something like /c/Users/yourusername).

Important Pro Tip: In Unix you can auto-complete by hitting tab. This means that we can type cd d then hit tab. Unix will either auto-complete if docs is the only directory/file starting with d or show you the options. Try it out! Using Unix without auto-complete will make it unbearable.

When using cd, we can either type a full path, which will start with / or ~, or a relative path. In the example above, in which we typed cd projects, we used a relative path. If the path you type does not start with / or ~, Unix will assume you are typing a relative path, meaning that it will look for the directory in your current working directory. So something like this will give you an error:

cd Users

because there is no Users directory in your working directory.

Now suppose we want to move back to the directory in which projects is a subdirectory, referred to as the parent directory. We could use the full path of the parent directory, but Unix provides a shortcut for this: the parent directory of the working directory is represented with two dots: ... So to move back we simply type:

cd ..

You should now be back in your home directory which you can confirm using pwd.

Because we can use full paths with cd, the following command:

cd ~

will always take us back to the home directory, no matter where we are in the filesystem.

The working directory also has a nickname and it is a single .. So if you type:

cd .

You will not move. Although this particular use of . is not useful, this nickname does come in handy sometimes. The reasons are not relevant for this section, but you should still be aware of this fact.

In summary, we have learned that when using cd we either stay put, move to a new directory using the desired directories name, or move back to the parent directory using ...

Now when typing directory names, we can concatenate directories with the forward-slashes. So if we want a command that takes us to the projects directory no matter where we are in the filesystem, we can type:

cd ~/projects

which is equivalent to writing the entire path out. For example, in Windows we would write something like

cd /c/Users/yourusername/projects

The last two commands are equivalent and in both cases we are typing the full path.

When typing out the path of the directory we want, either full or relative, we can concatenate directories with the forward-slashes. We already saw that we can move to projects directory regardless of where we are by typing the full path like this: [AN: falta el code]

We can also concatenate directory names for relative paths. So, for instance, if we want to move back to the parent directory of the parent directory of the working directory [AN: remove one “parent directory” or are both correct?], we can type:

cd ../..

A couple of final tips related to the cd command. First, you can go back to whatever directory you just left by typing:

cd -

This can be useful if you type a very long path and then realize you want to go back to where you were, and that too has a very long path.

Second, if you just type:

cd

you will be returned to your home directory.

Some examples

Let’s explore some examples of using cd. To help visualize, we will show the graphical representation of our file system vertically:

Suppose our working directory is ~/projects and we want to move to figs in project-1.

Here it convenient to use relative paths:

cd project-1/figs

Now suppose our working directory is ~/projects and we want to move to reports in docs, how can we do this?

One way is to use relative paths:

cd ../docs/reports

Another is to use the full path:

cd ~/docs/reports

If trying this out on your system, remember auto-complete.

Let’s examine one more example. Suppose we are in ~/projects/project-1/figs and want to change to ~/projects/project-2. Again, there are two ways.

With relative paths:

cd ../../proejct-2

and with full paths:

cd ~/projects/project-2

mv: moving files

In a point-and-click system, we move files from one directory to another by dragging and dropping. In Unix, we use the mv command.

Warning: mv will not ask “are you sure?” if your move results in overwriting a file.

Now that you know how to use full and relative paths, using mv is relatively straightforward. The general form is:

mv path_to_file path_to_destination_directory

So, for example, if we want to move the file cv.tex from resumes to reports, you could use the full paths like this:

mv ~/docs/resumes/cv.tex ~/docs/reports/

You can also use relative paths. So you could do this:

cd ~/docs/resumes
mv cv.tex ../reports/

or this:

cd ~/docs/reports/
mv ../cv.tex ./

Notice that, in the last one, we used the working directory shortcut . to give a relative path as the destination directory.

We can also use mv to change the name of a file. To do this, instead of the second argument being the destination directory, it also includes a filename. So, for example, to change the name from cv.tex to resume.tex, we simply type:

cd ~/docs/resumes
mv cv.tex resume.tex

We can also combine the move and a rename. For example:

cd ~/docs/resumes
mv cv.tex ../reports/resume.tex

And we can move entire directories. So to move the resumes directory into reports, we do as follows:

mv ~/docs/resumes ~/docs/reports/

It is important to add the last / to make it clear you do not want to rename the resumes directory to reports, but rather move into reports.

cp: copying files

The command cp behaves similar to mv except instead of moving, we copy the file, meaning that the original file stays untouched.

So in all the mv examples above, you can switch mv to cp and they will copy instead of move with one exception. We can’t copy entire directories without learning about arguments.

rm: removing files

In point-and-click systems, we remove files by dragging and dropping them into the trash or using some special click on the mouse. In Unix, we use the rm command.

Warning: Unlike throwing files into the trash, rm is permanent so be careful.

The general way it works is like this:

rm filename

You can actually list files as well like this:

rm filename-1 filename-2 filename-3

You can use full or relative paths. To remove directories, you will have to learn about arguments which we do later.

less: looking at a file

Often you want to quickly look at the content of a file. If this file is a text, the quickest way to do is by using the command less. To look a the command cv.tex, you do this: [AN: review that last sent]

cd ~/docs/resumes
less cv.tex 

To exit the viewer, you type q. If the files are long, you can use the arrow keys to move up and down. There are many other keyboard commands you can use within less to, for example, search or jump pages. You will learn more about this in a later section.

If you are wondering why the command is called less, it is because the original was called more, as in “show me more of this file”. The second version was called less because of the saying “less is more”.

Preparing for a data science project

We are now ready to prepare a directory for a project. You should start by creating a directory where you will keep all your projects. We recommend a directory called projects in your home directory. To do this you would type:

cd ~
mkdir projects

Our project relates to gun violence murders so we will call the directory for our project murders. It will be a subdirectory in our projects directories. In the murders directory, we will create two subdirectories to hold the raw data and intermediate data. We will call these data and rda respectively.

Open a terminal and make sure you are in the home directory:

cd ~

Now run the following commands to create the directory structure we want. At the end, we use ls and pwd to confirm we have generated the correct directories in the correct working directory:

cd projects
mkdir murders
cd murders
mkdir data rdas 
ls
pwd

Note that the full path of our murders dataset is ~/projects/murders.

So if we open a new terminal and want to navigate into that directory we type:

cd projects/murders

In RStudio when you start a new project, you can pick Exsting Directory instead of New directory:

and write the full path of the murders directory:

Once you do this, you will see the rdas and data directories you created in the Files tab.

Keep in mind that when we are in this project, our default working directory will be ~/projects/murders. You can confirm this by typing getwd() into your R session. This is important because it will help us organize the code when we need to write file paths.

Pro tip: Always use relative paths in data science projects, and relative to a default working directory. [AN: “relative to a default” ok?]

The problem with using full paths is that your code is unlikely to work on filesystems other that yours since the directory structures will be different. This includes using the home directory ~ as part of your path.

Let’s now write a script that downloads a file into the data directory. We will call this file download-data.R.

The content of this file will be:

url <- "https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/murders.csv"
dest_file <- "data/murders.csv"
download.file(url, destfile = dest_file)

Notice that we are using the relative path data/muders.csv.

Run this code in R and you will see that a file is added to the data directory.

Now we are ready to write a script to read this data and prepare a table that we can use for analysis. Call the file wrangle-data.R. The content of this file will be:

library(tidyverse)
murders <- read_csv("data/murders.csv")
murders <-murders %>% mutate(region = factor(region),
                             rate = total / population * 10^5)
save(murders, file = "rdas/murders.rda")

Again note we use relative paths exclusively.

In this file, we introduce a command we have not seen: save. The save command in R saves objects into what is called an rda file, rda stands for R data. We recommend using the .rda suffix on files saving R objects. You will see that .RData is also used.

If you run this code above, the processed data object will be saved in a file in the rda directory. Although not the case here, this approach is often practical because generating the data object we use for final analyses and plots can be a complex and time consuming process. So we run this process once and save the file. But we still want to be able to generate the entire analysis from the raw data.

Now we are ready to write the analysis file. Let’s call it analysis.R. The content should be the following:

library(tidyverse)
load("rdas/murders.rda")

murders %>% mutate(abb = reorder(abb, rate)) %>%
  ggplot(aes(abb, rate)) +
  geom_bar(width = 0.5, stat = "identity", color = "black") +
  coord_flip()

If you run this analysis you will see that it generates a file.

Now, suppose we want to save the generated file for use in a report or presentation. We can do this with the ggplot command ggsave. But where do we put the graph? We should be systematically organized so we will save plots to a directory called figs. Generate a file in the terminal:

mkdir figs

and then you can add the line:

ggsave("figs/barplot.png")

to our script. If you run the script now, a png file will be saved into the figs directory. If we wanted to copy that file to some other directory where we are developing a presentation, we can avoid using the mouse by using the cp command in our terminal.

You now have a self-contained analysis in one directory. One final recommendation is to create a README.txt file describing what each of these files does for the benefit of others reading your code, including your future self. This would not be a script but just some notes. One of the options provided when opening a new file in RStudio is a text file. You can save something like this into the text file:

We analyze US gun murder data collected by the FBI.

download-data.R - downloads csv file to data directory

wrangle-data.R - creates a derived dataset and saves as R object in rdas directory

analysis.R - makes the 

A plot is generated and saved in the figs directory.

[AN: check that above notes are complete. anaylysis.R seems to be incomplete] If you now type pwd, you will see you are in ~/projects/murders with ~ representing the home directory.

You have now successfully used Unix and RStudio to create a directory for a project. When a new project comes your way, you can simply cd back to project directory, create a new directory and start a new project.

You can see a version of this project, organized with Unix directories, on GitHub. You can download a copy to your computer by using the git clone command on your terminal. This command will create a directory called murders on your working directory so be careful where you call it from.

git clone https://github.com/rairizarry/murders.git