In this tutorial, we will walk through the logistics of setiting up a course that teaches how to analyze data. It is separated into two sections. First, we will discuss recommended strategies on how to set up the course before the first day of class. Then, we will talk about the logistics for one the class begins (e.g. lectures, homework assignment, grading, etc). These recommendations are based on our experience of teaching three Introduction to Data Science courses:
Here is brief summary of the background of students in the BST 260 Course taught in 2016:
We have also written a Guide to Teaching Data Science based on these experiences.
Finally, this tutorial relies heavily on material from the dsbook by Rafael Irizarry.
Generally speaking, we do not recommend using point-and-click approaches for data analysis. Instead, we recommend scripting languages, such as R, since they are more flexible and greatly facilitate reproducibility. Similarly, we recommend against the use of point-and-click approaches to organizing files and document preparation. In this chapter, we demonstrate alternative approaches. Specifically, we recommend to use freely available tools that, although at first may seem cumbersome and non-intuitive, will eventually make you a much more efficient and productive data scientist.
Three general guiding principles that motivate what we learn here are
As you become more proficient at coding, you will find that
R is not a programming language like C or Java. It was not created by software engineers for software development. Instead, it was developed by statisticians as an interactive environment for data analysis. You can read the full history here. The interactivity is an indispensable feature in data science because, as you will soon learn, the ability to quickly explore data is a necessity for success in this field. However, like in other programming languages, you can save your work as scripts that can be easily executed at any moment. These scripts serve as a record of the analysis you performed, a key feature that facilitates reproducible work. If you are an expert programmer, you should not expect R to follow the conventions you are used to since you will be disappointed. If you are patient, you will come to appreciate the unequal power of R when it comes to data analysis and data visualization specifically.
Other attractive features of R are the following:
To learn more about R, follow these tutorials:
knitR
in a way that will greatly help with situations such as the ones described here. The main feature is that code and textual descriptions can be combined into the same document, and the figures and tables produced by the code are automatically added to the document.We will put all this together using the powerful integrated desktop environment RStudio. RStudio will be our launching pad for data science projects. It not only provides an editor for us to create and edit our scripts but many other useful tools. In this section, we go over some of the basics.
In terms of course logistics, we recommend spending one lecture at the beginning of the course introducing the programming language R and tools, R Markdown and RStudio.
A typical data analysis challenge may involve several parts, each involving several data files, including files containing the scripts we use to analyze data. Keeping all this organized can be challenging. One approach to overcome this challenge is to use the Unix shell as a tool for managing files and directories on your computer system. Using Unix will permit you to use the keyboard rather than the mouse when creating folders, moving from directory to directory, and renaming, deleting or moving files.
The data analysis process is iterative and adaptive. As result, we are constantly editing our scripts and reports. In this course, we introduce you to the version control system Git which is a powerful tool for keeping track of these changes. We also introduce you to GitHub, a service that permits you to host and share your code, including building webpages for your code and courses.
Although we could recommend organizing courses with a standard teaching tool, such as Blackboard, we recommend exposing students to the notion of version control. We can achieve this by using one of the most popular systems, git, along with the web-based git repository hosting service, GitHub. GitHub is currently the most widely used resource for code developers including data scientists.
There are three main reasons to use Git and GitHub.
Share: Even if we do not take advantage of the advanced and powerful version control functionality, we can still use Git and GitHub to share our code. We have already shown how we can do this with RStudio.
Collaborating: Once you set up a central repo, you can have multiple people make changes to code and keep versions synched. GitHub provides a free service for centralized repos. GitHub also has a special utility, called a pull request, that can be used by anybody to suggest changes to your code. You can easily either accept or deny the request.
Version control: The version control capabilities of Git permit us to keep track of changes we make to our code. We can also revert back to previous versions of files. Git also permits us to create branches in which we can test out ideas, then decide if we merge the new branch with the original.
In terms of course logistics, we recommend spending one lecture at the beginning of the course introducing the concept of version control, git and GitHub. In our course, to demonstrate the mechanics, we created a test repository (https://github.com/datasciencelabs/test_repo) and asked all the students to use git to obtain a copy of this repository during the lecture. We also introduced the concept of making changes to local repositories and pushing the changes to remote repositories. After this lecture, students were able to stay in sync with the course repository to access the course material at the beginning of each lecture.
Now that we know about git and GitHub, here we explain how to create a course website using GitHub Pages. To learn more about this topic, we highly recommend the following resources:
For this course, we registered the GitHub organziation howtoteachdatascience
. Hypothetically, if a course is called Statistics 110
at ABC University, you could try registering the GitHub organization ABC-Stat110
, for example.
A GitHub Organization is similar to a GitHub User, except multiple people can be admins for the organization and it is not tied to any particular user. Members could include multiple instructors or teaching assistants.
Once you have created an organization, you will want to added GitHub repositories like you in your own user account.
github.io
Next, create a special repository that begings with the GitHub Organizaiton name and ends with .github.io
. In this case, we created howtoteachdatascience.github.io
.
Once you create the repository, GitHub will ask that you add files to the repository, commit the changes, and push to GitHub.
When you push .html
files to GitHub in this special repository name, specifically with a index.html
file, a webpage will be created. We can create the .html
files by creating R Markdown files ending in .Rmd
and knitting them to .html
files. We can also create a special file called _site.yml
. This tells GitHub Pages how to organize your .html
files.
Finally, if we knit the files in our repository, we can get a preview of what the website will look like.
Once you are happy with the website, push the changes and you will see the course website!
In this course, we created the JSM2018
GitHub repository to organize and store the course material.
There are several ways to set up a syllabus, but one way that allows for a bit more flexiblity is to create a spreadsheet on Google Drive and embed the spreadsheet in an R Markdown on the course website.
Here is an example from our course in 2016: http://datasciencelabs.github.io/2016/pages/lectures.html
In this part of the tutorial, we explain what we found useful based on our experience of teaching three Introduction to Data Science courses:
Each lecture and homework assignment was created using literate programming. We prepared lectures using R Markdown and R Presentations and rendered the presentations using RStudio, which provides functionality to easily convert from these formats to PDF or HTML. More importantly, using RStudio also permitted us to run live data analysis during lecture. These documents were available on GitHub (https://github.com/datasciencelabs/2016) to allow students to follow along and run code on their own laptops during class. For each lecture, there were three to four TAs available in the classroom who were walking around to answer questions in person. In addition, we included a link to a Google Document at the top of the R Markdown in each lecture to allow students a venue to ask questions if they did not want to interrupt the lecture. Note that in the course in which we used Python, we used Jupyter Notebooks which provide similar functionality to Rmd and Rpres. Karl Broman has provided several useful tutorials in these formats and others.
We divided lectures into 10 to 30 minute modules and included 3-5 assessment problems in between. These questions consisted of multiple-choice or open-ended questions with most requiring a short data analysis. The solutions, in the form of code required to solve these assessments, were presented and discussed in class and added to the lectures only after the lecture was complete. We asked students to enter their answers in Google Forms that we created before lecture. Seeing these responses permitted us to adapt the pace of the lectures.
Using Google Forms as an active learning tool. (A) Three to five assessments were included in the R Markdown for each lecture, which consisted of either multiple?choice or open-ended questions. (B) Students were given a few minutes during the lecture to answer the questions. (C) Student responses were recorded and instructors could see the responses instantly. The live responses helped adapt the pace of the lectures.
Homework assignments were created in R Markdown and specific code chunks were created for the students to add their code as solutions.
Once the student was satisfied with their solutions, the homework submission was committed to the private GitHub repository as an R Markdown and HTML. The TAs were able to quickly and efficiently access and grade the homework submissions in the individual repositories.
A complete description of the grading for the Biostatistics 260 course is described here.
Generally, the grading was based on the following:
We also had a Late Day Policy:
The students also completed a month long final project on a topic of their choice either on their own or in a group. This portion of the course most closely mimicked the data scientist’s experience.
The deliverables for the project included
The project proposal described the motivation for the project, the project objectives, a description of the data, how to obtain the data, an overview of the computational methods proposed to analyze the data and a timeline for completing the project.
TAs were paired together with 3-4 groups to meet to discuss the proposed projects and provide guidance. The students used the concepts learned in and outside of the course to complete the projects. Once the projects were complete, the submitted deliverables were reviewed and the best projects were highlighted at the end of the course.
Some examples of final projects are listed at the bottom of this page: http://datasciencelabs.github.io/2016/pages/projects.html