1 Introduction

The datasets used up to now have been made available to you as R objects, specifically as data frames. The US murders data, the reported heights data, the Gapminder data, and the poll data are all examples. These datasets come included in the dslabs package and we loaded them using the data function. Furthermore, we have made the data available in what is referred to as tidy form, a concept we define later. The tidyverse packages and functions assume that the data is tidy and this assumption is a big part of the reason these packages work so well together.

However, very rarely in a data science project is data easily available as part of a package. We did quite a bit of work “behind the scenes” to get the original raw data into the tidy tables you worked with. Much more typical is for the data to be in a file, a database, or extracted from a document, including web pages, tweets, or PDFs. In these cases, the first step is to import the data into R and, when using the tidyverse, tidy the data. The first step in the data analysis process usually involves several, often complicated, steps to covert data from its raw form to the tidy form that greatly facilitates the rest of the analysis. We refer to this process as data wrangling.

Here we cover several common steps of the data wrangling process including importing data into R from files, tidying data, string processing, html parsing, working with dates and times, and text mining. Rarely are all these wrangling steps necessary in a single analysis, but data scientists will likely face them all at some point. Some of the examples we use to demonstrate data wrangling techniques are based on the work we did to convert raw data into the the tidy datasets provided by the dslabs package and used in the book as examples.

2 Tidy data

library(tidyverse)
library(dslabs)
ds_theme_set()

To help define tidy data, we refer to an example from the data visualization lecture in which we plotted fertility data across time for two countries: South Korea and Germany. To make the plot, we used this subset of the data:

data("gapminder")
tidy_data <- gapminder %>% 
  filter(country %in% c("South Korea", "Germany")) %>%
  select(country, year, fertility)
head(tidy_data)

##       country year fertility
## 1     Germany 1960      2.41
## 2 South Korea 1960      6.16
## 3     Germany 1961      2.44
## 4 South Korea 1961      5.99
## 5     Germany 1962      2.47
## 6 South Korea 1962      5.79

With the data in this format, we could quickly make the desired plot:

tidy_data %>% 
  ggplot(aes(year, fertility, color = country)) +
  geom_point()

## Warning: Removed 2 rows containing missing values (geom_point).

One reason this code works seamlessly is because the data is tidy: each point is represented in a row. This brings us to the definition of tidy data: each row represents one observation and the columns represent the different variables available for each of these observations.

If we go back to the original data provided by GapMinder, we see that it does not start out tidy. We include an example file with the data shown in this graph mimicking the way it was originally saved in a spreadsheet:

path <- system.file("extdata", package="dslabs")
filename <- file.path(path,  "fertility-two-countries-example.csv")
wide_data <- read_csv(filename)

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   country = col_character()
## )

## See spec(...) for full column specifications.

The object wide_data includes the same information as the object tidy_data except it’s in a different format: a wide format. Here are the first nine columns:

select(wide_data, country, `1960`:`1967`)

## # A tibble: 2 x 9
##   country     `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
##   <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Germany       2.41   2.44   2.47   2.49   2.49   2.48   2.44   2.37
## 2 South Korea   6.16   5.99   5.79   5.57   5.36   5.16   4.99   4.85

Two important differences between the wide and tidy formats are: 1) in the wide format, each row includes several observations and 2) one of the variables, year, is stored in the header.

The ggplot code we introduced earlier no longer works here. For one, there is no year variable. To use the tidyverse, we need to wrangle this data into tidy format.

3 Reshaping data

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   country = col_character()
## )

## See spec(...) for full column specifications.

As we have seen, having data in tidy format is what makes the tidyverse flow. After the first step in the data analysis process, importing data, a common next step is to reshape the data into a form that facilitates the rest of the analysis. The tidyr package includes several functions that are useful for tyding data.

3.1 `gather`

One of the most used functions in the tidyr package is gather, which converts wide data into tidy data.

In the third argument of the gather function, you specify the columns that will be gathered. The default is to gather all columns so, in most cases, we have to specify the columns. Here we want columns 1960, 1961 up to 2015. The first argument sets the column/variable name that will hold the variable that is currently kept in the wide data column names. In our case, it makes sense to set name to year, although we can name it anything. The second argument sets the new column/variable name that will hold the values. In this case, we call it fertility since this is what is stored in this file. Note that nowhere in this file does it tell us this is fertility data. Instead, this information was kept in the file name.

The gathering code looks like this:

new_tidy_data <- wide_data %>%
  gather(year, fertility, `1960`:`2015`)

We can see that the data have been converted to tidy format with columns year and fertility:

head(new_tidy_data)

## # A tibble: 6 x 3
##   country     year  fertility
##   <chr>       <chr>     <dbl>
## 1 Germany     1960       2.41
## 2 South Korea 1960       6.16
## 3 Germany     1961       2.44
## 4 South Korea 1961       5.99
## 5 Germany     1962       2.47
## 6 South Korea 1962       5.79

However, each year resulted in two rows since we have two countries and this column was not gathered.

A somewhat quicker way to write this code is to specify which column will not be gathered, rather than all the columns that will be gathered:

new_tidy_data <- wide_data %>%
  gather(year, fertility, -country)

The new_tidy_data object looks like the original tidy_data we used with just one minor difference. Can you spot it? Look at the data type of the year column:

class(tidy_data$year)

## [1] "integer"

class(new_tidy_data$year)

## [1] "character"

The gather function assumes that column names are characters. So we need a bit more wrangling before we are ready to make a plot. We need to convert the year column to numbers. The gather function has an argument for that, the convert argument:

new_tidy_data <- wide_data %>%
  gather(year, fertility, -country, convert = TRUE)
class(new_tidy_data$year)

## [1] "integer"

We could have also used the mutate and as.numeric.

Now that the data is tidy, we can use the same ggplot code as before:

new_tidy_data %>% ggplot(aes(year, fertility, color = country)) +
  geom_point()

3.2 `spread`

As we will see in later examples, it is sometimes useful for data wrangling purposes to convert tidy data into wide data. We often use this as an intermediate step in tidying up data. The spread function is basically the inverse of gather. The first argument is for the data, but since we are using the pipe, we don’t show it. The second argument tells spread which variable will be used as the column names. The third argument specifies which variable to use to fill out the cells:

new_wide_data <- new_tidy_data %>% spread(year, fertility)
select(new_wide_data, country, `1960`:`1967`)

## # A tibble: 2 x 9
##   country     `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
##   <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Germany       2.41   2.44   2.47   2.49   2.49   2.48   2.44   2.37
## 2 South Korea   6.16   5.99   5.79   5.57   5.36   5.16   4.99   4.85

The following diagram can help remind you how these two functions work:

3.3 `separate`

The data wrangling shown above was simple compared to what is usually required. In our example spreadsheet files, we include an illustration that is slightly more complicated. It contains two variables: life expectancy and fertility. However, the way it is stored is not tidy and, as we will explain, not optimal.

path <- system.file("extdata", package = "dslabs")
filename <- file.path(path, "life-expectancy-and-fertility-two-countries-example.csv")

raw_dat <- read_csv(filename)

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   country = col_character()
## )

## See spec(...) for full column specifications.

select(raw_dat, 1:5)

## # A tibble: 2 x 5
##   country     `1960_fertility` `1960_life_expectancy` `1961_fertility`
##   <chr>                  <dbl>                  <dbl>            <dbl>
## 1 Germany                 2.41                   69.3             2.44
## 2 South Korea             6.16                   53.0             5.99
## # ... with 1 more variable: `1961_life_expectancy` <dbl>

First, note that the data is in wide format. Second, notice that this table includes values for two variables, fertility and life expectancy, with the column name encoding which column represents which variable. This is not a recommended way to store information, encoded in column names, but it is quite common. We will put our wrangling skills to work to extract this information and store it in a tidy fashion.

We can start the data wrangling with the gather function, but we should no longer use the column name year for the new column since it also contains the variable type. We will call it key, the default, for now:

dat <- raw_dat %>% gather(key, value, -country)
head(dat)

## # A tibble: 6 x 3
##   country     key                  value
##   <chr>       <chr>                <dbl>
## 1 Germany     1960_fertility        2.41
## 2 South Korea 1960_fertility        6.16
## 3 Germany     1960_life_expectancy 69.3 
## 4 South Korea 1960_life_expectancy 53.0 
## 5 Germany     1961_fertility        2.44
## 6 South Korea 1961_fertility        5.99

The result is not exactly what we refer to as tidy since each observation is associated with two, not one, rows. We want to have the values from the two variables, fertility and life expectancy, in two separate columns. The first challenge to achieve this is to separate the key column into the year and the variable type. Notice that the entries in this column separate the year from the variable name with an underscore:

dat$key[1:5]

## [1] "1960_fertility"       "1960_fertility"       "1960_life_expectancy"
## [4] "1960_life_expectancy" "1961_fertility"

Encoding multiple variables in a column name is such a common problem that the readr package includes a function to separate these columns into two or more. Apart from the data, the separate function takes three arguments: the name of the column to be separated, the names to be used for the new columns, and the character that separates the variables. So, a first attempt at this is:

dat %>% separate(key, c("year", "variable_name"), "_")

Because “_" is the default separator assumed by separate, we do not have to include it in the code:

dat %>% separate(key, c("year", "variable_name"))

## Warning: Expected 2 pieces. Additional pieces discarded in 112 rows [3, 4,
## 7, 8, 11, 12, 15, 16, 19, 20, 23, 24, 27, 28, 31, 32, 35, 36, 39, 40, ...].

## # A tibble: 224 x 4
##    country     year  variable_name value
##    <chr>       <chr> <chr>         <dbl>
##  1 Germany     1960  fertility      2.41
##  2 South Korea 1960  fertility      6.16
##  3 Germany     1960  life          69.3 
##  4 South Korea 1960  life          53.0 
##  5 Germany     1961  fertility      2.44
##  6 South Korea 1961  fertility      5.99
##  7 Germany     1961  life          69.8 
##  8 South Korea 1961  life          53.8 
##  9 Germany     1962  fertility      2.47
## 10 South Korea 1962  fertility      5.79
## # ... with 214 more rows

The function does separate the values, but we run into a new problem. We receive the warning Too many values at 112 locations: and that the life_expectancy variable is truncated to life. This is because the _ is used to separate life and expectancy not just year and variable name! We could add a third column to catch this and let the separate function know which column to fill in with missing values, NA, when there is no third value. Here we tell it to fill the column on the right:

dat %>% separate(key, c("year", "first_variable_name", "second_variable_name"), 
                 fill = "right")

## # A tibble: 224 x 5
##    country     year  first_variable_name second_variable_name value
##    <chr>       <chr> <chr>               <chr>                <dbl>
##  1 Germany     1960  fertility           <NA>                  2.41
##  2 South Korea 1960  fertility           <NA>                  6.16
##  3 Germany     1960  life                expectancy           69.3 
##  4 South Korea 1960  life                expectancy           53.0 
##  5 Germany     1961  fertility           <NA>                  2.44
##  6 South Korea 1961  fertility           <NA>                  5.99
##  7 Germany     1961  life                expectancy           69.8 
##  8 South Korea 1961  life                expectancy           53.8 
##  9 Germany     1962  fertility           <NA>                  2.47
## 10 South Korea 1962  fertility           <NA>                  5.79
## # ... with 214 more rows

However, if we read the separate help file, we find that a better approach is to merge the last two variables when there is an extra separation:

dat %>% separate(key, c("year", "variable_name"), extra = "merge")

## # A tibble: 224 x 4
##    country     year  variable_name   value
##    <chr>       <chr> <chr>           <dbl>
##  1 Germany     1960  fertility        2.41
##  2 South Korea 1960  fertility        6.16
##  3 Germany     1960  life_expectancy 69.3 
##  4 South Korea 1960  life_expectancy 53.0 
##  5 Germany     1961  fertility        2.44
##  6 South Korea 1961  fertility        5.99
##  7 Germany     1961  life_expectancy 69.8 
##  8 South Korea 1961  life_expectancy 53.8 
##  9 Germany     1962  fertility        2.47
## 10 South Korea 1962  fertility        5.79
## # ... with 214 more rows

This achieves the separation we wanted. However, we are not done yet. We need to create a column for each variable. As we learned, the spread function can do this:

dat %>% 
  separate(key, c("year", "variable_name"), extra = "merge") %>%
  spread(variable_name, value)

## # A tibble: 112 x 4
##    country year  fertility life_expectancy
##    <chr>   <chr>     <dbl>           <dbl>
##  1 Germany 1960       2.41            69.3
##  2 Germany 1961       2.44            69.8
##  3 Germany 1962       2.47            70.0
##  4 Germany 1963       2.49            70.1
##  5 Germany 1964       2.49            70.7
##  6 Germany 1965       2.48            70.6
##  7 Germany 1966       2.44            70.8
##  8 Germany 1967       2.37            71.0
##  9 Germany 1968       2.28            70.6
## 10 Germany 1969       2.17            70.5
## # ... with 102 more rows

The data is now in tidy format with one row for each observation with three variables: year, fertility and life expectancy.

3.4 `unite`

It is sometimes useful to do the inverse of separate, unite two columns into one. To demonstrate how to use unite, we show code that, although this is not an optimal approach, serves as an illustration. Suppose that we did not know about extra and used this command to separate:

dat %>% 
  separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right")

## # A tibble: 224 x 5
##    country     year  first_variable_name second_variable_name value
##    <chr>       <chr> <chr>               <chr>                <dbl>
##  1 Germany     1960  fertility           <NA>                  2.41
##  2 South Korea 1960  fertility           <NA>                  6.16
##  3 Germany     1960  life                expectancy           69.3 
##  4 South Korea 1960  life                expectancy           53.0 
##  5 Germany     1961  fertility           <NA>                  2.44
##  6 South Korea 1961  fertility           <NA>                  5.99
##  7 Germany     1961  life                expectancy           69.8 
##  8 South Korea 1961  life                expectancy           53.8 
##  9 Germany     1962  fertility           <NA>                  2.47
## 10 South Korea 1962  fertility           <NA>                  5.79
## # ... with 214 more rows

We can achieve the same final result by uniting the second and third columns like this:

dat %>% 
  separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right") %>%
  unite(variable_name, first_variable_name, second_variable_name, sep="_")

## # A tibble: 224 x 4
##    country     year  variable_name   value
##    <chr>       <chr> <chr>           <dbl>
##  1 Germany     1960  fertility_NA     2.41
##  2 South Korea 1960  fertility_NA     6.16
##  3 Germany     1960  life_expectancy 69.3 
##  4 South Korea 1960  life_expectancy 53.0 
##  5 Germany     1961  fertility_NA     2.44
##  6 South Korea 1961  fertility_NA     5.99
##  7 Germany     1961  life_expectancy 69.8 
##  8 South Korea 1961  life_expectancy 53.8 
##  9 Germany     1962  fertility_NA     2.47
## 10 South Korea 1962  fertility_NA     5.79
## # ... with 214 more rows

Then spreading the columns:

dat %>% 
  separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right") %>%
  unite(variable_name, first_variable_name, second_variable_name, sep="_") %>%
  spread(variable_name, value) %>%
  rename(fertlity = fertility_NA)

## # A tibble: 112 x 4
##    country year  fertlity life_expectancy
##    <chr>   <chr>    <dbl>           <dbl>
##  1 Germany 1960      2.41            69.3
##  2 Germany 1961      2.44            69.8
##  3 Germany 1962      2.47            70.0
##  4 Germany 1963      2.49            70.1
##  5 Germany 1964      2.49            70.7
##  6 Germany 1965      2.48            70.6
##  7 Germany 1966      2.44            70.8
##  8 Germany 1967      2.37            71.0
##  9 Germany 1968      2.28            70.6
## 10 Germany 1969      2.17            70.5
## # ... with 102 more rows

4 Combining tables

The information we need for a given analysis may not be just on one table. For example, when forecasting elections we used the function left_join to combine the information from two tables. Here we use a simpler example to illustrate the general challenge of combining tables.

Suppose we want to explore the relationship between population size for US states and electoral votes. We have the population size in this table:

data(murders)
head(murders)

##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

and electoral votes in this one:

data(polls_us_election_2016)
head(results_us_election_2016)

##        state electoral_votes clinton trump others
## 1    Alabama               9    34.4  62.1    3.6
## 2     Alaska               3    36.6  51.3   12.2
## 3    Arizona              11    45.1  48.7    6.2
## 4   Arkansas               6    33.7  60.6    5.8
## 5 California              55    61.7  31.6    6.7
## 6   Colorado               9    48.2  43.3    8.6

Just joining these two tables together will not work since the order of the states is not quite the same:

identical(results_us_election_2016$state, murders$state)

## [1] FALSE

The join functions, described below, are designed to handle this challenge.

4.1 Joins

The join functions in the dplyr package make sure that the tables are combined so that matching rows are together. If you know SQL, you will see that the approach and syntax is very similar. The general idea is that one needs to identify one or more columns that will serve to match the two tables. Then a new table with the combined information is returned. Notice what happens if we join the two tables above by state using left_join:

tab <- left_join(murders, results_us_election_2016, by = "state")
head(tab)

##        state abb region population total electoral_votes clinton trump
## 1    Alabama  AL  South    4779736   135               9    34.4  62.1
## 2     Alaska  AK   West     710231    19               3    36.6  51.3
## 3    Arizona  AZ   West    6392017   232              11    45.1  48.7
## 4   Arkansas  AR  South    2915918    93               6    33.7  60.6
## 5 California  CA   West   37253956  1257              55    61.7  31.6
## 6   Colorado  CO   West    5029196    65               9    48.2  43.3
##   others
## 1    3.6
## 2   12.2
## 3    6.2
## 4    5.8
## 5    6.7
## 6    8.6

The data has been successfully joined and we can now, for example, make a plot to explore the relationship:

tab %>% ggplot(aes(population/10^6, electoral_votes, label = abb)) +
  geom_point() +
  geom_text_repel() + 
  scale_x_continuous(trans = "log2") +
  scale_y_continuous(trans = "log2") +
  geom_smooth(method = "lm", se = FALSE)

We see the relationship is close to linear with about 2 electoral votes for every million persons, but with smaller states getting higher ratios.

In practice, it is not always the case that each row in one table has a matching row in the other. For this reason, we have several versions of join. To illustrate this challenge, we will take subsets of the tables above. We create the tables tab1 and tab2 so that they have some states in common but not all:

tab_1 <- slice(murders, 1:6) %>% 
  select(state, population)
tab_1

##        state population
## 1    Alabama    4779736
## 2     Alaska     710231
## 3    Arizona    6392017
## 4   Arkansas    2915918
## 5 California   37253956
## 6   Colorado    5029196

tab_2 <- slice(results_us_election_2016, c(1:3, 5, 7:8)) %>% 
  select(state, electoral_votes)
tab_2

##         state electoral_votes
## 1     Alabama               9
## 2      Alaska               3
## 3     Arizona              11
## 4  California              55
## 5 Connecticut               7
## 6    Delaware               3

We will use these two tables as examples in the next sections.

4.1.1 Left join

Suppose we want a table like tab_1, but adding electoral votes to whatever states we have available. For this, we use left join with tab_1 as the first argument.

left_join(tab_1, tab_2)

## Joining, by = "state"

##        state population electoral_votes
## 1    Alabama    4779736               9
## 2     Alaska     710231               3
## 3    Arizona    6392017              11
## 4   Arkansas    2915918              NA
## 5 California   37253956              55
## 6   Colorado    5029196              NA

Note that NAs are added to the two states not appearing in tab_2. Also notice that this function, as well as all the other joins, can receive the first arguments through the pipe:

tab_1 %>% left_join(tab_2)

## Joining, by = "state"

##        state population electoral_votes
## 1    Alabama    4779736               9
## 2     Alaska     710231               3
## 3    Arizona    6392017              11
## 4   Arkansas    2915918              NA
## 5 California   37253956              55
## 6   Colorado    5029196              NA

4.1.2 Right join

If instead of a table like tab_1, we want one like tab_2, we can use right_join:

tab_1 %>% right_join(tab_2)

## Joining, by = "state"

##         state population electoral_votes
## 1     Alabama    4779736               9
## 2      Alaska     710231               3
## 3     Arizona    6392017              11
## 4  California   37253956              55
## 5 Connecticut         NA               7
## 6    Delaware         NA               3

Now the NAs are in the column coming from tab_1.

4.1.3 Inner join

If we want to keep only the rows that have information in both tables, we use inner join. You can think of this as an intersection:

inner_join(tab_1, tab_2)

## Joining, by = "state"

##        state population electoral_votes
## 1    Alabama    4779736               9
## 2     Alaska     710231               3
## 3    Arizona    6392017              11
## 4 California   37253956              55

4.1.4 Full join

If we want to keep all the rows and fill the missing parts with NAs, we can use a full join. You can think of this as a union:

full_join(tab_1, tab_2)

## Joining, by = "state"

##         state population electoral_votes
## 1     Alabama    4779736               9
## 2      Alaska     710231               3
## 3     Arizona    6392017              11
## 4    Arkansas    2915918              NA
## 5  California   37253956              55
## 6    Colorado    5029196              NA
## 7 Connecticut         NA               7
## 8    Delaware         NA               3

4.1.5 Semi join

The semi_join lets us keep the part of first table, for which we have information, in the second. It does not add the columns of the second:

semi_join(tab_1, tab_2)

## Joining, by = "state"

##        state population
## 1    Alabama    4779736
## 2     Alaska     710231
## 3    Arizona    6392017
## 4 California   37253956

4.1.6 Anti join

The function anti_join is the opposite of semi_join. It keeps the elements of the first table, for which there is no information, in the second:

anti_join(tab_1, tab_2)

## Joining, by = "state"

##      state population
## 1 Arkansas    2915918
## 2 Colorado    5029196

The following diagram summarizes the above joins:

4.2 Binding

Although we have yet to use it in this book, another common way in which datasets are combined is by binding them. Unlike the join function, the binding functions do no try to match by a variable, but instead simply combine datasets. If the datasets don’t match by the appropriate dimensions, one obtains an error.

4.2.1 Binding columns

The dplyr function bind_cols binds two objects by making them columns in a tibble. For example, we quickly want to make a data frame consisting of numbers we can use.

bind_cols(a = 1:3, b = 4:6)

## # A tibble: 3 x 2
##       a     b
##   <int> <int>
## 1     1     4
## 2     2     5
## 3     3     6

This function requires that we assign names to the columns. Here we chose a and b.

Note that there is an R-base function cbind with the exact same functionality. An important difference is that cbind can create different types of objects, while bind_cols always produces a data frame.

bind_cols can also bind two different data frames. For example, here we break up the tab data frame and then bind them back together:

tab_1 <- tab[, 1:3]
tab_2 <- tab[, 4:6]
tab_3 <- tab[, 7:9]
new_tab <- bind_cols(tab_1, tab_2, tab_3)
head(new_tab)

##        state abb region population total electoral_votes clinton trump
## 1    Alabama  AL  South    4779736   135               9    34.4  62.1
## 2     Alaska  AK   West     710231    19               3    36.6  51.3
## 3    Arizona  AZ   West    6392017   232              11    45.1  48.7
## 4   Arkansas  AR  South    2915918    93               6    33.7  60.6
## 5 California  CA   West   37253956  1257              55    61.7  31.6
## 6   Colorado  CO   West    5029196    65               9    48.2  43.3
##   others
## 1    3.6
## 2   12.2
## 3    6.2
## 4    5.8
## 5    6.7
## 6    8.6

4.2.2 Binding by rows

The bind_rows function is similar to bind_cols, but binds rows instead of columns:

tab_1 <- tab[1:2,]
tab_2 <- tab[3:4,]
bind_rows(tab_1, tab_2)

##      state abb region population total electoral_votes clinton trump
## 1  Alabama  AL  South    4779736   135               9    34.4  62.1
## 2   Alaska  AK   West     710231    19               3    36.6  51.3
## 3  Arizona  AZ   West    6392017   232              11    45.1  48.7
## 4 Arkansas  AR  South    2915918    93               6    33.7  60.6
##   others
## 1    3.6
## 2   12.2
## 3    6.2
## 4    5.8

This is based on an R-base function rbind.

4.3 Set operators

Another set of commands useful for combing datasets are the set operators. When applied to vectors, these behave as their names suggest, intersect and union. However, if the tidyverse or, more specifically, dplyr is loaded, these functions can be used on data frames as opposed to just on vectors.

4.3.1 Intersect

You can take intersections of vectors of any type, such as numeric:

intersect(1:10, 6:15)

## [1]  6  7  8  9 10

or characters:

intersect(c("a","b","c"), c("b","c","d"))

## [1] "b" "c"

But with dplyr loaded, we can also do this for tables having the same column names:

tab_1 <- tab[1:5,]
tab_2 <- tab[3:7,]
intersect(tab_1, tab_2)

##        state abb region population total electoral_votes clinton trump
## 1    Arizona  AZ   West    6392017   232              11    45.1  48.7
## 2   Arkansas  AR  South    2915918    93               6    33.7  60.6
## 3 California  CA   West   37253956  1257              55    61.7  31.6
##   others
## 1    6.2
## 2    5.8
## 3    6.7

4.3.2 Union

Similarly union takes the union of vectors. For example:

union(1:10, 6:15)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

union(c("a","b","c"), c("b","c","d"))

## [1] "a" "b" "c" "d"

But with dplyr loaded, we can also do this for tables having the same column names:

tab_1 <- tab[1:5,]
tab_2 <- tab[3:7,]
union(tab_1, tab_2)

##         state abb    region population total electoral_votes clinton trump
## 1 Connecticut  CT Northeast    3574097    97               7    54.6  40.9
## 2    Colorado  CO      West    5029196    65               9    48.2  43.3
## 3  California  CA      West   37253956  1257              55    61.7  31.6
## 4    Arkansas  AR     South    2915918    93               6    33.7  60.6
## 5     Arizona  AZ      West    6392017   232              11    45.1  48.7
## 6      Alaska  AK      West     710231    19               3    36.6  51.3
## 7     Alabama  AL     South    4779736   135               9    34.4  62.1
##   others
## 1    4.5
## 2    8.6
## 3    6.7
## 4    5.8
## 5    6.2
## 6   12.2
## 7    3.6

4.3.3 `setdiff`

The set difference between a first and second argument can be obtained with setdiff. Not unlike instersect and union, this function is not symmetric:

setdiff(1:10, 6:15)

## [1] 1 2 3 4 5

setdiff(6:15, 1:10)

## [1] 11 12 13 14 15

As in the function shown above, we can apply it to data frames:

tab_1 <- tab[1:5,]
tab_2 <- tab[3:7,]
setdiff(tab_1, tab_2)

##     state abb region population total electoral_votes clinton trump others
## 1 Alabama  AL  South    4779736   135               9    34.4  62.1    3.6
## 2  Alaska  AK   West     710231    19               3    36.6  51.3   12.2

4.3.4 `setequal`

Finally, the function set_equal tells us if two sets are the same, regardless of order. So notice that:

setequal(1:5, 1:6)

## [1] FALSE

but:

setequal(1:5, 5:1)

## [1] TRUE

When applied to data frames that are not equal, regardless of order, it provides a useful message letting us know how the sets are different:

setequal(tab_1, tab_2)

## FALSE: Rows in x but not y: 2, 1. Rows in y but not x: 5, 4.

5 Web Scraping

The data we need to answer a question is not always in a spreadsheet, ready for us to read. For example, the US murders dataset originally comes from this Wikipedia page: https://en.wikipedia.org/w/index.php?title=Gun_violence_in_the_United_States_by_state&oldid=843135608. You can see the data table when you visit the webpage:

Unfortunately, there is no link to a data file. To make the data frame we loaded using data(murders) or reading-in the csv file made available through dslabs, we had to do some web scrapping.

Web scraping, or web harvesting, are the terms we use to describe the process of extracting data from a website. The reason we can do this is because the information used by a browser to render webpages is received as a text file from a server. The text is code written in hyper text markup language (HTML). Every browser has a way to show the html source code for a page, each one different. On Chrome you can use Control-U on a PC and command+alt+U on a Mac.

5.1 HTML

Because this code is accessible, we can download the HTML file, import it into R, and then write programs to extract the information we need from the page. However, once we look at HTML code, this might seem like a daunting task. But we will show you some convenient tools to facilitate the process. To get an idea of how it works, here are a few lines of code from the Wikipedia page that provides the US murders data:

p>The 2015 U.S. population total was 320.9 million. The 2015 U.S. overall murder rate per 100,000 inhabitants was 4.89.</p>
<h2><span class="mw-headline" id="States">States</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Murder_in_the_United_States_by_state&amp;action=edit&amp;section=1" title="Edit section: States">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<table class="wikitable sortable">
<tr>
<th>State</th>
<th><a href="/wiki/List_of_U.S._states_and_territories_by_population" title="List of U.S. states and territories by population">Population</a><br />
<small>(total inhabitants)</small><br />
<small>(2015)</small> <sup id="cite_ref-1" class="reference"><a href="#cite_note-1">[1]</a></sup></th>
<th>Murders and Nonnegligent
<p>Manslaughter<br />
<small>(total deaths)</small><br />
<small>(2015)</small> <sup id="cite_ref-2" class="reference"><a href="#cite_note-2">[2]</a></sup></p>
</th>
<th>Murder and Nonnegligent
<p>Manslaughter Rate<br />
<small>(per 100,000 inhabitants)</small><br />
<small>(2015)</small></p>
</th>
</tr>
<tr>
<td><a href="/wiki/Alabama" title="Alabama">Alabama</a></td>
<td>4,853,875</td>
<td>348</td>
<td>7.2</td>
</tr>
<tr>
<td><a href="/wiki/Alaska" title="Alaska">Alaska</a></td>
<td>737,709</td>
<td>59</td>
<td>8.0</td>
</tr>
<tr>

You can actually see the data! We can also see a pattern of how it is stored. If you know HTML, you can write programs that leverage knowledge of these patterns to extract what we want. We also take advantage of a language widely used to make webpages look “pretty” called Cascading Style Sheets (CSS). We say more about this in the CSS Selector Section.

Although we provide tools that make it possible to scrape data without knowing HTML, for data scientists, it is quite useful to learn some HTML and CSS. Not only does this improve your scraping skills, but it might come in handy if you are creating a webpage to showcase you work. There are plenty of online courses and tutorials for learning these. Two examples are code academy and WWW3 school.

5.2 The rvest package

The tidyverse provides a web harvesting package called rvest. The first step using this package is to import the webpage into R. The package makes this quite simple:

library(rvest)

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:purrr':
## 
##     pluck

## The following object is masked from 'package:readr':
## 
##     guess_encoding

url <- "https://en.wikipedia.org/w/index.php?title=Gun_violence_in_the_United_States_by_state&oldid=843135608"
h <- read_html(url)

Note that the entire Murders in the US Wikipedia webpage is now contained in h. The class of this object is:

class(h)

## [1] "xml_document" "xml_node"

The rvest package is actually more general; it handles XML documents. XML is a general markup language, that’s what the ML stand for, that can be used to represent any kind of data. HTML is a specific type of XML specifically developed for representing webpages. Here we focus on HTML documents.

Now, how do we extract the table from the object h? If we print h, we don’t really see much:

## {xml_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...

When we know that the information is stored in an HTML table, you can see this in this line of the HTML code above <table class="wikitable sortable">. The different parts of an HTML document, often defined with a message in between < and > are referred to as nodes. The rvest package includes functions to extract nodes of an HTML document: html_nodes extracts all nodes of different types and html_node extracts the first one. To extract the tables from the html code we use:

tab <- h %>% html_nodes("table")

Now, instead of the entire webpage, we just have the html code for the tables in the page:

tab

## {xml_nodeset (4)}
## [1] <table id="revision-info-current" class="plainlinks fmbox fmbox-syst ...
## [2] <table class="wikitable">\n<caption>Legend\n</caption>\n<tbody><tr>\ ...
## [3] <table class="wikitable sortable"><tbody>\n<tr>\n<th data-sort-type= ...
## [4] <table class="nowraplinks hlist collapsible collapsed navbox-inner"  ...

The table we are interested is the third one:

tab[[3]]

## {xml_node}
## <table class="wikitable sortable">
## [1] <tbody>\n<tr>\n<th data-sort-type="text">State\n</th>\n<th data-sort ...

We are not quite there yet because this is clearly not a tidy dataset, not even a data frame. In the code above, you can definitely see a pattern and writing code to extract just the data is very doable. In fact, rvest includes a function just for converting HTML tables into data frames:

tab <- tab[[3]] %>% html_table
class(tab)

## [1] "data.frame"

We are now much closer to having a usable data table:

tab <- tab %>% setNames(c("state", "population", "total", "murder_rate"))
head(tab)

##        state population total murder_rate    NA   NA  NA     NA     NA
## 1    Alabama  4,853,875   348        3[a]  3[a] 48.9 7.2 0.1[a] 0.1[a]
## 2     Alaska    737,709    59          57    39 61.7 8.0    7.7    5.3
## 3    Arizona  6,817,565   309         278   171 32.3 4.5    4.1    2.5
## 4   Arkansas  2,977,853   181         164   110 57.9 6.1    5.5    3.7
## 5 California 38,993,940 1,861       1,861 1,275 20.1 4.8    4.8    3.3
## 6   Colorado  5,448,819   176         176   115 34.3 3.2    3.2    2.1

We still have some wrangling to do. For example, we need to remove the commas and turn characters into numbers. Before continuing with this, we will learn a more general approach to extracting information from web sites.

5.3 CSS selectors

The default look of a webpage made with the most basic HTML is quite unattractive. The aesthetically pleasing pages we see today are made using CSS, which is used to add style to webpages. The fact that all pages for a company have the same style is usually a result that they all use the same CSS file to define the style. The general way these CSS files work is by defining how each of the elements of a webpage will look. The title, headings, itemized lists, tables, and links, for example, each receive their own style including font, color, size, and distance from the margin. CSS does this by leveraging patterns used to define these elements, referred to as selectors. An example of such a pattern, which we used above, is table, but there are many, many more.

If we want to grab data from a webpage and we happen to know a selector that is unique to the part of the page containing this data, we can use the html_nodes function. However, knowing which selector can be quite complicated. In fact, the complexity of webpages has been increasing as they become more sophisticated. For some of the more advanced ones, it seems almost impossible to find the notes that define a particular piece of data. However, selector gadgets actually make this possible.

SelectorGadget is piece of software that allows you to interactively determine what CSS selector you need to extract specific components from the webpage. If you plan on scrapping data other than tables, we highly recommend you install it if you are interested in scrapping html pages. A Chrome extension is available which permits you to turn on the gadget and then, as you click through the page, it highlights parts and shows you the selector you need to extract these parts. There are various demos of how to do this including author Hadley Wickham’s vignetter, and these two tutorials based on the package vignette.

We add an example here that tries to extract the recipe name, total preparation time, and list of ingredients from this Guacamole recipe. Looking at the code for this page, it seems that the task is impossibly complex. For the Guacamole recipe page, we already have determined that we need the following selectors:

h <- read_html("http://www.foodnetwork.com/recipes/alton-brown/guacamole-recipe-1940609")
recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text()
prep_time <- h %>% html_node(".o-RecipeInfo__a-Description--Total") %>% html_text()
ingredients <- h %>% html_nodes(".o-Ingredients__a-ListItemText") %>% html_text()

You can see how complex the selectors are. In any case, we are now ready to extract what we want and create a list:

guacamole <- list(recipe, prep_time, ingredients)
guacamole

## [[1]]
## [1] "Guacamole"
## 
## [[2]]
## [1] NA
## 
## [[3]]
## character(0)

Since recipe pages from this website follow this general layout, we can use this code to create a function that extracts this information:

get_recipe <- function(url){
  h <- read_html(url)
  recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text()
  prep_time <- h %>% html_node(".o-RecipeInfo__a-Description--Total") %>% html_text()
  ingredients <- h %>% html_nodes(".o-Ingredients__a-ListItemText") %>% html_text()
  return(list(recipe = recipe, prep_time = prep_time, ingredients = ingredients))
}

Then we can use it to extract recipes from any of their webpages:

get_recipe("http://www.foodnetwork.com/recipes/food-network-kitchen/pancakes-recipe-1913844")

## $recipe
## [1] "Pancakes"
## 
## $prep_time
## [1] "22 min"
## 
## $ingredients
## [1] "1 1/2 cups all-purpose flour"                      
## [2] "3 tablespoons sugar"                               
## [3] "1 tablespoon baking powder"                        
## [4] "1/4 teaspoon salt"                                 
## [5] "1/8 teaspoon freshly ground nutmeg"                
## [6] "2 large eggs, at room temperature"                 
## [7] "1 1/4 cups milk, at room temperature"              
## [8] "1/2 teaspoon pure vanilla extract"                 
## [9] "3 tablespoons unsalted butter, plus more as needed"

get_recipe("https://www.foodnetwork.com/recipes/giada-de-laurentiis/pasta-primavera-recipe-1942131")

## $recipe
## [1] "Pasta Primavera"
## 
## $prep_time
## [1] NA
## 
## $ingredients
## character(0)

There are several other powerful tools provided by rvest. For example, the functions html_form, set_values, and submit_form permit you to query a webpage from R. This is a more advanced topic not covered here.

5.4 JSON

Sharing data on the internet has become more and more common. Unfortunately, providers use different formats which makes it harder for data scientists to wrangle data into R. Yet there are some standards that are also becoming more common. Currently, a format that is widely being adopted is the JavaScript Object Notation or JSON. Because this format is very general, it is nothing like a spreadsheet. This JSON files look more like the code you use to define a list. Here is an example of information stored in a JSON format:

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
## 
##     flatten

## [
##   {
##     "name": "Miguel",
##     "student_id": 1,
##     "exam_1": 85,
##     "exam_2": 86
##   },
##   {
##     "name": "Sofia",
##     "student_id": 2,
##     "exam_1": 94,
##     "exam_2": 93
##   },
##   {
##     "name": "Aya",
##     "student_id": 3,
##     "exam_1": 87,
##     "exam_2": 88
##   },
##   {
##     "name": "Cheng",
##     "student_id": 4,
##     "exam_1": 90,
##     "exam_2": 91
##   }
## ]

The file above actually represents a data frame. To read it, we can use the function fromJSON from the jsonlite package. Note that JSON files are often made available via the internet. Several organization provide a JSON API or a web service that you can connect directly to and obtain data. Here is an example:

library(jsonlite)
citi_bike <- fromJSON("http://citibikenyc.com/stations/json")

This downloads a list. The first argument tells you when you downloaded it:

citi_bike$executionTime

## [1] "2018-07-30 10:54:29 PM"

and the second is a data table:

head(citi_bike$stationBeanList)

##    id              stationName availableDocks totalDocks latitude
## 1 150        E 2 St & Avenue C              9         29 40.72087
## 2 307    Canal St & Rutgers St             12         30 40.71427
## 3 445       E 10 St & Avenue A             20         52 40.72741
## 4  72         W 52 St & 11 Ave             21         39 40.76727
## 5  79 Franklin St & W Broadway             32         33 40.71912
## 6  82   St James Pl & Pearl St             24         27 40.71117
##   longitude statusValue statusKey availableBikes               stAddress1
## 1 -73.98086  In Service         1             20        E 2 St & Avenue C
## 2 -73.98990  In Service         1             16    Canal St & Rutgers St
## 3 -73.98142  In Service         1             30       E 10 St & Avenue A
## 4 -73.99393  In Service         1             18         W 52 St & 11 Ave
## 5 -74.00667  In Service         1              1 Franklin St & W Broadway
## 6 -74.00017  In Service         1              1   St James Pl & Pearl St
##   stAddress2 city postalCode location altitude testStation
## 1                                                    FALSE
## 2                                                    FALSE
## 3                                                    FALSE
## 4                                                    FALSE
## 5                                                    FALSE
## 6                                                    FALSE
##    lastCommunicationTime landMark
## 1 2018-07-30 10:53:34 PM         
## 2 2018-07-30 10:51:31 PM         
## 3 2018-07-30 10:51:59 PM         
## 4 2018-07-30 10:52:00 PM         
## 5 2018-07-30 10:51:09 PM         
## 6 2018-07-30 10:52:14 PM

You can learn much more by examining tutorials and help files from the jsonlite package. This package is intended for relatively simple tasks such as converging data into tables. For more flexibility, we recommend rjson.

6 Text mining

With the exception of labels used to represent categorical data, we have focused on numerical data. But in many applications data starts as text. Well known examples are spam filtering, cyber-crime prevention, counter-terrorism and sentiment analysis. In all these examples, the raw data is composed of free form texts. Our task is to extract insights from these data. In this section we learn how to generate useful numerical summaries from text data to which we can apply some of the powerful data visualization and analysis techniques we have learned.

6.1 Case study: Trump tweets

During the 2016 US presidential election, then candidate Donald J. Trump used his twitter account as a way to communicate with potential voters. On August 6, 2016, Todd Vaziri tweeted about Trump that “Every non-hyperbolic tweet is from iPhone (his staff). Every hyperbolic tweet is from Android (from him).” Data scientist David Robinson conducted an analysis to determine if data supported this assertion. Here we go through David’s analysis to learn some of the basics of text mining. To learn more about text mining in R we recommend this book.

We will use the following libraries:

library(lubridate)
library(tidyr)
library(scales)

In general, we can extract data directly from twitter using the package. However, in this case, a group has already compiled data for us and made it available at http://www.trumptwitterarchive.com. We can get the data from their JSON API using a script like this:

url <- 'http://www.trumptwitterarchive.com/data/realdonaldtrump/%s.json'
trump_tweets <- map(2009:2017, ~sprintf(url, .x)) %>%
  map_df(jsonlite::fromJSON, simplifyDataFrame = TRUE) %>%
  filter(!is_retweet & !str_detect(text, '^"')) %>%
  mutate(created_at = parse_date_time(created_at, orders = "a b! d! H!:M!:S! z!* Y!", tz="EST"))

For convenience, we include the result of the code above in the dslabs package:

library(dslabs)
data("trump_tweets")

This is the data frame with information about the tweets:

head(trump_tweets)

##               source     id_str
## 1 Twitter Web Client 6971079756
## 2 Twitter Web Client 6312794445
## 3 Twitter Web Client 6090839867
## 4 Twitter Web Client 5775731054
## 5 Twitter Web Client 5364614040
## 6 Twitter Web Client 5203117820
##                                                                                                                                         text
## 1       From Donald Trump: Wishing everyone a wonderful holiday & a happy, healthy, prosperous New Year. Let’s think like champions in 2010!
## 2 Trump International Tower in Chicago ranked 6th tallest building in world by Council on Tall Buildings & Urban Habitat http://bit.ly/sqvQq
## 3                                                                             Wishing you and yours a very Happy and Bountiful Thanksgiving!
## 4                       Donald Trump Partners with TV1 on New Reality Series Entitled, Omarosa's Ultimate Merger: http://tinyurl.com/yk5m3lc
## 5                         --Work has begun, ahead of schedule, to build the greatest golf course in history: Trump International – Scotland.
## 6              --From Donald Trump: "Ivanka and Jared’s wedding was spectacular, and they make a beautiful couple. I’m a very proud father."
##            created_at retweet_count in_reply_to_user_id_str favorite_count
## 1 2009-12-23 12:38:18            28                    <NA>             12
## 2 2009-12-03 14:39:09            33                    <NA>              6
## 3 2009-11-26 14:55:38            13                    <NA>             11
## 4 2009-11-16 16:06:10             5                    <NA>              3
## 5 2009-11-02 09:57:56             7                    <NA>              6
## 6 2009-10-27 10:31:48             4                    <NA>              5
##   is_retweet
## 1      FALSE
## 2      FALSE
## 3      FALSE
## 4      FALSE
## 5      FALSE
## 6      FALSE

with the following variables included:

names(trump_tweets)

## [1] "source"                  "id_str"                 
## [3] "text"                    "created_at"             
## [5] "retweet_count"           "in_reply_to_user_id_str"
## [7] "favorite_count"          "is_retweet"

The help file ?trump_tweets provides details on what each variable represents. The tweets are represented by the text variable:

trump_tweets %>% select(text) %>% head

##                                                                                                                                         text
## 1       From Donald Trump: Wishing everyone a wonderful holiday & a happy, healthy, prosperous New Year. Let’s think like champions in 2010!
## 2 Trump International Tower in Chicago ranked 6th tallest building in world by Council on Tall Buildings & Urban Habitat http://bit.ly/sqvQq
## 3                                                                             Wishing you and yours a very Happy and Bountiful Thanksgiving!
## 4                       Donald Trump Partners with TV1 on New Reality Series Entitled, Omarosa's Ultimate Merger: http://tinyurl.com/yk5m3lc
## 5                         --Work has begun, ahead of schedule, to build the greatest golf course in history: Trump International – Scotland.
## 6              --From Donald Trump: "Ivanka and Jared’s wedding was spectacular, and they make a beautiful couple. I’m a very proud father."

and the source variable tells us which device was used to compose and upload each tweet:

trump_tweets %>% count(source) %>% arrange(desc(n))

## # A tibble: 19 x 2
##    source                       n
##    <chr>                    <int>
##  1 Twitter Web Client       10718
##  2 Twitter for Android       4652
##  3 Twitter for iPhone        3962
##  4 TweetDeck                  468
##  5 TwitLonger Beta            288
##  6 Instagram                  133
##  7 Media Studio               114
##  8 Facebook                   104
##  9 Twitter Ads                 96
## 10 Twitter for BlackBerry      78
## 11 Mobile Web (M5)             54
## 12 Twitter for iPad            39
## 13 Twitlonger                  22
## 14 Twitter QandA               10
## 15 Vine - Make a Scene         10
## 16 Periscope                    7
## 17 Neatly For BlackBerry 10     4
## 18 Twitter for Websites         1
## 19 Twitter Mirror for iPad      1

To start, we will use extract to remove the Twitter for part of the source and filter out retweets.

trump_tweets %>% 
  extract(source, "source", "Twitter for (.*)") %>%
  count(source)

## # A tibble: 6 x 2
##   source         n
##   <chr>      <int>
## 1 Android     4652
## 2 BlackBerry    78
## 3 iPad          39
## 4 iPhone      3962
## 5 Websites       1
## 6 <NA>       12029

We are interested in what happened during the campaign, so for this analysis we will focus on what was tweeted between the day Trump announced his campaign and election day. We define the following table containing just the tweets from that time period:

campaign_tweets <- trump_tweets %>% 
  extract(source, "source", "Twitter for (.*)") %>%
  filter(source %in% c("Android", "iPhone") &
           created_at >= ymd("2015-06-17") & 
           created_at < ymd("2016-11-08")) %>%
  filter(!is_retweet) %>%
  arrange(created_at)

We can now use data visualization to explore the possibility that two different groups were tweeting from these devices. For each tweet, we will extract the hour, East Coast time (EST), it was tweeted and then compute the proportion of tweets tweeted at each hour for each device:

ds_theme_set()
campaign_tweets %>%
  mutate(hour = hour(with_tz(created_at, "EST"))) %>%
  count(source, hour) %>%
  group_by(source) %>%
  mutate(percent = n / sum(n)) %>%
  ungroup %>%
  ggplot(aes(hour, percent, color = source)) +
  geom_line() +
  geom_point() +
  scale_y_continuous(labels = percent_format()) +
  labs(x = "Hour of day (EST)",
       y = "% of tweets",
       color = "")

We notice a big peak for the Android in the early hours of the morning, between 6 and 8 AM. There seems to be a clear difference in these patterns. We will therefore assume that two different entities are using these two devices.

We will now study how the tweets differ when we compare Android to iPhone. To do this, we introduce the tidytext package.

6.2 Text as data

The tidytext package helps us convert free form text into a tidy table. Having the data in this format greatly facilitates data visualization and the use of statistical techniques.

library(tidytext)

The main function needed to achieve this is unnest_tokens. A token refers to the units that we are considering to be a data point. The most common token will be words, but they can also be single characters, ngrams, sentences, lines or a pattern defined by a regex. The functions will take a vector of strings and extract the tokens so that each one gets a row in the new table. Here is a simple example:

example <- data_frame(line = c(1, 2, 3, 4),
                      text = c("Roses are red,", "Violets are blue,", "Sugar is sweet,", "And so are you."))
example

## # A tibble: 4 x 2
##    line text             
##   <dbl> <chr>            
## 1     1 Roses are red,   
## 2     2 Violets are blue,
## 3     3 Sugar is sweet,  
## 4     4 And so are you.

example %>% unnest_tokens(word, text)

## # A tibble: 13 x 2
##     line word   
##    <dbl> <chr>  
##  1     1 roses  
##  2     1 are    
##  3     1 red    
##  4     2 violets
##  5     2 are    
##  6     2 blue   
##  7     3 sugar  
##  8     3 is     
##  9     3 sweet  
## 10     4 and    
## 11     4 so     
## 12     4 are    
## 13     4 you

Now let’s look at an example from the tweets. We will look at tweet number 3008 because it will later permit us to illustrate a couple of points:

i <- 3008
campaign_tweets$text[i]

## [1] "Great to be back in Iowa! #TBT with @JerryJrFalwell joining me in Davenport- this past winter. #MAGA https://t.co/A5IF0QHnic"

campaign_tweets[i,] %>% 
  unnest_tokens(word, text) %>%
  select(word)

##                   word
## 3008             great
## 3008.1              to
## 3008.2              be
## 3008.3            back
## 3008.4              in
## 3008.5            iowa
## 3008.6             tbt
## 3008.7            with
## 3008.8  jerryjrfalwell
## 3008.9         joining
## 3008.10             me
## 3008.11             in
## 3008.12      davenport
## 3008.13           this
## 3008.14           past
## 3008.15         winter
## 3008.16           maga
## 3008.17          https
## 3008.18           t.co
## 3008.19     a5if0qhnic

Note that the function tries to convert tokens into words. However, to do this, it strips characters that are important in the context of twitter. Namely, the function removes all the # and @. A token in the context of twitter is not the same as in the context of spoken or written English. For this reason, instead of using the default, words, we define a regex that captures twitter characters. This may appear complex but all we are defining is a pattern that starts with @, # or neither, and is followed by any combination of letters or digits:

pattern <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"

We can now use the unnest_tokens function with the regex option and appropriately extract the hashtags and mentions. We demonstrate with our example tweet:

campaign_tweets[i,] %>% 
  unnest_tokens(word, text, token = "regex", pattern = pattern) %>%
  select(word)

##               word
## 1            great
## 2               to
## 3               be
## 4             back
## 5               in
## 6             iowa
## 7             #tbt
## 8             with
## 9  @jerryjrfalwell
## 10         joining
## 11              me
## 12              in
## 13       davenport
## 14            this
## 15            past
## 16          winter
## 17           #maga
## 18           https
## 19               t
## 20              co
## 21      a5if0qhnic

Another minor adjustment we want to make is to remove the links to pictures:

campaign_tweets[i,] %>% 
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", ""))  %>%
  unnest_tokens(word, text, token = "regex", pattern = pattern) %>%
  select(word)

##               word
## 1            great
## 2               to
## 3               be
## 4             back
## 5               in
## 6             iowa
## 7             #tbt
## 8             with
## 9  @jerryjrfalwell
## 10         joining
## 11              me
## 12              in
## 13       davenport
## 14            this
## 15            past
## 16          winter
## 17           #maga

Now we are now ready to extract the words for all our tweets.

tweet_words <- campaign_tweets %>% 
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", ""))  %>%
  unnest_tokens(word, text, token = "regex", pattern = pattern)

And we can now answer questions such as “what are the most commonly used words?”:

tweet_words %>% 
  count(word) %>%
  arrange(desc(n))

## # A tibble: 6,211 x 2
##    word      n
##    <chr> <int>
##  1 the    2335
##  2 to     1413
##  3 and    1246
##  4 a      1210
##  5 in     1189
##  6 i      1161
##  7 you    1000
##  8 of      983
##  9 is      944
## 10 on      880
## # ... with 6,201 more rows

It is not surprising that these are the top words. The top words are not informative. The tidytext package has a database of these commonly used words, referred to as stop words, in text mining:

stop_words

## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ... with 1,139 more rows

If we filter out rows representing stop words with filter(!word %in% stop_words$word):

tweet_words <- campaign_tweets %>% 
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", ""))  %>%
  unnest_tokens(word, text, token = "regex", pattern = pattern) %>%
  filter(!word %in% stop_words$word )

we end up with a much more informative set of top 10 tweeted words:

tweet_words %>% 
  count(word) %>%
  top_n(10, n) %>%
  mutate(word = reorder(word, n)) %>%
  arrange(desc(n))

## # A tibble: 10 x 2
##    word                       n
##    <fct>                  <int>
##  1 #trump2016               415
##  2 hillary                  407
##  3 people                   303
##  4 #makeamericagreatagain   296
##  5 america                  254
##  6 clinton                  239
##  7 poll                     220
##  8 crooked                  206
##  9 trump                    199
## 10 cruz                     162

Some exploration of the resulting words (not shown here) reveals a couple of unwanted characteristics in our tokens. First, some of our tokens are just numbers (years, for example). We want to remove these and we can find them using the regex ^\d+$. Second, some of our tokens come from a quote and they start with '. We want to remove the ' when it is at the start of a word so we will just str_replace. We add these two lines to the code above to generate our final table:

tweet_words <- campaign_tweets %>% 
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", ""))  %>%
  unnest_tokens(word, text, token = "regex", pattern = pattern) %>%
  filter(!word %in% stop_words$word &
           !str_detect(word, "^\\d+$")) %>%
  mutate(word = str_replace(word, "^'", ""))

Now that we have all our words in a table, along with information about what device was used to compose the tweet they came from, we can start exploring which words are more common when comparing Android to iPhone.

For each word, we want to know if it is more likely to come from an Android tweet or an iPhone tweet. We previously introduced the odds ratio as a summary statistic useful for quantifying these differences. For each device and a given word, let’s call it y, we compute the odds or the ratio between the proportion of words that are y and not y and compute the ratio of those odds. Here we will have many proportions that are 0, so we use the 0.5 correction described in the Association Test section.

android_iphone_or <- tweet_words %>%
  count(word, source) %>%
  spread(source, n, fill = 0) %>%
  mutate(or = (Android + 0.5) / (sum(Android) - Android + 0.5) / 
           ( (iPhone + 0.5) / (sum(iPhone) - iPhone + 0.5)))
android_iphone_or %>% arrange(desc(or))

## # A tibble: 5,509 x 4
##    word      Android iPhone    or
##    <chr>       <dbl>  <dbl> <dbl>
##  1 mails          22      0  39.3
##  2 poor           13      0  23.6
##  3 poorly         12      0  21.8
##  4 @cbsnews       11      0  20.1
##  5 bosses         11      0  20.1
##  6 turnberry      11      0  20.1
##  7 angry          10      0  18.3
##  8 write          10      0  18.3
##  9 brexit          9      0  16.6
## 10 defend          9      0  16.6
## # ... with 5,499 more rows

android_iphone_or %>% arrange(or)

## # A tibble: 5,509 x 4
##    word                   Android iPhone      or
##    <chr>                    <dbl>  <dbl>   <dbl>
##  1 #makeamericagreatagain       0    296 0.00144
##  2 #americafirst                0     71 0.00607
##  3 #draintheswamp               0     63 0.00683
##  4 #trump2016                   3    412 0.00718
##  5 #votetrump                   0     56 0.00769
##  6 join                         1    157 0.00821
##  7 #imwithyou                   0     51 0.00843
##  8 #crookedhillary              0     30 0.0143 
##  9 #fitn                        0     30 0.0143 
## 10 #gopdebate                   0     30 0.0143 
## # ... with 5,499 more rows

Given that several of these words are overall low frequency words, we can impose a filter based on the total frequency like this:

android_iphone_or %>% filter(Android+iPhone > 100) %>%
  arrange(desc(or))

## # A tibble: 30 x 4
##    word        Android iPhone    or
##    <chr>         <dbl>  <dbl> <dbl>
##  1 @cnn            104     18  4.95
##  2 bad             104     26  3.45
##  3 crooked         157     49  2.79
##  4 ted              85     28  2.62
##  5 interviewed      76     25  2.62
##  6 media            77     26  2.56
##  7 cruz            116     46  2.19
##  8 hillary         290    119  2.14
##  9 win              74     30  2.14
## 10 president        84     35  2.08
## # ... with 20 more rows

android_iphone_or %>% filter(Android+iPhone > 100) %>%
  arrange(or)

## # A tibble: 30 x 4
##    word                   Android iPhone      or
##    <chr>                    <dbl>  <dbl>   <dbl>
##  1 #makeamericagreatagain       0    296 0.00144
##  2 #trump2016                   3    412 0.00718
##  3 join                         1    157 0.00821
##  4 tomorrow                    25    101 0.218  
##  5 vote                        46     67 0.600  
##  6 america                    114    141 0.703  
##  7 tonight                     71     84 0.737  
##  8 iowa                        62     65 0.831  
##  9 poll                       117    103 0.990  
## 10 trump                      112     92 1.06   
## # ... with 20 more rows

We already see somewhat of a pattern in the types of words that are being tweeted more from one device versus the other. However, we are not interested in specific words but rather in the tone. Vaziri’s assertion is that the Android tweets are more hyperbolic. So how can we check this with data? Hyperbolic is a hard sentiment to extract from words as it relies on interpreting phrases. However, words can be associated to more basic sentiment such as anger, fear, joy and surprise. In the next section, we demonstrate basic sentiment analysis.

6.3 Sentiment analysis

In sentiment analysis, we assign a word to one or more “sentiments”. Although this approach will miss context dependent sentiments, such as sarcasm, when performed on large numbers of words, summaries can provide insights.

The first step in sentiment analysis is to assign a sentiment to each word. The tidytext package includes several maps or lexicons in the object sentiments:

table(sentiments$lexicon)

## 
##    AFINN     bing loughran      nrc 
##     2476     6788     4149    13901

The bing lexicon divides words into positive and negative sentiments. We can see this using the tidytext function get_sentiments:

get_sentiments("bing")

## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # ... with 6,778 more rows

The AFINN lexicon assigns a score between -5 and 5, with -5 the most negative and 5 the most positive.

get_sentiments("afinn")

## # A tibble: 2,476 x 2
##    word       score
##    <chr>      <int>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,466 more rows

The loughran and nrc lexicons provide several different sentiments:

get_sentiments("loughran") %>% count(sentiment)

## # A tibble: 6 x 2
##   sentiment        n
##   <chr>        <int>
## 1 constraining   184
## 2 litigious      903
## 3 negative      2355
## 4 positive       354
## 5 superfluous     56
## 6 uncertainty    297

get_sentiments("nrc") %>% count(sentiment)

## # A tibble: 10 x 2
##    sentiment        n
##    <chr>        <int>
##  1 anger         1247
##  2 anticipation   839
##  3 disgust       1058
##  4 fear          1476
##  5 joy            689
##  6 negative      3324
##  7 positive      2312
##  8 sadness       1191
##  9 surprise       534
## 10 trust         1231

To start learning about how these lexicons were developed, you can read this help file ?sentiments.

For our analysis, we are interested in exploring the different sentiments of each tweet so we will use the nrc lexicon:

nrc <- sentiments %>%
  filter(lexicon == "nrc") %>%
  select(word, sentiment)

We can combine the words and sentiments using inner_join, which will only keep words associated with a sentiment. Here are 10 random words extracted from the tweets:

tweet_words %>% inner_join(nrc, by = "word") %>% 
  select(source, word, sentiment) %>% 
  sample_n(10)

##        source     word    sentiment
## 4545  Android    phony        anger
## 6369  Android     vote          joy
## 9804  Android   virtue        trust
## 15542  iPhone tomorrow anticipation
## 3451  Android  finally        trust
## 15372  iPhone    honor        trust
## 16163  iPhone     vote     negative
## 11305  iPhone     bomb      sadness
## 10763  iPhone negative      sadness
## 1057  Android   spirit     positive

Now we are ready to perform a quantitative analysis comparing Android and iPhone by comparing the sentiments of the tweets posted from each device. Here we could perform a tweet by tweet analysis, assigning a sentiment to each tweet. However, this will be challenging since each tweet will have several sentiments attached to it, one for each word appearing in the lexicon. For illustrative purposes, we will perform a much simpler analysis: we will count and compare the frequencies of each sentiment appearing in each device.

sentiment_counts <- tweet_words %>%
  left_join(nrc, by = "word") %>%
  count(source, sentiment) %>%
  spread(source, n) %>%
  mutate(sentiment = replace_na(sentiment, replace = "none"))
sentiment_counts

## # A tibble: 11 x 3
##    sentiment    Android iPhone
##    <chr>          <int>  <int>
##  1 anger            965    527
##  2 anticipation     915    710
##  3 disgust          641    318
##  4 fear             802    486
##  5 joy              698    540
##  6 negative        1668    935
##  7 positive        1834   1497
##  8 sadness          907    514
##  9 surprise         530    365
## 10 trust           1253   1010
## 11 none           11523  10739

For each sentiment, we can compute the odds of being in the device: proportion of words with sentiment versus proportion of words without, and then compute the odds ratio comparing the two devices.

sentiment_counts %>%
  mutate(Android = Android / (sum(Android) - Android) , 
         iPhone = iPhone / (sum(iPhone) - iPhone), 
         or = Android/iPhone) %>%
  arrange(desc(or))

## # A tibble: 11 x 4
##    sentiment    Android iPhone    or
##    <chr>          <dbl>  <dbl> <dbl>
##  1 disgust       0.0304 0.0184 1.66 
##  2 anger         0.0465 0.0308 1.51 
##  3 negative      0.0831 0.0560 1.49 
##  4 sadness       0.0435 0.0300 1.45 
##  5 fear          0.0383 0.0283 1.35 
##  6 surprise      0.0250 0.0211 1.18 
##  7 joy           0.0332 0.0316 1.05 
##  8 anticipation  0.0439 0.0419 1.05 
##  9 trust         0.0612 0.0607 1.01 
## 10 positive      0.0922 0.0927 0.994
## 11 none          1.13   1.56   0.725

So we do see some differences and the order is interesting: the largest three sentiments are disgust, anger, and negative! But are they statistically significant? How does this compare if we are just assigning sentiments at random?

To answer this question we can compute, for each sentiment, an odds ratio and a confidence interval. We will add the two values we need to form a two-by-two table and the odd ratio:

library(broom)
log_or <- sentiment_counts %>%
  mutate( log_or = log( (Android / (sum(Android) - Android)) / (iPhone / (sum(iPhone) - iPhone))),
          se = sqrt( 1/Android + 1/(sum(Android) - Android) + 1/iPhone + 1/(sum(iPhone) - iPhone)),
          conf.low = log_or - qnorm(0.975)*se,
          conf.high = log_or + qnorm(0.975)*se) %>%
  arrange(desc(log_or))
  
log_or

## # A tibble: 11 x 7
##    sentiment    Android iPhone   log_or     se conf.low conf.high
##    <chr>          <int>  <int>    <dbl>  <dbl>    <dbl>     <dbl>
##  1 disgust          641    318  0.504   0.0694   0.368     0.640 
##  2 anger            965    527  0.411   0.0551   0.303     0.519 
##  3 negative        1668    935  0.395   0.0422   0.313     0.478 
##  4 sadness          907    514  0.372   0.0562   0.262     0.482 
##  5 fear             802    486  0.302   0.0584   0.187     0.416 
##  6 surprise         530    365  0.168   0.0688   0.0332    0.303 
##  7 joy              698    540  0.0495  0.0582  -0.0647    0.164 
##  8 anticipation     915    710  0.0468  0.0511  -0.0533    0.147 
##  9 trust           1253   1010  0.00726 0.0436  -0.0781    0.0926
## 10 positive        1834   1497 -0.00624 0.0364  -0.0776    0.0651
## 11 none           11523  10739 -0.321   0.0206  -0.362    -0.281

A graphical visualization shows some sentiments that are clearly overrepresented:

log_or %>%
  mutate(sentiment = reorder(sentiment, log_or)) %>%
  ggplot(aes(x = sentiment, ymin = conf.low, ymax = conf.high)) +
  geom_errorbar() +
  geom_point(aes(sentiment, log_or)) +
  ylab("Log odds ratio for association between Android and sentiment") +
  coord_flip()

We see that the disgust, anger, negative, sadness and fear sentiments are associated with the Android in a way that is hard to explain by chance alone. Words not associated to a sentiment were strongly associated with the iPhone source, which is in agreement with the original claim about hyperbolic tweets.

If we are interested in exploring which specific words are driving these differences, we can refer back to our android_iphone_or object:

android_iphone_or %>% inner_join(nrc) %>%
  filter(sentiment == "disgust" & Android + iPhone > 10) %>%
  arrange(desc(or))

## Joining, by = "word"

## # A tibble: 20 x 5
##    word      Android iPhone    or sentiment
##    <chr>       <dbl>  <dbl> <dbl> <chr>    
##  1 mess           15      2 5.41  disgust  
##  2 finally        12      2 4.36  disgust  
##  3 unfair         12      2 4.36  disgust  
##  4 bad           104     26 3.45  disgust  
##  5 lie            13      3 3.37  disgust  
##  6 terrible       31      8 3.24  disgust  
##  7 lying           9      3 2.37  disgust  
##  8 waste          12      5 1.98  disgust  
##  9 phony          21      9 1.97  disgust  
## 10 illegal        32     14 1.96  disgust  
## 11 nasty          14      6 1.95  disgust  
## 12 pathetic       11      5 1.82  disgust  
## 13 horrible       14      7 1.69  disgust  
## 14 disaster       21     11 1.63  disgust  
## 15 winning        14      9 1.33  disgust  
## 16 liar            6      5 1.03  disgust  
## 17 dishonest      37     32 1.01  disgust  
## 18 john           24     21 0.994 disgust  
## 19 dying           6      6 0.872 disgust  
## 20 terrorism       9      9 0.872 disgust

and we can make a graph:

android_iphone_or %>% inner_join(nrc, by = "word") %>%
  mutate(sentiment = factor(sentiment, levels = log_or$sentiment)) %>%
  mutate(log_or = log(or)) %>%
  filter(Android + iPhone > 10 & abs(log_or)>1) %>%
  mutate(word = reorder(word, log_or)) %>%
  ggplot(aes(word, log_or, fill = log_or < 0)) +
  facet_wrap(~sentiment, scales = "free_x", nrow = 2) + 
  geom_bar(stat="identity", show.legend = FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

This is just a simple example of the many analyses one can perform with tidytext. To learn more, we again recommend this book.

Wrangling

Rafael Irizarry

July 30, 2018

1 Introduction

2 Tidy data

3 Reshaping data

3.1 `gather`

3.2 `spread`

3.3 `separate`

3.4 `unite`

4 Combining tables

4.1 Joins

4.1.1 Left join

4.1.2 Right join

4.1.3 Inner join

4.1.4 Full join

4.1.5 Semi join

4.1.6 Anti join

4.2 Binding

4.2.1 Binding columns

4.2.2 Binding by rows

4.3 Set operators

4.3.1 Intersect

4.3.2 Union

4.3.3 `setdiff`

4.3.4 `setequal`

5 Web Scraping

5.1 HTML

5.2 The rvest package

5.3 CSS selectors

5.4 JSON

6 Text mining

6.1 Case study: Trump tweets

6.2 Text as data

6.3 Sentiment analysis

Wrangling

Rafael Irizarry

July 30, 2018

1 Introduction

2 Tidy data

3 Reshaping data

3.1 gather

3.2 spread

3.3 separate

3.4 unite

4 Combining tables

4.1 Joins

4.1.1 Left join

4.1.2 Right join

4.1.3 Inner join

4.1.4 Full join

4.1.5 Semi join

4.1.6 Anti join

4.2 Binding

4.2.1 Binding columns

4.2.2 Binding by rows

4.3 Set operators

4.3.1 Intersect

4.3.2 Union

4.3.3 setdiff

4.3.4 setequal

5 Web Scraping

5.1 HTML

5.2 The rvest package

5.3 CSS selectors

5.4 JSON

6 Text mining

6.1 Case study: Trump tweets

6.2 Text as data

6.3 Sentiment analysis

3.1 `gather`

3.2 `spread`

3.3 `separate`

3.4 `unite`

4.3.3 `setdiff`

4.3.4 `setequal`