The datasets used up to now have been made available to you as R objects, specifically as data frames. The US murders data, the reported heights data, the Gapminder data, and the poll data are all examples. These datasets come included in the dslabs package and we loaded them using the data
function. Furthermore, we have made the data available in what is referred to as tidy
form, a concept we define later. The tidyverse packages and functions assume that the data is tidy
and this assumption is a big part of the reason these packages work so well together.
However, very rarely in a data science project is data easily available as part of a package. We did quite a bit of work “behind the scenes” to get the original raw data into the tidy tables you worked with. Much more typical is for the data to be in a file, a database, or extracted from a document, including web pages, tweets, or PDFs. In these cases, the first step is to import the data into R and, when using the tidyverse, tidy the data. The first step in the data analysis process usually involves several, often complicated, steps to covert data from its raw form to the tidy form that greatly facilitates the rest of the analysis. We refer to this process as data wrangling
.
Here we cover several common steps of the data wrangling process including importing data into R from files, tidying data, string processing, html parsing, working with dates and times, and text mining. Rarely are all these wrangling steps necessary in a single analysis, but data scientists will likely face them all at some point. Some of the examples we use to demonstrate data wrangling techniques are based on the work we did to convert raw data into the the tidy datasets provided by the dslabs package and used in the book as examples.
library(tidyverse)
library(dslabs)
ds_theme_set()
To help define tidy data, we refer to an example from the data visualization lecture in which we plotted fertility data across time for two countries: South Korea and Germany. To make the plot, we used this subset of the data:
data("gapminder")
tidy_data <- gapminder %>%
filter(country %in% c("South Korea", "Germany")) %>%
select(country, year, fertility)
head(tidy_data)
## country year fertility
## 1 Germany 1960 2.41
## 2 South Korea 1960 6.16
## 3 Germany 1961 2.44
## 4 South Korea 1961 5.99
## 5 Germany 1962 2.47
## 6 South Korea 1962 5.79
With the data in this format, we could quickly make the desired plot:
tidy_data %>%
ggplot(aes(year, fertility, color = country)) +
geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).
One reason this code works seamlessly is because the data is tidy: each point is represented in a row. This brings us to the definition of tidy data: each row represents one observation and the columns represent the different variables available for each of these observations.
If we go back to the original data provided by GapMinder, we see that it does not start out tidy. We include an example file with the data shown in this graph mimicking the way it was originally saved in a spreadsheet:
path <- system.file("extdata", package="dslabs")
filename <- file.path(path, "fertility-two-countries-example.csv")
wide_data <- read_csv(filename)
## Parsed with column specification:
## cols(
## .default = col_double(),
## country = col_character()
## )
## See spec(...) for full column specifications.
The object wide_data
includes the same information as the object tidy_data
except it’s in a different format: a wide
format. Here are the first nine columns:
select(wide_data, country, `1960`:`1967`)
## # A tibble: 2 x 9
## country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37
## 2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85
Two important differences between the wide and tidy formats are: 1) in the wide format, each row includes several observations and 2) one of the variables, year, is stored in the header.
The ggplot code we introduced earlier no longer works here. For one, there is no year
variable. To use the tidyverse
, we need to wrangle this data into tidy
format.
## Parsed with column specification:
## cols(
## .default = col_double(),
## country = col_character()
## )
## See spec(...) for full column specifications.
As we have seen, having data in tidy format is what makes the tidyverse flow. After the first step in the data analysis process, importing data, a common next step is to reshape the data into a form that facilitates the rest of the analysis. The tidyr
package includes several functions that are useful for tyding data.
gather
One of the most used functions in the tidyr package is gather
, which converts wide data into tidy data.
In the third argument of the gather
function, you specify the columns that will be gathered. The default is to gather all columns so, in most cases, we have to specify the columns. Here we want columns 1960
, 1961
up to 2015
. The first argument sets the column/variable name that will hold the variable that is currently kept in the wide data column names. In our case, it makes sense to set name to year
, although we can name it anything. The second argument sets the new column/variable name that will hold the values. In this case, we call it fertility
since this is what is stored in this file. Note that nowhere in this file does it tell us this is fertility data. Instead, this information was kept in the file name.
The gathering code looks like this:
new_tidy_data <- wide_data %>%
gather(year, fertility, `1960`:`2015`)
We can see that the data have been converted to tidy format with columns year
and fertility
:
head(new_tidy_data)
## # A tibble: 6 x 3
## country year fertility
## <chr> <chr> <dbl>
## 1 Germany 1960 2.41
## 2 South Korea 1960 6.16
## 3 Germany 1961 2.44
## 4 South Korea 1961 5.99
## 5 Germany 1962 2.47
## 6 South Korea 1962 5.79
However, each year resulted in two rows since we have two countries and this column was not gathered.
A somewhat quicker way to write this code is to specify which column will not be gathered, rather than all the columns that will be gathered:
new_tidy_data <- wide_data %>%
gather(year, fertility, -country)
The new_tidy_data
object looks like the original tidy_data
we used with just one minor difference. Can you spot it? Look at the data type of the year column:
class(tidy_data$year)
## [1] "integer"
class(new_tidy_data$year)
## [1] "character"
The gather
function assumes that column names are characters. So we need a bit more wrangling before we are ready to make a plot. We need to convert the year column to numbers. The gather
function has an argument for that, the convert
argument:
new_tidy_data <- wide_data %>%
gather(year, fertility, -country, convert = TRUE)
class(new_tidy_data$year)
## [1] "integer"
We could have also used the mutate
and as.numeric
.
Now that the data is tidy, we can use the same ggplot code as before:
new_tidy_data %>% ggplot(aes(year, fertility, color = country)) +
geom_point()
spread
As we will see in later examples, it is sometimes useful for data wrangling purposes to convert tidy data into wide data. We often use this as an intermediate step in tidying up data. The spread
function is basically the inverse of gather
. The first argument is for the data, but since we are using the pipe, we don’t show it. The second argument tells spread
which variable will be used as the column names. The third argument specifies which variable to use to fill out the cells:
new_wide_data <- new_tidy_data %>% spread(year, fertility)
select(new_wide_data, country, `1960`:`1967`)
## # A tibble: 2 x 9
## country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37
## 2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85
The following diagram can help remind you how these two functions work:
separate
The data wrangling shown above was simple compared to what is usually required. In our example spreadsheet files, we include an illustration that is slightly more complicated. It contains two variables: life expectancy and fertility. However, the way it is stored is not tidy and, as we will explain, not optimal.
path <- system.file("extdata", package = "dslabs")
filename <- file.path(path, "life-expectancy-and-fertility-two-countries-example.csv")
raw_dat <- read_csv(filename)
## Parsed with column specification:
## cols(
## .default = col_double(),
## country = col_character()
## )
## See spec(...) for full column specifications.
select(raw_dat, 1:5)
## # A tibble: 2 x 5
## country `1960_fertility` `1960_life_expectancy` `1961_fertility`
## <chr> <dbl> <dbl> <dbl>
## 1 Germany 2.41 69.3 2.44
## 2 South Korea 6.16 53.0 5.99
## # ... with 1 more variable: `1961_life_expectancy` <dbl>
First, note that the data is in wide format. Second, notice that this table includes values for two variables, fertility and life expectancy, with the column name encoding which column represents which variable. This is not a recommended way to store information, encoded in column names, but it is quite common. We will put our wrangling skills to work to extract this information and store it in a tidy fashion.
We can start the data wrangling with the gather
function, but we should no longer use the column name year
for the new column since it also contains the variable type. We will call it key
, the default, for now:
dat <- raw_dat %>% gather(key, value, -country)
head(dat)
## # A tibble: 6 x 3
## country key value
## <chr> <chr> <dbl>
## 1 Germany 1960_fertility 2.41
## 2 South Korea 1960_fertility 6.16
## 3 Germany 1960_life_expectancy 69.3
## 4 South Korea 1960_life_expectancy 53.0
## 5 Germany 1961_fertility 2.44
## 6 South Korea 1961_fertility 5.99
The result is not exactly what we refer to as tidy since each observation is associated with two, not one, rows. We want to have the values from the two variables, fertility and life expectancy, in two separate columns. The first challenge to achieve this is to separate the key
column into the year and the variable type. Notice that the entries in this column separate the year from the variable name with an underscore:
dat$key[1:5]
## [1] "1960_fertility" "1960_fertility" "1960_life_expectancy"
## [4] "1960_life_expectancy" "1961_fertility"
Encoding multiple variables in a column name is such a common problem that the readr package includes a function to separate these columns into two or more. Apart from the data, the separate
function takes three arguments: the name of the column to be separated, the names to be used for the new columns, and the character that separates the variables. So, a first attempt at this is:
dat %>% separate(key, c("year", "variable_name"), "_")
Because “_" is the default separator assumed by separate
, we do not have to include it in the code:
dat %>% separate(key, c("year", "variable_name"))
## Warning: Expected 2 pieces. Additional pieces discarded in 112 rows [3, 4,
## 7, 8, 11, 12, 15, 16, 19, 20, 23, 24, 27, 28, 31, 32, 35, 36, 39, 40, ...].
## # A tibble: 224 x 4
## country year variable_name value
## <chr> <chr> <chr> <dbl>
## 1 Germany 1960 fertility 2.41
## 2 South Korea 1960 fertility 6.16
## 3 Germany 1960 life 69.3
## 4 South Korea 1960 life 53.0
## 5 Germany 1961 fertility 2.44
## 6 South Korea 1961 fertility 5.99
## 7 Germany 1961 life 69.8
## 8 South Korea 1961 life 53.8
## 9 Germany 1962 fertility 2.47
## 10 South Korea 1962 fertility 5.79
## # ... with 214 more rows
The function does separate the values, but we run into a new problem. We receive the warning Too many values at 112 locations:
and that the life_expectancy
variable is truncated to life
. This is because the _
is used to separate life
and expectancy
not just year and variable name! We could add a third column to catch this and let the separate
function know which column to fill in with missing values, NA
, when there is no third value. Here we tell it to fill the column on the right:
dat %>% separate(key, c("year", "first_variable_name", "second_variable_name"),
fill = "right")
## # A tibble: 224 x 5
## country year first_variable_name second_variable_name value
## <chr> <chr> <chr> <chr> <dbl>
## 1 Germany 1960 fertility <NA> 2.41
## 2 South Korea 1960 fertility <NA> 6.16
## 3 Germany 1960 life expectancy 69.3
## 4 South Korea 1960 life expectancy 53.0
## 5 Germany 1961 fertility <NA> 2.44
## 6 South Korea 1961 fertility <NA> 5.99
## 7 Germany 1961 life expectancy 69.8
## 8 South Korea 1961 life expectancy 53.8
## 9 Germany 1962 fertility <NA> 2.47
## 10 South Korea 1962 fertility <NA> 5.79
## # ... with 214 more rows
However, if we read the separate
help file, we find that a better approach is to merge the last two variables when there is an extra separation:
dat %>% separate(key, c("year", "variable_name"), extra = "merge")
## # A tibble: 224 x 4
## country year variable_name value
## <chr> <chr> <chr> <dbl>
## 1 Germany 1960 fertility 2.41
## 2 South Korea 1960 fertility 6.16
## 3 Germany 1960 life_expectancy 69.3
## 4 South Korea 1960 life_expectancy 53.0
## 5 Germany 1961 fertility 2.44
## 6 South Korea 1961 fertility 5.99
## 7 Germany 1961 life_expectancy 69.8
## 8 South Korea 1961 life_expectancy 53.8
## 9 Germany 1962 fertility 2.47
## 10 South Korea 1962 fertility 5.79
## # ... with 214 more rows
This achieves the separation we wanted. However, we are not done yet. We need to create a column for each variable. As we learned, the spread
function can do this:
dat %>%
separate(key, c("year", "variable_name"), extra = "merge") %>%
spread(variable_name, value)
## # A tibble: 112 x 4
## country year fertility life_expectancy
## <chr> <chr> <dbl> <dbl>
## 1 Germany 1960 2.41 69.3
## 2 Germany 1961 2.44 69.8
## 3 Germany 1962 2.47 70.0
## 4 Germany 1963 2.49 70.1
## 5 Germany 1964 2.49 70.7
## 6 Germany 1965 2.48 70.6
## 7 Germany 1966 2.44 70.8
## 8 Germany 1967 2.37 71.0
## 9 Germany 1968 2.28 70.6
## 10 Germany 1969 2.17 70.5
## # ... with 102 more rows
The data is now in tidy format with one row for each observation with three variables: year, fertility and life expectancy.
unite
It is sometimes useful to do the inverse of separate
, unite two columns into one. To demonstrate how to use unite
, we show code that, although this is not an optimal approach, serves as an illustration. Suppose that we did not know about extra
and used this command to separate:
dat %>%
separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right")
## # A tibble: 224 x 5
## country year first_variable_name second_variable_name value
## <chr> <chr> <chr> <chr> <dbl>
## 1 Germany 1960 fertility <NA> 2.41
## 2 South Korea 1960 fertility <NA> 6.16
## 3 Germany 1960 life expectancy 69.3
## 4 South Korea 1960 life expectancy 53.0
## 5 Germany 1961 fertility <NA> 2.44
## 6 South Korea 1961 fertility <NA> 5.99
## 7 Germany 1961 life expectancy 69.8
## 8 South Korea 1961 life expectancy 53.8
## 9 Germany 1962 fertility <NA> 2.47
## 10 South Korea 1962 fertility <NA> 5.79
## # ... with 214 more rows
We can achieve the same final result by uniting the second and third columns like this:
dat %>%
separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right") %>%
unite(variable_name, first_variable_name, second_variable_name, sep="_")
## # A tibble: 224 x 4
## country year variable_name value
## <chr> <chr> <chr> <dbl>
## 1 Germany 1960 fertility_NA 2.41
## 2 South Korea 1960 fertility_NA 6.16
## 3 Germany 1960 life_expectancy 69.3
## 4 South Korea 1960 life_expectancy 53.0
## 5 Germany 1961 fertility_NA 2.44
## 6 South Korea 1961 fertility_NA 5.99
## 7 Germany 1961 life_expectancy 69.8
## 8 South Korea 1961 life_expectancy 53.8
## 9 Germany 1962 fertility_NA 2.47
## 10 South Korea 1962 fertility_NA 5.79
## # ... with 214 more rows
Then spreading the columns:
dat %>%
separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right") %>%
unite(variable_name, first_variable_name, second_variable_name, sep="_") %>%
spread(variable_name, value) %>%
rename(fertlity = fertility_NA)
## # A tibble: 112 x 4
## country year fertlity life_expectancy
## <chr> <chr> <dbl> <dbl>
## 1 Germany 1960 2.41 69.3
## 2 Germany 1961 2.44 69.8
## 3 Germany 1962 2.47 70.0
## 4 Germany 1963 2.49 70.1
## 5 Germany 1964 2.49 70.7
## 6 Germany 1965 2.48 70.6
## 7 Germany 1966 2.44 70.8
## 8 Germany 1967 2.37 71.0
## 9 Germany 1968 2.28 70.6
## 10 Germany 1969 2.17 70.5
## # ... with 102 more rows
The information we need for a given analysis may not be just on one table. For example, when forecasting elections we used the function left_join
to combine the information from two tables. Here we use a simpler example to illustrate the general challenge of combining tables.
Suppose we want to explore the relationship between population size for US states and electoral votes. We have the population size in this table:
data(murders)
head(murders)
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
and electoral votes in this one:
data(polls_us_election_2016)
head(results_us_election_2016)
## state electoral_votes clinton trump others
## 1 Alabama 9 34.4 62.1 3.6
## 2 Alaska 3 36.6 51.3 12.2
## 3 Arizona 11 45.1 48.7 6.2
## 4 Arkansas 6 33.7 60.6 5.8
## 5 California 55 61.7 31.6 6.7
## 6 Colorado 9 48.2 43.3 8.6
Just joining these two tables together will not work since the order of the states is not quite the same:
identical(results_us_election_2016$state, murders$state)
## [1] FALSE
The join functions, described below, are designed to handle this challenge.
The join functions in the dplyr
package make sure that the tables are combined so that matching rows are together. If you know SQL, you will see that the approach and syntax is very similar. The general idea is that one needs to identify one or more columns that will serve to match the two tables. Then a new table with the combined information is returned. Notice what happens if we join the two tables above by state using left_join
:
tab <- left_join(murders, results_us_election_2016, by = "state")
head(tab)
## state abb region population total electoral_votes clinton trump
## 1 Alabama AL South 4779736 135 9 34.4 62.1
## 2 Alaska AK West 710231 19 3 36.6 51.3
## 3 Arizona AZ West 6392017 232 11 45.1 48.7
## 4 Arkansas AR South 2915918 93 6 33.7 60.6
## 5 California CA West 37253956 1257 55 61.7 31.6
## 6 Colorado CO West 5029196 65 9 48.2 43.3
## others
## 1 3.6
## 2 12.2
## 3 6.2
## 4 5.8
## 5 6.7
## 6 8.6
The data has been successfully joined and we can now, for example, make a plot to explore the relationship:
tab %>% ggplot(aes(population/10^6, electoral_votes, label = abb)) +
geom_point() +
geom_text_repel() +
scale_x_continuous(trans = "log2") +
scale_y_continuous(trans = "log2") +
geom_smooth(method = "lm", se = FALSE)
We see the relationship is close to linear with about 2 electoral votes for every million persons, but with smaller states getting higher ratios.
In practice, it is not always the case that each row in one table has a matching row in the other. For this reason, we have several versions of join. To illustrate this challenge, we will take subsets of the tables above. We create the tables tab1
and tab2
so that they have some states in common but not all:
tab_1 <- slice(murders, 1:6) %>%
select(state, population)
tab_1
## state population
## 1 Alabama 4779736
## 2 Alaska 710231
## 3 Arizona 6392017
## 4 Arkansas 2915918
## 5 California 37253956
## 6 Colorado 5029196
tab_2 <- slice(results_us_election_2016, c(1:3, 5, 7:8)) %>%
select(state, electoral_votes)
tab_2
## state electoral_votes
## 1 Alabama 9
## 2 Alaska 3
## 3 Arizona 11
## 4 California 55
## 5 Connecticut 7
## 6 Delaware 3
We will use these two tables as examples in the next sections.
Suppose we want a table like tab_1
, but adding electoral votes to whatever states we have available. For this, we use left join with tab_1
as the first argument.
left_join(tab_1, tab_2)
## Joining, by = "state"
## state population electoral_votes
## 1 Alabama 4779736 9
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 Arkansas 2915918 NA
## 5 California 37253956 55
## 6 Colorado 5029196 NA
Note that NA
s are added to the two states not appearing in tab_2
. Also notice that this function, as well as all the other joins, can receive the first arguments through the pipe:
tab_1 %>% left_join(tab_2)
## Joining, by = "state"
## state population electoral_votes
## 1 Alabama 4779736 9
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 Arkansas 2915918 NA
## 5 California 37253956 55
## 6 Colorado 5029196 NA
If instead of a table like tab_1
, we want one like tab_2
, we can use right_join
:
tab_1 %>% right_join(tab_2)
## Joining, by = "state"
## state population electoral_votes
## 1 Alabama 4779736 9
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 California 37253956 55
## 5 Connecticut NA 7
## 6 Delaware NA 3
Now the NAs are in the column coming from tab_1
.
If we want to keep only the rows that have information in both tables, we use inner join. You can think of this as an intersection:
inner_join(tab_1, tab_2)
## Joining, by = "state"
## state population electoral_votes
## 1 Alabama 4779736 9
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 California 37253956 55
If we want to keep all the rows and fill the missing parts with NAs, we can use a full join. You can think of this as a union:
full_join(tab_1, tab_2)
## Joining, by = "state"
## state population electoral_votes
## 1 Alabama 4779736 9
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 Arkansas 2915918 NA
## 5 California 37253956 55
## 6 Colorado 5029196 NA
## 7 Connecticut NA 7
## 8 Delaware NA 3
The semi_join
lets us keep the part of first table, for which we have information, in the second. It does not add the columns of the second:
semi_join(tab_1, tab_2)
## Joining, by = "state"
## state population
## 1 Alabama 4779736
## 2 Alaska 710231
## 3 Arizona 6392017
## 4 California 37253956
The function anti_join
is the opposite of semi_join
. It keeps the elements of the first table, for which there is no information, in the second:
anti_join(tab_1, tab_2)
## Joining, by = "state"
## state population
## 1 Arkansas 2915918
## 2 Colorado 5029196
The following diagram summarizes the above joins:
Although we have yet to use it in this book, another common way in which datasets are combined is by binding them. Unlike the join function, the binding functions do no try to match by a variable, but instead simply combine datasets. If the datasets don’t match by the appropriate dimensions, one obtains an error.
The dplyr
function bind_cols binds two objects by making them columns in a tibble. For example, we quickly want to make a data frame consisting of numbers we can use.
bind_cols(a = 1:3, b = 4:6)
## # A tibble: 3 x 2
## a b
## <int> <int>
## 1 1 4
## 2 2 5
## 3 3 6
This function requires that we assign names to the columns. Here we chose a
and b
.
Note that there is an R-base function cbind
with the exact same functionality. An important difference is that cbind
can create different types of objects, while bind_cols
always produces a data frame.
bind_cols
can also bind two different data frames. For example, here we break up the tab
data frame and then bind them back together:
tab_1 <- tab[, 1:3]
tab_2 <- tab[, 4:6]
tab_3 <- tab[, 7:9]
new_tab <- bind_cols(tab_1, tab_2, tab_3)
head(new_tab)
## state abb region population total electoral_votes clinton trump
## 1 Alabama AL South 4779736 135 9 34.4 62.1
## 2 Alaska AK West 710231 19 3 36.6 51.3
## 3 Arizona AZ West 6392017 232 11 45.1 48.7
## 4 Arkansas AR South 2915918 93 6 33.7 60.6
## 5 California CA West 37253956 1257 55 61.7 31.6
## 6 Colorado CO West 5029196 65 9 48.2 43.3
## others
## 1 3.6
## 2 12.2
## 3 6.2
## 4 5.8
## 5 6.7
## 6 8.6
The bind_rows
function is similar to bind_cols
, but binds rows instead of columns:
tab_1 <- tab[1:2,]
tab_2 <- tab[3:4,]
bind_rows(tab_1, tab_2)
## state abb region population total electoral_votes clinton trump
## 1 Alabama AL South 4779736 135 9 34.4 62.1
## 2 Alaska AK West 710231 19 3 36.6 51.3
## 3 Arizona AZ West 6392017 232 11 45.1 48.7
## 4 Arkansas AR South 2915918 93 6 33.7 60.6
## others
## 1 3.6
## 2 12.2
## 3 6.2
## 4 5.8
This is based on an R-base function rbind
.
Another set of commands useful for combing datasets are the set operators. When applied to vectors, these behave as their names suggest, intersect
and union
. However, if the tidyverse
or, more specifically, dplyr
is loaded, these functions can be used on data frames as opposed to just on vectors.
You can take intersections of vectors of any type, such as numeric:
intersect(1:10, 6:15)
## [1] 6 7 8 9 10
or characters:
intersect(c("a","b","c"), c("b","c","d"))
## [1] "b" "c"
But with dplyr
loaded, we can also do this for tables having the same column names:
tab_1 <- tab[1:5,]
tab_2 <- tab[3:7,]
intersect(tab_1, tab_2)
## state abb region population total electoral_votes clinton trump
## 1 Arizona AZ West 6392017 232 11 45.1 48.7
## 2 Arkansas AR South 2915918 93 6 33.7 60.6
## 3 California CA West 37253956 1257 55 61.7 31.6
## others
## 1 6.2
## 2 5.8
## 3 6.7
Similarly union takes the union of vectors. For example:
union(1:10, 6:15)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
union(c("a","b","c"), c("b","c","d"))
## [1] "a" "b" "c" "d"
But with dplyr
loaded, we can also do this for tables having the same column names:
tab_1 <- tab[1:5,]
tab_2 <- tab[3:7,]
union(tab_1, tab_2)
## state abb region population total electoral_votes clinton trump
## 1 Connecticut CT Northeast 3574097 97 7 54.6 40.9
## 2 Colorado CO West 5029196 65 9 48.2 43.3
## 3 California CA West 37253956 1257 55 61.7 31.6
## 4 Arkansas AR South 2915918 93 6 33.7 60.6
## 5 Arizona AZ West 6392017 232 11 45.1 48.7
## 6 Alaska AK West 710231 19 3 36.6 51.3
## 7 Alabama AL South 4779736 135 9 34.4 62.1
## others
## 1 4.5
## 2 8.6
## 3 6.7
## 4 5.8
## 5 6.2
## 6 12.2
## 7 3.6
setdiff
The set difference between a first and second argument can be obtained with setdiff
. Not unlike instersect
and union
, this function is not symmetric:
setdiff(1:10, 6:15)
## [1] 1 2 3 4 5
setdiff(6:15, 1:10)
## [1] 11 12 13 14 15
As in the function shown above, we can apply it to data frames:
tab_1 <- tab[1:5,]
tab_2 <- tab[3:7,]
setdiff(tab_1, tab_2)
## state abb region population total electoral_votes clinton trump others
## 1 Alabama AL South 4779736 135 9 34.4 62.1 3.6
## 2 Alaska AK West 710231 19 3 36.6 51.3 12.2
setequal
Finally, the function set_equal
tells us if two sets are the same, regardless of order. So notice that:
setequal(1:5, 1:6)
## [1] FALSE
but:
setequal(1:5, 5:1)
## [1] TRUE
When applied to data frames that are not equal, regardless of order, it provides a useful message letting us know how the sets are different:
setequal(tab_1, tab_2)
## FALSE: Rows in x but not y: 2, 1. Rows in y but not x: 5, 4.
The data we need to answer a question is not always in a spreadsheet, ready for us to read. For example, the US murders dataset originally comes from this Wikipedia page: https://en.wikipedia.org/w/index.php?title=Gun_violence_in_the_United_States_by_state&oldid=843135608. You can see the data table when you visit the webpage:
Unfortunately, there is no link to a data file. To make the data frame we loaded using data(murders)
or reading-in the csv file made available through dslabs
, we had to do some web scrapping.
Web scraping, or web harvesting, are the terms we use to describe the process of extracting data from a website. The reason we can do this is because the information used by a browser to render webpages is received as a text file from a server. The text is code written in hyper text markup language (HTML). Every browser has a way to show the html source code for a page, each one different. On Chrome you can use Control-U on a PC and command+alt+U on a Mac.
Because this code is accessible, we can download the HTML file, import it into R, and then write programs to extract the information we need from the page. However, once we look at HTML code, this might seem like a daunting task. But we will show you some convenient tools to facilitate the process. To get an idea of how it works, here are a few lines of code from the Wikipedia page that provides the US murders data:
p>The 2015 U.S. population total was 320.9 million. The 2015 U.S. overall murder rate per 100,000 inhabitants was 4.89.</p>
<h2><span class="mw-headline" id="States">States</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Murder_in_the_United_States_by_state&action=edit&section=1" title="Edit section: States">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<table class="wikitable sortable">
<tr>
<th>State</th>
<th><a href="/wiki/List_of_U.S._states_and_territories_by_population" title="List of U.S. states and territories by population">Population</a><br />
<small>(total inhabitants)</small><br />
<small>(2015)</small> <sup id="cite_ref-1" class="reference"><a href="#cite_note-1">[1]</a></sup></th>
<th>Murders and Nonnegligent
<p>Manslaughter<br />
<small>(total deaths)</small><br />
<small>(2015)</small> <sup id="cite_ref-2" class="reference"><a href="#cite_note-2">[2]</a></sup></p>
</th>
<th>Murder and Nonnegligent
<p>Manslaughter Rate<br />
<small>(per 100,000 inhabitants)</small><br />
<small>(2015)</small></p>
</th>
</tr>
<tr>
<td><a href="/wiki/Alabama" title="Alabama">Alabama</a></td>
<td>4,853,875</td>
<td>348</td>
<td>7.2</td>
</tr>
<tr>
<td><a href="/wiki/Alaska" title="Alaska">Alaska</a></td>
<td>737,709</td>
<td>59</td>
<td>8.0</td>
</tr>
<tr>
You can actually see the data! We can also see a pattern of how it is stored. If you know HTML, you can write programs that leverage knowledge of these patterns to extract what we want. We also take advantage of a language widely used to make webpages look “pretty” called Cascading Style Sheets (CSS). We say more about this in the CSS Selector Section.
Although we provide tools that make it possible to scrape data without knowing HTML, for data scientists, it is quite useful to learn some HTML and CSS. Not only does this improve your scraping skills, but it might come in handy if you are creating a webpage to showcase you work. There are plenty of online courses and tutorials for learning these. Two examples are code academy and WWW3 school.
The tidyverse
provides a web harvesting package called rvest
. The first step using this package is to import the webpage into R. The package makes this quite simple:
library(rvest)
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
url <- "https://en.wikipedia.org/w/index.php?title=Gun_violence_in_the_United_States_by_state&oldid=843135608"
h <- read_html(url)
Note that the entire Murders in the US Wikipedia webpage is now contained in h
. The class of this object is:
class(h)
## [1] "xml_document" "xml_node"
The rvest package is actually more general; it handles XML documents. XML is a general markup language, that’s what the ML stand for, that can be used to represent any kind of data. HTML is a specific type of XML specifically developed for representing webpages. Here we focus on HTML documents.
Now, how do we extract the table from the object h
? If we print h
, we don’t really see much:
h
## {xml_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...
When we know that the information is stored in an HTML table, you can see this in this line of the HTML code above <table class="wikitable sortable">
. The different parts of an HTML document, often defined with a message in between <
and >
are referred to as nodes. The rvest package includes functions to extract nodes of an HTML document: html_nodes
extracts all nodes of different types and html_node
extracts the first one. To extract the tables from the html code we use:
tab <- h %>% html_nodes("table")
Now, instead of the entire webpage, we just have the html code for the tables in the page:
tab
## {xml_nodeset (4)}
## [1] <table id="revision-info-current" class="plainlinks fmbox fmbox-syst ...
## [2] <table class="wikitable">\n<caption>Legend\n</caption>\n<tbody><tr>\ ...
## [3] <table class="wikitable sortable"><tbody>\n<tr>\n<th data-sort-type= ...
## [4] <table class="nowraplinks hlist collapsible collapsed navbox-inner" ...
The table we are interested is the third one:
tab[[3]]
## {xml_node}
## <table class="wikitable sortable">
## [1] <tbody>\n<tr>\n<th data-sort-type="text">State\n</th>\n<th data-sort ...
We are not quite there yet because this is clearly not a tidy dataset, not even a data frame. In the code above, you can definitely see a pattern and writing code to extract just the data is very doable. In fact, rvest
includes a function just for converting HTML tables into data frames:
tab <- tab[[3]] %>% html_table
class(tab)
## [1] "data.frame"
We are now much closer to having a usable data table:
tab <- tab %>% setNames(c("state", "population", "total", "murder_rate"))
head(tab)
## state population total murder_rate NA NA NA NA NA
## 1 Alabama 4,853,875 348 3[a] 3[a] 48.9 7.2 0.1[a] 0.1[a]
## 2 Alaska 737,709 59 57 39 61.7 8.0 7.7 5.3
## 3 Arizona 6,817,565 309 278 171 32.3 4.5 4.1 2.5
## 4 Arkansas 2,977,853 181 164 110 57.9 6.1 5.5 3.7
## 5 California 38,993,940 1,861 1,861 1,275 20.1 4.8 4.8 3.3
## 6 Colorado 5,448,819 176 176 115 34.3 3.2 3.2 2.1
We still have some wrangling to do. For example, we need to remove the commas and turn characters into numbers. Before continuing with this, we will learn a more general approach to extracting information from web sites.
The default look of a webpage made with the most basic HTML is quite unattractive. The aesthetically pleasing pages we see today are made using CSS, which is used to add style to webpages. The fact that all pages for a company have the same style is usually a result that they all use the same CSS file to define the style. The general way these CSS files work is by defining how each of the elements of a webpage will look. The title, headings, itemized lists, tables, and links, for example, each receive their own style including font, color, size, and distance from the margin. CSS does this by leveraging patterns used to define these elements, referred to as selectors. An example of such a pattern, which we used above, is table
, but there are many, many more.
If we want to grab data from a webpage and we happen to know a selector that is unique to the part of the page containing this data, we can use the html_nodes
function. However, knowing which selector can be quite complicated. In fact, the complexity of webpages has been increasing as they become more sophisticated. For some of the more advanced ones, it seems almost impossible to find the notes that define a particular piece of data. However, selector gadgets actually make this possible.
SelectorGadget is piece of software that allows you to interactively determine what CSS selector you need to extract specific components from the webpage. If you plan on scrapping data other than tables, we highly recommend you install it if you are interested in scrapping html pages. A Chrome extension is available which permits you to turn on the gadget and then, as you click through the page, it highlights parts and shows you the selector you need to extract these parts. There are various demos of how to do this including author Hadley Wickham’s vignetter, and these two tutorials based on the package vignette.
We add an example here that tries to extract the recipe name, total preparation time, and list of ingredients from this Guacamole recipe. Looking at the code for this page, it seems that the task is impossibly complex. For the Guacamole recipe page, we already have determined that we need the following selectors:
h <- read_html("http://www.foodnetwork.com/recipes/alton-brown/guacamole-recipe-1940609")
recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text()
prep_time <- h %>% html_node(".o-RecipeInfo__a-Description--Total") %>% html_text()
ingredients <- h %>% html_nodes(".o-Ingredients__a-ListItemText") %>% html_text()
You can see how complex the selectors are. In any case, we are now ready to extract what we want and create a list:
guacamole <- list(recipe, prep_time, ingredients)
guacamole
## [[1]]
## [1] "Guacamole"
##
## [[2]]
## [1] NA
##
## [[3]]
## character(0)
Since recipe pages from this website follow this general layout, we can use this code to create a function that extracts this information:
get_recipe <- function(url){
h <- read_html(url)
recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text()
prep_time <- h %>% html_node(".o-RecipeInfo__a-Description--Total") %>% html_text()
ingredients <- h %>% html_nodes(".o-Ingredients__a-ListItemText") %>% html_text()
return(list(recipe = recipe, prep_time = prep_time, ingredients = ingredients))
}
Then we can use it to extract recipes from any of their webpages:
get_recipe("http://www.foodnetwork.com/recipes/food-network-kitchen/pancakes-recipe-1913844")
## $recipe
## [1] "Pancakes"
##
## $prep_time
## [1] "22 min"
##
## $ingredients
## [1] "1 1/2 cups all-purpose flour"
## [2] "3 tablespoons sugar"
## [3] "1 tablespoon baking powder"
## [4] "1/4 teaspoon salt"
## [5] "1/8 teaspoon freshly ground nutmeg"
## [6] "2 large eggs, at room temperature"
## [7] "1 1/4 cups milk, at room temperature"
## [8] "1/2 teaspoon pure vanilla extract"
## [9] "3 tablespoons unsalted butter, plus more as needed"
get_recipe("https://www.foodnetwork.com/recipes/giada-de-laurentiis/pasta-primavera-recipe-1942131")
## $recipe
## [1] "Pasta Primavera"
##
## $prep_time
## [1] NA
##
## $ingredients
## character(0)
There are several other powerful tools provided by rvest
. For example, the functions html_form
, set_values
, and submit_form
permit you to query a webpage from R. This is a more advanced topic not covered here.
Sharing data on the internet has become more and more common. Unfortunately, providers use different formats which makes it harder for data scientists to wrangle data into R. Yet there are some standards that are also becoming more common. Currently, a format that is widely being adopted is the JavaScript Object Notation or JSON. Because this format is very general, it is nothing like a spreadsheet. This JSON files look more like the code you use to define a list. Here is an example of information stored in a JSON format:
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
## [
## {
## "name": "Miguel",
## "student_id": 1,
## "exam_1": 85,
## "exam_2": 86
## },
## {
## "name": "Sofia",
## "student_id": 2,
## "exam_1": 94,
## "exam_2": 93
## },
## {
## "name": "Aya",
## "student_id": 3,
## "exam_1": 87,
## "exam_2": 88
## },
## {
## "name": "Cheng",
## "student_id": 4,
## "exam_1": 90,
## "exam_2": 91
## }
## ]
The file above actually represents a data frame. To read it, we can use the function fromJSON
from the jsonlite package. Note that JSON files are often made available via the internet. Several organization provide a JSON API or a web service that you can connect directly to and obtain data. Here is an example:
library(jsonlite)
citi_bike <- fromJSON("http://citibikenyc.com/stations/json")
This downloads a list. The first argument tells you when you downloaded it:
citi_bike$executionTime
## [1] "2018-07-30 10:54:29 PM"
and the second is a data table:
head(citi_bike$stationBeanList)
## id stationName availableDocks totalDocks latitude
## 1 150 E 2 St & Avenue C 9 29 40.72087
## 2 307 Canal St & Rutgers St 12 30 40.71427
## 3 445 E 10 St & Avenue A 20 52 40.72741
## 4 72 W 52 St & 11 Ave 21 39 40.76727
## 5 79 Franklin St & W Broadway 32 33 40.71912
## 6 82 St James Pl & Pearl St 24 27 40.71117
## longitude statusValue statusKey availableBikes stAddress1
## 1 -73.98086 In Service 1 20 E 2 St & Avenue C
## 2 -73.98990 In Service 1 16 Canal St & Rutgers St
## 3 -73.98142 In Service 1 30 E 10 St & Avenue A
## 4 -73.99393 In Service 1 18 W 52 St & 11 Ave
## 5 -74.00667 In Service 1 1 Franklin St & W Broadway
## 6 -74.00017 In Service 1 1 St James Pl & Pearl St
## stAddress2 city postalCode location altitude testStation
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
## lastCommunicationTime landMark
## 1 2018-07-30 10:53:34 PM
## 2 2018-07-30 10:51:31 PM
## 3 2018-07-30 10:51:59 PM
## 4 2018-07-30 10:52:00 PM
## 5 2018-07-30 10:51:09 PM
## 6 2018-07-30 10:52:14 PM
You can learn much more by examining tutorials and help files from the jsonlite package. This package is intended for relatively simple tasks such as converging data into tables. For more flexibility, we recommend rjson
.
With the exception of labels used to represent categorical data, we have focused on numerical data. But in many applications data starts as text. Well known examples are spam filtering, cyber-crime prevention, counter-terrorism and sentiment analysis. In all these examples, the raw data is composed of free form texts. Our task is to extract insights from these data. In this section we learn how to generate useful numerical summaries from text data to which we can apply some of the powerful data visualization and analysis techniques we have learned.
During the 2016 US presidential election, then candidate Donald J. Trump used his twitter account as a way to communicate with potential voters. On August 6, 2016, Todd Vaziri tweeted about Trump that “Every non-hyperbolic tweet is from iPhone (his staff). Every hyperbolic tweet is from Android (from him).” Data scientist David Robinson conducted an analysis to determine if data supported this assertion. Here we go through David’s analysis to learn some of the basics of text mining. To learn more about text mining in R we recommend this book.
We will use the following libraries:
library(lubridate)
library(tidyr)
library(scales)
In general, we can extract data directly from twitter using the package. However, in this case, a group has already compiled data for us and made it available at http://www.trumptwitterarchive.com. We can get the data from their JSON API using a script like this:
url <- 'http://www.trumptwitterarchive.com/data/realdonaldtrump/%s.json'
trump_tweets <- map(2009:2017, ~sprintf(url, .x)) %>%
map_df(jsonlite::fromJSON, simplifyDataFrame = TRUE) %>%
filter(!is_retweet & !str_detect(text, '^"')) %>%
mutate(created_at = parse_date_time(created_at, orders = "a b! d! H!:M!:S! z!* Y!", tz="EST"))
For convenience, we include the result of the code above in the dslabs package:
library(dslabs)
data("trump_tweets")
This is the data frame with information about the tweets:
head(trump_tweets)
## source id_str
## 1 Twitter Web Client 6971079756
## 2 Twitter Web Client 6312794445
## 3 Twitter Web Client 6090839867
## 4 Twitter Web Client 5775731054
## 5 Twitter Web Client 5364614040
## 6 Twitter Web Client 5203117820
## text
## 1 From Donald Trump: Wishing everyone a wonderful holiday & a happy, healthy, prosperous New Year. Let’s think like champions in 2010!
## 2 Trump International Tower in Chicago ranked 6th tallest building in world by Council on Tall Buildings & Urban Habitat http://bit.ly/sqvQq
## 3 Wishing you and yours a very Happy and Bountiful Thanksgiving!
## 4 Donald Trump Partners with TV1 on New Reality Series Entitled, Omarosa's Ultimate Merger: http://tinyurl.com/yk5m3lc
## 5 --Work has begun, ahead of schedule, to build the greatest golf course in history: Trump International – Scotland.
## 6 --From Donald Trump: "Ivanka and Jared’s wedding was spectacular, and they make a beautiful couple. I’m a very proud father."
## created_at retweet_count in_reply_to_user_id_str favorite_count
## 1 2009-12-23 12:38:18 28 <NA> 12
## 2 2009-12-03 14:39:09 33 <NA> 6
## 3 2009-11-26 14:55:38 13 <NA> 11
## 4 2009-11-16 16:06:10 5 <NA> 3
## 5 2009-11-02 09:57:56 7 <NA> 6
## 6 2009-10-27 10:31:48 4 <NA> 5
## is_retweet
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
with the following variables included:
names(trump_tweets)
## [1] "source" "id_str"
## [3] "text" "created_at"
## [5] "retweet_count" "in_reply_to_user_id_str"
## [7] "favorite_count" "is_retweet"
The help file ?trump_tweets
provides details on what each variable represents. The tweets are represented by the text
variable:
trump_tweets %>% select(text) %>% head
## text
## 1 From Donald Trump: Wishing everyone a wonderful holiday & a happy, healthy, prosperous New Year. Let’s think like champions in 2010!
## 2 Trump International Tower in Chicago ranked 6th tallest building in world by Council on Tall Buildings & Urban Habitat http://bit.ly/sqvQq
## 3 Wishing you and yours a very Happy and Bountiful Thanksgiving!
## 4 Donald Trump Partners with TV1 on New Reality Series Entitled, Omarosa's Ultimate Merger: http://tinyurl.com/yk5m3lc
## 5 --Work has begun, ahead of schedule, to build the greatest golf course in history: Trump International – Scotland.
## 6 --From Donald Trump: "Ivanka and Jared’s wedding was spectacular, and they make a beautiful couple. I’m a very proud father."
and the source variable tells us which device was used to compose and upload each tweet:
trump_tweets %>% count(source) %>% arrange(desc(n))
## # A tibble: 19 x 2
## source n
## <chr> <int>
## 1 Twitter Web Client 10718
## 2 Twitter for Android 4652
## 3 Twitter for iPhone 3962
## 4 TweetDeck 468
## 5 TwitLonger Beta 288
## 6 Instagram 133
## 7 Media Studio 114
## 8 Facebook 104
## 9 Twitter Ads 96
## 10 Twitter for BlackBerry 78
## 11 Mobile Web (M5) 54
## 12 Twitter for iPad 39
## 13 Twitlonger 22
## 14 Twitter QandA 10
## 15 Vine - Make a Scene 10
## 16 Periscope 7
## 17 Neatly For BlackBerry 10 4
## 18 Twitter for Websites 1
## 19 Twitter Mirror for iPad 1
To start, we will use extract
to remove the Twitter for
part of the source and filter out retweets.
trump_tweets %>%
extract(source, "source", "Twitter for (.*)") %>%
count(source)
## # A tibble: 6 x 2
## source n
## <chr> <int>
## 1 Android 4652
## 2 BlackBerry 78
## 3 iPad 39
## 4 iPhone 3962
## 5 Websites 1
## 6 <NA> 12029
We are interested in what happened during the campaign, so for this analysis we will focus on what was tweeted between the day Trump announced his campaign and election day. We define the following table containing just the tweets from that time period:
campaign_tweets <- trump_tweets %>%
extract(source, "source", "Twitter for (.*)") %>%
filter(source %in% c("Android", "iPhone") &
created_at >= ymd("2015-06-17") &
created_at < ymd("2016-11-08")) %>%
filter(!is_retweet) %>%
arrange(created_at)
We can now use data visualization to explore the possibility that two different groups were tweeting from these devices. For each tweet, we will extract the hour, East Coast time (EST), it was tweeted and then compute the proportion of tweets tweeted at each hour for each device:
ds_theme_set()
campaign_tweets %>%
mutate(hour = hour(with_tz(created_at, "EST"))) %>%
count(source, hour) %>%
group_by(source) %>%
mutate(percent = n / sum(n)) %>%
ungroup %>%
ggplot(aes(hour, percent, color = source)) +
geom_line() +
geom_point() +
scale_y_continuous(labels = percent_format()) +
labs(x = "Hour of day (EST)",
y = "% of tweets",
color = "")
We notice a big peak for the Android in the early hours of the morning, between 6 and 8 AM. There seems to be a clear difference in these patterns. We will therefore assume that two different entities are using these two devices.
We will now study how the tweets differ when we compare Android to iPhone. To do this, we introduce the tidytext package.
The tidytext package helps us convert free form text into a tidy table. Having the data in this format greatly facilitates data visualization and the use of statistical techniques.
library(tidytext)
The main function needed to achieve this is unnest_tokens
. A token refers to the units that we are considering to be a data point. The most common token will be words, but they can also be single characters, ngrams, sentences, lines or a pattern defined by a regex. The functions will take a vector of strings and extract the tokens so that each one gets a row in the new table. Here is a simple example:
example <- data_frame(line = c(1, 2, 3, 4),
text = c("Roses are red,", "Violets are blue,", "Sugar is sweet,", "And so are you."))
example
## # A tibble: 4 x 2
## line text
## <dbl> <chr>
## 1 1 Roses are red,
## 2 2 Violets are blue,
## 3 3 Sugar is sweet,
## 4 4 And so are you.
example %>% unnest_tokens(word, text)
## # A tibble: 13 x 2
## line word
## <dbl> <chr>
## 1 1 roses
## 2 1 are
## 3 1 red
## 4 2 violets
## 5 2 are
## 6 2 blue
## 7 3 sugar
## 8 3 is
## 9 3 sweet
## 10 4 and
## 11 4 so
## 12 4 are
## 13 4 you
Now let’s look at an example from the tweets. We will look at tweet number 3008 because it will later permit us to illustrate a couple of points:
i <- 3008
campaign_tweets$text[i]
## [1] "Great to be back in Iowa! #TBT with @JerryJrFalwell joining me in Davenport- this past winter. #MAGA https://t.co/A5IF0QHnic"
campaign_tweets[i,] %>%
unnest_tokens(word, text) %>%
select(word)
## word
## 3008 great
## 3008.1 to
## 3008.2 be
## 3008.3 back
## 3008.4 in
## 3008.5 iowa
## 3008.6 tbt
## 3008.7 with
## 3008.8 jerryjrfalwell
## 3008.9 joining
## 3008.10 me
## 3008.11 in
## 3008.12 davenport
## 3008.13 this
## 3008.14 past
## 3008.15 winter
## 3008.16 maga
## 3008.17 https
## 3008.18 t.co
## 3008.19 a5if0qhnic
Note that the function tries to convert tokens into words. However, to do this, it strips characters that are important in the context of twitter. Namely, the function removes all the #
and @
. A token in the context of twitter is not the same as in the context of spoken or written English. For this reason, instead of using the default, words, we define a regex that captures twitter characters. This may appear complex but all we are defining is a pattern that starts with @, # or neither, and is followed by any combination of letters or digits:
pattern <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
We can now use the unnest_tokens
function with the regex
option and appropriately extract the hashtags and mentions. We demonstrate with our example tweet:
campaign_tweets[i,] %>%
unnest_tokens(word, text, token = "regex", pattern = pattern) %>%
select(word)
## word
## 1 great
## 2 to
## 3 be
## 4 back
## 5 in
## 6 iowa
## 7 #tbt
## 8 with
## 9 @jerryjrfalwell
## 10 joining
## 11 me
## 12 in
## 13 davenport
## 14 this
## 15 past
## 16 winter
## 17 #maga
## 18 https
## 19 t
## 20 co
## 21 a5if0qhnic
Another minor adjustment we want to make is to remove the links to pictures:
campaign_tweets[i,] %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = pattern) %>%
select(word)
## word
## 1 great
## 2 to
## 3 be
## 4 back
## 5 in
## 6 iowa
## 7 #tbt
## 8 with
## 9 @jerryjrfalwell
## 10 joining
## 11 me
## 12 in
## 13 davenport
## 14 this
## 15 past
## 16 winter
## 17 #maga
Now we are now ready to extract the words for all our tweets.
tweet_words <- campaign_tweets %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = pattern)
And we can now answer questions such as “what are the most commonly used words?”:
tweet_words %>%
count(word) %>%
arrange(desc(n))
## # A tibble: 6,211 x 2
## word n
## <chr> <int>
## 1 the 2335
## 2 to 1413
## 3 and 1246
## 4 a 1210
## 5 in 1189
## 6 i 1161
## 7 you 1000
## 8 of 983
## 9 is 944
## 10 on 880
## # ... with 6,201 more rows
It is not surprising that these are the top words. The top words are not informative. The tidytext package has a database of these commonly used words, referred to as stop words, in text mining:
stop_words
## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # ... with 1,139 more rows
If we filter out rows representing stop words with filter(!word %in% stop_words$word)
:
tweet_words <- campaign_tweets %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = pattern) %>%
filter(!word %in% stop_words$word )
we end up with a much more informative set of top 10 tweeted words:
tweet_words %>%
count(word) %>%
top_n(10, n) %>%
mutate(word = reorder(word, n)) %>%
arrange(desc(n))
## # A tibble: 10 x 2
## word n
## <fct> <int>
## 1 #trump2016 415
## 2 hillary 407
## 3 people 303
## 4 #makeamericagreatagain 296
## 5 america 254
## 6 clinton 239
## 7 poll 220
## 8 crooked 206
## 9 trump 199
## 10 cruz 162
Some exploration of the resulting words (not shown here) reveals a couple of unwanted characteristics in our tokens. First, some of our tokens are just numbers (years, for example). We want to remove these and we can find them using the regex ^\d+$
. Second, some of our tokens come from a quote and they start with '
. We want to remove the '
when it is at the start of a word so we will just str_replace
. We add these two lines to the code above to generate our final table:
tweet_words <- campaign_tweets %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = pattern) %>%
filter(!word %in% stop_words$word &
!str_detect(word, "^\\d+$")) %>%
mutate(word = str_replace(word, "^'", ""))
Now that we have all our words in a table, along with information about what device was used to compose the tweet they came from, we can start exploring which words are more common when comparing Android to iPhone.
For each word, we want to know if it is more likely to come from an Android tweet or an iPhone tweet. We previously introduced the odds ratio as a summary statistic useful for quantifying these differences. For each device and a given word, let’s call it y
, we compute the odds or the ratio between the proportion of words that are y
and not y
and compute the ratio of those odds. Here we will have many proportions that are 0, so we use the 0.5 correction described in the Association Test section.
android_iphone_or <- tweet_words %>%
count(word, source) %>%
spread(source, n, fill = 0) %>%
mutate(or = (Android + 0.5) / (sum(Android) - Android + 0.5) /
( (iPhone + 0.5) / (sum(iPhone) - iPhone + 0.5)))
android_iphone_or %>% arrange(desc(or))
## # A tibble: 5,509 x 4
## word Android iPhone or
## <chr> <dbl> <dbl> <dbl>
## 1 mails 22 0 39.3
## 2 poor 13 0 23.6
## 3 poorly 12 0 21.8
## 4 @cbsnews 11 0 20.1
## 5 bosses 11 0 20.1
## 6 turnberry 11 0 20.1
## 7 angry 10 0 18.3
## 8 write 10 0 18.3
## 9 brexit 9 0 16.6
## 10 defend 9 0 16.6
## # ... with 5,499 more rows
android_iphone_or %>% arrange(or)
## # A tibble: 5,509 x 4
## word Android iPhone or
## <chr> <dbl> <dbl> <dbl>
## 1 #makeamericagreatagain 0 296 0.00144
## 2 #americafirst 0 71 0.00607
## 3 #draintheswamp 0 63 0.00683
## 4 #trump2016 3 412 0.00718
## 5 #votetrump 0 56 0.00769
## 6 join 1 157 0.00821
## 7 #imwithyou 0 51 0.00843
## 8 #crookedhillary 0 30 0.0143
## 9 #fitn 0 30 0.0143
## 10 #gopdebate 0 30 0.0143
## # ... with 5,499 more rows
Given that several of these words are overall low frequency words, we can impose a filter based on the total frequency like this:
android_iphone_or %>% filter(Android+iPhone > 100) %>%
arrange(desc(or))
## # A tibble: 30 x 4
## word Android iPhone or
## <chr> <dbl> <dbl> <dbl>
## 1 @cnn 104 18 4.95
## 2 bad 104 26 3.45
## 3 crooked 157 49 2.79
## 4 ted 85 28 2.62
## 5 interviewed 76 25 2.62
## 6 media 77 26 2.56
## 7 cruz 116 46 2.19
## 8 hillary 290 119 2.14
## 9 win 74 30 2.14
## 10 president 84 35 2.08
## # ... with 20 more rows
android_iphone_or %>% filter(Android+iPhone > 100) %>%
arrange(or)
## # A tibble: 30 x 4
## word Android iPhone or
## <chr> <dbl> <dbl> <dbl>
## 1 #makeamericagreatagain 0 296 0.00144
## 2 #trump2016 3 412 0.00718
## 3 join 1 157 0.00821
## 4 tomorrow 25 101 0.218
## 5 vote 46 67 0.600
## 6 america 114 141 0.703
## 7 tonight 71 84 0.737
## 8 iowa 62 65 0.831
## 9 poll 117 103 0.990
## 10 trump 112 92 1.06
## # ... with 20 more rows
We already see somewhat of a pattern in the types of words that are being tweeted more from one device versus the other. However, we are not interested in specific words but rather in the tone. Vaziri’s assertion is that the Android tweets are more hyperbolic. So how can we check this with data? Hyperbolic is a hard sentiment to extract from words as it relies on interpreting phrases. However, words can be associated to more basic sentiment such as anger, fear, joy and surprise. In the next section, we demonstrate basic sentiment analysis.
In sentiment analysis, we assign a word to one or more “sentiments”. Although this approach will miss context dependent sentiments, such as sarcasm, when performed on large numbers of words, summaries can provide insights.
The first step in sentiment analysis is to assign a sentiment to each word. The tidytext package includes several maps or lexicons in the object sentiments
:
table(sentiments$lexicon)
##
## AFINN bing loughran nrc
## 2476 6788 4149 13901
The bing
lexicon divides words into positive
and negative
sentiments. We can see this using the tidytext function get_sentiments
:
get_sentiments("bing")
## # A tibble: 6,788 x 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 a+ positive
## 4 abnormal negative
## 5 abolish negative
## 6 abominable negative
## 7 abominably negative
## 8 abominate negative
## 9 abomination negative
## 10 abort negative
## # ... with 6,778 more rows
The AFINN
lexicon assigns a score between -5 and 5, with -5 the most negative and 5 the most positive.
get_sentiments("afinn")
## # A tibble: 2,476 x 2
## word score
## <chr> <int>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,466 more rows
The loughran
and nrc
lexicons provide several different sentiments:
get_sentiments("loughran") %>% count(sentiment)
## # A tibble: 6 x 2
## sentiment n
## <chr> <int>
## 1 constraining 184
## 2 litigious 903
## 3 negative 2355
## 4 positive 354
## 5 superfluous 56
## 6 uncertainty 297
get_sentiments("nrc") %>% count(sentiment)
## # A tibble: 10 x 2
## sentiment n
## <chr> <int>
## 1 anger 1247
## 2 anticipation 839
## 3 disgust 1058
## 4 fear 1476
## 5 joy 689
## 6 negative 3324
## 7 positive 2312
## 8 sadness 1191
## 9 surprise 534
## 10 trust 1231
To start learning about how these lexicons were developed, you can read this help file ?sentiments
.
For our analysis, we are interested in exploring the different sentiments of each tweet so we will use the nrc
lexicon:
nrc <- sentiments %>%
filter(lexicon == "nrc") %>%
select(word, sentiment)
We can combine the words and sentiments using inner_join
, which will only keep words associated with a sentiment. Here are 10 random words extracted from the tweets:
tweet_words %>% inner_join(nrc, by = "word") %>%
select(source, word, sentiment) %>%
sample_n(10)
## source word sentiment
## 4545 Android phony anger
## 6369 Android vote joy
## 9804 Android virtue trust
## 15542 iPhone tomorrow anticipation
## 3451 Android finally trust
## 15372 iPhone honor trust
## 16163 iPhone vote negative
## 11305 iPhone bomb sadness
## 10763 iPhone negative sadness
## 1057 Android spirit positive
Now we are ready to perform a quantitative analysis comparing Android and iPhone by comparing the sentiments of the tweets posted from each device. Here we could perform a tweet by tweet analysis, assigning a sentiment to each tweet. However, this will be challenging since each tweet will have several sentiments attached to it, one for each word appearing in the lexicon. For illustrative purposes, we will perform a much simpler analysis: we will count and compare the frequencies of each sentiment appearing in each device.
sentiment_counts <- tweet_words %>%
left_join(nrc, by = "word") %>%
count(source, sentiment) %>%
spread(source, n) %>%
mutate(sentiment = replace_na(sentiment, replace = "none"))
sentiment_counts
## # A tibble: 11 x 3
## sentiment Android iPhone
## <chr> <int> <int>
## 1 anger 965 527
## 2 anticipation 915 710
## 3 disgust 641 318
## 4 fear 802 486
## 5 joy 698 540
## 6 negative 1668 935
## 7 positive 1834 1497
## 8 sadness 907 514
## 9 surprise 530 365
## 10 trust 1253 1010
## 11 none 11523 10739
For each sentiment, we can compute the odds of being in the device: proportion of words with sentiment versus proportion of words without, and then compute the odds ratio comparing the two devices.
sentiment_counts %>%
mutate(Android = Android / (sum(Android) - Android) ,
iPhone = iPhone / (sum(iPhone) - iPhone),
or = Android/iPhone) %>%
arrange(desc(or))
## # A tibble: 11 x 4
## sentiment Android iPhone or
## <chr> <dbl> <dbl> <dbl>
## 1 disgust 0.0304 0.0184 1.66
## 2 anger 0.0465 0.0308 1.51
## 3 negative 0.0831 0.0560 1.49
## 4 sadness 0.0435 0.0300 1.45
## 5 fear 0.0383 0.0283 1.35
## 6 surprise 0.0250 0.0211 1.18
## 7 joy 0.0332 0.0316 1.05
## 8 anticipation 0.0439 0.0419 1.05
## 9 trust 0.0612 0.0607 1.01
## 10 positive 0.0922 0.0927 0.994
## 11 none 1.13 1.56 0.725
So we do see some differences and the order is interesting: the largest three sentiments are disgust, anger, and negative! But are they statistically significant? How does this compare if we are just assigning sentiments at random?
To answer this question we can compute, for each sentiment, an odds ratio and a confidence interval. We will add the two values we need to form a two-by-two table and the odd ratio:
library(broom)
log_or <- sentiment_counts %>%
mutate( log_or = log( (Android / (sum(Android) - Android)) / (iPhone / (sum(iPhone) - iPhone))),
se = sqrt( 1/Android + 1/(sum(Android) - Android) + 1/iPhone + 1/(sum(iPhone) - iPhone)),
conf.low = log_or - qnorm(0.975)*se,
conf.high = log_or + qnorm(0.975)*se) %>%
arrange(desc(log_or))
log_or
## # A tibble: 11 x 7
## sentiment Android iPhone log_or se conf.low conf.high
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 disgust 641 318 0.504 0.0694 0.368 0.640
## 2 anger 965 527 0.411 0.0551 0.303 0.519
## 3 negative 1668 935 0.395 0.0422 0.313 0.478
## 4 sadness 907 514 0.372 0.0562 0.262 0.482
## 5 fear 802 486 0.302 0.0584 0.187 0.416
## 6 surprise 530 365 0.168 0.0688 0.0332 0.303
## 7 joy 698 540 0.0495 0.0582 -0.0647 0.164
## 8 anticipation 915 710 0.0468 0.0511 -0.0533 0.147
## 9 trust 1253 1010 0.00726 0.0436 -0.0781 0.0926
## 10 positive 1834 1497 -0.00624 0.0364 -0.0776 0.0651
## 11 none 11523 10739 -0.321 0.0206 -0.362 -0.281
A graphical visualization shows some sentiments that are clearly overrepresented:
log_or %>%
mutate(sentiment = reorder(sentiment, log_or)) %>%
ggplot(aes(x = sentiment, ymin = conf.low, ymax = conf.high)) +
geom_errorbar() +
geom_point(aes(sentiment, log_or)) +
ylab("Log odds ratio for association between Android and sentiment") +
coord_flip()
We see that the disgust, anger, negative, sadness and fear sentiments are associated with the Android in a way that is hard to explain by chance alone. Words not associated to a sentiment were strongly associated with the iPhone source, which is in agreement with the original claim about hyperbolic tweets.
If we are interested in exploring which specific words are driving these differences, we can refer back to our android_iphone_or
object:
android_iphone_or %>% inner_join(nrc) %>%
filter(sentiment == "disgust" & Android + iPhone > 10) %>%
arrange(desc(or))
## Joining, by = "word"
## # A tibble: 20 x 5
## word Android iPhone or sentiment
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 mess 15 2 5.41 disgust
## 2 finally 12 2 4.36 disgust
## 3 unfair 12 2 4.36 disgust
## 4 bad 104 26 3.45 disgust
## 5 lie 13 3 3.37 disgust
## 6 terrible 31 8 3.24 disgust
## 7 lying 9 3 2.37 disgust
## 8 waste 12 5 1.98 disgust
## 9 phony 21 9 1.97 disgust
## 10 illegal 32 14 1.96 disgust
## 11 nasty 14 6 1.95 disgust
## 12 pathetic 11 5 1.82 disgust
## 13 horrible 14 7 1.69 disgust
## 14 disaster 21 11 1.63 disgust
## 15 winning 14 9 1.33 disgust
## 16 liar 6 5 1.03 disgust
## 17 dishonest 37 32 1.01 disgust
## 18 john 24 21 0.994 disgust
## 19 dying 6 6 0.872 disgust
## 20 terrorism 9 9 0.872 disgust
and we can make a graph:
android_iphone_or %>% inner_join(nrc, by = "word") %>%
mutate(sentiment = factor(sentiment, levels = log_or$sentiment)) %>%
mutate(log_or = log(or)) %>%
filter(Android + iPhone > 10 & abs(log_or)>1) %>%
mutate(word = reorder(word, log_or)) %>%
ggplot(aes(word, log_or, fill = log_or < 0)) +
facet_wrap(~sentiment, scales = "free_x", nrow = 2) +
geom_bar(stat="identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
This is just a simple example of the many analyses one can perform with tidytext. To learn more, we again recommend this book.