1 Introduction

Perhaps the most popular data science methodologies come from the the field of Machine Learning. Machine learning success stories include the hand written zip code readers implemented by the postal service, speech recognition such as Apple’s Siri, movie recommendation systems, spam and malware detectors, housing prices predictors, and driver-less cars. While the first artificial intelligence algorithms, such as those used by chess playing machines, implemented decision making based on programmable rules, derived from theory or first principles, in Machine Learning decisions, are based on algorithms built on data.

1.1 Notation

In Machine Learning, data comes in the form of

  1. the outcome we want to predict and
  2. the features that we will use to predict the outcome.

We want to build an algorithm that takes feature values as input and returns a prediction for the outcome when we don’t know the outcome. The machine learning approach is to train an algorithm using a dataset for which we do know the outcome, and then apply this algorithm in the future to make a prediction when we don’t know the outcome.

Here we will use \(Y\) to denote the outcome and \(X_1, \dots, X_p\) to denote features. Note that features are sometimes referred to as predictors or covariates. We consider all these to be synonyms.

Prediction problems can be divided into categorical and continuous outcomes. For categorical outcomes \(Y\) can be any one of \(K\) classes. The number of classes can vary greatly across applications. For example, in the digit reader data, \(K=10\) with the classes being the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. In speech recognition, the outcome are all possible words we are trying to detect. Spam detection has two outcomes: spam or not spam. In this book, we denote the \(K\) categories with indexes \(k=1,\dots,K\). However, for binary data we will use \(k=0,1\) for mathematical convenience which we demonstrate later.

The general set-up is as follows. We have a series of features and an unknown outcome we want to predict:

library(tidyverse)
library(dslabs)
ds_theme_set()
outcome feature_1 feature_2 feature_3 feature_4 feature_5
? X_1 X_2 X_3 X_4 X_5

To build a model, that provides a prediction for any set of values \(X_1=x_1, X_2=x_2, \dots X_5=x_5\), we collect data for which we know the outcome

outcome feature_1 feature_2 feature_3 feature_4 feature_5
Y_1 X_1,1 X_1,2 X_1,3 X_1,4 X_1,5
Y_2 X_2,1 X_2,2 X_2,3 X_2,4 X_2,5
Y_3 X_3,1 X_3,2 X_3,3 X_3,4 X_3,5
Y_4 X_4,1 X_4,2 X_4,3 X_4,4 X_4,5
Y_5 X_5,1 X_5,2 X_5,3 X_5,4 X_5,5
Y_6 X_6,1 X_6,2 X_6,3 X_6,4 X_6,5
Y_7 X_7,1 X_7,2 X_7,3 X_7,4 X_7,5
Y_8 X_8,1 X_8,2 X_8,3 X_8,4 X_8,5
Y_9 X_9,1 X_9,2 X_9,3 X_9,4 X_9,5
Y_10 X_10,1 X_10,2 X_10,3 X_10,4 X_10,5

We use the notation \(\hat{Y}\) to denote the prediction. We use the term actual outcome to denote what we ended up observing. So we want the prediction \(\hat{Y}\) to match the actual outcome.

1.2 Categorical versus Continuous

The outcome \(Y\) can be categorical (which digit, what word, spam or not spam, pedestrian or empty road ahead) or continuous (movie ratings, housing prices, stock value, distance between driverless car and a pedestrian). The concepts and algorithms we learn here apply to both. However, there are some differences in how we approach each case so it is important to distinguish between the two.

When the outcome is categorical we refer to the machine learning task as classification. Our predictions will be categorical just like our outcomes and they will be either correct or incorrect. When the outcome is continuous we will refer to the task as prediction. In this case our predictions will not be either right or wrong. Instead we will make an error which is the difference between the prediction and the actual outcome. This terminology can become confusing since we call \(\hat{Y}\) our prediction even when it is a categorical outcome. However, throughout the lecture, the context will make the meaning clear.

Note that these terms vary among course, text books, and other publications. Often prediction is used for both categorical and continuous and regression is used for the continuous case. Here we avoid using regression to avoid confusion with our previous use of the term linear regression. In most cases it will be clear if our outcomes are categorical or continuous so we will avoid using these terms when possible.

The first part of this part deals with categorical values and the second with continuous ones.

1.3 An example

Let’s consider an example. The first thing that happens to a letter when they are received in the post office is that they are sorted by zip code:

Originally humans had to sort these by hand. To do this they had to read the zip codes on each letter. Today thanks to machine learning algorithms, a computer can read zip codes and then a robot sorts the letters. In this lecture we will learn how to build algorithms that can read a digit.

The first step in building an algorithm is to understand what are the outcomes and features? Below are three images of written digits. These have already been read by a human and assigned an outcome \(Y\). These are considered known and serve as the training set.

The images are converted into \(28 \times 28 = 784\) pixels and for each pixel we obtain a grey scale intensity between 0 (white) to 255 (black) which we consider continuous for now. We can see these values like this:

For each digitized image \(i\) we have a categorical outcome \(Y_i\) which can be one of 10 values: \(0,1,2,3,4,5,6,7,8,9\) and features \(X_{i,1}, \dots, X_{i,784}\). We use bold face \(\mathbf{X}_i = (X_{i,1}, \dots, X_{i,784})\) to distinguish the vector of predictors from the individual predictors. When referring to an arbitrary set of features we drop the index \(i\) and use \(Y\) and \(\mathbf{X} = (X_{1}, \dots, X_{784})\). We use upper case variable because, in general, we think of the predictors as random variables. We use lower case, for example \(\mathbf{X} = \mathbf{x}\), to denote observed values. Although when we code we stick to lowercase.

The machine learning tasks is to build an algorithm that returns a prediction for any of the possible values of the features. Here we will learn several approaches to building these algorithms. Although at this point it might seem impossible to achieve this, we will start with a simpler examples, and build up our knowledge until we can attack more complex examples. To learn the basic concepts we will start with very simple examples with just one predictor and then move to two predictors. Once we learn these we will attack more real world machine learning challenges.

2 The confusion matrix, prevalence, sensitivity and specificity

library(tidyverse)
library(dslabs)
ds_theme_set()

We start by introducing the caret package, which has several useful functions for building and assessing machine learning methods.

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

Later we show how we use these functions.

Here we focus on describing ways in which machine learning algorithms are evaluated. For our first introduction to machine learning concepts we will start with a boring and simple example: predict sex using height. We explain machine learning step by step and this example will let us set down the first building block. Soon enough we will be attacking more interesting challenges.

For this first example we use the height data in dslabs:

library(dslabs)
data(heights)

We start by defining the outcome and predictors. In this example we have only one predictor.

y <- heights$sex
x <- heights$height

This is clearly a categorical outcome since \(Y\) can be Male or Female and we only have one predictor: height. We know that will not be able to predict \(Y\) very accurately based on \(X\) because male and female heights are not that different relative to within group variability. But we can we do better than guessing? To answer this question we need to quantitative definition of better.

2.1 Training and test sets

Ultimately a machine learning algorithm is evaluated on how it performs in the real world with others running the code. However, when developing an algorithm we usually have a dataset for which we know the outcomes, as we do with the heights: we know the sex of every student. Therefore, to mimic the ultimate evaluation process we typically split the data into two and act as if we don’t know the outcome for one of these. We stop pretending we don’t know to outcome to evaluate the algorithm, but only after we are done constructing it. We refer to the group for which we know the outcome and use to develop the algorithm as the training set and the group for which we pretend we don’t know the outcome as the test sets. A standard way of generating the training and test sets is by randomly splitting the data.

The caret packages includes the function createDataPartition that helps us generates indexes for randomly splitting the data into training and test sets. The argument times is used to define how many random samples of indexes to return, the argument p is used to define what proportion of the index represent, and the argument list is used to decide if we want the indexes returned as a list or not:

set.seed(2)
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)

We can use this index to define the training and test sets:

train_set <- heights[-test_index, ]
test_set <- heights[test_index, ]

We will now develop an algorithm using only the training set. Once we are done developing the algorithm we will freeze it and evaluate it using the test set. The simplest way to evaluate the algorithm when the outcomes are categorical is by simply reporting the proportion of cases that were correctly predicted in the test set. This metric is usually referred to as overall accurarcy.

2.2 Overall accuracy

To demonstrate the use of overall accuracy we will build two competing algorithms and compare them.

Let’s start by developing the simplest possible machine algorithm: guessing the outcome.

y_hat <- sample(c("Male", "Female"), length(test_index), replace = TRUE) 

Note that we are completely ignoring the predictor and simply guessing the sex.

In machine learning applications, it is useful to use factors to represent the categorical outcomes. R functions developed for machine learning, such as those in the caret package, require or recommend that categorical outcomes be coded as factors:

y_hat <- sample(c("Male", "Female"), length(test_index), replace = TRUE) %>% factor(levels = levels(test_set$sex))

The overall accuacy is simply defined as the overall proportion that is predicted correctly:

mean(y_hat == test_set$sex)
## [1] 0.5238095

Not surprisingly, our accuracy is about 50%.

Can we do better? Exploratory data analysis suggests we can because, on average, males are slightly taller than females:

heights %>% group_by(sex) %>% summarize(mean(height), sd(height))
## # A tibble: 2 x 3
##   sex    `mean(height)` `sd(height)`
##   <fct>           <dbl>        <dbl>
## 1 Female           64.9         3.76
## 2 Male             69.3         3.61

But how do we make use of this insight? Let’s try another simple approach: predict Male if height is within a two standard deviations from the average male:

y_hat <- ifelse(x > 62, "Male", "Female") %>% factor(levels = levels(test_set$sex))

The accuracy goes way up from 0.50 to about 0.80:

mean(y == y_hat)
## [1] 0.7933333

But can we do even better? In the example above we used a cutoff of 62, but we can examine the accuracy obtained for other cutoffs and then pick the value that provides the best results. But remember, it is important that we pick the best value on the training set: the test set is only for evaluation. Although for this simplistic example it is not much of a problem, later we will learn that evaluating an algorithm on the training set can lead to overfitting, which often results in dangerously over-optimistic assessments.

We examine the accuracy of 10 different cutoffs and pick the one yielding the best result.

cutoff <- seq(61, 70)
accuracy <- map_dbl(cutoff, function(x){
  y_hat <- ifelse(train_set$height > x, "Male", "Female") %>% factor(levels = levels(test_set$sex))
  mean(y_hat == train_set$sex)
})

We can make a plot showing the accuracy on the training set for males and females.

We see that the maximum value is:

max(accuracy)
## [1] 0.8361905

much higher than 0.5, and it is maximized with the cutoff:

best_cutoff <- cutoff[which.max(accuracy)]
best_cutoff
## [1] 64

Now we can now test this cutoff on our test set to make sure our accuracy is not overly optimistic:

y_hat <- ifelse(test_set$height > best_cutoff, "Male", "Female") %>% factor(levels = levels(test_set$sex))
y_hat <- factor(y_hat)
mean(y_hat == test_set$sex)
## [1] 0.8171429

We see that it is a bit lower than the accuracy observed for the training set, but it is still better than guessing. And by testing on a dataset that we did no train on, we know it is not due to over-training.

2.3 The Confusion Matrix

The prediction rule we developed in the previous section predicts Male if the student’s is taller than 64 inches. Given that the average female is about 65 inches this prediction rule seems wrong. What happened? If a student is the height of the average female, shouldn’t we predict Female?

Generally speaking, overall accuracy can be a deceptive measure. To see this we will start by constructing what is referred to as the confusion matrix, which basically tabulates each combination of prediction and actual value. We can do this in R using the function table:

table(predicted = y_hat, actual = test_set$sex)
##          actual
## predicted Female Male
##    Female     50   27
##    Male       69  379

If we study this table closely it reveals a problem. If we compute the accuracy separately for each sex we get:

test_set %>% 
  mutate(y_hat = y_hat) %>%
  group_by(sex) %>% 
  summarize(accuracy = mean(y_hat == sex))
## # A tibble: 2 x 2
##   sex    accuracy
##   <fct>     <dbl>
## 1 Female    0.420
## 2 Male      0.933

There is an imbalance in the accuracy for males and females: too many females are predicted to be male. We are calling close to half females, males! How can our overall accuracy be so high then? This because the prevalance of males in this dataset is high. These heights were collected from three data sciences courses, two of which had more males enrolled:

prev <- mean(y == "Male")
prev
## [1] 0.7733333

So when computing overall accuracy, the high percentage of mistakes made for females is outweighed by the gains in correct calls for men. This can actually be a big problem in machine learning. If your training data is biased in some way, you are likely to develop an algorithms that are biased as well. The fact that we used a test set does not matter because it is also derived from the original, biased, dataset. This is one of the reasons we look at metrics other than overall accuracy when evaluating a machine learning algorithm.

There are several metrics that we can evaluate an algorithm in a way that prevalence does not cloud our assessment, and these can all be derived from the confusion matrix. A general improvement to using overall accuracy is to study sensitivity and specificity separately.

2.4 Sensitivity and Specificity

To define sensitivity and specificity we need a binary outcome. When the outcomes are categorical we can define these terms for a specific category. In the digits example we can ask for the specificity in the case of correctly predicting 2 as opposed to some other digit. Once we specify a category of interest then we can talk about positive outcomes, \(Y=1\), and negative outcomes, \(Y=0\).

In general, sensitivity is defined as the ability of an algorithm to predict a positive outcome when the actual outcome is positive: call \(\hat{Y}=1\) whenever \(Y=1\). Because an algorithm that calls everything positive (\(\hat{Y}=1\) no matter what) has perfect sensitivity, this metric on its own is not enough to judge an algorithm. For this reason we also examine specificity, which is generally defined as the ability of an algorithm to not to predict a positive \(\hat{Y}=0\) when the actual outcome is not a positive \(Y=0\). We can summarize in the following way:

  • High sensitivity: \(Y=1 \implies \hat{Y}=1\)
  • High specificity: \(Y=0 \implies \hat{Y} = 0\)

Another way to define specificity is by the proportion of positive calls that are actually positive:

  • High specificity: \(\hat{Y}=1 \implies Y=1\).

To provide a precise definitions we name the four entries of the confusion matrix:

Actually Positive Actually Negative
Predicted positve True positives (TP) False positives (FP)
Predicted negative False negatives (FN) True negatives (TN)

Sensitivity is typically quantified by \(TP/(TP+FN)\), or the proportion of actual positives (the first column or \(TP+FN\)) that are called positives TP. This quantity is referred to as the true positive rate (TPR) or recall.

Specificity is typically quantified as \(TN/(TN+FP)\) or the proportion of negatives (the second column or \(FP+TN\)) that are called negatives TN. This quantity is s is also called the true negative rate (TNR). There is another way of quantifying specificity which is \(TP/(TP+FP)\) or the proportion of outcomes called positives (the first row or \(TP+FP\)) that are actually positives \(TP\). This quantity is referred to as precision and also as positive predictive value (PPV). Note that unlike TPR and TNR, precision depends on prevalence since higher prevalence implies you can get higher precision even when guessing.

The multiple names can be confusing, so we include a table to help us remember the terms. The table includes a column that shows the definition if we think of the proportions as probabilities.

A measure of Name 1 Name 2 Definition Probability representation
Sensitivity True positive rate (TPR) Recall \(TP / (TP + FN)\) \(\mbox{Pr}(\hat{Y}=1 \mid Y=1)\)
Specificity True negative rate (TNR) 1 minus false positive rate (1-FPR) \(TN / (TN+FP)\) \(\mbox{Pr}(\hat{Y}=0 \mid Y=0)\)
Specificity Positive Predictive value (PPV) Precision \(TP / (TP+FP)\) \(\mbox{Pr}(Y=1 \mid \hat{Y}=1)\)

The caret function confusionMatrix computes all these metrics for us once we define what a positive is. The function expects factors as input and the first level is considered the “positive” outcome or \(Y=1\). One our example Female is the first level because it comes before Male alphabetically:

confusionMatrix(data = y_hat, reference = test_set$sex)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Female Male
##     Female     50   27
##     Male       69  379
##                                           
##                Accuracy : 0.8171          
##                  95% CI : (0.7814, 0.8493)
##     No Information Rate : 0.7733          
##     P-Value [Acc > NIR] : 0.008354        
##                                           
##                   Kappa : 0.4041          
##  Mcnemar's Test P-Value : 2.857e-05       
##                                           
##             Sensitivity : 0.42017         
##             Specificity : 0.93350         
##          Pos Pred Value : 0.64935         
##          Neg Pred Value : 0.84598         
##              Prevalence : 0.22667         
##          Detection Rate : 0.09524         
##    Detection Prevalence : 0.14667         
##       Balanced Accuracy : 0.67683         
##                                           
##        'Positive' Class : Female          
## 

We can see that the high overall accuracy is possible despite relatively low sensitivity. As we hinted at above, the reason this happens is the low prevalence (0.23): the proportion of females is low. Because prevalence is low, failing to call actual females as females (low sensitivity) does not lower the accuracy as much as it would have increased if incorrectly called males females. This is an example of why it is important to examine sensitivity and specificity and not just accuracy. Before applying this algorithm to general data sets we need to ask ourselves if prevalence will be the same.

2.5 Balanced accuracy and \(F_1\) score

Although in general we recommend studying both specificity and sensitivity, very often it is useful to have a one number summary, for example for optimization purposes. One metric that is preferred over overall accuracy is the average of specificity and sensitivity, refereed to as balanced accuracy. Because specificity and sensitivity are rates, it is more appropriate to compute the harmonic average of specificity and sensitivity. In fact the \(F_1\)-score, a widely used one number summary, is the harmonic average of precision and recall:

\[ \frac{1}{\frac{1}{2}\left(\frac{1}{\mbox{recall}} + \frac{1}{\mbox{precision}}\right) } \]

Because it is easier to write, you often see this harmonic average rewritten as

\[ 2\frac{\mbox{precision} \cdot \mbox{recall}} {\mbox{precision} + \mbox{recall}} \]

when defining \(F_1\).

Note that depending on the context, some type of errors are more costly than others. For example, in the case of plane safety it is much more important to maximize sensitivity over specificity: failing to predicting a plane will malfunction before it crashes is a much more costly error than grounding a plane when in fact the plane is in perfect condition. In a capital murder criminal case the opposite is true since a false positive can lead to killing an innocent person. The \(F_1\)-score can be adapted to weigh specificity and sensitivity differently. To do this we define \(\beta\) to represent how much more important sensitivity is compared to specificity and consider a weighted harmonic average:

\[ \frac{1}{\frac{\beta^2}{1+\beta^2}\frac{1}{\mbox{recall}} + \frac{1}{1+\beta^2}\frac{1}{\mbox{precision}} } \]

The F_meas function in the caret package computes this summary with beta defaulting to 1.

Let’s rebuild our prediction algorithm, but this time maximizing the F-score instead of overall accuracy:

cutoff <- seq(61, 70)
F_1 <- map_dbl(cutoff, function(x){
  y_hat <- ifelse(train_set$height > x, "Male", "Female") %>% factor(levels = levels(test_set$sex))
  F_meas(data = y_hat, reference = factor(train_set$sex))
})

As before, we can plot these \(F_1\) measures versus the cutoffs:

We see that it is maximized at \(F_1\) value of :

max(F_1)
## [1] 0.6142322

when we use cutoff:

best_cutoff <- cutoff[which.max(F_1)]
best_cutoff
## [1] 66

A cutoff of 66 makes much more sense than 64. Furthermore, it balances the specificity and sensitivity of our confusion matrix:

y_hat <- ifelse(test_set$height > best_cutoff, "Male", "Female") %>% factor(levels = levels(test_set$sex))
confusionMatrix(data = y_hat, reference = test_set$sex)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Female Male
##     Female     81   67
##     Male       38  339
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.7632, 0.8334)
##     No Information Rate : 0.7733          
##     P-Value [Acc > NIR] : 0.078192        
##                                           
##                   Kappa : 0.4748          
##  Mcnemar's Test P-Value : 0.006285        
##                                           
##             Sensitivity : 0.6807          
##             Specificity : 0.8350          
##          Pos Pred Value : 0.5473          
##          Neg Pred Value : 0.8992          
##              Prevalence : 0.2267          
##          Detection Rate : 0.1543          
##    Detection Prevalence : 0.2819          
##       Balanced Accuracy : 0.7578          
##                                           
##        'Positive' Class : Female          
## 

We now see that we do much better than guessing and that both sensitivity and specificity are relatively high and we have built our first machine learning algorithm. It takes height as a predictor and predicts female if you are 66 inches or shorter.

2.6 Prevalence matters in practice

A machine learning algorithm with very high sensitivity and specificity may not be useful in practice when prevalence is close to either 0 or 1. To see this consider the case of a doctor that specializes in a rare disease and is interested in developing a algorithm to predicting who has the disease. The doctor shares data with you and you develop an algorithm with very high sensitivity. You explain that this means that if a patient has the disease the algorithm is very likely to predict correctly. You also tell the doctor that you are also concern because, based on the dataset you analyzed the 1/2 the patients have the disease: \(\mbox{Pr}(\hat{Y}=1)\). The doctor is neither concerned nor impressed and explains that what is important is the precision of the test: \(\mbox{Pr}(Y=1 | \hat{Y}=1)\). Using Bayes theorem we can connect the two measures:

\[ \mbox{Pr}(Y \mid \hat{Y}=1) = \mbox{Pr}(\hat{Y}=1 \mid Y=1) \frac{\mbox{Pr}(Y=1)}{\mbox{Pr}(\hat{Y}=1)}\]

The doctor knows that the prevalence of the disease is 5 in 1000 which implies that \(\mbox{Pr}(Y=1) \, / \,\mbox{Pr}(\hat{Y}=1) = 1/100\) and therefore the precision of your algorithm is less than 0.01. The doctor does not have much use for your algorithm.

2.7 ROC and precision-recall curves

When comparing the two methods: guessing versus using a height cutoff, we looked at accuracy and \(F_1\). The second method clearly outperformed. However, while for the second method we considered several cutoffs, for the first we only considered one approach: guessing with equal probability. Note that guessing Male with higher probability would give us higher accuracy due to the bias in the sample:

p <- 0.9
y_hat <- sample(c("Male", "Female"), length(test_index), replace = TRUE, prob=c(p, 1-p)) %>% 
  factor(levels = levels(test_set$sex))
mean(y_hat == test_set$sex)
## [1] 0.7180952

But, as described above, this would come at the cost of lower sensitivity. The curves we describe in this section will help us see this.

Note that for each of these parameters we can get a different sensitivity and specificity. For this reason a very common approach to evaluating methods is to compare them graphically by plotting both.

A widely used plot that does this is the receiver operating characterstic (ROC) curve. If you are wondering where this name comes from, according to Wikipedia:

The ROC curve was first used during World War II for the analysis of radar signals before it was employed in signal detection theory.[35] Following the attack on Pearl Harbor in 1941, the United States army began new research to increase the prediction of correctly detected Japanese aircraft from their radar signals. For this purposes they measured the ability of radar receiver operators to make these important distinctions, which was called the Receiver Operating Characteristics.

The ROC curve plots sensitivity (TPR) versus 1 - specificity or the false positive rate (FPR). Here is an ROC curve for guessing sex but using different probabilities of guessing male:

probs <- seq(0, 1, length.out = 10)
guessing <- map_df(probs, function(p){
  y_hat <- sample(c("Male", "Female"), length(test_index), replace = TRUE, prob=c(p, 1-p)) %>% 
    factor(levels = c("Female", "Male"))
  list(method = "Guessing",
       FPR = 1 - specificity(y_hat, test_set$sex),
       TPR = sensitivity(y_hat, test_set$sex))
})
guessing %>% qplot(FPR, TPR, data =., xlab = "1 - Specificity", ylab = "Sensitivity")

The ROC curve for guessing always looks like a straight line. Note that a perfect algorithm would shoot straight to 1 and stay up there: perfect sensitivity for all value of specificity. SO how does our second approach compare? We can construct an ROC curve for the height based approach.

cutoffs <- c(50, seq(60, 75), 80)
height_cutoff <- map_df(cutoffs, function(x){
  y_hat <- ifelse(test_set$height > x, "Male", "Female") %>% 
    factor(levels = c("Female", "Male"))
   list(method = "Height cutoff",
        FPR = 1-specificity(y_hat, test_set$sex),
        TPR = sensitivity(y_hat, test_set$sex))
})

By plotting both curves together we are able to compare sensitivity for different values of specificity.

bind_rows(guessing, height_cutoff) %>%
  ggplot(aes(FPR, TPR, color = method)) +
  geom_line() +
  geom_point() +
  xlab("1 - Specificity") +
  ylab("Sensitivity")

We can see that we obtain higher sensitivity with this approach for all values of specificity which imply it is in fact a better method.

Note that when making ROC curves it is often nice to add the cutoff used to the points:

map_df(cutoffs, function(x){
  y_hat <- ifelse(test_set$height > x, "Male", "Female") %>% 
    factor(levels = c("Female", "Male"))
   list(method = "Height cutoff",
        cutoff = x, 
        FPR = 1-specificity(y_hat, test_set$sex),
        TPR = sensitivity(y_hat, test_set$sex))
}) %>%
  ggplot(aes(FPR, TPR, label = cutoff)) +
  geom_line() +
  geom_point() +
  geom_text(nudge_y = 0.01)

ROC curves have on weakness and it is that neither of the measures plotted depend on prevalence. In case in which prevalence matters, we may instead make a precision-recall plot. The idea is similar but we instead plot precision against recall:

guessing <- map_df(probs, function(p){
  y_hat <- sample(c("Male", "Female"), length(test_index), replace = TRUE, prob=c(p, 1-p)) %>% factor(levels = c("Female", "Male"))
  list(method = "Guess",
    recall = sensitivity(y_hat, test_set$sex),
    precision = precision(y_hat, test_set$sex))
})

height_cutoff <- map_df(cutoffs, function(x){
  y_hat <- ifelse(test_set$height > x, "Male", "Female") %>% factor(levels = c("Female", "Male"))
  list(method = "Height cutoff",
       recall = sensitivity(y_hat, test_set$sex),
    precision = precision(y_hat, test_set$sex))
})
bind_rows(guessing, height_cutoff) %>%
  ggplot(aes(recall, precision, color = method)) +
  geom_line() +
  geom_point()
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).

From this plot we immediately see that the precision of guessing is not high. This is because the prevalence is low.

If we change positives to mean Male instead of Female the ROC curve remains the same, but the precision recall plot changes:

guessing <- map_df(probs, function(p){
  y_hat <- sample(c("Male", "Female"), length(test_index), replace = TRUE, prob=c(p, 1-p)) %>% factor(levels = c("Male", "Female"))
  list(method = "Guess",
    recall = sensitivity(y_hat, relevel(test_set$sex, "Male", "Female")),
    precision = precision(y_hat, relevel(test_set$sex, "Male", "Female")))
})

height_cutoff <- map_df(cutoffs, function(x){
  y_hat <- ifelse(test_set$height > x, "Male", "Female") %>% factor(levels = c("Male", "Female"))
  list(method = "Height cutoff",
       recall = sensitivity(y_hat, relevel(test_set$sex, "Male", "Female")),
    precision = precision(y_hat, relevel(test_set$sex, "Male", "Female")))
})
bind_rows(guessing, height_cutoff) %>%
  ggplot(aes(recall, precision, color = method)) +
  geom_line() +
  geom_point()
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).

3 Conditional probabilities and expectations

In machine learning applications we rarely can predict outcomes perfectly. Note that spam detectors often miss emails that are clearly spam, Siri often misunderstands the words we are saying, and your bank often thinks your card was stolen when it was not. The most common reason for not being able to build prefect algorithms is that it is impossible. To see this, note that most datasets will include groups of observation with the same exact observed values for all predictors, and thus the same prediction, but different outcomes, making it impossible to get them all the predictions right. We saw a simple example of this in the previous section: for any given height \(x\) you will have both males and females that are \(x\) inches tall. However, none of this means that we can’t build useful algorithms that are much better than guessing, and in some cases better than expert opinions. To achieve this in an optimal, way we make use of probabilistic representations of the problem. Observations with the same observed values for the predictors may not all be the same, but we can assume that they all have the same probability of this class or that class. We will write this idea out mathematically for the case of categorical data.

3.1 Conditional probabilities

We use the notation \((X_1 = x_1,\dots,X_p=x_p)\) to represent the fact that we have observed values \(x_1,\dots,x_n\) for covariates \(X_1, \dots, X_p\). This does not imply that the outcome \(Y\) will take a specific value but rather that it implies a specific probability. Specifically, we denote the conditional probabilities for each class \(k\):

\[ \mbox{Pr}(Y=k \mid X_1 = x_1,\dots,X_p=x_p), \, \mbox{for}\,k=1,\dots,K \]

To avoid writing out all the predictors we will the following bold letters: \(\mathbf{X} \equiv (X_1,\dots,X_p)\) and \(\mathbf{x} \equiv (x_1,\dots,x_p)\). We will also use the following notation for the conditional probability of being class \(k\).

\[ p_k(\mathbf{x}) = \mbox{Pr}(Y=k \mid \mathbf{X}=\mathbf{x}), \, \mbox{for}\, k=1,\dots,K \]

Note: We will be using the \(p(x)\) notation to represent conditional probabilities as functions of the predictors. Do not confuse it with the \(p\) that represents the number of predictors.

Knowing these probabilities can guide the construction of an algorithm that makes the best prediction: for any give \(\mathbf{x}\) we will predicting the class \(k\) with the largest probability among \(p_1(x), p_2(x), \dots p_K(x)\). In mathematical notation we write it like this:

\[\hat{Y} = \max_k p_k(\mathbf{x})\]

In machine learning we refer to this as Bayes’ Rule. But note that this is a theoretical rule since in practice we don’t know \(p_k(\mathbf{x}), k=1,\dots,K\). In fact, estimating these conditional probabilities can be thought of as the main challenge of machine learning. The better our estimate \(\hat{p}_k(\mathbf{x})\), the better our predictor

\[\hat{Y} = \max_k \hat{p}_k(\mathbf{x})\]

So well we predict will depend on two things 1) how close \[\max_k p_k(\mathbf{x})\] is to 1 and 2) how close our estimate \(\hat{p}_k(\mathbf{x})\) is to \(p_k(\mathbf{x})\). We can’t do anything about the first restriction as it is determined by the nature of the problem so our energy goes into finding ways to best estimate conditional probabilities. The first restriction does imply that we have limits as to how well even the best possible algorithm can perform. You should get used to the idea that while in some challenges we will be able to achieve almost perfect accuracy, digit readers for example, in others our success is restricted by the randomness of the process, movie recommendations for example.

Before we continue note that defining our prediction by maximizing the probability is not always optimal in practice and depends on the context. As discussed above, sensitivity and specificity may differ in importance in different contexts. But even in these cases, having a good estimate of the \(p_k(x), k=1,\dots,K\) will suffice for us to build optimal prediction models since we can specificity and sensitivity however we wish. For example, we can simply change the cutoffs used to predict one outcome or the other. In the plane example given above, we may ground the plane anytime the probability of malfunction is higher than 1/1000 as opposed to the default 1/2 used when error types are equally undesired.

3.2 Conditional expectations

For binary data, you can think of the probability \(\mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x})\) as the proportion of 1s in the stratum of the population for which \(\mathbf{X}=\mathbf{x}\). Many of the algorithms we will learn can be applied to both categorical and continuous data due to the connection between conditional probabilities and conditional expectations.

Because the expectation is the average of values \(y_1,\dots,y_n\) in the population, in the case in which the \(y\)s are 0 or 1, the expectation is equivalent to the probability of randomly picking a one since the average is simply the proportion of ones:

\[ \mbox{E}(Y \mid \mathbf{X}=\mathbf{x})=\mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x}). \]

We therefore often only use the expectation to denote both the conditional probability and conditional expectation.

Just like with categorical outcomes, in most applications the same observed predictors does not guarantee the same continuous outcomes. Instead we assume that the outcome follows the same conditional distribution and we will now explain why we look use the conditional expectation to define our predictors.

3.3 The Loss Function

Before we start describing approaches to optimizing the way we build algorithms, we first need to define what we mean when we say one approach is better than another. Specifically, we need to quantify what we mean by “better”.

With binary outcomes we have already described how sensitivity, specificity, accuracy and \(F_1\) can be used as quantifications. However, these metrics are not useful for continuous outcomes. The general approach to defining “best” in machine leaning is to define a loss function. The most commonly used one is the squared loss function: if \(\hat{y}\) is our predictor and \(y\) is the observed outcome, the squared loss function is simply:

\[ (\hat{y} - y)^2 \]

Because we often have a test set with many observations, say \(N\), we use the mean squared error (MSE):

\[ \mbox{MSE} = \frac{1}{N} \mbox{RSS} = \frac{1}{N}\sum_{i=1}^N (\hat{y}_i - y_i)^2 \]

In practice we often report the root mean squared error:

\[ \mbox{RMSE} = \sqrt{\mbox{MSE}} = \sqrt{\frac{1}{N}\sum_{i=1}^N (\hat{y}_i - y_i)^2} \]

becuase it is in the same units as the outcomes. But doing the math is oftern easier with the MSE and is therefore more commonly used in text books, since these often describe theoretical properties of algorithms.

Note that if the outcomes are binary, both RMSE and MSE are equivalent to accuracy since \((\hat{y} - y)^2\) is 1 if the prediction was correct and 0 otherwise. In general, our goal is to build an algorithm that minimize the loss so it is as close to 0 as possible.

Because our data is usually a random sample, we can think of the MSE a random variable and the observed MSE can be thought of as an estimate of the expeced MSE, which in mathematical notation we write like this:

\[ \mbox{E}\left\{ \frac{1}{N}\sum_{i=1}^N (\hat{Y}_i - Y_i)^2 \right\} \]

Note that this is a theoretical concept because in practice we only have one dataset to work with. But in theory, we can get a very very large number, call it \(B\), of random samples, apply our algorithm to each, obtain an MSE for each random sample, and think of the expected MSE as

\[ \frac{1}{B} \sum_{b=1}^B \frac{1}{N}\sum_{i=1}^N \left(\hat{y}_i^b - y_i^b\right)^2 \]

with \(y_{i}^b\) denoting the \(i\)th observation in the \(b\)-th random sample and \(\hat{y}_i^b\) the resulting prediction obtained from applying the exact same algorithm to the \(b\)-th random sample. But again, in practice, we only observe one random sample so the expected MSE is only theortical. However, in a later section we describe an approach to estimating the MSE.

Before we continue, note that the there are loss functions other that the squared loss. For example the Mean Absolute Error uses absolute values instead of squaring the errors:

\[ \mbox{E}\left\{ \frac{1}{N}\sum_{i=1}^N |\hat{Y}_i - Y_i|\right\} \]

But in this book we focus on minimizing square loss since it is the most widely used.

3.4 Conditional expectation minimizes squared loss function

So why do we care about the conditional expectation in machine learning? This is because the expected value has an attractive mathematical property: it minimized the MSE. Specifically, of all possible predictors \(\hat{Y}\),

\[ \hat{Y} = \mbox{E}(Y \mid \mathbf{X}=\mathbf{x}) \, \mbox{ minimizes } \, \mbox{E}\{ (\hat{Y} - Y)^2 \mid \mathbf{X}=\mathbf{x} \} \]

Due to this property, a succinct description of the main task of Machine Learning is that we use data to estimate

\[ f(\mathbf{x}) \equiv \mbox{E}( Y \mid \mathbf{X}=\mathbf{x} ) \]

for any set of features \(\mathbf{x} = (x_1, \dots, x_p)\). This of course easier said than done, since this function can take any shape and \(p\) can be very large. Consider a case in which we only have one predictor \(x\). The expectation \(\mbox{E}\{ Y \mid X=x \}\) can be any function of \(x\): a line, a parabola, a sine wave, a step function, anything. It get’s even more complicated when we consider cases with large \(p\) in which case \(f(\mathbf{x})\) is a function a multidimensional vector \(\mathbf{x}\). For example, in our digit reader example \(p = 784\)! The main way in which competing Machine Learning algorithms differ is in the approach to estimating this expectation.

4 Linear regression for prediction

Linear regression can be considers a machine learning algorithm. As we will see it is too rigid to be useful in general, but for some challenges it works rather well. It also serves as a baseline approach: if you can’t beat it with a more complex approach, you probably want to stick to linear regression. To quickly make the connection between regression and machine learning we will reformulate Galton’s study with heights: a continuous outcome.

library(HistData)

galton_heights <- GaltonFamilies %>%
  filter(childNum == 1 & gender == "male") %>%
  select(father, childHeight) %>%
  rename(son = childHeight)

Suppose you are tasked with building a machine learning algorithm that predicts the son’s height \(Y\) using the father’s height \(X\). Let’s generate testing and training sets:

library(caret)
y <- galton_heights$son
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)

train_set <- galton_heights %>% slice(-test_index)
test_set <- galton_heights %>% slice(test_index)

In this case, if we were just ignoring the father’s height and guessing the son’s height we would guess the average height of sons.

avg <- mean(train_set$son)
avg
## [1] 70.46477

Our squared loss is

mean((avg - test_set$son)^2)
## [1] 6.452315

Can we do better? In regression we learned that if the pair \((X,Y)\) follow a bivariate normal distribution, the conditional expectation (what we want to estimate) is equivalent to the regression line:

\[ f(x) = \mbox{E}( Y \mid X= x ) = \beta_0 + \beta_1 x \]

We also introduced least squares as method for estimating the slope \(\beta_0\) and intercept \(\beta_1\):

fit <- lm(son ~ father, data = train_set)
fit$coef
## (Intercept)      father 
##  35.6454352   0.5041631

This gives us an estimate of the conditional expectation:

\[ \hat{f}(x) = 38 + 0.47 x \]

We can see that this does indeed provide an improvement over our guessing approach.

y_hat <- fit$coef[1] + fit$coef[2]*test_set$father
mean((y_hat - test_set$son)^2)
## [1] 4.733392

4.1 The predict function

The predict function is very useful for machine learning applications. This function take a fitted object from function such as lm or glm and a data frame with the new predictors for which to predict. So in our current example we would use predict like this:

y_hat <- predict(fit, test_set)

We can get the same results as above using predict:

y_hat <- predict(fit, test_set)
mean((y_hat - test_set$son)^2)
## [1] 4.733392

Note that predict does not always return object of the same types. To learn about the specifics you need to look at the help file specific for the type of fit object is being used. Note the difference between:

?predict.lm

and

?predict.glm

4.2 Regression for a categorical outcome

The regression approach can also be applied to categorical data. To illustrate this we will apply it to our previous predicting sex example:

If we define the outcome \(Y\) as 1 for females and 0 for males and \(X\) as the height, in this case we are interested in the conditional probability:

\[ \mbox{Pr}( Y = 1 \mid X = x) \]

As an example, let’s provide a prediction for a student that is 66 inches tall. What is the conditional probability of being female if you are 66 inches tall? In our dataset we can estimate this by rounding to the nearest inch and computing:

train_set %>% 
  filter(round(height)==66) %>%
  summarize(mean(sex=="Female"))
##   mean(sex == "Female")
## 1             0.2424242

We will define \(Y=1\) for females and \(Y=0\) for males. To construct a prediction algorithm we want to estimate the proportion of the population that is female for any given height \(X=x\) which we write as

\[ \mbox{Pr}( Y = 1 | X=x) \]

Let’s see what this looks like for several values of \(x\) (we will remove values of \(x\) with few data points):

heights %>% 
  mutate(x = round(height)) %>%
  group_by(x) %>%
  filter(n() >= 10) %>%
  summarize(prop = mean(sex == "Female")) %>%
  ggplot(aes(x, prop)) +
  geom_point()

Since the results from the plot above look close to linear, and it is the only approach we currently know, we will try regression. We assumes that:

\[p(x) = \mbox{Pr}( Y = 1 | X=x) = \beta_0 + \beta_1 x\]

Note: because \(p_0(x) = 1 - p_1(x)\), we will only estimate \(p_1(x)\) and drop the index.

If we convert the factors to 0s and 1s we can we can estimate \(\beta_0\) and \(\beta_1\) with least squares.

lm_fit <- mutate(train_set, y = as.numeric(sex == "Female")) %>%
                lm(y ~ height, data = .)

Once we have estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) we can obtain an actual prediction. Our estimate of the conditional probability \(p(x)\) is:

\[ \hat{p}(x) = \hat{\beta}_0+ \hat{\beta}_1 x \]

To form a prediction we define a decision rule: predict female if \(\hat{p}(x) > 0.5\). We can compare our predictions to the outcomes using:

p_hat <- predict(lm_fit, test_set)
y_hat <- ifelse(p_hat > 0.5, "Female", "Male") %>% factor()
confusionMatrix(y_hat, test_set$sex)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Female Male
##     Female     20   15
##     Male       98  393
##                                           
##                Accuracy : 0.7852          
##                  95% CI : (0.7476, 0.8195)
##     No Information Rate : 0.7757          
##     P-Value [Acc > NIR] : 0.3218          
##                                           
##                   Kappa : 0.177           
##  Mcnemar's Test P-Value : 1.22e-14        
##                                           
##             Sensitivity : 0.16949         
##             Specificity : 0.96324         
##          Pos Pred Value : 0.57143         
##          Neg Pred Value : 0.80041         
##              Prevalence : 0.22433         
##          Detection Rate : 0.03802         
##    Detection Prevalence : 0.06654         
##       Balanced Accuracy : 0.56636         
##                                           
##        'Positive' Class : Female          
## 

5 Logistic Regression

Note that the function \(\beta_0 + \beta_1 x\) can take any value including negatives and values larger than 1. In fact the estimate \(\hat{p}(x)\) computed in the linear regression section does indeed become negative at around 76 inches.

heights %>% 
  mutate(x = round(height)) %>%
  group_by(x) %>%
  filter(n() >= 10) %>%
  summarize(prop = mean(sex == "Female")) %>%
  ggplot(aes(x, prop)) +
  geom_point() + 
  geom_abline(intercept = lm_fit$coef[1], slope = lm_fit$coef[2])

The range is:

range(p_hat)
## [1] -0.397868  1.123309

But we are estimating a probability: \(\mbox{Pr}( Y = 1 \mid X = x)\).

Logistic regression is an extension of linear regression that assures that the estimate of \(\mbox{Pr}( Y = 1 \mid X = x)\) is between 0 and 1. This approach makes use of the logistics transformation.

\[ g(p) = \log \frac{p}{1-p}\]

This logistic transformation converts probability to log odds. As discussed in the data visualization lecture, the odds tell us how much more likely something will happen compared to not happening. So \(p=0.5\) means the odds are 1 to 1, thus the odds are 1. If \(p=0.75\) the odds are 3 to 1. A nice characteristic of this transformation is that it transforms probabilities to be symmetric around 0. Here is a plot of \(g(p)\) versus \(p\):

With logistic regression we model the conditional probability directly with:

\[ g\left\{ \mbox{Pr}(Y = 1 \mid X=x) \right\} = \beta_0 + \beta_1 x \]

With this model, we can no longer use least squares. Instead we compute the maximum likelihood estimate (MLE). You can learn more about this concept in a statistical theory text.

In R we can fit the logistic regression model with the function glm: generalized linear models. This function is more general than logistic regression so we need to specify the model we want through the family parameter:

glm_fit <- train_set %>% 
  mutate(y = as.numeric(sex == "Female")) %>%
  glm(y ~ height, data=., family = "binomial")

We can obtain prediction using the predict function:

p_hat_logit <- predict(glm_fit, newdata = test_set, type = "response")

Note that when using predict with a glm object we have to specify that we want type="response" if we want the conditional probabilities since the defualt is to return the logistic transformed values.

Note that this model fits the data slightly better than the line:

Because we have an estimate \(\hat{p}(x)\) we can obtain predictions

y_hat_logit <- ifelse(p_hat_logit > 0.5, "Female", "Male") %>% factor
confusionMatrix(y_hat_logit, test_set$sex)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Female Male
##     Female     31   19
##     Male       87  389
##                                          
##                Accuracy : 0.7985         
##                  95% CI : (0.7616, 0.832)
##     No Information Rate : 0.7757         
##     P-Value [Acc > NIR] : 0.1138         
##                                          
##                   Kappa : 0.2718         
##  Mcnemar's Test P-Value : 7.635e-11      
##                                          
##             Sensitivity : 0.26271        
##             Specificity : 0.95343        
##          Pos Pred Value : 0.62000        
##          Neg Pred Value : 0.81723        
##              Prevalence : 0.22433        
##          Detection Rate : 0.05894        
##    Detection Prevalence : 0.09506        
##       Balanced Accuracy : 0.60807        
##                                          
##        'Positive' Class : Female         
## 

The resulting prediction are similar. This is because the two estimates of \(p(x)\) are larger than 1/2 in about the same region of x:

data.frame(x = seq(min(tmp$x), max(tmp$x))) %>%
  mutate(logistic = plogis(glm_fit$coef[1] + glm_fit$coef[2]*x),
         regression = lm_fit$coef[1] + lm_fit$coef[2]*x) %>%
  gather(method, p_x, -x) %>%
  ggplot(aes(x, p_x, color = method)) + 
  geom_line() +
  geom_hline(yintercept = 0.5, lty = 5)

Both linear and logistic regression provide an estimate for the conditional expectation:

\[ \mbox{E}(Y \mid X=x) \] which in the case of of binary data is equivalent to the conditional probability

\[ \mbox{Pr}(Y = 1 \mid X = x) \]

Once we move on to more complex example we will see that linear regression is limited and not flexible enough to be useful. The techniques we learn are essentially approaches to estimating the conditional probability in a way that is more flexible.

6 Case study: Is it a 2 or a 7?

In the two simple examples above we only had one predictor. We actually do not consider these machine learning challenges, which are characterized by challenges with many predictors. Let’s go back to the digits example in which we had 784 predictors. However, for illustrative purposes we will start with an example with 2 predictors and two outcomes.

We want to build an algorithm that can determine if a digit is a 2 or 7 from the predictors. We are not quite ready to build algorithms with 784 predictors so we will extract two simple predictors from the 784: the proportion of dark pixels that are in the upper left quadrant (\(X_1\)) and the lower right quadrant (\(X_2\)).

We then selected a random sample of 1,000 digits, 500 in the a train set and 500 in the test set and provide them here:

data("mnist_27")

We can explore this data by plotting the two predictors and use colors to denote the labels:

mnist_27$train %>% ggplot(aes(x_1, x_2, color = y)) +
  geom_point()

We can immediately see some patterns. For example, if \(X_1\) (the upper left panel) is large, then the digit is probably a 7. Also, for smaller values of \(X_2\) the 2s appear to be the mid range values.

These are the images of the digits with the largest and smallest values for \(X_1\):

And here the original images corresponding to the largest and smallest value of \(x_2\):

We can start getting a sense for why these predictors are useful but also why the problem will be somewhat challenging.

So let’s try building a machine learning algorithm. We haven’t really learned any algorithms yet, so let’s start with logistic regression. The model is simply:

\[ p(x_1, x_2) = \mbox{Pr}(Y=1 \mid X_1=x_1 , X_2 = x_2) = g^{-1}(\beta_0 + \beta_1 x_1 + \beta_2 x_2) \]

with \(g^{-1}\) the inverse of the logistic function: \(g^{-1}(x) = \exp(x)/\{1+\exp(x)\}\). We fit it like this:

fit <- glm(y ~ x_1 + x_2, data=mnist_27$train, family="binomial")

We can now build a decision rule based on the estimate of \(\hat{p}(x_1, x_2)\):

p_hat <- predict(fit, newdata = mnist_27$test, type="response")
y_hat <- factor(ifelse(p_hat > 0.5, 7, 2))
library(caret)
confusionMatrix(data = y_hat, reference = mnist_27$test$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  2  7
##          2 82 26
##          7 24 68
##                                          
##                Accuracy : 0.75           
##                  95% CI : (0.684, 0.8084)
##     No Information Rate : 0.53           
##     P-Value [Acc > NIR] : 1.266e-10      
##                                          
##                   Kappa : 0.4976         
##  Mcnemar's Test P-Value : 0.8875         
##                                          
##             Sensitivity : 0.7736         
##             Specificity : 0.7234         
##          Pos Pred Value : 0.7593         
##          Neg Pred Value : 0.7391         
##              Prevalence : 0.5300         
##          Detection Rate : 0.4100         
##    Detection Prevalence : 0.5400         
##       Balanced Accuracy : 0.7485         
##                                          
##        'Positive' Class : 2              
## 

We get an accuracy of 0.79! Not bad for our first try. But can we do better?

Because we constructed the mnist_27 example and we had to our disposal 60,000 digits in just the MNIST dataset, we used this to build the true conditional distribution \(p(x_1, x_2)\). Note that this is something we don’t have access to in practice, but we include it in this example because it let’s compare \(\hat{p}(x_1, x_2)\) to the true \(p(x_1, x_2)\) which teaches us the limitations of different algorithms.

Let’s do that here. We can access and plot \(p(x_1,x_2)\) like this:

mnist_27$true_p %>% ggplot(aes(x_1, x_2, fill=p)) +
  geom_raster() 

We will chose better colors and draw a curve that separates pairs \((x_1,x_2)\) for which \(p(x_1,x_2) > 0.5\) and cases for which \(p(x_1,x_2) < 0.5\):

mnist_27$true_p %>% ggplot(aes(x_1, x_2, z=p, fill=p)) +
  geom_raster() +
  scale_fill_gradientn(colors=c("#F8766D","white","#00BFC4")) +
  stat_contour(breaks=c(0.5),color="black")

So above you see a plot of the true \(p(x,y)\). To start understanding the limitations of logistic regression here. First note that with logistic regression \(\hat{p}(x,y)\) has to be a plane and as result the boundary defined by the decision rule is given by

\[ \hat{p}(x,y) = 0.5\] which implies the boundary can’t be anything other than straight line:

\[ g^{-1}(\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2) = 0.5 \implies \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 = g(0.5) = 0 \implies x_2 = -\hat{\beta}_0/\hat{\beta}_2 + -\hat{\beta}_1/\hat{\beta}_2 x_1 \]

which implies our logistic regression approach has no chance of capturing the non-linear nature of the true \(p(x_1,x_2)\). Here is a visual representation of \(\hat{p}(x_1, x_2)\):

p_hat <- predict(fit, newdata = mnist_27$true_p, type = "response")
mnist_27$true_p %>% mutate(p_hat = p_hat) %>%
  ggplot(aes(x_1, x_2,  z=p_hat, fill=p_hat)) +
  geom_raster() +
  scale_fill_gradientn(colors=c("#F8766D","white","#00BFC4")) +
  stat_contour(breaks=c(0.5),color="black") 

We can see where the mistakes were made mainly come from low values \(x_1\) that have either high or low value of \(x_2\). Logistic regression can’t catch this.

p_hat <- predict(fit, newdata = mnist_27$true_p, type="response")
mnist_27$true_p %>% mutate(p_hat = p_hat) %>%
  ggplot() +
  stat_contour(aes(x_1, x_2, z=p_hat), breaks=c(0.5), color="black") + 
  geom_point(mapping = aes(x_1, x_2, color=y), data = mnist_27$test) 

We need something more flexible. A method that permits estimates with shapes other than a plane. We are going to learn a few new algorithms based on different ideas and concepts. But what they all have in common is that they permit more flexible approaches. We will start by describing nearest neighbor and kernel approaches. To introduce the concepts behinds these approaches we will again start with a simple one dimensional example and describe the concept of smoothing.