Overview of Classification

Classification is a form of supervised learning where the response variable is categorical, as opposed to numeric for regression. Our goal is to find a rule, algorithm, or function which takes as input a feature vector, and outputs a category which is the true category as often as possible.

That is, the classifier C^(x)\hat{C}(x) returns the predicted category y^(X)\hat{y}(X).

y^(x)=C^(x)\hat{y}(x) = \hat{C}(x)

To build our first classifier, we will use the Default dataset from the ISLR package.


Our goal is to properly classify individuals as defaulters based on student status, credit card balance, and income. Be aware that the response default is a factor, as is the predictor student.


As we did with regression, we test-train split our data. In this case, using 50% for each.

default_idx   = sample(nrow(Default), 5000)
default_trn = Default[default_idx, ]
default_tst = Default[-default_idx, ]

Visualization for Classification

Often, some simple visualizations can suggest simple classification rules. To quickly create some useful visualizations, we use the featurePlot() function from the caret() package.


A density plot can often suggest a simple split based on a numeric predictor. Essentially this plot graphs a density estimate

f^Xi(xiY=k)\hat{f}_{X_i}(x_i \mid Y = k)

for each numeric predictor xix_i and each category kk of the response yy.

featurePlot(x = default_trn[, c("balance", "income")],
            y = default_trn$default,
            plot = "density",
            scales = list(x = list(relation = "free"),
                          y = list(relation = "free")),
            adjust = 1.5,
            pch = "|",
            layout = c(2, 1),
            auto.key = list(columns = 2))

Some notes about the arguments to this function:

  • x is a data frame containing only numeric predictors. It would be nonsensical to estimate a density for a categorical predictor.
  • y is the response variable. It needs to be a factor variable. If coded as 0 and 1, you will need to coerce to factor for plotting.
  • plot specifies the type of plot, here density.
  • scales defines the scale of the axes for each plot. By default, the axis of each plot would be the same, which often is not useful, so the arguments here, a different axis for each plot, will almost always be used.
  • adjust specifies the amount of smoothing used for the density estimate.
  • pch specifies the plot character used for the bottom of the plot.
  • layout places the individual plots into rows and columns. For some odd reason, it is given as (col, row).
  • auto.key defines the key at the top of the plot. The number of columns should be the number of categories.

It seems that the income variable by itself is not particularly useful. However, there seems to be a big difference in default status at a balance of about 1400. We will use this information shortly.

featurePlot(x = default_trn[, c("balance", "income")],
            y = default_trn$student,
            plot = "density",
            scales = list(x = list(relation = "free"),
                          y = list(relation = "free")),
            adjust = 1.5,
            pch = "|",
            layout = c(2, 1),
            auto.key = list(columns = 2))

Above, we create a similar plot, except with student as the response. We see that students often carry a slightly larger balance, and have far lower income. This will be useful to know when making more complicated classifiers.

featurePlot(x = default_trn[, c("student", "balance", "income")],
            y = default_trn$default,
            plot = "pairs",
            auto.key = list(columns = 2))

We can use plot = "pairs" to consider multiple variables at the same time. This plot reinforces using balance to create a classifier, and again shows that income seems not that useful.

featurePlot(x = default_trn[, c("balance", "income")],
            y = default_trn$default,
            plot = "ellipse",
            auto.key = list(columns = 2))

Similar to pairs is a plot of type ellipse, which requires the ellipse package. Here we only use numeric predictors, as essentially we are assuming multivariate normality. The ellipses mark points of equal density. This will be useful later when discussing LDA and QDA.

A Simple Classifier

A very simple classifier is a rule based on a boundary bb for a particular input variable xx.

C^(x)={1x>b0xb\hat{C}(x) = \begin{cases} 1 & x > b \\ 0 & x \leq b \end{cases}

Based on the first plot, we believe we can use balance to create a reasonable classifier. In particular,

C^(balance)={Yesbalance>1400Nobalance1400\hat{C}(\texttt{balance}) = \begin{cases} \text{Yes} & \texttt{balance} > 1400 \\ \text{No} & \texttt{balance} \leq 1400 \end{cases}

So we predict an individual is a defaulter if their balance is above 1400, and not a defaulter if the balance is 1400 or less.

simple_class = function(x, boundary, above = 1, below = 0) {
  ifelse(x > boundary, above, below)

We write a simple R function that compares a variable to a boundary, then use it to make predictions on the train and test sets with our chosen variable and boundary.

default_trn_pred = simple_class(x = default_trn$balance,
                                boundary = 1400, above = "Yes", below = "No")
default_tst_pred = simple_class(x = default_tst$balance,
                                boundary = 1400, above = "Yes", below = "No")
head(default_tst_pred, n = 10)

Metrics for Classification

In the classification setting, there are a large number of metrics to assess how well a classifier is performing.

One of the most obvious things to do is arrange predictions and true values in a cross table.

(trn_tab = table(predicted = default_trn_pred, actual = default_trn$default))
(tst_tab = table(predicted = default_tst_pred, actual = default_tst$default))

Often we give specific names to individual cells of these tables, and in the predictive setting, we would call this table a confusion matrix. Be aware, that the placement of Actual and Predicted values affects the names of the cells, and often the matrix may be presented transposed.

In statistics, we label the errors Type I and Type II, but these are hard to remember. False Positive and False Negative are more descriptive, so we choose to use these.

The confusionMatrix() function from the caret package can be used to obtain a wealth of additional information, which we see output below for the test data. Note that we specify which category is considered "positive."

trn_con_mat  = confusionMatrix(trn_tab, positive = "Yes")
(tst_con_mat = confusionMatrix(tst_tab, positive = "Yes"))

The most common, and most important metric is the classification error rate.

err(C^,Data)=1ni=1nI(yiC^(xi))\text{err}(\hat{C}, \text{Data}) = \frac{1}{n}\sum_{i = 1}^{n}I(y_i \neq \hat{C}(x_i))

Here, II is an indicator function, so we are essentially calculating the proportion of predicted classes that match the true class.

I(yiC^(x))={1yiC^(x)0yi=C^(x)I(y_i \neq \hat{C}(x)) = \begin{cases} 1 & y_i \neq \hat{C}(x) \\ 0 & y_i = \hat{C}(x) \\ \end{cases}

It is also common to discuss the accuracy, which is simply one minus the error.

Like regression, we often split the data, and then consider Train (Classification) Error and Test (Classification) Error will be used as a measure of how well a classifier will work on unseen future data.

errtrn(C^,Train Data)=1ntrnitrnI(yiC^(xi))\text{err}_{\texttt{trn}}(\hat{C}, \text{Train Data}) = \frac{1}{n_{\texttt{trn}}}\sum_{i \in \texttt{trn}}^{}I(y_i \neq \hat{C}(x_i))

errtst(C^,Test Data)=1ntstitstI(yiC^(xi))\text{err}_{\texttt{tst}}(\hat{C}, \text{Test Data}) = \frac{1}{n_{\texttt{tst}}}\sum_{i \in \texttt{tst}}^{}I(y_i \neq \hat{C}(x_i))

Accuracy values can be found by calling confusionMatrix(), or, if stored, can be accessed directly. Here, we use them to obtain error rates.

1 - trn_con_mat$overall["Accuracy"]
1 - tst_con_mat$overall["Accuracy"]

Sometimes guarding against making certain errors, FP or FN, are more important than simply finding the best accuracy. Thus, sometimes we will consider sensitivity and specificity.

Sens=True Positive Rate=TPP=TPTP + FN\text{Sens} = \text{True Positive Rate} = \frac{\text{TP}}{\text{P}} = \frac{\text{TP}}{\text{TP + FN}}


Spec=True Negative Rate=TNN=TNTN + FP\text{Spec} = \text{True Negative Rate} = \frac{\text{TN}}{\text{N}} = \frac{\text{TN}}{\text{TN + FP}}


Like accuracy, these can easily be found using confusionMatrix().

When considering how well a classifier is performing, often, it is understandable to assume that any accuracy in a binary classification problem above 0.50, is a reasonable classifier. This however is not the case. We need to consider the balance of the classes. To do so, we look at the prevalence of positive cases.

Prev=PTotal Obs=TP + FNTotal Obs\text{Prev} = \frac{\text{P}}{\text{Total Obs}}= \frac{\text{TP + FN}}{\text{Total Obs}}


Here, we see an extremely low prevalence, which suggests an even simpler classifier than our current based on balance.

C^(balance)={Nobalance>1400Nobalance1400\hat{C}(\texttt{balance}) = \begin{cases} \text{No} & \texttt{balance} > 1400 \\ \text{No} & \texttt{balance} \leq 1400 \end{cases}

This classifier simply classifies all observations as negative cases.

pred_all_no = simple_class(default_tst$balance,
                           boundary = 1400, above = "No", below = "No")
table(predicted = pred_all_no, actual = default_tst$default)

The confusionMatrix() function won't even accept this table as input, because it isn't a full matrix, only one row, so we calculate error rates directly. To do so, we write a function.

calc_class_err = function(actual, predicted) {
  mean(actual != predicted)
calc_class_err(actual = default_tst$default,
               predicted = pred_all_no)

Here we see that the error rate is exactly the prevelance of the minority class.

table(default_tst$default) / length(default_tst$default)

This classifier does better than the previous. But the point is, in reality, to create a good classifier, we should obtain a test error better than 0.033, which is obtained by simply manipulating the prevalences. Next chapter, we'll introduce much better classifiers which should have no problem accomplishing this task.