Logistic Regression

"The line between disorder and order lies in logistics."

  • Sun Tzu (speaking about a decidedly different type of logistics)

After these lectures you will be able to:

  • Understand how generalized linear models are a generalization of ordinary linear models.
  • Use logistic regression to model a binary response.
  • Apply concepts learned for ordinary linear models to logistic regression.
  • Use logistic regression to perform classification.

Simple motivating question Can we use linear models to deal with categorical variables?

Yes!

The model that we previously covered, which we now call ordinary linear regression, is actually a specific case of the more general, generalized linear model. (Aren't statisticians great at naming things?)

Generalized Linear Models

So far, we've had response variables that, conditioned on the predictors, were modeled using a normal distribution with a mean that is some linear combination of the predictors. This linear combination is what made a linear model "linear."

YX=xN(β0+β1x1++βp1xp1, σ2)Y \mid {\bf X} = {\bf x} \sim N(\beta_0 + \beta_1x_1 + \ldots + \beta_{p - 1}x_{p - 1}, \ \sigma^2)

Now we'll allow for two modifications of this situation, which will let us use linear models in many more situations. Instead of using a normal distribution for the response conditioned on the predictors, we'll allow for other distributions. Also, instead of the conditional mean being a linear combination of the predictors, it can be some function of a linear combination of the predictors.

In general, a generalized linear model has three parts:

  • A distribution of the response conditioned on the predictors. (Technically this distribution needs to be from the exponential family of distributions.)
  • A linear combination of the p1p - 1 predictors, β0+β1x1+β2x2++βp1xp1\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_{p - 1} x_{p - 1}, which we write as η(x)\eta({\bf x}). That is,

η(x)=β0+β1x1+β2x2++βp1xp1\eta({\bf x}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_{p - 1} x_{p - 1}

  • A link function, g()g(), that defines how η(x)\eta({\bf x}), the linear combination of the predictors, is related to the mean of the response conditioned on the predictors, E[YX=x]\text{E}[Y \mid {\bf X} = {\bf x}].

η(x)=g(E[YX=x]).\eta({\bf x}) = g\left(\text{E}[Y \mid {\bf X} = {\bf x}]\right).

The following table summarizes three examples of a generalized linear model:

Linear Regression Poisson Regression Logistic Regression
YX=xY \mid {\bf X} = {\bf x} N(μ(x),σ2)N(\mu({\bf x}), \sigma^2) Pois(λ(x))\text{Pois}(\lambda({\bf x})) Bern(p(x))\text{Bern}(p({\bf x}))
Distribution Name Normal Poisson Bernoulli (Binomial)
E[YX=x]\text{E}[Y \mid {\bf X} = {\bf x}] μ(x)\mu({\bf x}) λ(x)\lambda({\bf x}) p(x)p({\bf x})
Support Real: (,)(-\infty, \infty) Integer: 0,1,2,0, 1, 2, \ldots Integer: 0,10, 1
Usage Numeric Data Count (Integer) Data Binary (Class ) Data
Link Name Identity Log Logit
Link Function η(x)=μ(x)\eta({\bf x}) = \mu({\bf x}) η(x)=log(λ(x))\eta({\bf x}) = \log(\lambda({\bf x})) η(x)=log(p(x)1p(x))\eta({\bf x}) = \log \left(\frac{p({\bf x})}{1 - p({\bf x})} \right)
Mean Function μ(x)=η(x)\mu({\bf x}) = \eta({\bf x}) λ(x)=eη(x)\lambda({\bf x}) = e^{\eta({\bf x})} p(x)=eη(x)1+eη(x)=11+eη(x)p({\bf x}) = \frac{e^{\eta({\bf x})}}{1 + e^{\eta({\bf x})}} = \frac{1}{1 + e^{-\eta({\bf x})}}

Like ordinary linear regression, we will seek to "fit" the model by estimating the β\beta parameters. To do so, we will use the method of maximum likelihood.

Note that a Bernoulli distribution is a specific case of a binomial distribution where the nn parameter of a binomial is 11. Binomial regression is also possible, but we'll focus on the much more popular Bernoulli case.

So, in general, GLMs relate the mean of the response to a linear combination of the predictors, η(x)\eta({\bf x}), through the use of a link function, g()g(). That is,

η(x)=g(E[YX=x]).\eta({\bf x}) = g\left(\text{E}[Y \mid {\bf X} = {\bf x}]\right).

The mean is then

E[YX=x]=g1(η(x)).\text{E}[Y \mid {\bf X} = {\bf x}] = g^{-1}(\eta({\bf x})).

Binary Response

To illustrate the use of a GLM we'll focus on the case of binary responses variable coded using 00 and 11. In practice, these 00 and 11s will code for two classes such as yes/no, cat/dog, sick/healthy, etc.

Y={1yes0noY = \begin{cases} 1 & \text{yes} \\ 0 & \text{no} \end{cases}

First, we define some notation that we will use throughout.

p(x)=P[Y=1X=x]p({\bf x}) = P[Y = 1 \mid {\bf X} = {\bf x}]

With a binary (Bernoulli) response, we'll mostly focus on the case when Y=1Y = 1, since with only two possibilities, it is trivial to obtain probabilities when Y=0Y = 0.

P[Y=0X=x]+P[Y=1X=x]=1P[Y = 0 \mid {\bf X} = {\bf x}] + P[Y = 1 \mid {\bf X} = {\bf x}] = 1

P[Y=0X=x]=1p(x)P[Y = 0 \mid {\bf X} = {\bf x}] = 1 - p({\bf x})

We now define the logistic regression model.

log(p(x)1p(x))=β0+β1x1++βp1xp1\log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \ldots + \beta_{p - 1} x_{p - 1}

Immediately we notice some similarities to ordinary linear regression, in particular, the right hand side. This is our usual linear combination of the predictors. We have our usual p1p - 1 predictors for a total of pp β\beta parameters. (Note, many more machine learning focused texts will use pp as the number of parameters. This is an arbitrary choice, but you should be aware of it.)

The left hand side is called the log odds, which is the log of the odds. The odds are the probability for a positive event (Y=1)(Y = 1) divided by the probability of a negative event (Y=0)(Y = 0). So when the odds are 11, the two events have equal probability. Odds greater than 11 favor a positive event. The opposite is true when the odds are less than 11.

p(x)1p(x)=P[Y=1X=x]P[Y=0X=x]\frac{p({\bf x})}{1 - p({\bf x})} = \frac{P[Y = 1 \mid {\bf X} = {\bf x}]}{P[Y = 0 \mid {\bf X} = {\bf x}]}

Essentially, the log odds are the logit transform applied to p(x)p({\bf x}).

logit(ξ)=log(ξ1ξ)\text{logit}(\xi) = \log\left(\frac{\xi}{1 - \xi}\right)

It will also be useful to define the inverse logit, otherwise known as the "logistic" or sigmoid function.

logit1(ξ)=eξ1+eξ=11+eξ\text{logit}^{-1}(\xi) = \frac{e^\xi}{1 + e^{\xi}} = \frac{1}{1 + e^{-\xi}}

Note that for x(,))x \in (-\infty, \infty)), this function outputs values between 0 and 1.

Students often ask, where is the error term? The answer is that its something that is specific to the normal model. First notice that the model with the error term,

Y=β0+β1x1++βqxq+ϵ,  ϵN(0,σ2)Y = \beta_0 + \beta_1x_1 + \ldots + \beta_qx_q + \epsilon, \ \ \epsilon \sim N(0, \sigma^2)
can instead be written as

YX=xN(β0+β1x1++βqxq, σ2).Y \mid {\bf X} = {\bf x} \sim N(\beta_0 + \beta_1x_1 + \ldots + \beta_qx_q, \ \sigma^2).

While our main focus is on estimating the mean, β0+β1x1++βqxq\beta_0 + \beta_1x_1 + \ldots + \beta_qx_q, there is also another parameter, σ2\sigma^2 which needs to be estimated. This is the result of the normal distribution having two parameters.

With logistic regression, which uses the Bernoulli distribution, we only need to estimate the Bernoulli distribution's single parameter p(x)p({\bf x}), which happens to be its mean.

log(p(x)1p(x))=β0+β1x1++βqxq\log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \ldots + \beta_{q} x_{q}

So even though we introduced ordinary linear regression first, in some ways, logistic regression is actually simpler.

Note that applying the inverse logit transformation allow us to obtain an expression for p(x)p({\bf x}).

p(x)=P[Y=1X=x]=eβ0+β1x1++βp1x(p1)1+eβ0+β1x1++βp1x(p1)p({\bf x}) = P[Y = 1 \mid {\bf X} = {\bf x}] = \frac{e^{\beta_0 + \beta_1 x_{1} + \cdots + \beta_{p-1} x_{(p-1)}}}{1 + e^{\beta_0 + \beta_1 x_{1} + \cdots + \beta_{p-1} x_{(p-1)}}}

Fitting Logistic Regression

With nn observations, we write the model indexed with ii to note that it is being applied to each observation.

log(p(xi)1p(xi)))=β0+β1xi1++βp1xi(p1)\log\left(\frac{p({\bf x_i})}{1 - p({\bf x_i)})}\right) = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)}

We can apply the inverse logit transformation to obtain P[Yi=1Xi=xi]P[Y_i = 1 \mid {\bf X_i} = {\bf x_i}] for each observation. Since these are probabilities, it's good that we used a function that returns values between 00 and 11.

p(xi)=P[Yi=1Xi=xi]=eβ0+β1xi1++βp1xi(p1)1+eβ0+β1xi1++βp1xi(p1)p({\bf x_i}) = P[Y_i = 1 \mid {\bf X_i} = {\bf x_i}] = \frac{e^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)}}}{1 + e^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)}}}

1p(xi)=P[Yi=0X=xi]=11+eβ0+β1xi1++βp1xi(p1)1 - p({\bf x_i}) = P[Y_i = 0 \mid {\bf X} = {\bf x_i}] = \frac{1}{1 + e^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)}}}

To "fit" this model, that is estimate the β\beta parameters, we will use maximum likelihood.

β=[β0,β1,β2,β3,,βp1]\boldsymbol{{\beta}} = [\beta_0, \beta_1, \beta_2, \beta_3, \ldots, \beta_{p - 1}]

We first write the likelihood given the observed data.

L(β)=i=1nP[Yi=yiXi=xi]L(\boldsymbol{{\beta}}) = \prod_{i = 1}^{n} P[Y_i = y_i \mid {\bf X_i} = {\bf x_i}]

This is already technically a function of the β\beta parameters, but we'll do some rearrangement to make this more explicit.[1]

L(β)=i=1np(xi)yi(1p(xi))(1yi)L(\boldsymbol{{\beta}}) = \prod_{i = 1}^{n} p({\bf x_i})^{y_i} (1 - p({\bf x_i}))^{(1 - y_i)}

L(β)=i:yi=1np(xi)j:yj=0n(1p(xj))L(\boldsymbol{{\beta}}) = \prod_{i : y_i = 1}^{n} p({\bf x_i}) \prod_{j : y_j = 0}^{n} (1 - p({\bf x_j}))

L(β)=i:yi=1eβ0+β1xi1++βp1xi(p1)1+eβ0+β1xi1++βp1xi(p1)j:yj=011+eβ0+β1xj1++βp1xj(p1)L(\boldsymbol{{\beta}}) = \prod_{i : y_i = 1}^{} \frac{e^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)}}}{1 + e^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)}}} \prod_{j : y_j = 0}^{} \frac{1}{1 + e^{\beta_0 + \beta_1 x_{j1} + \cdots + \beta_{p-1} x_{j(p-1)}}}

Fitting Issues

We should note that, if there exists some β\beta^* such that

xiβ>0    yi=1{\bf x_i}^{\top} \beta^* > 0 \implies y_i = 1

and

xiβ<0    yi=0{\bf x_i}^{\top} \beta^* < 0 \implies y_i = 0

for all observations, then the MLE is not unique. Such data is said to be separable.[2]

This, and similar numeric issues related to estimated probabilities near 0 or 1, will return a warning in R:

message("Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred")

Simulation Examples

sim_logistic_data = function(sample_size = 25, beta_0 = -2, beta_1 = 3) {
  x = rnorm(n = sample_size)
  eta = beta_0 + beta_1 * x
  p = 1 / (1 + exp(-eta))
  y = rbinom(n = sample_size, size = 1, prob = p)
  data.frame(y, x)
}

You might think, why not simply use ordinary linear regression? Even with a binary response, our goal is still to model (some function of) E[YX=x]\text{E}[Y \mid {\bf X} = {\bf x}].

With a binary response coded as 00 and 11, this implies:

E[YX=x]=P[Y=1X=x]\text{E}[Y \mid {\bf X} = {\bf x}] = P[Y = 1 \mid {\bf X} = {\bf x}] since

E[YX=x]=1P[Y=1X=x]+0P[Y=0X=x]=P[Y=1X=x]\begin{aligned} \text{E}[Y \mid {\bf X} = {\bf x}] &= 1 \cdot P[Y = 1 \mid {\bf X} = {\bf x}] + 0 \cdot P[Y = 0 \mid {\bf X} = {\bf x}] \\ &= P[Y = 1 \mid {\bf X} = {\bf x}] \end{aligned}

Then why can't we just use ordinary linear regression to estimate E[YX=x]\text{E}[Y \mid {\bf X} = {\bf x}], and thus P[Y=1X=x]P[Y = 1 \mid {\bf X} = {\bf x}]?

To investigate, let's simulate data from the following model:

log(p(x)1p(x))=2+3x\log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = -2 + 3 x

Another way to write this, which better matches the function we're using to simulate the data:

YiXi=xiBern(pi)pi=p(xi)=11+eη(xi)η(xi)=2+3xi\begin{aligned} Y_i \mid {\bf X_i} = {\bf x_i} &\sim \text{Bern}(p_i) \\ p_i &= p({\bf x_i}) = \frac{1}{1 + e^{-\eta({\bf x_i})}} \\ \eta({\bf x_i}) &= -2 + 3 x_i \end{aligned}

set.seed(1)
example_data = sim_logistic_data()
head(example_data)

After simulating a dataset, we'll then fit both ordinary linear regression and logistic regression. Notice that currently the responses variable y is a numeric variable that only takes values 0 and 1. Later we'll see that we can also fit logistic regression when the response is a factor variable with only two levels. (Generally, having a factor response is preferred, but having a dummy response allows use to make the comparison to using ordinary linear regression.)

# ordinary linear regression
fit_lm  = lm(y ~ x, data = example_data)
# logistic regression
fit_glm = glm(y ~ x, data = example_data, family = binomial)

Notice that the syntax is extremely similar. What's changed?

  • lm() has become glm()
  • We've added family = binomial argument

In a lot of ways, lm() is just a more specific version of glm(). For example

glm(y ~ x, data = example_data)

would actually fit the ordinary linear regression that we have seen in the past. By default, glm() uses family = gaussian argument. That is, we're fitting a GLM with a normally distributed response and the identity function as the link.

The family argument to glm() actually specifies both the distribution and the link function. If not made explicit, the link function is chosen to be the canonical link function, which is essentially the most mathematical convenient link function. See ?glm and ?family for details. For example, the following code explicitly specifies the link function which was previously used by default.

# more detailed call to glm for logistic regression
fit_glm = glm(y ~ x, data = example_data, family = binomial(link = "logit"))

Making predictions with an object of type glm is slightly different than making predictions after fitting with lm(). In the case of logistic regression, with family = binomial, we have:

type Returned
"link" [default] η^(x)=log(p^(x)1p^(x))\hat{\eta}({\bf x}) = \log\left(\frac{\hat{p}({\bf x})}{1 - \hat{p}({\bf x})}\right)
"response" p^(x)=eη^(x)1+eη^(x)=11+eη^(x)\hat{p}({\bf x}) = \frac{e^{\hat{\eta}({\bf x})}}{1 + e^{\hat{\eta}({\bf x})}} = \frac{1}{1 + e^{-\hat{\eta}({\bf x})}}

That is, type = "link" will get you the log odds, while type = "response" will return the estimated mean, in this case, P[Y=1X=x]P[Y = 1 \mid {\bf X} = {\bf x}] for each observation.

Lecture12-1

plot(y ~ x, data = example_data,
     pch = 20, ylab = "Estimated Probability",
     main = "Ordinary vs Logistic Regression")
grid()
abline(fit_lm, col = "darkorange")
curve(predict(fit_glm, data.frame(x), type = "response"),
      add = TRUE, col = "dodgerblue", lty = 2)
legend("topleft", c("Ordinary", "Logistic", "Data"), lty = c(1, 2, 0),
       pch = c(NA, NA, 20), lwd = 2, col = c("darkorange", "dodgerblue", "black"))

Since we only have a single predictor variable, we are able to graphically show this situation. First, note that the data, is plotted using black dots. The response y only takes values 0 and 1.

The solid orange line, is the fitted ordinary linear regression.

The dashed blue curve is the estimated logistic regression. It is helpful to realize that we are not plotting an estimate of YY for either. (Sometimes it might seem that way with ordinary linear regression, but that isn't what is happening.) For both, we are plotting E^[YX=x]\hat{\text{E}}[Y \mid {\bf X} = {\bf x}], the estimated mean, which for a binary response happens to be an estimate of P[Y=1X=x]P[Y = 1 \mid {\bf X} = {\bf x}].

We immediately see why ordinary linear regression is not a good idea. While it is estimating the mean, we see that it produces estimates that are less than 0! (And in other situations could produce estimates greater than 1!) If the mean is a probability, we don't want probabilities less than 0 or greater than 1.

Enter logistic regression. Since the output of the inverse logit function is restricted to be between 0 and 1, our estimates make much more sense as probabilities. Let's look at our estimated coefficients. (With a lot of rounding, for simplicity.)

round(coef(fit_glm), 1)

Our estimated model is then:

log(p^(x)1p^(x))=2.3+3.7x\log\left(\frac{\hat{p}({\bf x})}{1 - \hat{p}({\bf x})}\right) = -2.3 + 3.7 x

Because we're not directly estimating the mean, but instead a function of the mean, we need to be careful with our interpretation of β^1=3.7\hat{\beta}_1 = 3.7. This means that, for a one unit increase in xx, the log odds change (in this case increase) by 3.73.7. Also, since β^1\hat{\beta}_1 is positive, as we increase xx we also increase p^(x)\hat{p}({\bf x}). To see how much, we have to consider the inverse logistic function.

For example, we have:

P^[Y=1X=0.5]=e2.3+3.7(0.5)1+e2.3+3.7(0.5)0.016\hat{P}[Y = 1 \mid X = -0.5] = \frac{e^{-2.3 + 3.7 \cdot (-0.5)}}{1 + e^{-2.3 + 3.7 \cdot (-0.5)}} \approx 0.016

P^[Y=1X=0]=e2.3+3.7(0)1+e2.3+3.7(0)0.09112296\hat{P}[Y = 1 \mid X = 0] = \frac{e^{-2.3 + 3.7 \cdot (0)}}{1 + e^{-2.3 + 3.7 \cdot (0)}} \approx 0.09112296

P^[Y=1X=1]=e2.3+3.7(1)1+e2.3+3.7(1)0.8021839\hat{P}[Y = 1 \mid X = 1] = \frac{e^{-2.3 + 3.7 \cdot (1)}}{1 + e^{-2.3 + 3.7 \cdot (1)}} \approx 0.8021839

Now that we know we should use logistic regression, and not ordinary linear regression, let's consider another example. This time, let's consider the model

log(p(x)1p(x))=1+4x.\log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = 1 + -4 x.

Again, we could re-write this to better match the function we're using to simulate the data:

YiXi=xiBern(pi)pi=p(xi)=11+eη(xi)η(xi)=1+4xi\begin{aligned} Y_i \mid {\bf X_i} = {\bf x_i} &\sim \text{Bern}(p_i) \\ p_i &= p({\bf x_i}) = \frac{1}{1 + e^{-\eta({\bf x_i})}} \\ \eta({\bf x_i}) &= 1 + -4 x_i \end{aligned}

In this model, as xx increases, the log odds decrease.

set.seed(1)
example_data = sim_logistic_data(sample_size = 50, beta_0 = 1, beta_1 = -4)

We again simulate some observations form this model, then fit logistic regression.

fit_glm = glm(y ~ x, data = example_data, family = binomial)
plot(y ~ x, data = example_data,
     pch = 20, ylab = "Estimated Probability",
     main = "Logistic Regression, Decreasing Probability")
grid()
curve(predict(fit_glm, data.frame(x), type = "response"),
      add = TRUE, col = "dodgerblue", lty = 2)
curve(boot::inv.logit(1 - 4 * x), add = TRUE, col = "darkorange", lty = 1)
legend("bottomleft", c("True Probability", "Estimated Probability", "Data"), lty = c(1, 2, 0),
       pch = c(NA, NA, 20), lwd = 2, col = c("darkorange", "dodgerblue", "black"))

Lecture12-2

Now let's look at an example where the estimated probability doesn't always simply increase or decrease. Much like ordinary linear regression, the linear combination of predictors can contain transformations of predictors (in this case a quadratic term) and interactions.

sim_quadratic_logistic_data = function(sample_size = 25) {
  x = rnorm(n = sample_size)
  eta = -1.5 + 0.5 * x + x ^ 2
  p = 1 / (1 + exp(-eta))
  y = rbinom(n = sample_size, size = 1, prob = p)
  data.frame(y, x)
}

log(p(x)1p(x))=1.5+0.5x+x2.\log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = -1.5 + 0.5x + x^2.

Again, we could re-write this to better match the function we're using to simulate the data:

YiXi=xiBern(pi)pi=p(xi)=11+eη(xi)η(xi)=1.5+0.5xi+xi2\begin{aligned} Y_i \mid {\bf X_i} = {\bf x_i} &\sim \text{Bern}(p_i) \\ p_i &= p({\bf x_i}) = \frac{1}{1 + e^{-\eta({\bf x_i})}} \\ \eta({\bf x_i}) &= -1.5 + 0.5x_i + x_i^2 \end{aligned}

set.seed(42)
example_data = sim_quadratic_logistic_data(sample_size = 50)
fit_glm = glm(y ~ x + I(x^2), data = example_data, family = binomial)

Lecture12-3

plot(y ~ x, data = example_data,
     pch = 20, ylab = "Estimated Probability",
     main = "Logistic Regression, Quadratic Relationship")
grid()
curve(predict(fit_glm, data.frame(x), type = "response"),
      add = TRUE, col = "dodgerblue", lty = 2)
curve(boot::inv.logit(-1.5 + 0.5 * x + x ^ 2),
      add = TRUE, col = "darkorange", lty = 1)
legend("bottomleft", c("True Probability", "Estimated Probability", "Data"), lty = c(1, 2, 0),
       pch = c(NA, NA, 20), lwd = 2, col = c("darkorange", "dodgerblue", "black"))

Working with Logistic Regression

While the logistic regression model isn't exactly the same as the ordinary linear regression model, because they both use a linear combination of the predictors

η(x)=β0+β1x1+β2x2++βp1xp1\eta({\bf x}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_{p - 1} x_{p - 1}

working with logistic regression is very similar. Many of the things we did with ordinary linear regression can be done with logistic regression in a very similar fashion. For example,

  • Testing for a single β\beta parameter
  • Testing for a set of β\beta parameters
  • Formula specification in R
  • Interpreting parameters and estimates
  • Confidence intervals for parameters
  • Confidence intervals for mean response
  • Variable selection

Hypothesis Testing: Wald Test

In ordinary linear regression, we perform the test of

H0:βj=0vsH1:βj0H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0

using a tt-test.

For the logistic regression model,

log(p(x)1p(x))=β0+β1x1++βp1xp1\log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \ldots + \beta_{p - 1} x_{p - 1}

we can again perform a test of

H0:βj=0vsH1:βj0H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0

however, the test statistic and its distribution are no longer tt.

With a little algebra and mathematical work, we can see that the test statistic takes the same form

z=β^jβjSE[β^j]approxN(0,1)z = \frac{\hat{\beta}_j - \beta_j}{\text{SE}[\hat{\beta}_j]} \overset{\text{approx}}{\sim} N(0, 1)

but now we are performing a zz-test, as the test statistic is approximated by a standard normal distribution, provided we have a large enough sample. (The tt-test for ordinary linear regression, assuming the assumptions were correct, had an exact distribution for any sample size.)

We'll skip some of the exact details of the calculations, as R will obtain the standard error for us. The use of this test will be extremely similar to the tt-test for ordinary linear regression. Essentially the only thing that changes is the distribution of the test statistic.

Likelihood-Ratio Test

Consider the following full model,

log(p(xi)1p(xi))=β0+β1xi1+β2xi2++β(p1)xi(p1)+ϵi\log\left(\frac{p({\bf x_i})}{1 - p({\bf x_i})}\right) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{(p-1)} x_{i(p-1)} + \epsilon_i

This model has p1p - 1 predictors, for a total of pp β\beta-parameters. We will denote the MLE of these β\beta-parameters as β^Full\hat{\beta}_{\text{Full}}

Now consider a null (or reduced) model,

log(p(xi)1p(xi))=β0+β1xi1+β2xi2++β(q1)xi(q1)+ϵi\log\left(\frac{p({\bf x_i})}{1 - p({\bf x_i})}\right) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{(q-1)} x_{i(q-1)} + \epsilon_i

where q<pq < p. This model has q1q - 1 predictors, for a total of qq β\beta-parameters. We will denote the MLE of these β\beta-parameters as β^Null\hat{\beta}_{\text{Null}}

As with linear regression, the difference between these two models can be codified by the null hypothesis of a test.

H0:βq=βq+1==βp1=0.H_0: \beta_q = \beta_{q+1} = \cdots = \beta_{p - 1} = 0.

This implies that the reduced model is nested inside the full model.

We then define a test statistic, DD,

D=2log(L(β^Null)L(β^Full))=2log(L(β^Full)L(β^Null))=2((β^Full)(β^Null))D = -2 \log \left( \frac{L(\boldsymbol{\hat{\beta}_{\text{Null}}})} {L(\boldsymbol{\hat{\beta}_{\text{Full}}})} \right) = 2 \log \left( \frac{L(\boldsymbol{\hat{\beta}_{\text{Full}}})} {L(\boldsymbol{\hat{\beta}_{\text{Null}}})} \right) = 2 \left( \ell(\hat{\beta}_{\text{Full}}) - \ell(\hat{\beta}_{\text{Null}})\right)

where LL denotes a likelihood and \ell denotes a log-likelihood. For a large enough sample, this test statistic has an approximate Chi-square distribution

Dapproxχk2D \overset{\text{approx}}{\sim} \chi^2_{k}

where k=pqk = p - q, the difference in number of parameters of the two models.

This test, which we will call the Likelihood-Ratio Test, will be the analogue to the ANOVA FF-test for logistic regression. Interestingly, to perform the Likelihood-Ratio Test, we'll actually again use the anova() function in R!.

The Likelihood-Ratio Test is actually a rather general test, however, here we have presented a specific application to nested logistic regression models.


  1. Unfortunately, unlike ordinary linear regression, there is no analytical solution for this maximization problem. Instead, it will need to be solved numerically. Fortunately, R will take care of this for us using an iteratively reweighted least squares algorithm. (We'll leave the details for a machine learning or optimization course, which would likely also discuss alternative optimization strategies.) ↩︎

  2. When this happens, the model is still "fit," but there are consequences, namely, the estimated coefficients are highly suspect. This is an issue when then trying to interpret the model. When this happens, the model will often still be useful for creating a classifier, which will be discussed later. However, it is still subject to the usual evaluations for classifiers to determine how well it is performing. For details, see Modern Applied Statistics with S-PLUS, Chapter 7 ↩︎