Multiple Linear Regression

"Life is really simple, but we insist on making it complicated."

--- Confucius

This lecture describes the following:

  • Constructing and interpreting linear regression models with more than one predictor.
  • How regression models are derived using matrices.
  • How to create interval estimates and perform hypothesis tests for multiple regression parameters.
  • How to formulate and interpret interval estimates for the mean response under various conditions.
  • Comparing nested models using an ANOVA F-Test.

"Numbers is hardly real and they never have feelings // But you push too hard, even numbers got limits"

--– Mos Def

Put another way: More math ahead. Sorry.

Introduction by Example

The last lecture we saw how to fit a model that assumed a linear relationship between a response variable and a single predictor variable. Specifically, we defined the simple linear regression model,

Yi=β0+β1xi+ϵiY_i = \beta_0 + \beta_1 x_i + \epsilon_i

where ϵiN(0,σ2)\epsilon_i \sim N(0, \sigma^2).

However, it is rarely the case that we want to explore merely a single predictor variable. It is also rarely the case that a response variable will only depend on a single variable.[1] We will extend the linear model to allow a response to depend on multiple predictors. The code below loads a useful dataset.

# read the data from the web
autompg = read.table(
  quote = "\"",
  comment.char = "",
  stringsAsFactors = FALSE)
# give the dataframe headers
colnames(autompg) = c("mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin", "name")
# remove missing data, which is stored as "?"
autompg = subset(autompg, autompg$hp != "?")
# remove the plymouth reliant, as it causes some issues
autompg = subset(autompg, autompg$name != "plymouth reliant")
# give the dataset row names, based on the engine, year and name
rownames(autompg) = paste(autompg$cyl, "cylinder", autompg$year, autompg$name)
# remove the variable for name, as well as origin
autompg = subset(autompg, select = c("mpg", "cyl", "disp", "hp", "wt", "acc", "year"))
# change horsepower from character to numeric
autompg$hp = as.numeric(autompg$hp)
# check final structure of data
# create local csv for backup
write.csv(autompg, "data/autompg.csv")

We will once again discuss a dataset with information about cars. This dataset, which can be found at the UCI Machine Learning Repository contains a response variable mpg which stores the city fuel efficiency of cars, as well as several predictor variables for the attributes of the vehicles. We load the data, and perform some basic tidying before moving on to analysis.

For now we will focus on using two variables, wt and year, as predictor variables. That is, we would like to model the fuel efficiency (mpg) of a car as a function of its weight (wt) and model year (year). To do so, we will define the following linear model,

Yi=β0+β1xi1+β2xi2+ϵi,i=1,2,,nY_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i, \qquad i = 1, 2, \ldots, n

where ϵiN(0,σ2)\epsilon_i \sim N(0, \sigma^2). In this notation we will define:

  • xi1x_{i1} as the weight (wt) of the iith car.
  • xi2x_{i2} as the model year (year) of the iith car.


The code below will generate a picture to help us visualize what we would like to accomplish. The data points (xi1,xi2,yi)(x_{i1}, x_{i2}, y_i) now exist in 3-dimensional space, so instead of fitting a line to the data, we will fit a plane. (We'll soon move to higher dimensions, so this will be the last example that is easy to visualize and think about this way.)

The next slide provides code to visualize this process. It's... possibly helpful?

Sadly, as we'll see shortly, visualizations for high dimensions don't make a ton of sense.

Code below


x = autompg$wt
y = autompg$year
z = autompg$mpg

fit <- lm(z ~ x + y)

grid.lines = 25
x.pred     = seq(min(x), max(x), length.out = grid.lines)
y.pred     = seq(min(y), max(y), length.out = grid.lines)
xy         = expand.grid(x = x.pred, y = y.pred)

z.pred = matrix(predict(fit, newdata = xy),
                nrow = grid.lines, ncol = grid.lines)

fitpoints = predict(fit)

scatter3D(x, y, z, pch = 19, cex = 2, col = gg.col(1000), lighting = TRUE,
          theta = 25, phi = 45, ticktype = "detailed",
          xlab = "wt", ylab = "year", zlab = "mpg", zlim = c(0, 40), clim = c(0, 40),
          surf = list(x = x.pred, y = y.pred, z = z.pred,
                      facets = NA, fit = fitpoints), main = "")

How do we find such a plane? Well, we would like a plane that is as close as possible to the data points. That is, we would like it to minimize the errors it is making. How will we define these errors? Squared distance of course! So, we would like to minimize

f(β0,β1,β2)=i=1n(yi(β0+β1xi1+β2xi2))2f(\beta_0, \beta_1, \beta_2) = \sum_{i = 1}^{n}(y_i - (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}))^2

with respect to β0\beta_0, β1\beta_1, and β2\beta_2. How do we do so? It is another straightforward multivariate calculus problem. All we have done is add an extra variable since we did this last time. So again, we take a derivative with respect to each of β0\beta_0, β1\beta_1, and β2\beta_2 and set them equal to zero, then solve the resulting system of equations. That is,

fβ0=0fβ1=0fβ2=0\begin{aligned} \frac{\partial f}{\partial \beta_0} &= 0 \\ \frac{\partial f}{\partial \beta_1} &= 0 \\ \frac{\partial f}{\partial \beta_2} &= 0 \end{aligned}

After doing so, we will once again obtain the normal equations.

nβ0+β1i=1nxi1+β2i=1nxi2=i=1nyiβ0i=1nxi1+β1i=1nxi12+β2i=1nxi1xi2=i=1nxi1yiβ0i=1nxi2+β1i=1nxi1xi2+β2i=1nxi22=i=1nxi2yi\begin{aligned} n \beta_0 + \beta_1 \sum_{i = 1}^{n} x_{i1} + \beta_2 \sum_{i = 1}^{n} x_{i2} &= \sum_{i = 1}^{n} y_i \\ \beta_0 \sum_{i = 1}^{n} x_{i1} + \beta_1 \sum_{i = 1}^{n} x_{i1}^2 + \beta_2 \sum_{i = 1}^{n} x_{i1}x_{i2} &= \sum_{i = 1}^{n} x_{i1}y_i \\ \beta_0 \sum_{i = 1}^{n} x_{i2} + \beta_1 \sum_{i = 1}^{n} x_{i1}x_{i2} + \beta_2 \sum_{i = 1}^{n} x_{i2}^2 &= \sum_{i = 1}^{n} x_{i2}y_i \end{aligned}

We now have three equations and three variables, which we could solve, or we could simply let R solve for us.

mpg_model = lm(mpg ~ wt + year, data = autompg)

y^=14.63764190.0066349x1+0.761402x2\hat{y} = − 14.6376419 − 0.0066349x_1 + 0.761402 x_2

Here we have once again fit our model using lm(), however we have introduced a new syntactical element. The formula mpg ~ wt + year now reads: "model the response variable mpg as a linear function of wt and year". That is, it will estimate an intercept, as well as slope coefficients for wt and year. We then extract these as we have done before using coef().

In the multiple linear regression setting, some of the interpretations of the coefficients change slightly.

Here, β^0\hat{\beta}_0 is our estimate for β0\beta_0, the mean miles per gallon for a car that weighs 0 pounds and was built in 1900. We see our estimate here is negative, which is a physical impossibility. However, this isn't unexpected, as we shouldn't expect our model to be accurate for cars from 1900 which weigh 0 pounds. (Because they never existed!) This isn't much of a change from SLR. That is, β0\beta_0 is still simply the mean when all of the predictors are 0.

A Note on Significant Digits

Economics is, at best, a two-digit science... [S]uppose we wrote 1.2345 (SE = .523). A physicist wouldn't let you do that. They'd think you were a f*&#ing moron.

--- David E. Card, Class of 1950 Professor of Economics, UC Berkeley

Don't do what I did on the previous slide.

In social science, you never are measuring with much precision. Showing tons of digits after the decimal pioint is misleading.

Coefficients and Interpretation

Criticallty, the interpretation of the coefficients in front of our predictors are slightly different than before.

For example β^1=\hat{\beta}_1 = is our estimate for β1\beta_1, the average change in miles per gallon for an increase in weight (x1x_{1}) of one-pound for a car of a certain model year. That is, we hold all other variables constant. Here, this means β^1\hat{\beta}_1 should be interpreted for a fixed value of x2x_{2}. Note that this coefficient is actually the same for any given value of x2x_{2}.

Later, we will look at models that allow for a different change in mean response for different values of x2x_{2}. Also note that this estimate is negative, which we would expect since, in general, fuel efficiency decreases for larger vehicles. Note that in the multiple linear regression setting, this interpretation is dependent on a fixed value for x2x_{2}---that is, "for a car of a certain model year." It is possible that the indirect relationship between fuel efficiency and weight does not hold when an additional factor, say year, is included, and thus we could have the sign of our coefficient flipped.

Lastly, β^2=\hat{\beta}_2 = is our estimate for β2\beta_2, the average change in miles per gallon for a one-year increase in model year (x2x_{2}) for a car of a certain weight, that is, for a fixed value of x1x_{1}. It is not surprising that the estimate is positive. We expect that as time passes and the years march on, technology would improve so that a car of a specific weight would get better mileage now as compared to their predecessors. And yet, the coefficient could have been negative because we are also including weight as variable, and not strictly as a fixed value.

Matrix Approach to Regression

In our above example we used two predictor variables, but it will only take a little more work to allow for an arbitrary number of predictor variables and derive their coefficient estimates. We can consider the model,

Yi=β0+β1xi1+β2xi2++βp1xi(p1)+ϵi,i=1,2,,nY_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{p-1} x_{i(p-1)} + \epsilon_i, \qquad i = 1, 2, \ldots, n

where ϵiN(0,σ2)\epsilon_i \sim N(0, \sigma^2). In this model, there are p1p - 1 predictor variables, x1,x2,,xp1x_1, x_2, \cdots, x_{p-1}. There are a total of pp β\beta-parameters and a single parameter σ2\sigma^2 for the variance of the errors. (It should be noted that almost as often, authors will use pp as the number of predictors, making the total number of β\beta parameters p+1p+1. This is always something you should be aware of when reading about multiple regression. There is not a standard that is used most often.)

If we were to stack together the nn linear equations that represent each YiY_i into a column vector, we get the following.

[Y1Y2Yn]=[1x11x12x1(p1)1x21x22x2(p1)1xn1xn2xn(p1)][β0β1β2βp1]+[ϵ1ϵ2ϵn]\begin{bmatrix} Y_1 \\ Y_2 \\ \vdots\\ Y_n \\ \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1(p-1)} \\ 1 & x_{21} & x_{22} & \cdots & x_{2(p-1)} \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{n(p-1)} \\ \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_{p-1} \\ \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots\\ \epsilon_n \\ \end{bmatrix}

Y=Xβ+ϵY = X \beta + \epsilon

Y=[Y1Y2Yn],X=[1x11x12x1(p1)1x21x22x2(p1)1xn1xn2xn(p1)],β=[β0β1β2βp1],ϵ=[ϵ1ϵ2ϵn]Y = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots\\ Y_n \end{bmatrix}, \quad X = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1(p-1)} \\ 1 & x_{21} & x_{22} & \cdots & x_{2(p-1)} \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{n(p-1)} \\ \end{bmatrix}, \quad \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_{p-1} \\ \end{bmatrix}, \quad \epsilon = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots\\ \epsilon_n \end{bmatrix}

So now with data,

y=[y1y2yn]y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots\\ y_n \end{bmatrix}

Just as before, we can estimate β\beta by minimizing,

f(β0,β1,β2,,βp1)=i=1n(yi(β0+β1xi1+β2xi2++βp1xi(p1)))2,f(\beta_0, \beta_1, \beta_2, \cdots, \beta_{p-1}) = \sum_{i = 1}^{n}(y_i - (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{p-1} x_{i(p-1)}))^2,

which would require taking pp derivatives. This results in following normal equations.

[ni=1nxi1i=1nxi2i=1nxi(p1)i=1nxi1i=1nxi12i=1nxi1xi2i=1nxi1xi(p1)i=1nxi(p1)i=1nxi(p1)xi1i=1nxi(p1)xi2i=1nxi(p1)2][β0β1βp1]=[i=1nyii=1nxi1yii=1nxi(p1)yi]\begin{bmatrix} n & \sum_{i = 1}^{n} x_{i1} & \sum_{i = 1}^{n} x_{i2} & \cdots & \sum_{i = 1}^{n} x_{i(p-1)} \\ \sum_{i = 1}^{n} x_{i1} & \sum_{i = 1}^{n} x_{i1}^2 & \sum_{i = 1}^{n} x_{i1}x_{i2} & \cdots & \sum_{i = 1}^{n} x_{i1}x_{i(p-1)} \\ \vdots & \vdots & \vdots & & \vdots \\ \sum_{i = 1}^{n} x_{i(p-1)} & \sum_{i = 1}^{n} x_{i(p-1)}x_{i1} & \sum_{i = 1}^{n} x_{i(p-1)}x_{i2} & \cdots & \sum_{i = 1}^{n} x_{i(p-1)}^2 \\ \end{bmatrix}\begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{p-1} \\ \end{bmatrix}=\begin{bmatrix} \sum_{i = 1}^{n} y_i \\ \sum_{i = 1}^{n} x_{i1}y_i \\ \vdots \\ \sum_{i = 1}^{n} x_{i(p-1)}y_i \\ \end{bmatrix}

Regression Solution

The normal equations can be written much more succinctly in matrix notation,

XXβ=Xy.X^\top X \beta = X^\top y.

We can then solve this expression by multiplying both sides by the inverse of XXX^\top X, which exists, provided the columns of XX are linearly independent. Then as always, we denote our solution with a hat.

β^=(XX)1Xy\hat{\beta} = \left( X^\top X \right)^{-1}X^\top y

To verify that this is what R has done for us in the case of two predictors, we create an XX matrix. Note that the first column is all 1s, and the remaining columns contain the data.

n = nrow(autompg)
p = length(coef(mpg_model))
X = cbind(rep(1, n), autompg$wt, autompg$year)
y = autompg$mpg

(beta_hat = solve(t(X) %*% X) %*% t(X) %*% y)

β^=[14.60.00660.76]\hat{\beta} = \begin{bmatrix} −14.6 \\ −0.0066 \\ 0.76 \\ \end{bmatrix}

In our new notation, the fitted values can be written

y^=Xβ^.\hat{y} = X \hat{\beta}.

y^=[y^1y^2y^n]\hat{y} = \begin{bmatrix} \hat{y}_1 \\ \hat{y}_2 \\ \vdots\\ \hat{y}_n \end{bmatrix}

Then, we can create a vector for the residual values,

e=[e1e2en]=[y1y2yn][y^1y^2y^n].e = \begin{bmatrix} e_1 \\ e_2 \\ \vdots\\ e_n \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots\\ y_n \end{bmatrix} - \begin{bmatrix} \hat{y}_1 \\ \hat{y}_2 \\ \vdots\\ \hat{y}_n \end{bmatrix}.

And lastly, we can update our estimate for σ2\sigma^2.

se2=i=1n(yiy^i)2np=eenps_e^2 = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n - p} = \frac{e^\top e}{n-p}

In R, we could directly access ses_e for a fitted model, as we have seen before.


And we can now verify that our math above is indeed calculating the same quantities.

y_hat = X %*% solve(t(X) %*% X) %*% t(X) %*% y
e     = y - y_hat
sqrt(t(e) %*% e / (n - p))
sqrt(sum((y - y_hat) ^ 2) / (n - p))

Sampling Distribution

This subsection includes technical notes. Skip Ahead

As we can see in the output below, the results of calling summary() are similar to simple linear regression, but there are some differences, most obviously a new row for the added predictor variable.


To understand these differences in detail, we will need to first obtain the sampling distribution of β^\hat{\beta}.

The derivation of the sampling distribution of β^\hat{\beta} involves the multivariate normal distribution. A full discussion of this distribution is outside the scope of this course.

Our goal now is to obtain the distribution of the β^\hat{\beta} vector,

β^=[β^0β^1β^2β^p1]\hat{\beta} = \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \hat{\beta}_2 \\ \vdots \\ \hat{\beta}_{p-1} \end{bmatrix}

Recall from last time that when discussing sampling distributions, we now consider β^\hat{\beta} to be a random vector, thus we use YY instead of the data vector yy.

β^=(XX)1XY\hat{\beta} = \left( X^\top X \right)^{-1}X^\top Y

Then it is a consequence of the multivariate normal distribution that,

β^N(β,σ2(XX)1).\hat{\beta} \sim N\left(\beta, \sigma^2 \left(X^\top X\right)^{-1} \right).

We then have

E[β^]=β\text{E}[\hat{\beta}] = \beta

and for any β^j\hat{\beta}_j we have

E[β^j]=βj.\text{E}[\hat{\beta}_j] = \beta_j.

We also have

Var[β^]=σ2(XX)1\text{Var}[\hat{\beta}] = \sigma^2 \left( X^\top X \right)^{-1}

and for any β^j\hat{\beta}_j we have

Var[β^j]=σ2Cjj\text{Var}[\hat{\beta}_j] = \sigma^2 C_{jj}


C=(XX)1C = \left(X^\top X\right)^{-1}

and the elements of CC are denoted

C=[C00C01C02C0(p1)C10C11C12C1(p1)C20C21C22C2(p1)C(p1)0C(p1)1C(p1)2C(p1)(p1)].C = \begin{bmatrix} C_{00} & C_{01} & C_{02} & \cdots & C_{0(p-1)} \\ C_{10} & C_{11} & C_{12} & \cdots & C_{1(p-1)} \\ C_{20} & C_{21} & C_{22} & \cdots & C_{2(p-1)} \\ \vdots & \vdots & \vdots & & \vdots \\ C_{(p-1)0} & C_{(p-1)1} & C_{(p-1)2} & \cdots & C_{(p-1)(p-1)} \\ \end{bmatrix}.
Essentially, the diagonal elements correspond to the β\beta vector.

The standard error for the β^\hat{\beta} vector is given by

SE[β^]=se(XX)1\text{SE}[\hat{\beta}] = s_e \sqrt{\left( X^\top X \right)^{-1}}

and for a particular β^j\hat{\beta}_j

SE[β^j]=seCjj.\text{SE}[\hat{\beta}_j] = s_e \sqrt{C_{jj}}.

Lastly, each of the β^j\hat{\beta}_j follows a normal distribution,

β^jN(βj,σ2Cjj).\hat{\beta}_j \sim N\left(\beta_j, \sigma^2 C_{jj} \right).


β^jβjseCjjtnp.\frac{\hat{\beta}_j - \beta_j}{s_e \sqrt{C_{jj}}} \sim t_{n-p}.

Now that we have the necessary distributional results, we can move on to perform tests and make interval estimates.

Single Parameter Tests

The first test we will see is a test for a single βj\beta_j.

H0:βj=0vsH1:βj0H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0

Again, the test statistic takes the form

TS=ESTHYPSE.\text{TS} = \frac{\text{EST} - \text{HYP}}{\text{SE}}.

In particular,

t=β^jβjSE[β^j]=β^j0seCjj,t = \frac{\hat{\beta}_j - \beta_j}{\text{SE}[\hat{\beta}_j]} = \frac{\hat{\beta}_j-0}{s_e\sqrt{C_{jj}}},

which, under the null hypothesis, follows a tt distribution with npn - p degrees of freedom.

Recall our model for mpg,

Yi=β0+β1xi1+β2xi2+ϵi,i=1,2,,nY_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i, \qquad i = 1, 2, \ldots, n

where ϵiN(0,σ2)\epsilon_i \sim N(0, \sigma^2).

  • xi1x_{i1} as the weight (wt) of the iith car.
  • xi2x_{i2} as the model year (year) of the iith car.

Then the test

H0:β1=0vsH1:β10H_0: \beta_1 = 0 \quad \text{vs} \quad H_1: \beta_1 \neq 0

can be found in the summary() output, in particular:


The estimate (Estimate), standard error (Std. Error), test statistic (t value), and p-value (Pr(>|t|)) for this test are displayed in the second row, labeled wt. Remember that the p-value given here is specifically for a two-sided test, where the hypothesized value is 0.

Also note in this case, by hypothesizing that β1=0\beta_1 = 0 the null and alternative essentially specify two different models:

  • H0H_0: Y=β0+β2x2+ϵY = \beta_0 + \beta_2 x_{2} + \epsilon
  • H1H_1: Y=β0+β1x1+β2x2+ϵY = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \epsilon

This is important. We are not simply testing whether or not there is a relationship between weight and fuel efficiency. We are testing if there is a relationship between weight and fuel efficiency, given that a term for year is in the model. (Note, we dropped some indexing here, for readability.)

Confidence Intervals

Since β^j\hat{\beta}_j is our estimate for βj\beta_j and we have

E[β^j]=βj\text{E}[\hat{\beta}_j] = \beta_j

as well as the standard error,

SE[β^j]=seCjj\text{SE}[\hat{\beta}_j] = s_e\sqrt{C_{jj}}

and the sampling distribution of β^j\hat{\beta}_j is Normal, then we can easily construct confidence intervals for each of the β^j\hat{\beta}_j.

β^j±tα/2,npseCjj\hat{\beta}_j \pm t_{\alpha/2, n - p} \cdot s_e\sqrt{C_{jj}}

We can find these in R using the same method as before. Now there will simply be additional rows for the additional β\beta.

confint(mpg_model, level = 0.99)

Confidence Intervals for Mean Response

As we saw in SLR, we can create confidence intervals for the mean response, that is, an interval estimate for E[YX=x]\text{E}[Y \mid X = x]. In SLR, the mean of YY was only dependent on a single value xx. Now, in multiple regression, E[YX=x]\text{E}[Y \mid X = x] is dependent on the value of each of the predictors, so we define the vector x0x_0 to be,

x0=[1x01x02x0(p1)].x_{0} = \begin{bmatrix} 1 \\ x_{01} \\ x_{02} \\ \vdots \\ x_{0(p-1)} \\ \end{bmatrix}.

Then our estimate of E[YX=x0]\text{E}[Y \mid X = x_0] for a set of values x0x_0 is given by

y^(x0)=x0β^=β^0+β^1x01+β^2x02++β^p1x0(p1).\begin{aligned} \hat{y}(x_0) &= x_{0}^\top\hat{\beta} \\ &= \hat{\beta}_0 + \hat{\beta}_1 x_{01} + \hat{\beta}_2 x_{02} + \cdots + \hat{\beta}_{p-1} x_{0(p-1)}. \end{aligned}

As with SLR, this is an unbiased estimate.

E[y^(x0)]=x0β=β0+β1x01+β2x02++βp1x0(p1)\begin{aligned} \text{E}[\hat{y}(x_0)] &= x_{0}^\top\beta \\ &= \beta_0 + \beta_1 x_{01} + \beta_2 x_{02} + \cdots + \beta_{p-1} x_{0(p-1)} \end{aligned}

To make an interval estimate, we will also need its standard error.

SE[y^(x0)]=sex0(XX)1x0\text{SE}[\hat{y}(x_0)] = s_e \sqrt{x_{0}^\top\left(X^\top X\right)^{-1}x_{0}}

Putting it all together, we obtain a confidence interval for the mean response.

y^(x0)±tα/2,npsex0(XX)1x0\hat{y}(x_0) \pm t_{\alpha/2, n - p} \cdot s_e \sqrt{x_{0}^\top\left(X^\top X\right)^{-1}x_{0}}

The math has changed a bit, but the process in R remains almost identical. Here, we create a data frame for two additional cars. One car that weighs 3500 pounds produced in 1976, as well as a second car that weighs 5000 pounds which was produced in 1981.

new_cars = data.frame(wt = c(3500, 5000), year = c(76, 81))

We can then use the predict() function with interval = "confidence" to obtain intervals for the mean fuel efficiency for both new cars. Again, it is important to make the data passed to newdata a data frame, so that R knows which values are for which variables.

predict(mpg_model, newdata = new_cars, interval = "confidence", level = 0.99)

R then reports the estimate y^(x0)\hat{y}(x_0) (fit) for each, as well as the lower (lwr) and upper (upr) bounds for the interval at a desired level (99%).

A word of caution here: one of these estimates is good while one is suspect.


Note that both of the weights of the new cars are within the range of observed values.


As are the years of each of the new cars.

plot(year ~ wt, data = autompg, pch = 20, col = "dodgerblue", cex = 1.5)
points(new_cars, col = "darkorange", cex = 3, pch = "X")

However, we have to consider weight and year together now. And based on the above plot, one of the new cars is within the "blob" of observed values, while the other, the car from 1981 weighing 5000 pounds, is noticeably outside of the observed values. This is a hidden extrapolation which you should be aware of when using multiple regression.

Shifting gears back to the new data pair that can be reasonably estimated, we do a quick verification of some of the mathematics in R.

x0 = c(1, 3500, 76)
x0 %*% beta_hat

Also note that, using a particular value for x0x_0, we can essentially extract certain β^j\hat{\beta}_j values.

x0 = c(0, 0, 1)
x0 %*% beta_hat

With this in mind, confidence intervals for the individual β^j\hat{\beta}_j are actually a special case of a confidence interval for mean response.

Prediction Intervals

Skip. Creating prediction intervals involves one slight change to the standard error to account for the fact that we are now considering an observation, instead of a mean.

Here we use y^(x0)\hat{y}(x_0) to estimate Y0Y_0, a new observation of YY at the predictor vector x0x_0.

y^(x0)=x0β^=β^0+β^1x01+β^2x02++β^p1x0(p1)\begin{aligned} \hat{y}(x_0) &= x_{0}^\top\hat{\beta} \\ &= \hat{\beta}_0 + \hat{\beta}_1 x_{01} + \hat{\beta}_2 x_{02} + \cdots + \hat{\beta}_{p-1} x_{0(p-1)} \end{aligned}

E[y^(x0)]=x0β=β0+β1x01+β2x02++βp1x0(p1)\begin{aligned} \text{E}[\hat{y}(x_0)] &= x_{0}^\top\beta \\ &= \beta_0 + \beta_1 x_{01} + \beta_2 x_{02} + \cdots + \beta_{p-1} x_{0(p-1)} \end{aligned}

As we did with SLR, we need to account for the additional variability of an observation about its mean.

SE[y^(x0)+ϵ]=se1+x0(XX)1x0\text{SE}[\hat{y}(x_0) + \epsilon] = s_e \sqrt{1 + x_{0}^\top\left(X^\top X\right)^{-1}x_{0}}

Then we arrive at our updated prediction interval for MLR.

y^(x0)±tα/2,npse1+x0(XX)1x0\hat{y}(x_0) \pm t_{\alpha/2, n - p} \cdot s_e \sqrt{1 + x_{0}^\top\left(X^\top X\right)^{-1}x_{0}}

predict(mpg_model, newdata = new_cars, interval = "prediction", level = 0.99)

Significance of Regression

The decomposition of variation that we had seen in SLR still holds for MLR.

i=1n(yiyˉ)2=i=1n(yiy^i)2+i=1n(y^iyˉ)2\sum_{i=1}^{n}(y_i - \bar{y})^2 = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2

That is,

SST=SSE+SSReg.\text{SST} = \text{SSE} + \text{SSReg}.

This means that, we can calculate R2R^2 in a simple manner. R continues to do this automatically.


The interpretation changes slightly as compared to SLR. As an example, we would say that R2R^2 percent of the observed variation in miles per gallon is explained by the linear relationship with the two predictor variables, weight and year.

In multiple regression, a natural test is the following:

H0:β1=β2==βp1=0.H_0: \beta_1 = \beta_2 = \cdots = \beta_{p - 1} = 0.

Here, we see that the null hypothesis sets all of the βj\beta_j equal to 0, except the intercept, β0\beta_0. We could then say that the null model, or "model under the null hypothesis" is

Yi=β0+ϵi.Y_i = \beta_0 + \epsilon_i.

This is a model where the regression is insignificant. None of the predictors have a significant linear relationship with the response. Notationally, we will denote the fitted values of this model as y^0i\hat{y}_{0i}, which in this case happens to be:

y^0i=yˉ.\hat{y}_{0i} = \bar{y}.

The alternative hypothesis here is that at least one of the βj\beta_j from the null hypothesis is not 0.

H1:At least one of βj0,j=1,2,,(p1)H_1: \text{At least one of } \beta_j \neq 0, j = 1, 2, \cdots, (p-1)

We could then say that the full model, or "model under the alternative hypothesis" is

Yi=β0+β1xi1+β2xi2++β(p1)xi(p1)+ϵiY_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{(p-1)} x_{i(p-1)} + \epsilon_i

This is a model where the regression is significant. At least one of the predictors has a significant linear relationship with the response. There is some linear relationship between yy and the predictors, x1,x2,,xp1x_1, x_2, \ldots, x_{p - 1}.

We will denote the fitted values of this model as y^1i\hat{y}_{1i}.

To develop the FF test for the significance of the regression, we will arrange the variance decomposition into an ANOVA table.

Source Sum of Squares Degrees of Freedom Mean Square FF
Regression i=1n(y^1iyˉ)2\sum_{i=1}^{n}(\hat{y}_{1i} - \bar{y})^2 p1p - 1 SSReg/(p1)\text{SSReg} / (p - 1) MSReg/MSE\text{MSReg} / \text{MSE}
Error i=1n(yiy^1i)2\sum_{i=1}^{n}(y_i - \hat{y}_{1i})^2 npn - p SSE/(np)\text{SSE} / (n - p)
Total i=1n(yiyˉ)2\sum_{i=1}^{n}(y_i - \bar{y})^2 n1n - 1

In summary, the FF statistic is

F=i=1n(y^1iyˉ)2/(p1)i=1n(yiy^1i)2/(np),F = \frac{\sum_{i=1}^{n}(\hat{y}_{1i} - \bar{y})^2 / (p - 1)}{\sum_{i=1}^{n}(y_i - \hat{y}_{1i})^2 / (n - p)},

and the p-value is calculated as

P(Fp1,np>F)P(F_{p-1, n-p} > F)

since we reject for large values of FF. A large value of the statistic corresponds to a large portion of the variance being explained by the regression. Here Fp1,npF_{p-1, n-p} represents a random variable which follows an FF distribution with p1p - 1 and npn - p degrees of freedom.

To perform this test in R, we first explicitly specify the two models in R and save the results in different variables. We then use anova() to compare the two models, giving anova() the null model first and the alternative (full) model second. (Specifying the full model first will result in the same p-value, but some nonsensical intermediate values.)

In this case,

  • H0H_0: Yi=β0+ϵiY_i = \beta_0 + \epsilon_i
  • H1H_1: Yi=β0+β1xi1+β2xi2+ϵiY_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i

That is, in the null model, we use neither of the predictors, whereas in the full (alternative) model, at least one of the predictors is useful.

null_mpg_model = lm(mpg ~ 1, data = autompg)
full_mpg_model = lm(mpg ~ wt + year, data = autompg)
anova(null_mpg_model, full_mpg_model)

First, notice that R does not display the results in the same manner as the table above. More important than the layout of the table are its contents. We see that the value of the FF statistic is r round(anova(null_mpg_model, full_mpg_model)$F[2], 3), and the p-value is extremely low, so we reject the null hypothesis at any reasonable α\alpha and say that the regression is significant. At least one of wt or year has a useful linear relationship with mpg.


Notice that the value reported in the row for F-statistic is indeed the FF test statistic for the significance of regression test, and additionally it reports the two relevant degrees of freedom.

Also, note that none of the individual tt-tests are equivalent to the FF-test as they were in SLR. This equivalence only holds for SLR because the individual test for β1\beta_1 is the same as testing for all non-intercept parameters, since there is only one.

We can also verify the sums of squares and degrees of freedom directly in R. You should match these to the table from R and use this to match R's output to the written table above.

# SSReg
sum((fitted(full_mpg_model) - fitted(null_mpg_model)) ^ 2)
sum(resid(full_mpg_model) ^ 2)
sum(resid(null_mpg_model) ^ 2)
# Degrees of Freedom: Regression
length(coef(full_mpg_model)) - length(coef(null_mpg_model))
# Degrees of Freedom: Error
length(resid(full_mpg_model)) - length(coef(full_mpg_model))
# Degrees of Freedom: Total
length(resid(null_mpg_model)) - length(coef(null_mpg_model))

Nested Models

The significance of regression test is actually a special case of testing what we will call nested models. This is a powerful tool. Using it we can compare two models, where one model is "nested" inside the other. This means one model contains a subset of the predictors from only the larger model.

Consider the following full model,

Yi=β0+β1xi1+β2xi2++β(p1)xi(p1)+ϵiY_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{(p-1)} x_{i(p-1)} + \epsilon_i

This model has p1p - 1 predictors, for a total of pp β\beta-parameters. We will denote the fitted values of this model as y^1i\hat{y}_{1i}.

Let the null model be

Yi=β0+β1xi1+β2xi2++β(q1)xi(q1)+ϵiY_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{(q-1)} x_{i(q-1)} + \epsilon_i

where q<pq < p. This model has q1q - 1 predictors, for a total of qq β\beta-parameters. We will denote the fitted values of this model as y^0i\hat{y}_{0i}.

The difference between these two models can be codified by the null hypothesis of a test.

H0:βq=βq+1==βp1=0.H_0: \beta_q = \beta_{q+1} = \cdots = \beta_{p - 1} = 0.

Specifically, the β\beta-parameters from the full model that are not in the null model are zero. The resulting model, which is nested, is the null model.

We can then perform this test using an FF-test, which is the result of the following ANOVA table.

Source Sum of Squares Degrees of Freedom Mean Square FF
Diff i=1n(y^1iy^0i)2\sum_{i=1}^{n}(\hat{y}_{1i} - \hat{y}_{0i})^2 pqp - q SSD/(pq)\text{SSD} / (p - q) MSD/MSE\text{MSD} / \text{MSE}
Full i=1n(yiy^1i)2\sum_{i=1}^{n}(y_i - \hat{y}_{1i})^2 npn - p SSE/(np)\text{SSE} / (n - p)
Null i=1n(yiy^0i)2\sum_{i=1}^{n}(y_i - \hat{y}_{0i})^2 nqn - q

F=i=1n(y^1iy^0i)2/(pq)i=1n(yiy^1i)2/(np).F = \frac{\sum_{i=1}^{n}(\hat{y}_{1i} - \hat{y}_{0i})^2 / (p - q)}{\sum_{i=1}^{n}(y_i - \hat{y}_{1i})^2 / (n - p)}.

Notice that the row for "Diff" compares the sum of the squared differences of the fitted values. The degrees of freedom is then the difference of the number of β\beta-parameters estimated between the two models.

For example, the autompg dataset has a number of additional variables that we have yet to use.


We'll continue to use mpg as the response, but now we will consider two different models.

  • Full: mpg ~ wt + year + cyl + disp + hp + acc
  • Null: mpg ~ wt + year

Note that these are nested models, as the null model contains a subset of the predictors from the full model, and no additional predictors. Both models have an intercept β0\beta_0 as well as a coefficient in front of each of the predictors. We could then write the null hypothesis for comparing these two models as,

H0:βcyl=βdisp=βhp=βacc=0H_0: \beta_{\texttt{cyl}} = \beta_{\texttt{disp}} = \beta_{\texttt{hp}} = \beta_{\texttt{acc}} = 0

The alternative is simply that at least one of the βj\beta_{j} from the null is not 0.

To perform this test in R we first define both models, then give them to the anova() commands.

null_mpg_model = lm(mpg ~ wt + year, data = autompg)
#full_mpg_model = lm(mpg ~ wt + year + cyl + disp + hp + acc, data = autompg)
full_mpg_model = lm(mpg ~ ., data = autompg)
anova(null_mpg_model, full_mpg_model)

Here we have used the formula mpg ~ . to define to full model. This is the same as the commented out line. Specifically, this is a common shortcut in R which reads, "model mpg as the response with each of the remaining variables in the data frame as predictors."


Since we ignored the derivation of certain results, we will again use simulation to convince ourselves of some of the above results. In particular, we will simulate samples of size n = 100 from the model

Yi=5+2xi1+6xi2+ϵi,i=1,2,,nY_i = 5 + -2 x_{i1} + 6 x_{i2} + \epsilon_i, \qquad i = 1, 2, \ldots, n

where ϵiN(0,σ2=16)\epsilon_i \sim N(0, \sigma^2 = 16). Here we have two predictors, so p=3p = 3.

n = 100 # sample size
p = 3

beta_0 = 5
beta_1 = -2
beta_2 = 6
sigma  = 4

As is the norm with regression, the xx values are considered fixed and known quantities, so we will simulate those first, and they remain the same for the rest of the simulation study. Also note we create an x0 which is all 1, which we need to create our X matrix. If you look at the matrix formulation of regression, this unit vector of all 1s is a "predictor" that puts the intercept into the model. We also calculate the C matrix for later use.

x0 = rep(1, n)
x1 = sample(seq(1, 10, length = n))
x2 = sample(seq(1, 10, length = n))
X = cbind(x0, x1, x2)
C = solve(t(X) %*% X)

We then simulate the response according the model above. Lastly, we place the two predictors and response into a data frame. Note that we do not place x0 in the data frame. This is a result of R adding an intercept by default.

eps      = rnorm(n, mean = 0, sd = sigma)
y        = beta_0 + beta_1 * x1 + beta_2 * x2 + eps
sim_data = data.frame(x1, x2, y)

Plotting this data and fitting the regression produces the following plot.

# make this use data.frame? or, simply hide this?
fit = lm(y ~ x1 + x2)

grid.lines = 25
x1.pred = seq(min(x1), max(x1), length.out = grid.lines)
x2.pred = seq(min(x2), max(x2), length.out = grid.lines)
x1x2 = expand.grid(x1 = x1.pred, x2 = x2.pred)

y.pred = matrix(predict(fit, newdata = x1x2),
                 nrow = grid.lines, ncol = grid.lines)
# fitted points for droplines to surface
fitpoints = predict(fit)

# scatter plot with regression plane
scatter3D(x1, x2, y, pch = 20, cex = 2, col = gg.col(1000), lighting = TRUE,
          theta = 45, phi = 15, ticktype = "detailed", zlim = c(min(y.pred), max(y.pred)), clim = c(min(y.pred), max(y.pred)),
          xlab = "x1", ylab = "x2", zlab = "y",
          surf = list(x = x1.pred, y = x2.pred, z = y.pred,
                      facets = NA, fit = fitpoints), main = "")

We then calculate

β^=(XX)1Xy.\hat{\beta} = \left( X^\top X \right)^{-1}X^\top y.

(beta_hat = C %*% t(X) %*% y)

Notice that these values are the same as the coefficients found using lm() in R.

coef(lm(y ~ x1 + x2, data = sim_data))

Also, these values are close to what we would expect.

c(beta_0, beta_1, beta_2)

We then calculated the fitted values in order to calculate ses_e, which we see is the same as the sigma which is returned by summary().

y_hat = X %*% beta_hat
(s_e = sqrt(sum((y - y_hat) ^ 2) / (n - p)))
summary(lm(y ~ x1 + x2, data = sim_data))$sigma

So far so good. Everything checks out. Now we will finally simulate from this model repeatedly in order to obtain an empirical distribution of β^2\hat{\beta}_2.

We expect β^2\hat{\beta}_2 to follow a normal distribution,

β^2N(β2,σ2C22).\hat{\beta}_2 \sim N\left(\beta_2, \sigma^2 C_{22} \right).

We now perform the simulation a large number of times. Each time, we update the y variable in the data frame, leaving the x variables the same. We then fit a model, and store β^2\hat{\beta}_2.

num_sims = 10000
beta_hat_2 = rep(0, num_sims)
for(i in 1:num_sims) {
  eps           = rnorm(n, mean = 0 , sd = sigma)
  sim_data$y    = beta_0 * x0 + beta_1 * x1 + beta_2 * x2 + eps
  fit           = lm(y ~ x1 + x2, data = sim_data)
  beta_hat_2[i] = coef(fit)[3]

We then see that the mean of the simulated values is close to the true value of β2\beta_2.


We also see that the variance of the simulated values is close to the true variance of β^2\hat{\beta}_2.

Var[β^2]=σ2C22=rsigma2×rC[2+1,2+1]=rsigma2C[2+1,2+1]\text{Var}[\hat{\beta}_2] = \sigma^2 \cdot C_{22} = `r sigma^2` \times `r C[2+1, 2+1]` = `r sigma^2 * C[2+1, 2+1]`

sigma ^ 2 * C[2 + 1, 2 + 1]

The standard deviations found from the simulated data and the parent population are also very close.

sqrt(sigma ^ 2 * C[2 + 1, 2 + 1])

Lastly, we plot a histogram of the simulated values, and overlay the true distribution.

hist(beta_hat_2, prob = TRUE, breaks = 20,
     xlab = expression(hat(beta)[2]), main = "", border = "dodgerblue")
curve(dnorm(x, mean = beta_2, sd = sqrt(sigma ^ 2 * C[2 + 1, 2 + 1])),
      col = "darkorange", add = TRUE, lwd = 3)

This looks good! The simulation-based histogram appears to be Normal with mean 6 and spread of about 0.15 as you measure from center to inflection point.

In-Class Exercise

One Per Group

  1. Using the "bank.csv" file (found here), run a regression with all variables to predict balance.
  • You'll need to code the factor variables (or else use R with some ingenuity.)
  1. Use the FF test to remove any variables that do not improve your model fit.
  2. Discus the resulting final model.
  3. Put the (functioning) code in your group repo.

  1. Moreover, for those who are well-versed in EC420 or statistics, it is misleading to examine solely one variable. Omitted variable bias is the basic fact that a misspecified model will not ↩︎