Model Building

"Statisticians, like artists, have the bad habit of falling in love with their models."

--- George Box

In the next two lectures we will close our discussion of the linear regression. These lectures describe the following:

  • How to think about the model selection process.
  • Exploration versus Prediction.
  • Common issues when practically implementing linear regression.

Let's take a step back and consider the process of finding a model for data at a higher level. We are attempting to find a model for a response variable yy based on a number of predictors x1,x2,x3,,xp1x_1, x_2, x_3, \ldots, x_{p-1}.

Essentially, we are trying to discover the functional relationship between yy and the predictors. In the previous lectures we were fitting models for a car's fuel efficiency (mpg) as a function of its attributes (wt, year, cyl, disp, hp, acc). We also consider yy to be a function of some noise. Rarely if ever do we expect there to be such an exact functional relationship between the predictors and the response. Nearly any function could occur:

y=f(x1,x2,x3,,xp1)+ϵy = f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon

We can think of this as

response=signal+noise.\text{response} = \text{signal} + \text{noise}.

We could consider all sorts of complicated functions for ff. We will encounter several ways of doing this under the general umbrella of machine learning. But so far in this course we have focused on (multiple) linear regression. That is

y=f(x1,x2,x3,,xp1)+ϵ=β0+β1x1+β2x2++βp1xp1+ϵ\begin{aligned} y &= f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon \\ &= \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \cdots + \beta_{p-1} x_{p-1} + \epsilon \end{aligned}

In the big picture of possible models that we could fit to this data, this is a rather restrictive model. What do we mean by a restrictive model?

Family, Form, and Fit

When modeling data, there are a number of choices that need to be made.

  • What family of models will be considered?
  • What form of the model will be used?
  • How will the model be fit?

Let's work backwards and discuss each of these.

Fit

Consider one of the simplest models we could fit to data, simple linear regression.

y=f(x1,x2,x3,,xp1)+ϵ=β0+β1x1+ϵy = f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon = \beta_0 + \beta_1 x_{1} + \epsilon

So here, despite having multiple predictors, we chose to use only one. How is this model fit? We will almost exclusively use the method of least squares, but recall, we had seen alternative methods of fitting this model.

argminβ0,β1maxyi(β0+β1xi)\underset{\beta_0, \beta_1}{\mathrm{argmin}} \max|y_i - (\beta_0 + \beta_1 x_i)|

argminβ0,β1i=1nyi(β0+β1xi)\underset{\beta_0, \beta_1}{\mathrm{argmin}} \sum_{i = 1}^{n}|y_i - (\beta_0 + \beta_1 x_i)|

argminβ0,β1i=1n(yi(β0+β1xi))2\underset{\beta_0, \beta_1}{\mathrm{argmin}} \sum_{i = 1}^{n}(y_i - (\beta_0 + \beta_1 x_i))^2

Any of these methods (we will always use the last, least squares) will obtain estimates of the unknown parameters β0\beta_0 and β1\beta_1. Since those are the only unknowns of the specified model, we have then fit the model. The fitted model is then

y^=f^(x1,x2,x3,,xp1)=β^0+β^1x1\hat{y} = \hat{f}(x_1, x_2, x_3, \ldots, x_{p-1}) = \hat{\beta}_0 + \hat{\beta}_1 x_{1}

Note that we have dropped the term for the noise. We didn't make any effort to model the noise, only the signal. We construct estimates such that the noise averages out to zero and we politely ignore it.

Form

What are the different forms a model could take? Currently, for the linear models we have considered, the only method for altering the form of the model is to control the predictors used. For example, one form of the multiple linear regression model is simple linear regression.

y=f(x1,x2,x3,,xp1)+ϵ=β0+β1x1+ϵy = f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon = \beta_0 + \beta_1 x_{1} + \epsilon

We could also consider a SLR model with a different predictor, thus altering the form of the model.

y=f(x1,x2,x3,,xp1)+ϵ=β0+β2x2+ϵy = f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon = \beta_0 + \beta_2 x_{2} + \epsilon

Often, we'll use multiple predictors in our model. Very often, we will at least try a model with all possible predictors.

y=f(x1,x2,x3,,xp1)+ϵ=β0+β1x1+β2x2++βp1xp1+ϵ\begin{aligned} y &= f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon \\ &= \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \cdots + \beta_{p-1} x_{p-1} + \epsilon \end{aligned}

We could also use some, but not all of the predictors.

y=f(x1,x2,x3,,xp1)+ϵ=β0+β1x1+β3x3+β5x5+ϵ\begin{aligned} y &= f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon \\ &= \beta_0 + \beta_1 x_{1} + \beta_3 x_{3} + \beta_5 x_{5} + \epsilon \end{aligned}

These forms are restrictive in two senses. First, they only allow for linear relationships between the response and the predictors. This seems like an obvious restriction of linear models, but in fact, we will soon see how to use linear models for non-linear relationships. (It will involve transforming variables.) Second, how one variable affects the response is the same for any values of the other predictors. Soon we will see how to create models where the effect of x1x_{1} can be different for different values of x2x_{2}. We will discuss the concept of interaction.

Family

A family of models is a broader grouping of many possible forms of a model. For example, above we saw several forms of models from the family of linear models. In social science, we often only concern ourselves with linear models, which model a response as a linear combination of predictors. There are certainly other families of models. Broadly, the family defines our hypothesis set H\mathcal{H}, which we will discuss more coming up shortly.

For examples of alternative familes, there are several families of non-parametric regression. Smoothing is a broad family of models. As are trees.

In linear regression, we specified models with parameters, βj\beta_j and fit the model by finding the best values of these parameters. This is a parametric approach. A non-parametric approach skips the step of specifying a model with parameters, and are often described by the algorithm they employ. Non-parametric models are used in statistical learning.

Example Code (Simulation)

set.seed(42)
n = 30
beta_0 = 1
beta_1 = 0.5
sigma = 3
x = runif(n, 0, 10)
xplot = seq(0, 10, by = 0.1)

sim_slr = function(n, x, beta_0 = 10, beta_1 = 5, sigma = 1) {
  epsilon = rnorm(n, mean = 0, sd = sigma)
  y       = beta_0 + beta_1 * x + epsilon
  data.frame(predictor = x, response = y)
}

sim_data = sim_slr(n = n, x = x, beta_0 = beta_0, beta_1 = beta_1, sigma = sigma)

sim_fit_1 = lm(response ~ predictor, data = sim_data)
sim_fit_2 = loess(response ~ predictor, data = sim_data)

par(mfrow = c(1, 2))

plot(response ~ predictor, data = sim_data,
     xlab = "Predictor", ylab = "Response",
     main = "Linear Regression",
     pch  = 20, cex  = 3, col = "darkgrey",
     xlim = c(0, 10), ylim = c(-6, 11))
abline(sim_fit_1, lwd = 3, col = "dodgerblue")

plot(response ~ predictor, data = sim_data,
     xlab = "Predictor", ylab = "Response",
     main = "Smoothing",
     pch  = 20, cex  = 3, col = "darkgrey",
     xlim = c(0, 10), ylim = c(-6, 11))
lines(xplot, predict(sim_fit_2, newdata = data.frame(predictor = xplot)),
      lwd = 3, lty = 2, col = "darkorange")

Here, linear regression (parametric) is used on the left, while smoothing (non-parametric) is used on the right. This simple linear regression finds the "best" slope and intercept. Smoothing produces the fitted yy value at a particular xx value by considering the yy values of the data in a neighborhood of the xx value considered. (Local smoothing.) Which is "better"?

And why the focus on linear models? Two big reasons:

  • Linear models are the go-to model. Linear models have been around for a long time, and are computationally easy. A linear model may not be the final model you use, but often, it should be the first model you try.
  • The ideas behind linear models can be easily transferred to other modeling techniques.

Assumed Model, Fitted Model

When searching for a model, we often need to make assumptions. These assumptions are codified in the family and form of the model. For example

y=β0+β1x1+β3x3+β5x5+ϵy = \beta_0 + \beta_1 x_{1} + \beta_3 x_{3} + \beta_5 x_{5} + \epsilon

assumes that yy is a linear combination of x1x_{1}, x3x_{3}, and x5x_{5} as well as some noise. This assumes that the effect of x1x_{1} on yy is β1\beta_1, which is the same for all values of x3x_{3} and x5x_{5}. That is, we are using the family of linear models with a particular form.

Suppose we then fit this model to some data and obtain the fitted model. For example, in R we would use

fit = lm(y ~ x1 + x3 + x5, data = some_data)

This is R's way of saying the family is linear and specifying the form from above. An additive model with the specified predictors as well as an intercept. We then obtain

y^=1.5+0.9x1+1.1x3+2.3x5.\hat{y} = 1.5 + 0.9 x_{1} + 1.1 x_{3} + 2.3 x_{5}.

This is our best guess for the function ff in

y=f(x1,x2,x3,,xp1)+ϵy = f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon

for the assumed family and form. Fitting a model only gives us the best fit for the family and form that we specify. So the natural question is; how do we choose the correct family and form? We'll focus on form since we are focusing on the family of linear models.

"Explanation" versus Prediction

What is the purpose of fitting a model to data? Usually it is to accomplish one of two goals. Social scientist often uses models to explain the relationship between the response and the predictors. Models can also be used to predict the response based on the predictors.

For reasons we've previously discussed, I discourage pretending like you're explaning. It's very hard to show causality. Instead, I will use the term explore. Thus we'll consider exploring versus predicting. We'll discuss both goals below, and note how the process of finding models for explaining and predicting have some differences.

For the moment---since we are only considering linear models---searching for a good model is essentially searching for a good form of a model.

Exploration

If the goal of a model is to explore the relationship between the response and the predictors, we are looking for a model that is small and interpretable, but still fits the data well. When discussing linear models, the size of a model is essentially the number of β\beta parameters used.

Suppose we would like to find a model that explains fuel efficiency (mpg) based on a car's attributes (wt, year, cyl, disp, hp, acc). Perhaps we are a car manufacturer trying to engineer a fuel efficient vehicle. If this is the case, we are interested in both which predictor variables are useful for explaining the car's fuel efficiency, as well as how those variables effect fuel efficiency. By getting a better understanding of this relationship (at least locally), we can use this knowledge to our advantage when designing a car.

To explain a relationship, we are interested in keeping models as small as possible, since smaller models are easy to interpret. The fewer predictors the less considerations we need to make in our design process.[1]

To find small and interpretable models, we will eventually use selection procedures, which search among many possible forms of a model. For now we will do this in a more ad-hoc manner using inference techniques we have already encountered. To use inference as we have seen it, we need an additional assumption in addition to the family and form of the model.

y=β0+β1x1+β3x3+β5x5+ϵy = \beta_0 + \beta_1 x_{1} + \beta_3 x_{3} + \beta_5 x_{5} + \epsilon

Our additional assumption is about the error term.

ϵN(0,σ2)\epsilon \sim N(0, \sigma^2)

This assumption, that the errors are normally distributed with some common variance is the key to all of the inference we have done so far. We will discuss this in great detail later.

So with our inference tools (ANOVA and tt-test) we have two potential strategies. Start with a very small model (no predictors) and attempt to add predictors. Or, start with a big model (all predictors) and attempt to remove predictors.

Correlation and Causation

autompg = read.table(
  "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
  quote = "\"",
  comment.char = "",
  stringsAsFactors = FALSE)
colnames(autompg) = c("mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin", "name")
autompg = subset(autompg, autompg$hp != "?")
autompg = subset(autompg, autompg$name != "plymouth reliant")
rownames(autompg) = paste(autompg$cyl, "cylinder", autompg$year, autompg$name)
autompg = subset(autompg, select = c("mpg", "cyl", "disp", "hp", "wt", "acc", "year"))
autompg$hp = as.numeric(autompg$hp)

A word of caution when using a model to explore a relationship. There are two terms often used to describe a relationship between two variables: causation and correlation. Correlation is often also referred to as association.

Just because two variables are correlated does not necessarily mean that one causes the other. For example, consider modeling mpg as only a function of hp.

plot(mpg ~ hp, data = autompg, col = "dodgerblue", pch = 20, cex = 1.5)

Does an increase in horsepower cause a drop in fuel efficiency? Or, perhaps the causality is reversed and an increase in fuel efficiency cause a decrease in horsepower. Or, perhaps there is a third variable that explains both!

The issue here is that we have observational data. With observational data, we can only detect associations. To speak with confidence about causality, we would need to run experiments. Often, this decision is made for us, before we ever see data, so we can only modify our interpretation.

This is a concept that you will encounter often in your professional lives. (Especially for you economists---does demand cause supply or vice versa?) For some further reading, and some related fallacies, see: Wikipedia: Correlation does not imply causation.

Prediction

If the goal of a model is to predict the response, then the only consideration is how well the model fits the data.

Correlation and causation are not an issue here. If a predictor is correlated with the response, it is useful for prediction. For example, in elementary school aged children their shoe size certainly doesn't cause them to read at a higher level, however we could very easily use shoe size to make a prediction about reading ability. The larger their shoe size, the better they read. There's a lurking variable here though: age.

Also, since we are not performing inference, assumptions about the errors are not needed. The only thing we care about is how close the fitted model is to the data. Least squares is least squares. For a specified model, it will find the values of the parameters which will minimize the squared error loss. Your results might be largely uninterpretable and useless for inference, but for prediction none of that matters.

We'll return to prediction in earnest in a few days.

Assessing Our Toolkit

The follow is a reasonably set of questions to ask about a model:

  1. Is at least one of the predictors X1,X2,...,XpX_1, X_2, . . . , X_p useful in predicting the response?
  2. Do all the predictors help to explain YY , or is only a subset of the predictors useful?
  3. How well does the model fit the data?
  4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

Is XX useful in predicting YY?

A natural question is whether at least one of the predictors is useful in capturing the variation in YY.

As we previously explored, perhaps the most straightforward way to address this question is with the FF test.

F=(TSSRSS)pRSS(np1)Fp,np1F = \frac{\frac{(TSS − RSS)}{p}}{\frac{RSS}{(n−p−1)}} \sim F_{p,n−p−1}

Written this way, we can loosely think of this test as the explained variation (numerator) normalized to the by the total variation (denominator). But this isn't exactly right, so let's be careful with such interpretations.

Could we merely use a subset of predictors?

This is a complicated question and a very common one in social science.

  • The most direct approach is called all subsets or best subsets regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size.
  • However we often can’t examine all possible models, since they are 2p2^p of them; for example when p=40p = 40 there are over a billion potential models we could explore.
  • Instead we need an automated approach that searches through a subset of them. We discuss two commonly use approaches next.

Forward Selection

  1. Begin with the null model — a model that contains an intercept but no predictors.

  2. Fit pp simple linear regressions and add to the null model the variable that results in the lowest RSS. You now have a simple linear regression.

  3. Add to that model the variable that results in the lowest RSS amongst all two-variable models. (So now your model is given by y=β0+β1x1+β2x2+ϵy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon.)

  4. Continue adding variables one at a time until some stopping rule is satisfied.

  • This part is pretty ad-hoc. For example, people often adopt the rule that they will stop when all remaining variables have a pp-value above some threshold. But anything is valid.

Backward Selection

  • Start with all variables in the model.
  • Remove the variable with the largest p-value — that is, the
    variable that is the least statistically significant.
  • The new (p1)(p − 1)-variable model is fit, and the variable with the largest pp-value is removed.
  • Continue until a stopping rule is reached. As before, this is an ad-hoc rule. For instance, we may stop when all remaining variables have a significant pp-value defined by some significance threshold.

Later we will discuss more systematic criteria for choosing an "optimal" member in the subset of models produced by forward or backward stepwise selection. Moreover, we'll introduce formal mathematical models that automate this process.

How Well Does Our model Fit?

The coefficient of determination, R2R^2, is defined as

R2=SSRegSST=i=1n(y^iyˉ)2i=1n(yiyˉ)2=SSTSSESST=1SSESST=1i=1n(yiy^i)2i=1n(yiyˉ)2=1i=1nei2i=1n(yiyˉ)2\begin{aligned} R^2 &= \frac{\text{SSReg}}{\text{SST}} = \frac{\sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} \\[2.5ex] &= \frac{\text{SST} - \text{SSE}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}} \\[2.5ex] &= 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} = 1 - \frac{\sum_{i = 1}^{n}e_i^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} \end{aligned}

The coefficient of determination is interpreted as the proportion of observed variation in yy that can be explained by the simple linear regression model.

R2 = SSReg / SST
R2

For the cars example, we'd say that R2100R^2*100 percent of the observed variability in stopping distance is explained by the linear relationship with speed. The following code visually demonstrates the three "sums of squares" for a simulated dataset which has a somewhat high R2R^2 value. It also demonstrates defining functions (a notion that is hopefully familiar but likely not).

generate_data = function(int = 1,
                         slope = 2,
                         sigma = 5,
                         n_obs = 15,
                         x_min = 0,
                         x_max = 10) {
  x = seq(x_min, x_max, length.out = n_obs)
  y = int + slope * x + rnorm(n_obs, 0, sigma)
  fit = lm(y ~ x)
  y_hat = fitted(fit)
  y_bar = rep(mean(y), n_obs)
  data.frame(x, y, y_hat, y_bar)
}

plot_total_dev = function(reg_data) {
  plot(reg_data$x, reg_data$y, main = "SST (Sum of Squares Total)",
       xlab = "x", ylab = "y", pch = 20, cex = 2, col = "grey")
  arrows(reg_data$x, reg_data$y_bar,
         reg_data$x, reg_data$y,
         col = 'grey', lwd = 1, lty = 3, length = 0.2, angle = 20)
  abline(h = mean(reg_data$y), lwd = 2,col = "grey")
  # abline(lm(y ~ x, data = reg_data), lwd = 2, col = "grey")
}

plot_total_dev_prop = function(reg_data) {
  plot(reg_data$x, reg_data$y, main = "SST (Sum of Squares Total)",
       xlab = "x", ylab = "y", pch = 20, cex = 2, col = "grey")
  arrows(reg_data$x, reg_data$y_bar,
         reg_data$x, reg_data$y_hat,
         col = 'darkorange', lwd = 1, length = 0.2, angle = 20)
  arrows(reg_data$x, reg_data$y_hat,
         reg_data$x, reg_data$y,
         col = 'dodgerblue', lwd = 1, lty = 2, length = 0.2, angle = 20)
  abline(h = mean(reg_data$y), lwd = 2,col = "grey")
  abline(lm(y ~ x, data = reg_data), lwd = 2, col = "grey")
}

Code Example (Cont.)

plot_unexp_dev = function(reg_data) {
  plot(reg_data$x, reg_data$y, main = "SSE (Sum of Squares Error)",
       xlab = "x", ylab = "y", pch = 20, cex = 2, col = "grey")
  arrows(reg_data$x, reg_data$y_hat,
         reg_data$x, reg_data$y,
         col = 'dodgerblue', lwd = 1, lty = 2, length = 0.2, angle = 20)
  abline(lm(y ~ x, data = reg_data), lwd = 2, col = "grey")
}

plot_exp_dev = function(reg_data) {
  plot(reg_data$x, reg_data$y, main = "SSReg (Sum of Squares Regression)",
  xlab = "x", ylab = "y", pch = 20, cex = 2, col = "grey")
  arrows(reg_data$x, reg_data$y_bar,
         reg_data$x, reg_data$y_hat,
         col = 'darkorange', lwd = 1, length = 0.2, angle = 20)
  abline(lm(y ~ x, data = reg_data), lwd = 2, col = "grey")
  abline(h = mean(reg_data$y), col = "grey")
}
set.seed(2)
plot_data = generate_data(sigma = 2)
par(mfrow = c(2, 2))
plot_exp_dev(plot_data)
plot_unexp_dev(plot_data)
plot_total_dev(plot_data)
plot_total_dev_prop(plot_data)
set.seed(1)
plot_data = generate_data(slope = -1.5, sigma = 10)

The next plots again visually demonstrate the three "sums of squares" with a different R2R^2. Notice in the final plot, that now the blue arrows account for a larger proportion of the total arrow.

par(mfrow = c(2, 2))
plot_exp_dev(plot_data)
plot_unexp_dev(plot_data)
plot_total_dev(plot_data)
plot_total_dev_prop(plot_data)

Data Manipulation for Linear Regressions

Qualitative Predictors

  • Some predictors are not quantitative but are qualitative. That is, they take on a discrete set of values, which may include strings or data types.
  • These are also called categorical predictors or factor variables.
  • You have explored these a bit already. See, for instance marital, education, or job in the bank data from the previous lecture and Lab 1.
  • Such variables need to be handled very carefully. The next few slides will give you some guidance, but this problem is general and will crop up for many different types of models that we will explore in this course.

Consider the following exercise. Suppose we are endowed with data from a major national bank who is curious about credit card utilization amongst different generations. Specifically, we want to consider how credit card debt differs between Boomers and Millenials. They've (helpfully) generated a qualitative variable with these two words. That is, the data itself is a string which takes only two values.

We'll create a new variable:
Di={1  if the ith person is a boomer0 if the ith person is not a boomer\mathbb{D}_i = \begin{cases} 1 &\text{~ if the $i$th person is a boomer} \\ 0 &\text{~if the $i$th person is not a boomer} \end{cases}

Now we can run the regression:
yi=β0+β1Di+ϵy_i = \beta_0 + \beta_1 \mathbb{D_i} + \epsilon

The resulting model is given by
yi=β0+β1Di+ϵ={β0+ϵ  if the ith person is not a boomerβ0+β1+ϵ  if the ith person is a boomery_i = \beta_0 + \beta_1 \mathbb{D_i} + \epsilon = \begin{cases} \beta_0 + \epsilon &\text{~ if the $i$th person is not a boomer} \\ \beta_0 + \beta_1 + \epsilon &\text{~ if the $i$th person is a boomer} \end{cases}

How do we interpret this output? Simple!

  • This simple model gives us the average credit card debt yy for each of the two groups. However, it's not quite as simple as just staring at the numbers.
    1. The average credit card debt of non-boomers is given by β0\beta_0. (Try it: Verify this is true by construction and using the regression equations.)
    2. The average credit card debt of boomers is given by β0+β1\beta_0 + \beta_1.

This example helps show why thinking about things carefully matters a great deal. For example, this helps illustrate a natural statistical test is H0:β1=0H_0: \beta_1 = 0.

Qualitative Variables Cont.

If our original data only had boomers and millenials, we could use the β0+β1\beta_0 + \beta_1 as the mean debt of millenials. Of course, qualitative variables often take on more than two levels, and our data likely would have multiple levels. We can manually create additional "dummy" variables to account for each of these. For example, let's imagine that our credit card data has only births after 1950. Thus, for the variable we have been describing, a natural encoding would also include "Gen X" and "Gen Z". We can create two more dummy variables to account for this:
Di2={1if the ith person is Gen X0if the ith person is not Gen X\mathbb{D}_{i2} = \begin{cases} 1 & \text{if the $i$th person is Gen X} \\0 & \text{if the $i$th person is not Gen X} \end{cases}
and
Di3={1  if the ith person is a millenial0 if the ith person is not a millenial\mathbb{D}_{i3} = \begin{cases} 1 &\text{~ if the $i$th person is a millenial} \\ 0 &\text{~if the $i$th person is not a millenial} \end{cases}

Our final regression would include all of these dummies:
yi=β0+β1Di1+β2Di2+β3Di3+ϵ={β0+ϵ  if the ith person is Gen Zβ0+β1+ϵ  if the ith person is a boomerβ0+β1+β2+ϵ  if the ith person is Gen Zβ0+β1+β2+β3+ϵ  if the ith person is a millenialy_i = \beta_0 + \beta_1 \mathbb{D}_{i1} + \beta_2 \mathbb{D}_{i2}+ \beta_3 \mathbb{D}_{i3} + \epsilon = \begin{cases} \beta_0 + \epsilon &\text{~ if the $i$th person is Gen Z} \\ \beta_0 + \beta_1 + \epsilon &\text{~ if the $i$th person is a boomer} \\ \beta_0 + \beta_1 + \beta_2 + \epsilon &\text{~ if the $i$th person is Gen Z} \\ \beta_0 + \beta_1 + \beta_2 + \beta_3 + \epsilon &\text{~ if the $i$th person is a millenial}\end{cases}

This highlights two important facts:

  1. The number of dummies is always one fewer than the number of values that the factor variable can take. The intercept captures the mean of this baseline group. In this example, GenZ is the baseline.
  2. The order that you encode these variables can either aid or harm your interpretation; you should do this carefully.

Common Problems in Linear Regression

When we fit a linear regression model to a particular data set, many problems can occur. Some violations of the underlying assumptions are benign. Others are a bigger issue. (These are discussed in the text as Chapter 3.3.3). A few common issues are the following:

  1. Non-linearity of the response-predictor relationships.
  2. Correlation of error terms.
  3. Non-constant variance of error terms.
  4. Outliers.
  5. High-leverage points.
  6. Collinearity.

Non-Linearity

This is a deep issue and one that linear regression is limited in its ability to handle. As previously discussed, linear regression restricts the family that we utilize. We will describe a practical solution to this issue, but it introduces conceptual problems. To highlight the conceptual difficulties, let's remind ourselves that the underlying (unknown) relationship is given by:
Y=f(X)+ϵY = f(X) + \epsilon

Accordingly, it's unreasonable to expect a linear relationship. However, the Taylor's Theorem---and its associated series approximation---suggests that we might be able to locally represent ff by some linear function. (See here for the calculus.)

But we should be very wary of interpreting much from this exercise. If we find that the regression
y=β0+β1x4+ϵy = \beta_0 + \beta_1 x^4 + \epsilon
yields a significant positive β1\beta_1, that does not mean that this is the true relationship. Merely, this polynomial does a reasonable job of approximating the underlying truth (at least locally).


  1. Note that linear models of any size are rather interpretable to begin with. Later in your data analysis careers, you will see more complicated models that may fit data better, but are much harder, if not impossible to interpret. These models aren't nearly as useful for explaining a relationship. This is another reason to always attempt a linear model. If it fits as well as more complicated methods, it will be the easiest to understand. ↩︎