Linear Regression in R

Regressions are a powerful tool in statistical analysis. However it is important to be able to interpret regression outputs in order to make the correct decisions going forward.

Life Cycle Savings Dataset

To delve deeper into this, we shall be looking at the LifeCycleSavings set available within R.

data("LifeCycleSavings")
head(LifeCycleSavings)

             sr pop15 pop75     dpi ddpi
Australia 11.43 29.35  2.87 2329.68 2.87
Austria   12.07 23.32  4.41 1507.99 3.93
Belgium   13.17 23.80  4.43 2108.47 3.82
Bolivia    5.75 41.89  1.67  189.13 0.22
Brazil    12.88 42.19  0.83  728.47 4.56
Canada     8.79 31.72  2.85 2982.88 2.43

This is a data frame with 50 observations on 5 variables:

sr numeric aggregate personal savings
pop15 numeric % of population under 15
pop75 numeric % of population over 75
dpi numeric real per-capita disposable income
ddpi numeric % growth rate of dpi

Testing an Economic Hypothesis

Under the life-cycle savings hypothesis as developed by Franco Modigliani, the savings ratio (aggregate personal saving divided by disposable income) is explained by per-capita disposable income, the percentage rate of change in per-capita disposable income, and two demographic variables: the percentage of population less than 15 years old and the percentage of the population over 75 years old. The data are averaged over the decade 1960–1970 to remove the business cycle or other short-term fluctuations.

In order to test this hypothesis, we perform a Simple Linear Regression:

lm1<-lm(
  data = LifeCycleSavings,
  formula = sr~pop15+ddpi+pop75+dpi
  )
summary(lm1)


Call:
lm(formula = sr ~ pop15 + ddpi + pop75 + dpi, data = LifeCycleSavings)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2422 -2.6857 -0.2488  2.4280  9.7509 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 28.5660865  7.3545161   3.884 0.000334 ***
pop15       -0.4611931  0.1446422  -3.189 0.002603 ** 
ddpi         0.4096949  0.1961971   2.088 0.042471 *  
pop75       -1.6914977  1.0835989  -1.561 0.125530    
dpi         -0.0003369  0.0009311  -0.362 0.719173    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.803 on 45 degrees of freedom
Multiple R-squared:  0.3385,    Adjusted R-squared:  0.2797 
F-statistic: 5.756 on 4 and 45 DF,  p-value: 0.0007904

Interpretting a Regression Output

This table is from a linear regression model in R, which assesses the relationship between a dependent variable (in your case, sr) and several independent variables (pop15, ddpi, pop75, dpi). Here’s a breakdown of the key components:

Call: This shows the actual command used to fit the model, indicating the formula and the dataset used. In your case, it’s lm(formula = sr ~ pop15 + ddpi + pop75 + dpi, data = LifeCycleSavings).
Residuals: These are the differences between the observed values and the values predicted by the model. The summary statistics (Min, 1Q, Median, 3Q, Max) give you an idea of their distribution. Ideally, residuals should be randomly distributed.
Coefficients:
- Estimate: The estimated effect of each independent variable on the dependent variable. For example, pop15 has an estimate of -0.4611931, suggesting that a unit increase in pop15 is associated with a 0.4611931 unit decrease in sr.
- Std. Error: Standard error of the estimate, indicating the variability of the estimate.
- t value: The t-statistic, calculated as the Estimate divided by the Standard Error. It tests the null hypothesis that the coefficient is equal to zero.
- Pr(>|t|): P-value associated with the t-statistic. A low p-value (< 0.05) suggests that the effect of the variable is statistically significant to a level of 95%.
Significance Codes: Symbols next to the coefficients indicate their level of statistical significance. These are good for having a quick glance at a table but focus on the actual values in the table if you can.
- *** indicates a very high level of significance (p ≤ 0.001).
- - indicates a significant result (p ≤ 0.05).
- . indicates a marginally significant result (p ≤ 0.1).
Residual Standard Error: This is the estimate of the standard deviation of the residuals, giving an idea of the typical size of the prediction errors.
Degrees of Freedom: In your model, there are 45 degrees of freedom, which is the number of observations minus the number of parameters being estimated.
Multiple R-squared (Or just R-Squared): Indicates the proportion of variance in the dependent variable that can be explained by the independent variables. In this model, it’s 0.3385.
Adjusted R-squared: Similar to R-squared but adjusted for the number of predictors in the model, making it more reliable for comparing models with different numbers of independent variables. Naturally, as you add more and more variables, no matter their significance, the R-Squared will often go up. Adjusted R-squared, which is adjusted by taking into account the degrees of freedom of the model, can be seen as an alternative measure. In our model, it is 0.2797.
F-statistic and its p-value: Tests the overall significance of the model. It compares a model with no predictors (only an intercept) to your model. A low p-value here (0.0007904) suggests that your model provides a better fit than the intercept-only model. Note that the degrees of freedom used are 4 (No. of Independent variables in the model) and 45 (No. of Observations [50] - No. of Independent Variables [4] -1 )

ANOVA

ANOVA is another type of Analysis we can use to improve our initial mode.

By performing an ANOVA on the previous mode we get the following:

anova(lm1)

Analysis of Variance Table

Response: sr
          Df Sum Sq Mean Sq F value    Pr(>F)    
pop15      1 204.12 204.118 14.1157 0.0004922 ***
ddpi       1  78.96  78.959  5.4604 0.0239634 *  
pop75      1  47.95  47.946  3.3157 0.0752748 .  
dpi        1   1.89   1.893  0.1309 0.7191732    
Residuals 45 650.71  14.460                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretting ANOVA Output

From this out put we can see the following:

Response: This indicates the dependent variable in the model, which is sr in your case.
ANOVA Output Table: Each row provides the analysis for one of independent variables in the model. We observe the following columns in an ANOVA Output:
- DF (Degrees of Freedom): Reflects the number of levels or categories in each independent variable minus one. For each predictor, this is typically 1 (since each predictor is usually a continuous variable).
- Sum Sq (Sum of Squares): This measures the total variation in the dependent variable explained by each predictor. It’s a sum of squared differences between the predicted values and the mean of the dependent variable.
- Mean Sq (Mean Squares): It’s an average of the Sum of Squares - calculated by dividing the Sum of Squares by the corresponding degrees of freedom.
- F value: This is the result of a test statistic used to determine if the model is statistically significant. It’s calculated by dividing the Mean Square of each variable by the Mean Square of the Residuals.
- Pr(>F) (P-value): Indicates the probability of observing the given result if the null hypothesis is true. A low p-value (typically ≤ 0.05) suggests that the variable significantly contributes to the model.
Significance Codes: These are the same symbols represented in the Regression Output Table

By observing this ANOVA Table, we can say the following for each of the present variable:

pop15: Has a significant effect on sr (p = 0.0004922, very significant).
ddpi: Also significantly affects sr (p = 0.0239634).
pop75: Marginally significant (p = 0.0752748).
dpi: Not statistically significant in explaining the variation in sr (p = 0.7191732).

In summary, this ANOVA table helps to understand which variables in your regression model are significant predictors of the dependent variable sr. It provides a statistical test for each independent variable, indicating whether each one has a significant relationship with the dependent variable.

Adjusting the Regression Output by removing dpi, we now get the following new model:

lm_R<-lm(
  data = LifeCycleSavings,
  formula = sr~pop15+pop75+ddpi
  )
summary(lm_R)


Call:
lm(formula = sr ~ pop15 + pop75 + ddpi, data = LifeCycleSavings)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2539 -2.6159 -0.3913  2.3344  9.7070 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  28.1247     7.1838   3.915 0.000297 ***
pop15        -0.4518     0.1409  -3.206 0.002452 ** 
pop75        -1.8354     0.9984  -1.838 0.072473 .  
ddpi          0.4278     0.1879   2.277 0.027478 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.767 on 46 degrees of freedom
Multiple R-squared:  0.3365,    Adjusted R-squared:  0.2933 
F-statistic: 7.778 on 3 and 46 DF,  p-value: 0.0002646

Some Further Analysis

There might still be some further room for improvement in our analysis. Lets remove pop75, and perform an ANOVA on our original regression and this new regression:

lm2<-lm(
  data = LifeCycleSavings,
  formula = sr~pop15+ddpi
  )
anova(lm2,lm1)

Analysis of Variance Table

Model 1: sr ~ pop15 + ddpi
Model 2: sr ~ pop15 + ddpi + pop75 + dpi
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     47 700.55                           
2     45 650.71  2    49.839 1.7233   0.19

Here we have a slightly different ANOVA Output. First it states the models, Model 1 and Model 2, which we are comparing a different ANOVA table:

Res.Df (Residual Degrees of Freedom): Number of degrees of freedom remaining after fitting the model. It’s calculated as the total number of observations minus the number of parameters being estimated (including the intercept).
RSS (Residual Sum of Squares): Sum of squared differences between the observed values and the values predicted by the model. A lower RSS indicates a better fit.
DF (Difference in Degrees of Freedom): The change in degrees of freedom when moving from Model 1 to Model 2.
Sum of Sq (Sum of Squares): The difference in the RSS between Model 1 and Model 2. It indicates how much more of the variance in the response variable is explained by Model 2 compared to Model 1.
F: The F-statistic for the test. It’s calculated by dividing the difference in the sum of squares by the difference in degrees of freedom, then dividing this value by the mean square error from the larger model (Model 2).
Pr(>F) (P-value): Probability of observing at least as extreme value of the F-statistic under the null hypothesis, which in this case is that Model 2 does not provide a significantly better fit than Model 1.

From the output given, we can observe the following:

For Model 1, there are 47 residual degrees of freedom, and for Model 2, there are 45.
The RSS is higher for Model 1 (700.55) than for Model 2 (650.71), suggesting that Model 2 might be a better fit.
The difference in degrees of freedom between the two models is 2, corresponding to the two additional variables in Model 2.
The sum of squares difference is 49.839, indicating that Model 2 explains an additional 49.839 units of the variance in sr compared to Model 1. The F-statistic is 1.7233.
The p-value is 0.19. Since this value is greater than 0.05, the improvement in fit from Model 1 to Model 2 is not statistically significant at the conventional 5% significance level.

To conclude, while Model 2 (with four predictors) explains more variance in sr than Model 1 (with two predictors), the difference is not statistically significant, suggesting that the addition of pop75 and dpi does not significantly improve the model’s fit.

Alternative Regression Outputs

If you want to compare similar regression models alongside each other, you can use the stargazer package.

Here is an example comparing the above models. It displays the Coefficients and Standard Errors for each variable in each model

library(stargazer)

stargazer(lm1, lm_R, lm2,
  type = "html",
  title = "Three Regression Models Predicting Variation in Savings Rate",
  column.labels = c("Original Model", "Intermediate Model", "Final Model"),
  colnames = FALSE,
  model.numbers = FALSE,
  notes.align = "l",
  dep.var.labels = "Savings Rate"
  )

**Three Regression Models Predicting Variation in Savings Rate**

	Dependent variable:

	Savings Rate
	Original Model	Intermediate Model	Final Model

pop15	-0.461^***	-0.452^***	-0.216^***
	(0.145)	(0.141)	(0.060)

ddpi	0.410^**	0.428^**	0.443^**
	(0.196)	(0.188)	(0.192)

pop75	-1.691	-1.835^*
	(1.084)	(0.998)

dpi	-0.0003
	(0.001)

Constant	28.566^***	28.125^***	15.600^***
	(7.355)	(7.184)	(2.334)


Observations	50	50	50
R²	0.338	0.337	0.288
Adjusted R²	0.280	0.293	0.257
Residual Std. Error	3.803 (df = 45)	3.767 (df = 46)	3.861 (df = 47)
F Statistic	5.756^*** (df = 4; 45)	7.778^*** (df = 3; 46)	9.496^*** (df = 2; 47)

Note:	p<0.1; p<0.05; p<0.01

Another thing you may want to do is do a visual representation of your regression. The first one we can use is an Actual vs Fitted plot. You can do this with default base R plots, but can make it more presentable using the ggplot2 and ggthemes packages

library(ggplot2)
library(ggthemes)

ggplot(
  LifeCycleSavings,
  aes(
    x = sr,
    y = lm2$fitted.values
  )
) +
  geom_point() +
  labs(
    title = "Actual vs Fitted Plot",
    caption = "Created for illustrative purposes",
    tag = "Figure 1",
    x = "Actual Values",
    y = "Fitted Values"
  ) +
  geom_abline(slope = 1, intercept = 0) + #geom_abline is used to draw y = x
  theme_classic() #One of the many preset themes of the ggthemes package

You can also produce a Fitted vs Residuals plot too:

ggplot(
  LifeCycleSavings,
  aes(
    x = lm2$fitted.values,
    y = lm2$residuals
  )
) +
  geom_point() +
  labs(
    title = "Fitted vs Residuals Plot",
    caption = "Created for illustrative purposes",
    tag = "Figure 2",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  geom_hline(yintercept = 0) +
  theme_economist() #Another theme provided by ggthemes

You can also create a QQplot as a ggplot object using the lindia package:

library(lindia)

gg_qqplot(lm2) +
  theme_excel() #Because it is a ggplot object I can use ggthemse too