Regressions are a powerful tool in statistical analysis. However it is important to be able to interpret regression outputs in order to make the correct decisions going forward.
To delve deeper into this, we shall be looking at the LifeCycleSavings set available within R.
data("LifeCycleSavings")
head(LifeCycleSavings)
sr pop15 pop75 dpi ddpi
Australia 11.43 29.35 2.87 2329.68 2.87
Austria 12.07 23.32 4.41 1507.99 3.93
Belgium 13.17 23.80 4.43 2108.47 3.82
Bolivia 5.75 41.89 1.67 189.13 0.22
Brazil 12.88 42.19 0.83 728.47 4.56
Canada 8.79 31.72 2.85 2982.88 2.43
This is a data frame with 50 observations on 5 variables:
sr numeric aggregate personal savings
pop15 numeric % of population under 15
pop75 numeric % of population over 75
dpi numeric real per-capita disposable income
ddpi numeric % growth rate of dpi
Under the life-cycle savings hypothesis as developed by Franco Modigliani, the savings ratio (aggregate personal saving divided by disposable income) is explained by per-capita disposable income, the percentage rate of change in per-capita disposable income, and two demographic variables: the percentage of population less than 15 years old and the percentage of the population over 75 years old. The data are averaged over the decade 1960–1970 to remove the business cycle or other short-term fluctuations.
In order to test this hypothesis, we perform a Simple Linear Regression:
lm1<-lm(
data = LifeCycleSavings,
formula = sr~pop15+ddpi+pop75+dpi
)
summary(lm1)
Call:
lm(formula = sr ~ pop15 + ddpi + pop75 + dpi, data = LifeCycleSavings)
Residuals:
Min 1Q Median 3Q Max
-8.2422 -2.6857 -0.2488 2.4280 9.7509
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.5660865 7.3545161 3.884 0.000334 ***
pop15 -0.4611931 0.1446422 -3.189 0.002603 **
ddpi 0.4096949 0.1961971 2.088 0.042471 *
pop75 -1.6914977 1.0835989 -1.561 0.125530
dpi -0.0003369 0.0009311 -0.362 0.719173
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.803 on 45 degrees of freedom
Multiple R-squared: 0.3385, Adjusted R-squared: 0.2797
F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007904
This table is from a linear regression model in R, which assesses the relationship between a dependent variable (in your case, sr) and several independent variables (pop15, ddpi, pop75, dpi). Here’s a breakdown of the key components:
Call: This shows the actual command used to fit the model, indicating the formula and the dataset used. In your case, it’s lm(formula = sr ~ pop15 + ddpi + pop75 + dpi, data = LifeCycleSavings).
Residuals: These are the differences between the observed values and the values predicted by the model. The summary statistics (Min, 1Q, Median, 3Q, Max) give you an idea of their distribution. Ideally, residuals should be randomly distributed.
Coefficients:
Estimate: The estimated effect of each independent variable on the dependent variable. For example, pop15 has an estimate of -0.4611931, suggesting that a unit increase in pop15 is associated with a 0.4611931 unit decrease in sr.
Std. Error: Standard error of the estimate, indicating the variability of the estimate.
t value: The t-statistic, calculated as the Estimate divided by the Standard Error. It tests the null hypothesis that the coefficient is equal to zero.
Pr(>|t|): P-value associated with the t-statistic. A low p-value (< 0.05) suggests that the effect of the variable is statistically significant to a level of 95%.
Significance Codes: Symbols next to the coefficients indicate their level of statistical significance. These are good for having a quick glance at a table but focus on the actual values in the table if you can.
Residual Standard Error: This is the estimate of the standard deviation of the residuals, giving an idea of the typical size of the prediction errors.
Degrees of Freedom: In your model, there are 45 degrees of freedom, which is the number of observations minus the number of parameters being estimated.
Multiple R-squared (Or just R-Squared): Indicates the proportion of variance in the dependent variable that can be explained by the independent variables. In this model, it’s 0.3385.
Adjusted R-squared: Similar to R-squared but adjusted for the number of predictors in the model, making it more reliable for comparing models with different numbers of independent variables. Naturally, as you add more and more variables, no matter their significance, the R-Squared will often go up. Adjusted R-squared, which is adjusted by taking into account the degrees of freedom of the model, can be seen as an alternative measure. In our model, it is 0.2797.
F-statistic and its p-value: Tests the overall significance of the model. It compares a model with no predictors (only an intercept) to your model. A low p-value here (0.0007904) suggests that your model provides a better fit than the intercept-only model. Note that the degrees of freedom used are 4 (No. of Independent variables in the model) and 45 (No. of Observations [50] - No. of Independent Variables [4] -1 )
ANOVA is another type of Analysis we can use to improve our initial mode.
By performing an ANOVA on the previous mode we get the following:
anova(lm1)
Analysis of Variance Table
Response: sr
Df Sum Sq Mean Sq F value Pr(>F)
pop15 1 204.12 204.118 14.1157 0.0004922 ***
ddpi 1 78.96 78.959 5.4604 0.0239634 *
pop75 1 47.95 47.946 3.3157 0.0752748 .
dpi 1 1.89 1.893 0.1309 0.7191732
Residuals 45 650.71 14.460
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From this out put we can see the following:
Response: This indicates the dependent variable in the model, which is sr in your case.
ANOVA Output Table: Each row provides the analysis for one of independent variables in the model. We observe the following columns in an ANOVA Output:
Significance Codes: These are the same symbols represented in the Regression Output Table
By observing this ANOVA Table, we can say the following for each of the present variable:
In summary, this ANOVA table helps to understand which variables in your regression model are significant predictors of the dependent variable sr. It provides a statistical test for each independent variable, indicating whether each one has a significant relationship with the dependent variable.
Adjusting the Regression Output by removing dpi, we now get the following new model:
lm_R<-lm(
data = LifeCycleSavings,
formula = sr~pop15+pop75+ddpi
)
summary(lm_R)
Call:
lm(formula = sr ~ pop15 + pop75 + ddpi, data = LifeCycleSavings)
Residuals:
Min 1Q Median 3Q Max
-8.2539 -2.6159 -0.3913 2.3344 9.7070
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.1247 7.1838 3.915 0.000297 ***
pop15 -0.4518 0.1409 -3.206 0.002452 **
pop75 -1.8354 0.9984 -1.838 0.072473 .
ddpi 0.4278 0.1879 2.277 0.027478 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.767 on 46 degrees of freedom
Multiple R-squared: 0.3365, Adjusted R-squared: 0.2933
F-statistic: 7.778 on 3 and 46 DF, p-value: 0.0002646
There might still be some further room for improvement in our analysis. Lets remove pop75, and perform an ANOVA on our original regression and this new regression:
lm2<-lm(
data = LifeCycleSavings,
formula = sr~pop15+ddpi
)
anova(lm2,lm1)
Analysis of Variance Table
Model 1: sr ~ pop15 + ddpi
Model 2: sr ~ pop15 + ddpi + pop75 + dpi
Res.Df RSS Df Sum of Sq F Pr(>F)
1 47 700.55
2 45 650.71 2 49.839 1.7233 0.19
Here we have a slightly different ANOVA Output. First it states the models, Model 1 and Model 2, which we are comparing a different ANOVA table:
From the output given, we can observe the following:
To conclude, while Model 2 (with four predictors) explains more variance in sr than Model 1 (with two predictors), the difference is not statistically significant, suggesting that the addition of pop75 and dpi does not significantly improve the model’s fit.
If you want to compare similar regression models alongside each other, you can use the stargazer package.
Here is an example comparing the above models. It displays the Coefficients and Standard Errors for each variable in each model
library(stargazer)
stargazer(lm1, lm_R, lm2,
type = "html",
title = "Three Regression Models Predicting Variation in Savings Rate",
column.labels = c("Original Model", "Intermediate Model", "Final Model"),
colnames = FALSE,
model.numbers = FALSE,
notes.align = "l",
dep.var.labels = "Savings Rate"
)
Dependent variable: | |||
Savings Rate | |||
Original Model | Intermediate Model | Final Model | |
pop15 | -0.461*** | -0.452*** | -0.216*** |
(0.145) | (0.141) | (0.060) | |
ddpi | 0.410** | 0.428** | 0.443** |
(0.196) | (0.188) | (0.192) | |
pop75 | -1.691 | -1.835* | |
(1.084) | (0.998) | ||
dpi | -0.0003 | ||
(0.001) | |||
Constant | 28.566*** | 28.125*** | 15.600*** |
(7.355) | (7.184) | (2.334) | |
Observations | 50 | 50 | 50 |
R2 | 0.338 | 0.337 | 0.288 |
Adjusted R2 | 0.280 | 0.293 | 0.257 |
Residual Std. Error | 3.803 (df = 45) | 3.767 (df = 46) | 3.861 (df = 47) |
F Statistic | 5.756*** (df = 4; 45) | 7.778*** (df = 3; 46) | 9.496*** (df = 2; 47) |
Note: | p<0.1; p<0.05; p<0.01 |
Another thing you may want to do is do a visual representation of your regression. The first one we can use is an Actual vs Fitted plot. You can do this with default base R plots, but can make it more presentable using the ggplot2 and ggthemes packages
library(ggplot2)
library(ggthemes)
ggplot(
LifeCycleSavings,
aes(
x = sr,
y = lm2$fitted.values
)
) +
geom_point() +
labs(
title = "Actual vs Fitted Plot",
caption = "Created for illustrative purposes",
tag = "Figure 1",
x = "Actual Values",
y = "Fitted Values"
) +
geom_abline(slope = 1, intercept = 0) + #geom_abline is used to draw y = x
theme_classic() #One of the many preset themes of the ggthemes package
You can also produce a Fitted vs Residuals plot too:
ggplot(
LifeCycleSavings,
aes(
x = lm2$fitted.values,
y = lm2$residuals
)
) +
geom_point() +
labs(
title = "Fitted vs Residuals Plot",
caption = "Created for illustrative purposes",
tag = "Figure 2",
x = "Fitted Values",
y = "Residuals"
) +
geom_hline(yintercept = 0) +
theme_economist() #Another theme provided by ggthemes
You can also create a QQplot as a ggplot object using the lindia package:
library(lindia)
gg_qqplot(lm2) +
theme_excel() #Because it is a ggplot object I can use ggthemse too