Tutorial multiple linear regression with categorical variables
This tutorial details a multiple regression analysis based on the "carseat" dataset (Information about car seat sales in 400 stores). This tutorial follows roughly the same steps done the datasets in the "An Introduction to Statistical Learning" book (https://www.statlearning.com/), from pages 119, 120, and 124.
First, creating the dataset.
We create the dataset with the help of the DataFrames.jl
and Download
packages.
using Downloads, DataFrames, CSV
df = DataFrame(CSV.File(Downloads.download("https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/Notebooks/Data/Carseats.csv")))
describe(df)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | Column1 | 200.5 | 1 | 200.5 | 400 | 0 | Int64 |
2 | Sales | 7.49633 | 0.0 | 7.49 | 16.27 | 0 | Float64 |
3 | CompPrice | 124.975 | 77 | 125.0 | 175 | 0 | Int64 |
4 | Income | 68.6575 | 21 | 69.0 | 120 | 0 | Int64 |
5 | Advertising | 6.635 | 0 | 5.0 | 29 | 0 | Int64 |
6 | Population | 264.84 | 10 | 272.0 | 509 | 0 | Int64 |
7 | Price | 115.795 | 24 | 117.0 | 191 | 0 | Int64 |
8 | ShelveLoc | Bad | Medium | 0 | String7 | ||
9 | Age | 53.3225 | 25 | 54.5 | 80 | 0 | Int64 |
10 | Education | 13.9 | 10 | 14.0 | 18 | 0 | Int64 |
11 | Urban | No | Yes | 0 | String3 | ||
12 | US | No | Yes | 0 | String3 |
Second, basic analysis.
We make a model with all variables and a couple of interactions.
using LinearRegressionKit, StatsModels
lm = regress(@formula(Sales ~ CompPrice + Income + Advertising + Population + Price +
ShelveLoc + Age + Education + Urban + US + Income & Advertising + Price & Age), df)
Model definition: Sales ~ 1 + CompPrice + Income + Advertising + Population + Price + ShelveLoc + Age + Education + Urban + US + Income & Advertising + Price & Age
Used observations: 400
Model statistics:
R²: 0.876117 Adjusted R²: 0.871945
MSE: 1.02132 RMSE: 1.0106
σ̂²: 1.02132
F Value: 209.988 with degrees of freedom 13 and 386, Pr > F (p-value): 6.13897e-166
Confidence interval: 95%
Coefficients statistics:
Terms ╲ Stats │ Coefs Std err t Pr(>|t|) code low ci high ci
─────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────
(Intercept) │ 6.57557 1.00875 6.51855 2.22362e-10 *** 4.59224 8.55889
CompPrice │ 0.0929371 0.00411831 22.5668 1.64077e-72 *** 0.08484 0.101034
Income │ 0.010894 0.00260444 4.18284 3.56653e-5 *** 0.0057733 0.0160146
Advertising │ 0.0702462 0.0226091 3.10698 0.0020299 ** 0.0257938 0.114699
Population │ 0.000159245 0.000367858 0.432899 0.66533 -0.00056401 0.000882501
Price │ -0.100806 0.00743989 -13.5494 1.73829e-34 *** -0.115434 -0.0861786
ShelveLoc: Good │ 4.84868 0.152838 31.7243 1.38476e-109 *** 4.54818 5.14918
ShelveLoc: Medium │ 1.95326 0.125768 15.5307 1.33639e-42 *** 1.70599 2.20054
Age │ -0.0579466 0.0159506 -3.63288 0.000318136 *** -0.0893075 -0.0265857
Education │ -0.0208525 0.0196131 -1.06319 0.288361 -0.0594145 0.0177095
Urban: Yes │ 0.14016 0.112402 1.24695 0.213171 -0.0808369 0.361156
US: Yes │ -0.157557 0.148923 -1.05797 0.290729 -0.45036 0.135245
Income & Advertising │ 0.000751039 0.000278409 2.69761 0.00729023 ** 0.000203651 0.00129843
Price & Age │ 0.00010676 0.000133337 0.800676 0.423812 -0.000155398 0.000368918
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To have better explainability, we choose to set the base for the Shelve Location (ShelveLoc) as "Medium" so that the results highlight what happens when it is "Bad" or "Good". Furthermore, to form an idea about how collinear the predictors are, we request the Variance inflation factor (VIF).
lm = regress(@formula(Sales ~ CompPrice + Income + Advertising + Population + Price +
ShelveLoc + Age + Education + Urban + US + Income & Advertising + Price & Age), df,
req_stats=["default", "vif"],
contrasts= Dict(:ShelveLoc => DummyCoding(base="Medium"), :Urban => DummyCoding(base="No"), :US => DummyCoding(base="No") ))
Model definition: Sales ~ 1 + CompPrice + Income + Advertising + Population + Price + ShelveLoc + Age + Education + Urban + US + Income & Advertising + Price & Age
Used observations: 400
Model statistics:
R²: 0.876117 Adjusted R²: 0.871945
MSE: 1.02132 RMSE: 1.0106
σ̂²: 1.02132
F Value: 209.988 with degrees of freedom 13 and 386, Pr > F (p-value): 6.13897e-166
Confidence interval: 95%
Coefficients statistics:
Terms ╲ Stats │ Coefs Std err t Pr(>|t|) code low ci high ci VIF
─────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
(Intercept) │ 8.52883 0.995089 8.57092 2.50145e-16 *** 6.57235 10.4853 0.0
CompPrice │ 0.0929371 0.00411831 22.5668 1.64077e-72 *** 0.08484 0.101034 1.55808
Income │ 0.010894 0.00260444 4.18284 3.56653e-5 *** 0.0057733 0.0160146 2.0755
Advertising │ 0.0702462 0.0226091 3.10698 0.0020299 ** 0.0257938 0.114699 8.83223
Population │ 0.000159245 0.000367858 0.432899 0.66533 -0.00056401 0.000882501 1.14823
Price │ -0.100806 0.00743989 -13.5494 1.73829e-34 *** -0.115434 -0.0861786 12.1223
ShelveLoc: Bad │ -1.95326 0.125768 -15.5307 1.33639e-42 *** -2.20054 -1.70599 1.12997
ShelveLoc: Good │ 2.89541 0.12989 22.2913 2.41275e-71 *** 2.64003 3.15079 1.10575
Age │ -0.0579466 0.0159506 -3.63288 0.000318136 *** -0.0893075 -0.0265857 26.0862
Education │ -0.0208525 0.0196131 -1.06319 0.288361 -0.0594145 0.0177095 1.03201
Urban: Yes │ 0.14016 0.112402 1.24695 0.213171 -0.0808369 0.361156 1.0291
US: Yes │ -0.157557 0.148923 -1.05797 0.290729 -0.45036 0.135245 1.9889
Income & Advertising │ 0.000751039 0.000278409 2.69761 0.00729023 ** 0.000203651 0.00129843 8.84326
Price & Age │ 0.00010676 0.000133337 0.800676 0.423812 -0.000155398 0.000368918 33.4118
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now let's assume we want our response to be Sales and the predictors to be Price, Urban, and US:
lm = regress(@formula(Sales ~ Price + Urban + US), df,
contrasts= Dict(:ShelveLoc => DummyCoding(base="Medium"), :Urban => DummyCoding(base="No"), :US => DummyCoding(base="No") ))
Model definition: Sales ~ 1 + Price + Urban + US
Used observations: 400
Model statistics:
R²: 0.239275 Adjusted R²: 0.233512
MSE: 6.11322 RMSE: 2.47249
σ̂²: 6.11322
F Value: 41.5188 with degrees of freedom 3 and 396, Pr > F (p-value): 2.3852e-23
Confidence interval: 95%
Coefficients statistics:
Terms ╲ Stats │ Coefs Std err t Pr(>|t|) code low ci high ci
──────────────┼──────────────────────────────────────────────────────────────────────────────────────────
(Intercept) │ 13.0435 0.651012 20.0357 3.6266e-62 *** 11.7636 14.3233
Price │ -0.0544588 0.00524186 -10.3892 1.60992e-22 *** -0.0647642 -0.0441535
Urban: Yes │ -0.0219162 0.27165 -0.0806778 0.935739 -0.555973 0.512141
US: Yes │ 1.20057 0.259042 4.63467 4.86024e-6 *** 0.691304 1.70984
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Indeed, we note that "Urban:Yes" appears to have a low significance. Hence we could decide to make our model without this predictor:
lm = regress(@formula(Sales ~ Price + US), df,
contrasts= Dict(:ShelveLoc => DummyCoding(base="Medium"), :Urban => DummyCoding(base="No"), :US => DummyCoding(base="No") ))
Model definition: Sales ~ 1 + Price + US
Used observations: 400
Model statistics:
R²: 0.239263 Adjusted R²: 0.23543
MSE: 6.09792 RMSE: 2.4694
σ̂²: 6.09792
F Value: 62.4311 with degrees of freedom 2 and 397, Pr > F (p-value): 2.66115e-24
Confidence interval: 95%
Coefficients statistics:
Terms ╲ Stats │ Coefs Std err t Pr(>|t|) code low ci high ci
──────────────┼──────────────────────────────────────────────────────────────────────────────────────────
(Intercept) │ 13.0308 0.630976 20.6518 7.00138e-65 *** 11.7903 14.2713
Price │ -0.0544776 0.00523013 -10.4161 1.27216e-22 *** -0.0647598 -0.0441954
US: Yes │ 1.19964 0.258461 4.64148 4.70719e-6 *** 0.69152 1.70777
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To identify potential outliers and high leverage variables, we choose to plot the Cook's Distance and the leverage plot.
using VegaLite
lm, ps = regress(@formula(Sales ~ Price + US), df, "all",
req_stats=["default", "vif"],
contrasts= Dict(:ShelveLoc => DummyCoding(base="Medium"), :Urban => DummyCoding(base="No"), :US => DummyCoding(base="No") ))
p = [ps["leverage"]
ps["cooksd"]]
Alternatively, we can also use the predicted values and their statistics to create a new data frame with the entries of interest (here, we show only the first three entries).
results = predict_in_sample(lm, df, req_stats="all")
threshold_cooksd = 4 / lm.observations
potential_outliers = results[ results.cooksd .> threshold_cooksd , :]
potential_outliers[1:3, 1:3]
Row | Price | US | Sales |
---|---|---|---|
Int64 | String3 | Float64 | |
1 | 82 | No | 14.9 |
2 | 97 | Yes | 2.99 |
3 | 89 | No | 13.55 |
threshold_leverage = 2 * lm.p / lm.observations
potential_highleverage = results[ abs.(results.leverage) .> threshold_leverage , : ]
potential_highleverage[1:3, 1:3]
Row | Price | US | Sales |
---|---|---|---|
Int64 | String3 | Float64 | |
1 | 24 | No | 10.43 |
2 | 49 | No | 9.34 |
3 | 69 | No | 7.71 |