Tutorial multiple linear regression with categorical variables

This tutorial details a multiple regression analysis based on the "carseat" dataset (Information about car seat sales in 400 stores). This tutorial follows roughly the same steps done the datasets in the "An Introduction to Statistical Learning" book (https://www.statlearning.com/), from pages 119, 120, and 124.

First, creating the dataset.

We create the dataset with the help of the DataFrames.jl and Download packages.

using Downloads, DataFrames, CSV
df = DataFrame(CSV.File(Downloads.download("https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/Notebooks/Data/Carseats.csv")))
describe(df)
12×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64DataType
1Column1200.51200.54000Int64
2Sales7.496330.07.4916.270Float64
3CompPrice124.97577125.01750Int64
4Income68.65752169.01200Int64
5Advertising6.63505.0290Int64
6Population264.8410272.05090Int64
7Price115.79524117.01910Int64
8ShelveLocBadMedium0String7
9Age53.32252554.5800Int64
10Education13.91014.0180Int64
11UrbanNoYes0String3
12USNoYes0String3

Second, basic analysis.

We make a model with all variables and a couple of interactions.

using LinearRegressionKit, StatsModels
lm = regress(@formula(Sales ~ CompPrice + Income + Advertising + Population + Price +
            ShelveLoc + Age + Education + Urban + US + Income & Advertising + Price & Age), df)
Model definition:	Sales ~ 1 + CompPrice + Income + Advertising + Population + Price + ShelveLoc + Age + Education + Urban + US + Income & Advertising + Price & Age
Used observations:	400
Model statistics:
  R²: 0.876117			Adjusted R²: 0.871945
  MSE: 1.02132			RMSE: 1.0106
  σ̂²: 1.02132
  F Value: 209.988 with degrees of freedom 13 and 386, Pr > F (p-value): 6.13897e-166
Confidence interval: 95%

Coefficients statistics:
       Terms ╲ Stats │        Coefs       Std err             t      Pr(>|t|)          code        low ci       high ci
─────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────
(Intercept)          │      6.57557       1.00875       6.51855   2.22362e-10          ***        4.59224       8.55889
CompPrice            │    0.0929371    0.00411831       22.5668   1.64077e-72          ***        0.08484      0.101034
Income               │     0.010894    0.00260444       4.18284    3.56653e-5          ***      0.0057733     0.0160146
Advertising          │    0.0702462     0.0226091       3.10698     0.0020299           **      0.0257938      0.114699
Population           │  0.000159245   0.000367858      0.432899       0.66533                 -0.00056401   0.000882501
Price                │    -0.100806    0.00743989      -13.5494   1.73829e-34          ***      -0.115434    -0.0861786
ShelveLoc: Good      │      4.84868      0.152838       31.7243  1.38476e-109          ***        4.54818       5.14918
ShelveLoc: Medium    │      1.95326      0.125768       15.5307   1.33639e-42          ***        1.70599       2.20054
Age                  │   -0.0579466     0.0159506      -3.63288   0.000318136          ***     -0.0893075    -0.0265857
Education            │   -0.0208525     0.0196131      -1.06319      0.288361                  -0.0594145     0.0177095
Urban: Yes           │      0.14016      0.112402       1.24695      0.213171                  -0.0808369      0.361156
US: Yes              │    -0.157557      0.148923      -1.05797      0.290729                    -0.45036      0.135245
Income & Advertising │  0.000751039   0.000278409       2.69761    0.00729023           **    0.000203651    0.00129843
Price & Age          │   0.00010676   0.000133337      0.800676      0.423812                -0.000155398   0.000368918

	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

To have better explainability, we choose to set the base for the Shelve Location (ShelveLoc) as "Medium" so that the results highlight what happens when it is "Bad" or "Good". Furthermore, to form an idea about how collinear the predictors are, we request the Variance inflation factor (VIF).

lm = regress(@formula(Sales ~ CompPrice + Income + Advertising + Population + Price +
            ShelveLoc + Age + Education + Urban + US + Income & Advertising + Price & Age), df,
            req_stats=["default", "vif"],
            contrasts= Dict(:ShelveLoc => DummyCoding(base="Medium"), :Urban => DummyCoding(base="No"), :US => DummyCoding(base="No") ))
Model definition:	Sales ~ 1 + CompPrice + Income + Advertising + Population + Price + ShelveLoc + Age + Education + Urban + US + Income & Advertising + Price & Age
Used observations:	400
Model statistics:
  R²: 0.876117			Adjusted R²: 0.871945
  MSE: 1.02132			RMSE: 1.0106
  σ̂²: 1.02132
  F Value: 209.988 with degrees of freedom 13 and 386, Pr > F (p-value): 6.13897e-166
Confidence interval: 95%

Coefficients statistics:
       Terms ╲ Stats │        Coefs       Std err             t      Pr(>|t|)          code        low ci       high ci           VIF
─────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
(Intercept)          │      8.52883      0.995089       8.57092   2.50145e-16          ***        6.57235       10.4853           0.0
CompPrice            │    0.0929371    0.00411831       22.5668   1.64077e-72          ***        0.08484      0.101034       1.55808
Income               │     0.010894    0.00260444       4.18284    3.56653e-5          ***      0.0057733     0.0160146        2.0755
Advertising          │    0.0702462     0.0226091       3.10698     0.0020299           **      0.0257938      0.114699       8.83223
Population           │  0.000159245   0.000367858      0.432899       0.66533                 -0.00056401   0.000882501       1.14823
Price                │    -0.100806    0.00743989      -13.5494   1.73829e-34          ***      -0.115434    -0.0861786       12.1223
ShelveLoc: Bad       │     -1.95326      0.125768      -15.5307   1.33639e-42          ***       -2.20054      -1.70599       1.12997
ShelveLoc: Good      │      2.89541       0.12989       22.2913   2.41275e-71          ***        2.64003       3.15079       1.10575
Age                  │   -0.0579466     0.0159506      -3.63288   0.000318136          ***     -0.0893075    -0.0265857       26.0862
Education            │   -0.0208525     0.0196131      -1.06319      0.288361                  -0.0594145     0.0177095       1.03201
Urban: Yes           │      0.14016      0.112402       1.24695      0.213171                  -0.0808369      0.361156        1.0291
US: Yes              │    -0.157557      0.148923      -1.05797      0.290729                    -0.45036      0.135245        1.9889
Income & Advertising │  0.000751039   0.000278409       2.69761    0.00729023           **    0.000203651    0.00129843       8.84326
Price & Age          │   0.00010676   0.000133337      0.800676      0.423812                -0.000155398   0.000368918       33.4118

	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now let's assume we want our response to be Sales and the predictors to be Price, Urban, and US:

lm = regress(@formula(Sales ~ Price + Urban + US), df,
            contrasts= Dict(:ShelveLoc => DummyCoding(base="Medium"), :Urban => DummyCoding(base="No"), :US => DummyCoding(base="No") ))
Model definition:	Sales ~ 1 + Price + Urban + US
Used observations:	400
Model statistics:
  R²: 0.239275			Adjusted R²: 0.233512
  MSE: 6.11322			RMSE: 2.47249
  σ̂²: 6.11322
  F Value: 41.5188 with degrees of freedom 3 and 396, Pr > F (p-value): 2.3852e-23
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │       Coefs      Std err            t     Pr(>|t|)         code       low ci      high ci
──────────────┼──────────────────────────────────────────────────────────────────────────────────────────
(Intercept)   │     13.0435     0.651012      20.0357   3.6266e-62         ***       11.7636      14.3233
Price         │  -0.0544588   0.00524186     -10.3892  1.60992e-22         ***    -0.0647642   -0.0441535
Urban: Yes    │  -0.0219162      0.27165   -0.0806778     0.935739                 -0.555973     0.512141
US: Yes       │     1.20057     0.259042      4.63467   4.86024e-6         ***      0.691304      1.70984

	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Indeed, we note that "Urban:Yes" appears to have a low significance. Hence we could decide to make our model without this predictor:

lm = regress(@formula(Sales ~ Price + US), df,
            contrasts= Dict(:ShelveLoc => DummyCoding(base="Medium"), :Urban => DummyCoding(base="No"), :US => DummyCoding(base="No") ))
Model definition:	Sales ~ 1 + Price + US
Used observations:	400
Model statistics:
  R²: 0.239263			Adjusted R²: 0.23543
  MSE: 6.09792			RMSE: 2.4694
  σ̂²: 6.09792
  F Value: 62.4311 with degrees of freedom 2 and 397, Pr > F (p-value): 2.66115e-24
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │       Coefs      Std err            t     Pr(>|t|)         code       low ci      high ci
──────────────┼──────────────────────────────────────────────────────────────────────────────────────────
(Intercept)   │     13.0308     0.630976      20.6518  7.00138e-65         ***       11.7903      14.2713
Price         │  -0.0544776   0.00523013     -10.4161  1.27216e-22         ***    -0.0647598   -0.0441954
US: Yes       │     1.19964     0.258461      4.64148   4.70719e-6         ***       0.69152      1.70777

	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

To identify potential outliers and high leverage variables, we choose to plot the Cook's Distance and the leverage plot.

using VegaLite
lm, ps = regress(@formula(Sales ~ Price + US), df, "all",
            req_stats=["default", "vif"],
            contrasts= Dict(:ShelveLoc => DummyCoding(base="Medium"), :Urban => DummyCoding(base="No"), :US => DummyCoding(base="No") ))
p = [ps["leverage"]
    ps["cooksd"]]

Alternatively, we can also use the predicted values and their statistics to create a new data frame with the entries of interest (here, we show only the first three entries).

results = predict_in_sample(lm, df, req_stats="all")

threshold_cooksd = 4 / lm.observations
potential_outliers = results[ results.cooksd .> threshold_cooksd , :]
potential_outliers[1:3, 1:3]
3×3 DataFrame
RowPriceUSSales
Int64String3Float64
182No14.9
297Yes2.99
389No13.55
threshold_leverage = 2 * lm.p / lm.observations
potential_highleverage = results[ abs.(results.leverage) .> threshold_leverage , : ]
potential_highleverage[1:3, 1:3]
3×3 DataFrame
RowPriceUSSales
Int64String3Float64
124No10.43
249No9.34
369No7.71