Multiple regression · LinearRegressionKit.jl

Tutorial multiple linear regression with categorical variables

This tutorial details a multiple regression analysis based on the "carseat" dataset (Information about car seat sales in 400 stores). This tutorial follows roughly the same steps done the datasets in the "An Introduction to Statistical Learning" book (https://www.statlearning.com/), from pages 119, 120, and 124.

First, creating the dataset.

We create the dataset with the help of the DataFrames.jl and Download packages.

using Downloads, DataFrames, CSV
df = DataFrame(CSV.File(Downloads.download("https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/Notebooks/Data/Carseats.csv")))
describe(df)

12×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	Column1	200.5	1	200.5	400	0	Int64
2	Sales	7.49633	0.0	7.49	16.27	0	Float64
3	CompPrice	124.975	77	125.0	175	0	Int64
4	Income	68.6575	21	69.0	120	0	Int64
5	Advertising	6.635	0	5.0	29	0	Int64
6	Population	264.84	10	272.0	509	0	Int64
7	Price	115.795	24	117.0	191	0	Int64
8	ShelveLoc		Bad		Medium	0	String7
9	Age	53.3225	25	54.5	80	0	Int64
10	Education	13.9	10	14.0	18	0	Int64
11	Urban		No		Yes	0	String3
12	US		No		Yes	0	String3

Second, basic analysis.

We make a model with all variables and a couple of interactions.

using LinearRegressionKit, StatsModels
lm = regress(@formula(Sales ~ CompPrice + Income + Advertising + Population + Price +
            ShelveLoc + Age + Education + Urban + US + Income & Advertising + Price & Age), df)

Model definition:	Sales ~ 1 + CompPrice + Income + Advertising + Population + Price + ShelveLoc + Age + Education + Urban + US + Income & Advertising + Price & Age
Used observations:	400
Model statistics:
  R²: 0.876117			Adjusted R²: 0.871945
  MSE: 1.02132			RMSE: 1.0106
  σ̂²: 1.02132
  F Value: 209.988 with degrees of freedom 13 and 386, Pr > F (p-value): 6.13897e-166
Confidence interval: 95%

Coefficients statistics:
       Terms ╲ Stats │        Coefs       Std err             t      Pr(>|t|)          code        low ci       high ci
─────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────
(Intercept)          │      6.57557       1.00875       6.51855   2.22362e-10          ***        4.59224       8.55889
CompPrice            │    0.0929371    0.00411831       22.5668   1.64077e-72          ***        0.08484      0.101034
Income               │     0.010894    0.00260444       4.18284    3.56653e-5          ***      0.0057733     0.0160146
Advertising          │    0.0702462     0.0226091       3.10698     0.0020299           **      0.0257938      0.114699
Population           │  0.000159245   0.000367858      0.432899       0.66533                 -0.00056401   0.000882501
Price                │    -0.100806    0.00743989      -13.5494   1.73829e-34          ***      -0.115434    -0.0861786
ShelveLoc: Good      │      4.84868      0.152838       31.7243  1.38476e-109          ***        4.54818       5.14918
ShelveLoc: Medium    │      1.95326      0.125768       15.5307   1.33639e-42          ***        1.70599       2.20054
Age                  │   -0.0579466     0.0159506      -3.63288   0.000318136          ***     -0.0893075    -0.0265857
Education            │   -0.0208525     0.0196131      -1.06319      0.288361                  -0.0594145     0.0177095
Urban: Yes           │      0.14016      0.112402       1.24695      0.213171                  -0.0808369      0.361156
US: Yes              │    -0.157557      0.148923      -1.05797      0.290729                    -0.45036      0.135245
Income & Advertising │  0.000751039   0.000278409       2.69761    0.00729023           **    0.000203651    0.00129843
Price & Age          │   0.00010676   0.000133337      0.800676      0.423812                -0.000155398   0.000368918

	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

To have better explainability, we choose to set the base for the Shelve Location (ShelveLoc) as "Medium" so that the results highlight what happens when it is "Bad" or "Good". Furthermore, to form an idea about how collinear the predictors are, we request the Variance inflation factor (VIF).

lm = regress(@formula(Sales ~ CompPrice + Income + Advertising + Population + Price +
            ShelveLoc + Age + Education + Urban + US + Income & Advertising + Price & Age), df,
            req_stats=["default", "vif"],
            contrasts= Dict(:ShelveLoc => DummyCoding(base="Medium"), :Urban => DummyCoding(base="No"), :US => DummyCoding(base="No") ))

Model definition:	Sales ~ 1 + CompPrice + Income + Advertising + Population + Price + ShelveLoc + Age + Education + Urban + US + Income & Advertising + Price & Age
Used observations:	400
Model statistics:
  R²: 0.876117			Adjusted R²: 0.871945
  MSE: 1.02132			RMSE: 1.0106
  σ̂²: 1.02132
  F Value: 209.988 with degrees of freedom 13 and 386, Pr > F (p-value): 6.13897e-166
Confidence interval: 95%

Coefficients statistics:
       Terms ╲ Stats │        Coefs       Std err             t      Pr(>|t|)          code        low ci       high ci           VIF
─────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
(Intercept)          │      8.52883      0.995089       8.57092   2.50145e-16          ***        6.57235       10.4853           0.0
CompPrice            │    0.0929371    0.00411831       22.5668   1.64077e-72          ***        0.08484      0.101034       1.55808
Income               │     0.010894    0.00260444       4.18284    3.56653e-5          ***      0.0057733     0.0160146        2.0755
Advertising          │    0.0702462     0.0226091       3.10698     0.0020299           **      0.0257938      0.114699       8.83223
Population           │  0.000159245   0.000367858      0.432899       0.66533                 -0.00056401   0.000882501       1.14823
Price                │    -0.100806    0.00743989      -13.5494   1.73829e-34          ***      -0.115434    -0.0861786       12.1223
ShelveLoc: Bad       │     -1.95326      0.125768      -15.5307   1.33639e-42          ***       -2.20054      -1.70599       1.12997
ShelveLoc: Good      │      2.89541       0.12989       22.2913   2.41275e-71          ***        2.64003       3.15079       1.10575
Age                  │   -0.0579466     0.0159506      -3.63288   0.000318136          ***     -0.0893075    -0.0265857       26.0862
Education            │   -0.0208525     0.0196131      -1.06319      0.288361                  -0.0594145     0.0177095       1.03201
Urban: Yes           │      0.14016      0.112402       1.24695      0.213171                  -0.0808369      0.361156        1.0291
US: Yes              │    -0.157557      0.148923      -1.05797      0.290729                    -0.45036      0.135245        1.9889
Income & Advertising │  0.000751039   0.000278409       2.69761    0.00729023           **    0.000203651    0.00129843       8.84326
Price & Age          │   0.00010676   0.000133337      0.800676      0.423812                -0.000155398   0.000368918       33.4118

	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now let's assume we want our response to be Sales and the predictors to be Price, Urban, and US:

lm = regress(@formula(Sales ~ Price + Urban + US), df,
            contrasts= Dict(:ShelveLoc => DummyCoding(base="Medium"), :Urban => DummyCoding(base="No"), :US => DummyCoding(base="No") ))

Model definition:	Sales ~ 1 + Price + Urban + US
Used observations:	400
Model statistics:
  R²: 0.239275			Adjusted R²: 0.233512
  MSE: 6.11322			RMSE: 2.47249
  σ̂²: 6.11322
  F Value: 41.5188 with degrees of freedom 3 and 396, Pr > F (p-value): 2.3852e-23
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │       Coefs      Std err            t     Pr(>|t|)         code       low ci      high ci
──────────────┼──────────────────────────────────────────────────────────────────────────────────────────
(Intercept)   │     13.0435     0.651012      20.0357   3.6266e-62         ***       11.7636      14.3233
Price         │  -0.0544588   0.00524186     -10.3892  1.60992e-22         ***    -0.0647642   -0.0441535
Urban: Yes    │  -0.0219162      0.27165   -0.0806778     0.935739                 -0.555973     0.512141
US: Yes       │     1.20057     0.259042      4.63467   4.86024e-6         ***      0.691304      1.70984

	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Indeed, we note that "Urban:Yes" appears to have a low significance. Hence we could decide to make our model without this predictor:

lm = regress(@formula(Sales ~ Price + US), df,
            contrasts= Dict(:ShelveLoc => DummyCoding(base="Medium"), :Urban => DummyCoding(base="No"), :US => DummyCoding(base="No") ))

Model definition:	Sales ~ 1 + Price + US
Used observations:	400
Model statistics:
  R²: 0.239263			Adjusted R²: 0.23543
  MSE: 6.09792			RMSE: 2.4694
  σ̂²: 6.09792
  F Value: 62.4311 with degrees of freedom 2 and 397, Pr > F (p-value): 2.66115e-24
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │       Coefs      Std err            t     Pr(>|t|)         code       low ci      high ci
──────────────┼──────────────────────────────────────────────────────────────────────────────────────────
(Intercept)   │     13.0308     0.630976      20.6518  7.00138e-65         ***       11.7903      14.2713
Price         │  -0.0544776   0.00523013     -10.4161  1.27216e-22         ***    -0.0647598   -0.0441954
US: Yes       │     1.19964     0.258461      4.64148   4.70719e-6         ***       0.69152      1.70777

	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

To identify potential outliers and high leverage variables, we choose to plot the Cook's Distance and the leverage plot.

using VegaLite
lm, ps = regress(@formula(Sales ~ Price + US), df, "all",
            req_stats=["default", "vif"],
            contrasts= Dict(:ShelveLoc => DummyCoding(base="Medium"), :Urban => DummyCoding(base="No"), :US => DummyCoding(base="No") ))
p = [ps["leverage"]
    ps["cooksd"]]

Alternatively, we can also use the predicted values and their statistics to create a new data frame with the entries of interest (here, we show only the first three entries).

results = predict_in_sample(lm, df, req_stats="all")

threshold_cooksd = 4 / lm.observations
potential_outliers = results[ results.cooksd .> threshold_cooksd , :]
potential_outliers[1:3, 1:3]

3×3 DataFrame

Row	Price	US	Sales
	Int64	String3	Float64
1	82	No	14.9
2	97	Yes	2.99
3	89	No	13.55

threshold_leverage = 2 * lm.p / lm.observations
potential_highleverage = results[ abs.(results.leverage) .> threshold_leverage , : ]
potential_highleverage[1:3, 1:3]

3×3 DataFrame

Row	Price	US	Sales
	Int64	String3	Float64
1	24	No	10.43
2	49	No	9.34
3	69	No	7.71