Tutorial ridge regression

This tutorial gives a brief introduction to ridge regression. The tutorial makes use of the acetylene dataset from "Marquardt, D. W., and Snee, R. D. (1975). “Ridge Regression in Practice.” American Statistician 29:3–20.", and the follow the same outline as the SAS documentation

First, creating the dataset.

We create the dataset with the help of the DataFrames.jl package.

using DataFrames
x1 = [1300, 1300, 1300, 1300, 1300, 1300, 1200, 1200, 1200, 1200, 1200, 1200, 1100, 1100, 1100, 1100]
x2 = [7.5, 9.0, 11.0, 13.5, 17.0, 23.0, 5.3, 7.5, 11.0, 13.5, 17.0, 23.0, 5.3, 7.5, 11.0, 17.0]
x3 = [0.012, 0.012, 0.0115, 0.013, 0.0135, 0.012, 0.04, 0.038, 0.032, 0.026, 0.034, 0.041, 0.084, 0.098, 0.092, 0.086]
y = [49.0, 50.2, 50.5, 48.5, 47.5, 44.5, 28.0, 31.5, 34.5, 35.0, 38.0, 38.5, 15.0, 17.0, 20.5, 29.5]
df = DataFrame(x1= x1, x2= x2, x3= x3, y= y)
16×4 DataFrame
Rowx1x2x3y
Int64Float64Float64Float64
113007.50.01249.0
213009.00.01250.2
3130011.00.011550.5
4130013.50.01348.5
5130017.00.013547.5
6130023.00.01244.5
712005.30.0428.0
812007.50.03831.5
9120011.00.03234.5
10120013.50.02635.0
11120017.00.03438.0
12120023.00.04138.5
1311005.30.08415.0
1411007.50.09817.0
15110011.00.09220.5
16110017.00.08629.5

Second, make a least square regression

We make a ordinary least squares (OLS) regression for comparison.

using LinearRegressionKit, StatsModels
using VegaLite

f = @formula(y ~ x1 + x2 + x3 + x1 & x2 + x1^2)
lm, ps = regress(f, df, "all", req_stats=["default", "vif"])
lm
Model definition:	y ~ 1 + x1 + x2 + x3 + :(x1 ^ 2) + x1 & x2
Used observations:	16
Model statistics:
  R²: 0.993708			Adjusted R²: 0.990562
  MSE: 1.33625			RMSE: 1.15596
  σ̂²: 1.33625
  F Value: 315.86 with degrees of freedom 5 and 10, Pr > F (p-value): 1.14778e-10
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │       Coefs      Std err            t     Pr(>|t|)         code       low ci      high ci          VIF
──────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
(Intercept)   │     390.538      211.523      1.84632    0.0946146           .      -80.7641      861.841          0.0
x1            │   -0.776761     0.324481     -2.39385    0.0377095           *      -1.49975   -0.0537716      7682.37
x2            │     10.1735     0.943009      10.7883   7.89586e-7         ***       8.07235      12.2747      320.022
x3            │    -121.626      69.0175     -1.76225     0.108507                  -275.407      32.1545      53.5246
x1 ^ 2        │ 0.000398306  0.000125283      3.17925   0.00983192          **   0.000119158  0.000677454      6643.32
x1 & x2       │  -0.0080548  0.000772088     -10.4325   1.07686e-6         ***   -0.00977512  -0.00633448      344.545

	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We observe that the VIF for the coefficients are highs, and hence indicate likely multicollinearity.

Ridge regression

Ridge regression requires a parameter (k), while there are methods to numerically identify k. It is also possible to trace the coefficients and VIFs values to let the analyst choose a k. Here we are going to trace for the k between 0.0 and 0.1 by increment of 0.0005. We display only the results for the first 5 k.

rdf, ps = ridge(f, df, 0.0:0.0005:0.1, traceplots=true)
rdf[1:5 , :]
5×17 DataFrame
RowkMSERMSER2ADJR2(Intercept)x1x2x3x1 ^ 2x1 & x2vif (Intercept)vif x1vif x2vif x3vif x1 ^ 2vif x1 & x2
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.01.336251.155960.9937080.990562390.538-0.77676110.1735-121.6260.000398306-0.00805480.07682.37320.02253.52466643.32344.545
20.00053.077441.754260.9855090.978264-90.3826-0.006572577.639944.359998.80983e-5-0.005982850.0118.261179.73412.0955104.157193.496
30.0014.673422.161810.9779940.966991-105.7230.03536596.128370.9605356.41509e-5-0.004745410.035.5156114.98611.335832.4431123.78
40.00156.087072.46720.9713380.957006-105.7850.04775965.12211-4.39785.41688e-5-0.003921590.017.78779.876310.993517.00785.9757
50.0027.274962.697210.9657440.948616-103.3880.05292054.40401-9.065164.844e-5-0.003333710.011.177358.731210.744411.21863.2076

Here is the default trace plot for the coefficients:

ps["coefs traceplot"]
Note

It is an issue that the & in the variable names are replaced by ampersand because otherwise it would cause issue either on the web display or on the SVG display. This only affect the plot generate with the library. One can directly use the DataFrame to generate its own plots.

And here is the default trace plot for the VIFs:

ps["vifs traceplot"]

As it is difficult to see the VIFs traces, it is also possible to request the version of hte plot with the y-axis log scaled.

ps["vifs traceplot log"]

Once a k has been selected (in this case 0.004) a regular ridge regression with this value can executed.

rlm = ridge(f, df, 0.004)
Ridge Regression
Constant k:	0.004
Model definition:	y ~ 1 + x1 + x2 + x3 + :(x1 ^ 2) + x1 & x2
Used observations:	16
Model statistics:
  R²: 0.951075			Adjusted R²: 0.926612
  MSE: 10.3903			RMSE: 3.2234

Coefficients statistics:
Terms ╲ Stats │       Coefs          VIF
──────────────┼─────────────────────────
(Intercept)   │     -93.797          0.0
x1            │   0.0578795      4.35597
x2            │     2.83932      23.9391
x3            │    -21.3375      9.99599
x1 ^ 2        │  3.82194e-5       5.1494
x1 & x2       │ -0.00205277      25.7445

From there the regular predict_* functions can be used, although only the predicted and potentially the residuals statistics will be calculated.

res = predict_in_sample(rlm, df)
16×6 DataFrame
Rowx1x3x2ypredictedresiduals
Int64Float64Float64Float64Float64Float64
113000.0127.549.047.06161.93845
213000.0129.050.247.31762.88236
313000.011511.050.547.66982.83024
413000.01313.548.548.06460.435441
513000.013517.047.548.6514-1.15142
613000.01223.044.549.7078-5.20777
712000.045.328.031.8337-3.83367
812000.0387.531.532.7035-1.20355
912000.03211.034.534.14760.352423
1012000.02613.535.035.2156-0.215604
1112000.03417.038.036.36091.63909
1212000.04123.038.538.46750.0324516
1311000.0845.315.017.4044-2.40437
1411000.0987.517.018.3845-1.38446
1511000.09211.020.520.547-0.0469559
1611000.08617.029.524.16265.33735