Tutorial ridge regression

This tutorial gives a brief introduction to ridge regression. The tutorial makes use of the acetylene dataset from "Marquardt, D. W., and Snee, R. D. (1975). “Ridge Regression in Practice.” American Statistician 29:3–20.", and the follow the same outline as the SAS documentation

First, creating the dataset.

We create the dataset with the help of the DataFrames.jl package.

using DataFrames
x1 = [1300, 1300, 1300, 1300, 1300, 1300, 1200, 1200, 1200, 1200, 1200, 1200, 1100, 1100, 1100, 1100]
x2 = [7.5, 9.0, 11.0, 13.5, 17.0, 23.0, 5.3, 7.5, 11.0, 13.5, 17.0, 23.0, 5.3, 7.5, 11.0, 17.0]
x3 = [0.012, 0.012, 0.0115, 0.013, 0.0135, 0.012, 0.04, 0.038, 0.032, 0.026, 0.034, 0.041, 0.084, 0.098, 0.092, 0.086]
y = [49.0, 50.2, 50.5, 48.5, 47.5, 44.5, 28.0, 31.5, 34.5, 35.0, 38.0, 38.5, 15.0, 17.0, 20.5, 29.5]
df = DataFrame(x1= x1, x2= x2, x3= x3, y= y)
16×4 DataFrame
Rowx1x2x3y
Int64Float64Float64Float64
113007.50.01249.0
213009.00.01250.2
3130011.00.011550.5
4130013.50.01348.5
5130017.00.013547.5
6130023.00.01244.5
712005.30.0428.0
812007.50.03831.5
9120011.00.03234.5
10120013.50.02635.0
11120017.00.03438.0
12120023.00.04138.5
1311005.30.08415.0
1411007.50.09817.0
15110011.00.09220.5
16110017.00.08629.5

Second, make a least square regression

We make a ordinary least squares (OLS) regression for comparison.

using LinearRegressionKit, StatsModels
using VegaLite

f = @formula(y ~ x1 + x2 + x3 + x1 & x2 + x1^2)
lm, ps = regress(f, df, "all", req_stats=["default", "vif"])
lm
Model definition:	y ~ 1 + x1 + x2 + x3 + :(x1 ^ 2) + x1 & x2
Used observations:	16
Model statistics:
  R²: 0.993708			Adjusted R²: 0.990562
  MSE: 1.33625			RMSE: 1.15596
  σ̂²: 1.33625
  F Value: 315.86 with degrees of freedom 5 and 10, Pr > F (p-value): 1.14778e-10
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │       Coefs      Std err            t     Pr(>|t|)         code       low ci      high ci          VIF
──────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
(Intercept)   │     390.538      211.523      1.84632    0.0946146           .      -80.7641      861.841          0.0
x1            │   -0.776761     0.324481     -2.39385    0.0377095           *      -1.49975   -0.0537716      7682.37
x2            │     10.1735     0.943009      10.7883   7.89586e-7         ***       8.07235      12.2747      320.022
x3            │    -121.626      69.0175     -1.76225     0.108507                  -275.407      32.1545      53.5246
x1 ^ 2        │ 0.000398306  0.000125283      3.17925   0.00983192          **   0.000119158  0.000677454      6643.32
x1 & x2       │  -0.0080548  0.000772088     -10.4325   1.07686e-6         ***   -0.00977512  -0.00633448      344.545

	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We observe that the VIF for the coefficients are highs, and hence indicate likely multicollinearity.

Ridge regression

Ridge regression requires a parameter (k), while there are methods to numerically identify k. It is also possible to trace the coefficients and VIFs values to let the analyst choose a k. Here we are going to trace for the k between 0.0 and 0.1 by increment of 0.0005. We display only the results for the first 5 k.

rdf, ps = ridge(f, df, 0.0:0.0005:0.1, traceplots=true)
rdf[1:5 , :]
5×17 DataFrame
RowkMSERMSER2ADJR2(Intercept)x1x2x3x1 ^ 2x1 & x2vif (Intercept)vif x1vif x2vif x3vif x1 ^ 2vif x1 & x2
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.01.336251.155960.9937080.990562390.538-0.77676110.1735-121.6260.000398306-0.00805480.07682.37320.02253.52466643.32344.545
20.00053.077441.754260.9855090.978264-90.3826-0.006572577.639944.359998.80983e-5-0.005982850.0118.261179.73412.0955104.157193.496
30.0014.673422.161810.9779940.966991-105.7230.03536596.128370.9605356.41509e-5-0.004745410.035.5156114.98611.335832.4431123.78
40.00156.087072.46720.9713380.957006-105.7850.04775965.12211-4.39785.41688e-5-0.003921590.017.78779.876310.993517.00785.9757
50.0027.274962.697210.9657440.948616-103.3880.05292054.40401-9.065164.844e-5-0.003333710.011.177358.731210.744411.21863.2076

Here is the default trace plot for the coefficients:

ps["coefs traceplot"]
Note

It is an issue that the & in the variable names are replaced by ampersand because otherwise it would cause issue either on the web display or on the SVG display. This only affect the plot generate with the library. One can directly use the DataFrame to generate its own plots.

And here is the default trace plot for the VIFs:

ps["vifs traceplot"]

As it is difficult to see the VIFs traces, it is also possible to request the version of hte plot with the y-axis log scaled.

ps["vifs traceplot log"]

Once a k has been selected (in this case 0.004) a regular ridge regression with this value can executed.

rlm = ridge(f, df, 0.004)
Ridge Regression
Constant k:	0.004
Model definition:	y ~ 1 + x1 + x2 + x3 + :(x1 ^ 2) + x1 & x2
Used observations:	16
Model statistics:
  R²: 0.951075			Adjusted R²: 0.926612
  MSE: 10.3903			RMSE: 3.2234

Coefficients statistics:
Terms ╲ Stats │       Coefs          VIF
──────────────┼─────────────────────────
(Intercept)   │     -93.797          0.0
x1            │   0.0578795      4.35597
x2            │     2.83932      23.9391
x3            │    -21.3375      9.99599
x1 ^ 2        │  3.82194e-5       5.1494
x1 & x2       │ -0.00205277      25.7445

From there the regular predict_* functions can be used, although only the predicted and potentially the residuals statistics will be calculated.

res = predict_in_sample(rlm, df)
16×6 DataFrame
Rowyx2x1x3predictedresiduals
Float64Float64Int64Float64Float64Float64
149.07.513000.01247.06161.93845
250.29.013000.01247.31762.88236
350.511.013000.011547.66982.83024
448.513.513000.01348.06460.435441
547.517.013000.013548.6514-1.15142
644.523.013000.01249.7078-5.20777
728.05.312000.0431.8337-3.83367
831.57.512000.03832.7035-1.20355
934.511.012000.03234.14760.352423
1035.013.512000.02635.2156-0.215604
1138.017.012000.03436.36091.63909
1238.523.012000.04138.46750.0324516
1315.05.311000.08417.4044-2.40437
1417.07.511000.09818.3845-1.38446
1520.511.011000.09220.547-0.0469559
1629.517.011000.08624.16265.33735