Weighted regression · LinearRegressionKit.jl

Tutorial weighted regression

This tutorial gives a brief introduction to simple weighted regression using analytical weights. The tutorial makes use of the short dataset available on this sas blog post.

First, creating the dataset.

We create the dataset with the help of the DataFrames.jl package.

using DataFrames
    tw = [
        2.3  7.4  0.058
        3.0  7.6  0.073
        2.9  8.2  0.114
        4.8  9.0  0.144
        1.3 10.4  0.151
        3.6 11.7  0.119
        2.3 11.7  0.119
        4.6 11.8  0.114
        3.0 12.4  0.073
        5.4 12.9  0.035
        6.4 14.0  0
    ]
    df = DataFrame(tw, [:y,:x,:w])

11×3 DataFrame

Row	y	x	w
	Float64	Float64	Float64
1	2.3	7.4	0.058
2	3.0	7.6	0.073
3	2.9	8.2	0.114
4	4.8	9.0	0.144
5	1.3	10.4	0.151
6	3.6	11.7	0.119
7	2.3	11.7	0.119
8	4.6	11.8	0.114
9	3.0	12.4	0.073
10	5.4	12.9	0.035
11	6.4	14.0	0.0

Second, make a basic analysis

We make a simple linear regression.

using LinearRegressionKit, StatsModels
using VegaLite

f = @formula(y ~ x)
lms, pss = regress(f, df, "fit")
lms

Model definition:	y ~ 1 + x
Used observations:	11
Model statistics:
  R²: 0.284772			Adjusted R²: 0.205303
  MSE: 1.85959			RMSE: 1.36367
  σ̂²: 1.85959
  F Value: 3.58341 with degrees of freedom 1 and 9, Pr > F (p-value): 0.0909011
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │      Coefs     Std err           t    Pr(>|t|)        code      low ci     high ci
──────────────┼───────────────────────────────────────────────────────────────────────────────────
(Intercept)   │  -0.228269     2.06371   -0.110611    0.914352                -4.89672     4.44018
x             │   0.359615    0.189972     1.89299   0.0909011          .   -0.0701318    0.789363

	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

And then the weighted regression version:

lmw, psw = regress(f, df, "fit", weights="w")
lmw

Weighted regression
Model definition:	y ~ 1 + x
Used observations:	10
Model statistics:
  R²: 0.0149549			Adjusted R²: -0.108176
  MSE: 0.182858			RMSE: 0.427619
  σ̂²: 0.182858
  F Value: 0.121456 with degrees of freedom 1 and 8, Pr > F (p-value): 0.736455
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │     Coefs    Std err          t   Pr(>|t|)       code     low ci    high ci
──────────────┼────────────────────────────────────────────────────────────────────────────
(Intercept)   │   2.32824    2.55186   0.912367   0.388242              -3.55637    8.21285
x             │ 0.0853571   0.244924   0.348505   0.736455             -0.479438   0.650152

	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The output of the model indicates that this is a weighted regression. We also note that the number of observations is 10 instead of 11 for the simple regression. This is because the last observation weights 0, and as the package only uses positive weights, it is not used to fit the regression model.

For comparison, we fit the simple regression with only the first 10 observations.

df = first(df, 10)
lms, pss = regress(f, df, "fit")
lms

Model definition:	y ~ 1 + x
Used observations:	10
Model statistics:
  R²: 0.116182			Adjusted R²: 0.00570458
  MSE: 1.63241			RMSE: 1.27766
  σ̂²: 1.63241
  F Value: 1.05164 with degrees of freedom 1 and 8, Pr > F (p-value): 0.335136
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │    Coefs   Std err         t  Pr(>|t|)      code    low ci   high ci
──────────────┼─────────────────────────────────────────────────────────────────────
(Intercept)   │  1.16103   2.14371  0.541599  0.602847            -3.78238   6.10445
x             │ 0.209405  0.204199   1.02549  0.335136            -0.26148   0.68029

	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We can now realise that the coefficients are indeed differents with the weighted regression.

We can then contrast the fit plot from both regressions.

[pss["fit"] psw["fit"]]

We note that the regression line is indeed "flatter" in the weighted regression case. We also note that the prediction interval is presented differently (using error bars), and it shows a different shape, reflecting the weights' importance.