Weighted regression · LinearRegression.jl

Tutorial weighted regression

This tutorial gives a brief introduction to simple weighted regression using analytical weights. The tutorial makes use of the short dataset available on this sas blog post.

First, creating the dataset.

We create the dataset with the help of the DataFrames.jl package.

using DataFrames
    tw = [
        2.3  7.4  0.058
        3.0  7.6  0.073
        2.9  8.2  0.114
        4.8  9.0  0.144
        1.3 10.4  0.151
        3.6 11.7  0.119
        2.3 11.7  0.119
        4.6 11.8  0.114
        3.0 12.4  0.073
        5.4 12.9  0.035
        6.4 14.0  0
    ]
    df = DataFrame(tw, [:y,:x,:w])

11 rows × 3 columns

	y	x	w
	Float64	Float64	Float64
1	2.3	7.4	0.058
2	3.0	7.6	0.073
3	2.9	8.2	0.114
4	4.8	9.0	0.144
5	1.3	10.4	0.151
6	3.6	11.7	0.119
7	2.3	11.7	0.119
8	4.6	11.8	0.114
9	3.0	12.4	0.073
10	5.4	12.9	0.035
11	6.4	14.0	0.0

Second, make a basic analysis

We make a simple linear regression.

using LinearRegression, StatsModels
using VegaLite

f = @formula(y ~ x)
lms, pss = regress(f, df, "fit")
lms

Model definition:	y ~ 1 + x
Used observations:	11
Model statistics:
  R²: 0.284772			Adjusted R²: 0.205303
  MSE: 1.85959			RMSE: 1.36367
  σ̂²: 1.85959
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │      Coefs     Std err           t    Pr(>|t|)      low ci     high ci
──────────────┼───────────────────────────────────────────────────────────────────────
(Intercept)   │  -0.228269     2.06371   -0.110611    0.914352    -4.89672     4.44018
x             │   0.359615    0.189972     1.89299   0.0909011  -0.0701318    0.789363

And then the weighted regression version:

lmw, psw = regress(f, df, "fit", weights="w")
lmw

Model definition:	y ~ 1 + x
Used observations:	10
Weighted regression
Model statistics:
  R²: 0.0149549			Adjusted R²: -0.108176
  MSE: 0.182858			RMSE: 0.427619
  σ̂²: 0.182858
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │     Coefs    Std err          t   Pr(>|t|)     low ci    high ci
──────────────┼─────────────────────────────────────────────────────────────────
(Intercept)   │   2.32824    2.55186   0.912367   0.388242   -3.55637    8.21285
x             │ 0.0853571   0.244924   0.348505   0.736455  -0.479438   0.650152

The output of the model indicates that this is a weighted regression. We also note that the number of observations is 10 instead of 11 for the simple regression. This is because the last observation weights 0, and as the package only uses positive weights, it is not used to fit the regression model.

For comparison, we fit the simple regression with only the first 10 observations.

df = first(df, 10)
lms, pss = regress(f, df, "fit")
lms

Model definition:	y ~ 1 + x
Used observations:	10
Model statistics:
  R²: 0.116182			Adjusted R²: 0.00570458
  MSE: 1.63241			RMSE: 1.27766
  σ̂²: 1.63241
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │    Coefs   Std err         t  Pr(>|t|)    low ci   high ci
──────────────┼───────────────────────────────────────────────────────────
(Intercept)   │  1.16103   2.14371  0.541599  0.602847  -3.78238   6.10445
x             │ 0.209405  0.204199   1.02549  0.335136  -0.26148   0.68029

We can now realise that the coefficients are indeed differents with the weighted regression.

We can then contrast the fit plot from both regressions.

[pss["fit"] psw["fit"]]

We note that the regression line is indeed "flatter" in the weighted regression case. We also note that the prediction interval is presented differently (using error bars), and it shows a different shape, reflecting the weights' importance.