Tutorial weighted regression
This tutorial gives a brief introduction to simple weighted regression using analytical weights. The tutorial makes use of the short dataset available on this sas blog post.
First, creating the dataset.
We create the dataset with the help of the DataFrames.jl package.
using DataFrames
tw = [
2.3 7.4 0.058
3.0 7.6 0.073
2.9 8.2 0.114
4.8 9.0 0.144
1.3 10.4 0.151
3.6 11.7 0.119
2.3 11.7 0.119
4.6 11.8 0.114
3.0 12.4 0.073
5.4 12.9 0.035
6.4 14.0 0
]
df = DataFrame(tw, [:y,:x,:w])11 rows × 3 columns
| y | x | w | |
|---|---|---|---|
| Float64 | Float64 | Float64 | |
| 1 | 2.3 | 7.4 | 0.058 |
| 2 | 3.0 | 7.6 | 0.073 |
| 3 | 2.9 | 8.2 | 0.114 |
| 4 | 4.8 | 9.0 | 0.144 |
| 5 | 1.3 | 10.4 | 0.151 |
| 6 | 3.6 | 11.7 | 0.119 |
| 7 | 2.3 | 11.7 | 0.119 |
| 8 | 4.6 | 11.8 | 0.114 |
| 9 | 3.0 | 12.4 | 0.073 |
| 10 | 5.4 | 12.9 | 0.035 |
| 11 | 6.4 | 14.0 | 0.0 |
Second, make a basic analysis
We make a simple linear regression.
using LinearRegression, StatsModels
using VegaLite
f = @formula(y ~ x)
lms, pss = regress(f, df, "fit")
lmsModel definition: y ~ 1 + x
Used observations: 11
Model statistics:
R²: 0.284772 Adjusted R²: 0.205303
MSE: 1.85959 RMSE: 1.36367
σ̂²: 1.85959
Confidence interval: 95%
Coefficients statistics:
Terms ╲ Stats │ Coefs Std err t Pr(>|t|) low ci high ci
──────────────┼───────────────────────────────────────────────────────────────────────
(Intercept) │ -0.228269 2.06371 -0.110611 0.914352 -4.89672 4.44018
x │ 0.359615 0.189972 1.89299 0.0909011 -0.0701318 0.789363
And then the weighted regression version:
lmw, psw = regress(f, df, "fit", weights="w")
lmwModel definition: y ~ 1 + x
Used observations: 10
Weighted regression
Model statistics:
R²: 0.0149549 Adjusted R²: -0.108176
MSE: 0.182858 RMSE: 0.427619
σ̂²: 0.182858
Confidence interval: 95%
Coefficients statistics:
Terms ╲ Stats │ Coefs Std err t Pr(>|t|) low ci high ci
──────────────┼─────────────────────────────────────────────────────────────────
(Intercept) │ 2.32824 2.55186 0.912367 0.388242 -3.55637 8.21285
x │ 0.0853571 0.244924 0.348505 0.736455 -0.479438 0.650152
The output of the model indicates that this is a weighted regression. We also note that the number of observations is 10 instead of 11 for the simple regression. This is because the last observation weights 0, and as the package only uses positive weights, it is not used to fit the regression model.
For comparison, we fit the simple regression with only the first 10 observations.
df = first(df, 10)
lms, pss = regress(f, df, "fit")
lmsModel definition: y ~ 1 + x
Used observations: 10
Model statistics:
R²: 0.116182 Adjusted R²: 0.00570458
MSE: 1.63241 RMSE: 1.27766
σ̂²: 1.63241
Confidence interval: 95%
Coefficients statistics:
Terms ╲ Stats │ Coefs Std err t Pr(>|t|) low ci high ci
──────────────┼───────────────────────────────────────────────────────────
(Intercept) │ 1.16103 2.14371 0.541599 0.602847 -3.78238 6.10445
x │ 0.209405 0.204199 1.02549 0.335136 -0.26148 0.68029
We can now realise that the coefficients are indeed differents with the weighted regression.
We can then contrast the fit plot from both regressions.
[pss["fit"] psw["fit"]]We note that the regression line is indeed "flatter" in the weighted regression case. We also note that the prediction interval is presented differently (using error bars), and it shows a different shape, reflecting the weights' importance.