Tutorial weighted regression
This tutorial gives a brief introduction to simple weighted regression using analytical weights. The tutorial makes use of the short dataset available on this sas blog post.
First, creating the dataset.
We create the dataset with the help of the DataFrames.jl
package.
using DataFrames
tw = [
2.3 7.4 0.058
3.0 7.6 0.073
2.9 8.2 0.114
4.8 9.0 0.144
1.3 10.4 0.151
3.6 11.7 0.119
2.3 11.7 0.119
4.6 11.8 0.114
3.0 12.4 0.073
5.4 12.9 0.035
6.4 14.0 0
]
df = DataFrame(tw, [:y,:x,:w])
Row | y | x | w |
---|---|---|---|
Float64 | Float64 | Float64 | |
1 | 2.3 | 7.4 | 0.058 |
2 | 3.0 | 7.6 | 0.073 |
3 | 2.9 | 8.2 | 0.114 |
4 | 4.8 | 9.0 | 0.144 |
5 | 1.3 | 10.4 | 0.151 |
6 | 3.6 | 11.7 | 0.119 |
7 | 2.3 | 11.7 | 0.119 |
8 | 4.6 | 11.8 | 0.114 |
9 | 3.0 | 12.4 | 0.073 |
10 | 5.4 | 12.9 | 0.035 |
11 | 6.4 | 14.0 | 0.0 |
Second, make a basic analysis
We make a simple linear regression.
using LinearRegressionKit, StatsModels
using VegaLite
f = @formula(y ~ x)
lms, pss = regress(f, df, "fit")
lms
Model definition: y ~ 1 + x
Used observations: 11
Model statistics:
R²: 0.284772 Adjusted R²: 0.205303
MSE: 1.85959 RMSE: 1.36367
σ̂²: 1.85959
F Value: 3.58341 with degrees of freedom 1 and 9, Pr > F (p-value): 0.0909011
Confidence interval: 95%
Coefficients statistics:
Terms ╲ Stats │ Coefs Std err t Pr(>|t|) code low ci high ci
──────────────┼───────────────────────────────────────────────────────────────────────────────────
(Intercept) │ -0.228269 2.06371 -0.110611 0.914352 -4.89672 4.44018
x │ 0.359615 0.189972 1.89299 0.0909011 . -0.0701318 0.789363
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And then the weighted regression version:
lmw, psw = regress(f, df, "fit", weights="w")
lmw
Weighted regression
Model definition: y ~ 1 + x
Used observations: 10
Model statistics:
R²: 0.0149549 Adjusted R²: -0.108176
MSE: 0.182858 RMSE: 0.427619
σ̂²: 0.182858
F Value: 0.121456 with degrees of freedom 1 and 8, Pr > F (p-value): 0.736455
Confidence interval: 95%
Coefficients statistics:
Terms ╲ Stats │ Coefs Std err t Pr(>|t|) code low ci high ci
──────────────┼────────────────────────────────────────────────────────────────────────────
(Intercept) │ 2.32824 2.55186 0.912367 0.388242 -3.55637 8.21285
x │ 0.0853571 0.244924 0.348505 0.736455 -0.479438 0.650152
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The output of the model indicates that this is a weighted regression. We also note that the number of observations is 10 instead of 11 for the simple regression. This is because the last observation weights 0, and as the package only uses positive weights, it is not used to fit the regression model.
For comparison, we fit the simple regression with only the first 10 observations.
df = first(df, 10)
lms, pss = regress(f, df, "fit")
lms
Model definition: y ~ 1 + x
Used observations: 10
Model statistics:
R²: 0.116182 Adjusted R²: 0.00570458
MSE: 1.63241 RMSE: 1.27766
σ̂²: 1.63241
F Value: 1.05164 with degrees of freedom 1 and 8, Pr > F (p-value): 0.335136
Confidence interval: 95%
Coefficients statistics:
Terms ╲ Stats │ Coefs Std err t Pr(>|t|) code low ci high ci
──────────────┼─────────────────────────────────────────────────────────────────────
(Intercept) │ 1.16103 2.14371 0.541599 0.602847 -3.78238 6.10445
x │ 0.209405 0.204199 1.02549 0.335136 -0.26148 0.68029
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We can now realise that the coefficients are indeed differents with the weighted regression.
We can then contrast the fit plot from both regressions.
[pss["fit"] psw["fit"]]
We note that the regression line is indeed "flatter" in the weighted regression case. We also note that the prediction interval is presented differently (using error bars), and it shows a different shape, reflecting the weights' importance.