Tutorial Linear Regression Basics
This tutorial details a simple regression analysis based on the "Formaldehyde" dataset.
First, creating the dataset
This is done relying on the DataFrames.jl
package.
using DataFrames
df = DataFrame(Carb=[0.1,0.3,0.5,0.6,0.7,0.9], OptDen=[0.086,0.269,0.446,0.538,0.626,0.782])
Row | Carb | OptDen |
---|---|---|
Float64 | Float64 | |
1 | 0.1 | 0.086 |
2 | 0.3 | 0.269 |
3 | 0.5 | 0.446 |
4 | 0.6 | 0.538 |
5 | 0.7 | 0.626 |
6 | 0.9 | 0.782 |
Second, the model is defined
We want OptDen as the dependent variable (the response) and Carb as the independent variable (the predictor). Our model will have an intercept; however, the package will implicitly add the intercept to the model. We define the model as Optden ~ Carb
; the variable's names need to be column names from the DataFrame, which is the second argument to the regress
function. The lm
object will then present essential information from the regression.
using LinearRegressionKit
using StatsModels # this is requested to use the @formula
lm = regress(@formula(OptDen ~ Carb), df)
Model definition: OptDen ~ 1 + Carb
Used observations: 6
Model statistics:
R²: 0.999047 Adjusted R²: 0.998808
MSE: 7.48e-05 RMSE: 0.0086487
σ̂²: 7.48e-05
F Value: 4191.84 with degrees of freedom 1 and 4, Pr > F (p-value): 3.40919e-07
Confidence interval: 95%
Coefficients statistics:
Terms ╲ Stats │ Coefs Std err t Pr(>|t|) code low ci high ci
──────────────┼───────────────────────────────────────────────────────────────────────────────────
(Intercept) │ 0.00508571 0.00783368 0.649211 0.551595 -0.0166641 0.0268355
Carb │ 0.876286 0.0135345 64.7444 3.40919e-7 *** 0.838708 0.913864
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Third, some illustration about the model is created
Here we will only look at the fit-plot. To obtain it, we only need to add a third argument to the regress
function. Namely, the name of the plot requested ("fit"). When at least one plot is requested, the regress
function will return a pair of objects: the information about the regression (as before), and an object (Dict
) to access the requested plot(s).
using VegaLite # this is the package use for plotting
lm, ps = regress(@formula(OptDen ~ Carb), df, "fit")
ps["fit"]
The response is plotted on the y-axis, and the predictor is plotted on the x-axis. The dark orange line represents the regression equation. The dark grey band represents the confidence interval given the α (which defaults to 0.05 and gives a 95% confidence interval). The light grey band represents the individual prediction interval. Finally, the blue circles represent the actual observations from the dataset.
Fourth, generate the predictions from the model
Here we get the predicted values from the model using the same Dataframe.
results = predict_in_sample(lm, df)
Row | Carb | OptDen | predicted |
---|---|---|---|
Float64 | Float64 | Float64 | |
1 | 0.1 | 0.086 | 0.0927143 |
2 | 0.3 | 0.269 | 0.267971 |
3 | 0.5 | 0.446 | 0.443229 |
4 | 0.6 | 0.538 | 0.530857 |
5 | 0.7 | 0.626 | 0.618486 |
6 | 0.9 | 0.782 | 0.793743 |
Fifth, generate the others statistics about the model
In order to get all the statistics, one can use the "all" keyword as an argument of the req_stats
argument.
results = predict_in_sample(lm, df, req_stats="all")
Row | Carb | OptDen | student | rstudent | stdi | lcli | uclp | stdp | ucli | press | lclp | predicted | residuals | leverage | stdr | cooksd |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.1 | 0.086 | -1.21516 | -1.32495 | 0.0109119 | 0.062418 | 0.111187 | 0.00665352 | 0.123011 | -0.01645 | 0.0742411 | 0.0927143 | -0.00671429 | 0.591837 | 0.00552545 | 1.07054 |
2 | 0.3 | 0.269 | 0.140317 | 0.121818 | 0.00979112 | 0.240787 | 0.280715 | 0.00458978 | 0.295156 | 0.00143182 | 0.255228 | 0.267971 | 0.00102857 | 0.281633 | 0.00733034 | 0.00385947 |
3 | 0.5 | 0.446 | 0.351173 | 0.308924 | 0.00934439 | 0.417284 | 0.453052 | 0.00353802 | 0.469173 | 0.00332843 | 0.433405 | 0.443229 | 0.00277143 | 0.167347 | 0.00789192 | 0.0123927 |
4 | 0.6 | 0.538 | 0.914091 | 0.890024 | 0.0094095 | 0.504732 | 0.541148 | 0.00370659 | 0.556982 | 0.00875 | 0.520566 | 0.530857 | 0.00714286 | 0.183673 | 0.00781417 | 0.0940007 |
5 | 0.7 | 0.626 | 1.00256 | 1.00342 | 0.00966559 | 0.59165 | 0.630468 | 0.00431552 | 0.645322 | 0.0100054 | 0.606504 | 0.618486 | 0.00751429 | 0.24898 | 0.00749509 | 0.166611 |
6 | 0.9 | 0.782 | -1.97323 | -10.4789 | 0.0106857 | 0.764075 | 0.811167 | 0.00627571 | 0.823411 | -0.0248017 | 0.776319 | 0.793743 | -0.0117429 | 0.526531 | 0.00595109 | 2.16499 |
Sixth, generate prediction for new data
We first create a new DataFrame that needs to use the same column names used in the model. In our case, there is only one column: "Carb".
ndf = DataFrame(Carb= [0.11, 0.22, 0.55, 0.77])
predictions = predict_out_of_sample(lm, ndf)
Row | Carb | predicted |
---|---|---|
Float64 | Float64 | |
1 | 0.11 | 0.101477 |
2 | 0.22 | 0.197869 |
3 | 0.55 | 0.487043 |
4 | 0.77 | 0.679826 |