Tutorial Linear Regression Basics

This tutorial details a simple regression analysis based on the "Formaldehyde" dataset.

First, creating the dataset

This is done relying on the DataFrames.jl package.

using DataFrames
df = DataFrame(Carb=[0.1,0.3,0.5,0.6,0.7,0.9], OptDen=[0.086,0.269,0.446,0.538,0.626,0.782])
6×2 DataFrame
RowCarbOptDen
Float64Float64
10.10.086
20.30.269
30.50.446
40.60.538
50.70.626
60.90.782

Second, the model is defined

We want OptDen as the dependent variable (the response) and Carb as the independent variable (the predictor). Our model will have an intercept; however, the package will implicitly add the intercept to the model. We define the model as Optden ~ Carb; the variable's names need to be column names from the DataFrame, which is the second argument to the regress function. The lm object will then present essential information from the regression.

using LinearRegressionKit
using StatsModels # this is requested to use the @formula

lm = regress(@formula(OptDen ~ Carb), df)
Model definition:	OptDen ~ 1 + Carb
Used observations:	6
Model statistics:
  R²: 0.999047			Adjusted R²: 0.998808
  MSE: 7.48e-05			RMSE: 0.0086487
  σ̂²: 7.48e-05
  F Value: 4191.84 with degrees of freedom 1 and 4, Pr > F (p-value): 3.40919e-07
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │      Coefs     Std err           t    Pr(>|t|)        code      low ci     high ci
──────────────┼───────────────────────────────────────────────────────────────────────────────────
(Intercept)   │ 0.00508571  0.00783368    0.649211    0.551595              -0.0166641   0.0268355
Carb          │   0.876286   0.0135345     64.7444  3.40919e-7        ***     0.838708    0.913864

	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Third, some illustration about the model is created

Here we will only look at the fit-plot. To obtain it, we only need to add a third argument to the regress function. Namely, the name of the plot requested ("fit"). When at least one plot is requested, the regress function will return a pair of objects: the information about the regression (as before), and an object (Dict) to access the requested plot(s).

using VegaLite # this is the package use for plotting

lm, ps = regress(@formula(OptDen ~ Carb), df, "fit")
ps["fit"]

The response is plotted on the y-axis, and the predictor is plotted on the x-axis. The dark orange line represents the regression equation. The dark grey band represents the confidence interval given the α (which defaults to 0.05 and gives a 95% confidence interval). The light grey band represents the individual prediction interval. Finally, the blue circles represent the actual observations from the dataset.

Fourth, generate the predictions from the model

Here we get the predicted values from the model using the same Dataframe.

results = predict_in_sample(lm, df)
6×3 DataFrame
RowOptDenCarbpredicted
Float64Float64Float64
10.0860.10.0927143
20.2690.30.267971
30.4460.50.443229
40.5380.60.530857
50.6260.70.618486
60.7820.90.793743

Fifth, generate the others statistics about the model

In order to get all the statistics, one can use the "all" keyword as an argument of the req_stats argument.

results = predict_in_sample(lm, df, req_stats="all")
6×16 DataFrame
RowOptDenCarbstudentrstudentstdilcliuclpstdpuclipresslclppredictedresidualsleveragestdrcooksd
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.0860.1-1.21516-1.324950.01091190.0624180.1111870.006653520.123011-0.016450.07424110.0927143-0.006714290.5918370.005525451.07054
20.2690.30.1403170.1218180.009791120.2407870.2807150.004589780.2951560.001431820.2552280.2679710.001028570.2816330.007330340.00385947
30.4460.50.3511730.3089240.009344390.4172840.4530520.003538020.4691730.003328430.4334050.4432290.002771430.1673470.007891920.0123927
40.5380.60.9140910.8900240.00940950.5047320.5411480.003706590.5569820.008750.5205660.5308570.007142860.1836730.007814170.0940007
50.6260.71.002561.003420.009665590.591650.6304680.004315520.6453220.01000540.6065040.6184860.007514290.248980.007495090.166611
60.7820.9-1.97323-10.47890.01068570.7640750.8111670.006275710.823411-0.02480170.7763190.793743-0.01174290.5265310.005951092.16499

Sixth, generate prediction for new data

We first create a new DataFrame that needs to use the same column names used in the model. In our case, there is only one column: "Carb".

ndf = DataFrame(Carb= [0.11, 0.22, 0.55, 0.77])
predictions = predict_out_of_sample(lm, ndf)
4×2 DataFrame
RowCarbpredicted
Float64Float64
10.110.101477
20.220.197869
30.550.487043
40.770.679826