Basic · LinearRegressionKit.jl

Tutorial Linear Regression Basics

This tutorial details a simple regression analysis based on the "Formaldehyde" dataset.

First, creating the dataset

This is done relying on the DataFrames.jl package.

using DataFrames
df = DataFrame(Carb=[0.1,0.3,0.5,0.6,0.7,0.9], OptDen=[0.086,0.269,0.446,0.538,0.626,0.782])

6×2 DataFrame

Row	Carb	OptDen
	Float64	Float64
1	0.1	0.086
2	0.3	0.269
3	0.5	0.446
4	0.6	0.538
5	0.7	0.626
6	0.9	0.782

We want OptDen as the dependent variable (the response) and Carb as the independent variable (the predictor). Our model will have an intercept; however, the package will implicitly add the intercept to the model. We define the model as Optden ~ Carb; the variable's names need to be column names from the DataFrame, which is the second argument to the regress function. The lm object will then present essential information from the regression.

using LinearRegressionKit
using StatsModels # this is requested to use the @formula

lm = regress(@formula(OptDen ~ Carb), df)

Model definition:	OptDen ~ 1 + Carb
Used observations:	6
Model statistics:
  R²: 0.999047			Adjusted R²: 0.998808
  MSE: 7.48e-05			RMSE: 0.0086487
  σ̂²: 7.48e-05
  F Value: 4191.84 with degrees of freedom 1 and 4, Pr > F (p-value): 3.40919e-07
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │      Coefs     Std err           t    Pr(>|t|)        code      low ci     high ci
──────────────┼───────────────────────────────────────────────────────────────────────────────────
(Intercept)   │ 0.00508571  0.00783368    0.649211    0.551595              -0.0166641   0.0268355
Carb          │   0.876286   0.0135345     64.7444  3.40919e-7        ***     0.838708    0.913864

	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Third, some illustration about the model is created

Here we will only look at the fit-plot. To obtain it, we only need to add a third argument to the regress function. Namely, the name of the plot requested ("fit"). When at least one plot is requested, the regress function will return a pair of objects: the information about the regression (as before), and an object (Dict) to access the requested plot(s).

using VegaLite # this is the package use for plotting

lm, ps = regress(@formula(OptDen ~ Carb), df, "fit")
ps["fit"]

The response is plotted on the y-axis, and the predictor is plotted on the x-axis. The dark orange line represents the regression equation. The dark grey band represents the confidence interval given the α (which defaults to 0.05 and gives a 95% confidence interval). The light grey band represents the individual prediction interval. Finally, the blue circles represent the actual observations from the dataset.

Fourth, generate the predictions from the model

Here we get the predicted values from the model using the same Dataframe.

results = predict_in_sample(lm, df)

6×3 DataFrame

Row	OptDen	Carb	predicted
	Float64	Float64	Float64
1	0.086	0.1	0.0927143
2	0.269	0.3	0.267971
3	0.446	0.5	0.443229
4	0.538	0.6	0.530857
5	0.626	0.7	0.618486
6	0.782	0.9	0.793743

Fifth, generate the others statistics about the model

In order to get all the statistics, one can use the "all" keyword as an argument of the req_stats argument.

results = predict_in_sample(lm, df, req_stats="all")

6×16 DataFrame

Row	OptDen	Carb	student	rstudent	stdi	lcli	uclp	stdp	ucli	press	lclp	predicted	residuals	leverage	stdr	cooksd
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	0.086	0.1	-1.21516	-1.32495	0.0109119	0.062418	0.111187	0.00665352	0.123011	-0.01645	0.0742411	0.0927143	-0.00671429	0.591837	0.00552545	1.07054
2	0.269	0.3	0.140317	0.121818	0.00979112	0.240787	0.280715	0.00458978	0.295156	0.00143182	0.255228	0.267971	0.00102857	0.281633	0.00733034	0.00385947
3	0.446	0.5	0.351173	0.308924	0.00934439	0.417284	0.453052	0.00353802	0.469173	0.00332843	0.433405	0.443229	0.00277143	0.167347	0.00789192	0.0123927
4	0.538	0.6	0.914091	0.890024	0.0094095	0.504732	0.541148	0.00370659	0.556982	0.00875	0.520566	0.530857	0.00714286	0.183673	0.00781417	0.0940007
5	0.626	0.7	1.00256	1.00342	0.00966559	0.59165	0.630468	0.00431552	0.645322	0.0100054	0.606504	0.618486	0.00751429	0.24898	0.00749509	0.166611
6	0.782	0.9	-1.97323	-10.4789	0.0106857	0.764075	0.811167	0.00627571	0.823411	-0.0248017	0.776319	0.793743	-0.0117429	0.526531	0.00595109	2.16499

Sixth, generate prediction for new data

We first create a new DataFrame that needs to use the same column names used in the model. In our case, there is only one column: "Carb".

ndf = DataFrame(Carb= [0.11, 0.22, 0.55, 0.77])
predictions = predict_out_of_sample(lm, ndf)

4×2 DataFrame

Row	Carb	predicted
	Float64	Float64
1	0.11	0.101477
2	0.22	0.197869
3	0.55	0.487043
4	0.77	0.679826