LinearRegressionKit.jl Documentation
LinearRegressionKit.jl implements linear regression using the least-squares algorithm (relying on the sweep operator). This package is in the alpha stage. Hence it is likely that some bugs exist. Furthermore, the API might change in future versions.
The usage aims to be straightforward, a call to regress
to build a linear regression model, and a call to predict_in_sample
to predict data using the built linear regression model. When predicting on data not present during the regression, use the predict_out_of_sample
function as this does not require a response value (consequently, statistics that need a response, as the residuals, are not available.)
The regress call will compute some statistics about the fitted model in addition to the coefficients. The statistics computed depend on the value of the req_stats
argument. The prediction functions compute predicted values together with some statistics. Like for the regress calls, the statistics computed depend on the value of the req_stats
argument.
When some analytical positive weights are used, a weighted regression is performed.
Statistics related to the regression (the fitting)
Fitting the model generates some statistics dependent on the req_stats
argument of the regress
function.
- $n$, $p$,
"coefs"
and"see"
are always computed "mse"
,"sst"
,"rmse"
,"aic"
,"sigma"
,"t_statistic"
,"vif"
,"r2"
,"adjr2"
,"stderror"
,"t_values"
,"p_values"
,"ci"
,"press"
, and"cond"
are computed upon request.- some diagnostics can be requested as well. Here is the full list as Symbols
[:diag_normality, :diag_ks, :diag_ad, :diag_jb, :diag_heteroskedasticity, :diag_white, :diag_bp ]
,"diag_normality"
is a shortcut for[:diag_ks, :diag_ad, :diag_jb]
and:diag_heteroskedasticity
is a shortcut for[:diag_white, :diag_bp]
.
- some diagnostics can be requested as well. Here is the full list as Symbols
- "default", includes the mandatory stats, and some of the optional statistics here as Symbols:
[:coefs, :sse, :mse, :sst, :rmse, :sigma, :t_statistic, :r2, :adjr2, :stderror, :t_values, :p_values, :ci, :f_stats]
"all"
includes all availble statistics"none"
include only the mandatory statistics
The meaning for these statistics is given below.
Number of observations and variables
The number of observations $n$ used to fit the model.
The number of independent variables $p$ used in the model.
Total Sum of Squares
The Total Sum of Squares (SST) is calculated but not presented to the user. In case of model with intercept the SST is computed with the following:
\[\mathrm{SST}=\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^2\]
And when there is no intercept with the following:
\[\mathrm{SST}=\sum_{i=1}^{n} y_{i}^2\]
Error Sum of Squares
The Error Sum of Squares (or SSE) also known as Residual Sum of Square (RSS). This package uses the sweep operator (Goodnight, J. (1979). "A Tutorial on the SWEEP Operator." The American Statistician.) to compute the SSE.
Mean Squared Error
The Mean Squared Error (MSE) is calculated as
\[\mathrm{MSE} = \displaystyle{\frac{{\mathrm{SSE}}}{{n - p}}}\]
The Root Mean Squared Error (RMSE) is calculated as
\[\mathrm{RMSE} = \sqrt{\mathrm{MSE}}\]
The MSE is the estimator of σ̂² unless at least one robust covariance estimator is requested.
R² and Adjusted R²
The R² (R2 or R-squared) see (https://en.wikipedia.org/wiki/Coefficientofdetermination) is calculated with the following formula:
\[\mathrm{R}^2 = 1 - \displaystyle{\frac{{\mathrm{SSE}}}{{\mathrm{SST}}}}\]
The Adjusted R² (ADJR2) is computed with the following formulas:
when it is a model with an intercept:
\[\mathrm{ADJR}^2 = 1 - \displaystyle \frac{(n-1)(1-\mathrm{R}^2)}{n-p}\]
And when there is no intercept:
\[\mathrm{ADJR}^2 = 1 - \displaystyle \frac{(n)(1-\mathrm{R}^2)}{n-p}\]
Akaike information criterion
The Akaike information criterion is calculated with the Linear Regression specific formula:
\[\mathrm{AIC} = \displaystyle n \ln \left( \frac{\mathrm{SSE}}{n} \right) + 2p\]
t_statistic and confidence interval
The t_statistic is computed by using the inverse cumulative t_distribution (with quantile()
) with parameter ($n - p$) at $1 - \frac{α}{2}$.
The standard errors of the coefficients are calculated by multiplying the Sigma (estimated by the MSE) with the pseudo inverse matrix (resulting from the sweep operator), out of which the square root of the diagonal elements are extracted.
The t-values are calculated as the coefficients divided by their standard deviation.
The upper bound of the confidence interval for each coefficient is calculated as the coeffiecent + coefficient's standard error * t_statistic.
The lower bound of the confidence interval for each coefficient is calculated as the coeffiecent - coefficient's standard error * t_statistic.
p-values
The p-values are computed using the F Distribution, the degree of freedom for each coefficent.
F Value (F Statistic)
The F Value (F Statistic) is computed using the F Distribution, the degree of freedom for the exaplined and unexplained variance.
Variance inflation factor
Variance inflation factor (VIF) is calculated by taking the diagonal elements of the inverse of the correlation matrix formed by the independent variables.
PRESS predicted residual error sum of squares
The predicted residual error sum of squares is calculated by taking the sum of squares from the PRESS
(see below the statistics related to predictions) of each observations.
Condition number (cond)
The condition number of the design matrix is computed using the function cond
from LinearAlgebra
. it give some information indicating how severly ill-condition the problem is. See Wikipedia page and Julia documentation for more information.
Robust covariance estimators
Robust Covariance estimator can be requested through the cov
argument of the regress
function. The options are (as Symbols):
:white
: Heteroscedasticity:hc0
: Heteroscedasticity:hc1
: Heteroscedasticity):hc2
: Heteroscedasticity):hc3
: Heteroscedasticity):nw
: HAC (Heteroskedasticity and Autocorrelation Consistent estimator)
Heteroscedasticity estimators
The user can select estimators from above list. If the user select :white
as an estimator then HC3 will be selected for a small size (n <= 250) otherwise HC0 will be selected. (see "Using Heteroscedasticity Consitent Standard Errors in the Linear Regression Model" J. Scott Long and Laurie H. Ervin (1998-2000)). If another estimator is requested it is provided. A list of estimator can be requested as in for instance cov=[:hc2, hc3]
. Comprehensive descriptions of the estimators and their applications shoudl in found in a text book, here only a brief description of the implementation is provided.
HC0
Having InvMat the pseudo inverse resulting from the sweep operator. And having $xe$ being the matrix of the independent variables times the residuals. Then HC0 is calculated as:
\[\textup{HC0} = \sqrt{diag(\textup{InvMat } \textup{xe}' \textup{xe} \textup{ InvMat})}\]
HC1
Having n being the number of observations and p the number of variables. Then HC1 is calculated as:
\[\textup{HC1} = \sqrt{diag(\textup{InvMat } \textup{xe}' \textup{xe} \textup{ InvMat } \frac{n}{n-p})}\]
HC2
The leverage or hat matrix is calculated as:
\[\textup{H} = \textup{X} (\textup{X'X})^{-1}\textup{X'}\]
$xe$ is scaled by $\frac{1}{1 - H}$ then
\[\textup{HC2} = \sqrt{diag(\textup{InvMat } \textup{xe}' \textup{xe} \textup{ InvMat } )}\]
HC3
$xe$ is scaled by $\frac{1}{{\left( 1 - H \right)^2}}$ then
\[\textup{HC3} = \sqrt{diag(\textup{InvMat } \textup{xe}' \textup{xe} \textup{ InvMat } )}\]
Heteroskedasticity and autocorrelation consistent estimator (HAC)
Newey-West estimator calculation is not documented yet. See reference implementation current implementation for details.
Statistics related to the prediction
Predicting values using independent variables and a model will generate predicted values and some additional statistics dependent on the value of the req_stats
argument of the predict*
functions. Here is a list of the available statistics: [:predicted, :residuals, :leverage, :stdp, :stdi, :stdr, :student, :rstudent, :lcli, :ucli, :lclp, :uclp, :press, :cooksd]
Predicted
The predicted value is the sum of the dependant variable(s) multiplied by the coefficients from the regression and the intercept (if the model has one). The predicted value is also known as the Y-hat.
Residuals
The residuals are here defined as the known responses variables minus the predicted values.
Leverage
The leverage for the i-th independent observation x_i when it is not a weighted regression is calculated as:
\[\mathrm{h_i} = \mathrm{x_i' (X' X)^{-1} x_i}\]
And as per below when it is a weighted regression with a vector of weights $W$ with the i-th weight being $w_i$ then the i-th leverage is calculated as such:
\[\mathrm{h_i} = \mathrm{w_i \cdot x_i' (X' W X)^{-1} x_i}\]
STDP
STDP is the standard error of the mean predicted value, and is calculated as
\[\textup{STDP} = \sqrt{\hat{\sigma}^2 h_i }\]
and for a weighted regression as:
\[\textup{STDP} = \sqrt{\hat{\sigma}^2 h_i / w_i}\]
STDI
STDI is the standard error of the individual predicted value, and is calculated as
\[\textup{STDI} = \sqrt{\hat{\sigma}^2 (1 + h_i)}\]
and for a weighted regression as:
\[\textup{STDI} = \sqrt{\hat{\sigma}^2 (1 + h_i) / w_i}\]
STDR
STDR is the standard error of the residual, and is calculated as
\[\textup{STDR} = \sqrt{\hat{\sigma}^2 (1 - h_i) }\]
and for a weighted regression as:
\[\textup{STDR} = \sqrt{\hat{\sigma}^2 (1 - h_i) / w_i}\]
Student
Student represents the standardized residuals, and is calculated by using the residuals over the standard error of the residuals.
RStudent
RStudent is the studentized residuals calculated as
\[\textup{RSTUDENT} = \sqrt{ \frac{n - p - 1}{n - p - \textup{student}^2}} \]
LCLI
LCLI is the lower bound of the prediction interval and is calculated as:
\[\textup{LCLI} = \mathrm{predicted} - ( \mathrm{t\_statistic} \cdot \mathrm{STDI} )\]
UCLI
UCLI is the upper bound of the prediction interval and is calculated as:
\[\textup{UCLI} = \mathrm{predicted} + ( \mathrm{t\_statistic} \cdot \mathrm{STDI} )\]
LCLP
LCLP is the lower bound of the predicted mean confidence interval and is calculated as:
\[\textup{LCLP} = \mathrm{predicted} - ( \mathrm{t\_statistic} \cdot \mathrm{STDP} )\]
UCLP
UCLP is the upper bound of the predicted mean confidence interval and is calculated as:
\[\textup{UCLI} = \mathrm{predicted} + ( \mathrm{t\_statistic} \cdot \mathrm{STDP} )\]
COOKSD
COOKSD is the Cook's Distance for each predicted value, and is calculated as
\[\textup{COOKSD} = \frac{1}{p} \frac{\textup{STDP}^2}{\textup{STDR}^2 \cdot \textup{student}^2}\]
PRESS
PRESS is the predicted residual error sum of squares and is calculated as
\[\textup{PRESS} = \frac{\textup{residuals}}{1 - \textup{leverage}}\]
Type 1 SS
Type 1 Sum of squares, are calculated as a by-product of the sweep operator.
Type 2 SS
Type 2 Sum of squares, are calculated using the pseudo-inverse matrix. The Type 2 SS of the ith independent variable is the square of the coefficient of the independent variable divided by the ith element of the diagonal from the pseudo-inverse matrix.
Pcorr 1 and 2
pcorr1
and pcorr2
are the squared partial correlation coefficient calculated as:
\[\textup{pcorr1} = \frac{\textup{Type 1 SS}}{\textup{Type 1 SS}+ \textup{SSE}}\]
\[\textup{pcorr2} = \frac{\textup{Type 2 SS}}{\textup{Type 2 SS}+ \textup{SSE}}\]
When there is an intercept in the model the pcorr1
and pcorr2
are considered missing
for the intercept.
Scorr 1 and 2
scorr1
and scorr2
are the squared semi-partial correlation coefficient calculated as:
\[\textup{scorr1} = \frac{\textup{Type 1 SS}}{\textup{SST}}\]
\[\textup{scorr2} = \frac{\textup{Type 2 SS}}{\textup{SST}}\]
When there is an intercept in the model the scorr1
and scorr2
are considered missing
for the intercept.
Ridge Regression and Weighthed Ridge Regression
Ridge regression and weighthed ridge regression are possible using the ridge
functions; please note that only the following statistics are available: MSE
, RMSE
, R2
, ADJR2
, and VIF
. The ridge constant k
can be specified as scalar or as a range (AbstractRange
). The coefficients calculation is inspired by the details given in SAS blog post.
Let $X$ be the matrix of the independent variables after centering [and scaling]the data, and let $Y$ be a vector corresponding to the [centered]dependent variable. Let $D$ be a diagonal matrix with diagonal elements as in $X`X$. The ridge regression estimate corresponding to the ridge constant k can be computed as $D^{-1/2} * (Z^{T}Z + k*I)^{-1} * Z^{T}Y$.
General remarks
For all options and parameters they can be passed as a Vector{String}
or a Vector{Symbol}
or alternatively if only options is needed as a single String
or Symbol
. For instance "all"
, :all
or ["R2", "VIF"]
or [:r2, :vif]
.
Functions
LinearRegressionKit.regress
— Methodfunction regress(f::StatsModels.FormulaTerm, df::AbstractDataFrame, req_plots; α::Float64=0.05, req_stats=["default"], weights::Union{Nothing,String}=nothing, remove_missing=false, cov=[:none], contrasts=nothing, plot_args=Dict("plot_width" => 400, "loess_bw" => 0.6, "residuals_with_density" => false))
Estimate the coefficients of the regression, given a dataset and a formula. and provide the requested plot(s).
A dictionary of the generated plots indexed by the descritption of the plots.
It is possible to indicate the width of the plots, and the bandwidth of the Loess smoother.
LinearRegressionKit.regress
— Methodfunction regress(f::StatsModels.FormulaTerm, df::DataFrames.AbstractDataFrame; α::Float64=0.05, req_stats=["default"], weights::Union{Nothing,String}=nothing,
remove_missing=false, cov=[:none], contrasts=nothing)
Estimate the coefficients of the regression, given a dataset and a formula.
The formula details are provided in the StatsModels package and the behaviour aims to be similar as what the Julia GLM package provides.
The data shall be provided as a DataFrame without missing data.
If remove_missing is set to true a copy of the dataframe will be made and the row with missing data will be removed.
Some robust covariance estimator(s) can be requested through the `cov` argument.
Default contrast is dummy coding, other contrasts can be requested through the `contrasts` argument.
For a weighted regression, the name of column containing the analytical weights shall be identified by the `weights` argument.
LinearRegressionKit.predict_out_of_sample
— Methodfunction predict_out_of_sample(lr::linRegRes, df::AbstractDataFrame; α=0.05, req_stats=["none"], dropmissingvalues=true)
Similar to `predict_in_sample` although it does not expect a response variable nor produce statistics requiring a response variable.
LinearRegressionKit.predict_in_sample
— Methodfunction predict_in_sample(lr::linRegRes, df::AbstractDataFrame; α=0.05, req_stats=["none"], dropmissingvalues=true)
Using the estimated coefficients from the regression make predictions, and calculate related statistics.
LinearRegressionKit.kfold
— Functionfunction kfold(f, df, k, r = 1, shuffle=true; kwargs...)
Provide a simple `k` fold(s) cross-validation, repeated `r` time(s).
`kwargs` arguments are passed to the `regress` function call.
This feature overlap in part with the PRESS statistics.
LinearRegressionKit.ridge
— Methodfunction ridge(f::StatsModels.FormulaTerm, df::DataFrames.AbstractDataFrame, k::Float64 ;
weights::Union{Nothing,String}=nothing, remove_missing=false, contrasts=nothing)
Ridge regression, expects a k parameter (also known as k).
When weights are provided, result in a weighted ridge regression.
LinearRegressionKit.ridge
— Methodfunction ridge(f::StatsModels.FormulaTerm, df::DataFrames.AbstractDataFrame, ks::AbstractRange ;
weights::Union{Nothing,String}=nothing, remove_missing=false, contrasts=nothing, traceplots=false)
Ridge regression, expects a range of k parameter (also known as k).
When weights are provided, result in a weighted ridge regression.
When traceplots are requested, also return a dictionnary of trace plots.
LinearRegressionKit.predict_in_sample
— Methodfunction predict_in_sample(rr::ridgeRegRes, df::AbstractDataFrame; dropmissingvalues=true)
Using the estimated coefficients from the regression make predictions, and calculate related statistics.
LinearRegressionKit.predict_out_of_sample
— Methodfunction predict_out_of_sample(rr::ridgeRegRes, df::AbstractDataFrame; dropmissingvalues=true)
Similar to `predict_in_sample` although it does not expect a response variable nor produce statistics requiring a response variable.
Index
LinearRegressionKit.kfold
LinearRegressionKit.predict_in_sample
LinearRegressionKit.predict_in_sample
LinearRegressionKit.predict_out_of_sample
LinearRegressionKit.predict_out_of_sample
LinearRegressionKit.regress
LinearRegressionKit.regress
LinearRegressionKit.ridge
LinearRegressionKit.ridge