A Python implementation of multivariant linear regression with focus on the statistical analysis of the coefficients.
The project provides a Python-based implementation of the ordinary least squares linear regression method. The purpose is purely pedagogical and it closely mirrors the scikit-learn's LinearRegression class. This implementation gives some emphasis to the statistical analysis of the linear regression coefficients. In particular, the code computes the standard errors and confidence intervals for the coefficients and their
linear_regression/: The linear regression modules
- It contains the multivariant linear regression module. See also Module architecture below.
notebooks/: Notebooks demonstrating the modules
LinearRegression.ipynb: Notebook discussing the basics of linear regression
Description of the linear_regression module architecture.
-
linear_regression/__init__.py- Initialises the module.
- Imports the LinearRegressor class, for multivariant linear regression.
-
linear_regression/multivariant_linear_regression.py: defines the LinearRegressor class with the methodsfit;predict;score;get_params;regression_report.
- The
LinearRegressorclass:- performs an multivariant linear regression by minimizing the residual sum of squares using the SVD method to obtain the linear system solution;
- predicts the values
$\hat{y}$ of the dependent variable$y$ using the estimates for the coefficients from the predictor values$X$ ; - produces a statistical analysis of the coefficient estimates;
- has the following methods:
-
fitfits the linear model, -
predictspredicts using the linear model, -
scorereturns the coefficient of determination$R^2$ and the$F$ -statistic, -
get_paramsgets the estimation of the regression coefficients, -
regression_reportreturns a report on the statistical analysis of the estimators's coefficients, with confidence interval at 95% confidence level, standard error and$p$ -value against the null hypothesis.
-
To test and illustrate the LinearRegressor class, we generate the artificial data assuming the model
with
Using the method score, the class returns a statistical analysis of the model by computing the residual standard error, the coefficient of determination
| quantity | value |
|---|---|
| residual std. error | 1.0029 |
| 0.8469 | |
|
|
177.0768 |
|
|
0.0 |
Table 1: Scores for the linear regression problem.
The method regression_report returns a report on the statistical analysis for each of the estimators's coefficients. It gives the confidence interval at 95% confidence level, the standard error and
| coefficient | confidence interval @ 95.0% | std. error | p-value | |
|---|---|---|---|---|
| intercept | 2.3718 | [1.7778, 2.9657] | 0.2992 | < 0.0001 |
| coef_1 | 2.8989 | [2.1736, 3.6241] | 0.3654 | < 0.0001 |
| coef_2 | 7.5535 | [6.8090, 8.2980] | 0.3750 | < 0.0001 |
| coef_3 | 1.6406 | [0.9870, 2.2941] | 0.3292 | < 0.0001 |
Table 2: Regression report for linear regression problem.
- G. Loiola Silva, Notas de Probabilidade e Estatística (2024).
- K. Silva Conceição, Estatística I.
- G. James, D. Witten, T. Hastie and R. Tibshirani, An Introduction to Statistical Learning, Springer (2017).
- M.N. Magalhães and A.C. Pedroso de Lima, Noções de Probabilidade e Estatística, Edusp (2023).
