Skip to content

Add multilinear regression#488

Open
insel-maz wants to merge 4 commits intomarkrogoyski:masterfrom
krumedia:multilinear-regression
Open

Add multilinear regression#488
insel-maz wants to merge 4 commits intomarkrogoyski:masterfrom
krumedia:multilinear-regression

Conversation

@insel-maz
Copy link

This PR introduces MathPHP\Statistics\Regression\Multilinear for performing multilinear regression. The change is intended to be non-breaking.

  • The Regression constructor now additionally accepts a list of features (a list of x-vectors).
  • MathPHP\LinearAlgebra\MatrixFactory::vandermonde() now utilizes the new MonomialExponentGenerator.

If you would like me to make any further changes, please let me know.

Thank you very much for your comprehensive library!

@coveralls
Copy link

coveralls commented Jan 14, 2026

Coverage Status

coverage: 99.457% (-0.2%) from 99.66%
when pulling 1a4d424 on krumedia:multilinear-regression
into c064e80 on markrogoyski:master.

@markrogoyski
Copy link
Owner

Hi @insel-maz,

Thank you for your interest in MathPHP! And thank you for your pull request.

Please allow me some time to review the PR and understand the new functionality.

Thanks again,
Mark

@insel-maz
Copy link
Author

Hi @markrogoyski,

I've just pushed a new commit to update some code to match the new shape of $points that I introduced earlier.
Do you think this new structure is suitable, or would you prefer a different approach?
Also, I'm wondering why the new unit tests aren't achieving higher code coverage. Any ideas?

Thank you

@markrogoyski
Copy link
Owner

Hi @insel-maz,

Apologies on the delay in reviewing this. It is on my to-do list. I will try to get back to you soon with feedback.

Thanks again for the contribution.
Mark

@markrogoyski
Copy link
Owner

Sorry for the delay in reviewing. Thanks again for this contribution.

The core coefficient computation (design matrix + OLS) seems correct and well-structured.

I think there may be one issue with regression statistics: $this->p in LeastSquares::leastSquares() is set to $order (which defaults to 1), but for multilinear regression with k features, the number of predictors is k, not 1. This means degrees of freedom, mean squares, F-statistic, and Cook's D are all wrong when k > 1. The coefficients, R², residuals, and predictions are unaffected.

Here's a minimal test that demonstrates the problem:

  /**
   * Demonstrates that regression statistics are wrong for k > 1.
   * Oracle: statsmodels OLS, scikit-learn LinearRegression
   */
  public function testDegreesOfFreedomAndStatistics(): void
  {
      // y ≈ 2x₁ + 3x₂ + 5 with noise, k=2 features, n=10 points
      $points = [
          [[1.0, 2.0], 13.3], [[3.0, 1.0], 13.8],
          [[2.0, 4.0], 21.5], [[5.0, 2.0], 20.9],
          [[4.0, 3.0], 22.4], [[1.0, 5.0], 21.7],
          [[6.0, 1.0], 20.2], [[3.0, 3.0], 20.1],
          [[2.0, 6.0], 26.6], [[4.0, 4.0], 25.3],
      ];
      $regression = new Multilinear($points);

      // Degrees of freedom: should be n - k - 1 = 10 - 2 - 1 = 7
      // Currently returns 8 (n - 1 - 1, using p=1 instead of p=k)
      $this->assertEquals(7, $regression->degreesOfFreedom());

      // F-statistic: should be 705.22 (statsmodels OLS)
      // Currently returns ~1612 (wrong MSR and MSE)
      $this->assertEqualsWithDelta(705.2243, $regression->fStatistic(), 0.001);
  }

The root cause is on line 112 of LeastSquares.php:

$this->p = $order;  // always 1 for multilinear, but should be k

$order serves double duty. It's both the polynomial degree (used by vandermonde()) and the number of predictors (used by degrees of freedom / mean squares). For polynomial regression these are the same, but for multilinear regression they differ.

A possible fix: derive p from the actual design matrix size after construction, instead of from $order:

$X = $this->createDesignMatrix($xs);
$this->p = $X->getN() - $this->fit_constant;  // actual number of predictors
$this->ν = $this->n - $this->p - $this->fit_constant;

I think this would work for both polynomial and multilinear cases since the design matrix always has the correct number of columns.

I also noticed that standardErrors(), tValues(), and tProbability() return hardcoded ['m', 'b'] keys which only makes sense for simple linear regression. They'd need to be generalized for k > 1 features.

I can provide additional unit tests to help verify what the expected results should be if that would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants