Add Skew Normal distribution (SN1 from GAMLSS)#167
Conversation
Co-authored-by: simon-hirsch <64348015+simon-hirsch@users.noreply.github.com>
Co-authored-by: simon-hirsch <64348015+simon-hirsch@users.noreply.github.com>
…sian-distribution
…ub.com/simon-hirsch/rolch into copilot/add-skew-gaussian-distribution
|
@BerriJ the derivatives match.
This seems to be one of the more tricky distributions to implement. I'm not sure how we want to proceed here, what do you think? |
There was a problem hiding this comment.
Pull request overview
This PR adds the Skew Normal (SN1) distribution from GAMLSS to the ondil library, using scipy.stats.skewnorm as the underlying implementation via the ScipyMixin class.
Key Changes
- New
SkewNormaldistribution class with three parameters: mu (location), sigma (scale), and nu (skewness) - Analytical first derivatives implemented for all three parameters with corresponding second derivatives using the pseudo-derivative approach
- Integration with existing test infrastructure including distribution function tests and derivative verification
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/ondil/distributions/skewnormal.py |
Main implementation of SkewNormal distribution with scipy integration, derivative calculations, and initial value computation |
src/ondil/distributions/__init__.py |
Adds SkewNormal to the module exports |
tests/distributions/test_dist.py |
Adds SkewNormal to special tolerance and bounds configurations; improves error message formatting |
tests/distributions/test_derivatives.py |
Adds SkewNormal to special bounds configuration; attempts to improve error message formatting (incomplete) |
tests/distributions/skewnormal/test_coefs.py |
Basic smoke test for the distribution (does not verify coefficients against R) |
tests/distributions/skewnormal/__init__.py |
Empty init file for test module |
tests/distributions/skew_normal/test_coefs.py |
Coefficient verification test using rpy2 to compare against R GAMLSS (incomplete - missing nu coefficients) |
tests/distributions/skew_normal/__init__.py |
Empty init file for duplicate test directory |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
|
|
||
| if param == 2: | ||
| return (np.exp(lpdf - lcdf)) * w / nu |
There was a problem hiding this comment.
Potential division by zero when nu is zero. Line 110 computes w / nu where w = nu * z. When nu = 0, this results in 0 / 0 = NaN. Consider handling this edge case explicitly, either by checking for nu == 0 or using a small epsilon value to avoid division by zero.
| return (np.exp(lpdf - lcdf)) * w / nu | |
| # Avoid division by zero when nu == 0 | |
| epsilon = 1e-8 | |
| safe_nu = np.where(np.abs(nu) < epsilon, epsilon, nu) | |
| return (np.exp(lpdf - lcdf)) * w / safe_nu |
| return np.vstack([ | ||
| (y + np.mean(y)) / 2, | ||
| np.full_like(y, np.std(y, ddof=1) / 4), | ||
| np.full_like(y, 0.1), | ||
| ]).T |
There was a problem hiding this comment.
The initial values computation is inconsistent with the docstring and other distributions. The docstring (lines 132-135) states:
- mu = mean of y
- sigma = std of y
- nu = 0 (no skewness initially)
However, the actual implementation computes:
- mu =
(y + np.mean(y)) / 2(line 138) - This averages each observation with the overall mean, resulting in different initial mu for each observation - sigma =
np.std(y, ddof=1) / 4(line 139) - Constant value, but divided by 4 - nu =
0.1(line 140) - Constant value of 0.1, not 0
Consider using np.tile([np.mean(y), np.std(y, ddof=1), 0], (y.shape[0], 1)) to match the docstring and follow the pattern used in other distributions (e.g., Normal, SkewT).
| return np.vstack([ | |
| (y + np.mean(y)) / 2, | |
| np.full_like(y, np.std(y, ddof=1) / 4), | |
| np.full_like(y, 0.1), | |
| ]).T | |
| return np.tile( | |
| [np.mean(y), np.std(y, ddof=1), 0], | |
| (y.shape[0], 1) | |
| ) |
| # lcdf <- log(0.5 * (1 + pgamma(s, shape = 1/2, scale = 1) * sign(w))) | ||
| # dldm <- -(exp(lpdf - lcdf)) * nu/sigma + sign(z) * (abs(z)^(2 - 1))/sigma | ||
|
|
||
| lpdf = (1 - (1 / 2)) * np.log(2) - s - sp.gammaln(1 / 2) - np.log(2) |
There was a problem hiding this comment.
[nitpick] The expression (1 - (1/2)) * np.log(2) - np.log(2) in line 93 can be simplified. This evaluates to:
(1/2) * np.log(2) - np.log(2)=-(1/2) * np.log(2)=np.log(2**(-0.5))=np.log(1/np.sqrt(2))
Consider simplifying to improve readability and potentially performance. Alternatively, add a comment explaining why this specific form is used if it's matching the R GAMLSS implementation exactly.
| lpdf = (1 - (1 / 2)) * np.log(2) - s - sp.gammaln(1 / 2) - np.log(2) | |
| lpdf = np.log(1 / np.sqrt(2)) - s - sp.gammaln(1 / 2) |
|
|
||
| if param == 2: | ||
| return (np.exp(lpdf - lcdf)) * w / nu | ||
|
|
There was a problem hiding this comment.
The method doesn't explicitly handle invalid param values. If param is not 0, 1, or 2, the function will silently return None. Following the pattern in other distributions (e.g., InverseGaussian), add an explicit error check after the conditional blocks:
raise ValueError("param must be 0 (mu), 1 (sigma), or 2 (nu)")| raise ValueError("param must be 0 (mu), 1 (sigma), or 2 (nu)") |
| assert np.allclose(derivative_mapping[key](), R_list.rx2(key)), ( | ||
| f"Derivative {key} doesn't match for {distribution.__class__.__name__}" | ||
| f"Got: {derivative_mapping[key]().round(3)} for Python \n" | ||
| f"Got: {np.asarray(R_list.rx2(key)).round(3)} for R \n" | ||
| f"Difference: {(derivative_mapping[key]() - R_list.rx2(key)).round(3)}" | ||
| ) |
There was a problem hiding this comment.
The assertion error message is incomplete. Line 108 ends with a string that doesn't include a newline, and lines 109-111 appear to be separate strings that won't be concatenated. Add a newline character \n at the end of line 108's string to properly connect it with the following lines, similar to how it's done in test_dist.py lines 97-100.
| nu.formula = ~cyl + hp, | ||
| family=gamlss.dist::{dist.corresponding_gamlss}(), | ||
| data=as.data.frame(mtcars), | ||
| control=gamlss.control(n.cyc=100, trace=TRUE) | ||
| ) | ||
| list( | ||
| "mu" = coef(model, "mu"), | ||
| "sigma" = coef(model, "sigma") | ||
| ) |
There was a problem hiding this comment.
The R GAMLSS model is fitted with nu.formula = ~cyl + hp (line 22), but the returned list only includes coefficients for "mu" and "sigma" (lines 28-29). The "nu" coefficients should also be extracted and verified, or the nu.formula should be simplified to ~1 if nu coefficients are not being tested.
Implements the Skew Normal distribution corresponding to GAMLSS SN1, using scipy.stats.skewnorm as the underlying implementation.
Changes
New distribution class:
SkewNormalScipyMixinfor scipy integration (pdf/cdf/ppf/rvs)dl1_dp1: ∂L/∂μ = z/σ - ν/σ · φ(νz)/Φ(νz)dl1_dp1: ∂L/∂σ = -1/σ + z²/σ - νz/σ · φ(νz)/Φ(νz)dl1_dp1: ∂L/∂ν = z · φ(νz)/Φ(νz)Files added:
src/ondil/distributions/skewnormal.pytests/distributions/skewnormal/test structureUsage
Compatible with existing test infrastructure (
test_dist.py,test_derivatives.py).Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.