In some applications, we may require that the LD matrix (SNP-by-SNP correlation matrix) be positive definite (PD) or positive semi-definite (PSD). Very often, this is not the case, which may result in instabilities for some downstream models. Therefore, it would be great to augment our LD functionalities to check for PSD and to also find the nearest PSD matrix to the sample LD matrix.
A good starting point may be the code snippet in this StackOverflow thread. I modified this function to find the nearest PSD while retaining the sparsity structure in the original matrix:
from numpy import linalg as la
def nearestPD(A):
"""Find the nearest positive-definite matrix to input
A Python/Numpy port of John D'Errico's `nearestSPD` MATLAB code [1], which
credits [2].
Credit: https://stackoverflow.com/a/43244194
Modified by Shadi Zabad to retain the sparsity structure in the original matrix.
[1] https://www.mathworks.com/matlabcentral/fileexchange/42885-nearestspd
[2] N.J. Higham, "Computing a nearest symmetric positive semidefinite
matrix" (1988): https://doi.org/10.1016/0024-3795(88)90223-6
"""
np.fill_diagonal(A, 1.)
B = (A + A.T) / 2
_, s, V = la.svd(B)
H = np.dot(V.T, np.dot(np.diag(s), V))
A2 = (B + H) / 2
A3 = (A2 + A2.T) / 2
A3[A == 0.] = 0.
if isPD(A3):
return A3
spacing = np.spacing(la.norm(A))
# The above is different from [1]. It appears that MATLAB's `chol` Cholesky
# decomposition will accept matrixes with exactly 0-eigenvalue, whereas
# Numpy's will not. So where [1] uses `eps(mineig)` (where `eps` is Matlab
# for `np.spacing`), we use the above definition. CAVEAT: our `spacing`
# will be much larger than [1]'s `eps(mineig)`, since `mineig` is usually on
# the order of 1e-16, and `eps(1e-16)` is on the order of 1e-34, whereas
# `spacing` will, for Gaussian random matrixes of small dimension, be on
# othe order of 1e-16. In practice, both ways converge, as the unit test
# below suggests.
I = np.eye(A.shape[0])
k = 1
while not isPD(A3):
mineig = np.min(np.real(la.eigvals(A3)))
A3 += I * (-mineig * k**2 + spacing)
A3[A == 0.] = 0.
k += 1
return A3
def isPD(B):
"""Returns true when input is positive-definite, via Cholesky"""
try:
_ = la.cholesky(B)
return True
except la.LinAlgError:
return False
The main bottleneck here is that we need to perform SVD on the LD matrix, which may be computationally expensive, even with sparse arrays. One way to get around this difficulty is to deploy this within LD blocks, which may be more manageable.
To experiment with this, we can try to tackle the following tasks:
In some applications, we may require that the LD matrix (SNP-by-SNP correlation matrix) be positive definite (PD) or positive semi-definite (PSD). Very often, this is not the case, which may result in instabilities for some downstream models. Therefore, it would be great to augment our LD functionalities to check for PSD and to also find the nearest PSD matrix to the sample LD matrix.
A good starting point may be the code snippet in this StackOverflow thread. I modified this function to find the nearest PSD while retaining the sparsity structure in the original matrix:
The main bottleneck here is that we need to perform SVD on the LD matrix, which may be computationally expensive, even with sparse arrays. One way to get around this difficulty is to deploy this within LD blocks, which may be more manageable.
To experiment with this, we can try to tackle the following tasks:
LDMatrixcorresponding to the LD blocks defined by the block estimator.viprs).