Skip to content

Commit 86dadd7

Browse files
authored
Fix numerical stability issues in SVD implementation (#73)
* Fix numerical stability issues in SVD implementation - Add EPSILON constant for safe numerical comparisons - Guard Vector#normalize against zero/near-zero magnitude - Guard SVD angle calculation against division by zero - Guard SVD singular value sqrt against negative values - Guard content_node weighted_total against division by zero - Add nil checks in LSI build_index for edge cases - Add tests for zero vectors, near-singular matrices, edge cases Fixes #63 * Fix RuboCop offenses in SVD implementation * Address code review feedback - Move EPSILON constant to Vector::EPSILON to avoid polluting global namespace - Use Vector::EPSILON in content_node.rb for consistency - Remove unrelated .claude/settings.local.json from PR - Add .claude/ to .gitignore * Address additional code review feedback - Change normalize guard from `< EPSILON` to `<= 0.0` - only return zero vector for actual zero magnitude, not near-zero values - Use `val.abs` with EPSILON threshold for singular values to handle small negative values from floating point errors more robustly - Update test to reflect correct behavior: near-zero vectors normalize to unit vectors, only true zero vectors return zero * Fix RuboCop: add empty line before assertions * Address third round of code review feedback - Rename test to test_normalize_near_zero_vector_normalizes_correctly - Simplify singular value sqrt to [val, 0.0].max for uniform handling - Preserve sign relationship in content_node divisor fallback * Remove redundant comments * Use Vector::EPSILON in test for maintainability * Flatten nested conditional in content_node divisor logic
1 parent ba21e6f commit 86dadd7

File tree

6 files changed

+129
-17
lines changed

6 files changed

+129
-17
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,6 @@ pkg/
1313

1414
# OS files
1515
.DS_Store
16+
17+
# Claude Code local settings
18+
.claude/

lib/classifier/extensions/vector.rb

Lines changed: 20 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,9 @@ def sum_with_identity(identity = 0.0, &)
1414
end
1515
end
1616

17-
module VectorExtensions
17+
class Vector
18+
EPSILON = 1e-10
19+
1820
def magnitude
1921
sum_of_squares = 0.to_r
2022
size.times do |i|
@@ -24,19 +26,17 @@ def magnitude
2426
end
2527

2628
def normalize
29+
magnitude_value = magnitude
30+
return Vector[*Array.new(size, 0.0)] if magnitude_value <= 0.0
31+
2732
normalized_values = []
28-
magnitude_value = magnitude.to_r
2933
size.times do |i|
3034
normalized_values << (self[i] / magnitude_value)
3135
end
3236
Vector[*normalized_values]
3337
end
3438
end
3539

36-
class Vector
37-
include VectorExtensions
38-
end
39-
4040
class Matrix
4141
def self.diag(diagonal_elements)
4242
Matrix.diagonal(*diagonal_elements)
@@ -62,10 +62,15 @@ def SV_decomp(max_sweeps = 20)
6262
(1..(q_rotation_matrix.row_size - 1)).each do |col|
6363
next if row == col
6464

65-
angle = Math.atan((2.to_r * q_rotation_matrix[row,
66-
col]) / (q_rotation_matrix[row,
67-
row] - q_rotation_matrix[col,
68-
col])) / 2.0
65+
numerator = 2.0 * q_rotation_matrix[row, col]
66+
denominator = q_rotation_matrix[row, row] - q_rotation_matrix[col, col]
67+
68+
angle = if denominator.abs < Vector::EPSILON
69+
numerator >= 0 ? Math::PI / 4.0 : -Math::PI / 4.0
70+
else
71+
Math.atan(numerator / denominator) / 2.0
72+
end
73+
6974
cosine = Math.cos(angle)
7075
sine = Math.sin(angle)
7176
rotation_matrix = Matrix.identity(q_rotation_matrix.row_size)
@@ -89,11 +94,12 @@ def SV_decomp(max_sweeps = 20)
8994
break if (sum_of_differences <= 0.001 && iteration_count > 1) || iteration_count >= max_sweeps
9095
end
9196

92-
singular_values = []
93-
q_rotation_matrix.row_size.times do |r|
94-
singular_values << Math.sqrt(q_rotation_matrix[r, r].to_f)
97+
singular_values = q_rotation_matrix.row_size.times.map do |r|
98+
Math.sqrt([q_rotation_matrix[r, r].to_f, 0.0].max)
9599
end
96-
u_matrix = (row_size >= column_size ? self : trans) * v_matrix * Matrix.diagonal(*singular_values).inverse
100+
101+
safe_singular_values = singular_values.map { |v| [v, Vector::EPSILON].max }
102+
u_matrix = (row_size >= column_size ? self : trans) * v_matrix * Matrix.diagonal(*safe_singular_values).inverse
97103
[u_matrix, v_matrix, singular_values]
98104
end
99105

lib/classifier/lsi.rb

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -143,8 +143,13 @@ def build_index(cutoff = 0.75)
143143
ntdm = build_reduced_matrix(tdm, cutoff)
144144

145145
ntdm.row_size.times do |col|
146-
doc_list[col].lsi_vector = ntdm.column(col) if doc_list[col]
147-
doc_list[col].lsi_norm = ntdm.column(col).normalize if doc_list[col]
146+
next unless doc_list[col]
147+
148+
column = ntdm.column(col)
149+
next unless column
150+
151+
doc_list[col].lsi_vector = column
152+
doc_list[col].lsi_norm = column.normalize
148153
end
149154
end
150155

lib/classifier/lsi/content_node.rb

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,10 @@ def raw_vector_with(word_list)
6060
val = term_over_total * Math.log(term_over_total)
6161
weighted_total += val unless val.nan?
6262
end
63-
vec = vec.collect { |val| Math.log(val + 1) / -weighted_total }
63+
64+
sign = weighted_total.negative? ? 1.0 : -1.0
65+
divisor = sign * [weighted_total.abs, Vector::EPSILON].max
66+
vec = vec.collect { |val| Math.log(val + 1) / divisor }
6467
end
6568

6669
if Classifier::LSI.gsl_available

test/extensions/word_hash_test.rb

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,30 @@ def test_normalize_preserves_direction
115115
assert_in_delta 1.0, normalized[0], 0.001
116116
assert_in_delta 0.0, normalized[1], 0.001
117117
end
118+
119+
def test_normalize_zero_vector_returns_zero_vector
120+
vec = Vector[0, 0, 0]
121+
normalized = vec.normalize
122+
123+
assert_in_delta 0.0, normalized[0], 0.001
124+
assert_in_delta 0.0, normalized[1], 0.001
125+
assert_in_delta 0.0, normalized[2], 0.001
126+
end
127+
128+
def test_normalize_near_zero_vector_normalizes_correctly
129+
# Near-zero vectors should still normalize to unit vectors
130+
# Only actual zero vectors return zero
131+
small = Vector::EPSILON * 10
132+
vec = Vector[small, small, small]
133+
normalized = vec.normalize
134+
135+
# Should normalize to [1/sqrt(3), 1/sqrt(3), 1/sqrt(3)]
136+
expected = 1.0 / Math.sqrt(3)
137+
138+
assert_in_delta expected, normalized[0], 0.001
139+
assert_in_delta expected, normalized[1], 0.001
140+
assert_in_delta expected, normalized[2], 0.001
141+
end
118142
end
119143

120144
class MatrixExtensionsTest < Minitest::Test
@@ -137,4 +161,31 @@ def test_matrix_element_assignment
137161

138162
assert_equal 99, matrix[0, 1]
139163
end
164+
165+
def test_svd_basic
166+
matrix = Matrix[[1, 0], [0, 1], [0, 0]]
167+
_u, _v, s = matrix.SV_decomp
168+
169+
assert_equal 2, s.size
170+
assert(s.all? { |val| val >= 0 })
171+
end
172+
173+
def test_svd_with_zero_rows
174+
# Matrix with linearly dependent rows that could cause zero singular values
175+
matrix = Matrix[[1, 1], [1, 1], [0, 0]]
176+
_u, _v, s = matrix.SV_decomp
177+
178+
# Should not raise an error
179+
assert_equal 2, s.size
180+
end
181+
182+
def test_svd_handles_near_singular_matrix
183+
# Near-singular matrix that previously caused Math::DomainError
184+
matrix = Matrix[[1e-10, 0], [0, 1e-10]]
185+
186+
# Should not raise an error
187+
_u, _v, s = matrix.SV_decomp
188+
189+
assert_equal 2, s.size
190+
end
140191
end

test/lsi/lsi_test.rb

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -263,4 +263,48 @@ def test_categories_for_nonexistent_item
263263

264264
assert_empty result, 'Should return empty array for nonexistent item'
265265
end
266+
267+
# Numerical stability tests
268+
269+
def test_identical_documents
270+
lsi = Classifier::LSI.new auto_rebuild: false
271+
272+
# Identical documents could cause zero singular values
273+
lsi.add_item 'Exactly the same text', 'A'
274+
lsi.add_item 'Exactly the same text', 'A'
275+
lsi.add_item 'Different content here', 'B'
276+
277+
# Should handle gracefully without crashing
278+
lsi.build_index
279+
280+
refute_predicate lsi, :needs_rebuild?
281+
end
282+
283+
def test_single_word_documents
284+
lsi = Classifier::LSI.new auto_rebuild: false
285+
286+
# Very short documents could cause edge cases
287+
lsi.add_item 'dog', 'Animal'
288+
lsi.add_item 'cat', 'Animal'
289+
lsi.add_item 'car', 'Vehicle'
290+
291+
# Should handle gracefully
292+
lsi.build_index
293+
294+
refute_predicate lsi, :needs_rebuild?
295+
end
296+
297+
def test_empty_word_hash_handling
298+
lsi = Classifier::LSI.new auto_rebuild: false
299+
300+
# Documents with only stop words result in empty word hashes
301+
lsi.add_item 'the a an', 'StopWords'
302+
lsi.add_item 'Dogs are great', 'Animal'
303+
lsi.add_item 'Cats are nice', 'Animal'
304+
305+
# Should handle gracefully
306+
lsi.build_index
307+
308+
refute_predicate lsi, :needs_rebuild?
309+
end
266310
end

0 commit comments

Comments
 (0)