Skip to content

Commit 2bdfc9d

Browse files
authored
feat(lsi): add hash-style API for adding items (#101)
* test: add property-based tests for probabilistic invariants Add comprehensive property-based testing using the Rantly gem to verify that probabilistic invariants hold across random inputs. This addresses the gap in test coverage for edge cases and mathematical properties. Tests added for Bayes classifier: - Classification determinism (same input = same output) - Training order independence (commutativity) - Train/untrain inverse property - Word and category counts never go negative - Log probabilities are always finite - Multiple training equivalence Tests added for LSI: - Classification, search, and find_related determinism - Graceful handling of uncategorized items - Consistency after index rebuild Tests added for category operations: - Add/remove category consistency - Training data isolation between categories Closes #70 * style: fix RuboCop offenses in property tests * refactor: remove redundant comments from property tests * ci: drop Ruby 3.2 from test matrix * feat(lsi): add hash-style API for adding items The existing add_item API had confusing positional arguments - it wasn't clear whether the first argument was the key or the content. The new add() method uses hash syntax to make the relationship explicit: lsi.add("Ruby programming" => doc1) lsi.add("Java development" => [doc2, :programming]) This also enables batch operations naturally: lsi.add( "Ruby programming" => doc1, "Java development" => doc2 ) The add_item method is preserved but marked as deprecated for backward compatibility. Closes #100 * style: fix RuboCop symbol array offenses * fix(lsi): correct add API to use category => items syntax The hash key should be the category, with values being items (or arrays of items) that belong to that category. This mirrors the Bayes API: lsi.add("Dog" => "Dogs are loyal pets") lsi.add("Dog" => ["Puppies are cute", "Canines are friendly"]) lsi.add( "Dog" => ["Dogs are loyal", "Puppies are cute"], "Cat" => ["Cats are independent", "Kittens are playful"] ) * refactor(lsi): simplify add method with Array()
1 parent 7d70749 commit 2bdfc9d

File tree

3 files changed

+146
-7
lines changed

3 files changed

+146
-7
lines changed

README.md

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -115,19 +115,27 @@ require 'classifier'
115115

116116
lsi = Classifier::LSI.new
117117

118-
# Add documents with categories
119-
lsi.add_item "Dogs are loyal pets that love to play fetch", :pets
120-
lsi.add_item "Cats are independent and love to nap", :pets
121-
lsi.add_item "Ruby is a dynamic programming language", :programming
122-
lsi.add_item "Python is great for data science", :programming
118+
# Add documents with hash-style syntax (category => item(s))
119+
lsi.add("Pets" => "Dogs are loyal pets that love to play fetch")
120+
lsi.add("Pets" => "Cats are independent and love to nap")
121+
lsi.add("Programming" => "Ruby is a dynamic programming language")
122+
123+
# Add multiple items with the same category
124+
lsi.add("Programming" => ["Python is great for data science", "JavaScript runs in browsers"])
125+
126+
# Batch operations with multiple categories
127+
lsi.add(
128+
"Pets" => ["Hamsters are small furry pets", "Birds can be great companions"],
129+
"Programming" => "Go is fast and concurrent"
130+
)
123131

124132
# Classify new text
125133
lsi.classify "My puppy loves to run around"
126-
# => :pets
134+
# => "Pets"
127135

128136
# Get classification with confidence score
129137
lsi.classify_with_confidence "Learning to code in Ruby"
130-
# => [:programming, 0.89]
138+
# => ["Programming", 0.89]
131139
```
132140

133141
### Search and Discovery

lib/classifier/lsi.rb

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,12 +122,39 @@ def singular_value_spectrum
122122
end
123123
end
124124

125+
# Adds items to the index using hash-style syntax.
126+
# The hash keys are categories, and values are items (or arrays of items).
127+
#
128+
# For example:
129+
# lsi = Classifier::LSI.new
130+
# lsi.add("Dog" => "Dogs are loyal pets")
131+
# lsi.add("Cat" => "Cats are independent")
132+
# lsi.add(Bird: "Birds can fly") # Symbol keys work too
133+
#
134+
# Multiple items with the same category:
135+
# lsi.add("Dog" => ["Dogs are loyal", "Puppies are cute"])
136+
#
137+
# Batch operations with multiple categories:
138+
# lsi.add(
139+
# "Dog" => ["Dogs are loyal", "Puppies are cute"],
140+
# "Cat" => ["Cats are independent", "Kittens are playful"]
141+
# )
142+
#
143+
# @rbs (**untyped items) -> void
144+
def add(**items)
145+
items.each do |category, value|
146+
Array(value).each { |doc| add_item(doc, category.to_s) }
147+
end
148+
end
149+
125150
# Adds an item to the index. item is assumed to be a string, but
126151
# any item may be indexed so long as it responds to #to_s or if
127152
# you provide an optional block explaining how the indexer can
128153
# fetch fresh string data. This optional block is passed the item,
129154
# so the item may only be a reference to a URL or file name.
130155
#
156+
# @deprecated Use {#add} instead for clearer hash-style syntax.
157+
#
131158
# For example:
132159
# lsi = Classifier::LSI.new
133160
# lsi.add_item "This is just plain text"

test/lsi/lsi_test.rb

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,110 @@ def setup
1111
@str5 = 'This text involves birds. Birds.'
1212
end
1313

14+
# Hash-style add API tests (Issue #100)
15+
16+
def test_add_with_hash_syntax
17+
lsi = Classifier::LSI.new
18+
lsi.add('Dog' => 'Dogs are loyal pets')
19+
lsi.add('Cat' => 'Cats are independent')
20+
21+
assert_equal 2, lsi.items.size
22+
assert_includes lsi.items, 'Dogs are loyal pets'
23+
assert_includes lsi.items, 'Cats are independent'
24+
end
25+
26+
def test_add_with_symbol_keys
27+
lsi = Classifier::LSI.new
28+
lsi.add(Dog: 'Dogs are loyal', Cat: 'Cats are independent')
29+
30+
assert_equal 2, lsi.items.size
31+
assert_equal ['Dog'], lsi.categories_for('Dogs are loyal')
32+
assert_equal ['Cat'], lsi.categories_for('Cats are independent')
33+
end
34+
35+
def test_add_multiple_items_same_category
36+
lsi = Classifier::LSI.new
37+
lsi.add('Dog' => ['Dogs are loyal', 'Puppies are cute', 'Canines are friendly'])
38+
39+
assert_equal 3, lsi.items.size
40+
assert_equal ['Dog'], lsi.categories_for('Dogs are loyal')
41+
assert_equal ['Dog'], lsi.categories_for('Puppies are cute')
42+
assert_equal ['Dog'], lsi.categories_for('Canines are friendly')
43+
end
44+
45+
def test_add_batch_operations
46+
lsi = Classifier::LSI.new
47+
lsi.add(
48+
'Dog' => ['Dogs are loyal', 'Puppies are cute'],
49+
'Cat' => ['Cats are independent', 'Kittens are playful']
50+
)
51+
52+
assert_equal 4, lsi.items.size
53+
assert_equal ['Dog'], lsi.categories_for('Dogs are loyal')
54+
assert_equal ['Cat'], lsi.categories_for('Cats are independent')
55+
end
56+
57+
def test_add_classification_works
58+
lsi = Classifier::LSI.new
59+
lsi.add(
60+
'Dog' => @str2,
61+
'Cat' => [@str3, @str4],
62+
'Bird' => @str5
63+
)
64+
65+
assert_equal 'Dog', lsi.classify(@str1)
66+
assert_equal 'Cat', lsi.classify(@str3)
67+
assert_equal 'Bird', lsi.classify(@str5)
68+
end
69+
70+
def test_add_find_related_works
71+
lsi = Classifier::LSI.new
72+
lsi.add(
73+
'Dog' => [@str1, @str2],
74+
'Cat' => [@str3, @str4],
75+
'Bird' => @str5
76+
)
77+
78+
# The closest match to str1 should be str2 (both about dogs)
79+
related = lsi.find_related(@str1, 3)
80+
81+
assert_equal @str2, related.first, 'Most related to dog text should be other dog text'
82+
end
83+
84+
def test_add_equivalence_to_add_item
85+
# Using add
86+
lsi1 = Classifier::LSI.new
87+
lsi1.add(
88+
'Programming' => ['Ruby programming language', 'Java enterprise development'],
89+
'Entertainment' => 'Cat pictures are cute'
90+
)
91+
92+
# Using add_item (legacy)
93+
lsi2 = Classifier::LSI.new
94+
lsi2.add_item 'Ruby programming language', 'Programming'
95+
lsi2.add_item 'Java enterprise development', 'Programming'
96+
lsi2.add_item 'Cat pictures are cute', 'Entertainment'
97+
98+
# Both should classify the same
99+
test_text = 'Python programming'
100+
101+
assert_equal lsi1.classify(test_text), lsi2.classify(test_text)
102+
end
103+
104+
def test_add_triggers_auto_rebuild
105+
lsi = Classifier::LSI.new auto_rebuild: true
106+
lsi.add('Dog' => ['Dogs are great', 'More about dogs'])
107+
108+
refute_predicate lsi, :needs_rebuild?, 'Auto-rebuild should keep index current'
109+
end
110+
111+
def test_add_respects_auto_rebuild_false
112+
lsi = Classifier::LSI.new auto_rebuild: false
113+
lsi.add('Dog' => ['Dogs are great', 'More about dogs'])
114+
115+
assert_predicate lsi, :needs_rebuild?, 'Index should need rebuild when auto_rebuild is false'
116+
end
117+
14118
def test_basic_indexing
15119
lsi = Classifier::LSI.new
16120
[@str1, @str2, @str3, @str4, @str5].each { |x| lsi << x }

0 commit comments

Comments
 (0)