diff --git a/.rubocop.yml b/.rubocop.yml
index dbcc4fae..dfadfcd4 100644
--- a/.rubocop.yml
+++ b/.rubocop.yml
@@ -5,6 +5,10 @@ plugins:
Layout/LeadingCommentSpace:
AllowRBSInlineAnnotation: true
+# @rbs type annotations can be long
+Layout/LineLength:
+ Max: 140
+
AllCops:
TargetRubyVersion: 3.1
NewCops: enable
diff --git a/README.md b/README.md
index 19fc6d41..bbca62ab 100644
--- a/README.md
+++ b/README.md
@@ -4,531 +4,138 @@
[](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml)
[](https://opensource.org/licenses/LGPL-2.1)
-A Ruby library for text classification using Bayesian, Logistic Regression, LSI (Latent Semantic Indexing), k-Nearest Neighbors (kNN), and TF-IDF algorithms.
+Text classification in Ruby. Five algorithms, native performance, streaming support.
-**[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[Guides](https://rubyclassifier.com/docs/guides)**
-
-> **Note:** This is the original `classifier` gem, maintained for 20 years since 2005. After a quieter period, active development resumed in 2025 with major new features. If you're choosing between this and a fork, this is the canonical, most actively-developed version.
+**[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[API Reference](https://rubyclassifier.com/docs/api)**
## Why This Library?
-This gem has features no fork provides:
-
-| | This Gem | Forks |
+| | This Gem | Other Forks |
|:--|:--|:--|
| **Algorithms** | 5 classifiers | 2 only |
-| **LSI Performance** | Native C (5-50x faster) | Pure Ruby, or requires GSL/Numo + system libs |
-| **Persistence** | Pluggable (file, Redis, S3, SQL) | Marshal only |
-| **Thread Safety** | ✅ | ❌ |
-| **Type Annotations** | RBS throughout | ❌ |
-| **Laplace Smoothing** | Numerically stable | ❌ Unstable |
-| **Probability Calibration** | ✅ | ❌ |
-| **Feature Weights** | ✅ Interpretable | ❌ |
-
-### Recent Developments (Late 2025)
-
-- Added Logistic Regression classifier with SGD and L2 regularization
-- Added k-Nearest Neighbors classifier with distance-weighted voting
-- Added TF-IDF vectorizer with n-gram support
-- Built zero-dependency native C extension (replaces GSL or Numo requirement)
-- Added pluggable storage backends for persistence
-- Made all classifiers thread-safe
-- Fixed Laplace smoothing for numerical stability
-- Added RBS type signatures throughout
-- Modernized for new Ruby coding standards
-
-## Table of Contents
-
-- [Installation](#installation)
-- [Algorithms](#bayesian-classifier)
- - [Bayesian Classifier](#bayesian-classifier)
- - [Logistic Regression](#logistic-regression)
- - [LSI (Latent Semantic Indexing)](#lsi-latent-semantic-indexing)
- - [k-Nearest Neighbors (kNN)](#k-nearest-neighbors-knn)
- - [TF-IDF Vectorizer](#tf-idf-vectorizer)
-- [Persistence](#persistence)
-- [Performance](#performance)
-- [Development](#development)
-- [Contributing](#contributing)
-- [License](#license)
+| **Incremental LSI** | Brand's algorithm (no rebuild) | Full SVD rebuild on every add |
+| **LSI Performance** | Native C extension (5-50x faster) | Pure Ruby or requires GSL |
+| **Streaming** | Train on multi-GB datasets | Must load all data in memory |
+| **Persistence** | Pluggable (file, Redis, S3) | Marshal only |
## Installation
-Add to your Gemfile:
-
```ruby
gem 'classifier'
```
-Then run:
-
-```bash
-bundle install
-```
-
-Or install directly:
-
-```bash
-gem install classifier
-```
-
-### Native C Extension
+## Quick Start
-The gem includes a zero-dependency native C extension for fast LSI operations (5-50x faster than pure Ruby). It compiles automatically during installation.
-
-To verify the native extension is active:
+### Bayesian
```ruby
-require 'classifier'
-puts Classifier::LSI.backend # => :native
-```
-
-To force pure Ruby mode (for debugging):
-
-```bash
-NATIVE_VECTOR=true ruby your_script.rb
-```
-
-## Bayesian Classifier
-
-Fast, accurate classification with modest memory requirements. Ideal for spam filtering, sentiment analysis, and content categorization.
-
-### Quick Start
-
-```ruby
-require 'classifier'
-
classifier = Classifier::Bayes.new(:spam, :ham)
-
-# Train with keyword arguments
-classifier.train(spam: "Buy cheap v1agra now! Limited offer!")
-classifier.train(ham: "Meeting scheduled for tomorrow at 10am")
-
-# Train multiple items at once
-classifier.train(
- spam: ["You've won a million dollars!", "Free money!!!"],
- ham: ["Please review the document", "Lunch tomorrow?"]
-)
-
-# Classify new text
-classifier.classify "Congratulations! You've won a prize!"
-# => "Spam"
-```
-
-### Learn More
-
-- [Bayes Basics Guide](https://rubyclassifier.com/docs/guides/bayes/basics) - In-depth documentation
-- [Build a Spam Filter Tutorial](https://rubyclassifier.com/docs/tutorials/spam-filter) - Step-by-step guide
-- [Paul Graham: A Plan for Spam](http://www.paulgraham.com/spam.html)
-
-## Logistic Regression
-
-Linear classifier using Stochastic Gradient Descent (SGD). Often more accurate than Naive Bayes while remaining fast and interpretable. Provides well-calibrated probability estimates and feature weights for understanding which words drive predictions.
-
-### Key Features
-
-- **More Accurate**: No independence assumption means better accuracy on many text problems
-- **Calibrated Probabilities**: Unlike Bayes, probabilities reflect true confidence
-- **Interpretable**: Feature weights show which words matter for each class
-- **Fast**: Linear time prediction, efficient SGD training
-- **L2 Regularization**: Prevents overfitting on small datasets
-
-### Quick Start
-
-```ruby
-require 'classifier'
-
-classifier = Classifier::LogisticRegression.new(:spam, :ham)
-
-# Train with keyword arguments (same API as Bayes)
-classifier.train(spam: ["Buy now! Free money!", "Click here for prizes!"])
-classifier.train(ham: ["Meeting tomorrow", "Please review the document"])
-
-# Classify new text
-classifier.classify "Claim your free prize!"
-# => "Spam"
-
-# Get calibrated probabilities
-classifier.probabilities "Claim your free prize!"
-# => {"Spam" => 0.92, "Ham" => 0.08}
-
-# See which words matter most
-classifier.weights(:spam, limit: 5)
-# => {:free => 2.3, :prize => 1.9, :claim => 1.5, ...}
+classifier.train(spam: "Buy cheap viagra now!", ham: "Meeting at 3pm tomorrow")
+classifier.classify "You've won a prize!" # => "Spam"
```
+[Bayesian Guide →](https://rubyclassifier.com/docs/guides/bayes/basics)
-### Hyperparameters
+### Logistic Regression
```ruby
-classifier = Classifier::LogisticRegression.new(
- :positive, :negative,
- learning_rate: 0.1, # SGD step size (default: 0.1)
- regularization: 0.01, # L2 penalty strength (default: 0.01)
- max_iterations: 100, # Training iterations (default: 100)
- tolerance: 1e-4 # Convergence threshold (default: 1e-4)
-)
+classifier = Classifier::LogisticRegression.new(:positive, :negative)
+classifier.train(positive: "Great product!", negative: "Terrible experience")
+classifier.classify "Loved it!" # => "Positive"
```
+[Logistic Regression Guide →](https://rubyclassifier.com/docs/guides/logistic-regression/basics)
-### When to Use Logistic Regression vs Bayes
-
-| Aspect | Logistic Regression | Naive Bayes |
-|--------|---------------------|-------------|
-| **Accuracy** | Often higher | Good baseline |
-| **Probabilities** | Well-calibrated | Tend to be extreme |
-| **Training** | Needs to fit model | Instant (just counting) |
-| **Interpretability** | Feature weights | Word frequencies |
-| **Small datasets** | Use higher regularization | Works well |
-
-Use Logistic Regression when you need accurate probabilities or want to understand feature importance. Use Bayes when you need instant updates or have very small training sets.
-
-## LSI (Latent Semantic Indexing)
-
-Semantic analysis using Singular Value Decomposition (SVD). More flexible than Bayesian classifiers, providing search, clustering, and classification based on meaning rather than just keywords.
-
-### Quick Start
+### LSI (Latent Semantic Indexing)
```ruby
-require 'classifier'
-
lsi = Classifier::LSI.new
-
-# Add documents (category: item(s))
-lsi.add(pets: "Dogs are loyal pets that love to play fetch")
-lsi.add(pets: "Cats are independent and love to nap")
-lsi.add(programming: "Ruby is a dynamic programming language")
-
-# Add multiple items with the same category
-lsi.add(programming: ["Python is great for data science", "JavaScript runs in browsers"])
-
-# Batch operations with multiple categories
-lsi.add(
- pets: ["Hamsters are small furry pets", "Birds can be great companions"],
- programming: "Go is fast and concurrent"
-)
-
-# Classify new text
-lsi.classify "My puppy loves to run around"
-# => "Pets"
-
-# Get classification with confidence score
-lsi.classify_with_confidence "Learning to code in Ruby"
-# => ["Programming", 0.89]
+lsi.add(pets: "Dogs are loyal", tech: "Ruby is elegant")
+lsi.classify "My puppy is playful" # => "pets"
```
+[LSI Guide →](https://rubyclassifier.com/docs/guides/lsi/basics)
-### Search and Discovery
-
-```ruby
-# Find similar documents
-lsi.find_related "Dogs are great companions", 2
-# => ["Dogs are loyal pets that love to play fetch", "Cats are independent..."]
-
-# Search by keyword
-lsi.search "programming", 3
-# => ["Ruby is a dynamic programming language", "Python is great for..."]
-```
-
-### Text Summarization
-
-LSI can extract key sentences from text:
-
-```ruby
-text = "First sentence about dogs. Second about cats. Third about birds."
-text.summary(2) # Extract 2 most relevant sentences
-```
-
-For better sentence boundary detection (handles abbreviations like "Dr.", decimals, etc.), install the optional `pragmatic_segmenter` gem:
-
-```ruby
-gem 'pragmatic_segmenter'
-```
-
-### Learn More
-
-- [LSI Basics Guide](https://rubyclassifier.com/docs/guides/lsi/basics) - In-depth documentation
-- [Wikipedia: Latent Semantic Analysis](http://en.wikipedia.org/wiki/Latent_semantic_analysis)
-
-## k-Nearest Neighbors (kNN)
-
-Instance-based classification that stores examples and classifies by finding the most similar ones. No training phase required—just add examples and classify.
-
-### Key Features
-
-- **No Training Required**: Uses instance-based learning—store examples and classify by similarity
-- **Interpretable Results**: Returns neighbors that contributed to the decision
-- **Incremental Updates**: Easy to add or remove examples without retraining
-- **Distance-Weighted Voting**: Optional weighting by similarity score
-- **Built on LSI**: Leverages LSI's semantic similarity for better matching
-
-### Quick Start
+### k-Nearest Neighbors
```ruby
-require 'classifier'
-
knn = Classifier::KNN.new(k: 3)
-
-# Add labeled examples
-knn.add(spam: ["Buy now! Limited offer!", "You've won a million dollars!"])
-knn.add(ham: ["Meeting at 3pm tomorrow", "Please review the document"])
-
-# Classify new text
-knn.classify "Congratulations! Claim your prize!"
-# => "spam"
-```
-
-### Detailed Classification
-
-Get neighbor information for interpretable results:
-
-```ruby
-result = knn.classify_with_neighbors "Free money offer"
-
-result[:category] # => "spam"
-result[:confidence] # => 0.85
-result[:neighbors] # => [{item: "Buy now!...", category: "spam", similarity: 0.92}, ...]
-result[:votes] # => {"spam" => 2.0, "ham" => 1.0}
+knn.train(spam: "Free money!", ham: "Quarterly report attached") # or knn.add()
+knn.classify "Claim your prize" # => "spam"
```
+[k-Nearest Neighbors Guide →](https://rubyclassifier.com/docs/guides/knn/basics)
-### Distance-Weighted Voting
-
-Weight votes by similarity score for more accurate classification:
+### TF-IDF
```ruby
-knn = Classifier::KNN.new(k: 5, weighted: true)
-
-knn.add(
- positive: ["Great product!", "Loved it!", "Excellent service"],
- negative: ["Terrible experience", "Would not recommend"]
-)
-
-# Closer neighbors have more influence on the result
-knn.classify "This was amazing!"
-# => "positive"
-```
-
-### Updating the Classifier
-
-```ruby
-# Add more examples anytime
-knn.add(neutral: "It was okay, nothing special")
-
-# Remove examples
-knn.remove_item "Buy now! Limited offer!"
-
-# Change k value
-knn.k = 7
-
-# List all categories
-knn.categories
-# => ["spam", "ham", "neutral"]
-```
-
-### When to Use kNN vs Bayes vs LSI
-
-| Classifier | Best For |
-|------------|----------|
-| **Bayes** | Fast classification, any training size (stores only word counts) |
-| **LSI** | Semantic similarity, document clustering, search |
-| **kNN** | <1000 examples, interpretable results, incremental updates |
-
-**Why the size difference?** Bayes stores aggregate statistics—adding 10,000 documents just increments counters. kNN stores every example and compares against all of them during classification, so performance degrades with size.
-
-## TF-IDF Vectorizer
-
-Transform text documents into TF-IDF (Term Frequency-Inverse Document Frequency) weighted feature vectors. TF-IDF downweights common words and upweights discriminative terms—the foundation for most classic text classification approaches.
-
-### Quick Start
-
-```ruby
-require 'classifier'
-
tfidf = Classifier::TFIDF.new
-tfidf.fit(["Dogs are great pets", "Cats are independent", "Birds can fly"])
-
-# Transform text to TF-IDF vector (L2 normalized)
-vector = tfidf.transform("Dogs are loyal")
-# => {:dog=>0.7071..., :loyal=>0.7071...}
-
-# Fit and transform in one step
-vectors = tfidf.fit_transform(documents)
-```
-
-### Options
-
-```ruby
-tfidf = Classifier::TFIDF.new(
- min_df: 2, # Minimum document frequency (Integer or Float 0.0-1.0)
- max_df: 0.95, # Maximum document frequency (filters very common terms)
- ngram_range: [1, 2], # Extract unigrams and bigrams
- sublinear_tf: true # Use 1 + log(tf) instead of raw term frequency
-)
+tfidf.fit(["Dogs are pets", "Cats are independent"])
+tfidf.transform("Dogs are loyal") # => {:dog => 0.707, :loyal => 0.707}
```
+[TF-IDF Guide →](https://rubyclassifier.com/docs/guides/tfidf/basics)
-### Vocabulary Inspection
-
-```ruby
-tfidf.fit(documents)
+## Key Features
-tfidf.vocabulary # => {:dog=>0, :cat=>1, :bird=>2, ...}
-tfidf.idf # => {:dog=>1.405, :cat=>1.405, ...}
-tfidf.feature_names # => [:dog, :cat, :bird, ...]
-tfidf.num_documents # => 3
-tfidf.fitted? # => true
-```
-
-### N-gram Support
-
-```ruby
-# Extract bigrams only
-tfidf = Classifier::TFIDF.new(ngram_range: [2, 2])
-tfidf.fit(["quick brown fox", "lazy brown dog"])
-tfidf.vocabulary.keys
-# => [:quick_brown, :brown_fox, :lazi_brown, :brown_dog]
-
-# Unigrams through trigrams
-tfidf = Classifier::TFIDF.new(ngram_range: [1, 3])
-```
+### Incremental LSI
-### Serialization
+Add documents without rebuilding the entire index—400x faster for streaming data:
```ruby
-# Save to JSON
-json = tfidf.to_json
-File.write("tfidf.json", json)
-
-# Load from JSON
-loaded = Classifier::TFIDF.from_json(File.read("tfidf.json"))
+lsi = Classifier::LSI.new(incremental: true)
+lsi.add(tech: ["Ruby is elegant", "Python is popular"])
+lsi.build_index
-# Or use Marshal
-data = Marshal.dump(tfidf)
-loaded = Marshal.load(data)
+# These use Brand's algorithm—no full rebuild
+lsi.add(tech: "Go is fast")
+lsi.add(tech: "Rust is safe")
```
-## Persistence
-
-Save and load classifiers with pluggable storage backends. Works with Bayes, LogisticRegression, LSI, and kNN classifiers.
+[Learn more →](https://rubyclassifier.com/docs/guides/lsi/incremental)
-### File Storage
+### Persistence
```ruby
-require 'classifier'
-
-classifier = Classifier::Bayes.new(:spam, :ham)
-classifier.train(spam: "Buy now! Limited offer!")
-classifier.train(ham: "Meeting tomorrow at 3pm")
-
-# Configure storage and save
-classifier.storage = Classifier::Storage::File.new(path: "spam_filter.json")
+classifier.storage = Classifier::Storage::File.new(path: "model.json")
classifier.save
-# Load later
loaded = Classifier::Bayes.load(storage: classifier.storage)
-loaded.classify "Claim your prize now!"
-# => "Spam"
```
-### Custom Storage Backends
+[Learn more →](https://rubyclassifier.com/docs/guides/persistence)
-Create backends for Redis, PostgreSQL, S3, or any storage system:
+### Streaming Training
```ruby
-class RedisStorage < Classifier::Storage::Base
- def initialize(redis:, key:)
- super()
- @redis, @key = redis, key
- end
-
- def write(data) = @redis.set(@key, data)
- def read = @redis.get(@key)
- def delete = @redis.del(@key)
- def exists? = @redis.exists?(@key)
-end
-
-# Use it
-classifier.storage = RedisStorage.new(redis: Redis.new, key: "classifier:spam")
-classifier.save
+classifier.train_from_stream(:spam, File.open("spam_corpus.txt"))
```
-### Learn More
-
-- [Persistence Guide](https://rubyclassifier.com/docs/guides/persistence/basics) - Full documentation with examples
+[Learn more →](https://rubyclassifier.com/docs/guides/streaming)
## Performance
-### Native C Extension vs Pure Ruby
-
-The native C extension provides dramatic speedups for LSI operations, especially `build_index` (SVD computation):
+Native C extension provides 5-50x speedup for LSI operations:
-| Documents | build_index | Overall |
-|-----------|-------------|---------|
-| 5 | 7x faster | 2.6x |
-| 10 | 25x faster | 4.6x |
-| 15 | 112x faster | 14.5x |
-| 20 | 385x faster | 48.7x |
-
-
-Detailed benchmark (20 documents)
-
-```
-Operation Pure Ruby Native C Speedup
-----------------------------------------------------------
-build_index 0.5540 0.0014 384.5x
-classify 0.0190 0.0060 3.2x
-search 0.0145 0.0037 3.9x
-find_related 0.0098 0.0011 8.6x
-----------------------------------------------------------
-TOTAL 0.5973 0.0123 48.7x
-```
-
-
-### Running Benchmarks
+| Documents | Speedup |
+|-----------|---------|
+| 10 | 25x |
+| 20 | 50x |
```bash
-rake benchmark # Run with current configuration
-rake benchmark:compare # Compare native C vs pure Ruby
+rake benchmark:compare # Run your own comparison
```
## Development
-### Setup
-
```bash
-git clone https://github.com/cardmagic/classifier.git
-cd classifier
bundle install
-rake compile # Compile native C extension
+rake compile # Build native extension
+rake test # Run tests
```
-### Running Tests
-
-```bash
-rake test # Run all tests (compiles first)
-ruby -Ilib test/bayes/bayesian_test.rb # Run specific test file
-
-# Test with pure Ruby (no native extension)
-NATIVE_VECTOR=true rake test
-```
-
-### Console
-
-```bash
-rake console
-```
-
-## Contributing
-
-1. Fork the repository
-2. Create your feature branch (`git checkout -b feature/amazing-feature`)
-3. Commit your changes (`git commit -am 'Add amazing feature'`)
-4. Push to the branch (`git push origin feature/amazing-feature`)
-5. Open a Pull Request
-
## Authors
-- **Lucas Carlson** - *Original author* - lucas@rufy.com
-- **David Fayram II** - *LSI implementation* - dfayram@gmail.com
+- **Lucas Carlson** - lucas@rufy.com
+- **David Fayram II** - dfayram@gmail.com
- **Cameron McBride** - cameron.mcbride@gmail.com
- **Ivan Acosta-Rubio** - ivan@softwarecriollo.com
## License
-This library is released under the [GNU Lesser General Public License (LGPL) 2.1](LICENSE).
+[LGPL 2.1](LICENSE)
diff --git a/ext/classifier/classifier_ext.c b/ext/classifier/classifier_ext.c
index f6096933..45ebad40 100644
--- a/ext/classifier/classifier_ext.c
+++ b/ext/classifier/classifier_ext.c
@@ -22,4 +22,5 @@ void Init_classifier_ext(void)
Init_vector();
Init_matrix();
Init_svd();
+ Init_incremental_svd();
}
diff --git a/ext/classifier/incremental_svd.c b/ext/classifier/incremental_svd.c
new file mode 100644
index 00000000..ab390d98
--- /dev/null
+++ b/ext/classifier/incremental_svd.c
@@ -0,0 +1,393 @@
+/*
+ * incremental_svd.c
+ * Native C implementation of Brand's incremental SVD operations
+ *
+ * Provides fast matrix operations for:
+ * - Matrix column extension
+ * - Vertical stacking (vstack)
+ * - Vector subtraction
+ * - Batch document projection
+ */
+
+#include "linalg.h"
+
+/*
+ * Extend a matrix with a new column
+ * Returns a new matrix [M | col] with one additional column
+ */
+CMatrix *cmatrix_extend_column(CMatrix *m, CVector *col)
+{
+ if (m->rows != col->size) {
+ rb_raise(rb_eArgError,
+ "Matrix rows (%ld) must match vector size (%ld)",
+ (long)m->rows, (long)col->size);
+ }
+
+ CMatrix *result = cmatrix_alloc(m->rows, m->cols + 1);
+
+ for (size_t i = 0; i < m->rows; i++) {
+ memcpy(&MAT_AT(result, i, 0), &MAT_AT(m, i, 0), m->cols * sizeof(double));
+ MAT_AT(result, i, m->cols) = col->data[i];
+ }
+
+ return result;
+}
+
+/*
+ * Vertically stack two matrices
+ * Returns a new matrix [top; bottom]
+ */
+CMatrix *cmatrix_vstack(CMatrix *top, CMatrix *bottom)
+{
+ if (top->cols != bottom->cols) {
+ rb_raise(rb_eArgError,
+ "Matrices must have same column count: %ld vs %ld",
+ (long)top->cols, (long)bottom->cols);
+ }
+
+ size_t new_rows = top->rows + bottom->rows;
+ CMatrix *result = cmatrix_alloc(new_rows, top->cols);
+
+ memcpy(result->data, top->data, top->rows * top->cols * sizeof(double));
+ memcpy(result->data + top->rows * top->cols,
+ bottom->data,
+ bottom->rows * bottom->cols * sizeof(double));
+
+ return result;
+}
+
+/*
+ * Vector subtraction: a - b
+ */
+CVector *cvector_subtract(CVector *a, CVector *b)
+{
+ if (a->size != b->size) {
+ rb_raise(rb_eArgError,
+ "Vector sizes must match: %ld vs %ld",
+ (long)a->size, (long)b->size);
+ }
+
+ CVector *result = cvector_alloc(a->size);
+ for (size_t i = 0; i < a->size; i++) {
+ result->data[i] = a->data[i] - b->data[i];
+ }
+ return result;
+}
+
+/*
+ * Batch project multiple vectors onto U matrix
+ * Computes lsi_vector = U^T * raw_vector for each vector
+ * This is the most performance-critical operation for incremental updates
+ */
+void cbatch_project(CMatrix *u, CVector **raw_vectors, size_t num_vectors,
+ CVector **lsi_vectors_out)
+{
+ size_t m = u->rows; /* vocabulary size */
+ size_t k = u->cols; /* rank */
+
+ for (size_t v = 0; v < num_vectors; v++) {
+ CVector *raw = raw_vectors[v];
+ if (raw->size != m) {
+ rb_raise(rb_eArgError,
+ "Vector %ld size (%ld) must match matrix rows (%ld)",
+ (long)v, (long)raw->size, (long)m);
+ }
+
+ CVector *lsi = cvector_alloc(k);
+
+ /* Compute U^T * raw (project onto k-dimensional space) */
+ for (size_t j = 0; j < k; j++) {
+ double sum = 0.0;
+ for (size_t i = 0; i < m; i++) {
+ sum += MAT_AT(u, i, j) * raw->data[i];
+ }
+ lsi->data[j] = sum;
+ }
+
+ lsi_vectors_out[v] = lsi;
+ }
+}
+
+/*
+ * Build the K matrix for Brand's algorithm when rank grows
+ * K = | diag(s) m_vec |
+ * | 0 p_norm |
+ */
+static CMatrix *build_k_matrix_with_growth(CVector *s, CVector *m_vec, double p_norm)
+{
+ size_t k = s->size;
+ CMatrix *result = cmatrix_alloc(k + 1, k + 1);
+
+ /* First k rows: diagonal s values and m_vec in last column */
+ for (size_t i = 0; i < k; i++) {
+ MAT_AT(result, i, i) = s->data[i];
+ MAT_AT(result, i, k) = m_vec->data[i];
+ }
+
+ /* Last row: zeros except p_norm in last position */
+ MAT_AT(result, k, k) = p_norm;
+
+ return result;
+}
+
+/*
+ * Perform one incremental SVD update using Brand's algorithm
+ *
+ * @param u Current U matrix (m x k)
+ * @param s Current singular values (k values)
+ * @param c New document vector (m x 1)
+ * @param max_rank Maximum rank to maintain
+ * @param epsilon Threshold for detecting new directions
+ * @param u_out Output: updated U matrix
+ * @param s_out Output: updated singular values
+ */
+static void incremental_update(CMatrix *u, CVector *s, CVector *c, int max_rank,
+ double epsilon, CMatrix **u_out, CVector **s_out)
+{
+ size_t m = u->rows;
+ size_t k = u->cols;
+
+ /* Step 1: Project c onto column space of U */
+ /* m_vec = U^T * c */
+ CVector *m_vec = cvector_alloc(k);
+ for (size_t j = 0; j < k; j++) {
+ double sum = 0.0;
+ for (size_t i = 0; i < m; i++) {
+ sum += MAT_AT(u, i, j) * c->data[i];
+ }
+ m_vec->data[j] = sum;
+ }
+
+ /* Step 2: Compute residual p = c - U * m_vec */
+ CVector *u_times_m = cmatrix_multiply_vector(u, m_vec);
+ CVector *p = cvector_subtract(c, u_times_m);
+ double p_norm = cvector_magnitude(p);
+
+ cvector_free(u_times_m);
+
+ if (p_norm > epsilon) {
+ /* New direction found - rank may increase */
+
+ /* Step 3: Normalize residual */
+ CVector *p_hat = cvector_alloc(m);
+ double inv_p_norm = 1.0 / p_norm;
+ for (size_t i = 0; i < m; i++) {
+ p_hat->data[i] = p->data[i] * inv_p_norm;
+ }
+
+ /* Step 4: Build K matrix */
+ CMatrix *k_mat = build_k_matrix_with_growth(s, m_vec, p_norm);
+
+ /* Step 5: SVD of K matrix */
+ CMatrix *u_prime, *v_prime;
+ CVector *s_prime;
+ jacobi_svd(k_mat, &u_prime, &v_prime, &s_prime);
+ cmatrix_free(k_mat);
+ cmatrix_free(v_prime);
+
+ /* Step 6: Update U = [U | p_hat] * U' */
+ CMatrix *u_extended = cmatrix_extend_column(u, p_hat);
+ CMatrix *u_new = cmatrix_multiply(u_extended, u_prime);
+ cmatrix_free(u_extended);
+ cmatrix_free(u_prime);
+ cvector_free(p_hat);
+
+ /* Truncate if needed */
+ if (s_prime->size > (size_t)max_rank) {
+ /* Create truncated U (keep first max_rank columns) */
+ CMatrix *u_trunc = cmatrix_alloc(u_new->rows, (size_t)max_rank);
+ for (size_t i = 0; i < u_new->rows; i++) {
+ memcpy(&MAT_AT(u_trunc, i, 0), &MAT_AT(u_new, i, 0),
+ (size_t)max_rank * sizeof(double));
+ }
+ cmatrix_free(u_new);
+ u_new = u_trunc;
+
+ /* Truncate singular values */
+ CVector *s_trunc = cvector_alloc((size_t)max_rank);
+ memcpy(s_trunc->data, s_prime->data, (size_t)max_rank * sizeof(double));
+ cvector_free(s_prime);
+ s_prime = s_trunc;
+ }
+
+ *u_out = u_new;
+ *s_out = s_prime;
+ } else {
+ /* Vector in span - use simpler update */
+ /* For now, just return unchanged (projection handles this) */
+ *u_out = cmatrix_alloc(u->rows, u->cols);
+ memcpy((*u_out)->data, u->data, u->rows * u->cols * sizeof(double));
+ *s_out = cvector_alloc(s->size);
+ memcpy((*s_out)->data, s->data, s->size * sizeof(double));
+ }
+
+ cvector_free(p);
+ cvector_free(m_vec);
+}
+
+/* ========== Ruby Wrappers ========== */
+
+/*
+ * Matrix.extend_column(matrix, vector)
+ * Returns [matrix | vector]
+ */
+static VALUE rb_cmatrix_extend_column(VALUE klass, VALUE rb_matrix, VALUE rb_vector)
+{
+ CMatrix *m;
+ CVector *v;
+
+ GET_CMATRIX(rb_matrix, m);
+ GET_CVECTOR(rb_vector, v);
+
+ CMatrix *result = cmatrix_extend_column(m, v);
+ return TypedData_Wrap_Struct(klass, &cmatrix_type, result);
+
+ (void)klass;
+}
+
+/*
+ * Matrix.vstack(top, bottom)
+ * Vertically stack two matrices
+ */
+static VALUE rb_cmatrix_vstack(VALUE klass, VALUE rb_top, VALUE rb_bottom)
+{
+ CMatrix *top, *bottom;
+
+ GET_CMATRIX(rb_top, top);
+ GET_CMATRIX(rb_bottom, bottom);
+
+ CMatrix *result = cmatrix_vstack(top, bottom);
+ return TypedData_Wrap_Struct(klass, &cmatrix_type, result);
+
+ (void)klass;
+}
+
+/*
+ * Matrix.zeros(rows, cols)
+ * Create a zero matrix
+ */
+static VALUE rb_cmatrix_zeros(VALUE klass, VALUE rb_rows, VALUE rb_cols)
+{
+ size_t rows = NUM2SIZET(rb_rows);
+ size_t cols = NUM2SIZET(rb_cols);
+
+ CMatrix *result = cmatrix_alloc(rows, cols);
+ return TypedData_Wrap_Struct(klass, &cmatrix_type, result);
+
+ (void)klass;
+}
+
+/*
+ * Vector#-(other)
+ * Vector subtraction
+ */
+static VALUE rb_cvector_subtract(VALUE self, VALUE other)
+{
+ CVector *a, *b;
+
+ GET_CVECTOR(self, a);
+
+ if (rb_obj_is_kind_of(other, cClassifierVector)) {
+ GET_CVECTOR(other, b);
+ CVector *result = cvector_subtract(a, b);
+ return TypedData_Wrap_Struct(cClassifierVector, &cvector_type, result);
+ }
+
+ rb_raise(rb_eTypeError, "Cannot subtract %s from Vector",
+ rb_obj_classname(other));
+ return Qnil;
+}
+
+/*
+ * Matrix#batch_project(vectors_array)
+ * Project multiple vectors onto this matrix (as U)
+ * Returns array of projected vectors
+ *
+ * This is the high-performance batch operation for re-projecting documents
+ */
+static VALUE rb_cmatrix_batch_project(VALUE self, VALUE rb_vectors)
+{
+ CMatrix *u;
+ GET_CMATRIX(self, u);
+
+ Check_Type(rb_vectors, T_ARRAY);
+ long num_vectors = RARRAY_LEN(rb_vectors);
+
+ if (num_vectors == 0) {
+ return rb_ary_new();
+ }
+
+ CVector **raw_vectors = ALLOC_N(CVector *, num_vectors);
+ for (long i = 0; i < num_vectors; i++) {
+ VALUE rb_vec = rb_ary_entry(rb_vectors, i);
+ if (!rb_obj_is_kind_of(rb_vec, cClassifierVector)) {
+ xfree(raw_vectors);
+ rb_raise(rb_eTypeError, "Expected array of Vectors");
+ }
+ GET_CVECTOR(rb_vec, raw_vectors[i]);
+ }
+
+ CVector **lsi_vectors = ALLOC_N(CVector *, num_vectors);
+ cbatch_project(u, raw_vectors, (size_t)num_vectors, lsi_vectors);
+
+ VALUE result = rb_ary_new_capa(num_vectors);
+ for (long i = 0; i < num_vectors; i++) {
+ VALUE rb_lsi = TypedData_Wrap_Struct(cClassifierVector, &cvector_type,
+ lsi_vectors[i]);
+ rb_ary_push(result, rb_lsi);
+ }
+
+ xfree(raw_vectors);
+ xfree(lsi_vectors);
+
+ return result;
+}
+
+/*
+ * Matrix#incremental_svd_update(singular_values, new_vector, max_rank, epsilon)
+ * Perform one Brand's incremental SVD update
+ * Returns [new_u, new_singular_values]
+ */
+static VALUE rb_cmatrix_incremental_update(VALUE self, VALUE rb_s, VALUE rb_c,
+ VALUE rb_max_rank, VALUE rb_epsilon)
+{
+ CMatrix *u;
+ CVector *s, *c;
+
+ GET_CMATRIX(self, u);
+ GET_CVECTOR(rb_s, s);
+ GET_CVECTOR(rb_c, c);
+
+ int max_rank = NUM2INT(rb_max_rank);
+ double epsilon = NUM2DBL(rb_epsilon);
+
+ CMatrix *u_new;
+ CVector *s_new;
+
+ incremental_update(u, s, c, max_rank, epsilon, &u_new, &s_new);
+
+ VALUE rb_u_new = TypedData_Wrap_Struct(cClassifierMatrix, &cmatrix_type, u_new);
+ VALUE rb_s_new = TypedData_Wrap_Struct(cClassifierVector, &cvector_type, s_new);
+
+ return rb_ary_new_from_args(2, rb_u_new, rb_s_new);
+}
+
+void Init_incremental_svd(void)
+{
+ /* Matrix class methods for incremental SVD */
+ rb_define_singleton_method(cClassifierMatrix, "extend_column",
+ rb_cmatrix_extend_column, 2);
+ rb_define_singleton_method(cClassifierMatrix, "vstack",
+ rb_cmatrix_vstack, 2);
+ rb_define_singleton_method(cClassifierMatrix, "zeros",
+ rb_cmatrix_zeros, 2);
+
+ /* Instance methods */
+ rb_define_method(cClassifierMatrix, "batch_project",
+ rb_cmatrix_batch_project, 1);
+ rb_define_method(cClassifierMatrix, "incremental_svd_update",
+ rb_cmatrix_incremental_update, 4);
+
+ /* Vector subtraction */
+ rb_define_method(cClassifierVector, "-", rb_cvector_subtract, 1);
+}
diff --git a/ext/classifier/linalg.h b/ext/classifier/linalg.h
index 7220ee4f..71bbc30a 100644
--- a/ext/classifier/linalg.h
+++ b/ext/classifier/linalg.h
@@ -50,6 +50,14 @@ CMatrix *cmatrix_diagonal(CVector *v);
void Init_svd(void);
void jacobi_svd(CMatrix *a, CMatrix **u, CMatrix **v, CVector **s);
+/* Incremental SVD functions */
+void Init_incremental_svd(void);
+CMatrix *cmatrix_extend_column(CMatrix *m, CVector *col);
+CMatrix *cmatrix_vstack(CMatrix *top, CMatrix *bottom);
+CVector *cvector_subtract(CVector *a, CVector *b);
+void cbatch_project(CMatrix *u, CVector **raw_vectors, size_t num_vectors,
+ CVector **lsi_vectors_out);
+
/* TypedData definitions */
extern const rb_data_type_t cvector_type;
extern const rb_data_type_t cmatrix_type;
diff --git a/lib/classifier.rb b/lib/classifier.rb
index 306016e0..77cbd931 100644
--- a/lib/classifier.rb
+++ b/lib/classifier.rb
@@ -27,6 +27,7 @@
require 'rubygems'
require 'classifier/errors'
require 'classifier/storage'
+require 'classifier/streaming'
require 'classifier/extensions/string'
require 'classifier/extensions/vector'
require 'classifier/bayes'
diff --git a/lib/classifier/bayes.rb b/lib/classifier/bayes.rb
index 8b82283c..750aeeaf 100644
--- a/lib/classifier/bayes.rb
+++ b/lib/classifier/bayes.rb
@@ -8,8 +8,9 @@
require 'mutex_m'
module Classifier
- class Bayes
+ class Bayes # rubocop:disable Metrics/ClassLength
include Mutex_m
+ include Streaming
# @rbs @categories: Hash[Symbol, Hash[Symbol, Integer]]
# @rbs @total_words: Integer
@@ -310,8 +311,115 @@ def remove_category(category)
end
end
+ # Trains the classifier from an IO stream.
+ # Each line in the stream is treated as a separate document.
+ # This is memory-efficient for large corpora.
+ #
+ # @example Train from a file
+ # classifier.train_from_stream(:spam, File.open('spam_corpus.txt'))
+ #
+ # @example With progress tracking
+ # classifier.train_from_stream(:spam, io, batch_size: 500) do |progress|
+ # puts "#{progress.completed} documents processed"
+ # end
+ #
+ # @rbs (String | Symbol, IO, ?batch_size: Integer) { (Streaming::Progress) -> void } -> void
+ def train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE)
+ category = category.prepare_category_name
+ raise StandardError, "No such category: #{category}" unless @categories.key?(category)
+
+ reader = Streaming::LineReader.new(io, batch_size: batch_size)
+ total = reader.estimate_line_count
+ progress = Streaming::Progress.new(total: total)
+
+ reader.each_batch do |batch|
+ train_batch_internal(category, batch)
+ progress.completed += batch.size
+ progress.current_batch += 1
+ yield progress if block_given?
+ end
+ end
+
+ # Trains the classifier with an array of documents in batches.
+ # Reduces lock contention by processing multiple documents per synchronize call.
+ #
+ # @example Positional style
+ # classifier.train_batch(:spam, documents, batch_size: 100)
+ #
+ # @example Keyword style
+ # classifier.train_batch(spam: documents, ham: other_docs, batch_size: 100)
+ #
+ # @example With progress tracking
+ # classifier.train_batch(:spam, documents, batch_size: 100) do |progress|
+ # puts "#{progress.percent}% complete"
+ # end
+ #
+ # @rbs (?(String | Symbol)?, ?Array[String]?, ?batch_size: Integer, **Array[String]) { (Streaming::Progress) -> void } -> void
+ def train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block)
+ if category && documents
+ train_batch_for_category(category, documents, batch_size: batch_size, &block)
+ else
+ categories.each do |cat, docs|
+ train_batch_for_category(cat, Array(docs), batch_size: batch_size, &block)
+ end
+ end
+ end
+
+ # Loads a classifier from a checkpoint.
+ #
+ # @rbs (storage: Storage::Base, checkpoint_id: String) -> Bayes
+ def self.load_checkpoint(storage:, checkpoint_id:)
+ raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File)
+
+ dir = File.dirname(storage.path)
+ base = File.basename(storage.path, '.*')
+ ext = File.extname(storage.path)
+ checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}")
+
+ checkpoint_storage = Storage::File.new(path: checkpoint_path)
+ instance = load(storage: checkpoint_storage)
+ instance.storage = storage
+ instance
+ end
+
private
+ # Trains a batch of documents for a single category.
+ # @rbs (String | Symbol, Array[String], ?batch_size: Integer) { (Streaming::Progress) -> void } -> void
+ def train_batch_for_category(category, documents, batch_size: Streaming::DEFAULT_BATCH_SIZE)
+ category = category.prepare_category_name
+ raise StandardError, "No such category: #{category}" unless @categories.key?(category)
+
+ progress = Streaming::Progress.new(total: documents.size)
+
+ documents.each_slice(batch_size) do |batch|
+ train_batch_internal(category, batch)
+ progress.completed += batch.size
+ progress.current_batch += 1
+ yield progress if block_given?
+ end
+ end
+
+ # Internal method to train a batch of documents.
+ # Uses a single synchronize block for the entire batch.
+ # @rbs (Symbol, Array[String]) -> void
+ def train_batch_internal(category, batch)
+ synchronize do
+ invalidate_caches
+ @dirty = true
+ batch.each do |text|
+ word_hash = text.word_hash
+ @category_counts[category] += 1
+ word_hash.each do |word, count|
+ @categories[category][word] ||= 0
+ @categories[category][word] += count
+ @total_words += count
+ @category_word_count[category] += count
+ end
+ end
+ end
+ end
+
# Core training logic for a single category and text.
# @rbs (String | Symbol, String) -> void
def train_single(category, text)
diff --git a/lib/classifier/knn.rb b/lib/classifier/knn.rb
index d473abcb..37ae8174 100644
--- a/lib/classifier/knn.rb
+++ b/lib/classifier/knn.rb
@@ -19,6 +19,7 @@ module Classifier
#
class KNN
include Mutex_m
+ include Streaming
# @rbs @k: Integer
# @rbs @weighted: bool
@@ -42,17 +43,25 @@ def initialize(k: 5, weighted: false) # rubocop:disable Naming/MethodParameterNa
end
# Adds labeled examples. Keys are categories, values are items or arrays.
+ # Also aliased as `train` for API consistency with Bayes and LogisticRegression.
+ #
+ # knn.add(spam: "Buy now!", ham: "Meeting tomorrow")
+ # knn.train(spam: "Buy now!", ham: "Meeting tomorrow") # equivalent
+ #
# @rbs (**untyped items) -> void
def add(**items)
synchronize { @dirty = true }
@lsi.add(**items)
end
+ alias train add
+
# Classifies text using k nearest neighbors with majority voting.
- # @rbs (String) -> (String | Symbol)?
+ # Returns the category as a String for API consistency with Bayes and LogisticRegression.
+ # @rbs (String) -> String?
def classify(text)
result = classify_with_neighbors(text)
- result[:category]
+ result[:category]&.to_s
end
# Classifies and returns {category:, neighbors:, votes:, confidence:}.
@@ -91,10 +100,11 @@ def items
@lsi.items
end
- # @rbs () -> Array[String | Symbol]
+ # Returns all unique categories as strings.
+ # @rbs () -> Array[String]
def categories
synchronize do
- @lsi.items.flat_map { |item| @lsi.categories_for(item) }.uniq
+ @lsi.items.flat_map { |item| @lsi.categories_for(item) }.uniq.map(&:to_s)
end
end
@@ -104,6 +114,23 @@ def k=(value)
@k = value
end
+ # Provides dynamic training methods for categories.
+ # For example:
+ # knn.train_spam "Buy now!"
+ # knn.train_ham "Meeting tomorrow"
+ def method_missing(name, *args)
+ category_match = name.to_s.match(/\Atrain_(\w+)\z/)
+ return super unless category_match
+
+ category = category_match[1].to_sym
+ args.each { |text| add(category => text) }
+ end
+
+ # @rbs (Symbol, ?bool) -> bool
+ def respond_to_missing?(name, include_private = false)
+ !!(name.to_s =~ /\Atrain_(\w+)\z/) || super
+ end
+
# @rbs (?untyped) -> untyped
def as_json(_options = nil)
{
@@ -213,6 +240,59 @@ def marshal_load(data)
@storage = nil
end
+ # Loads a classifier from a checkpoint.
+ #
+ # @rbs (storage: Storage::Base, checkpoint_id: String) -> KNN
+ def self.load_checkpoint(storage:, checkpoint_id:)
+ raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File)
+
+ dir = File.dirname(storage.path)
+ base = File.basename(storage.path, '.*')
+ ext = File.extname(storage.path)
+ checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}")
+
+ checkpoint_storage = Storage::File.new(path: checkpoint_path)
+ instance = load(storage: checkpoint_storage)
+ instance.storage = storage
+ instance
+ end
+
+ # Trains the classifier from an IO stream.
+ # Each line in the stream is treated as a separate document.
+ #
+ # @example Train from a file
+ # knn.train_from_stream(:spam, File.open('spam_corpus.txt'))
+ #
+ # @example With progress tracking
+ # knn.train_from_stream(:spam, io, batch_size: 500) do |progress|
+ # puts "#{progress.completed} documents processed"
+ # end
+ #
+ # @rbs (String | Symbol, IO, ?batch_size: Integer) { (Streaming::Progress) -> void } -> void
+ def train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE, &block)
+ @lsi.train_from_stream(category, io, batch_size: batch_size, &block)
+ synchronize { @dirty = true }
+ end
+
+ # Adds items in batches.
+ #
+ # @example Positional style
+ # knn.train_batch(:spam, documents, batch_size: 100)
+ #
+ # @example Keyword style
+ # knn.train_batch(spam: documents, ham: other_docs)
+ #
+ # @rbs (?(String | Symbol)?, ?Array[String]?, ?batch_size: Integer, **Array[String]) { (Streaming::Progress) -> void } -> void
+ def train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block)
+ # @type var categories: Hash[Symbol, Array[String]]
+ @lsi.train_batch(category, documents, batch_size: batch_size, **categories, &block) # steep:ignore
+ synchronize { @dirty = true }
+ end
+
+ # @rbs!
+ # alias add_batch train_batch
+ alias add_batch train_batch
+
private
# @rbs (String) -> Array[Hash[Symbol, untyped]]
diff --git a/lib/classifier/logistic_regression.rb b/lib/classifier/logistic_regression.rb
index a413c062..04e3f037 100644
--- a/lib/classifier/logistic_regression.rb
+++ b/lib/classifier/logistic_regression.rb
@@ -20,6 +20,7 @@ module Classifier
#
class LogisticRegression # rubocop:disable Metrics/ClassLength
include Mutex_m
+ include Streaming
# @rbs @categories: Array[Symbol]
# @rbs @weights: Hash[Symbol, Hash[Symbol, Float]]
@@ -329,8 +330,112 @@ def marshal_load(data)
@storage = nil
end
+ # Loads a classifier from a checkpoint.
+ #
+ # @rbs (storage: Storage::Base, checkpoint_id: String) -> LogisticRegression
+ def self.load_checkpoint(storage:, checkpoint_id:)
+ raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File)
+
+ dir = File.dirname(storage.path)
+ base = File.basename(storage.path, '.*')
+ ext = File.extname(storage.path)
+ checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}")
+
+ checkpoint_storage = Storage::File.new(path: checkpoint_path)
+ instance = load(storage: checkpoint_storage)
+ instance.storage = storage
+ instance
+ end
+
+ # Trains the classifier from an IO stream.
+ # Each line in the stream is treated as a separate document.
+ # Note: The model is NOT automatically fitted after streaming.
+ # Call #fit to train the model after adding all data.
+ #
+ # @example Train from a file
+ # classifier.train_from_stream(:spam, File.open('spam_corpus.txt'))
+ # classifier.fit # Required to train the model
+ #
+ # @example With progress tracking
+ # classifier.train_from_stream(:spam, io, batch_size: 500) do |progress|
+ # puts "#{progress.completed} documents processed"
+ # end
+ # classifier.fit
+ #
+ # @rbs (String | Symbol, IO, ?batch_size: Integer) { (Streaming::Progress) -> void } -> void
+ def train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE)
+ category = category.to_s.prepare_category_name
+ raise StandardError, "No such category: #{category}" unless @categories.include?(category)
+
+ reader = Streaming::LineReader.new(io, batch_size: batch_size)
+ total = reader.estimate_line_count
+ progress = Streaming::Progress.new(total: total)
+
+ reader.each_batch do |batch|
+ synchronize do
+ batch.each do |text|
+ features = text.word_hash
+ features.each_key { |word| @vocabulary[word] = true }
+ @training_data << { category: category, features: features }
+ end
+ @fitted = false
+ @dirty = true
+ end
+ progress.completed += batch.size
+ progress.current_batch += 1
+ yield progress if block_given?
+ end
+ end
+
+ # Trains the classifier with an array of documents in batches.
+ # Note: The model is NOT automatically fitted after batch training.
+ # Call #fit to train the model after adding all data.
+ #
+ # @example Positional style
+ # classifier.train_batch(:spam, documents, batch_size: 100)
+ # classifier.fit
+ #
+ # @example Keyword style
+ # classifier.train_batch(spam: documents, ham: other_docs)
+ # classifier.fit
+ #
+ # @rbs (?(String | Symbol)?, ?Array[String]?, ?batch_size: Integer, **Array[String]) { (Streaming::Progress) -> void } -> void
+ def train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block)
+ if category && documents
+ train_batch_for_category(category, documents, batch_size: batch_size, &block)
+ else
+ categories.each do |cat, docs|
+ train_batch_for_category(cat, Array(docs), batch_size: batch_size, &block)
+ end
+ end
+ end
+
private
+ # Trains a batch of documents for a single category.
+ # @rbs (String | Symbol, Array[String], ?batch_size: Integer) { (Streaming::Progress) -> void } -> void
+ def train_batch_for_category(category, documents, batch_size: Streaming::DEFAULT_BATCH_SIZE)
+ category = category.to_s.prepare_category_name
+ raise StandardError, "No such category: #{category}" unless @categories.include?(category)
+
+ progress = Streaming::Progress.new(total: documents.size)
+
+ documents.each_slice(batch_size) do |batch|
+ synchronize do
+ batch.each do |text|
+ features = text.word_hash
+ features.each_key { |word| @vocabulary[word] = true }
+ @training_data << { category: category, features: features }
+ end
+ @fitted = false
+ @dirty = true
+ end
+ progress.completed += batch.size
+ progress.current_batch += 1
+ yield progress if block_given?
+ end
+ end
+
# Core training logic for a single category and text.
# @rbs (String | Symbol, String) -> void
def train_single(category, text)
diff --git a/lib/classifier/lsi.rb b/lib/classifier/lsi.rb
index 416a5cfe..6e1a6620 100644
--- a/lib/classifier/lsi.rb
+++ b/lib/classifier/lsi.rb
@@ -58,6 +58,7 @@ def matrix_class
require 'classifier/lsi/word_list'
require 'classifier/lsi/content_node'
require 'classifier/lsi/summary'
+require 'classifier/lsi/incremental_svd'
module Classifier
# This class implements a Latent Semantic Indexer, which can search, classify and cluster
@@ -65,6 +66,7 @@ module Classifier
# please consult Wikipedia[http://en.wikipedia.org/wiki/Latent_Semantic_Indexing].
class LSI
include Mutex_m
+ include Streaming
# @rbs @auto_rebuild: bool
# @rbs @word_list: WordList
@@ -74,14 +76,24 @@ class LSI
# @rbs @singular_values: Array[Float]?
# @rbs @dirty: bool
# @rbs @storage: Storage::Base?
+ # @rbs @incremental_mode: bool
+ # @rbs @u_matrix: Matrix?
+ # @rbs @max_rank: Integer
+ # @rbs @initial_vocab_size: Integer?
attr_reader :word_list, :singular_values
attr_accessor :auto_rebuild, :storage
+ # Default maximum rank for incremental SVD
+ DEFAULT_MAX_RANK = 100
+
# Create a fresh index.
# If you want to call #build_index manually, use
# Classifier::LSI.new auto_rebuild: false
#
+ # For incremental SVD mode (adds documents without full rebuild):
+ # Classifier::LSI.new incremental: true, max_rank: 100
+ #
# @rbs (?Hash[Symbol, untyped]) -> void
def initialize(options = {})
super()
@@ -92,6 +104,12 @@ def initialize(options = {})
@built_at_version = -1
@dirty = false
@storage = nil
+
+ # Incremental SVD settings
+ @incremental_mode = options[:incremental] == true
+ @max_rank = options[:max_rank] || DEFAULT_MAX_RANK
+ @u_matrix = nil
+ @initial_vocab_size = nil
end
# Returns true if the index needs to be rebuilt. The index needs
@@ -122,6 +140,40 @@ def singular_value_spectrum
end
end
+ # Returns true if incremental mode is enabled and active.
+ # Incremental mode becomes active after the first build_index call.
+ #
+ # @rbs () -> bool
+ def incremental_enabled?
+ @incremental_mode && !@u_matrix.nil?
+ end
+
+ # Returns the current rank of the incremental SVD (number of singular values kept).
+ # Returns nil if incremental mode is not active.
+ #
+ # @rbs () -> Integer?
+ def current_rank
+ @singular_values&.count(&:positive?)
+ end
+
+ # Disables incremental mode. Subsequent adds will trigger full rebuilds.
+ #
+ # @rbs () -> void
+ def disable_incremental_mode!
+ @incremental_mode = false
+ @u_matrix = nil
+ @initial_vocab_size = nil
+ end
+
+ # Enables incremental mode with optional max_rank setting.
+ # The next build_index call will store the U matrix for incremental updates.
+ #
+ # @rbs (?max_rank: Integer) -> void
+ def enable_incremental_mode!(max_rank: DEFAULT_MAX_RANK)
+ @incremental_mode = true
+ @max_rank = max_rank
+ end
+
# Adds items to the index using hash-style syntax.
# The hash keys are categories, and values are items (or arrays of items).
#
@@ -165,11 +217,18 @@ def add(**items)
# @rbs (String, *String | Symbol) ?{ (String) -> String } -> void
def add_item(item, *categories, &block)
clean_word_hash = block ? block.call(item).clean_word_hash : item.to_s.clean_word_hash
+ node = nil
+
synchronize do
- @items[item] = ContentNode.new(clean_word_hash, *categories)
+ node = ContentNode.new(clean_word_hash, *categories)
+ @items[item] = node
@version += 1
@dirty = true
end
+
+ # Use incremental update if enabled and we have a U matrix
+ return perform_incremental_update(node, clean_word_hash) if @incremental_mode && @u_matrix
+
build_index if @auto_rebuild
end
@@ -230,12 +289,12 @@ def items
# A value of 1 for cutoff means that no semantic analysis will take place,
# turning the LSI class into a simple vector search engine.
#
- # @rbs (?Float) -> void
- def build_index(cutoff = 0.75)
+ # @rbs (?Float, ?force: bool) -> void
+ def build_index(cutoff = 0.75, force: false)
validate_cutoff!(cutoff)
synchronize do
- return unless needs_rebuild_unlocked?
+ return unless force || needs_rebuild_unlocked?
make_word_list
@@ -246,14 +305,20 @@ def build_index(cutoff = 0.75)
# Convert vectors to arrays for matrix construction
tda_arrays = tda.map { |v| v.respond_to?(:to_a) ? v.to_a : v }
tdm = self.class.matrix_class.alloc(*tda_arrays).trans
- ntdm = build_reduced_matrix(tdm, cutoff)
+ ntdm, u_mat = build_reduced_matrix_with_u(tdm, cutoff)
assign_native_ext_lsi_vectors(ntdm, doc_list)
else
tdm = Matrix.rows(tda).trans
- ntdm = build_reduced_matrix(tdm, cutoff)
+ ntdm, u_mat = build_reduced_matrix_with_u(tdm, cutoff)
assign_ruby_lsi_vectors(ntdm, doc_list)
end
+ # Store U matrix for incremental mode
+ if @incremental_mode
+ @u_matrix = u_mat
+ @initial_vocab_size = @word_list.size
+ end
+
@built_at_version = @version
end
end
@@ -559,6 +624,100 @@ def self.load_from_file(path)
from_json(File.read(path))
end
+ # Loads an LSI index from a checkpoint.
+ #
+ # @rbs (storage: Storage::Base, checkpoint_id: String) -> LSI
+ def self.load_checkpoint(storage:, checkpoint_id:)
+ raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File)
+
+ dir = File.dirname(storage.path)
+ base = File.basename(storage.path, '.*')
+ ext = File.extname(storage.path)
+ checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}")
+
+ checkpoint_storage = Storage::File.new(path: checkpoint_path)
+ instance = load(storage: checkpoint_storage)
+ instance.storage = storage
+ instance
+ end
+
+ # Trains the LSI index from an IO stream.
+ # Each line in the stream is treated as a separate document.
+ # Documents are added without rebuilding, then the index is rebuilt at the end.
+ #
+ # @example Train from a file
+ # lsi.train_from_stream(:category, File.open('corpus.txt'))
+ #
+ # @example With progress tracking
+ # lsi.train_from_stream(:category, io, batch_size: 500) do |progress|
+ # puts "#{progress.completed} documents processed"
+ # end
+ #
+ # @rbs (String | Symbol, IO, ?batch_size: Integer) { (Streaming::Progress) -> void } -> void
+ def train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE)
+ original_auto_rebuild = @auto_rebuild
+ @auto_rebuild = false
+
+ begin
+ reader = Streaming::LineReader.new(io, batch_size: batch_size)
+ total = reader.estimate_line_count
+ progress = Streaming::Progress.new(total: total)
+
+ reader.each_batch do |batch|
+ batch.each { |text| add_item(text, category) }
+ progress.completed += batch.size
+ progress.current_batch += 1
+ yield progress if block_given?
+ end
+ ensure
+ @auto_rebuild = original_auto_rebuild
+ build_index if original_auto_rebuild
+ end
+ end
+
+ # Adds items to the index in batches from an array.
+ # Documents are added without rebuilding, then the index is rebuilt at the end.
+ #
+ # @example Batch add with progress
+ # lsi.add_batch(Dog: documents, batch_size: 100) do |progress|
+ # puts "#{progress.percent}% complete"
+ # end
+ #
+ # @rbs (?batch_size: Integer, **Array[String]) { (Streaming::Progress) -> void } -> void
+ def add_batch(batch_size: Streaming::DEFAULT_BATCH_SIZE, **items)
+ original_auto_rebuild = @auto_rebuild
+ @auto_rebuild = false
+
+ begin
+ total_docs = items.values.sum { |v| Array(v).size }
+ progress = Streaming::Progress.new(total: total_docs)
+
+ items.each do |category, documents|
+ Array(documents).each_slice(batch_size) do |batch|
+ batch.each { |doc| add_item(doc, category.to_s) }
+ progress.completed += batch.size
+ progress.current_batch += 1
+ yield progress if block_given?
+ end
+ end
+ ensure
+ @auto_rebuild = original_auto_rebuild
+ build_index if original_auto_rebuild
+ end
+ end
+
+ # Alias train_batch to add_batch for API consistency with other classifiers.
+ # Note: LSI uses categories differently (items have categories, not the training call).
+ #
+ # @rbs (?(String | Symbol)?, ?Array[String]?, ?batch_size: Integer, **Array[String]) { (Streaming::Progress) -> void } -> void
+ def train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block)
+ if category && documents
+ add_batch(batch_size: batch_size, **{ category.to_sym => documents }, &block)
+ else
+ add_batch(batch_size: batch_size, **categories, &block)
+ end
+ end
+
private
# Restores LSI state from a JSON string (used by reload)
@@ -679,7 +838,7 @@ def vote_unlocked(doc, cutoff = 0.30, &)
votes
end
- # Unlocked version of node_for_content for internal use
+ # Unlocked version of node_for_content for internal use.
# @rbs (String) ?{ (String) -> String } -> ContentNode
def node_for_content_unlocked(item, &block)
return @items[item] if @items[item]
@@ -687,30 +846,67 @@ def node_for_content_unlocked(item, &block)
clean_word_hash = block ? block.call(item).clean_word_hash : item.to_s.clean_word_hash
cn = ContentNode.new(clean_word_hash, &block)
cn.raw_vector_with(@word_list) unless needs_rebuild_unlocked?
+ assign_lsi_vector_incremental(cn) if incremental_enabled?
cn
end
# @rbs (untyped, ?Float) -> untyped
def build_reduced_matrix(matrix, cutoff = 0.75)
- # TODO: Check that M>=N on these dimensions! Transpose helps assure this
- u, v, s = matrix.SV_decomp
+ result, _u = build_reduced_matrix_with_u(matrix, cutoff)
+ result
+ end
- @singular_values = s.sort.reverse
+ # Builds reduced matrix and returns both the result and the U matrix.
+ # U matrix is needed for incremental SVD updates.
+ # @rbs (untyped, ?Float) -> [untyped, Matrix]
+ def build_reduced_matrix_with_u(matrix, cutoff = 0.75)
+ u, v, s = matrix.SV_decomp
+ all_singular_values = s.sort.reverse
s_cutoff_index = [(s.size * cutoff).round - 1, 0].max
- s_cutoff = @singular_values[s_cutoff_index]
+ s_cutoff = all_singular_values[s_cutoff_index]
+
+ kept_indices = []
+ kept_singular_values = []
s.size.times do |ord|
- s[ord] = 0.0 if s[ord] < s_cutoff
+ if s[ord] >= s_cutoff
+ kept_indices << ord
+ kept_singular_values << s[ord]
+ else
+ s[ord] = 0.0
+ end
end
- # Reconstruct the term document matrix, only with reduced rank
- result = u * self.class.matrix_class.diag(s) * v.trans
- # SVD may return transposed dimensions when row_size < column_size
- # Ensure result matches input dimensions
+ @singular_values = kept_singular_values.sort.reverse
+ result = u * self.class.matrix_class.diag(s) * v.trans
result = result.trans if result.row_size != matrix.row_size
+ u_reduced = extract_reduced_u(u, kept_indices, s)
- result
+ [result, u_reduced]
+ end
+
+ # Extracts columns from U corresponding to kept singular values.
+ # Columns are sorted by descending singular value to match @singular_values order.
+ # rubocop:disable Naming/MethodParameterName
+ # @rbs (untyped, Array[Integer], Array[Float]) -> Matrix
+ def extract_reduced_u(u, kept_indices, singular_values)
+ return Matrix.empty(u.row_size, 0) if kept_indices.empty?
+
+ sorted_indices = kept_indices.sort_by { |i| -singular_values[i] }
+
+ if u.respond_to?(:to_ruby_matrix)
+ u = u.to_ruby_matrix
+ elsif !u.is_a?(::Matrix)
+ rows = u.row_size.times.map do |i|
+ sorted_indices.map { |j| u[i, j] }
+ end
+ return Matrix.rows(rows)
+ end
+
+ cols = sorted_indices.map { |i| u.column(i).to_a }
+ Matrix.columns(cols)
end
+ # rubocop:enable Naming/MethodParameterName
# @rbs () -> void
def make_word_list
@@ -719,5 +915,129 @@ def make_word_list
node.word_hash.each_key { |key| @word_list.add_word key }
end
end
+
+ # Performs incremental SVD update for a new document.
+ # @rbs (ContentNode, Hash[Symbol, Integer]) -> void
+ def perform_incremental_update(node, word_hash)
+ needs_full_rebuild = false
+ old_rank = nil
+
+ synchronize do
+ if vocabulary_growth_exceeds_threshold?(word_hash)
+ disable_incremental_mode!
+ needs_full_rebuild = true
+ next
+ end
+
+ old_rank = @u_matrix.column_size
+ extend_vocabulary_for_incremental(word_hash)
+ raw_vec = node.raw_vector_with(@word_list)
+ raw_vector = Vector[*raw_vec.to_a]
+
+ @u_matrix, @singular_values = IncrementalSVD.update(
+ @u_matrix, @singular_values, raw_vector, max_rank: @max_rank
+ )
+
+ new_rank = @u_matrix.column_size
+ if new_rank > old_rank
+ reproject_all_documents
+ else
+ assign_lsi_vector_incremental(node)
+ end
+
+ @built_at_version = @version
+ end
+
+ build_index if needs_full_rebuild
+ end
+
+ # Checks if vocabulary growth would exceed threshold (20%)
+ # @rbs (Hash[Symbol, Integer]) -> bool
+ def vocabulary_growth_exceeds_threshold?(word_hash)
+ return false unless @initial_vocab_size&.positive?
+
+ new_words = word_hash.keys.count { |w| @word_list[w].nil? }
+ growth_ratio = new_words.to_f / @initial_vocab_size
+ growth_ratio > 0.2
+ end
+
+ # Extends vocabulary and U matrix for new words.
+ # @rbs (Hash[Symbol, Integer]) -> void
+ def extend_vocabulary_for_incremental(word_hash)
+ new_words = word_hash.keys.select { |w| @word_list[w].nil? }
+ return if new_words.empty?
+
+ new_words.each { |word| @word_list.add_word(word) }
+ extend_u_matrix(new_words.size)
+ end
+
+ # Extends U matrix with zero rows for new vocabulary terms.
+ # @rbs (Integer) -> void
+ def extend_u_matrix(num_new_rows)
+ return if num_new_rows.zero? || @u_matrix.nil?
+
+ if self.class.native_available? && @u_matrix.is_a?(self.class.matrix_class)
+ new_rows = self.class.matrix_class.zeros(num_new_rows, @u_matrix.column_size)
+ @u_matrix = self.class.matrix_class.vstack(@u_matrix, new_rows)
+ else
+ new_rows = Matrix.zero(num_new_rows, @u_matrix.column_size)
+ @u_matrix = Matrix.vstack(@u_matrix, new_rows)
+ end
+ end
+
+ # Re-projects all documents onto the current U matrix
+ # Called when rank grows to ensure consistent LSI vector sizes
+ # Uses native batch_project for performance when available
+ # @rbs () -> void
+ def reproject_all_documents
+ return unless @u_matrix
+ return reproject_all_documents_native if self.class.native_available? && @u_matrix.respond_to?(:batch_project)
+
+ reproject_all_documents_ruby
+ end
+
+ # Native batch re-projection using C extension.
+ # @rbs () -> void
+ def reproject_all_documents_native
+ nodes = @items.values
+ raw_vectors = nodes.map do |node|
+ raw = node.raw_vector_with(@word_list)
+ raw.is_a?(self.class.vector_class) ? raw : self.class.vector_class.alloc(raw.to_a)
+ end
+
+ lsi_vectors = @u_matrix.batch_project(raw_vectors)
+
+ nodes.each_with_index do |node, i|
+ lsi_vec = lsi_vectors[i].row
+ node.lsi_vector = lsi_vec
+ node.lsi_norm = lsi_vec.normalize
+ end
+ end
+
+ # Pure Ruby re-projection (fallback)
+ # @rbs () -> void
+ def reproject_all_documents_ruby
+ @items.each_value do |node|
+ assign_lsi_vector_incremental(node)
+ end
+ end
+
+ # Assigns LSI vector to a node using projection: lsi_vec = U^T * raw_vec.
+ # @rbs (ContentNode) -> void
+ def assign_lsi_vector_incremental(node)
+ return unless @u_matrix
+
+ raw_vec = node.raw_vector_with(@word_list)
+ raw_vector = Vector[*raw_vec.to_a]
+ lsi_arr = (@u_matrix.transpose * raw_vector).to_a
+
+ lsi_vec = if self.class.native_available?
+ self.class.vector_class.alloc(lsi_arr).row
+ else
+ Vector[*lsi_arr]
+ end
+ node.lsi_vector = lsi_vec
+ node.lsi_norm = lsi_vec.normalize
+ end
end
end
diff --git a/lib/classifier/lsi/incremental_svd.rb b/lib/classifier/lsi/incremental_svd.rb
new file mode 100644
index 00000000..beabfa37
--- /dev/null
+++ b/lib/classifier/lsi/incremental_svd.rb
@@ -0,0 +1,166 @@
+# rbs_inline: enabled
+
+# rubocop:disable Naming/MethodParameterName, Metrics/ParameterLists
+
+require 'matrix'
+
+module Classifier
+ class LSI
+ # Brand's Incremental SVD Algorithm for LSI
+ #
+ # Implements the algorithm from Brand (2006) "Fast low-rank modifications
+ # of the thin singular value decomposition" for adding documents to LSI
+ # without full SVD recomputation.
+ #
+ # Given existing thin SVD: A ≈ U * S * V^T (with k components)
+ # When adding a new column c:
+ #
+ # 1. Project: m = U^T * c (project onto existing column space)
+ # 2. Residual: p = c - U * m (component orthogonal to U)
+ # 3. Orthonormalize: If ||p|| > ε: p̂ = p / ||p||
+ # 4. Form K matrix:
+ # - If ||p|| > ε: K = [diag(s), m; 0, ||p||] (rank grows by 1)
+ # - If ||p|| ≈ 0: K = diag(s) + m * e_last^T (no new direction)
+ # 5. Small SVD: Compute SVD of K (only (k+1) × (k+1) matrix!)
+ # 6. Update:
+ # - U_new = [U, p̂] * U'
+ # - S_new = S'
+ #
+ module IncrementalSVD
+ EPSILON = 1e-10
+
+ class << self
+ # Updates SVD with a new document vector using Brand's algorithm.
+ #
+ # @param u [Matrix] current left singular vectors (m × k)
+ # @param s [Array] current singular values (k values)
+ # @param c [Vector] new document vector (m × 1)
+ # @param max_rank [Integer] maximum rank to maintain
+ # @param epsilon [Float] threshold for zero detection
+ # @return [Array>] updated [u, s]
+ #
+ # @rbs (Matrix, Array[Float], Vector, max_rank: Integer, ?epsilon: Float) -> [Matrix, Array[Float]]
+ def update(u, s, c, max_rank:, epsilon: EPSILON)
+ m_vec = project(u, c)
+ u_times_m = u * m_vec
+ p_vec = c - (u_times_m.is_a?(Vector) ? u_times_m : Vector[*u_times_m.to_a.flatten])
+ p_norm = magnitude(p_vec)
+
+ if p_norm > epsilon
+ update_with_new_direction(u, s, m_vec, p_vec, p_norm, max_rank, epsilon)
+ else
+ update_in_span(u, s, m_vec, max_rank, epsilon)
+ end
+ end
+
+ # Projects a document vector onto the semantic space defined by U.
+ # Returns the LSI representation: lsi_vec = U^T * raw_vec
+ #
+ # @param u [Matrix] left singular vectors (m × k)
+ # @param raw_vec [Vector] document vector in term space (m × 1)
+ # @return [Vector] document in semantic space (k × 1)
+ #
+ # @rbs (Matrix, Vector) -> Vector
+ def project(u, raw_vec)
+ u.transpose * raw_vec
+ end
+
+ private
+
+ # Update when new document has a component orthogonal to existing U.
+ # @rbs (Matrix, Array[Float], Vector, Vector, Float, Integer, Float) -> [Matrix, Array[Float]]
+ def update_with_new_direction(u, s, m_vec, p_vec, p_norm, max_rank, epsilon)
+ p_hat = p_vec * (1.0 / p_norm)
+ k_matrix = build_k_matrix_with_growth(s, m_vec, p_norm)
+ u_prime, s_prime = small_svd(k_matrix, epsilon)
+ u_extended = extend_matrix_with_column(u, p_hat)
+ u_new = u_extended * u_prime
+
+ u_new, s_prime = truncate(u_new, s_prime, max_rank) if s_prime.size > max_rank
+
+ [u_new, s_prime]
+ end
+
+ # Update when new document is entirely within span of existing U.
+ # @rbs (Matrix, Array[Float], Vector, Integer, Float) -> [Matrix, Array[Float]]
+ def update_in_span(u, s, m_vec, max_rank, epsilon)
+ k_matrix = build_k_matrix_in_span(s, m_vec)
+ u_prime, s_prime = small_svd(k_matrix, epsilon)
+ u_new = u * u_prime
+
+ u_new, s_prime = truncate(u_new, s_prime, max_rank) if s_prime.size > max_rank
+
+ [u_new, s_prime]
+ end
+
+ # Builds the K matrix when rank grows by 1.
+ # @rbs (Array[Float], untyped, Float) -> untyped
+ def build_k_matrix_with_growth(s, m_vec, p_norm)
+ k = s.size
+ rows = k.times.map do |i|
+ row = Array.new(k + 1, 0.0) #: Array[Float]
+ row[i] = s[i].to_f
+ row[k] = m_vec[i].to_f
+ row
+ end
+ rows << Array.new(k + 1, 0.0).tap { |r| r[k] = p_norm }
+ Matrix.rows(rows)
+ end
+
+ # Builds the K matrix when vector is in span (no rank growth).
+ # @rbs (Array[Float], Vector) -> Matrix
+ def build_k_matrix_in_span(s, _m_vec)
+ k = s.size
+ rows = k.times.map do |i|
+ row = Array.new(k, 0.0)
+ row[i] = s[i]
+ row
+ end
+ Matrix.rows(rows)
+ end
+
+ # Computes SVD of small matrix and extracts singular values.
+ # @rbs (Matrix, Float) -> [Matrix, Array[Float]]
+ def small_svd(matrix, epsilon)
+ u, _v, s_array = matrix.SV_decomp
+
+ s_sorted = s_array.select { |sv| sv.abs > epsilon }.sort.reverse
+ indices = s_array.each_with_index
+ .select { |sv, _| sv.abs > epsilon }
+ .sort_by { |sv, _| -sv }
+ .map { |_, i| i }
+
+ u_cols = indices.map { |i| u.column(i).to_a }
+ u_reordered = u_cols.empty? ? Matrix.empty(matrix.row_size, 0) : Matrix.columns(u_cols)
+
+ [u_reordered, s_sorted]
+ end
+
+ # Extends matrix with a new column
+ # @rbs (Matrix, Vector) -> Matrix
+ def extend_matrix_with_column(matrix, col_vec)
+ rows = matrix.row_size.times.map do |i|
+ matrix.row(i).to_a + [col_vec[i]]
+ end
+ Matrix.rows(rows)
+ end
+
+ # Truncates to max_rank
+ # @rbs (untyped, Array[Float], Integer) -> [untyped, Array[Float]]
+ def truncate(u, s, max_rank)
+ s_truncated = s[0...max_rank] || [] #: Array[Float]
+ cols = (0...max_rank).map { |i| u.column(i).to_a }
+ u_truncated = Matrix.columns(cols)
+ [u_truncated, s_truncated]
+ end
+
+ # Computes magnitude of a vector
+ # @rbs (untyped) -> Float
+ def magnitude(vec)
+ Math.sqrt(vec.to_a.sum { |x| x.to_f * x.to_f })
+ end
+ end
+ end
+ end
+end
+# rubocop:enable Naming/MethodParameterName, Metrics/ParameterLists
diff --git a/lib/classifier/streaming.rb b/lib/classifier/streaming.rb
new file mode 100644
index 00000000..3c228b41
--- /dev/null
+++ b/lib/classifier/streaming.rb
@@ -0,0 +1,122 @@
+# rbs_inline: enabled
+
+require_relative 'streaming/progress'
+require_relative 'streaming/line_reader'
+
+module Classifier
+ # Streaming module provides memory-efficient training capabilities for classifiers.
+ # Include this module in a classifier to add streaming and batch training methods.
+ #
+ # @example Including in a classifier
+ # class MyClassifier
+ # include Classifier::Streaming
+ # end
+ #
+ # @example Streaming training
+ # classifier.train_from_stream(:category, File.open('corpus.txt'))
+ #
+ # @example Batch training with progress
+ # classifier.train_batch(:category, documents, batch_size: 100) do |progress|
+ # puts "#{progress.percent}% complete"
+ # end
+ module Streaming
+ # Default batch size for streaming operations
+ DEFAULT_BATCH_SIZE = 100
+
+ # Trains the classifier from an IO stream.
+ # Each line in the stream is treated as a separate document.
+ #
+ # @rbs (Symbol | String, IO, ?batch_size: Integer) { (Progress) -> void } -> void
+ def train_from_stream(category, io, batch_size: DEFAULT_BATCH_SIZE, &block)
+ raise NotImplementedError, "#{self.class} must implement train_from_stream"
+ end
+
+ # Trains the classifier with an array of documents in batches.
+ # Supports both positional and keyword argument styles.
+ #
+ # @example Positional style
+ # classifier.train_batch(:spam, documents, batch_size: 100)
+ #
+ # @example Keyword style
+ # classifier.train_batch(spam: documents, ham: other_docs, batch_size: 100)
+ #
+ # @rbs (?(Symbol | String)?, ?Array[String]?, ?batch_size: Integer, **Array[String]) { (Progress) -> void } -> void
+ def train_batch(category = nil, documents = nil, batch_size: DEFAULT_BATCH_SIZE, **categories, &block)
+ raise NotImplementedError, "#{self.class} must implement train_batch"
+ end
+
+ # Saves a checkpoint of the current training state.
+ # Requires a storage backend to be configured.
+ #
+ # @rbs (String) -> void
+ def save_checkpoint(checkpoint_id)
+ raise ArgumentError, 'No storage configured' unless respond_to?(:storage) && storage
+
+ original_storage = storage
+
+ begin
+ self.storage = checkpoint_storage_for(checkpoint_id)
+ save
+ ensure
+ self.storage = original_storage
+ end
+ end
+
+ # Lists available checkpoints.
+ # Requires a storage backend to be configured.
+ #
+ # @rbs () -> Array[String]
+ def list_checkpoints
+ raise ArgumentError, 'No storage configured' unless respond_to?(:storage) && storage
+
+ case storage
+ when Storage::File
+ file_storage = storage #: Storage::File
+ dir = File.dirname(file_storage.path)
+ base = File.basename(file_storage.path, '.*')
+ ext = File.extname(file_storage.path)
+
+ pattern = File.join(dir, "#{base}_checkpoint_*#{ext}")
+ Dir.glob(pattern).map do |path|
+ File.basename(path, ext).sub(/^#{Regexp.escape(base)}_checkpoint_/, '')
+ end.sort
+ else
+ []
+ end
+ end
+
+ # Deletes a checkpoint.
+ #
+ # @rbs (String) -> void
+ def delete_checkpoint(checkpoint_id)
+ raise ArgumentError, 'No storage configured' unless respond_to?(:storage) && storage
+
+ checkpoint_storage = checkpoint_storage_for(checkpoint_id)
+ checkpoint_storage.delete if checkpoint_storage.exists?
+ end
+
+ private
+
+ # @rbs (String) -> String
+ def checkpoint_path_for(checkpoint_id)
+ raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File)
+
+ file_storage = storage #: Storage::File
+ dir = File.dirname(file_storage.path)
+ base = File.basename(file_storage.path, '.*')
+ ext = File.extname(file_storage.path)
+
+ File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}")
+ end
+
+ # @rbs (String) -> Storage::Base
+ def checkpoint_storage_for(checkpoint_id)
+ case storage
+ when Storage::File
+ Storage::File.new(path: checkpoint_path_for(checkpoint_id))
+ else
+ raise ArgumentError, "Checkpoints not supported for #{storage.class}"
+ end
+ end
+ end
+end
diff --git a/lib/classifier/streaming/line_reader.rb b/lib/classifier/streaming/line_reader.rb
new file mode 100644
index 00000000..e0b92f31
--- /dev/null
+++ b/lib/classifier/streaming/line_reader.rb
@@ -0,0 +1,99 @@
+# rbs_inline: enabled
+
+module Classifier
+ module Streaming
+ # Memory-efficient line reader for large files and IO streams.
+ # Reads lines one at a time and can yield in configurable batches.
+ #
+ # @example Reading line by line
+ # reader = LineReader.new(File.open('large_corpus.txt'))
+ # reader.each { |line| process(line) }
+ #
+ # @example Reading in batches
+ # reader = LineReader.new(io, batch_size: 100)
+ # reader.each_batch { |batch| process_batch(batch) }
+ class LineReader
+ include Enumerable #[String]
+
+ # @rbs @io: IO
+ # @rbs @batch_size: Integer
+
+ attr_reader :batch_size
+
+ # Creates a new LineReader.
+ #
+ # @rbs (IO, ?batch_size: Integer) -> void
+ def initialize(io, batch_size: 100)
+ @io = io
+ @batch_size = batch_size
+ end
+
+ # Iterates over each line in the IO stream.
+ # Lines are chomped (trailing newlines removed).
+ #
+ # @rbs () { (String) -> void } -> void
+ # @rbs () -> Enumerator[String, void]
+ def each
+ return enum_for(:each) unless block_given?
+
+ @io.each_line do |line|
+ yield line.chomp
+ end
+ end
+
+ # Iterates over batches of lines.
+ # Each batch is an array of chomped lines.
+ #
+ # @rbs () { (Array[String]) -> void } -> void
+ # @rbs () -> Enumerator[Array[String], void]
+ def each_batch
+ return enum_for(:each_batch) unless block_given?
+
+ batch = [] #: Array[String]
+ each do |line|
+ batch << line
+ if batch.size >= @batch_size
+ yield batch
+ batch = []
+ end
+ end
+ yield batch unless batch.empty?
+ end
+
+ # Estimates the total number of lines in the IO stream.
+ # This is a rough estimate based on file size and average line length.
+ # Returns nil for non-seekable streams.
+ #
+ # @rbs (?sample_size: Integer) -> Integer?
+ def estimate_line_count(sample_size: 100)
+ return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind)
+
+ begin
+ original_pos = @io.pos
+ @io.rewind
+
+ sample_bytes = 0
+ sample_lines = 0
+
+ sample_size.times do
+ line = @io.gets
+ break unless line
+
+ sample_bytes += line.bytesize
+ sample_lines += 1
+ end
+
+ @io.seek(original_pos)
+
+ return nil if sample_lines.zero?
+
+ avg_line_size = sample_bytes.to_f / sample_lines
+ io_size = @io.__send__(:size) #: Integer
+ (io_size / avg_line_size).round
+ rescue IOError, Errno::ESPIPE
+ nil
+ end
+ end
+ end
+ end
+end
diff --git a/lib/classifier/streaming/progress.rb b/lib/classifier/streaming/progress.rb
new file mode 100644
index 00000000..d747674a
--- /dev/null
+++ b/lib/classifier/streaming/progress.rb
@@ -0,0 +1,96 @@
+# rbs_inline: enabled
+
+module Classifier
+ module Streaming
+ # Progress tracking object yielded to blocks during batch/stream operations.
+ # Provides information about training progress including completion percentage,
+ # elapsed time, processing rate, and estimated time remaining.
+ #
+ # @example Basic usage with train_batch
+ # classifier.train_batch(:spam, documents, batch_size: 100) do |progress|
+ # puts "#{progress.completed}/#{progress.total} (#{progress.percent}%)"
+ # puts "Rate: #{progress.rate.round(1)} docs/sec"
+ # puts "ETA: #{progress.eta&.round}s" if progress.eta
+ # end
+ class Progress
+ # @rbs @completed: Integer
+ # @rbs @total: Integer?
+ # @rbs @start_time: Time
+ # @rbs @current_batch: Integer
+
+ attr_reader :start_time, :total
+ attr_accessor :completed, :current_batch
+
+ # @rbs (?total: Integer?, ?completed: Integer) -> void
+ def initialize(total: nil, completed: 0)
+ @completed = completed
+ @total = total
+ @start_time = Time.now
+ @current_batch = 0
+ end
+
+ # Returns the completion percentage (0-100).
+ # Returns nil if total is unknown.
+ #
+ # @rbs () -> Float?
+ def percent
+ return nil unless @total&.positive?
+
+ (@completed.to_f / @total * 100).round(2)
+ end
+
+ # Returns the elapsed time in seconds since the operation started.
+ #
+ # @rbs () -> Float
+ def elapsed
+ Time.now - @start_time
+ end
+
+ # Returns the processing rate in items per second.
+ # Returns 0 if no time has elapsed.
+ #
+ # @rbs () -> Float
+ def rate
+ e = elapsed
+ return 0.0 if e.zero?
+
+ @completed / e
+ end
+
+ # Returns the estimated time remaining in seconds.
+ # Returns nil if total is unknown or rate is zero.
+ #
+ # @rbs () -> Float?
+ def eta
+ return nil unless @total
+ return nil if rate.zero?
+ return 0.0 if @completed >= @total
+
+ (@total - @completed) / rate
+ end
+
+ # Returns true if the operation is complete.
+ #
+ # @rbs () -> bool
+ def complete?
+ return false unless @total
+
+ @completed >= @total
+ end
+
+ # Returns a hash representation of the progress state.
+ #
+ # @rbs () -> Hash[Symbol, untyped]
+ def to_h
+ {
+ completed: @completed,
+ total: @total,
+ percent: percent,
+ elapsed: elapsed.round(2),
+ rate: rate.round(2),
+ eta: eta&.round(2)
+ }
+ end
+ end
+ end
+end
diff --git a/lib/classifier/tfidf.rb b/lib/classifier/tfidf.rb
index 02520ee7..56c52b15 100644
--- a/lib/classifier/tfidf.rb
+++ b/lib/classifier/tfidf.rb
@@ -16,6 +16,8 @@ module Classifier
# tfidf.transform("Dogs are loyal") # => {:dog=>0.7071..., :loyal=>0.7071...}
#
class TFIDF
+ include Streaming
+
# @rbs @min_df: Integer | Float
# @rbs @max_df: Integer | Float
# @rbs @ngram_range: Array[Integer]
@@ -246,6 +248,70 @@ def marshal_load(data)
@storage = nil
end
+ # Loads a vectorizer from a checkpoint.
+ #
+ # @rbs (storage: Storage::Base, checkpoint_id: String) -> TFIDF
+ def self.load_checkpoint(storage:, checkpoint_id:)
+ raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File)
+
+ dir = File.dirname(storage.path)
+ base = File.basename(storage.path, '.*')
+ ext = File.extname(storage.path)
+ checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}")
+
+ checkpoint_storage = Storage::File.new(path: checkpoint_path)
+ instance = load(storage: checkpoint_storage)
+ instance.storage = storage
+ instance
+ end
+
+ # Fits the vectorizer from an IO stream.
+ # Collects all documents from the stream, then fits the model.
+ # Note: All documents must be collected in memory for IDF calculation.
+ #
+ # @example Fit from a file
+ # tfidf.fit_from_stream(File.open('corpus.txt'))
+ #
+ # @example With progress tracking
+ # tfidf.fit_from_stream(io, batch_size: 500) do |progress|
+ # puts "#{progress.completed} documents loaded"
+ # end
+ #
+ # @rbs (IO, ?batch_size: Integer) { (Streaming::Progress) -> void } -> self
+ def fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE)
+ reader = Streaming::LineReader.new(io, batch_size: batch_size)
+ total = reader.estimate_line_count
+ progress = Streaming::Progress.new(total: total)
+
+ documents = [] #: Array[String]
+
+ reader.each_batch do |batch|
+ documents.concat(batch)
+ progress.completed += batch.size
+ progress.current_batch += 1
+ yield progress if block_given?
+ end
+
+ fit(documents) unless documents.empty?
+ self
+ end
+
+ # TFIDF doesn't support train_from_stream (use fit_from_stream instead).
+ # This method raises NotImplementedError with guidance.
+ #
+ # @rbs (*untyped, **untyped) -> void
+ def train_from_stream(*) # steep:ignore
+ raise NotImplementedError, 'TFIDF uses fit_from_stream instead of train_from_stream'
+ end
+
+ # TFIDF doesn't support train_batch (use fit instead).
+ # This method raises NotImplementedError with guidance.
+ #
+ # @rbs (*untyped, **untyped) -> void
+ def train_batch(*) # steep:ignore
+ raise NotImplementedError, 'TFIDF uses fit instead of train_batch'
+ end
+
private
# Restores vectorizer state from JSON string.
@@ -329,9 +395,7 @@ def validate_df!(value, name)
# @rbs (Array[Integer]) -> void
def validate_ngram_range!(range)
raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2
- unless range.all?(Integer) && range.all?(&:positive?)
- raise ArgumentError, 'ngram_range values must be positive integers'
- end
+ raise ArgumentError, 'ngram_range values must be positive integers' unless range.all?(Integer) && range.all?(&:positive?)
raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]
end
diff --git a/sig/vendor/matrix.rbs b/sig/vendor/matrix.rbs
index 910018e8..0ec2dab2 100644
--- a/sig/vendor/matrix.rbs
+++ b/sig/vendor/matrix.rbs
@@ -1,26 +1,37 @@
# Type stubs for matrix gem
-class Vector[T]
+# Using untyped elements since our usage is primarily with Floats/Numerics
+class Vector
EPSILON: Float
- def self.[]: [T] (*T) -> Vector[T]
+ def self.[]: (*untyped) -> Vector
def size: () -> Integer
- def []: (Integer) -> T
+ def []: (Integer) -> untyped
def magnitude: () -> Float
- def normalize: () -> Vector[T]
- def each: () { (T) -> void } -> void
- def collect: [U] () { (T) -> U } -> Vector[U]
- def to_a: () -> Array[T]
+ def normalize: () -> Vector
+ def each: () { (untyped) -> void } -> void
+ def collect: () { (untyped) -> untyped } -> Vector
+ def to_a: () -> Array[untyped]
def *: (untyped) -> untyped
+ def -: (Vector) -> Vector
+ def is_a?: (untyped) -> bool
end
-class Matrix[T]
- def self.rows: [T] (Array[Array[T]]) -> Matrix[T]
- def self.[]: [T] (*Array[T]) -> Matrix[T]
- def self.diag: (untyped) -> Matrix[untyped]
- def trans: () -> Matrix[T]
+class Matrix
+ def self.rows: (Array[Array[untyped]]) -> Matrix
+ def self.[]: (*Array[untyped]) -> Matrix
+ def self.diag: (untyped) -> Matrix
+ def self.columns: (Array[Array[untyped]]) -> Matrix
+ def self.empty: (Integer, Integer) -> Matrix
+ def self.zero: (Integer, Integer) -> Matrix
+ def self.vstack: (Matrix, Matrix) -> Matrix
+ def trans: () -> Matrix
+ def transpose: () -> Matrix
def *: (untyped) -> untyped
def row_size: () -> Integer
def column_size: () -> Integer
- def column: (Integer) -> Vector[T]
- def SV_decomp: () -> [Matrix[T], Matrix[T], untyped]
+ def row: (Integer) -> Vector
+ def column: (Integer) -> Vector
+ def SV_decomp: () -> [Matrix, Matrix, untyped]
+ def is_a?: (untyped) -> bool
+ def respond_to?: (Symbol) -> bool
end
diff --git a/sig/vendor/streaming.rbs b/sig/vendor/streaming.rbs
new file mode 100644
index 00000000..ee42c16e
--- /dev/null
+++ b/sig/vendor/streaming.rbs
@@ -0,0 +1,14 @@
+# Type stubs for Streaming module
+# Defines the interface that including classes must implement
+
+module Classifier
+ # Interface for classes that include Streaming
+ interface _StreamingHost
+ def storage: () -> Storage::Base?
+ def storage=: (Storage::Base?) -> void
+ def save: () -> void
+ end
+
+ module Streaming : _StreamingHost
+ end
+end
diff --git a/test/bayes/streaming_test.rb b/test/bayes/streaming_test.rb
new file mode 100644
index 00000000..fd2f311a
--- /dev/null
+++ b/test/bayes/streaming_test.rb
@@ -0,0 +1,337 @@
+require_relative '../test_helper'
+require 'stringio'
+require 'tempfile'
+
+class BayesStreamingTest < Minitest::Test
+ def setup
+ @classifier = Classifier::Bayes.new('Spam', 'Ham')
+ end
+
+ # train_from_stream tests
+
+ def test_train_from_stream_basic
+ io = StringIO.new("buy now cheap\nfree money\nlimited offer\n")
+ @classifier.train_from_stream(:spam, io)
+
+ # Should have trained 3 documents
+ assert_equal 'Spam', @classifier.classify('buy cheap free')
+ end
+
+ def test_train_from_stream_empty_io
+ io = StringIO.new('')
+ @classifier.train_from_stream(:spam, io)
+
+ # No documents trained, classifier should still work
+ assert_includes @classifier.categories, 'Spam'
+ end
+
+ def test_train_from_stream_single_line
+ io = StringIO.new("this is spam content\n")
+ @classifier.train_from_stream(:spam, io)
+
+ # Train some ham to make classification meaningful
+ @classifier.train(:ham, 'this is normal email')
+
+ result = @classifier.classify('spam content')
+
+ assert_equal 'Spam', result
+ end
+
+ def test_train_from_stream_with_batch_size
+ lines = (1..50).map { |i| "document number #{i} with some content" }
+ io = StringIO.new(lines.join("\n"))
+
+ batches_processed = 0
+ @classifier.train_from_stream(:spam, io, batch_size: 10) do |progress|
+ batches_processed = progress.current_batch
+ end
+
+ assert_equal 5, batches_processed
+ end
+
+ def test_train_from_stream_progress_tracking
+ lines = (1..25).map { |i| "line #{i}" }
+ io = StringIO.new(lines.join("\n"))
+
+ completed_values = []
+ @classifier.train_from_stream(:spam, io, batch_size: 10) do |progress|
+ completed_values << progress.completed
+ end
+
+ assert_equal [10, 20, 25], completed_values
+ end
+
+ def test_train_from_stream_with_progress_block
+ io = StringIO.new("doc1\ndoc2\ndoc3\n")
+ progress_received = nil
+
+ @classifier.train_from_stream(:spam, io, batch_size: 2) do |progress|
+ progress_received = progress
+ end
+
+ assert_instance_of Classifier::Streaming::Progress, progress_received
+ assert_equal 3, progress_received.completed
+ end
+
+ def test_train_from_stream_invalid_category
+ io = StringIO.new("some content\n")
+
+ assert_raises(StandardError) do
+ @classifier.train_from_stream(:nonexistent, io)
+ end
+ end
+
+ def test_train_from_stream_marks_dirty
+ io = StringIO.new("content\n")
+
+ refute_predicate @classifier, :dirty?
+
+ @classifier.train_from_stream(:spam, io)
+
+ assert_predicate @classifier, :dirty?
+ end
+
+ def test_train_from_stream_with_file
+ Tempfile.create(['corpus', '.txt']) do |file|
+ file.puts 'spam message one'
+ file.puts 'spam message two'
+ file.puts 'spam message three'
+ file.flush
+ file.rewind
+
+ @classifier.train_from_stream(:spam, file)
+ end
+
+ @classifier.train(:ham, 'normal message here')
+
+ assert_equal 'Spam', @classifier.classify('spam message')
+ end
+
+ # train_batch tests
+
+ def test_train_batch_positional_style
+ documents = ['buy now', 'free money', 'limited offer']
+ @classifier.train_batch(:spam, documents)
+
+ @classifier.train(:ham, 'hello friend')
+
+ assert_equal 'Spam', @classifier.classify('buy free limited')
+ end
+
+ def test_train_batch_keyword_style
+ spam_docs = ['buy now', 'free money']
+ ham_docs = ['hello friend', 'meeting tomorrow']
+
+ @classifier.train_batch(spam: spam_docs, ham: ham_docs)
+
+ assert_equal 'Spam', @classifier.classify('buy free')
+ assert_equal 'Ham', @classifier.classify('hello meeting')
+ end
+
+ def test_train_batch_with_batch_size
+ documents = (1..100).map { |i| "document #{i}" }
+
+ batches = 0
+ @classifier.train_batch(:spam, documents, batch_size: 25) do |_progress|
+ batches += 1
+ end
+
+ assert_equal 4, batches
+ end
+
+ def test_train_batch_progress_tracking
+ documents = (1..30).map { |i| "doc #{i}" }
+
+ completed_values = []
+ @classifier.train_batch(:spam, documents, batch_size: 10) do |progress|
+ completed_values << progress.completed
+ end
+
+ assert_equal [10, 20, 30], completed_values
+ end
+
+ def test_train_batch_progress_percent
+ documents = (1..100).map { |i| "doc #{i}" }
+
+ percent_values = []
+ @classifier.train_batch(:spam, documents, batch_size: 25) do |progress|
+ percent_values << progress.percent
+ end
+
+ assert_equal [25.0, 50.0, 75.0, 100.0], percent_values
+ end
+
+ def test_train_batch_empty_array
+ @classifier.train_batch(:spam, [])
+ # Should not raise, classifier should be unchanged
+ assert_includes @classifier.categories, 'Spam'
+ end
+
+ def test_train_batch_invalid_category
+ assert_raises(StandardError) do
+ @classifier.train_batch(:nonexistent, ['doc'])
+ end
+ end
+
+ def test_train_batch_marks_dirty
+ refute_predicate @classifier, :dirty?
+ @classifier.train_batch(:spam, ['content'])
+
+ assert_predicate @classifier, :dirty?
+ end
+
+ def test_train_batch_multiple_categories
+ @classifier.train_batch(
+ spam: ['buy now', 'free offer'],
+ ham: %w[hello meeting]
+ )
+
+ assert_equal 'Spam', @classifier.classify('buy free')
+ assert_equal 'Ham', @classifier.classify('hello meeting')
+ end
+
+ # Equivalence tests
+
+ def test_train_batch_equivalent_to_train
+ classifier1 = Classifier::Bayes.new('Spam', 'Ham')
+ classifier2 = Classifier::Bayes.new('Spam', 'Ham')
+
+ documents = ['buy now cheap', 'free money fast', 'limited time offer']
+
+ # Train with regular train
+ documents.each { |doc| classifier1.train(:spam, doc) }
+
+ # Train with train_batch
+ classifier2.train_batch(:spam, documents)
+
+ # Both should classify the same
+ test_doc = 'buy cheap free limited'
+
+ assert_equal classifier1.classify(test_doc), classifier2.classify(test_doc)
+
+ # Classifications should be identical
+ assert_equal classifier1.classifications(test_doc), classifier2.classifications(test_doc)
+ end
+
+ def test_train_from_stream_equivalent_to_train
+ classifier1 = Classifier::Bayes.new('Spam', 'Ham')
+ classifier2 = Classifier::Bayes.new('Spam', 'Ham')
+
+ documents = ['buy now cheap', 'free money fast', 'limited time offer']
+
+ # Train with regular train
+ documents.each { |doc| classifier1.train(:spam, doc) }
+
+ # Train with train_from_stream
+ io = StringIO.new(documents.join("\n"))
+ classifier2.train_from_stream(:spam, io)
+
+ # Both should classify the same
+ test_doc = 'buy cheap free limited'
+
+ assert_equal classifier1.classify(test_doc), classifier2.classify(test_doc)
+ end
+
+ # Checkpoint tests
+
+ def test_save_checkpoint
+ Dir.mktmpdir do |dir|
+ path = File.join(dir, 'classifier.json')
+ @classifier.storage = Classifier::Storage::File.new(path: path)
+
+ @classifier.train(:spam, 'buy now cheap')
+ @classifier.save_checkpoint('50pct')
+
+ checkpoint_path = File.join(dir, 'classifier_checkpoint_50pct.json')
+
+ assert_path_exists checkpoint_path
+ end
+ end
+
+ def test_load_checkpoint
+ Dir.mktmpdir do |dir|
+ path = File.join(dir, 'classifier.json')
+ storage = Classifier::Storage::File.new(path: path)
+ @classifier.storage = storage
+
+ @classifier.train(:spam, 'buy now cheap')
+ @classifier.save_checkpoint('halfway')
+
+ # Load from checkpoint
+ loaded = Classifier::Bayes.load_checkpoint(storage: storage, checkpoint_id: 'halfway')
+
+ # Should have the same training
+ assert_equal @classifier.classify('buy cheap'), loaded.classify('buy cheap')
+ end
+ end
+
+ def test_list_checkpoints
+ Dir.mktmpdir do |dir|
+ path = File.join(dir, 'classifier.json')
+ @classifier.storage = Classifier::Storage::File.new(path: path)
+
+ @classifier.train(:spam, 'content')
+ @classifier.save_checkpoint('10pct')
+ @classifier.save_checkpoint('50pct')
+ @classifier.save_checkpoint('90pct')
+
+ checkpoints = @classifier.list_checkpoints
+
+ assert_equal %w[10pct 50pct 90pct], checkpoints.sort
+ end
+ end
+
+ def test_delete_checkpoint
+ Dir.mktmpdir do |dir|
+ path = File.join(dir, 'classifier.json')
+ @classifier.storage = Classifier::Storage::File.new(path: path)
+
+ @classifier.train(:spam, 'content')
+ @classifier.save_checkpoint('test')
+
+ checkpoint_path = File.join(dir, 'classifier_checkpoint_test.json')
+
+ assert_path_exists checkpoint_path
+
+ @classifier.delete_checkpoint('test')
+
+ refute_path_exists checkpoint_path
+ end
+ end
+
+ def test_save_checkpoint_requires_storage
+ assert_raises(ArgumentError) do
+ @classifier.save_checkpoint('test')
+ end
+ end
+
+ def test_checkpoint_workflow
+ Dir.mktmpdir do |dir|
+ path = File.join(dir, 'classifier.json')
+ storage = Classifier::Storage::File.new(path: path)
+
+ # First training session
+ classifier1 = Classifier::Bayes.new('Spam', 'Ham')
+ classifier1.storage = storage
+
+ classifier1.train(:spam, 'buy now')
+ classifier1.train(:spam, 'free money')
+ classifier1.save_checkpoint('phase1')
+
+ # Simulate restart - load from checkpoint
+ classifier2 = Classifier::Bayes.load_checkpoint(storage: storage, checkpoint_id: 'phase1')
+
+ # Continue training
+ classifier2.train(:ham, 'hello friend')
+ classifier2.train(:ham, 'meeting tomorrow')
+ classifier2.save_checkpoint('phase2')
+
+ # Load final checkpoint
+ final = Classifier::Bayes.load_checkpoint(storage: storage, checkpoint_id: 'phase2')
+
+ # Should have all training
+ assert_equal 'Spam', final.classify('buy free')
+ assert_equal 'Ham', final.classify('hello meeting')
+ end
+ end
+end
diff --git a/test/knn/knn_test.rb b/test/knn/knn_test.rb
index 9eed7dbb..f9e213bd 100644
--- a/test/knn/knn_test.rb
+++ b/test/knn/knn_test.rb
@@ -529,4 +529,56 @@ def test_items_returns_copy
assert_equal 1, knn.items.size
end
+
+ # API consistency tests (with Bayes and LogisticRegression)
+
+ def test_train_alias_for_add
+ knn = Classifier::KNN.new(k: 3)
+ knn.train(Dog: [@str1, @str2], Cat: [@str3, @str4])
+
+ assert_equal 4, knn.items.size
+ assert_equal 'Dog', knn.classify('This is about dogs')
+ assert_equal 'Cat', knn.classify('This is about cats')
+ end
+
+ def test_dynamic_train_methods
+ knn = Classifier::KNN.new(k: 3)
+ knn.train_dog @str1, @str2
+ knn.train_cat @str3, @str4
+
+ assert_equal 4, knn.items.size
+ # Dynamic methods create lowercase category names from method name
+ assert_equal 'dog', knn.classify('This is about dogs')
+ assert_equal 'cat', knn.classify('This is about cats')
+ end
+
+ def test_respond_to_train_methods
+ knn = Classifier::KNN.new
+
+ assert_respond_to knn, :train
+ assert_respond_to knn, :train_spam
+ assert_respond_to knn, :train_any_category
+ end
+
+ def test_classify_returns_string_not_symbol
+ knn = Classifier::KNN.new(k: 3)
+ knn.add(dog: [@str1, @str2], cat: [@str3, @str4]) # symbol keys
+
+ result = knn.classify('This is about dogs')
+
+ assert_instance_of String, result
+ assert_equal 'dog', result
+ end
+
+ def test_categories_returns_array_of_strings
+ knn = Classifier::KNN.new
+ knn.add(dog: 'Dogs are great', cat: 'Cats are independent')
+
+ cats = knn.categories
+
+ assert_instance_of Array, cats
+ cats.each { |cat| assert_instance_of String, cat }
+ assert_includes cats, 'dog'
+ assert_includes cats, 'cat'
+ end
end
diff --git a/test/lsi/incremental_svd_test.rb b/test/lsi/incremental_svd_test.rb
new file mode 100644
index 00000000..9ddfbcc8
--- /dev/null
+++ b/test/lsi/incremental_svd_test.rb
@@ -0,0 +1,304 @@
+require_relative '../test_helper'
+
+class IncrementalSVDModuleTest < Minitest::Test
+ def test_update_with_new_direction
+ # Create a simple U matrix (3 terms × 2 components)
+ u = Matrix[
+ [0.5, 0.5],
+ [0.5, -0.5],
+ [0.707, 0.0]
+ ]
+ s = [2.0, 1.0]
+
+ # New document vector (3 terms)
+ c = Vector[0.1, 0.2, 0.9]
+
+ u_new, s_new = Classifier::LSI::IncrementalSVD.update(u, s, c, max_rank: 10)
+
+ # Should have updated matrices
+ assert_instance_of Matrix, u_new
+ assert_instance_of Array, s_new
+
+ # Rank may have increased by 1 (if new direction found)
+ assert_operator s_new.size, :>=, s.size
+ assert_operator s_new.size, :<=, s.size + 1
+ end
+
+ def test_update_preserves_orthogonality
+ # Start with orthonormal U
+ u = Matrix[
+ [1.0, 0.0],
+ [0.0, 1.0],
+ [0.0, 0.0]
+ ]
+ s = [2.0, 1.5]
+
+ c = Vector[0.5, 0.5, 0.707]
+
+ u_new, _s_new = Classifier::LSI::IncrementalSVD.update(u, s, c, max_rank: 10)
+
+ # Check columns are approximately orthonormal
+ u_new.column_size.times do |i|
+ col_i = u_new.column(i)
+ # Column should have unit length (approximately)
+ norm = Math.sqrt(col_i.to_a.sum { |x| x * x })
+
+ assert_in_delta 1.0, norm, 0.1, "Column #{i} should have unit length"
+ end
+ end
+
+ def test_update_with_vector_in_span
+ # U spans the first two dimensions
+ u = Matrix[
+ [1.0, 0.0],
+ [0.0, 1.0],
+ [0.0, 0.0]
+ ]
+ s = [2.0, 1.0]
+
+ # Vector entirely in span of U (no component in third dimension)
+ c = Vector[0.6, 0.8, 0.0]
+
+ _u_new, s_new = Classifier::LSI::IncrementalSVD.update(u, s, c, max_rank: 10)
+
+ # Rank should not increase when vector is in span
+ assert_equal s.size, s_new.size
+ end
+
+ def test_update_respects_max_rank
+ u = Matrix[
+ [1.0, 0.0],
+ [0.0, 1.0],
+ [0.0, 0.0]
+ ]
+ s = [2.0, 1.0]
+
+ c = Vector[0.1, 0.1, 0.99]
+
+ # With max_rank = 2, should not exceed 2 components
+ u_new, s_new = Classifier::LSI::IncrementalSVD.update(u, s, c, max_rank: 2)
+
+ assert_equal 2, s_new.size
+ assert_equal 2, u_new.column_size
+ end
+
+ def test_singular_values_sorted_descending
+ u = Matrix[
+ [0.707, 0.707],
+ [0.707, -0.707],
+ [0.0, 0.0]
+ ]
+ s = [3.0, 1.0]
+
+ c = Vector[0.5, 0.5, 0.707]
+
+ _u_new, s_new = Classifier::LSI::IncrementalSVD.update(u, s, c, max_rank: 10)
+
+ # Singular values should be sorted in descending order
+ (0...(s_new.size - 1)).each do |i|
+ assert_operator s_new[i], :>=, s_new[i + 1], 'Singular values should be descending'
+ end
+ end
+end
+
+class LSIIncrementalModeTest < Minitest::Test
+ def setup
+ @dog_docs = [
+ 'dogs are loyal pets that bark',
+ 'puppies are playful young dogs',
+ 'dogs love to play fetch'
+ ]
+ @cat_docs = [
+ 'cats are independent pets',
+ 'kittens are curious creatures',
+ 'cats meow and purr softly'
+ ]
+ end
+
+ def test_incremental_mode_initialization
+ lsi = Classifier::LSI.new(incremental: true)
+
+ assert lsi.instance_variable_get(:@incremental_mode)
+ refute_predicate lsi, :incremental_enabled? # Not active until first build
+ end
+
+ def test_incremental_mode_with_max_rank
+ lsi = Classifier::LSI.new(incremental: true, max_rank: 50)
+
+ assert_equal 50, lsi.instance_variable_get(:@max_rank)
+ end
+
+ def test_incremental_enabled_after_build
+ lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false)
+
+ @dog_docs.each { |doc| lsi.add_item(doc, :dog) }
+ @cat_docs.each { |doc| lsi.add_item(doc, :cat) }
+
+ refute_predicate lsi, :incremental_enabled?
+
+ lsi.build_index
+
+ assert_predicate lsi, :incremental_enabled?
+ assert_instance_of Matrix, lsi.instance_variable_get(:@u_matrix)
+ end
+
+ def test_incremental_add_uses_incremental_update
+ lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false)
+
+ # Add initial documents
+ @dog_docs.each { |doc| lsi.add_item(doc, :dog) }
+ @cat_docs.each { |doc| lsi.add_item(doc, :cat) }
+ lsi.build_index
+
+ initial_version = lsi.instance_variable_get(:@version)
+
+ # Add new document - should use incremental update
+ lsi.add_item('my dog loves to run and play', :dog)
+
+ # Version should have incremented
+ assert_equal initial_version + 1, lsi.instance_variable_get(:@version)
+
+ # Should still be in incremental mode
+ assert_predicate lsi, :incremental_enabled?
+ end
+
+ def test_incremental_classification_works
+ lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false)
+
+ @dog_docs.each { |doc| lsi.add_item(doc, :dog) }
+ @cat_docs.each { |doc| lsi.add_item(doc, :cat) }
+ lsi.build_index
+
+ # Add more documents incrementally
+ lsi.add_item('dogs are wonderful companions', :dog)
+ lsi.add_item('cats sleep a lot during the day', :cat)
+
+ # Classification should work
+ result = lsi.classify('loyal pet that barks').to_s
+
+ assert_equal 'dog', result
+
+ result = lsi.classify('independent creature that meows').to_s
+
+ assert_equal 'cat', result
+ end
+
+ def test_current_rank
+ lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false)
+
+ @dog_docs.each { |doc| lsi.add_item(doc, :dog) }
+ @cat_docs.each { |doc| lsi.add_item(doc, :cat) }
+ lsi.build_index
+
+ rank = lsi.current_rank
+
+ assert_instance_of Integer, rank
+ assert_predicate rank, :positive?
+ end
+
+ def test_disable_incremental_mode
+ lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false)
+
+ @dog_docs.each { |doc| lsi.add_item(doc, :dog) }
+ lsi.build_index
+
+ assert_predicate lsi, :incremental_enabled?
+
+ lsi.disable_incremental_mode!
+
+ refute_predicate lsi, :incremental_enabled?
+ assert_nil lsi.instance_variable_get(:@u_matrix)
+ end
+
+ def test_enable_incremental_mode
+ lsi = Classifier::LSI.new(auto_rebuild: false)
+
+ @dog_docs.each { |doc| lsi.add_item(doc, :dog) }
+ lsi.build_index
+
+ refute_predicate lsi, :incremental_enabled?
+
+ lsi.enable_incremental_mode!(max_rank: 75)
+ lsi.build_index(force: true)
+
+ assert_predicate lsi, :incremental_enabled?
+ assert_equal 75, lsi.instance_variable_get(:@max_rank)
+ end
+
+ def test_force_rebuild
+ lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false)
+
+ @dog_docs.each { |doc| lsi.add_item(doc, :dog) }
+ lsi.build_index
+
+ # Force rebuild should work
+ lsi.build_index(force: true)
+
+ assert_predicate lsi, :incremental_enabled?
+ end
+
+ def test_vocabulary_growth_triggers_full_rebuild
+ lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false)
+
+ # Start with a small vocabulary
+ lsi.add_item('dog', :animal)
+ lsi.add_item('cat', :animal)
+ lsi.build_index
+
+ # Store initial vocab size to verify growth detection works
+ _initial_vocab_size = lsi.instance_variable_get(:@initial_vocab_size)
+
+ # Add document with many new words (> 20% of initial vocabulary)
+ # This should trigger a full rebuild and disable incremental mode
+ many_new_words = (1..100).map { |i| "newword#{i}" }.join(' ')
+ lsi.add_item(many_new_words, :animal)
+
+ # After vocabulary growth > 20%, incremental mode should be disabled
+ # and a full rebuild should have occurred
+ refute_predicate lsi, :incremental_enabled?
+ end
+
+ def test_incremental_produces_reasonable_results
+ # Build with full SVD
+ lsi_full = Classifier::LSI.new(auto_rebuild: false)
+ @dog_docs.each { |doc| lsi_full.add_item(doc, :dog) }
+ @cat_docs.each { |doc| lsi_full.add_item(doc, :cat) }
+ lsi_full.add_item('my dog is a great friend', :dog)
+ lsi_full.build_index
+
+ # Build with incremental mode
+ lsi_inc = Classifier::LSI.new(incremental: true, auto_rebuild: false)
+ @dog_docs.each { |doc| lsi_inc.add_item(doc, :dog) }
+ @cat_docs.each { |doc| lsi_inc.add_item(doc, :cat) }
+ lsi_inc.build_index
+ lsi_inc.add_item('my dog is a great friend', :dog)
+
+ # Both should classify test documents reasonably
+ # Note: Results may differ due to approximation, but should be reasonable
+ test_doc = 'loyal pet barking'
+
+ _full_result = lsi_full.classify(test_doc).to_s
+ inc_result = lsi_inc.classify(test_doc).to_s
+
+ # At minimum, incremental should produce a valid classification
+ assert_includes %w[dog cat], inc_result
+ end
+
+ def test_incremental_with_streaming_api
+ lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false)
+
+ # Add initial batch
+ @dog_docs.each { |doc| lsi.add_item(doc, :dog) }
+ @cat_docs.each { |doc| lsi.add_item(doc, :cat) }
+ lsi.build_index
+
+ # Use streaming API to add more
+ io = StringIO.new("dogs bark loudly\ndogs wag their tails\n")
+ lsi.train_from_stream(:dog, io)
+
+ # Should still work
+ result = lsi.classify('barking dog wags tail').to_s
+
+ assert_equal 'dog', result
+ end
+end
diff --git a/test/lsi/streaming_test.rb b/test/lsi/streaming_test.rb
new file mode 100644
index 00000000..94bfaafb
--- /dev/null
+++ b/test/lsi/streaming_test.rb
@@ -0,0 +1,343 @@
+require_relative '../test_helper'
+require 'stringio'
+require 'tempfile'
+
+class LSIStreamingTest < Minitest::Test
+ def setup
+ @lsi = Classifier::LSI.new
+ end
+
+ # train_from_stream tests
+
+ def test_train_from_stream_basic
+ dog_docs = "dogs are loyal pets\npuppies are playful\ndogs bark at strangers\n"
+ cat_docs = "cats are independent\nkittens are curious\ncats meow softly\n"
+
+ @lsi.train_from_stream(:dog, StringIO.new(dog_docs))
+ @lsi.train_from_stream(:cat, StringIO.new(cat_docs))
+
+ # Should be able to classify
+ # Note: classify returns the category as originally stored (symbol in this case)
+ result = @lsi.classify('loyal pet that barks')
+
+ assert_equal 'dog', result.to_s
+ end
+
+ def test_train_from_stream_empty_io
+ @lsi.train_from_stream(:category, StringIO.new(''))
+
+ # No items added
+ assert_empty @lsi.items
+ end
+
+ def test_train_from_stream_single_line
+ @lsi.train_from_stream(:dog, StringIO.new("dogs are loyal pets\n"))
+ @lsi.train_from_stream(:cat, StringIO.new("cats are independent\n"))
+
+ # Should have 2 items
+ assert_equal 2, @lsi.items.size
+ end
+
+ def test_train_from_stream_with_batch_size
+ lines = (1..50).map { |i| "document number #{i} about dogs" }
+ io = StringIO.new(lines.join("\n"))
+
+ batches_processed = 0
+ @lsi.train_from_stream(:dog, io, batch_size: 10) do |progress|
+ batches_processed = progress.current_batch
+ end
+
+ assert_equal 5, batches_processed
+ end
+
+ def test_train_from_stream_progress_tracking
+ lines = (1..25).map { |i| "document #{i}" }
+ io = StringIO.new(lines.join("\n"))
+
+ completed_values = []
+ @lsi.train_from_stream(:category, io, batch_size: 10) do |progress|
+ completed_values << progress.completed
+ end
+
+ assert_equal [10, 20, 25], completed_values
+ end
+
+ def test_train_from_stream_marks_dirty
+ refute_predicate @lsi, :dirty?
+ @lsi.train_from_stream(:category, StringIO.new("content\n"))
+
+ assert_predicate @lsi, :dirty?
+ end
+
+ def test_train_from_stream_rebuilds_index_when_auto_rebuild
+ @lsi = Classifier::LSI.new(auto_rebuild: true)
+
+ dog_docs = "dogs are loyal\ndogs bark\n"
+ cat_docs = "cats are independent\ncats meow\n"
+
+ @lsi.train_from_stream(:dog, StringIO.new(dog_docs))
+ @lsi.train_from_stream(:cat, StringIO.new(cat_docs))
+
+ # Index should be built
+ refute_predicate @lsi, :needs_rebuild?
+ end
+
+ def test_train_from_stream_skips_rebuild_when_auto_rebuild_false
+ @lsi = Classifier::LSI.new(auto_rebuild: false)
+
+ @lsi.train_from_stream(:category, StringIO.new("document one\ndocument two\n"))
+
+ # Index should need rebuild
+ assert_predicate @lsi, :needs_rebuild?
+ end
+
+ def test_train_from_stream_with_file
+ Tempfile.create(['corpus', '.txt']) do |file|
+ file.puts 'dogs are loyal pets'
+ file.puts 'puppies are playful animals'
+ file.puts 'dogs bark and play'
+ file.flush
+ file.rewind
+
+ @lsi.train_from_stream(:dog, file)
+ end
+
+ @lsi.add_item('cats are independent', :cat)
+ @lsi.add_item('kittens are curious', :cat)
+
+ result = @lsi.classify('loyal pet')
+
+ assert_equal 'dog', result.to_s
+ end
+
+ # add_batch tests
+
+ def test_add_batch_basic
+ dog_docs = ['dogs are loyal', 'puppies play', 'dogs bark']
+ cat_docs = ['cats are independent', 'kittens curious', 'cats meow']
+
+ @lsi.add_batch(dog: dog_docs, cat: cat_docs)
+
+ assert_equal 6, @lsi.items.size
+ assert_equal 'dog', @lsi.classify('loyal dog barks').to_s
+ end
+
+ def test_add_batch_with_progress
+ docs = (1..30).map { |i| "document #{i}" }
+
+ completed_values = []
+ @lsi.add_batch(batch_size: 10, category: docs) do |progress|
+ completed_values << progress.completed
+ end
+
+ assert_equal [10, 20, 30], completed_values
+ end
+
+ def test_add_batch_empty
+ @lsi.add_batch(category: [])
+
+ assert_empty @lsi.items
+ end
+
+ def test_add_batch_marks_dirty
+ refute_predicate @lsi, :dirty?
+ @lsi.add_batch(category: ['doc'])
+
+ assert_predicate @lsi, :dirty?
+ end
+
+ def test_add_batch_rebuilds_when_auto_rebuild
+ @lsi = Classifier::LSI.new(auto_rebuild: true)
+
+ @lsi.add_batch(
+ dog: ['dogs bark', 'puppies play'],
+ cat: ['cats meow', 'kittens purr']
+ )
+
+ refute_predicate @lsi, :needs_rebuild?
+ end
+
+ # train_batch tests (alias for add_batch)
+
+ def test_train_batch_positional_style
+ docs = ['dogs are loyal', 'puppies play']
+ @lsi.train_batch(:dog, docs)
+
+ @lsi.add_item('cats are independent', :cat)
+
+ assert_equal 3, @lsi.items.size
+ end
+
+ def test_train_batch_keyword_style
+ @lsi.train_batch(
+ dog: ['dogs are loyal', 'puppies play'],
+ cat: ['cats are independent', 'kittens curious']
+ )
+
+ assert_equal 4, @lsi.items.size
+ assert_equal 'dog', @lsi.classify('loyal dog').to_s
+ end
+
+ def test_train_batch_with_progress
+ docs = (1..20).map { |i| "doc #{i}" }
+
+ batches = 0
+ @lsi.train_batch(:category, docs, batch_size: 5) do |_progress|
+ batches += 1
+ end
+
+ assert_equal 4, batches
+ end
+
+ # Equivalence tests
+
+ def test_train_from_stream_equivalent_to_add_item
+ lsi1 = Classifier::LSI.new
+ lsi2 = Classifier::LSI.new
+
+ documents = ['dogs are loyal pets', 'puppies are playful', 'dogs bark at strangers']
+
+ # Add with add_item
+ documents.each { |doc| lsi1.add_item(doc, :dog) }
+
+ # Add with train_from_stream
+ io = StringIO.new(documents.join("\n"))
+ lsi2.train_from_stream(:dog, io)
+
+ # Both should have same items
+ assert_equal lsi1.items.size, lsi2.items.size
+
+ # Add some cat documents to both for classification
+ lsi1.add_item('cats are independent', :cat)
+ lsi2.add_item('cats are independent', :cat)
+
+ # Both should classify the same
+ test_doc = 'loyal playful pet'
+
+ assert_equal lsi1.classify(test_doc).to_s, lsi2.classify(test_doc).to_s
+ end
+
+ def test_add_batch_equivalent_to_add
+ lsi1 = Classifier::LSI.new
+ lsi2 = Classifier::LSI.new
+
+ dog_docs = ['dogs bark', 'puppies play']
+ cat_docs = ['cats meow', 'kittens purr']
+
+ # Add with hash-style add
+ lsi1.add(dog: dog_docs, cat: cat_docs)
+
+ # Add with add_batch
+ lsi2.add_batch(dog: dog_docs, cat: cat_docs)
+
+ # Both should have same items
+ assert_equal lsi1.items.size, lsi2.items.size
+
+ # Both should classify the same
+ test_doc = 'barking dog'
+
+ assert_equal lsi1.classify(test_doc).to_s, lsi2.classify(test_doc).to_s
+ end
+
+ # Checkpoint tests
+
+ def test_save_checkpoint
+ Dir.mktmpdir do |dir|
+ path = File.join(dir, 'lsi.json')
+ @lsi.storage = Classifier::Storage::File.new(path: path)
+
+ @lsi.add_item('dogs are loyal', :dog)
+ @lsi.add_item('cats are independent', :cat)
+ @lsi.save_checkpoint('50pct')
+
+ checkpoint_path = File.join(dir, 'lsi_checkpoint_50pct.json')
+
+ assert_path_exists checkpoint_path
+ end
+ end
+
+ def test_load_checkpoint
+ Dir.mktmpdir do |dir|
+ path = File.join(dir, 'lsi.json')
+ storage = Classifier::Storage::File.new(path: path)
+ @lsi.storage = storage
+
+ @lsi.add_item('dogs are loyal', :dog)
+ @lsi.add_item('cats are independent', :cat)
+ @lsi.save_checkpoint('halfway')
+
+ # Load from checkpoint
+ loaded = Classifier::LSI.load_checkpoint(storage: storage, checkpoint_id: 'halfway')
+
+ # Should have the same items and same classification
+ assert_equal @lsi.items.size, loaded.items.size
+ # Compare as strings since loaded classifier may have string categories
+ assert_equal @lsi.classify('loyal dog').to_s, loaded.classify('loyal dog').to_s
+ end
+ end
+
+ def test_list_checkpoints
+ Dir.mktmpdir do |dir|
+ path = File.join(dir, 'lsi.json')
+ @lsi.storage = Classifier::Storage::File.new(path: path)
+
+ @lsi.add_item('content', :category)
+ @lsi.save_checkpoint('10pct')
+ @lsi.save_checkpoint('50pct')
+ @lsi.save_checkpoint('90pct')
+
+ checkpoints = @lsi.list_checkpoints
+
+ assert_equal %w[10pct 50pct 90pct], checkpoints.sort
+ end
+ end
+
+ def test_delete_checkpoint
+ Dir.mktmpdir do |dir|
+ path = File.join(dir, 'lsi.json')
+ @lsi.storage = Classifier::Storage::File.new(path: path)
+
+ @lsi.add_item('content', :category)
+ @lsi.save_checkpoint('test')
+
+ checkpoint_path = File.join(dir, 'lsi_checkpoint_test.json')
+
+ assert_path_exists checkpoint_path
+
+ @lsi.delete_checkpoint('test')
+
+ refute_path_exists checkpoint_path
+ end
+ end
+
+ def test_checkpoint_workflow
+ Dir.mktmpdir do |dir|
+ path = File.join(dir, 'lsi.json')
+ storage = Classifier::Storage::File.new(path: path)
+
+ # First training session
+ lsi1 = Classifier::LSI.new
+ lsi1.storage = storage
+
+ lsi1.add_item('dogs are loyal', :dog)
+ lsi1.add_item('puppies are playful', :dog)
+ lsi1.save_checkpoint('phase1')
+
+ # Simulate restart - load from checkpoint
+ lsi2 = Classifier::LSI.load_checkpoint(storage: storage, checkpoint_id: 'phase1')
+
+ # Continue training
+ lsi2.add_item('cats are independent', :cat)
+ lsi2.add_item('kittens are curious', :cat)
+ lsi2.save_checkpoint('phase2')
+
+ # Load final checkpoint
+ final = Classifier::LSI.load_checkpoint(storage: storage, checkpoint_id: 'phase2')
+
+ # Should have all training
+ assert_equal 4, final.items.size
+ assert_equal 'dog', final.classify('loyal playful').to_s
+ assert_equal 'cat', final.classify('independent curious').to_s
+ end
+ end
+end
diff --git a/test/storage/storage_test.rb b/test/storage/storage_test.rb
index ddd90b39..b72c246e 100644
--- a/test/storage/storage_test.rb
+++ b/test/storage/storage_test.rb
@@ -1,17 +1,21 @@
require_relative '../test_helper'
class StorageAPIConsistencyTest < Minitest::Test
- CLASSIFIERS = [
- Classifier::Bayes,
- Classifier::LSI,
- Classifier::KNN,
- Classifier::LogisticRegression,
- Classifier::TFIDF
- ].freeze
+ # Dynamically discover all classifier/vectorizer classes
+ CLASSIFIERS = Classifier.constants.filter_map do |const|
+ klass = Classifier.const_get(const)
+ next unless klass.is_a?(Class)
+
+ klass if klass.method_defined?(:classify) || klass.method_defined?(:transform)
+ end.freeze
INSTANCE_METHODS = %i[save reload reload! dirty? storage storage=].freeze
CLASS_METHODS = %i[load].freeze
+ def test_classifiers_discovered
+ assert_operator CLASSIFIERS.size, :>=, 5, "Expected at least 5 classifiers, found: #{CLASSIFIERS.map(&:name)}"
+ end
+
CLASSIFIERS.each do |klass|
class_name = klass.name.split('::').last.downcase
diff --git a/test/streaming/streaming_test.rb b/test/streaming/streaming_test.rb
new file mode 100644
index 00000000..5e4c5c9b
--- /dev/null
+++ b/test/streaming/streaming_test.rb
@@ -0,0 +1,374 @@
+require_relative '../test_helper'
+require 'stringio'
+
+class ProgressTest < Minitest::Test
+ def test_initialization_with_defaults
+ progress = Classifier::Streaming::Progress.new
+
+ assert_equal 0, progress.completed
+ assert_nil progress.total
+ assert_instance_of Time, progress.start_time
+ end
+
+ def test_initialization_with_total
+ progress = Classifier::Streaming::Progress.new(total: 100)
+
+ assert_equal 0, progress.completed
+ assert_equal 100, progress.total
+ end
+
+ def test_initialization_with_completed
+ progress = Classifier::Streaming::Progress.new(total: 100, completed: 50)
+
+ assert_equal 50, progress.completed
+ assert_equal 100, progress.total
+ end
+
+ def test_percent_with_known_total
+ progress = Classifier::Streaming::Progress.new(total: 100, completed: 25)
+
+ assert_in_delta(25.0, progress.percent)
+ end
+
+ def test_percent_with_fractional_value
+ progress = Classifier::Streaming::Progress.new(total: 3, completed: 1)
+
+ assert_in_delta(33.33, progress.percent)
+ end
+
+ def test_percent_with_unknown_total
+ progress = Classifier::Streaming::Progress.new
+ progress.completed = 50
+
+ assert_nil progress.percent
+ end
+
+ def test_percent_with_zero_total
+ progress = Classifier::Streaming::Progress.new(total: 0)
+
+ assert_nil progress.percent
+ end
+
+ def test_elapsed_time
+ progress = Classifier::Streaming::Progress.new
+ sleep 0.01
+
+ assert_operator progress.elapsed, :>=, 0.01
+ assert_operator progress.elapsed, :<, 1
+ end
+
+ def test_rate_calculation
+ progress = Classifier::Streaming::Progress.new(completed: 0)
+ sleep 0.01 # Ensure some time passes
+ progress.completed = 100
+ # Rate should be roughly 100 / elapsed (approximately 10000/s)
+ rate = progress.rate
+
+ assert_predicate rate, :positive?
+ end
+
+ def test_rate_with_zero_elapsed
+ # This is tricky to test since time always passes,
+ # but rate should handle edge cases gracefully
+ progress = Classifier::Streaming::Progress.new
+ rate = progress.rate
+ # Rate is 0 when completed is 0
+ assert_in_delta(0.0, rate)
+ end
+
+ def test_eta_with_known_total
+ progress = Classifier::Streaming::Progress.new(total: 100, completed: 50)
+ sleep 0.01 # Ensure some time passes
+ eta = progress.eta
+
+ assert_instance_of Float, eta
+ assert_operator eta, :>=, 0
+ end
+
+ def test_eta_with_unknown_total
+ progress = Classifier::Streaming::Progress.new(completed: 50)
+
+ assert_nil progress.eta
+ end
+
+ def test_eta_when_complete
+ progress = Classifier::Streaming::Progress.new(total: 100, completed: 100)
+ sleep 0.01
+
+ assert_in_delta(0.0, progress.eta)
+ end
+
+ def test_eta_with_zero_rate
+ progress = Classifier::Streaming::Progress.new(total: 100, completed: 0)
+
+ assert_nil progress.eta
+ end
+
+ def test_complete_when_finished
+ progress = Classifier::Streaming::Progress.new(total: 100, completed: 100)
+
+ assert_predicate progress, :complete?
+ end
+
+ def test_complete_when_not_finished
+ progress = Classifier::Streaming::Progress.new(total: 100, completed: 50)
+
+ refute_predicate progress, :complete?
+ end
+
+ def test_complete_with_unknown_total
+ progress = Classifier::Streaming::Progress.new(completed: 100)
+
+ refute_predicate progress, :complete?
+ end
+
+ def test_to_h
+ progress = Classifier::Streaming::Progress.new(total: 100, completed: 50)
+ hash = progress.to_h
+
+ assert_equal 50, hash[:completed]
+ assert_equal 100, hash[:total]
+ assert_in_delta(50.0, hash[:percent])
+ assert_instance_of Float, hash[:elapsed]
+ assert_instance_of Float, hash[:rate]
+ end
+
+ def test_completed_is_mutable
+ progress = Classifier::Streaming::Progress.new(total: 100)
+
+ assert_equal 0, progress.completed
+
+ progress.completed = 25
+
+ assert_equal 25, progress.completed
+
+ progress.completed += 25
+
+ assert_equal 50, progress.completed
+ end
+
+ def test_current_batch_tracking
+ progress = Classifier::Streaming::Progress.new(total: 100)
+
+ assert_equal 0, progress.current_batch
+
+ progress.current_batch = 5
+
+ assert_equal 5, progress.current_batch
+ end
+end
+
+class LineReaderTest < Minitest::Test
+ def test_each_line
+ io = StringIO.new("line1\nline2\nline3\n")
+ reader = Classifier::Streaming::LineReader.new(io)
+
+ lines = reader.each.to_a
+
+ assert_equal %w[line1 line2 line3], lines
+ end
+
+ def test_each_line_removes_trailing_newlines
+ io = StringIO.new("line1\nline2\r\nline3")
+ reader = Classifier::Streaming::LineReader.new(io)
+
+ lines = reader.each.to_a
+ # chomp removes both \n and \r\n
+ assert_equal %w[line1 line2 line3], lines
+ end
+
+ def test_each_with_block
+ io = StringIO.new("a\nb\nc\n")
+ reader = Classifier::Streaming::LineReader.new(io)
+
+ collected = reader.map(&:upcase)
+
+ assert_equal %w[A B C], collected
+ end
+
+ def test_enumerable_methods
+ io = StringIO.new("1\n2\n3\n")
+ reader = Classifier::Streaming::LineReader.new(io)
+
+ # LineReader includes Enumerable
+ assert_equal %w[1 2 3], reader.first(3)
+ end
+
+ def test_each_batch_default_size
+ lines = (1..250).map { |i| "line#{i}" }.join("\n")
+ io = StringIO.new(lines)
+ reader = Classifier::Streaming::LineReader.new(io)
+
+ batches = reader.each_batch.to_a
+
+ assert_equal 3, batches.size
+ assert_equal 100, batches[0].size
+ assert_equal 100, batches[1].size
+ assert_equal 50, batches[2].size
+ end
+
+ def test_each_batch_custom_size
+ lines = (1..25).map { |i| "line#{i}" }.join("\n")
+ io = StringIO.new(lines)
+ reader = Classifier::Streaming::LineReader.new(io, batch_size: 10)
+
+ batches = reader.each_batch.to_a
+
+ assert_equal 3, batches.size
+ assert_equal 10, batches[0].size
+ assert_equal 10, batches[1].size
+ assert_equal 5, batches[2].size
+ end
+
+ def test_each_batch_with_block
+ io = StringIO.new("a\nb\nc\nd\ne\n")
+ reader = Classifier::Streaming::LineReader.new(io, batch_size: 2)
+
+ collected = []
+ reader.each_batch { |batch| collected << batch }
+
+ assert_equal [%w[a b], %w[c d], ['e']], collected
+ end
+
+ def test_each_batch_exact_multiple
+ lines = (1..10).map { |i| "line#{i}" }.join("\n")
+ io = StringIO.new(lines)
+ reader = Classifier::Streaming::LineReader.new(io, batch_size: 5)
+
+ batches = reader.each_batch.to_a
+
+ assert_equal 2, batches.size
+ assert_equal 5, batches[0].size
+ assert_equal 5, batches[1].size
+ end
+
+ def test_each_batch_empty_io
+ io = StringIO.new('')
+ reader = Classifier::Streaming::LineReader.new(io)
+
+ batches = reader.each_batch.to_a
+
+ assert_empty batches
+ end
+
+ def test_batch_size_reader
+ reader = Classifier::Streaming::LineReader.new(StringIO.new(''), batch_size: 50)
+
+ assert_equal 50, reader.batch_size
+ end
+
+ def test_estimate_line_count_with_seekable_io
+ # Create a file-like StringIO
+ lines = "short\nmedium line\nlonger line here\n"
+ io = StringIO.new(lines)
+
+ reader = Classifier::Streaming::LineReader.new(io)
+ estimate = reader.estimate_line_count(sample_size: 3)
+
+ # Should be close to 3 lines
+ assert_instance_of Integer, estimate
+ assert_predicate estimate, :positive?
+ end
+
+ def test_estimate_line_count_preserves_position
+ lines = "a\nb\nc\nd\ne\n"
+ io = StringIO.new(lines)
+ io.gets # Read one line, position is now after "a\n"
+ original_pos = io.pos
+
+ reader = Classifier::Streaming::LineReader.new(io)
+ reader.estimate_line_count
+
+ assert_equal original_pos, io.pos
+ end
+end
+
+class StreamingModuleTest < Minitest::Test
+ def test_default_batch_size
+ assert_equal 100, Classifier::Streaming::DEFAULT_BATCH_SIZE
+ end
+
+ class DummyClassifier
+ include Classifier::Streaming
+
+ attr_accessor :storage
+ end
+
+ def test_train_from_stream_raises_not_implemented
+ classifier = DummyClassifier.new
+ io = StringIO.new("test\n")
+
+ assert_raises(NotImplementedError) do
+ classifier.train_from_stream(:category, io)
+ end
+ end
+
+ def test_train_batch_raises_not_implemented
+ classifier = DummyClassifier.new
+
+ assert_raises(NotImplementedError) do
+ classifier.train_batch(:category, %w[doc1 doc2])
+ end
+ end
+
+ def test_save_checkpoint_requires_storage
+ classifier = DummyClassifier.new
+
+ assert_raises(ArgumentError) do
+ classifier.save_checkpoint('test')
+ end
+ end
+
+ def test_list_checkpoints_requires_storage
+ classifier = DummyClassifier.new
+
+ assert_raises(ArgumentError) do
+ classifier.list_checkpoints
+ end
+ end
+
+ def test_delete_checkpoint_requires_storage
+ classifier = DummyClassifier.new
+
+ assert_raises(ArgumentError) do
+ classifier.delete_checkpoint('test')
+ end
+ end
+end
+
+class StreamingEnforcementTest < Minitest::Test
+ # Dynamically discover all classifier/vectorizer classes by looking for
+ # classes that define `classify` (classifiers) or `transform` (vectorizers)
+ CLASSIFIERS = Classifier.constants.filter_map do |const|
+ klass = Classifier.const_get(const)
+ next unless klass.is_a?(Class)
+
+ klass if klass.method_defined?(:classify) || klass.method_defined?(:transform)
+ end.freeze
+
+ STREAMING_METHODS = %i[
+ train_from_stream
+ train_batch
+ save_checkpoint
+ list_checkpoints
+ delete_checkpoint
+ ].freeze
+
+ def test_classifiers_discovered
+ assert_operator CLASSIFIERS.size, :>=, 5, "Expected at least 5 classifiers, found: #{CLASSIFIERS.map(&:name)}"
+ end
+
+ CLASSIFIERS.each do |klass|
+ define_method("test_#{klass.name.split('::').last.downcase}_includes_streaming") do
+ assert_includes klass, Classifier::Streaming,
+ "#{klass} must include Classifier::Streaming"
+ end
+
+ STREAMING_METHODS.each do |method|
+ define_method("test_#{klass.name.split('::').last.downcase}_responds_to_#{method}") do
+ assert klass.method_defined?(method) || klass.private_method_defined?(method),
+ "#{klass} must respond to #{method}"
+ end
+ end
+ end
+end