diff --git a/.rubocop.yml b/.rubocop.yml index dbcc4fae..dfadfcd4 100644 --- a/.rubocop.yml +++ b/.rubocop.yml @@ -5,6 +5,10 @@ plugins: Layout/LeadingCommentSpace: AllowRBSInlineAnnotation: true +# @rbs type annotations can be long +Layout/LineLength: + Max: 140 + AllCops: TargetRubyVersion: 3.1 NewCops: enable diff --git a/README.md b/README.md index 19fc6d41..bbca62ab 100644 --- a/README.md +++ b/README.md @@ -4,531 +4,138 @@ [![CI](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml/badge.svg)](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml) [![License: LGPL](https://img.shields.io/badge/License-LGPL_2.1-blue.svg)](https://opensource.org/licenses/LGPL-2.1) -A Ruby library for text classification using Bayesian, Logistic Regression, LSI (Latent Semantic Indexing), k-Nearest Neighbors (kNN), and TF-IDF algorithms. +Text classification in Ruby. Five algorithms, native performance, streaming support. -**[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[Guides](https://rubyclassifier.com/docs/guides)** - -> **Note:** This is the original `classifier` gem, maintained for 20 years since 2005. After a quieter period, active development resumed in 2025 with major new features. If you're choosing between this and a fork, this is the canonical, most actively-developed version. +**[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[API Reference](https://rubyclassifier.com/docs/api)** ## Why This Library? -This gem has features no fork provides: - -| | This Gem | Forks | +| | This Gem | Other Forks | |:--|:--|:--| | **Algorithms** | 5 classifiers | 2 only | -| **LSI Performance** | Native C (5-50x faster) | Pure Ruby, or requires GSL/Numo + system libs | -| **Persistence** | Pluggable (file, Redis, S3, SQL) | Marshal only | -| **Thread Safety** | ✅ | ❌ | -| **Type Annotations** | RBS throughout | ❌ | -| **Laplace Smoothing** | Numerically stable | ❌ Unstable | -| **Probability Calibration** | ✅ | ❌ | -| **Feature Weights** | ✅ Interpretable | ❌ | - -### Recent Developments (Late 2025) - -- Added Logistic Regression classifier with SGD and L2 regularization -- Added k-Nearest Neighbors classifier with distance-weighted voting -- Added TF-IDF vectorizer with n-gram support -- Built zero-dependency native C extension (replaces GSL or Numo requirement) -- Added pluggable storage backends for persistence -- Made all classifiers thread-safe -- Fixed Laplace smoothing for numerical stability -- Added RBS type signatures throughout -- Modernized for new Ruby coding standards - -## Table of Contents - -- [Installation](#installation) -- [Algorithms](#bayesian-classifier) - - [Bayesian Classifier](#bayesian-classifier) - - [Logistic Regression](#logistic-regression) - - [LSI (Latent Semantic Indexing)](#lsi-latent-semantic-indexing) - - [k-Nearest Neighbors (kNN)](#k-nearest-neighbors-knn) - - [TF-IDF Vectorizer](#tf-idf-vectorizer) -- [Persistence](#persistence) -- [Performance](#performance) -- [Development](#development) -- [Contributing](#contributing) -- [License](#license) +| **Incremental LSI** | Brand's algorithm (no rebuild) | Full SVD rebuild on every add | +| **LSI Performance** | Native C extension (5-50x faster) | Pure Ruby or requires GSL | +| **Streaming** | Train on multi-GB datasets | Must load all data in memory | +| **Persistence** | Pluggable (file, Redis, S3) | Marshal only | ## Installation -Add to your Gemfile: - ```ruby gem 'classifier' ``` -Then run: - -```bash -bundle install -``` - -Or install directly: - -```bash -gem install classifier -``` - -### Native C Extension +## Quick Start -The gem includes a zero-dependency native C extension for fast LSI operations (5-50x faster than pure Ruby). It compiles automatically during installation. - -To verify the native extension is active: +### Bayesian ```ruby -require 'classifier' -puts Classifier::LSI.backend # => :native -``` - -To force pure Ruby mode (for debugging): - -```bash -NATIVE_VECTOR=true ruby your_script.rb -``` - -## Bayesian Classifier - -Fast, accurate classification with modest memory requirements. Ideal for spam filtering, sentiment analysis, and content categorization. - -### Quick Start - -```ruby -require 'classifier' - classifier = Classifier::Bayes.new(:spam, :ham) - -# Train with keyword arguments -classifier.train(spam: "Buy cheap v1agra now! Limited offer!") -classifier.train(ham: "Meeting scheduled for tomorrow at 10am") - -# Train multiple items at once -classifier.train( - spam: ["You've won a million dollars!", "Free money!!!"], - ham: ["Please review the document", "Lunch tomorrow?"] -) - -# Classify new text -classifier.classify "Congratulations! You've won a prize!" -# => "Spam" -``` - -### Learn More - -- [Bayes Basics Guide](https://rubyclassifier.com/docs/guides/bayes/basics) - In-depth documentation -- [Build a Spam Filter Tutorial](https://rubyclassifier.com/docs/tutorials/spam-filter) - Step-by-step guide -- [Paul Graham: A Plan for Spam](http://www.paulgraham.com/spam.html) - -## Logistic Regression - -Linear classifier using Stochastic Gradient Descent (SGD). Often more accurate than Naive Bayes while remaining fast and interpretable. Provides well-calibrated probability estimates and feature weights for understanding which words drive predictions. - -### Key Features - -- **More Accurate**: No independence assumption means better accuracy on many text problems -- **Calibrated Probabilities**: Unlike Bayes, probabilities reflect true confidence -- **Interpretable**: Feature weights show which words matter for each class -- **Fast**: Linear time prediction, efficient SGD training -- **L2 Regularization**: Prevents overfitting on small datasets - -### Quick Start - -```ruby -require 'classifier' - -classifier = Classifier::LogisticRegression.new(:spam, :ham) - -# Train with keyword arguments (same API as Bayes) -classifier.train(spam: ["Buy now! Free money!", "Click here for prizes!"]) -classifier.train(ham: ["Meeting tomorrow", "Please review the document"]) - -# Classify new text -classifier.classify "Claim your free prize!" -# => "Spam" - -# Get calibrated probabilities -classifier.probabilities "Claim your free prize!" -# => {"Spam" => 0.92, "Ham" => 0.08} - -# See which words matter most -classifier.weights(:spam, limit: 5) -# => {:free => 2.3, :prize => 1.9, :claim => 1.5, ...} +classifier.train(spam: "Buy cheap viagra now!", ham: "Meeting at 3pm tomorrow") +classifier.classify "You've won a prize!" # => "Spam" ``` +[Bayesian Guide →](https://rubyclassifier.com/docs/guides/bayes/basics) -### Hyperparameters +### Logistic Regression ```ruby -classifier = Classifier::LogisticRegression.new( - :positive, :negative, - learning_rate: 0.1, # SGD step size (default: 0.1) - regularization: 0.01, # L2 penalty strength (default: 0.01) - max_iterations: 100, # Training iterations (default: 100) - tolerance: 1e-4 # Convergence threshold (default: 1e-4) -) +classifier = Classifier::LogisticRegression.new(:positive, :negative) +classifier.train(positive: "Great product!", negative: "Terrible experience") +classifier.classify "Loved it!" # => "Positive" ``` +[Logistic Regression Guide →](https://rubyclassifier.com/docs/guides/logistic-regression/basics) -### When to Use Logistic Regression vs Bayes - -| Aspect | Logistic Regression | Naive Bayes | -|--------|---------------------|-------------| -| **Accuracy** | Often higher | Good baseline | -| **Probabilities** | Well-calibrated | Tend to be extreme | -| **Training** | Needs to fit model | Instant (just counting) | -| **Interpretability** | Feature weights | Word frequencies | -| **Small datasets** | Use higher regularization | Works well | - -Use Logistic Regression when you need accurate probabilities or want to understand feature importance. Use Bayes when you need instant updates or have very small training sets. - -## LSI (Latent Semantic Indexing) - -Semantic analysis using Singular Value Decomposition (SVD). More flexible than Bayesian classifiers, providing search, clustering, and classification based on meaning rather than just keywords. - -### Quick Start +### LSI (Latent Semantic Indexing) ```ruby -require 'classifier' - lsi = Classifier::LSI.new - -# Add documents (category: item(s)) -lsi.add(pets: "Dogs are loyal pets that love to play fetch") -lsi.add(pets: "Cats are independent and love to nap") -lsi.add(programming: "Ruby is a dynamic programming language") - -# Add multiple items with the same category -lsi.add(programming: ["Python is great for data science", "JavaScript runs in browsers"]) - -# Batch operations with multiple categories -lsi.add( - pets: ["Hamsters are small furry pets", "Birds can be great companions"], - programming: "Go is fast and concurrent" -) - -# Classify new text -lsi.classify "My puppy loves to run around" -# => "Pets" - -# Get classification with confidence score -lsi.classify_with_confidence "Learning to code in Ruby" -# => ["Programming", 0.89] +lsi.add(pets: "Dogs are loyal", tech: "Ruby is elegant") +lsi.classify "My puppy is playful" # => "pets" ``` +[LSI Guide →](https://rubyclassifier.com/docs/guides/lsi/basics) -### Search and Discovery - -```ruby -# Find similar documents -lsi.find_related "Dogs are great companions", 2 -# => ["Dogs are loyal pets that love to play fetch", "Cats are independent..."] - -# Search by keyword -lsi.search "programming", 3 -# => ["Ruby is a dynamic programming language", "Python is great for..."] -``` - -### Text Summarization - -LSI can extract key sentences from text: - -```ruby -text = "First sentence about dogs. Second about cats. Third about birds." -text.summary(2) # Extract 2 most relevant sentences -``` - -For better sentence boundary detection (handles abbreviations like "Dr.", decimals, etc.), install the optional `pragmatic_segmenter` gem: - -```ruby -gem 'pragmatic_segmenter' -``` - -### Learn More - -- [LSI Basics Guide](https://rubyclassifier.com/docs/guides/lsi/basics) - In-depth documentation -- [Wikipedia: Latent Semantic Analysis](http://en.wikipedia.org/wiki/Latent_semantic_analysis) - -## k-Nearest Neighbors (kNN) - -Instance-based classification that stores examples and classifies by finding the most similar ones. No training phase required—just add examples and classify. - -### Key Features - -- **No Training Required**: Uses instance-based learning—store examples and classify by similarity -- **Interpretable Results**: Returns neighbors that contributed to the decision -- **Incremental Updates**: Easy to add or remove examples without retraining -- **Distance-Weighted Voting**: Optional weighting by similarity score -- **Built on LSI**: Leverages LSI's semantic similarity for better matching - -### Quick Start +### k-Nearest Neighbors ```ruby -require 'classifier' - knn = Classifier::KNN.new(k: 3) - -# Add labeled examples -knn.add(spam: ["Buy now! Limited offer!", "You've won a million dollars!"]) -knn.add(ham: ["Meeting at 3pm tomorrow", "Please review the document"]) - -# Classify new text -knn.classify "Congratulations! Claim your prize!" -# => "spam" -``` - -### Detailed Classification - -Get neighbor information for interpretable results: - -```ruby -result = knn.classify_with_neighbors "Free money offer" - -result[:category] # => "spam" -result[:confidence] # => 0.85 -result[:neighbors] # => [{item: "Buy now!...", category: "spam", similarity: 0.92}, ...] -result[:votes] # => {"spam" => 2.0, "ham" => 1.0} +knn.train(spam: "Free money!", ham: "Quarterly report attached") # or knn.add() +knn.classify "Claim your prize" # => "spam" ``` +[k-Nearest Neighbors Guide →](https://rubyclassifier.com/docs/guides/knn/basics) -### Distance-Weighted Voting - -Weight votes by similarity score for more accurate classification: +### TF-IDF ```ruby -knn = Classifier::KNN.new(k: 5, weighted: true) - -knn.add( - positive: ["Great product!", "Loved it!", "Excellent service"], - negative: ["Terrible experience", "Would not recommend"] -) - -# Closer neighbors have more influence on the result -knn.classify "This was amazing!" -# => "positive" -``` - -### Updating the Classifier - -```ruby -# Add more examples anytime -knn.add(neutral: "It was okay, nothing special") - -# Remove examples -knn.remove_item "Buy now! Limited offer!" - -# Change k value -knn.k = 7 - -# List all categories -knn.categories -# => ["spam", "ham", "neutral"] -``` - -### When to Use kNN vs Bayes vs LSI - -| Classifier | Best For | -|------------|----------| -| **Bayes** | Fast classification, any training size (stores only word counts) | -| **LSI** | Semantic similarity, document clustering, search | -| **kNN** | <1000 examples, interpretable results, incremental updates | - -**Why the size difference?** Bayes stores aggregate statistics—adding 10,000 documents just increments counters. kNN stores every example and compares against all of them during classification, so performance degrades with size. - -## TF-IDF Vectorizer - -Transform text documents into TF-IDF (Term Frequency-Inverse Document Frequency) weighted feature vectors. TF-IDF downweights common words and upweights discriminative terms—the foundation for most classic text classification approaches. - -### Quick Start - -```ruby -require 'classifier' - tfidf = Classifier::TFIDF.new -tfidf.fit(["Dogs are great pets", "Cats are independent", "Birds can fly"]) - -# Transform text to TF-IDF vector (L2 normalized) -vector = tfidf.transform("Dogs are loyal") -# => {:dog=>0.7071..., :loyal=>0.7071...} - -# Fit and transform in one step -vectors = tfidf.fit_transform(documents) -``` - -### Options - -```ruby -tfidf = Classifier::TFIDF.new( - min_df: 2, # Minimum document frequency (Integer or Float 0.0-1.0) - max_df: 0.95, # Maximum document frequency (filters very common terms) - ngram_range: [1, 2], # Extract unigrams and bigrams - sublinear_tf: true # Use 1 + log(tf) instead of raw term frequency -) +tfidf.fit(["Dogs are pets", "Cats are independent"]) +tfidf.transform("Dogs are loyal") # => {:dog => 0.707, :loyal => 0.707} ``` +[TF-IDF Guide →](https://rubyclassifier.com/docs/guides/tfidf/basics) -### Vocabulary Inspection - -```ruby -tfidf.fit(documents) +## Key Features -tfidf.vocabulary # => {:dog=>0, :cat=>1, :bird=>2, ...} -tfidf.idf # => {:dog=>1.405, :cat=>1.405, ...} -tfidf.feature_names # => [:dog, :cat, :bird, ...] -tfidf.num_documents # => 3 -tfidf.fitted? # => true -``` - -### N-gram Support - -```ruby -# Extract bigrams only -tfidf = Classifier::TFIDF.new(ngram_range: [2, 2]) -tfidf.fit(["quick brown fox", "lazy brown dog"]) -tfidf.vocabulary.keys -# => [:quick_brown, :brown_fox, :lazi_brown, :brown_dog] - -# Unigrams through trigrams -tfidf = Classifier::TFIDF.new(ngram_range: [1, 3]) -``` +### Incremental LSI -### Serialization +Add documents without rebuilding the entire index—400x faster for streaming data: ```ruby -# Save to JSON -json = tfidf.to_json -File.write("tfidf.json", json) - -# Load from JSON -loaded = Classifier::TFIDF.from_json(File.read("tfidf.json")) +lsi = Classifier::LSI.new(incremental: true) +lsi.add(tech: ["Ruby is elegant", "Python is popular"]) +lsi.build_index -# Or use Marshal -data = Marshal.dump(tfidf) -loaded = Marshal.load(data) +# These use Brand's algorithm—no full rebuild +lsi.add(tech: "Go is fast") +lsi.add(tech: "Rust is safe") ``` -## Persistence - -Save and load classifiers with pluggable storage backends. Works with Bayes, LogisticRegression, LSI, and kNN classifiers. +[Learn more →](https://rubyclassifier.com/docs/guides/lsi/incremental) -### File Storage +### Persistence ```ruby -require 'classifier' - -classifier = Classifier::Bayes.new(:spam, :ham) -classifier.train(spam: "Buy now! Limited offer!") -classifier.train(ham: "Meeting tomorrow at 3pm") - -# Configure storage and save -classifier.storage = Classifier::Storage::File.new(path: "spam_filter.json") +classifier.storage = Classifier::Storage::File.new(path: "model.json") classifier.save -# Load later loaded = Classifier::Bayes.load(storage: classifier.storage) -loaded.classify "Claim your prize now!" -# => "Spam" ``` -### Custom Storage Backends +[Learn more →](https://rubyclassifier.com/docs/guides/persistence) -Create backends for Redis, PostgreSQL, S3, or any storage system: +### Streaming Training ```ruby -class RedisStorage < Classifier::Storage::Base - def initialize(redis:, key:) - super() - @redis, @key = redis, key - end - - def write(data) = @redis.set(@key, data) - def read = @redis.get(@key) - def delete = @redis.del(@key) - def exists? = @redis.exists?(@key) -end - -# Use it -classifier.storage = RedisStorage.new(redis: Redis.new, key: "classifier:spam") -classifier.save +classifier.train_from_stream(:spam, File.open("spam_corpus.txt")) ``` -### Learn More - -- [Persistence Guide](https://rubyclassifier.com/docs/guides/persistence/basics) - Full documentation with examples +[Learn more →](https://rubyclassifier.com/docs/guides/streaming) ## Performance -### Native C Extension vs Pure Ruby - -The native C extension provides dramatic speedups for LSI operations, especially `build_index` (SVD computation): +Native C extension provides 5-50x speedup for LSI operations: -| Documents | build_index | Overall | -|-----------|-------------|---------| -| 5 | 7x faster | 2.6x | -| 10 | 25x faster | 4.6x | -| 15 | 112x faster | 14.5x | -| 20 | 385x faster | 48.7x | - -
-Detailed benchmark (20 documents) - -``` -Operation Pure Ruby Native C Speedup ----------------------------------------------------------- -build_index 0.5540 0.0014 384.5x -classify 0.0190 0.0060 3.2x -search 0.0145 0.0037 3.9x -find_related 0.0098 0.0011 8.6x ----------------------------------------------------------- -TOTAL 0.5973 0.0123 48.7x -``` -
- -### Running Benchmarks +| Documents | Speedup | +|-----------|---------| +| 10 | 25x | +| 20 | 50x | ```bash -rake benchmark # Run with current configuration -rake benchmark:compare # Compare native C vs pure Ruby +rake benchmark:compare # Run your own comparison ``` ## Development -### Setup - ```bash -git clone https://github.com/cardmagic/classifier.git -cd classifier bundle install -rake compile # Compile native C extension +rake compile # Build native extension +rake test # Run tests ``` -### Running Tests - -```bash -rake test # Run all tests (compiles first) -ruby -Ilib test/bayes/bayesian_test.rb # Run specific test file - -# Test with pure Ruby (no native extension) -NATIVE_VECTOR=true rake test -``` - -### Console - -```bash -rake console -``` - -## Contributing - -1. Fork the repository -2. Create your feature branch (`git checkout -b feature/amazing-feature`) -3. Commit your changes (`git commit -am 'Add amazing feature'`) -4. Push to the branch (`git push origin feature/amazing-feature`) -5. Open a Pull Request - ## Authors -- **Lucas Carlson** - *Original author* - lucas@rufy.com -- **David Fayram II** - *LSI implementation* - dfayram@gmail.com +- **Lucas Carlson** - lucas@rufy.com +- **David Fayram II** - dfayram@gmail.com - **Cameron McBride** - cameron.mcbride@gmail.com - **Ivan Acosta-Rubio** - ivan@softwarecriollo.com ## License -This library is released under the [GNU Lesser General Public License (LGPL) 2.1](LICENSE). +[LGPL 2.1](LICENSE) diff --git a/ext/classifier/classifier_ext.c b/ext/classifier/classifier_ext.c index f6096933..45ebad40 100644 --- a/ext/classifier/classifier_ext.c +++ b/ext/classifier/classifier_ext.c @@ -22,4 +22,5 @@ void Init_classifier_ext(void) Init_vector(); Init_matrix(); Init_svd(); + Init_incremental_svd(); } diff --git a/ext/classifier/incremental_svd.c b/ext/classifier/incremental_svd.c new file mode 100644 index 00000000..ab390d98 --- /dev/null +++ b/ext/classifier/incremental_svd.c @@ -0,0 +1,393 @@ +/* + * incremental_svd.c + * Native C implementation of Brand's incremental SVD operations + * + * Provides fast matrix operations for: + * - Matrix column extension + * - Vertical stacking (vstack) + * - Vector subtraction + * - Batch document projection + */ + +#include "linalg.h" + +/* + * Extend a matrix with a new column + * Returns a new matrix [M | col] with one additional column + */ +CMatrix *cmatrix_extend_column(CMatrix *m, CVector *col) +{ + if (m->rows != col->size) { + rb_raise(rb_eArgError, + "Matrix rows (%ld) must match vector size (%ld)", + (long)m->rows, (long)col->size); + } + + CMatrix *result = cmatrix_alloc(m->rows, m->cols + 1); + + for (size_t i = 0; i < m->rows; i++) { + memcpy(&MAT_AT(result, i, 0), &MAT_AT(m, i, 0), m->cols * sizeof(double)); + MAT_AT(result, i, m->cols) = col->data[i]; + } + + return result; +} + +/* + * Vertically stack two matrices + * Returns a new matrix [top; bottom] + */ +CMatrix *cmatrix_vstack(CMatrix *top, CMatrix *bottom) +{ + if (top->cols != bottom->cols) { + rb_raise(rb_eArgError, + "Matrices must have same column count: %ld vs %ld", + (long)top->cols, (long)bottom->cols); + } + + size_t new_rows = top->rows + bottom->rows; + CMatrix *result = cmatrix_alloc(new_rows, top->cols); + + memcpy(result->data, top->data, top->rows * top->cols * sizeof(double)); + memcpy(result->data + top->rows * top->cols, + bottom->data, + bottom->rows * bottom->cols * sizeof(double)); + + return result; +} + +/* + * Vector subtraction: a - b + */ +CVector *cvector_subtract(CVector *a, CVector *b) +{ + if (a->size != b->size) { + rb_raise(rb_eArgError, + "Vector sizes must match: %ld vs %ld", + (long)a->size, (long)b->size); + } + + CVector *result = cvector_alloc(a->size); + for (size_t i = 0; i < a->size; i++) { + result->data[i] = a->data[i] - b->data[i]; + } + return result; +} + +/* + * Batch project multiple vectors onto U matrix + * Computes lsi_vector = U^T * raw_vector for each vector + * This is the most performance-critical operation for incremental updates + */ +void cbatch_project(CMatrix *u, CVector **raw_vectors, size_t num_vectors, + CVector **lsi_vectors_out) +{ + size_t m = u->rows; /* vocabulary size */ + size_t k = u->cols; /* rank */ + + for (size_t v = 0; v < num_vectors; v++) { + CVector *raw = raw_vectors[v]; + if (raw->size != m) { + rb_raise(rb_eArgError, + "Vector %ld size (%ld) must match matrix rows (%ld)", + (long)v, (long)raw->size, (long)m); + } + + CVector *lsi = cvector_alloc(k); + + /* Compute U^T * raw (project onto k-dimensional space) */ + for (size_t j = 0; j < k; j++) { + double sum = 0.0; + for (size_t i = 0; i < m; i++) { + sum += MAT_AT(u, i, j) * raw->data[i]; + } + lsi->data[j] = sum; + } + + lsi_vectors_out[v] = lsi; + } +} + +/* + * Build the K matrix for Brand's algorithm when rank grows + * K = | diag(s) m_vec | + * | 0 p_norm | + */ +static CMatrix *build_k_matrix_with_growth(CVector *s, CVector *m_vec, double p_norm) +{ + size_t k = s->size; + CMatrix *result = cmatrix_alloc(k + 1, k + 1); + + /* First k rows: diagonal s values and m_vec in last column */ + for (size_t i = 0; i < k; i++) { + MAT_AT(result, i, i) = s->data[i]; + MAT_AT(result, i, k) = m_vec->data[i]; + } + + /* Last row: zeros except p_norm in last position */ + MAT_AT(result, k, k) = p_norm; + + return result; +} + +/* + * Perform one incremental SVD update using Brand's algorithm + * + * @param u Current U matrix (m x k) + * @param s Current singular values (k values) + * @param c New document vector (m x 1) + * @param max_rank Maximum rank to maintain + * @param epsilon Threshold for detecting new directions + * @param u_out Output: updated U matrix + * @param s_out Output: updated singular values + */ +static void incremental_update(CMatrix *u, CVector *s, CVector *c, int max_rank, + double epsilon, CMatrix **u_out, CVector **s_out) +{ + size_t m = u->rows; + size_t k = u->cols; + + /* Step 1: Project c onto column space of U */ + /* m_vec = U^T * c */ + CVector *m_vec = cvector_alloc(k); + for (size_t j = 0; j < k; j++) { + double sum = 0.0; + for (size_t i = 0; i < m; i++) { + sum += MAT_AT(u, i, j) * c->data[i]; + } + m_vec->data[j] = sum; + } + + /* Step 2: Compute residual p = c - U * m_vec */ + CVector *u_times_m = cmatrix_multiply_vector(u, m_vec); + CVector *p = cvector_subtract(c, u_times_m); + double p_norm = cvector_magnitude(p); + + cvector_free(u_times_m); + + if (p_norm > epsilon) { + /* New direction found - rank may increase */ + + /* Step 3: Normalize residual */ + CVector *p_hat = cvector_alloc(m); + double inv_p_norm = 1.0 / p_norm; + for (size_t i = 0; i < m; i++) { + p_hat->data[i] = p->data[i] * inv_p_norm; + } + + /* Step 4: Build K matrix */ + CMatrix *k_mat = build_k_matrix_with_growth(s, m_vec, p_norm); + + /* Step 5: SVD of K matrix */ + CMatrix *u_prime, *v_prime; + CVector *s_prime; + jacobi_svd(k_mat, &u_prime, &v_prime, &s_prime); + cmatrix_free(k_mat); + cmatrix_free(v_prime); + + /* Step 6: Update U = [U | p_hat] * U' */ + CMatrix *u_extended = cmatrix_extend_column(u, p_hat); + CMatrix *u_new = cmatrix_multiply(u_extended, u_prime); + cmatrix_free(u_extended); + cmatrix_free(u_prime); + cvector_free(p_hat); + + /* Truncate if needed */ + if (s_prime->size > (size_t)max_rank) { + /* Create truncated U (keep first max_rank columns) */ + CMatrix *u_trunc = cmatrix_alloc(u_new->rows, (size_t)max_rank); + for (size_t i = 0; i < u_new->rows; i++) { + memcpy(&MAT_AT(u_trunc, i, 0), &MAT_AT(u_new, i, 0), + (size_t)max_rank * sizeof(double)); + } + cmatrix_free(u_new); + u_new = u_trunc; + + /* Truncate singular values */ + CVector *s_trunc = cvector_alloc((size_t)max_rank); + memcpy(s_trunc->data, s_prime->data, (size_t)max_rank * sizeof(double)); + cvector_free(s_prime); + s_prime = s_trunc; + } + + *u_out = u_new; + *s_out = s_prime; + } else { + /* Vector in span - use simpler update */ + /* For now, just return unchanged (projection handles this) */ + *u_out = cmatrix_alloc(u->rows, u->cols); + memcpy((*u_out)->data, u->data, u->rows * u->cols * sizeof(double)); + *s_out = cvector_alloc(s->size); + memcpy((*s_out)->data, s->data, s->size * sizeof(double)); + } + + cvector_free(p); + cvector_free(m_vec); +} + +/* ========== Ruby Wrappers ========== */ + +/* + * Matrix.extend_column(matrix, vector) + * Returns [matrix | vector] + */ +static VALUE rb_cmatrix_extend_column(VALUE klass, VALUE rb_matrix, VALUE rb_vector) +{ + CMatrix *m; + CVector *v; + + GET_CMATRIX(rb_matrix, m); + GET_CVECTOR(rb_vector, v); + + CMatrix *result = cmatrix_extend_column(m, v); + return TypedData_Wrap_Struct(klass, &cmatrix_type, result); + + (void)klass; +} + +/* + * Matrix.vstack(top, bottom) + * Vertically stack two matrices + */ +static VALUE rb_cmatrix_vstack(VALUE klass, VALUE rb_top, VALUE rb_bottom) +{ + CMatrix *top, *bottom; + + GET_CMATRIX(rb_top, top); + GET_CMATRIX(rb_bottom, bottom); + + CMatrix *result = cmatrix_vstack(top, bottom); + return TypedData_Wrap_Struct(klass, &cmatrix_type, result); + + (void)klass; +} + +/* + * Matrix.zeros(rows, cols) + * Create a zero matrix + */ +static VALUE rb_cmatrix_zeros(VALUE klass, VALUE rb_rows, VALUE rb_cols) +{ + size_t rows = NUM2SIZET(rb_rows); + size_t cols = NUM2SIZET(rb_cols); + + CMatrix *result = cmatrix_alloc(rows, cols); + return TypedData_Wrap_Struct(klass, &cmatrix_type, result); + + (void)klass; +} + +/* + * Vector#-(other) + * Vector subtraction + */ +static VALUE rb_cvector_subtract(VALUE self, VALUE other) +{ + CVector *a, *b; + + GET_CVECTOR(self, a); + + if (rb_obj_is_kind_of(other, cClassifierVector)) { + GET_CVECTOR(other, b); + CVector *result = cvector_subtract(a, b); + return TypedData_Wrap_Struct(cClassifierVector, &cvector_type, result); + } + + rb_raise(rb_eTypeError, "Cannot subtract %s from Vector", + rb_obj_classname(other)); + return Qnil; +} + +/* + * Matrix#batch_project(vectors_array) + * Project multiple vectors onto this matrix (as U) + * Returns array of projected vectors + * + * This is the high-performance batch operation for re-projecting documents + */ +static VALUE rb_cmatrix_batch_project(VALUE self, VALUE rb_vectors) +{ + CMatrix *u; + GET_CMATRIX(self, u); + + Check_Type(rb_vectors, T_ARRAY); + long num_vectors = RARRAY_LEN(rb_vectors); + + if (num_vectors == 0) { + return rb_ary_new(); + } + + CVector **raw_vectors = ALLOC_N(CVector *, num_vectors); + for (long i = 0; i < num_vectors; i++) { + VALUE rb_vec = rb_ary_entry(rb_vectors, i); + if (!rb_obj_is_kind_of(rb_vec, cClassifierVector)) { + xfree(raw_vectors); + rb_raise(rb_eTypeError, "Expected array of Vectors"); + } + GET_CVECTOR(rb_vec, raw_vectors[i]); + } + + CVector **lsi_vectors = ALLOC_N(CVector *, num_vectors); + cbatch_project(u, raw_vectors, (size_t)num_vectors, lsi_vectors); + + VALUE result = rb_ary_new_capa(num_vectors); + for (long i = 0; i < num_vectors; i++) { + VALUE rb_lsi = TypedData_Wrap_Struct(cClassifierVector, &cvector_type, + lsi_vectors[i]); + rb_ary_push(result, rb_lsi); + } + + xfree(raw_vectors); + xfree(lsi_vectors); + + return result; +} + +/* + * Matrix#incremental_svd_update(singular_values, new_vector, max_rank, epsilon) + * Perform one Brand's incremental SVD update + * Returns [new_u, new_singular_values] + */ +static VALUE rb_cmatrix_incremental_update(VALUE self, VALUE rb_s, VALUE rb_c, + VALUE rb_max_rank, VALUE rb_epsilon) +{ + CMatrix *u; + CVector *s, *c; + + GET_CMATRIX(self, u); + GET_CVECTOR(rb_s, s); + GET_CVECTOR(rb_c, c); + + int max_rank = NUM2INT(rb_max_rank); + double epsilon = NUM2DBL(rb_epsilon); + + CMatrix *u_new; + CVector *s_new; + + incremental_update(u, s, c, max_rank, epsilon, &u_new, &s_new); + + VALUE rb_u_new = TypedData_Wrap_Struct(cClassifierMatrix, &cmatrix_type, u_new); + VALUE rb_s_new = TypedData_Wrap_Struct(cClassifierVector, &cvector_type, s_new); + + return rb_ary_new_from_args(2, rb_u_new, rb_s_new); +} + +void Init_incremental_svd(void) +{ + /* Matrix class methods for incremental SVD */ + rb_define_singleton_method(cClassifierMatrix, "extend_column", + rb_cmatrix_extend_column, 2); + rb_define_singleton_method(cClassifierMatrix, "vstack", + rb_cmatrix_vstack, 2); + rb_define_singleton_method(cClassifierMatrix, "zeros", + rb_cmatrix_zeros, 2); + + /* Instance methods */ + rb_define_method(cClassifierMatrix, "batch_project", + rb_cmatrix_batch_project, 1); + rb_define_method(cClassifierMatrix, "incremental_svd_update", + rb_cmatrix_incremental_update, 4); + + /* Vector subtraction */ + rb_define_method(cClassifierVector, "-", rb_cvector_subtract, 1); +} diff --git a/ext/classifier/linalg.h b/ext/classifier/linalg.h index 7220ee4f..71bbc30a 100644 --- a/ext/classifier/linalg.h +++ b/ext/classifier/linalg.h @@ -50,6 +50,14 @@ CMatrix *cmatrix_diagonal(CVector *v); void Init_svd(void); void jacobi_svd(CMatrix *a, CMatrix **u, CMatrix **v, CVector **s); +/* Incremental SVD functions */ +void Init_incremental_svd(void); +CMatrix *cmatrix_extend_column(CMatrix *m, CVector *col); +CMatrix *cmatrix_vstack(CMatrix *top, CMatrix *bottom); +CVector *cvector_subtract(CVector *a, CVector *b); +void cbatch_project(CMatrix *u, CVector **raw_vectors, size_t num_vectors, + CVector **lsi_vectors_out); + /* TypedData definitions */ extern const rb_data_type_t cvector_type; extern const rb_data_type_t cmatrix_type; diff --git a/lib/classifier.rb b/lib/classifier.rb index 306016e0..77cbd931 100644 --- a/lib/classifier.rb +++ b/lib/classifier.rb @@ -27,6 +27,7 @@ require 'rubygems' require 'classifier/errors' require 'classifier/storage' +require 'classifier/streaming' require 'classifier/extensions/string' require 'classifier/extensions/vector' require 'classifier/bayes' diff --git a/lib/classifier/bayes.rb b/lib/classifier/bayes.rb index 8b82283c..750aeeaf 100644 --- a/lib/classifier/bayes.rb +++ b/lib/classifier/bayes.rb @@ -8,8 +8,9 @@ require 'mutex_m' module Classifier - class Bayes + class Bayes # rubocop:disable Metrics/ClassLength include Mutex_m + include Streaming # @rbs @categories: Hash[Symbol, Hash[Symbol, Integer]] # @rbs @total_words: Integer @@ -310,8 +311,115 @@ def remove_category(category) end end + # Trains the classifier from an IO stream. + # Each line in the stream is treated as a separate document. + # This is memory-efficient for large corpora. + # + # @example Train from a file + # classifier.train_from_stream(:spam, File.open('spam_corpus.txt')) + # + # @example With progress tracking + # classifier.train_from_stream(:spam, io, batch_size: 500) do |progress| + # puts "#{progress.completed} documents processed" + # end + # + # @rbs (String | Symbol, IO, ?batch_size: Integer) { (Streaming::Progress) -> void } -> void + def train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE) + category = category.prepare_category_name + raise StandardError, "No such category: #{category}" unless @categories.key?(category) + + reader = Streaming::LineReader.new(io, batch_size: batch_size) + total = reader.estimate_line_count + progress = Streaming::Progress.new(total: total) + + reader.each_batch do |batch| + train_batch_internal(category, batch) + progress.completed += batch.size + progress.current_batch += 1 + yield progress if block_given? + end + end + + # Trains the classifier with an array of documents in batches. + # Reduces lock contention by processing multiple documents per synchronize call. + # + # @example Positional style + # classifier.train_batch(:spam, documents, batch_size: 100) + # + # @example Keyword style + # classifier.train_batch(spam: documents, ham: other_docs, batch_size: 100) + # + # @example With progress tracking + # classifier.train_batch(:spam, documents, batch_size: 100) do |progress| + # puts "#{progress.percent}% complete" + # end + # + # @rbs (?(String | Symbol)?, ?Array[String]?, ?batch_size: Integer, **Array[String]) { (Streaming::Progress) -> void } -> void + def train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block) + if category && documents + train_batch_for_category(category, documents, batch_size: batch_size, &block) + else + categories.each do |cat, docs| + train_batch_for_category(cat, Array(docs), batch_size: batch_size, &block) + end + end + end + + # Loads a classifier from a checkpoint. + # + # @rbs (storage: Storage::Base, checkpoint_id: String) -> Bayes + def self.load_checkpoint(storage:, checkpoint_id:) + raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File) + + dir = File.dirname(storage.path) + base = File.basename(storage.path, '.*') + ext = File.extname(storage.path) + checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}") + + checkpoint_storage = Storage::File.new(path: checkpoint_path) + instance = load(storage: checkpoint_storage) + instance.storage = storage + instance + end + private + # Trains a batch of documents for a single category. + # @rbs (String | Symbol, Array[String], ?batch_size: Integer) { (Streaming::Progress) -> void } -> void + def train_batch_for_category(category, documents, batch_size: Streaming::DEFAULT_BATCH_SIZE) + category = category.prepare_category_name + raise StandardError, "No such category: #{category}" unless @categories.key?(category) + + progress = Streaming::Progress.new(total: documents.size) + + documents.each_slice(batch_size) do |batch| + train_batch_internal(category, batch) + progress.completed += batch.size + progress.current_batch += 1 + yield progress if block_given? + end + end + + # Internal method to train a batch of documents. + # Uses a single synchronize block for the entire batch. + # @rbs (Symbol, Array[String]) -> void + def train_batch_internal(category, batch) + synchronize do + invalidate_caches + @dirty = true + batch.each do |text| + word_hash = text.word_hash + @category_counts[category] += 1 + word_hash.each do |word, count| + @categories[category][word] ||= 0 + @categories[category][word] += count + @total_words += count + @category_word_count[category] += count + end + end + end + end + # Core training logic for a single category and text. # @rbs (String | Symbol, String) -> void def train_single(category, text) diff --git a/lib/classifier/knn.rb b/lib/classifier/knn.rb index d473abcb..37ae8174 100644 --- a/lib/classifier/knn.rb +++ b/lib/classifier/knn.rb @@ -19,6 +19,7 @@ module Classifier # class KNN include Mutex_m + include Streaming # @rbs @k: Integer # @rbs @weighted: bool @@ -42,17 +43,25 @@ def initialize(k: 5, weighted: false) # rubocop:disable Naming/MethodParameterNa end # Adds labeled examples. Keys are categories, values are items or arrays. + # Also aliased as `train` for API consistency with Bayes and LogisticRegression. + # + # knn.add(spam: "Buy now!", ham: "Meeting tomorrow") + # knn.train(spam: "Buy now!", ham: "Meeting tomorrow") # equivalent + # # @rbs (**untyped items) -> void def add(**items) synchronize { @dirty = true } @lsi.add(**items) end + alias train add + # Classifies text using k nearest neighbors with majority voting. - # @rbs (String) -> (String | Symbol)? + # Returns the category as a String for API consistency with Bayes and LogisticRegression. + # @rbs (String) -> String? def classify(text) result = classify_with_neighbors(text) - result[:category] + result[:category]&.to_s end # Classifies and returns {category:, neighbors:, votes:, confidence:}. @@ -91,10 +100,11 @@ def items @lsi.items end - # @rbs () -> Array[String | Symbol] + # Returns all unique categories as strings. + # @rbs () -> Array[String] def categories synchronize do - @lsi.items.flat_map { |item| @lsi.categories_for(item) }.uniq + @lsi.items.flat_map { |item| @lsi.categories_for(item) }.uniq.map(&:to_s) end end @@ -104,6 +114,23 @@ def k=(value) @k = value end + # Provides dynamic training methods for categories. + # For example: + # knn.train_spam "Buy now!" + # knn.train_ham "Meeting tomorrow" + def method_missing(name, *args) + category_match = name.to_s.match(/\Atrain_(\w+)\z/) + return super unless category_match + + category = category_match[1].to_sym + args.each { |text| add(category => text) } + end + + # @rbs (Symbol, ?bool) -> bool + def respond_to_missing?(name, include_private = false) + !!(name.to_s =~ /\Atrain_(\w+)\z/) || super + end + # @rbs (?untyped) -> untyped def as_json(_options = nil) { @@ -213,6 +240,59 @@ def marshal_load(data) @storage = nil end + # Loads a classifier from a checkpoint. + # + # @rbs (storage: Storage::Base, checkpoint_id: String) -> KNN + def self.load_checkpoint(storage:, checkpoint_id:) + raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File) + + dir = File.dirname(storage.path) + base = File.basename(storage.path, '.*') + ext = File.extname(storage.path) + checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}") + + checkpoint_storage = Storage::File.new(path: checkpoint_path) + instance = load(storage: checkpoint_storage) + instance.storage = storage + instance + end + + # Trains the classifier from an IO stream. + # Each line in the stream is treated as a separate document. + # + # @example Train from a file + # knn.train_from_stream(:spam, File.open('spam_corpus.txt')) + # + # @example With progress tracking + # knn.train_from_stream(:spam, io, batch_size: 500) do |progress| + # puts "#{progress.completed} documents processed" + # end + # + # @rbs (String | Symbol, IO, ?batch_size: Integer) { (Streaming::Progress) -> void } -> void + def train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE, &block) + @lsi.train_from_stream(category, io, batch_size: batch_size, &block) + synchronize { @dirty = true } + end + + # Adds items in batches. + # + # @example Positional style + # knn.train_batch(:spam, documents, batch_size: 100) + # + # @example Keyword style + # knn.train_batch(spam: documents, ham: other_docs) + # + # @rbs (?(String | Symbol)?, ?Array[String]?, ?batch_size: Integer, **Array[String]) { (Streaming::Progress) -> void } -> void + def train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block) + # @type var categories: Hash[Symbol, Array[String]] + @lsi.train_batch(category, documents, batch_size: batch_size, **categories, &block) # steep:ignore + synchronize { @dirty = true } + end + + # @rbs! + # alias add_batch train_batch + alias add_batch train_batch + private # @rbs (String) -> Array[Hash[Symbol, untyped]] diff --git a/lib/classifier/logistic_regression.rb b/lib/classifier/logistic_regression.rb index a413c062..04e3f037 100644 --- a/lib/classifier/logistic_regression.rb +++ b/lib/classifier/logistic_regression.rb @@ -20,6 +20,7 @@ module Classifier # class LogisticRegression # rubocop:disable Metrics/ClassLength include Mutex_m + include Streaming # @rbs @categories: Array[Symbol] # @rbs @weights: Hash[Symbol, Hash[Symbol, Float]] @@ -329,8 +330,112 @@ def marshal_load(data) @storage = nil end + # Loads a classifier from a checkpoint. + # + # @rbs (storage: Storage::Base, checkpoint_id: String) -> LogisticRegression + def self.load_checkpoint(storage:, checkpoint_id:) + raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File) + + dir = File.dirname(storage.path) + base = File.basename(storage.path, '.*') + ext = File.extname(storage.path) + checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}") + + checkpoint_storage = Storage::File.new(path: checkpoint_path) + instance = load(storage: checkpoint_storage) + instance.storage = storage + instance + end + + # Trains the classifier from an IO stream. + # Each line in the stream is treated as a separate document. + # Note: The model is NOT automatically fitted after streaming. + # Call #fit to train the model after adding all data. + # + # @example Train from a file + # classifier.train_from_stream(:spam, File.open('spam_corpus.txt')) + # classifier.fit # Required to train the model + # + # @example With progress tracking + # classifier.train_from_stream(:spam, io, batch_size: 500) do |progress| + # puts "#{progress.completed} documents processed" + # end + # classifier.fit + # + # @rbs (String | Symbol, IO, ?batch_size: Integer) { (Streaming::Progress) -> void } -> void + def train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE) + category = category.to_s.prepare_category_name + raise StandardError, "No such category: #{category}" unless @categories.include?(category) + + reader = Streaming::LineReader.new(io, batch_size: batch_size) + total = reader.estimate_line_count + progress = Streaming::Progress.new(total: total) + + reader.each_batch do |batch| + synchronize do + batch.each do |text| + features = text.word_hash + features.each_key { |word| @vocabulary[word] = true } + @training_data << { category: category, features: features } + end + @fitted = false + @dirty = true + end + progress.completed += batch.size + progress.current_batch += 1 + yield progress if block_given? + end + end + + # Trains the classifier with an array of documents in batches. + # Note: The model is NOT automatically fitted after batch training. + # Call #fit to train the model after adding all data. + # + # @example Positional style + # classifier.train_batch(:spam, documents, batch_size: 100) + # classifier.fit + # + # @example Keyword style + # classifier.train_batch(spam: documents, ham: other_docs) + # classifier.fit + # + # @rbs (?(String | Symbol)?, ?Array[String]?, ?batch_size: Integer, **Array[String]) { (Streaming::Progress) -> void } -> void + def train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block) + if category && documents + train_batch_for_category(category, documents, batch_size: batch_size, &block) + else + categories.each do |cat, docs| + train_batch_for_category(cat, Array(docs), batch_size: batch_size, &block) + end + end + end + private + # Trains a batch of documents for a single category. + # @rbs (String | Symbol, Array[String], ?batch_size: Integer) { (Streaming::Progress) -> void } -> void + def train_batch_for_category(category, documents, batch_size: Streaming::DEFAULT_BATCH_SIZE) + category = category.to_s.prepare_category_name + raise StandardError, "No such category: #{category}" unless @categories.include?(category) + + progress = Streaming::Progress.new(total: documents.size) + + documents.each_slice(batch_size) do |batch| + synchronize do + batch.each do |text| + features = text.word_hash + features.each_key { |word| @vocabulary[word] = true } + @training_data << { category: category, features: features } + end + @fitted = false + @dirty = true + end + progress.completed += batch.size + progress.current_batch += 1 + yield progress if block_given? + end + end + # Core training logic for a single category and text. # @rbs (String | Symbol, String) -> void def train_single(category, text) diff --git a/lib/classifier/lsi.rb b/lib/classifier/lsi.rb index 416a5cfe..6e1a6620 100644 --- a/lib/classifier/lsi.rb +++ b/lib/classifier/lsi.rb @@ -58,6 +58,7 @@ def matrix_class require 'classifier/lsi/word_list' require 'classifier/lsi/content_node' require 'classifier/lsi/summary' +require 'classifier/lsi/incremental_svd' module Classifier # This class implements a Latent Semantic Indexer, which can search, classify and cluster @@ -65,6 +66,7 @@ module Classifier # please consult Wikipedia[http://en.wikipedia.org/wiki/Latent_Semantic_Indexing]. class LSI include Mutex_m + include Streaming # @rbs @auto_rebuild: bool # @rbs @word_list: WordList @@ -74,14 +76,24 @@ class LSI # @rbs @singular_values: Array[Float]? # @rbs @dirty: bool # @rbs @storage: Storage::Base? + # @rbs @incremental_mode: bool + # @rbs @u_matrix: Matrix? + # @rbs @max_rank: Integer + # @rbs @initial_vocab_size: Integer? attr_reader :word_list, :singular_values attr_accessor :auto_rebuild, :storage + # Default maximum rank for incremental SVD + DEFAULT_MAX_RANK = 100 + # Create a fresh index. # If you want to call #build_index manually, use # Classifier::LSI.new auto_rebuild: false # + # For incremental SVD mode (adds documents without full rebuild): + # Classifier::LSI.new incremental: true, max_rank: 100 + # # @rbs (?Hash[Symbol, untyped]) -> void def initialize(options = {}) super() @@ -92,6 +104,12 @@ def initialize(options = {}) @built_at_version = -1 @dirty = false @storage = nil + + # Incremental SVD settings + @incremental_mode = options[:incremental] == true + @max_rank = options[:max_rank] || DEFAULT_MAX_RANK + @u_matrix = nil + @initial_vocab_size = nil end # Returns true if the index needs to be rebuilt. The index needs @@ -122,6 +140,40 @@ def singular_value_spectrum end end + # Returns true if incremental mode is enabled and active. + # Incremental mode becomes active after the first build_index call. + # + # @rbs () -> bool + def incremental_enabled? + @incremental_mode && !@u_matrix.nil? + end + + # Returns the current rank of the incremental SVD (number of singular values kept). + # Returns nil if incremental mode is not active. + # + # @rbs () -> Integer? + def current_rank + @singular_values&.count(&:positive?) + end + + # Disables incremental mode. Subsequent adds will trigger full rebuilds. + # + # @rbs () -> void + def disable_incremental_mode! + @incremental_mode = false + @u_matrix = nil + @initial_vocab_size = nil + end + + # Enables incremental mode with optional max_rank setting. + # The next build_index call will store the U matrix for incremental updates. + # + # @rbs (?max_rank: Integer) -> void + def enable_incremental_mode!(max_rank: DEFAULT_MAX_RANK) + @incremental_mode = true + @max_rank = max_rank + end + # Adds items to the index using hash-style syntax. # The hash keys are categories, and values are items (or arrays of items). # @@ -165,11 +217,18 @@ def add(**items) # @rbs (String, *String | Symbol) ?{ (String) -> String } -> void def add_item(item, *categories, &block) clean_word_hash = block ? block.call(item).clean_word_hash : item.to_s.clean_word_hash + node = nil + synchronize do - @items[item] = ContentNode.new(clean_word_hash, *categories) + node = ContentNode.new(clean_word_hash, *categories) + @items[item] = node @version += 1 @dirty = true end + + # Use incremental update if enabled and we have a U matrix + return perform_incremental_update(node, clean_word_hash) if @incremental_mode && @u_matrix + build_index if @auto_rebuild end @@ -230,12 +289,12 @@ def items # A value of 1 for cutoff means that no semantic analysis will take place, # turning the LSI class into a simple vector search engine. # - # @rbs (?Float) -> void - def build_index(cutoff = 0.75) + # @rbs (?Float, ?force: bool) -> void + def build_index(cutoff = 0.75, force: false) validate_cutoff!(cutoff) synchronize do - return unless needs_rebuild_unlocked? + return unless force || needs_rebuild_unlocked? make_word_list @@ -246,14 +305,20 @@ def build_index(cutoff = 0.75) # Convert vectors to arrays for matrix construction tda_arrays = tda.map { |v| v.respond_to?(:to_a) ? v.to_a : v } tdm = self.class.matrix_class.alloc(*tda_arrays).trans - ntdm = build_reduced_matrix(tdm, cutoff) + ntdm, u_mat = build_reduced_matrix_with_u(tdm, cutoff) assign_native_ext_lsi_vectors(ntdm, doc_list) else tdm = Matrix.rows(tda).trans - ntdm = build_reduced_matrix(tdm, cutoff) + ntdm, u_mat = build_reduced_matrix_with_u(tdm, cutoff) assign_ruby_lsi_vectors(ntdm, doc_list) end + # Store U matrix for incremental mode + if @incremental_mode + @u_matrix = u_mat + @initial_vocab_size = @word_list.size + end + @built_at_version = @version end end @@ -559,6 +624,100 @@ def self.load_from_file(path) from_json(File.read(path)) end + # Loads an LSI index from a checkpoint. + # + # @rbs (storage: Storage::Base, checkpoint_id: String) -> LSI + def self.load_checkpoint(storage:, checkpoint_id:) + raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File) + + dir = File.dirname(storage.path) + base = File.basename(storage.path, '.*') + ext = File.extname(storage.path) + checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}") + + checkpoint_storage = Storage::File.new(path: checkpoint_path) + instance = load(storage: checkpoint_storage) + instance.storage = storage + instance + end + + # Trains the LSI index from an IO stream. + # Each line in the stream is treated as a separate document. + # Documents are added without rebuilding, then the index is rebuilt at the end. + # + # @example Train from a file + # lsi.train_from_stream(:category, File.open('corpus.txt')) + # + # @example With progress tracking + # lsi.train_from_stream(:category, io, batch_size: 500) do |progress| + # puts "#{progress.completed} documents processed" + # end + # + # @rbs (String | Symbol, IO, ?batch_size: Integer) { (Streaming::Progress) -> void } -> void + def train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE) + original_auto_rebuild = @auto_rebuild + @auto_rebuild = false + + begin + reader = Streaming::LineReader.new(io, batch_size: batch_size) + total = reader.estimate_line_count + progress = Streaming::Progress.new(total: total) + + reader.each_batch do |batch| + batch.each { |text| add_item(text, category) } + progress.completed += batch.size + progress.current_batch += 1 + yield progress if block_given? + end + ensure + @auto_rebuild = original_auto_rebuild + build_index if original_auto_rebuild + end + end + + # Adds items to the index in batches from an array. + # Documents are added without rebuilding, then the index is rebuilt at the end. + # + # @example Batch add with progress + # lsi.add_batch(Dog: documents, batch_size: 100) do |progress| + # puts "#{progress.percent}% complete" + # end + # + # @rbs (?batch_size: Integer, **Array[String]) { (Streaming::Progress) -> void } -> void + def add_batch(batch_size: Streaming::DEFAULT_BATCH_SIZE, **items) + original_auto_rebuild = @auto_rebuild + @auto_rebuild = false + + begin + total_docs = items.values.sum { |v| Array(v).size } + progress = Streaming::Progress.new(total: total_docs) + + items.each do |category, documents| + Array(documents).each_slice(batch_size) do |batch| + batch.each { |doc| add_item(doc, category.to_s) } + progress.completed += batch.size + progress.current_batch += 1 + yield progress if block_given? + end + end + ensure + @auto_rebuild = original_auto_rebuild + build_index if original_auto_rebuild + end + end + + # Alias train_batch to add_batch for API consistency with other classifiers. + # Note: LSI uses categories differently (items have categories, not the training call). + # + # @rbs (?(String | Symbol)?, ?Array[String]?, ?batch_size: Integer, **Array[String]) { (Streaming::Progress) -> void } -> void + def train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block) + if category && documents + add_batch(batch_size: batch_size, **{ category.to_sym => documents }, &block) + else + add_batch(batch_size: batch_size, **categories, &block) + end + end + private # Restores LSI state from a JSON string (used by reload) @@ -679,7 +838,7 @@ def vote_unlocked(doc, cutoff = 0.30, &) votes end - # Unlocked version of node_for_content for internal use + # Unlocked version of node_for_content for internal use. # @rbs (String) ?{ (String) -> String } -> ContentNode def node_for_content_unlocked(item, &block) return @items[item] if @items[item] @@ -687,30 +846,67 @@ def node_for_content_unlocked(item, &block) clean_word_hash = block ? block.call(item).clean_word_hash : item.to_s.clean_word_hash cn = ContentNode.new(clean_word_hash, &block) cn.raw_vector_with(@word_list) unless needs_rebuild_unlocked? + assign_lsi_vector_incremental(cn) if incremental_enabled? cn end # @rbs (untyped, ?Float) -> untyped def build_reduced_matrix(matrix, cutoff = 0.75) - # TODO: Check that M>=N on these dimensions! Transpose helps assure this - u, v, s = matrix.SV_decomp + result, _u = build_reduced_matrix_with_u(matrix, cutoff) + result + end - @singular_values = s.sort.reverse + # Builds reduced matrix and returns both the result and the U matrix. + # U matrix is needed for incremental SVD updates. + # @rbs (untyped, ?Float) -> [untyped, Matrix] + def build_reduced_matrix_with_u(matrix, cutoff = 0.75) + u, v, s = matrix.SV_decomp + all_singular_values = s.sort.reverse s_cutoff_index = [(s.size * cutoff).round - 1, 0].max - s_cutoff = @singular_values[s_cutoff_index] + s_cutoff = all_singular_values[s_cutoff_index] + + kept_indices = [] + kept_singular_values = [] s.size.times do |ord| - s[ord] = 0.0 if s[ord] < s_cutoff + if s[ord] >= s_cutoff + kept_indices << ord + kept_singular_values << s[ord] + else + s[ord] = 0.0 + end end - # Reconstruct the term document matrix, only with reduced rank - result = u * self.class.matrix_class.diag(s) * v.trans - # SVD may return transposed dimensions when row_size < column_size - # Ensure result matches input dimensions + @singular_values = kept_singular_values.sort.reverse + result = u * self.class.matrix_class.diag(s) * v.trans result = result.trans if result.row_size != matrix.row_size + u_reduced = extract_reduced_u(u, kept_indices, s) - result + [result, u_reduced] + end + + # Extracts columns from U corresponding to kept singular values. + # Columns are sorted by descending singular value to match @singular_values order. + # rubocop:disable Naming/MethodParameterName + # @rbs (untyped, Array[Integer], Array[Float]) -> Matrix + def extract_reduced_u(u, kept_indices, singular_values) + return Matrix.empty(u.row_size, 0) if kept_indices.empty? + + sorted_indices = kept_indices.sort_by { |i| -singular_values[i] } + + if u.respond_to?(:to_ruby_matrix) + u = u.to_ruby_matrix + elsif !u.is_a?(::Matrix) + rows = u.row_size.times.map do |i| + sorted_indices.map { |j| u[i, j] } + end + return Matrix.rows(rows) + end + + cols = sorted_indices.map { |i| u.column(i).to_a } + Matrix.columns(cols) end + # rubocop:enable Naming/MethodParameterName # @rbs () -> void def make_word_list @@ -719,5 +915,129 @@ def make_word_list node.word_hash.each_key { |key| @word_list.add_word key } end end + + # Performs incremental SVD update for a new document. + # @rbs (ContentNode, Hash[Symbol, Integer]) -> void + def perform_incremental_update(node, word_hash) + needs_full_rebuild = false + old_rank = nil + + synchronize do + if vocabulary_growth_exceeds_threshold?(word_hash) + disable_incremental_mode! + needs_full_rebuild = true + next + end + + old_rank = @u_matrix.column_size + extend_vocabulary_for_incremental(word_hash) + raw_vec = node.raw_vector_with(@word_list) + raw_vector = Vector[*raw_vec.to_a] + + @u_matrix, @singular_values = IncrementalSVD.update( + @u_matrix, @singular_values, raw_vector, max_rank: @max_rank + ) + + new_rank = @u_matrix.column_size + if new_rank > old_rank + reproject_all_documents + else + assign_lsi_vector_incremental(node) + end + + @built_at_version = @version + end + + build_index if needs_full_rebuild + end + + # Checks if vocabulary growth would exceed threshold (20%) + # @rbs (Hash[Symbol, Integer]) -> bool + def vocabulary_growth_exceeds_threshold?(word_hash) + return false unless @initial_vocab_size&.positive? + + new_words = word_hash.keys.count { |w| @word_list[w].nil? } + growth_ratio = new_words.to_f / @initial_vocab_size + growth_ratio > 0.2 + end + + # Extends vocabulary and U matrix for new words. + # @rbs (Hash[Symbol, Integer]) -> void + def extend_vocabulary_for_incremental(word_hash) + new_words = word_hash.keys.select { |w| @word_list[w].nil? } + return if new_words.empty? + + new_words.each { |word| @word_list.add_word(word) } + extend_u_matrix(new_words.size) + end + + # Extends U matrix with zero rows for new vocabulary terms. + # @rbs (Integer) -> void + def extend_u_matrix(num_new_rows) + return if num_new_rows.zero? || @u_matrix.nil? + + if self.class.native_available? && @u_matrix.is_a?(self.class.matrix_class) + new_rows = self.class.matrix_class.zeros(num_new_rows, @u_matrix.column_size) + @u_matrix = self.class.matrix_class.vstack(@u_matrix, new_rows) + else + new_rows = Matrix.zero(num_new_rows, @u_matrix.column_size) + @u_matrix = Matrix.vstack(@u_matrix, new_rows) + end + end + + # Re-projects all documents onto the current U matrix + # Called when rank grows to ensure consistent LSI vector sizes + # Uses native batch_project for performance when available + # @rbs () -> void + def reproject_all_documents + return unless @u_matrix + return reproject_all_documents_native if self.class.native_available? && @u_matrix.respond_to?(:batch_project) + + reproject_all_documents_ruby + end + + # Native batch re-projection using C extension. + # @rbs () -> void + def reproject_all_documents_native + nodes = @items.values + raw_vectors = nodes.map do |node| + raw = node.raw_vector_with(@word_list) + raw.is_a?(self.class.vector_class) ? raw : self.class.vector_class.alloc(raw.to_a) + end + + lsi_vectors = @u_matrix.batch_project(raw_vectors) + + nodes.each_with_index do |node, i| + lsi_vec = lsi_vectors[i].row + node.lsi_vector = lsi_vec + node.lsi_norm = lsi_vec.normalize + end + end + + # Pure Ruby re-projection (fallback) + # @rbs () -> void + def reproject_all_documents_ruby + @items.each_value do |node| + assign_lsi_vector_incremental(node) + end + end + + # Assigns LSI vector to a node using projection: lsi_vec = U^T * raw_vec. + # @rbs (ContentNode) -> void + def assign_lsi_vector_incremental(node) + return unless @u_matrix + + raw_vec = node.raw_vector_with(@word_list) + raw_vector = Vector[*raw_vec.to_a] + lsi_arr = (@u_matrix.transpose * raw_vector).to_a + + lsi_vec = if self.class.native_available? + self.class.vector_class.alloc(lsi_arr).row + else + Vector[*lsi_arr] + end + node.lsi_vector = lsi_vec + node.lsi_norm = lsi_vec.normalize + end end end diff --git a/lib/classifier/lsi/incremental_svd.rb b/lib/classifier/lsi/incremental_svd.rb new file mode 100644 index 00000000..beabfa37 --- /dev/null +++ b/lib/classifier/lsi/incremental_svd.rb @@ -0,0 +1,166 @@ +# rbs_inline: enabled + +# rubocop:disable Naming/MethodParameterName, Metrics/ParameterLists + +require 'matrix' + +module Classifier + class LSI + # Brand's Incremental SVD Algorithm for LSI + # + # Implements the algorithm from Brand (2006) "Fast low-rank modifications + # of the thin singular value decomposition" for adding documents to LSI + # without full SVD recomputation. + # + # Given existing thin SVD: A ≈ U * S * V^T (with k components) + # When adding a new column c: + # + # 1. Project: m = U^T * c (project onto existing column space) + # 2. Residual: p = c - U * m (component orthogonal to U) + # 3. Orthonormalize: If ||p|| > ε: p̂ = p / ||p|| + # 4. Form K matrix: + # - If ||p|| > ε: K = [diag(s), m; 0, ||p||] (rank grows by 1) + # - If ||p|| ≈ 0: K = diag(s) + m * e_last^T (no new direction) + # 5. Small SVD: Compute SVD of K (only (k+1) × (k+1) matrix!) + # 6. Update: + # - U_new = [U, p̂] * U' + # - S_new = S' + # + module IncrementalSVD + EPSILON = 1e-10 + + class << self + # Updates SVD with a new document vector using Brand's algorithm. + # + # @param u [Matrix] current left singular vectors (m × k) + # @param s [Array] current singular values (k values) + # @param c [Vector] new document vector (m × 1) + # @param max_rank [Integer] maximum rank to maintain + # @param epsilon [Float] threshold for zero detection + # @return [Array>] updated [u, s] + # + # @rbs (Matrix, Array[Float], Vector, max_rank: Integer, ?epsilon: Float) -> [Matrix, Array[Float]] + def update(u, s, c, max_rank:, epsilon: EPSILON) + m_vec = project(u, c) + u_times_m = u * m_vec + p_vec = c - (u_times_m.is_a?(Vector) ? u_times_m : Vector[*u_times_m.to_a.flatten]) + p_norm = magnitude(p_vec) + + if p_norm > epsilon + update_with_new_direction(u, s, m_vec, p_vec, p_norm, max_rank, epsilon) + else + update_in_span(u, s, m_vec, max_rank, epsilon) + end + end + + # Projects a document vector onto the semantic space defined by U. + # Returns the LSI representation: lsi_vec = U^T * raw_vec + # + # @param u [Matrix] left singular vectors (m × k) + # @param raw_vec [Vector] document vector in term space (m × 1) + # @return [Vector] document in semantic space (k × 1) + # + # @rbs (Matrix, Vector) -> Vector + def project(u, raw_vec) + u.transpose * raw_vec + end + + private + + # Update when new document has a component orthogonal to existing U. + # @rbs (Matrix, Array[Float], Vector, Vector, Float, Integer, Float) -> [Matrix, Array[Float]] + def update_with_new_direction(u, s, m_vec, p_vec, p_norm, max_rank, epsilon) + p_hat = p_vec * (1.0 / p_norm) + k_matrix = build_k_matrix_with_growth(s, m_vec, p_norm) + u_prime, s_prime = small_svd(k_matrix, epsilon) + u_extended = extend_matrix_with_column(u, p_hat) + u_new = u_extended * u_prime + + u_new, s_prime = truncate(u_new, s_prime, max_rank) if s_prime.size > max_rank + + [u_new, s_prime] + end + + # Update when new document is entirely within span of existing U. + # @rbs (Matrix, Array[Float], Vector, Integer, Float) -> [Matrix, Array[Float]] + def update_in_span(u, s, m_vec, max_rank, epsilon) + k_matrix = build_k_matrix_in_span(s, m_vec) + u_prime, s_prime = small_svd(k_matrix, epsilon) + u_new = u * u_prime + + u_new, s_prime = truncate(u_new, s_prime, max_rank) if s_prime.size > max_rank + + [u_new, s_prime] + end + + # Builds the K matrix when rank grows by 1. + # @rbs (Array[Float], untyped, Float) -> untyped + def build_k_matrix_with_growth(s, m_vec, p_norm) + k = s.size + rows = k.times.map do |i| + row = Array.new(k + 1, 0.0) #: Array[Float] + row[i] = s[i].to_f + row[k] = m_vec[i].to_f + row + end + rows << Array.new(k + 1, 0.0).tap { |r| r[k] = p_norm } + Matrix.rows(rows) + end + + # Builds the K matrix when vector is in span (no rank growth). + # @rbs (Array[Float], Vector) -> Matrix + def build_k_matrix_in_span(s, _m_vec) + k = s.size + rows = k.times.map do |i| + row = Array.new(k, 0.0) + row[i] = s[i] + row + end + Matrix.rows(rows) + end + + # Computes SVD of small matrix and extracts singular values. + # @rbs (Matrix, Float) -> [Matrix, Array[Float]] + def small_svd(matrix, epsilon) + u, _v, s_array = matrix.SV_decomp + + s_sorted = s_array.select { |sv| sv.abs > epsilon }.sort.reverse + indices = s_array.each_with_index + .select { |sv, _| sv.abs > epsilon } + .sort_by { |sv, _| -sv } + .map { |_, i| i } + + u_cols = indices.map { |i| u.column(i).to_a } + u_reordered = u_cols.empty? ? Matrix.empty(matrix.row_size, 0) : Matrix.columns(u_cols) + + [u_reordered, s_sorted] + end + + # Extends matrix with a new column + # @rbs (Matrix, Vector) -> Matrix + def extend_matrix_with_column(matrix, col_vec) + rows = matrix.row_size.times.map do |i| + matrix.row(i).to_a + [col_vec[i]] + end + Matrix.rows(rows) + end + + # Truncates to max_rank + # @rbs (untyped, Array[Float], Integer) -> [untyped, Array[Float]] + def truncate(u, s, max_rank) + s_truncated = s[0...max_rank] || [] #: Array[Float] + cols = (0...max_rank).map { |i| u.column(i).to_a } + u_truncated = Matrix.columns(cols) + [u_truncated, s_truncated] + end + + # Computes magnitude of a vector + # @rbs (untyped) -> Float + def magnitude(vec) + Math.sqrt(vec.to_a.sum { |x| x.to_f * x.to_f }) + end + end + end + end +end +# rubocop:enable Naming/MethodParameterName, Metrics/ParameterLists diff --git a/lib/classifier/streaming.rb b/lib/classifier/streaming.rb new file mode 100644 index 00000000..3c228b41 --- /dev/null +++ b/lib/classifier/streaming.rb @@ -0,0 +1,122 @@ +# rbs_inline: enabled + +require_relative 'streaming/progress' +require_relative 'streaming/line_reader' + +module Classifier + # Streaming module provides memory-efficient training capabilities for classifiers. + # Include this module in a classifier to add streaming and batch training methods. + # + # @example Including in a classifier + # class MyClassifier + # include Classifier::Streaming + # end + # + # @example Streaming training + # classifier.train_from_stream(:category, File.open('corpus.txt')) + # + # @example Batch training with progress + # classifier.train_batch(:category, documents, batch_size: 100) do |progress| + # puts "#{progress.percent}% complete" + # end + module Streaming + # Default batch size for streaming operations + DEFAULT_BATCH_SIZE = 100 + + # Trains the classifier from an IO stream. + # Each line in the stream is treated as a separate document. + # + # @rbs (Symbol | String, IO, ?batch_size: Integer) { (Progress) -> void } -> void + def train_from_stream(category, io, batch_size: DEFAULT_BATCH_SIZE, &block) + raise NotImplementedError, "#{self.class} must implement train_from_stream" + end + + # Trains the classifier with an array of documents in batches. + # Supports both positional and keyword argument styles. + # + # @example Positional style + # classifier.train_batch(:spam, documents, batch_size: 100) + # + # @example Keyword style + # classifier.train_batch(spam: documents, ham: other_docs, batch_size: 100) + # + # @rbs (?(Symbol | String)?, ?Array[String]?, ?batch_size: Integer, **Array[String]) { (Progress) -> void } -> void + def train_batch(category = nil, documents = nil, batch_size: DEFAULT_BATCH_SIZE, **categories, &block) + raise NotImplementedError, "#{self.class} must implement train_batch" + end + + # Saves a checkpoint of the current training state. + # Requires a storage backend to be configured. + # + # @rbs (String) -> void + def save_checkpoint(checkpoint_id) + raise ArgumentError, 'No storage configured' unless respond_to?(:storage) && storage + + original_storage = storage + + begin + self.storage = checkpoint_storage_for(checkpoint_id) + save + ensure + self.storage = original_storage + end + end + + # Lists available checkpoints. + # Requires a storage backend to be configured. + # + # @rbs () -> Array[String] + def list_checkpoints + raise ArgumentError, 'No storage configured' unless respond_to?(:storage) && storage + + case storage + when Storage::File + file_storage = storage #: Storage::File + dir = File.dirname(file_storage.path) + base = File.basename(file_storage.path, '.*') + ext = File.extname(file_storage.path) + + pattern = File.join(dir, "#{base}_checkpoint_*#{ext}") + Dir.glob(pattern).map do |path| + File.basename(path, ext).sub(/^#{Regexp.escape(base)}_checkpoint_/, '') + end.sort + else + [] + end + end + + # Deletes a checkpoint. + # + # @rbs (String) -> void + def delete_checkpoint(checkpoint_id) + raise ArgumentError, 'No storage configured' unless respond_to?(:storage) && storage + + checkpoint_storage = checkpoint_storage_for(checkpoint_id) + checkpoint_storage.delete if checkpoint_storage.exists? + end + + private + + # @rbs (String) -> String + def checkpoint_path_for(checkpoint_id) + raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File) + + file_storage = storage #: Storage::File + dir = File.dirname(file_storage.path) + base = File.basename(file_storage.path, '.*') + ext = File.extname(file_storage.path) + + File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}") + end + + # @rbs (String) -> Storage::Base + def checkpoint_storage_for(checkpoint_id) + case storage + when Storage::File + Storage::File.new(path: checkpoint_path_for(checkpoint_id)) + else + raise ArgumentError, "Checkpoints not supported for #{storage.class}" + end + end + end +end diff --git a/lib/classifier/streaming/line_reader.rb b/lib/classifier/streaming/line_reader.rb new file mode 100644 index 00000000..e0b92f31 --- /dev/null +++ b/lib/classifier/streaming/line_reader.rb @@ -0,0 +1,99 @@ +# rbs_inline: enabled + +module Classifier + module Streaming + # Memory-efficient line reader for large files and IO streams. + # Reads lines one at a time and can yield in configurable batches. + # + # @example Reading line by line + # reader = LineReader.new(File.open('large_corpus.txt')) + # reader.each { |line| process(line) } + # + # @example Reading in batches + # reader = LineReader.new(io, batch_size: 100) + # reader.each_batch { |batch| process_batch(batch) } + class LineReader + include Enumerable #[String] + + # @rbs @io: IO + # @rbs @batch_size: Integer + + attr_reader :batch_size + + # Creates a new LineReader. + # + # @rbs (IO, ?batch_size: Integer) -> void + def initialize(io, batch_size: 100) + @io = io + @batch_size = batch_size + end + + # Iterates over each line in the IO stream. + # Lines are chomped (trailing newlines removed). + # + # @rbs () { (String) -> void } -> void + # @rbs () -> Enumerator[String, void] + def each + return enum_for(:each) unless block_given? + + @io.each_line do |line| + yield line.chomp + end + end + + # Iterates over batches of lines. + # Each batch is an array of chomped lines. + # + # @rbs () { (Array[String]) -> void } -> void + # @rbs () -> Enumerator[Array[String], void] + def each_batch + return enum_for(:each_batch) unless block_given? + + batch = [] #: Array[String] + each do |line| + batch << line + if batch.size >= @batch_size + yield batch + batch = [] + end + end + yield batch unless batch.empty? + end + + # Estimates the total number of lines in the IO stream. + # This is a rough estimate based on file size and average line length. + # Returns nil for non-seekable streams. + # + # @rbs (?sample_size: Integer) -> Integer? + def estimate_line_count(sample_size: 100) + return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind) + + begin + original_pos = @io.pos + @io.rewind + + sample_bytes = 0 + sample_lines = 0 + + sample_size.times do + line = @io.gets + break unless line + + sample_bytes += line.bytesize + sample_lines += 1 + end + + @io.seek(original_pos) + + return nil if sample_lines.zero? + + avg_line_size = sample_bytes.to_f / sample_lines + io_size = @io.__send__(:size) #: Integer + (io_size / avg_line_size).round + rescue IOError, Errno::ESPIPE + nil + end + end + end + end +end diff --git a/lib/classifier/streaming/progress.rb b/lib/classifier/streaming/progress.rb new file mode 100644 index 00000000..d747674a --- /dev/null +++ b/lib/classifier/streaming/progress.rb @@ -0,0 +1,96 @@ +# rbs_inline: enabled + +module Classifier + module Streaming + # Progress tracking object yielded to blocks during batch/stream operations. + # Provides information about training progress including completion percentage, + # elapsed time, processing rate, and estimated time remaining. + # + # @example Basic usage with train_batch + # classifier.train_batch(:spam, documents, batch_size: 100) do |progress| + # puts "#{progress.completed}/#{progress.total} (#{progress.percent}%)" + # puts "Rate: #{progress.rate.round(1)} docs/sec" + # puts "ETA: #{progress.eta&.round}s" if progress.eta + # end + class Progress + # @rbs @completed: Integer + # @rbs @total: Integer? + # @rbs @start_time: Time + # @rbs @current_batch: Integer + + attr_reader :start_time, :total + attr_accessor :completed, :current_batch + + # @rbs (?total: Integer?, ?completed: Integer) -> void + def initialize(total: nil, completed: 0) + @completed = completed + @total = total + @start_time = Time.now + @current_batch = 0 + end + + # Returns the completion percentage (0-100). + # Returns nil if total is unknown. + # + # @rbs () -> Float? + def percent + return nil unless @total&.positive? + + (@completed.to_f / @total * 100).round(2) + end + + # Returns the elapsed time in seconds since the operation started. + # + # @rbs () -> Float + def elapsed + Time.now - @start_time + end + + # Returns the processing rate in items per second. + # Returns 0 if no time has elapsed. + # + # @rbs () -> Float + def rate + e = elapsed + return 0.0 if e.zero? + + @completed / e + end + + # Returns the estimated time remaining in seconds. + # Returns nil if total is unknown or rate is zero. + # + # @rbs () -> Float? + def eta + return nil unless @total + return nil if rate.zero? + return 0.0 if @completed >= @total + + (@total - @completed) / rate + end + + # Returns true if the operation is complete. + # + # @rbs () -> bool + def complete? + return false unless @total + + @completed >= @total + end + + # Returns a hash representation of the progress state. + # + # @rbs () -> Hash[Symbol, untyped] + def to_h + { + completed: @completed, + total: @total, + percent: percent, + elapsed: elapsed.round(2), + rate: rate.round(2), + eta: eta&.round(2) + } + end + end + end +end diff --git a/lib/classifier/tfidf.rb b/lib/classifier/tfidf.rb index 02520ee7..56c52b15 100644 --- a/lib/classifier/tfidf.rb +++ b/lib/classifier/tfidf.rb @@ -16,6 +16,8 @@ module Classifier # tfidf.transform("Dogs are loyal") # => {:dog=>0.7071..., :loyal=>0.7071...} # class TFIDF + include Streaming + # @rbs @min_df: Integer | Float # @rbs @max_df: Integer | Float # @rbs @ngram_range: Array[Integer] @@ -246,6 +248,70 @@ def marshal_load(data) @storage = nil end + # Loads a vectorizer from a checkpoint. + # + # @rbs (storage: Storage::Base, checkpoint_id: String) -> TFIDF + def self.load_checkpoint(storage:, checkpoint_id:) + raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File) + + dir = File.dirname(storage.path) + base = File.basename(storage.path, '.*') + ext = File.extname(storage.path) + checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}") + + checkpoint_storage = Storage::File.new(path: checkpoint_path) + instance = load(storage: checkpoint_storage) + instance.storage = storage + instance + end + + # Fits the vectorizer from an IO stream. + # Collects all documents from the stream, then fits the model. + # Note: All documents must be collected in memory for IDF calculation. + # + # @example Fit from a file + # tfidf.fit_from_stream(File.open('corpus.txt')) + # + # @example With progress tracking + # tfidf.fit_from_stream(io, batch_size: 500) do |progress| + # puts "#{progress.completed} documents loaded" + # end + # + # @rbs (IO, ?batch_size: Integer) { (Streaming::Progress) -> void } -> self + def fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE) + reader = Streaming::LineReader.new(io, batch_size: batch_size) + total = reader.estimate_line_count + progress = Streaming::Progress.new(total: total) + + documents = [] #: Array[String] + + reader.each_batch do |batch| + documents.concat(batch) + progress.completed += batch.size + progress.current_batch += 1 + yield progress if block_given? + end + + fit(documents) unless documents.empty? + self + end + + # TFIDF doesn't support train_from_stream (use fit_from_stream instead). + # This method raises NotImplementedError with guidance. + # + # @rbs (*untyped, **untyped) -> void + def train_from_stream(*) # steep:ignore + raise NotImplementedError, 'TFIDF uses fit_from_stream instead of train_from_stream' + end + + # TFIDF doesn't support train_batch (use fit instead). + # This method raises NotImplementedError with guidance. + # + # @rbs (*untyped, **untyped) -> void + def train_batch(*) # steep:ignore + raise NotImplementedError, 'TFIDF uses fit instead of train_batch' + end + private # Restores vectorizer state from JSON string. @@ -329,9 +395,7 @@ def validate_df!(value, name) # @rbs (Array[Integer]) -> void def validate_ngram_range!(range) raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2 - unless range.all?(Integer) && range.all?(&:positive?) - raise ArgumentError, 'ngram_range values must be positive integers' - end + raise ArgumentError, 'ngram_range values must be positive integers' unless range.all?(Integer) && range.all?(&:positive?) raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1] end diff --git a/sig/vendor/matrix.rbs b/sig/vendor/matrix.rbs index 910018e8..0ec2dab2 100644 --- a/sig/vendor/matrix.rbs +++ b/sig/vendor/matrix.rbs @@ -1,26 +1,37 @@ # Type stubs for matrix gem -class Vector[T] +# Using untyped elements since our usage is primarily with Floats/Numerics +class Vector EPSILON: Float - def self.[]: [T] (*T) -> Vector[T] + def self.[]: (*untyped) -> Vector def size: () -> Integer - def []: (Integer) -> T + def []: (Integer) -> untyped def magnitude: () -> Float - def normalize: () -> Vector[T] - def each: () { (T) -> void } -> void - def collect: [U] () { (T) -> U } -> Vector[U] - def to_a: () -> Array[T] + def normalize: () -> Vector + def each: () { (untyped) -> void } -> void + def collect: () { (untyped) -> untyped } -> Vector + def to_a: () -> Array[untyped] def *: (untyped) -> untyped + def -: (Vector) -> Vector + def is_a?: (untyped) -> bool end -class Matrix[T] - def self.rows: [T] (Array[Array[T]]) -> Matrix[T] - def self.[]: [T] (*Array[T]) -> Matrix[T] - def self.diag: (untyped) -> Matrix[untyped] - def trans: () -> Matrix[T] +class Matrix + def self.rows: (Array[Array[untyped]]) -> Matrix + def self.[]: (*Array[untyped]) -> Matrix + def self.diag: (untyped) -> Matrix + def self.columns: (Array[Array[untyped]]) -> Matrix + def self.empty: (Integer, Integer) -> Matrix + def self.zero: (Integer, Integer) -> Matrix + def self.vstack: (Matrix, Matrix) -> Matrix + def trans: () -> Matrix + def transpose: () -> Matrix def *: (untyped) -> untyped def row_size: () -> Integer def column_size: () -> Integer - def column: (Integer) -> Vector[T] - def SV_decomp: () -> [Matrix[T], Matrix[T], untyped] + def row: (Integer) -> Vector + def column: (Integer) -> Vector + def SV_decomp: () -> [Matrix, Matrix, untyped] + def is_a?: (untyped) -> bool + def respond_to?: (Symbol) -> bool end diff --git a/sig/vendor/streaming.rbs b/sig/vendor/streaming.rbs new file mode 100644 index 00000000..ee42c16e --- /dev/null +++ b/sig/vendor/streaming.rbs @@ -0,0 +1,14 @@ +# Type stubs for Streaming module +# Defines the interface that including classes must implement + +module Classifier + # Interface for classes that include Streaming + interface _StreamingHost + def storage: () -> Storage::Base? + def storage=: (Storage::Base?) -> void + def save: () -> void + end + + module Streaming : _StreamingHost + end +end diff --git a/test/bayes/streaming_test.rb b/test/bayes/streaming_test.rb new file mode 100644 index 00000000..fd2f311a --- /dev/null +++ b/test/bayes/streaming_test.rb @@ -0,0 +1,337 @@ +require_relative '../test_helper' +require 'stringio' +require 'tempfile' + +class BayesStreamingTest < Minitest::Test + def setup + @classifier = Classifier::Bayes.new('Spam', 'Ham') + end + + # train_from_stream tests + + def test_train_from_stream_basic + io = StringIO.new("buy now cheap\nfree money\nlimited offer\n") + @classifier.train_from_stream(:spam, io) + + # Should have trained 3 documents + assert_equal 'Spam', @classifier.classify('buy cheap free') + end + + def test_train_from_stream_empty_io + io = StringIO.new('') + @classifier.train_from_stream(:spam, io) + + # No documents trained, classifier should still work + assert_includes @classifier.categories, 'Spam' + end + + def test_train_from_stream_single_line + io = StringIO.new("this is spam content\n") + @classifier.train_from_stream(:spam, io) + + # Train some ham to make classification meaningful + @classifier.train(:ham, 'this is normal email') + + result = @classifier.classify('spam content') + + assert_equal 'Spam', result + end + + def test_train_from_stream_with_batch_size + lines = (1..50).map { |i| "document number #{i} with some content" } + io = StringIO.new(lines.join("\n")) + + batches_processed = 0 + @classifier.train_from_stream(:spam, io, batch_size: 10) do |progress| + batches_processed = progress.current_batch + end + + assert_equal 5, batches_processed + end + + def test_train_from_stream_progress_tracking + lines = (1..25).map { |i| "line #{i}" } + io = StringIO.new(lines.join("\n")) + + completed_values = [] + @classifier.train_from_stream(:spam, io, batch_size: 10) do |progress| + completed_values << progress.completed + end + + assert_equal [10, 20, 25], completed_values + end + + def test_train_from_stream_with_progress_block + io = StringIO.new("doc1\ndoc2\ndoc3\n") + progress_received = nil + + @classifier.train_from_stream(:spam, io, batch_size: 2) do |progress| + progress_received = progress + end + + assert_instance_of Classifier::Streaming::Progress, progress_received + assert_equal 3, progress_received.completed + end + + def test_train_from_stream_invalid_category + io = StringIO.new("some content\n") + + assert_raises(StandardError) do + @classifier.train_from_stream(:nonexistent, io) + end + end + + def test_train_from_stream_marks_dirty + io = StringIO.new("content\n") + + refute_predicate @classifier, :dirty? + + @classifier.train_from_stream(:spam, io) + + assert_predicate @classifier, :dirty? + end + + def test_train_from_stream_with_file + Tempfile.create(['corpus', '.txt']) do |file| + file.puts 'spam message one' + file.puts 'spam message two' + file.puts 'spam message three' + file.flush + file.rewind + + @classifier.train_from_stream(:spam, file) + end + + @classifier.train(:ham, 'normal message here') + + assert_equal 'Spam', @classifier.classify('spam message') + end + + # train_batch tests + + def test_train_batch_positional_style + documents = ['buy now', 'free money', 'limited offer'] + @classifier.train_batch(:spam, documents) + + @classifier.train(:ham, 'hello friend') + + assert_equal 'Spam', @classifier.classify('buy free limited') + end + + def test_train_batch_keyword_style + spam_docs = ['buy now', 'free money'] + ham_docs = ['hello friend', 'meeting tomorrow'] + + @classifier.train_batch(spam: spam_docs, ham: ham_docs) + + assert_equal 'Spam', @classifier.classify('buy free') + assert_equal 'Ham', @classifier.classify('hello meeting') + end + + def test_train_batch_with_batch_size + documents = (1..100).map { |i| "document #{i}" } + + batches = 0 + @classifier.train_batch(:spam, documents, batch_size: 25) do |_progress| + batches += 1 + end + + assert_equal 4, batches + end + + def test_train_batch_progress_tracking + documents = (1..30).map { |i| "doc #{i}" } + + completed_values = [] + @classifier.train_batch(:spam, documents, batch_size: 10) do |progress| + completed_values << progress.completed + end + + assert_equal [10, 20, 30], completed_values + end + + def test_train_batch_progress_percent + documents = (1..100).map { |i| "doc #{i}" } + + percent_values = [] + @classifier.train_batch(:spam, documents, batch_size: 25) do |progress| + percent_values << progress.percent + end + + assert_equal [25.0, 50.0, 75.0, 100.0], percent_values + end + + def test_train_batch_empty_array + @classifier.train_batch(:spam, []) + # Should not raise, classifier should be unchanged + assert_includes @classifier.categories, 'Spam' + end + + def test_train_batch_invalid_category + assert_raises(StandardError) do + @classifier.train_batch(:nonexistent, ['doc']) + end + end + + def test_train_batch_marks_dirty + refute_predicate @classifier, :dirty? + @classifier.train_batch(:spam, ['content']) + + assert_predicate @classifier, :dirty? + end + + def test_train_batch_multiple_categories + @classifier.train_batch( + spam: ['buy now', 'free offer'], + ham: %w[hello meeting] + ) + + assert_equal 'Spam', @classifier.classify('buy free') + assert_equal 'Ham', @classifier.classify('hello meeting') + end + + # Equivalence tests + + def test_train_batch_equivalent_to_train + classifier1 = Classifier::Bayes.new('Spam', 'Ham') + classifier2 = Classifier::Bayes.new('Spam', 'Ham') + + documents = ['buy now cheap', 'free money fast', 'limited time offer'] + + # Train with regular train + documents.each { |doc| classifier1.train(:spam, doc) } + + # Train with train_batch + classifier2.train_batch(:spam, documents) + + # Both should classify the same + test_doc = 'buy cheap free limited' + + assert_equal classifier1.classify(test_doc), classifier2.classify(test_doc) + + # Classifications should be identical + assert_equal classifier1.classifications(test_doc), classifier2.classifications(test_doc) + end + + def test_train_from_stream_equivalent_to_train + classifier1 = Classifier::Bayes.new('Spam', 'Ham') + classifier2 = Classifier::Bayes.new('Spam', 'Ham') + + documents = ['buy now cheap', 'free money fast', 'limited time offer'] + + # Train with regular train + documents.each { |doc| classifier1.train(:spam, doc) } + + # Train with train_from_stream + io = StringIO.new(documents.join("\n")) + classifier2.train_from_stream(:spam, io) + + # Both should classify the same + test_doc = 'buy cheap free limited' + + assert_equal classifier1.classify(test_doc), classifier2.classify(test_doc) + end + + # Checkpoint tests + + def test_save_checkpoint + Dir.mktmpdir do |dir| + path = File.join(dir, 'classifier.json') + @classifier.storage = Classifier::Storage::File.new(path: path) + + @classifier.train(:spam, 'buy now cheap') + @classifier.save_checkpoint('50pct') + + checkpoint_path = File.join(dir, 'classifier_checkpoint_50pct.json') + + assert_path_exists checkpoint_path + end + end + + def test_load_checkpoint + Dir.mktmpdir do |dir| + path = File.join(dir, 'classifier.json') + storage = Classifier::Storage::File.new(path: path) + @classifier.storage = storage + + @classifier.train(:spam, 'buy now cheap') + @classifier.save_checkpoint('halfway') + + # Load from checkpoint + loaded = Classifier::Bayes.load_checkpoint(storage: storage, checkpoint_id: 'halfway') + + # Should have the same training + assert_equal @classifier.classify('buy cheap'), loaded.classify('buy cheap') + end + end + + def test_list_checkpoints + Dir.mktmpdir do |dir| + path = File.join(dir, 'classifier.json') + @classifier.storage = Classifier::Storage::File.new(path: path) + + @classifier.train(:spam, 'content') + @classifier.save_checkpoint('10pct') + @classifier.save_checkpoint('50pct') + @classifier.save_checkpoint('90pct') + + checkpoints = @classifier.list_checkpoints + + assert_equal %w[10pct 50pct 90pct], checkpoints.sort + end + end + + def test_delete_checkpoint + Dir.mktmpdir do |dir| + path = File.join(dir, 'classifier.json') + @classifier.storage = Classifier::Storage::File.new(path: path) + + @classifier.train(:spam, 'content') + @classifier.save_checkpoint('test') + + checkpoint_path = File.join(dir, 'classifier_checkpoint_test.json') + + assert_path_exists checkpoint_path + + @classifier.delete_checkpoint('test') + + refute_path_exists checkpoint_path + end + end + + def test_save_checkpoint_requires_storage + assert_raises(ArgumentError) do + @classifier.save_checkpoint('test') + end + end + + def test_checkpoint_workflow + Dir.mktmpdir do |dir| + path = File.join(dir, 'classifier.json') + storage = Classifier::Storage::File.new(path: path) + + # First training session + classifier1 = Classifier::Bayes.new('Spam', 'Ham') + classifier1.storage = storage + + classifier1.train(:spam, 'buy now') + classifier1.train(:spam, 'free money') + classifier1.save_checkpoint('phase1') + + # Simulate restart - load from checkpoint + classifier2 = Classifier::Bayes.load_checkpoint(storage: storage, checkpoint_id: 'phase1') + + # Continue training + classifier2.train(:ham, 'hello friend') + classifier2.train(:ham, 'meeting tomorrow') + classifier2.save_checkpoint('phase2') + + # Load final checkpoint + final = Classifier::Bayes.load_checkpoint(storage: storage, checkpoint_id: 'phase2') + + # Should have all training + assert_equal 'Spam', final.classify('buy free') + assert_equal 'Ham', final.classify('hello meeting') + end + end +end diff --git a/test/knn/knn_test.rb b/test/knn/knn_test.rb index 9eed7dbb..f9e213bd 100644 --- a/test/knn/knn_test.rb +++ b/test/knn/knn_test.rb @@ -529,4 +529,56 @@ def test_items_returns_copy assert_equal 1, knn.items.size end + + # API consistency tests (with Bayes and LogisticRegression) + + def test_train_alias_for_add + knn = Classifier::KNN.new(k: 3) + knn.train(Dog: [@str1, @str2], Cat: [@str3, @str4]) + + assert_equal 4, knn.items.size + assert_equal 'Dog', knn.classify('This is about dogs') + assert_equal 'Cat', knn.classify('This is about cats') + end + + def test_dynamic_train_methods + knn = Classifier::KNN.new(k: 3) + knn.train_dog @str1, @str2 + knn.train_cat @str3, @str4 + + assert_equal 4, knn.items.size + # Dynamic methods create lowercase category names from method name + assert_equal 'dog', knn.classify('This is about dogs') + assert_equal 'cat', knn.classify('This is about cats') + end + + def test_respond_to_train_methods + knn = Classifier::KNN.new + + assert_respond_to knn, :train + assert_respond_to knn, :train_spam + assert_respond_to knn, :train_any_category + end + + def test_classify_returns_string_not_symbol + knn = Classifier::KNN.new(k: 3) + knn.add(dog: [@str1, @str2], cat: [@str3, @str4]) # symbol keys + + result = knn.classify('This is about dogs') + + assert_instance_of String, result + assert_equal 'dog', result + end + + def test_categories_returns_array_of_strings + knn = Classifier::KNN.new + knn.add(dog: 'Dogs are great', cat: 'Cats are independent') + + cats = knn.categories + + assert_instance_of Array, cats + cats.each { |cat| assert_instance_of String, cat } + assert_includes cats, 'dog' + assert_includes cats, 'cat' + end end diff --git a/test/lsi/incremental_svd_test.rb b/test/lsi/incremental_svd_test.rb new file mode 100644 index 00000000..9ddfbcc8 --- /dev/null +++ b/test/lsi/incremental_svd_test.rb @@ -0,0 +1,304 @@ +require_relative '../test_helper' + +class IncrementalSVDModuleTest < Minitest::Test + def test_update_with_new_direction + # Create a simple U matrix (3 terms × 2 components) + u = Matrix[ + [0.5, 0.5], + [0.5, -0.5], + [0.707, 0.0] + ] + s = [2.0, 1.0] + + # New document vector (3 terms) + c = Vector[0.1, 0.2, 0.9] + + u_new, s_new = Classifier::LSI::IncrementalSVD.update(u, s, c, max_rank: 10) + + # Should have updated matrices + assert_instance_of Matrix, u_new + assert_instance_of Array, s_new + + # Rank may have increased by 1 (if new direction found) + assert_operator s_new.size, :>=, s.size + assert_operator s_new.size, :<=, s.size + 1 + end + + def test_update_preserves_orthogonality + # Start with orthonormal U + u = Matrix[ + [1.0, 0.0], + [0.0, 1.0], + [0.0, 0.0] + ] + s = [2.0, 1.5] + + c = Vector[0.5, 0.5, 0.707] + + u_new, _s_new = Classifier::LSI::IncrementalSVD.update(u, s, c, max_rank: 10) + + # Check columns are approximately orthonormal + u_new.column_size.times do |i| + col_i = u_new.column(i) + # Column should have unit length (approximately) + norm = Math.sqrt(col_i.to_a.sum { |x| x * x }) + + assert_in_delta 1.0, norm, 0.1, "Column #{i} should have unit length" + end + end + + def test_update_with_vector_in_span + # U spans the first two dimensions + u = Matrix[ + [1.0, 0.0], + [0.0, 1.0], + [0.0, 0.0] + ] + s = [2.0, 1.0] + + # Vector entirely in span of U (no component in third dimension) + c = Vector[0.6, 0.8, 0.0] + + _u_new, s_new = Classifier::LSI::IncrementalSVD.update(u, s, c, max_rank: 10) + + # Rank should not increase when vector is in span + assert_equal s.size, s_new.size + end + + def test_update_respects_max_rank + u = Matrix[ + [1.0, 0.0], + [0.0, 1.0], + [0.0, 0.0] + ] + s = [2.0, 1.0] + + c = Vector[0.1, 0.1, 0.99] + + # With max_rank = 2, should not exceed 2 components + u_new, s_new = Classifier::LSI::IncrementalSVD.update(u, s, c, max_rank: 2) + + assert_equal 2, s_new.size + assert_equal 2, u_new.column_size + end + + def test_singular_values_sorted_descending + u = Matrix[ + [0.707, 0.707], + [0.707, -0.707], + [0.0, 0.0] + ] + s = [3.0, 1.0] + + c = Vector[0.5, 0.5, 0.707] + + _u_new, s_new = Classifier::LSI::IncrementalSVD.update(u, s, c, max_rank: 10) + + # Singular values should be sorted in descending order + (0...(s_new.size - 1)).each do |i| + assert_operator s_new[i], :>=, s_new[i + 1], 'Singular values should be descending' + end + end +end + +class LSIIncrementalModeTest < Minitest::Test + def setup + @dog_docs = [ + 'dogs are loyal pets that bark', + 'puppies are playful young dogs', + 'dogs love to play fetch' + ] + @cat_docs = [ + 'cats are independent pets', + 'kittens are curious creatures', + 'cats meow and purr softly' + ] + end + + def test_incremental_mode_initialization + lsi = Classifier::LSI.new(incremental: true) + + assert lsi.instance_variable_get(:@incremental_mode) + refute_predicate lsi, :incremental_enabled? # Not active until first build + end + + def test_incremental_mode_with_max_rank + lsi = Classifier::LSI.new(incremental: true, max_rank: 50) + + assert_equal 50, lsi.instance_variable_get(:@max_rank) + end + + def test_incremental_enabled_after_build + lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false) + + @dog_docs.each { |doc| lsi.add_item(doc, :dog) } + @cat_docs.each { |doc| lsi.add_item(doc, :cat) } + + refute_predicate lsi, :incremental_enabled? + + lsi.build_index + + assert_predicate lsi, :incremental_enabled? + assert_instance_of Matrix, lsi.instance_variable_get(:@u_matrix) + end + + def test_incremental_add_uses_incremental_update + lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false) + + # Add initial documents + @dog_docs.each { |doc| lsi.add_item(doc, :dog) } + @cat_docs.each { |doc| lsi.add_item(doc, :cat) } + lsi.build_index + + initial_version = lsi.instance_variable_get(:@version) + + # Add new document - should use incremental update + lsi.add_item('my dog loves to run and play', :dog) + + # Version should have incremented + assert_equal initial_version + 1, lsi.instance_variable_get(:@version) + + # Should still be in incremental mode + assert_predicate lsi, :incremental_enabled? + end + + def test_incremental_classification_works + lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false) + + @dog_docs.each { |doc| lsi.add_item(doc, :dog) } + @cat_docs.each { |doc| lsi.add_item(doc, :cat) } + lsi.build_index + + # Add more documents incrementally + lsi.add_item('dogs are wonderful companions', :dog) + lsi.add_item('cats sleep a lot during the day', :cat) + + # Classification should work + result = lsi.classify('loyal pet that barks').to_s + + assert_equal 'dog', result + + result = lsi.classify('independent creature that meows').to_s + + assert_equal 'cat', result + end + + def test_current_rank + lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false) + + @dog_docs.each { |doc| lsi.add_item(doc, :dog) } + @cat_docs.each { |doc| lsi.add_item(doc, :cat) } + lsi.build_index + + rank = lsi.current_rank + + assert_instance_of Integer, rank + assert_predicate rank, :positive? + end + + def test_disable_incremental_mode + lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false) + + @dog_docs.each { |doc| lsi.add_item(doc, :dog) } + lsi.build_index + + assert_predicate lsi, :incremental_enabled? + + lsi.disable_incremental_mode! + + refute_predicate lsi, :incremental_enabled? + assert_nil lsi.instance_variable_get(:@u_matrix) + end + + def test_enable_incremental_mode + lsi = Classifier::LSI.new(auto_rebuild: false) + + @dog_docs.each { |doc| lsi.add_item(doc, :dog) } + lsi.build_index + + refute_predicate lsi, :incremental_enabled? + + lsi.enable_incremental_mode!(max_rank: 75) + lsi.build_index(force: true) + + assert_predicate lsi, :incremental_enabled? + assert_equal 75, lsi.instance_variable_get(:@max_rank) + end + + def test_force_rebuild + lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false) + + @dog_docs.each { |doc| lsi.add_item(doc, :dog) } + lsi.build_index + + # Force rebuild should work + lsi.build_index(force: true) + + assert_predicate lsi, :incremental_enabled? + end + + def test_vocabulary_growth_triggers_full_rebuild + lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false) + + # Start with a small vocabulary + lsi.add_item('dog', :animal) + lsi.add_item('cat', :animal) + lsi.build_index + + # Store initial vocab size to verify growth detection works + _initial_vocab_size = lsi.instance_variable_get(:@initial_vocab_size) + + # Add document with many new words (> 20% of initial vocabulary) + # This should trigger a full rebuild and disable incremental mode + many_new_words = (1..100).map { |i| "newword#{i}" }.join(' ') + lsi.add_item(many_new_words, :animal) + + # After vocabulary growth > 20%, incremental mode should be disabled + # and a full rebuild should have occurred + refute_predicate lsi, :incremental_enabled? + end + + def test_incremental_produces_reasonable_results + # Build with full SVD + lsi_full = Classifier::LSI.new(auto_rebuild: false) + @dog_docs.each { |doc| lsi_full.add_item(doc, :dog) } + @cat_docs.each { |doc| lsi_full.add_item(doc, :cat) } + lsi_full.add_item('my dog is a great friend', :dog) + lsi_full.build_index + + # Build with incremental mode + lsi_inc = Classifier::LSI.new(incremental: true, auto_rebuild: false) + @dog_docs.each { |doc| lsi_inc.add_item(doc, :dog) } + @cat_docs.each { |doc| lsi_inc.add_item(doc, :cat) } + lsi_inc.build_index + lsi_inc.add_item('my dog is a great friend', :dog) + + # Both should classify test documents reasonably + # Note: Results may differ due to approximation, but should be reasonable + test_doc = 'loyal pet barking' + + _full_result = lsi_full.classify(test_doc).to_s + inc_result = lsi_inc.classify(test_doc).to_s + + # At minimum, incremental should produce a valid classification + assert_includes %w[dog cat], inc_result + end + + def test_incremental_with_streaming_api + lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false) + + # Add initial batch + @dog_docs.each { |doc| lsi.add_item(doc, :dog) } + @cat_docs.each { |doc| lsi.add_item(doc, :cat) } + lsi.build_index + + # Use streaming API to add more + io = StringIO.new("dogs bark loudly\ndogs wag their tails\n") + lsi.train_from_stream(:dog, io) + + # Should still work + result = lsi.classify('barking dog wags tail').to_s + + assert_equal 'dog', result + end +end diff --git a/test/lsi/streaming_test.rb b/test/lsi/streaming_test.rb new file mode 100644 index 00000000..94bfaafb --- /dev/null +++ b/test/lsi/streaming_test.rb @@ -0,0 +1,343 @@ +require_relative '../test_helper' +require 'stringio' +require 'tempfile' + +class LSIStreamingTest < Minitest::Test + def setup + @lsi = Classifier::LSI.new + end + + # train_from_stream tests + + def test_train_from_stream_basic + dog_docs = "dogs are loyal pets\npuppies are playful\ndogs bark at strangers\n" + cat_docs = "cats are independent\nkittens are curious\ncats meow softly\n" + + @lsi.train_from_stream(:dog, StringIO.new(dog_docs)) + @lsi.train_from_stream(:cat, StringIO.new(cat_docs)) + + # Should be able to classify + # Note: classify returns the category as originally stored (symbol in this case) + result = @lsi.classify('loyal pet that barks') + + assert_equal 'dog', result.to_s + end + + def test_train_from_stream_empty_io + @lsi.train_from_stream(:category, StringIO.new('')) + + # No items added + assert_empty @lsi.items + end + + def test_train_from_stream_single_line + @lsi.train_from_stream(:dog, StringIO.new("dogs are loyal pets\n")) + @lsi.train_from_stream(:cat, StringIO.new("cats are independent\n")) + + # Should have 2 items + assert_equal 2, @lsi.items.size + end + + def test_train_from_stream_with_batch_size + lines = (1..50).map { |i| "document number #{i} about dogs" } + io = StringIO.new(lines.join("\n")) + + batches_processed = 0 + @lsi.train_from_stream(:dog, io, batch_size: 10) do |progress| + batches_processed = progress.current_batch + end + + assert_equal 5, batches_processed + end + + def test_train_from_stream_progress_tracking + lines = (1..25).map { |i| "document #{i}" } + io = StringIO.new(lines.join("\n")) + + completed_values = [] + @lsi.train_from_stream(:category, io, batch_size: 10) do |progress| + completed_values << progress.completed + end + + assert_equal [10, 20, 25], completed_values + end + + def test_train_from_stream_marks_dirty + refute_predicate @lsi, :dirty? + @lsi.train_from_stream(:category, StringIO.new("content\n")) + + assert_predicate @lsi, :dirty? + end + + def test_train_from_stream_rebuilds_index_when_auto_rebuild + @lsi = Classifier::LSI.new(auto_rebuild: true) + + dog_docs = "dogs are loyal\ndogs bark\n" + cat_docs = "cats are independent\ncats meow\n" + + @lsi.train_from_stream(:dog, StringIO.new(dog_docs)) + @lsi.train_from_stream(:cat, StringIO.new(cat_docs)) + + # Index should be built + refute_predicate @lsi, :needs_rebuild? + end + + def test_train_from_stream_skips_rebuild_when_auto_rebuild_false + @lsi = Classifier::LSI.new(auto_rebuild: false) + + @lsi.train_from_stream(:category, StringIO.new("document one\ndocument two\n")) + + # Index should need rebuild + assert_predicate @lsi, :needs_rebuild? + end + + def test_train_from_stream_with_file + Tempfile.create(['corpus', '.txt']) do |file| + file.puts 'dogs are loyal pets' + file.puts 'puppies are playful animals' + file.puts 'dogs bark and play' + file.flush + file.rewind + + @lsi.train_from_stream(:dog, file) + end + + @lsi.add_item('cats are independent', :cat) + @lsi.add_item('kittens are curious', :cat) + + result = @lsi.classify('loyal pet') + + assert_equal 'dog', result.to_s + end + + # add_batch tests + + def test_add_batch_basic + dog_docs = ['dogs are loyal', 'puppies play', 'dogs bark'] + cat_docs = ['cats are independent', 'kittens curious', 'cats meow'] + + @lsi.add_batch(dog: dog_docs, cat: cat_docs) + + assert_equal 6, @lsi.items.size + assert_equal 'dog', @lsi.classify('loyal dog barks').to_s + end + + def test_add_batch_with_progress + docs = (1..30).map { |i| "document #{i}" } + + completed_values = [] + @lsi.add_batch(batch_size: 10, category: docs) do |progress| + completed_values << progress.completed + end + + assert_equal [10, 20, 30], completed_values + end + + def test_add_batch_empty + @lsi.add_batch(category: []) + + assert_empty @lsi.items + end + + def test_add_batch_marks_dirty + refute_predicate @lsi, :dirty? + @lsi.add_batch(category: ['doc']) + + assert_predicate @lsi, :dirty? + end + + def test_add_batch_rebuilds_when_auto_rebuild + @lsi = Classifier::LSI.new(auto_rebuild: true) + + @lsi.add_batch( + dog: ['dogs bark', 'puppies play'], + cat: ['cats meow', 'kittens purr'] + ) + + refute_predicate @lsi, :needs_rebuild? + end + + # train_batch tests (alias for add_batch) + + def test_train_batch_positional_style + docs = ['dogs are loyal', 'puppies play'] + @lsi.train_batch(:dog, docs) + + @lsi.add_item('cats are independent', :cat) + + assert_equal 3, @lsi.items.size + end + + def test_train_batch_keyword_style + @lsi.train_batch( + dog: ['dogs are loyal', 'puppies play'], + cat: ['cats are independent', 'kittens curious'] + ) + + assert_equal 4, @lsi.items.size + assert_equal 'dog', @lsi.classify('loyal dog').to_s + end + + def test_train_batch_with_progress + docs = (1..20).map { |i| "doc #{i}" } + + batches = 0 + @lsi.train_batch(:category, docs, batch_size: 5) do |_progress| + batches += 1 + end + + assert_equal 4, batches + end + + # Equivalence tests + + def test_train_from_stream_equivalent_to_add_item + lsi1 = Classifier::LSI.new + lsi2 = Classifier::LSI.new + + documents = ['dogs are loyal pets', 'puppies are playful', 'dogs bark at strangers'] + + # Add with add_item + documents.each { |doc| lsi1.add_item(doc, :dog) } + + # Add with train_from_stream + io = StringIO.new(documents.join("\n")) + lsi2.train_from_stream(:dog, io) + + # Both should have same items + assert_equal lsi1.items.size, lsi2.items.size + + # Add some cat documents to both for classification + lsi1.add_item('cats are independent', :cat) + lsi2.add_item('cats are independent', :cat) + + # Both should classify the same + test_doc = 'loyal playful pet' + + assert_equal lsi1.classify(test_doc).to_s, lsi2.classify(test_doc).to_s + end + + def test_add_batch_equivalent_to_add + lsi1 = Classifier::LSI.new + lsi2 = Classifier::LSI.new + + dog_docs = ['dogs bark', 'puppies play'] + cat_docs = ['cats meow', 'kittens purr'] + + # Add with hash-style add + lsi1.add(dog: dog_docs, cat: cat_docs) + + # Add with add_batch + lsi2.add_batch(dog: dog_docs, cat: cat_docs) + + # Both should have same items + assert_equal lsi1.items.size, lsi2.items.size + + # Both should classify the same + test_doc = 'barking dog' + + assert_equal lsi1.classify(test_doc).to_s, lsi2.classify(test_doc).to_s + end + + # Checkpoint tests + + def test_save_checkpoint + Dir.mktmpdir do |dir| + path = File.join(dir, 'lsi.json') + @lsi.storage = Classifier::Storage::File.new(path: path) + + @lsi.add_item('dogs are loyal', :dog) + @lsi.add_item('cats are independent', :cat) + @lsi.save_checkpoint('50pct') + + checkpoint_path = File.join(dir, 'lsi_checkpoint_50pct.json') + + assert_path_exists checkpoint_path + end + end + + def test_load_checkpoint + Dir.mktmpdir do |dir| + path = File.join(dir, 'lsi.json') + storage = Classifier::Storage::File.new(path: path) + @lsi.storage = storage + + @lsi.add_item('dogs are loyal', :dog) + @lsi.add_item('cats are independent', :cat) + @lsi.save_checkpoint('halfway') + + # Load from checkpoint + loaded = Classifier::LSI.load_checkpoint(storage: storage, checkpoint_id: 'halfway') + + # Should have the same items and same classification + assert_equal @lsi.items.size, loaded.items.size + # Compare as strings since loaded classifier may have string categories + assert_equal @lsi.classify('loyal dog').to_s, loaded.classify('loyal dog').to_s + end + end + + def test_list_checkpoints + Dir.mktmpdir do |dir| + path = File.join(dir, 'lsi.json') + @lsi.storage = Classifier::Storage::File.new(path: path) + + @lsi.add_item('content', :category) + @lsi.save_checkpoint('10pct') + @lsi.save_checkpoint('50pct') + @lsi.save_checkpoint('90pct') + + checkpoints = @lsi.list_checkpoints + + assert_equal %w[10pct 50pct 90pct], checkpoints.sort + end + end + + def test_delete_checkpoint + Dir.mktmpdir do |dir| + path = File.join(dir, 'lsi.json') + @lsi.storage = Classifier::Storage::File.new(path: path) + + @lsi.add_item('content', :category) + @lsi.save_checkpoint('test') + + checkpoint_path = File.join(dir, 'lsi_checkpoint_test.json') + + assert_path_exists checkpoint_path + + @lsi.delete_checkpoint('test') + + refute_path_exists checkpoint_path + end + end + + def test_checkpoint_workflow + Dir.mktmpdir do |dir| + path = File.join(dir, 'lsi.json') + storage = Classifier::Storage::File.new(path: path) + + # First training session + lsi1 = Classifier::LSI.new + lsi1.storage = storage + + lsi1.add_item('dogs are loyal', :dog) + lsi1.add_item('puppies are playful', :dog) + lsi1.save_checkpoint('phase1') + + # Simulate restart - load from checkpoint + lsi2 = Classifier::LSI.load_checkpoint(storage: storage, checkpoint_id: 'phase1') + + # Continue training + lsi2.add_item('cats are independent', :cat) + lsi2.add_item('kittens are curious', :cat) + lsi2.save_checkpoint('phase2') + + # Load final checkpoint + final = Classifier::LSI.load_checkpoint(storage: storage, checkpoint_id: 'phase2') + + # Should have all training + assert_equal 4, final.items.size + assert_equal 'dog', final.classify('loyal playful').to_s + assert_equal 'cat', final.classify('independent curious').to_s + end + end +end diff --git a/test/storage/storage_test.rb b/test/storage/storage_test.rb index ddd90b39..b72c246e 100644 --- a/test/storage/storage_test.rb +++ b/test/storage/storage_test.rb @@ -1,17 +1,21 @@ require_relative '../test_helper' class StorageAPIConsistencyTest < Minitest::Test - CLASSIFIERS = [ - Classifier::Bayes, - Classifier::LSI, - Classifier::KNN, - Classifier::LogisticRegression, - Classifier::TFIDF - ].freeze + # Dynamically discover all classifier/vectorizer classes + CLASSIFIERS = Classifier.constants.filter_map do |const| + klass = Classifier.const_get(const) + next unless klass.is_a?(Class) + + klass if klass.method_defined?(:classify) || klass.method_defined?(:transform) + end.freeze INSTANCE_METHODS = %i[save reload reload! dirty? storage storage=].freeze CLASS_METHODS = %i[load].freeze + def test_classifiers_discovered + assert_operator CLASSIFIERS.size, :>=, 5, "Expected at least 5 classifiers, found: #{CLASSIFIERS.map(&:name)}" + end + CLASSIFIERS.each do |klass| class_name = klass.name.split('::').last.downcase diff --git a/test/streaming/streaming_test.rb b/test/streaming/streaming_test.rb new file mode 100644 index 00000000..5e4c5c9b --- /dev/null +++ b/test/streaming/streaming_test.rb @@ -0,0 +1,374 @@ +require_relative '../test_helper' +require 'stringio' + +class ProgressTest < Minitest::Test + def test_initialization_with_defaults + progress = Classifier::Streaming::Progress.new + + assert_equal 0, progress.completed + assert_nil progress.total + assert_instance_of Time, progress.start_time + end + + def test_initialization_with_total + progress = Classifier::Streaming::Progress.new(total: 100) + + assert_equal 0, progress.completed + assert_equal 100, progress.total + end + + def test_initialization_with_completed + progress = Classifier::Streaming::Progress.new(total: 100, completed: 50) + + assert_equal 50, progress.completed + assert_equal 100, progress.total + end + + def test_percent_with_known_total + progress = Classifier::Streaming::Progress.new(total: 100, completed: 25) + + assert_in_delta(25.0, progress.percent) + end + + def test_percent_with_fractional_value + progress = Classifier::Streaming::Progress.new(total: 3, completed: 1) + + assert_in_delta(33.33, progress.percent) + end + + def test_percent_with_unknown_total + progress = Classifier::Streaming::Progress.new + progress.completed = 50 + + assert_nil progress.percent + end + + def test_percent_with_zero_total + progress = Classifier::Streaming::Progress.new(total: 0) + + assert_nil progress.percent + end + + def test_elapsed_time + progress = Classifier::Streaming::Progress.new + sleep 0.01 + + assert_operator progress.elapsed, :>=, 0.01 + assert_operator progress.elapsed, :<, 1 + end + + def test_rate_calculation + progress = Classifier::Streaming::Progress.new(completed: 0) + sleep 0.01 # Ensure some time passes + progress.completed = 100 + # Rate should be roughly 100 / elapsed (approximately 10000/s) + rate = progress.rate + + assert_predicate rate, :positive? + end + + def test_rate_with_zero_elapsed + # This is tricky to test since time always passes, + # but rate should handle edge cases gracefully + progress = Classifier::Streaming::Progress.new + rate = progress.rate + # Rate is 0 when completed is 0 + assert_in_delta(0.0, rate) + end + + def test_eta_with_known_total + progress = Classifier::Streaming::Progress.new(total: 100, completed: 50) + sleep 0.01 # Ensure some time passes + eta = progress.eta + + assert_instance_of Float, eta + assert_operator eta, :>=, 0 + end + + def test_eta_with_unknown_total + progress = Classifier::Streaming::Progress.new(completed: 50) + + assert_nil progress.eta + end + + def test_eta_when_complete + progress = Classifier::Streaming::Progress.new(total: 100, completed: 100) + sleep 0.01 + + assert_in_delta(0.0, progress.eta) + end + + def test_eta_with_zero_rate + progress = Classifier::Streaming::Progress.new(total: 100, completed: 0) + + assert_nil progress.eta + end + + def test_complete_when_finished + progress = Classifier::Streaming::Progress.new(total: 100, completed: 100) + + assert_predicate progress, :complete? + end + + def test_complete_when_not_finished + progress = Classifier::Streaming::Progress.new(total: 100, completed: 50) + + refute_predicate progress, :complete? + end + + def test_complete_with_unknown_total + progress = Classifier::Streaming::Progress.new(completed: 100) + + refute_predicate progress, :complete? + end + + def test_to_h + progress = Classifier::Streaming::Progress.new(total: 100, completed: 50) + hash = progress.to_h + + assert_equal 50, hash[:completed] + assert_equal 100, hash[:total] + assert_in_delta(50.0, hash[:percent]) + assert_instance_of Float, hash[:elapsed] + assert_instance_of Float, hash[:rate] + end + + def test_completed_is_mutable + progress = Classifier::Streaming::Progress.new(total: 100) + + assert_equal 0, progress.completed + + progress.completed = 25 + + assert_equal 25, progress.completed + + progress.completed += 25 + + assert_equal 50, progress.completed + end + + def test_current_batch_tracking + progress = Classifier::Streaming::Progress.new(total: 100) + + assert_equal 0, progress.current_batch + + progress.current_batch = 5 + + assert_equal 5, progress.current_batch + end +end + +class LineReaderTest < Minitest::Test + def test_each_line + io = StringIO.new("line1\nline2\nline3\n") + reader = Classifier::Streaming::LineReader.new(io) + + lines = reader.each.to_a + + assert_equal %w[line1 line2 line3], lines + end + + def test_each_line_removes_trailing_newlines + io = StringIO.new("line1\nline2\r\nline3") + reader = Classifier::Streaming::LineReader.new(io) + + lines = reader.each.to_a + # chomp removes both \n and \r\n + assert_equal %w[line1 line2 line3], lines + end + + def test_each_with_block + io = StringIO.new("a\nb\nc\n") + reader = Classifier::Streaming::LineReader.new(io) + + collected = reader.map(&:upcase) + + assert_equal %w[A B C], collected + end + + def test_enumerable_methods + io = StringIO.new("1\n2\n3\n") + reader = Classifier::Streaming::LineReader.new(io) + + # LineReader includes Enumerable + assert_equal %w[1 2 3], reader.first(3) + end + + def test_each_batch_default_size + lines = (1..250).map { |i| "line#{i}" }.join("\n") + io = StringIO.new(lines) + reader = Classifier::Streaming::LineReader.new(io) + + batches = reader.each_batch.to_a + + assert_equal 3, batches.size + assert_equal 100, batches[0].size + assert_equal 100, batches[1].size + assert_equal 50, batches[2].size + end + + def test_each_batch_custom_size + lines = (1..25).map { |i| "line#{i}" }.join("\n") + io = StringIO.new(lines) + reader = Classifier::Streaming::LineReader.new(io, batch_size: 10) + + batches = reader.each_batch.to_a + + assert_equal 3, batches.size + assert_equal 10, batches[0].size + assert_equal 10, batches[1].size + assert_equal 5, batches[2].size + end + + def test_each_batch_with_block + io = StringIO.new("a\nb\nc\nd\ne\n") + reader = Classifier::Streaming::LineReader.new(io, batch_size: 2) + + collected = [] + reader.each_batch { |batch| collected << batch } + + assert_equal [%w[a b], %w[c d], ['e']], collected + end + + def test_each_batch_exact_multiple + lines = (1..10).map { |i| "line#{i}" }.join("\n") + io = StringIO.new(lines) + reader = Classifier::Streaming::LineReader.new(io, batch_size: 5) + + batches = reader.each_batch.to_a + + assert_equal 2, batches.size + assert_equal 5, batches[0].size + assert_equal 5, batches[1].size + end + + def test_each_batch_empty_io + io = StringIO.new('') + reader = Classifier::Streaming::LineReader.new(io) + + batches = reader.each_batch.to_a + + assert_empty batches + end + + def test_batch_size_reader + reader = Classifier::Streaming::LineReader.new(StringIO.new(''), batch_size: 50) + + assert_equal 50, reader.batch_size + end + + def test_estimate_line_count_with_seekable_io + # Create a file-like StringIO + lines = "short\nmedium line\nlonger line here\n" + io = StringIO.new(lines) + + reader = Classifier::Streaming::LineReader.new(io) + estimate = reader.estimate_line_count(sample_size: 3) + + # Should be close to 3 lines + assert_instance_of Integer, estimate + assert_predicate estimate, :positive? + end + + def test_estimate_line_count_preserves_position + lines = "a\nb\nc\nd\ne\n" + io = StringIO.new(lines) + io.gets # Read one line, position is now after "a\n" + original_pos = io.pos + + reader = Classifier::Streaming::LineReader.new(io) + reader.estimate_line_count + + assert_equal original_pos, io.pos + end +end + +class StreamingModuleTest < Minitest::Test + def test_default_batch_size + assert_equal 100, Classifier::Streaming::DEFAULT_BATCH_SIZE + end + + class DummyClassifier + include Classifier::Streaming + + attr_accessor :storage + end + + def test_train_from_stream_raises_not_implemented + classifier = DummyClassifier.new + io = StringIO.new("test\n") + + assert_raises(NotImplementedError) do + classifier.train_from_stream(:category, io) + end + end + + def test_train_batch_raises_not_implemented + classifier = DummyClassifier.new + + assert_raises(NotImplementedError) do + classifier.train_batch(:category, %w[doc1 doc2]) + end + end + + def test_save_checkpoint_requires_storage + classifier = DummyClassifier.new + + assert_raises(ArgumentError) do + classifier.save_checkpoint('test') + end + end + + def test_list_checkpoints_requires_storage + classifier = DummyClassifier.new + + assert_raises(ArgumentError) do + classifier.list_checkpoints + end + end + + def test_delete_checkpoint_requires_storage + classifier = DummyClassifier.new + + assert_raises(ArgumentError) do + classifier.delete_checkpoint('test') + end + end +end + +class StreamingEnforcementTest < Minitest::Test + # Dynamically discover all classifier/vectorizer classes by looking for + # classes that define `classify` (classifiers) or `transform` (vectorizers) + CLASSIFIERS = Classifier.constants.filter_map do |const| + klass = Classifier.const_get(const) + next unless klass.is_a?(Class) + + klass if klass.method_defined?(:classify) || klass.method_defined?(:transform) + end.freeze + + STREAMING_METHODS = %i[ + train_from_stream + train_batch + save_checkpoint + list_checkpoints + delete_checkpoint + ].freeze + + def test_classifiers_discovered + assert_operator CLASSIFIERS.size, :>=, 5, "Expected at least 5 classifiers, found: #{CLASSIFIERS.map(&:name)}" + end + + CLASSIFIERS.each do |klass| + define_method("test_#{klass.name.split('::').last.downcase}_includes_streaming") do + assert_includes klass, Classifier::Streaming, + "#{klass} must include Classifier::Streaming" + end + + STREAMING_METHODS.each do |method| + define_method("test_#{klass.name.split('::').last.downcase}_responds_to_#{method}") do + assert klass.method_defined?(method) || klass.private_method_defined?(method), + "#{klass} must respond to #{method}" + end + end + end +end