Skip to content

feat: add pluggable persistence backends with storage API#93

Merged
cardmagic merged 5 commits intomasterfrom
feature/pluggable-persistence
Dec 28, 2025
Merged

feat: add pluggable persistence backends with storage API#93
cardmagic merged 5 commits intomasterfrom
feature/pluggable-persistence

Conversation

@cardmagic
Copy link
Owner

Summary

  • Add a pluggable storage API for classifier persistence with file and memory backends
  • Introduce dirty? tracking to detect unsaved changes
  • Add reload / reload! methods to refresh classifiers from storage
  • Rename save/load to save_to_file/load_from_file (legacy API preserved)
  • Add C extension code coverage with lcov in CI

Storage API

Classifiers can now be configured with a storage backend:

# File storage
storage = Classifier::Storage::File.new(path: 'classifier.json')
classifier = Classifier::Bayes.new(:spam, :ham)
classifier.storage = storage

classifier.train_spam('bad content')
classifier.save  # writes to configured storage

# Load from storage
loaded = Classifier::Bayes.load(storage: storage)

Custom Backends

Implement Classifier::Storage::Base to create custom storage (Redis, S3, PostgreSQL, etc.):

class RedisStorage < Classifier::Storage::Base
  def initialize(redis:, key:)
    @redis, @key = redis, key
  end

  def write(data) = @redis.set(@key, data)
  def read = @redis.get(@key)
  def delete = @redis.del(@key)
  def exists? = @redis.exists?(@key)
end

Dirty Tracking

classifier.dirty?  # => false
classifier.train_spam('content')
classifier.dirty?  # => true
classifier.save
classifier.dirty?  # => false

Reload

classifier.reload   # raises UnsavedChangesError if dirty
classifier.reload!  # force reload, discards changes

Test plan

  • Storage backends (file, memory) work correctly
  • Dirty tracking updates on train/untrain/add_item/remove_item
  • Reload raises on unsaved changes, reload! discards them
  • Legacy save_to_file/load_from_file API still works
  • All existing tests pass
  • C coverage reports generate in CI

Add gcov/lcov-based coverage for the native C extension. Coverage is
compiled with -O0 --coverage flags and generates HTML reports showing
line and function coverage for vector.c, matrix.c, svd.c, and
classifier_ext.c.

New rake tasks:
- coverage:clean - remove gcov data files
- coverage:compile - build with coverage instrumentation
- coverage:report - generate HTML report via lcov
- coverage:c - full cycle (clean, compile, test, report)

CI uploads coverage reports as artifacts on each run.
Implement ActiveRecord-style storage abstraction allowing classifiers
to persist to any backend. Users can configure storage at setup time
and use simple save/reload methods without path arguments.

Key additions:
- Storage::Base abstract class defining write/read/delete/exists? protocol
- Storage::Memory for testing and ephemeral use
- Storage::File for file-based persistence
- Dirty tracking with dirty? method on Bayes and LSI
- reload raises UnsavedChangesError, reload! forces discard
- load(storage:) class method sets storage on returned instance

Legacy API preserved:
- save_to_file(path) replaces old save(path)
- load_from_file(path) replaces old load(path)

Closes #90
@cardmagic cardmagic self-assigned this Dec 28, 2025
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 28, 2025

Greptile Summary

  • Adds a pluggable storage API for classifier persistence with file and memory backends, allowing custom storage implementations (Redis, S3, databases) through the Classifier::Storage::Base interface
  • Introduces dirty tracking functionality to detect unsaved changes and prevent data loss during reload operations using UnsavedChangesError
  • Refactors the save/load API to support storage backends while preserving backward compatibility through renamed save_to_file/load_from_file methods

Important Files Changed

Filename Overview
lib/classifier/lsi.rb Contains critical race condition bug in remove_item method where build_index is called outside synchronize block
lib/classifier/bayes.rb Integrates storage API with dirty tracking for Bayes classifier; well-implemented storage abstraction
lib/classifier/storage/base.rb Contains redundant comments and type annotations that restate method signatures, violating de-AI principles
Rakefile Complex C coverage implementation with extensive error handling and path manipulation; highly intricate

Confidence score: 3/5

  • This PR contains a critical race condition in LSI's remove_item method that could cause index corruption in concurrent environments
  • Score lowered due to race condition bug, excessive "AI slop" comments in storage/base.rb, and complex coverage implementation
  • Pay close attention to lib/classifier/lsi.rb lines 173-183where build_index is called outside the synchronize block

De-AI Code Simplifications:

Remove from lib/classifier/storage/base.rb:

# Before (lines 25-46)
  # Save classifier data
  # @rbs (String) -> void
  def write(data)
    raise NotImplementedError, "#{self.class}#write must be implemented"
  end

  # Load classifier data
  # @rbs () -> String?
  def read
    raise NotImplementedError, "#{self.class}#read must be implemented"
  end

  # Delete classifier data
  # @rbs () -> void
  def delete
    raise NotImplementedError, "#{self.class}#delete must be implemented"
  end

  # Check if data exists
  # @rbs () -> bool
  def exists?
    raise NotImplementedError, "#{self.class}#exists? must be implemented"
  end

# After - remove redundant comments, keep @rbs for type checking
  # @rbs (String) -> void
  def write(data)
    raise NotImplementedError, "#{self.class}#write must be implemented"
  end

  # @rbs () -> String?
  def read
    raise NotImplementedError, "#{self.class}#read must be implemented"
  end

  # @rbs () -> void
  def delete
    raise NotImplementedError, "#{self.class}#delete must be implemented"
  end

  # @rbs () -> bool
  def exists?
    raise NotImplementedError, "#{self.class}#exists? must be implemented"
  end

Remove from lib/classifier/storage/file.rb:

# Before (lines 31-48)
  # Save data to file
  # @rbs (String) -> Integer
  def write(data)
    ::File.write(@path, data)
  end

  # Load data from file
  # @rbs () -> String?
  def read
    exists? ? ::File.read(@path) : nil
  end

  # Delete file
  # @rbs () -> void
  def delete
    ::File.delete(@path) if exists?
  end

  # Check if file exists
  # @rbs () -> bool
  def exists?
    ::File.exist?(@path)
  end

# After - remove obvious comments
  # @rbs (String) -> Integer
  def write(data)
    ::File.write(@path, data)
  end

  # @rbs () -> String?
  def read
    exists? ? ::File.read(@path) : nil
  end

  # @rbs () -> void
  def delete
    ::File.delete(@path) if exists?
  end

  # @rbs () -> bool
  def exists?
    ::File.exist?(@path)
  end

Critical Race Condition Fix needed in lib/classifier/lsi.rb:

# Before (lines 173-183) - BROKEN
def remove_item(item)
  synchronize do
    return false unless @items.include?(item)
    @items.delete(item)
    @dirty = true
  end
  build_index if auto_rebuild
true
end

# After - FIX: move build_index inside sync block
def remove_item(item)
  synchronize do
    return false unless @items.include?(item)
    @items.delete(item)
    @dirty = true
    build_index if auto_rebuild
  end
  true
end

Sequence Diagram

sequenceDiagram
    participant User
    participant Storage as "Storage::File"
    participant Classifier as "Classifier::Bayes"
    participant FileSystem as "File System"

    User->>Storage: "new(path: 'model.json')"
    Storage-->>User: "storage instance"
    
    User->>Classifier: "new('Spam', 'Ham')"
    Classifier-->>User: "classifier instance"
    
    User->>Classifier: "storage = storage"
    
    User->>Classifier: "train_spam('buy now')"
    Classifier->>Classifier: "set @dirty = true"
    
    User->>Classifier: "train_ham('hello friend')"
    Classifier->>Classifier: "set @dirty = true"
    
    User->>Classifier: "save()"
    Classifier->>Classifier: "to_json()"
    Classifier->>Storage: "write(json_data)"
    Storage->>FileSystem: "File.write(path, data)"
    FileSystem-->>Storage: "bytes_written"
    Storage-->>Classifier: "bytes_written"
    Classifier->>Classifier: "set @dirty = false"
    Classifier-->>User: "void"
    
    User->>Classifier: "load(storage: storage)"
    Classifier->>Storage: "read()"
    Storage->>FileSystem: "File.read(path)"
    FileSystem-->>Storage: "json_data"
    Storage-->>Classifier: "json_data"
    Classifier->>Classifier: "from_json(json_data)"
    Classifier->>Classifier: "storage = storage"
    Classifier-->>User: "loaded_classifier"
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 files reviewed, 30 comments

Edit Code Review Agent Settings | Greptile

@cardmagic
Copy link
Owner Author

The @rbs comments are type annotations used by rbs-inline to generate RBS type signatures for static type checking. They are not documentation comments and should not be removed.

This project uses # rbs_inline: enabled at the top of files to enable automatic RBS generation from these inline annotations. See sig/generated/ for the generated type signatures.

@cardmagic
Copy link
Owner Author

@greptile review

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. lib/classifier/lsi.rb, line 173-181 (link)

    logic: remove_item doesn't rebuild the index when @auto_rebuild is true, but add_item does (line 146). This inconsistency means removing items won't update the index automatically, causing stale results.

15 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@cardmagic
Copy link
Owner Author

@greptile review

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines 173 to 183
def remove_item(item)
synchronize do
return unless @items.key?(item)
removed = synchronize do
next false unless @items.key?(item)

@items.delete(item)
@version += 1
@dirty = true
true
end
build_index if removed && @auto_rebuild
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Critical race condition: build_index call occurs outside synchronize block. If another thread modifies @Items between line 181 and 182, the index rebuild may use inconsistent state.

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 173:183

Comment:
**logic:** Critical race condition: `build_index` call occurs outside `synchronize` block. If another thread modifies @items between line 181 and 182, the index rebuild may use inconsistent state.

How can I resolve this? If you propose a fix, please make it concise.

@cardmagic
Copy link
Owner Author

All review comments have been addressed:

  1. remove_item auto_rebuild inconsistency - Fixed in commit a342569. remove_item now calls build_index when @auto_rebuild is true, matching add_item's behavior.

  2. @rbs type annotations - These are intentional type annotations used by rbs-inline for generating RBS type signatures. They should not be removed.

  3. Rakefile lint issues - Fixed in commit a342569 with rubocop:disable comments for the coverage namespace blocks.

All CI checks are passing.

@cardmagic cardmagic merged commit 0ec3a49 into master Dec 28, 2025
7 checks passed
@cardmagic cardmagic deleted the feature/pluggable-persistence branch December 28, 2025 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant