Skip to content

Commit acbc053

Browse files
authored
Add zero-dependency native C extension for LSI acceleration (#89)
* feat: add zero-dependency native C extension for LSI acceleration Replace the rb-gsl dependency with a self-contained C extension that implements Vector, Matrix, and Jacobi SVD operations. This eliminates the need for users to install external libraries while providing significant performance improvements. The native extension provides 5-50x speedup over pure Ruby, with the SVD-heavy build_index operation showing up to 384x improvement on larger document sets. The implementation ports the existing Ruby Jacobi SVD algorithm to C, ensuring consistent results. Key changes: - Add ext/classifier/ with ~850 lines of C code - Implement Classifier::Linalg::Vector and Matrix classes - Port Jacobi SVD from Ruby to C - Auto-detect backend: native extension > pure Ruby fallback - Remove GSL-related code and dependencies - Update benchmarks to compare native C vs pure Ruby Closes #87 * chore: add native extension build artifacts to gitignore * style: fix RuboCop offenses in tests and config files Apply RuboCop autocorrections and add necessary inline disables: - Use %i symbol array syntax in Rakefile - Add empty lines before assertion methods per Minitest style - Convert float assert_equal to assert_in_delta for precision - Disable Style/GlobalVars for $CFLAGS (required for mkmf) - Disable Style/MapIntoArray in test intentionally testing each * refactor: address PR review feedback - Use SVD_CONVERGENCE_THRESHOLD constant instead of magic number - Clarify comment about undef_method vs override behavior
1 parent 2fb18f5 commit acbc053

File tree

19 files changed

+1401
-174
lines changed

19 files changed

+1401
-174
lines changed

.gitignore

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,15 @@ coverage/
55
*.gem
66
pkg/
77

8+
# Native extension build artifacts
9+
*.bundle
10+
*.so
11+
*.o
12+
tmp/
13+
lib/classifier/classifier_ext.*
14+
Makefile
15+
mkmf.log
16+
817
# IDE/editor files
918
.idea/
1019
*.swp

CLAUDE.md

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -11,16 +11,23 @@ Ruby gem providing text classification via two algorithms:
1111
## Common Commands
1212

1313
```bash
14-
# Run all tests
14+
# Compile native C extension
15+
rake compile
16+
17+
# Run all tests (compiles first)
1518
rake test
1619

1720
# Run a single test file
1821
ruby -Ilib test/bayes/bayesian_test.rb
1922
ruby -Ilib test/lsi/lsi_test.rb
2023

21-
# Run tests with native Ruby vector (without GSL)
24+
# Run tests with pure Ruby (no native extension)
2225
NATIVE_VECTOR=true rake test
2326

27+
# Run benchmarks
28+
rake benchmark
29+
rake benchmark:compare
30+
2431
# Interactive console
2532
rake console
2633

@@ -39,7 +46,7 @@ rake doc
3946

4047
**LSI Classifier** (`lib/classifier/lsi.rb`)
4148
- Uses Singular Value Decomposition (SVD) for semantic analysis
42-
- Optional GSL gem for 10x faster matrix operations; falls back to pure Ruby SVD
49+
- Native C extension for 5-50x faster matrix operations; falls back to pure Ruby
4350
- Key operations: `add_item`, `classify`, `find_related`, `search`
4451
- `auto_rebuild` option controls automatic index rebuilding after changes
4552

@@ -49,15 +56,18 @@ rake doc
4956
- Uses `fast-stemmer` gem for Porter stemming
5057

5158
**Vector Extensions** (`lib/classifier/extensions/vector.rb`)
52-
- Pure Ruby SVD implementation (`Matrix#SV_decomp`)
59+
- Pure Ruby SVD implementation (`Matrix#SV_decomp`) - used as fallback
5360
- Vector normalization and magnitude calculations
5461

55-
### GSL Integration
62+
### Native C Extension (`ext/classifier/`)
63+
64+
LSI uses a native C extension for fast linear algebra operations:
65+
- `Classifier::Linalg::Vector` - Vector operations (alloc, normalize, dot product)
66+
- `Classifier::Linalg::Matrix` - Matrix operations (alloc, transpose, multiply)
67+
- Jacobi SVD implementation for singular value decomposition
5668

57-
LSI checks for the `gsl` gem at load time. When available:
58-
- Uses `GSL::Matrix` and `GSL::Vector` for faster operations
59-
- Serialization handled via `vector_serialize.rb`
60-
- Test without GSL: `NATIVE_VECTOR=true rake test`
69+
Check current backend: `Classifier::LSI.backend` returns `:native` or `:ruby`
70+
Force pure Ruby: `NATIVE_VECTOR=true rake test`
6171

6272
### Content Nodes (`lib/classifier/lsi/content_node.rb`)
6373

Gemfile.lock

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,8 @@ GEM
5959
racc (1.8.1)
6060
rainbow (3.1.1)
6161
rake (13.3.1)
62+
rake-compiler (1.3.1)
63+
rake
6264
rb-fsevent (0.11.2)
6365
rb-inotify (0.11.1)
6466
ffi (~> 1.0)
@@ -137,6 +139,7 @@ DEPENDENCIES
137139
minitest
138140
mutex_m
139141
ostruct
142+
rake-compiler
140143
rbs-inline
141144
rdoc
142145
rubocop
@@ -145,4 +148,4 @@ DEPENDENCIES
145148
steep
146149

147150
BUNDLED WITH
148-
2.4.17
151+
4.0.3

README.md

Lines changed: 27 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -36,47 +36,27 @@ Or install directly:
3636
gem install classifier
3737
```
3838

39-
### Optional: GSL for Faster LSI
39+
### Native C Extension
4040

41-
For significantly faster LSI operations, install the [GNU Scientific Library](https://www.gnu.org/software/gsl/).
41+
The gem includes a native C extension for fast LSI operations. It compiles automatically during gem installation. No external dependencies are required.
4242

43-
<details>
44-
<summary><strong>Ruby 3+</strong></summary>
45-
46-
The released `gsl` gem doesn't support Ruby 3+. Install from source:
43+
To verify the native extension is active:
4744

48-
```bash
49-
# Install GSL library
50-
brew install gsl # macOS
51-
apt-get install libgsl-dev # Ubuntu/Debian
52-
53-
# Build and install the gem
54-
git clone https://github.com/cardmagic/rb-gsl.git
55-
cd rb-gsl
56-
git checkout fix/ruby-3.4-compatibility
57-
gem build gsl.gemspec
58-
gem install gsl-*.gem
45+
```ruby
46+
require 'classifier'
47+
puts Classifier::LSI.backend # => :native
5948
```
60-
</details>
6149

62-
<details>
63-
<summary><strong>Ruby 2.x</strong></summary>
50+
To force pure Ruby mode (for debugging):
6451

6552
```bash
66-
# macOS
67-
brew install gsl
68-
gem install gsl
69-
70-
# Ubuntu/Debian
71-
apt-get install libgsl-dev
72-
gem install gsl
53+
NATIVE_VECTOR=true ruby your_script.rb
7354
```
74-
</details>
7555

76-
When GSL is installed, Classifier automatically uses it. To suppress the GSL notice:
56+
To suppress the warning when native extension isn't available:
7757

7858
```bash
79-
SUPPRESS_GSL_WARNING=true ruby your_script.rb
59+
SUPPRESS_LSI_WARNING=true ruby your_script.rb
8060
```
8161

8262
### Compatibility
@@ -181,36 +161,37 @@ lsi.search "programming", 3
181161

182162
## Performance
183163

184-
### GSL vs Native Ruby
164+
### Native C Extension vs Pure Ruby
185165

186-
GSL provides dramatic speedups for LSI operations, especially `build_index` (SVD computation):
166+
The native C extension provides dramatic speedups for LSI operations, especially `build_index` (SVD computation):
187167

188168
| Documents | build_index | Overall |
189169
|-----------|-------------|---------|
190-
| 5 | 4x faster | 2.5x |
191-
| 10 | 24x faster | 5.5x |
192-
| 15 | 116x faster | 17x |
170+
| 5 | 7x faster | 2.6x |
171+
| 10 | 25x faster | 4.6x |
172+
| 15 | 112x faster | 14.5x |
173+
| 20 | 385x faster | 48.7x |
193174

194175
<details>
195-
<summary>Detailed benchmark (15 documents)</summary>
176+
<summary>Detailed benchmark (20 documents)</summary>
196177

197178
```
198-
Operation Native GSL Speedup
179+
Operation Pure Ruby Native C Speedup
199180
----------------------------------------------------------
200-
build_index 0.1412 0.0012 116.2x
201-
classify 0.0142 0.0049 2.9x
202-
search 0.0102 0.0026 3.9x
203-
find_related 0.0069 0.0016 4.2x
181+
build_index 0.5540 0.0014 384.5x
182+
classify 0.0190 0.0060 3.2x
183+
search 0.0145 0.0037 3.9x
184+
find_related 0.0098 0.0011 8.6x
204185
----------------------------------------------------------
205-
TOTAL 0.1725 0.0104 16.6x
186+
TOTAL 0.5973 0.0123 48.7x
206187
```
207188
</details>
208189

209190
### Running Benchmarks
210191

211192
```bash
212193
rake benchmark # Run with current configuration
213-
rake benchmark:compare # Compare GSL vs native Ruby
194+
rake benchmark:compare # Compare native C vs pure Ruby
214195
```
215196

216197
## Development
@@ -221,15 +202,16 @@ rake benchmark:compare # Compare GSL vs native Ruby
221202
git clone https://github.com/cardmagic/classifier.git
222203
cd classifier
223204
bundle install
205+
rake compile # Compile native C extension
224206
```
225207

226208
### Running Tests
227209

228210
```bash
229-
rake test # Run all tests
211+
rake test # Run all tests (compiles first)
230212
ruby -Ilib test/bayes/bayesian_test.rb # Run specific test file
231213

232-
# Test without GSL (pure Ruby)
214+
# Test with pure Ruby (no native extension)
233215
NATIVE_VECTOR=true rake test
234216
```
235217

Rakefile

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,20 @@ require 'rake'
22
require 'rake/testtask'
33
require 'rdoc/task'
44

5+
# Try to load rake-compiler for native extension support
6+
begin
7+
require 'rake/extensiontask'
8+
Rake::ExtensionTask.new('classifier_ext') do |ext|
9+
ext.lib_dir = 'lib/classifier'
10+
ext.ext_dir = 'ext/classifier'
11+
end
12+
HAVE_EXTENSION = true
13+
rescue LoadError
14+
HAVE_EXTENSION = false
15+
end
16+
517
desc 'Default Task'
6-
task default: [:test]
18+
task default: HAVE_EXTENSION ? %i[compile test] : [:test]
719

820
# Run the unit tests
921
desc 'Run all unit tests'

0 commit comments

Comments
 (0)