You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs(ruvllm): add TurboQuant KV-cache compression to crate README
- Add TurboQuant to key features table (6-8x memory reduction)
- Add v2.5 section with TurboQuant, embedding store, H2O/PyramidKV eviction
- Add full TurboQuant usage section with code examples and compression table
- Update version references from 2.0/2.3 to 2.1
Co-Authored-By: claude-flow <ruv@ruv.net>
Copy file name to clipboardExpand all lines: crates/ruvllm/README.md
+64-7Lines changed: 64 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,6 +32,7 @@ RuvLLM loads GGUF models and runs them on your hardware with full acceleration -
32
32
|---------|-------------|----------------|
33
33
|**SONA three-tier learning**| Adapts to your queries at three speeds: instant (<1 ms), background (~100 ms), deep (minutes) | Responses improve automatically without manual retraining |
34
34
|**Metal + CUDA + ANE**| Hardware-accelerated inference across Apple Silicon, NVIDIA GPUs, and Apple Neural Engine | Get the most out of whatever hardware you have |
35
+
|**TurboQuant KV-Cache**| 2-4 bit asymmetric per-channel quantization with H2O/PyramidKV eviction | 6-8x memory reduction, <0.5% quality loss |
35
36
|**Flash Attention 2**| Memory-efficient attention with O(N) complexity and online softmax | Longer contexts with less memory |
36
37
|**GGUF memory mapping**| Memory-mapped model loading with quantization (Q4K, Q8, FP16) | Load large models fast, use 4-8x less RAM |
37
38
|**Speculative decoding**| Draft model generates candidates, target model verifies in parallel | 2-3x faster text generation |
@@ -70,13 +71,13 @@ Add to your `Cargo.toml`:
70
71
```toml
71
72
[dependencies]
72
73
# Recommended for Apple Silicon Mac
73
-
ruvllm = { version = "2.0", features = ["inference-metal", "coreml", "parallel"] }
74
+
ruvllm = { version = "2.1", features = ["inference-metal", "coreml", "parallel"] }
74
75
75
76
# For NVIDIA GPUs
76
-
ruvllm = { version = "2.0", features = ["inference-cuda", "parallel"] }
77
+
ruvllm = { version = "2.1", features = ["inference-cuda", "parallel"] }
77
78
78
79
# Minimal (CPU only)
79
-
ruvllm = { version = "2.0" }
80
+
ruvllm = { version = "2.1" }
80
81
```
81
82
82
83
Or install the npm package:
@@ -85,7 +86,16 @@ Or install the npm package:
85
86
npm install @ruvector/ruvllm
86
87
```
87
88
88
-
## What's New in v2.3
89
+
## What's New in v2.5
90
+
91
+
| Feature | Description | Benefit |
92
+
|---------|-------------|---------|
93
+
|**TurboQuant**| 2-4 bit asymmetric per-channel KV-cache quantization | 6-8x memory reduction, <0.5% perplexity loss |
94
+
|**TurboQuant Embedding Store**| Quantized vector storage with asymmetric inner product search | 10-30x memory savings for embeddings |
95
+
|**H2O / PyramidKV Eviction**| Intelligent cache eviction based on attention scores | Keep most important tokens in long-context |
96
+
|**Optimized Inner Product**| Compute distances directly on quantized data | 2-4x faster search, skip decompression |
0 commit comments