-
-
Notifications
You must be signed in to change notification settings - Fork 881
Description
Inspired by the work in https://blog.serenedb.com/search-optimization-2 , we'd like to explore block based scoring in Tantivy.
To execute this, a few different APIs will need to change:
Components
DocSet & Postings
The goal would be to decouple document ID iteration from term frequency decompression, allowing both to happen in batches.
The existing DocSet::fill_buffer(&mut [DocId; 64]) method would transition from being a specialized fast-path to the primary driver of query execution.
Postings would gain a similar method to allow iterators to fetch frequencies for a block of documents only when needed:
trait Postings: DocSet {
// ... existing methods ...
// NEW: Decompresses term frequencies for the current block of matched docs
// into a dense array, matching the length of the previously filled DocId buffer.
fn fill_freq_buffer(&mut self, output: &mut [u32; 64]) -> usize;
}FieldNormReader
To prevent the vectorized scoring loop from being bottlenecked by scalar memory accesses (and in some cases, IO), fieldnorm reads must also be batched. We retain the existing u8 quantization and underlying storage format to reduce churn, but in future we could apply techniques like #1438 in batch-friendly ways.
// OLD (Scalar)
pub fn fieldnorm_id(&self, doc_id: DocId) -> u8;
// NEW (Batched): Performs a contiguous gather of the compressed u8 norms
// for a block of document IDs, preparing them for the BM25 loop.
pub fn fill_norm_id_buffer(&self, doc_ids: &[DocId], out_norm_ids: &mut [u8; 64]);Scorer & Bm25Weight
The virtual score() call per document would be replaced by a single virtual call per block. The underlying BM25 math remains exactly the same, but would be restructured to encourage the compiler to auto-vectorize the loop.
trait Scorer: DocSet {
// OLD
fn score(&mut self) -> Score;
// NEW: Processes up to 64 documents at once.
fn score_block(
&mut self,
doc_ids: &[DocId],
out_scores: &mut [Score]
) -> usize;
}In Bm25Weight, we would still rely on the precomputed cache array to map u8 fieldnorms to floating-point length penalties, but would apply them in a branchless loop.
// NEW: Inner loop for BM25.
// More friendly to auto-vectorization.
fn apply_bm25_block(
&self,
freqs: &[u32],
norm_ids: &[u8],
out_scores: &mut [Score]
) {
for i in 0..freqs.len() {
let tf = freqs[i] as Score;
let norm = self.cache[norm_ids[i] as usize];
out_scores[i] = self.weight * (tf / (tf + norm));
}
}Query Execution
The structural iterators (Intersection, Disjunction, etc.) would be updated to use these new buffer-filling methods.
Intersection/ AND (Late Scoring):- Skip-ahead
DocSetintersection fills the[DocId; 64]buffer. - Backtracks to the participating
Postingsand callsfill_freq_bufferonly for confirmed matches. - Calls
fill_norm_id_buffer. - Passes all arrays to
apply_bm25_block.
- Skip-ahead
BufferedUnionScorer/ OR (Greedy Scoring):- Because every document in a disjunction is a valid match, the leaf
TermScorereagerness is preserved. It callsfill_buffer, immediately callsfill_freq_bufferandfill_norm_id_buffer, and executesscore_block. - The
Unioniterator simply merges these pre-scored blocks.
- Because every document in a disjunction is a valid match, the leaf
block_wand/ Top-K:- Essential Terms (Score >= Threshold): Processed eagerly using the Greedy OR strategy.
- Non-Essential Terms: Handled via lazy
seek(). Frequencies and norms are only bulk-fetched and scored if the document's essential score surpasses the Top-K pruning threshold.
Steps
This should likely be implemented across at least three PRs:
1. Core APIs and BM25 Vectorization
Introduce the necessary buffer-filling methods, and vectorizable loops without altering how existing queries execute yet.
Postings: Addfill_freq_buffer(&mut self, output: &mut [u32; 64]) -> usize. Provide a naive default implementation that loops over.term_freq()so that existing mock postings/tests don't break immediately. Then, provide the optimized bulk-decompression implementation inSegmentPostings.FieldNormReader: Addfill_norm_id_buffer(&self, doc_ids: &[DocId], out_norm_ids: &mut [u8; 64])for batched gathers of quantized lengths.Scorer: Addscore_block(&mut self, doc_ids: &[DocId], out_scores: &mut [Score]) -> usize. Provide a default fallback that loops over the scalarscore()method to keep other implementations compiling.Bm25Weight: Implementapply_bm25_block. This is a branchless loop mappingu8norms through the cache array and computing the BM25 score array.
2. Leaf Scorers and Greedy OR (TermScorer & BufferedUnionScorer)
Hook the new block-APIs up to the simplest query execution paths to realize initial performance gains.
TermScorer: Implementscore_blocknatively. It will read a block ofDocIds, immediately callself.postings.fill_freq_buffer(...),self.fieldnorm_reader.fill_norm_id_buffer(...), and then pass them all toself.similarity_weight.apply_bm25_block(...).BufferedUnionScorer(Disjunction/OR): Because OR queries are "greedy" (every matching document contributes to the final score),TermScorer's eagerness works. Update the union logic to merge these pre-scored blocks of up to 64 documents rather than scoring one-by-one.
3. Late Scoring (Intersection & Top-K / block_wand)
Finally, Implement the multi-phase skipping logic for AND and Top-K queries.
Intersection(AND): Transition this iterator to the 4-phase "Late Scoring" architecture. UseDocSet::fill_bufferiteratively to find a block of documents that match all required terms. Once a block of confirmed matches is identified, step back to the participatingPostingsandFieldNormReadersto fetch only the required frequencies and norms. Finally, score the confirmed block.block_wand(Top-K): Refactor the dynamic pruning logic. Essential terms can be evaluated greedily (using the fast path from(2)), while non-essential terms use lazy seeks. The frequencies/norms for non-essential terms are only block-fetched if the document's essential score clears the Top-K threshold.