Skip to content

Block-based scoring #2859

@stuhood

Description

@stuhood

Inspired by the work in https://blog.serenedb.com/search-optimization-2 , we'd like to explore block based scoring in Tantivy.

To execute this, a few different APIs will need to change:

Components

DocSet & Postings

The goal would be to decouple document ID iteration from term frequency decompression, allowing both to happen in batches.

The existing DocSet::fill_buffer(&mut [DocId; 64]) method would transition from being a specialized fast-path to the primary driver of query execution.

Postings would gain a similar method to allow iterators to fetch frequencies for a block of documents only when needed:

trait Postings: DocSet {
    // ... existing methods ...

    // NEW: Decompresses term frequencies for the current block of matched docs 
    // into a dense array, matching the length of the previously filled DocId buffer.
    fn fill_freq_buffer(&mut self, output: &mut [u32; 64]) -> usize;
}

FieldNormReader

To prevent the vectorized scoring loop from being bottlenecked by scalar memory accesses (and in some cases, IO), fieldnorm reads must also be batched. We retain the existing u8 quantization and underlying storage format to reduce churn, but in future we could apply techniques like #1438 in batch-friendly ways.

// OLD (Scalar)
pub fn fieldnorm_id(&self, doc_id: DocId) -> u8;

// NEW (Batched): Performs a contiguous gather of the compressed u8 norms 
// for a block of document IDs, preparing them for the BM25 loop.
pub fn fill_norm_id_buffer(&self, doc_ids: &[DocId], out_norm_ids: &mut [u8; 64]);

Scorer & Bm25Weight

The virtual score() call per document would be replaced by a single virtual call per block. The underlying BM25 math remains exactly the same, but would be restructured to encourage the compiler to auto-vectorize the loop.

trait Scorer: DocSet {
    // OLD
    fn score(&mut self) -> Score;

    // NEW: Processes up to 64 documents at once.
    fn score_block(
        &mut self, 
        doc_ids: &[DocId], 
        out_scores: &mut [Score]
    ) -> usize;
}

In Bm25Weight, we would still rely on the precomputed cache array to map u8 fieldnorms to floating-point length penalties, but would apply them in a branchless loop.

// NEW: Inner loop for BM25.
// More friendly to auto-vectorization.
fn apply_bm25_block(
    &self, 
    freqs: &[u32], 
    norm_ids: &[u8], 
    out_scores: &mut [Score]
) {
    for i in 0..freqs.len() {
        let tf = freqs[i] as Score;
        let norm = self.cache[norm_ids[i] as usize];
        out_scores[i] = self.weight * (tf / (tf + norm));
    }
}

Query Execution

The structural iterators (Intersection, Disjunction, etc.) would be updated to use these new buffer-filling methods.

  • Intersection / AND (Late Scoring):
    1. Skip-ahead DocSet intersection fills the [DocId; 64] buffer.
    2. Backtracks to the participating Postings and calls fill_freq_buffer only for confirmed matches.
    3. Calls fill_norm_id_buffer.
    4. Passes all arrays to apply_bm25_block.
  • BufferedUnionScorer / OR (Greedy Scoring):
    • Because every document in a disjunction is a valid match, the leaf TermScorer eagerness is preserved. It calls fill_buffer, immediately calls fill_freq_buffer and fill_norm_id_buffer, and executes score_block.
    • The Union iterator simply merges these pre-scored blocks.
  • block_wand / Top-K:
    • Essential Terms (Score >= Threshold): Processed eagerly using the Greedy OR strategy.
    • Non-Essential Terms: Handled via lazy seek(). Frequencies and norms are only bulk-fetched and scored if the document's essential score surpasses the Top-K pruning threshold.

Steps

This should likely be implemented across at least three PRs:

1. Core APIs and BM25 Vectorization

Introduce the necessary buffer-filling methods, and vectorizable loops without altering how existing queries execute yet.

  • Postings: Add fill_freq_buffer(&mut self, output: &mut [u32; 64]) -> usize. Provide a naive default implementation that loops over .term_freq() so that existing mock postings/tests don't break immediately. Then, provide the optimized bulk-decompression implementation in SegmentPostings.
  • FieldNormReader: Add fill_norm_id_buffer(&self, doc_ids: &[DocId], out_norm_ids: &mut [u8; 64]) for batched gathers of quantized lengths.
  • Scorer: Add score_block(&mut self, doc_ids: &[DocId], out_scores: &mut [Score]) -> usize. Provide a default fallback that loops over the scalar score() method to keep other implementations compiling.
  • Bm25Weight: Implement apply_bm25_block. This is a branchless loop mapping u8 norms through the cache array and computing the BM25 score array.

2. Leaf Scorers and Greedy OR (TermScorer & BufferedUnionScorer)

Hook the new block-APIs up to the simplest query execution paths to realize initial performance gains.

  • TermScorer: Implement score_block natively. It will read a block of DocIds, immediately call self.postings.fill_freq_buffer(...), self.fieldnorm_reader.fill_norm_id_buffer(...), and then pass them all to self.similarity_weight.apply_bm25_block(...).
  • BufferedUnionScorer (Disjunction/OR): Because OR queries are "greedy" (every matching document contributes to the final score), TermScorer's eagerness works. Update the union logic to merge these pre-scored blocks of up to 64 documents rather than scoring one-by-one.

3. Late Scoring (Intersection & Top-K / block_wand)

Finally, Implement the multi-phase skipping logic for AND and Top-K queries.

  • Intersection (AND): Transition this iterator to the 4-phase "Late Scoring" architecture. Use DocSet::fill_buffer iteratively to find a block of documents that match all required terms. Once a block of confirmed matches is identified, step back to the participating Postings and FieldNormReaders to fetch only the required frequencies and norms. Finally, score the confirmed block.
  • block_wand (Top-K): Refactor the dynamic pruning logic. Essential terms can be evaluated greedily (using the fast path from (2)), while non-essential terms use lazy seeks. The frequencies/norms for non-essential terms are only block-fetched if the document's essential score clears the Top-K threshold.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions