Block-based scoring

Inspired by the work in https://blog.serenedb.com/search-optimization-2 , we'd like to explore block based scoring in Tantivy.

To execute this, a few different APIs will need to change:

## Components 
### `DocSet` & `Postings`
The goal would be to decouple document ID iteration from term frequency decompression, allowing both to happen in batches.

The existing `DocSet::fill_buffer(&mut [DocId; 64])` method would transition from being a specialized fast-path to the primary driver of query execution.

`Postings` would gain a similar method to allow iterators to fetch frequencies for a block of documents *only when needed*:
```rust
trait Postings: DocSet {
    // ... existing methods ...

    // NEW: Decompresses term frequencies for the current block of matched docs 
    // into a dense array, matching the length of the previously filled DocId buffer.
    fn fill_freq_buffer(&mut self, output: &mut [u32; 64]) -> usize;
}
```

### `FieldNormReader`

To prevent the vectorized scoring loop from being bottlenecked by scalar memory accesses (and in some cases, IO), fieldnorm reads must also be batched. We retain the existing `u8` quantization and underlying storage format to reduce churn, but in future we could apply techniques like #1438 in batch-friendly ways.

```rust
// OLD (Scalar)
pub fn fieldnorm_id(&self, doc_id: DocId) -> u8;

// NEW (Batched): Performs a contiguous gather of the compressed u8 norms 
// for a block of document IDs, preparing them for the BM25 loop.
pub fn fill_norm_id_buffer(&self, doc_ids: &[DocId], out_norm_ids: &mut [u8; 64]);
```

### `Scorer` & `Bm25Weight`

The virtual `score()` call per document would be replaced by a single virtual call per block. The underlying BM25 math remains exactly the same, but would be restructured to encourage the compiler to auto-vectorize the loop.

```rust
trait Scorer: DocSet {
    // OLD
    fn score(&mut self) -> Score;

    // NEW: Processes up to 64 documents at once.
    fn score_block(
        &mut self, 
        doc_ids: &[DocId], 
        out_scores: &mut [Score]
    ) -> usize;
}
```

In `Bm25Weight`, we would still rely on the precomputed `cache` array to map `u8` fieldnorms to floating-point length penalties, but would apply them in a branchless loop.
```rust
// NEW: Inner loop for BM25.
// More friendly to auto-vectorization.
fn apply_bm25_block(
    &self, 
    freqs: &[u32], 
    norm_ids: &[u8], 
    out_scores: &mut [Score]
) {
    for i in 0..freqs.len() {
        let tf = freqs[i] as Score;
        let norm = self.cache[norm_ids[i] as usize];
        out_scores[i] = self.weight * (tf / (tf + norm));
    }
}
```

### Query Execution

The structural iterators (`Intersection`, `Disjunction`, etc.) would be updated to use these new buffer-filling methods.
* `Intersection` / AND (Late Scoring):
    1. Skip-ahead `DocSet` intersection fills the `[DocId; 64]` buffer.
    2. Backtracks to the participating `Postings` and calls `fill_freq_buffer` *only* for confirmed matches.
    3. Calls `fill_norm_id_buffer`.
    4. Passes all arrays to `apply_bm25_block`.
* `BufferedUnionScorer` / OR (Greedy Scoring):
    * Because every document in a disjunction is a valid match, the leaf `TermScorer` eagerness is preserved. It calls `fill_buffer`, immediately calls `fill_freq_buffer` and `fill_norm_id_buffer`, and executes `score_block`.
    * The `Union` iterator simply merges these pre-scored blocks.
* `block_wand` / Top-K:
    * Essential Terms (Score >= Threshold): Processed eagerly using the Greedy OR strategy.
    * Non-Essential Terms: Handled via lazy `seek()`. Frequencies and norms are only bulk-fetched and scored if the document's essential score surpasses the Top-K pruning threshold.

## Steps

This should likely be implemented across at least three PRs:

### 1. Core APIs and BM25 Vectorization

Introduce the necessary buffer-filling methods, and vectorizable loops without altering how existing queries execute yet.

* `Postings`: Add `fill_freq_buffer(&mut self, output: &mut [u32; 64]) -> usize`. Provide a naive default implementation that loops over `.term_freq()` so that existing mock postings/tests don't break immediately. Then, provide the optimized bulk-decompression implementation in `SegmentPostings`.
* `FieldNormReader`: Add `fill_norm_id_buffer(&self, doc_ids: &[DocId], out_norm_ids: &mut [u8; 64])` for batched gathers of quantized lengths.
* `Scorer`: Add `score_block(&mut self, doc_ids: &[DocId], out_scores: &mut [Score]) -> usize`. Provide a default fallback that loops over the scalar `score()` method to keep other implementations compiling.
* `Bm25Weight`: Implement `apply_bm25_block`. This is a branchless loop mapping `u8` norms through the cache array and computing the BM25 score array.

### 2. Leaf Scorers and Greedy OR (`TermScorer` & `BufferedUnionScorer`)

Hook the new block-APIs up to the simplest query execution paths to realize initial performance gains.

* `TermScorer`: Implement `score_block` natively. It will read a block of `DocId`s, immediately call `self.postings.fill_freq_buffer(...)`, `self.fieldnorm_reader.fill_norm_id_buffer(...)`, and then pass them all to `self.similarity_weight.apply_bm25_block(...)`.
* `BufferedUnionScorer` (Disjunction/OR): Because OR queries are "greedy" (every matching document contributes to the final score), `TermScorer`'s eagerness works. Update the union logic to merge these pre-scored blocks of up to 64 documents rather than scoring one-by-one.

### 3. Late Scoring (`Intersection` & Top-K / `block_wand`)

Finally, Implement the multi-phase skipping logic for AND and Top-K queries.

* `Intersection` (AND): Transition this iterator to the 4-phase "Late Scoring" architecture. Use `DocSet::fill_buffer` iteratively to find a block of documents that match *all* required terms. Once a block of confirmed matches is identified, step back to the participating `Postings` and `FieldNormReaders` to fetch only the required frequencies and norms. Finally, score the confirmed block.
* `block_wand` (Top-K): Refactor the dynamic pruning logic. Essential terms can be evaluated greedily (using the fast path from `(2)`), while non-essential terms use lazy seeks. The frequencies/norms for non-essential terms are only block-fetched if the document's essential score clears the Top-K threshold.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Block-based scoring #2859

Components

`DocSet` & `Postings`

`FieldNormReader`

`Scorer` & `Bm25Weight`

Query Execution

Steps

1. Core APIs and BM25 Vectorization

2. Leaf Scorers and Greedy OR (`TermScorer` & `BufferedUnionScorer`)

3. Late Scoring (`Intersection` & Top-K / `block_wand`)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Block-based scoring #2859

Description

Components

DocSet & Postings

FieldNormReader

Scorer & Bm25Weight

Query Execution

Steps

1. Core APIs and BM25 Vectorization

2. Leaf Scorers and Greedy OR (TermScorer & BufferedUnionScorer)

3. Late Scoring (Intersection & Top-K / block_wand)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`DocSet` & `Postings`

`FieldNormReader`

`Scorer` & `Bm25Weight`

2. Leaf Scorers and Greedy OR (`TermScorer` & `BufferedUnionScorer`)

3. Late Scoring (`Intersection` & Top-K / `block_wand`)