Skip to content

Wire flat RaBitQ search through PQ4CodeScanner dispatch#4870

Open
algoriddle wants to merge 7 commits intofacebookresearch:mainfrom
algoriddle:export-D95392153
Open

Wire flat RaBitQ search through PQ4CodeScanner dispatch#4870
algoriddle wants to merge 7 commits intofacebookresearch:mainfrom
algoriddle:export-D95392153

Conversation

@algoriddle
Copy link
Contributor

Summary:
Parameterize RaBitQHeapHandler on SIMDLevel and wire it through the
PQ4CodeScanner dispatch boundary. Flat RaBitQ search now uses native SIMD
kernels in DD mode.

Handler changes:

  • RaBitQHeapHandler<C, W> -> RaBitQHeapHandler<C, W, SL> (defaulted)
  • Move member function definitions from .cpp to header (required for
    per-SIMD TU instantiation with non-default SL)
  • Change context from reference to pointer (allows post-construction set)
  • Add 'using SIMDResultHandler::handle' for DD compatibility

Scanner wiring:

  • Create rabitq_dispatching.h with RaBitQScannerMixIn + factory
  • Add context parameter to make_knn_scanner() virtual interface
  • IndexRaBitQFastScan::make_knn_scanner() returns scanner via factory
  • Include rabitq_dispatching.h in all per-SIMD TUs + NONE base TU

Static build impact: ZERO.

Differential Revision: D95392153

Summary:
Templatize all simd wrapper types (simd16uint16, simd32uint8, simd8float32,
etc.) on SIMDLevel. This is the foundation for PQ4 fast scan Dynamic Dispatch.

Primary templates are declared in simdlib.h. Each platform header provides
explicit specializations:
- simdlib_avx2.h: simd16uint16<AVX2>, simd32uint8<AVX2>, etc.
- simdlib_avx512.h: simd32uint16<AVX512>, simd64uint8<AVX512>, etc.
- simdlib_neon.h: simd16uint16<ARM_NEON>, etc.
- simdlib_emulated.h: simd16uint16<NONE>, etc. (always included)
- simdlib_ppc64.h: simd16uint16<NONE>, etc. (PPC-optimized scalar)

SINGLE_SIMD_LEVEL (inline constexpr in simd_levels.h) resolves to NONE in DD
mode and to the compiled-in level in static mode. SINGLE_SIMD_LEVEL_256 maps
through simd256_level_selector for 256-bit types (AVX512->AVX2, SVE->NEON).
Code without explicit SL context uses these. This is migration scaffolding —
subsequent diffs will replace SINGLE_SIMD_LEVEL usages with proper SL dispatch.

simd_result_handlers.h is no longer %include'd by SWIG (the templatized types
are unparseable by SWIG). make_knn_handler methods are %ignore'd. The Python
API does not use these internal SIMD handler types.

Pre-existing bug fixes bundled with this refactor:
- simdlib_avx512.h: simd512bit::bin() stack buffer overflow (char[257] -> char[513])
- simdlib_avx2.h: simd256bit constructor used aligned _mm256_load_si256 instead
  of unaligned _mm256_loadu_si256
- All platform headers: simd16uint16/simd32uint8 operator+=/operator-= returned
  by value instead of by reference

Static builds: zero performance change. Template specializations produce
identical layout, ABI, and codegen as the old plain structs.

Differential Revision: D95392150
Summary:
Templatize the result handler hierarchy and scaler types on SIMDLevel SL,
defaulted to SINGLE_SIMD_LEVEL_256. This allows per-SIMD TUs to instantiate
handlers and scalers with explicit SIMD levels (e.g., AVX2) for native
dispatch.

Result handlers: ResultHandlerCompare, SingleResultHandler, HeapHandler,
ReservoirHandler, RangeHandler, PartialRangeHandler — all gain SL parameter.

Scalers: DummyScaler templatized on SL. 512-bit methods use SL directly
(removing #ifdef __AVX512F__ guard — safe because template bodies only
instantiated when called). NormTableScaler stays non-template (public API).

FixedStorageHandler: add SL parameter, remove SIMDResultHandler base class
(never used polymorphically), remove final/virtual.

Pure refactor. All existing callers use defaults and compile unchanged.

Differential Revision: D95392149
Summary:
Move kernel templates from .cpp anonymous namespaces into includable headers,
parameterized on SIMDLevel SL. No behavior change — existing .cpp files include
the headers and instantiate with defaults.

New headers:
- kernels_simd256.h: multi-BB kernel (from search_1.cpp) + single-BB QBS
  256-bit kernel (from search_qbs.cpp non-AVX512 path)
- kernels_simd512.h: AVX512 nq1/nqx kernels + dispatcher (from search_qbs.cpp)
- decompose_qbs.h: unified kernel_accumulate_block<NQ, SL> that replaces
  #ifndef __AVX512F__ with if constexpr on SL, plus QBS decomposition logic

Template param order: <int NQ, SIMDLevel SL, class ResultHandler, class Scaler>
to enable ergonomic SL propagation via kernel_accumulate_block<Q1, SL>(...).

~900 lines moved (code motion), ~100 lines changed. Pure refactor.

Differential Revision: D95392155
Summary:
The core DD wiring for PQ4 fast scan. Introduces PQ4CodeScanner — a virtual
base with plain-type interface that bundles handler+kernel behind the SIMD
dispatch boundary. In DD mode, handler and kernel share the same SIMDLevel
(AVX2/AVX512/NEON), selected at runtime via pq4_make_knn_scanner().

New files:
- NormTableScalerSL<SL>: private SL-typed mirror of NormTableScaler
- dispatching.h: ScannerMixIn<Handler> + pq4_make_knn_scanner_impl<SL>
  factory, using THE_LEVEL_TO_DISPATCH pattern (matches SQ modules)
- impl-avx2.cpp, impl-avx512.cpp, impl-neon.cpp: per-SIMD TUs

DD-required changes to handler hierarchy:
- SIMDResultHandler::handle() made non-pure (default throws) — when SL != NONE,
  derived handle(simd16uint16<AVX2>) doesn't match base handle(simd16uint16<NONE>)
- Remove 'final' from SL-templatized handle() methods (not overrides when SL != NONE)
- Add 'using SIMDResultHandler::handle' to suppress -Woverloaded-virtual
- NormTableScaler 512-bit methods guarded with !defined(FAISS_ENABLE_DD)

Uses DISPATCH_SIMDLevel for runtime dispatch. NONE specialization compiled
in the base TU (pq4_fast_scan.cpp). Old pq4_accumulate_loop paths unchanged.

Differential Revision: D95392151
Summary:
The "moment of truth" diff: flat PQ4 search now runs through the per-SIMD
PQ4CodeScanner in DD mode, executing AVX2/AVX512/NEON kernels natively
instead of scalar emulation.

Changes:
- Add make_knn_scanner() virtual to IndexFastScan — returns PQ4CodeScanner
  from pq4_make_knn_scanner() factory. Used by search_implem_12 (QBS path)
  and search_implem_14 (multi-BB path).
- When scanner is available, search uses scanner->accumulate_loop_qbs()
  instead of the free pq4_accumulate_loop_qbs() + dynamic_cast chain.
- Fallback: if make_knn_scanner() returns nullptr, the old make_knn_handler()
  path is used (for RaBitQ and future custom handlers).
- IndexRaBitQFastScan::make_knn_scanner() returns nullptr (RaBitQ uses
  custom handlers; scanner support pending).
- SWIG: ignore PQ4CodeScanner, make_knn_scanner, pq4_make_knn_scanner
  (internal dispatch machinery, not Python API).
- IndexFastScan.h includes pq4_fast_scan.h for complete PQ4CodeScanner type
  (required by unique_ptr in SWIG-generated destructors).

IVF path unchanged (planned for follow-up diff).

Differential Revision: D95392156
Summary:
Extends DD SIMD dispatch to the IVF fast scan path. All three IVF search
implementations (implem 10, 12, 14) now use PQ4CodeScanner when available.

Changes:
- Add make_knn_scanner() virtual to IndexIVFFastScan (with_id_map=true).
- search_implem_10 and search_implem_12: accept optional PQ4CodeScanner*
  parameter. When non-null, use scanner->accumulate_loop[_qbs]() instead
  of the free pq4_accumulate_loop[_qbs]() functions. Handler fields
  (id_map, q_map, ntotal, dbias, set_list_context) are still configured
  per inverted list via scanner->handler().
- search_implem_14: create per-thread scanner in the OMP parallel region.
- search_dispatch_implem: try make_knn_scanner() first; fallback to
  make_knn_handler() for RaBitQ.
- IndexIVFPQFastScan::scan_codes(): switch from make_knn_handler() +
  pq4_accumulate_loop() to make_knn_scanner() + scanner->accumulate_loop().
- IndexIVFRaBitQFastScan::make_knn_scanner() returns nullptr.
- SWIG: ignore search_implem_10/12 (PQ4CodeScanner* parameter).

HeapHandler normalizers optimization: not supported through scanner factory
(thresholds default to neutral). Correct results; minor pruning difference
only affects incremental/resumable search scenarios.

Differential Revision: D95392152
Summary:
Parameterize RaBitQHeapHandler on SIMDLevel and wire it through the
PQ4CodeScanner dispatch boundary. Flat RaBitQ search now uses native SIMD
kernels in DD mode.

Handler changes:
- RaBitQHeapHandler<C, W> -> RaBitQHeapHandler<C, W, SL> (defaulted)
- Move member function definitions from .cpp to header (required for
  per-SIMD TU instantiation with non-default SL)
- Change context from reference to pointer (allows post-construction set)
- Add 'using SIMDResultHandler::handle' for DD compatibility

Scanner wiring:
- Create rabitq_dispatching.h with RaBitQScannerMixIn + factory
- Add context parameter to make_knn_scanner() virtual interface
- IndexRaBitQFastScan::make_knn_scanner() returns scanner via factory
- Include rabitq_dispatching.h in all per-SIMD TUs + NONE base TU

Static build impact: ZERO.

Differential Revision: D95392153
@meta-cla meta-cla bot added the CLA Signed label Mar 5, 2026
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Mar 5, 2026

@algoriddle has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95392153.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant