Wire flat RaBitQ search through PQ4CodeScanner dispatch by algoriddle · Pull Request #4870 · facebookresearch/faiss

algoriddle · 2026-03-05T16:12:17Z

Summary:
Parameterize RaBitQHeapHandler on SIMDLevel and wire it through the
PQ4CodeScanner dispatch boundary. Flat RaBitQ search now uses native SIMD
kernels in DD mode.

Handler changes:

RaBitQHeapHandler<C, W> -> RaBitQHeapHandler<C, W, SL> (defaulted)
Move member function definitions from .cpp to header (required for
per-SIMD TU instantiation with non-default SL)
Change context from reference to pointer (allows post-construction set)
Add 'using SIMDResultHandler::handle' for DD compatibility

Scanner wiring:

Create rabitq_dispatching.h with RaBitQScannerMixIn + factory
Add context parameter to make_knn_scanner() virtual interface
IndexRaBitQFastScan::make_knn_scanner() returns scanner via factory
Include rabitq_dispatching.h in all per-SIMD TUs + NONE base TU

Static build impact: ZERO.

Differential Revision: D95392153

Summary: Templatize all simd wrapper types (simd16uint16, simd32uint8, simd8float32, etc.) on SIMDLevel. This is the foundation for PQ4 fast scan Dynamic Dispatch. Primary templates are declared in simdlib.h. Each platform header provides explicit specializations: - simdlib_avx2.h: simd16uint16<AVX2>, simd32uint8<AVX2>, etc. - simdlib_avx512.h: simd32uint16<AVX512>, simd64uint8<AVX512>, etc. - simdlib_neon.h: simd16uint16<ARM_NEON>, etc. - simdlib_emulated.h: simd16uint16<NONE>, etc. (always included) - simdlib_ppc64.h: simd16uint16<NONE>, etc. (PPC-optimized scalar) SINGLE_SIMD_LEVEL (inline constexpr in simd_levels.h) resolves to NONE in DD mode and to the compiled-in level in static mode. SINGLE_SIMD_LEVEL_256 maps through simd256_level_selector for 256-bit types (AVX512->AVX2, SVE->NEON). Code without explicit SL context uses these. This is migration scaffolding — subsequent diffs will replace SINGLE_SIMD_LEVEL usages with proper SL dispatch. simd_result_handlers.h is no longer %include'd by SWIG (the templatized types are unparseable by SWIG). make_knn_handler methods are %ignore'd. The Python API does not use these internal SIMD handler types. Pre-existing bug fixes bundled with this refactor: - simdlib_avx512.h: simd512bit::bin() stack buffer overflow (char[257] -> char[513]) - simdlib_avx2.h: simd256bit constructor used aligned _mm256_load_si256 instead of unaligned _mm256_loadu_si256 - All platform headers: simd16uint16/simd32uint8 operator+=/operator-= returned by value instead of by reference Static builds: zero performance change. Template specializations produce identical layout, ABI, and codegen as the old plain structs. Differential Revision: D95392150

Summary: Templatize the result handler hierarchy and scaler types on SIMDLevel SL, defaulted to SINGLE_SIMD_LEVEL_256. This allows per-SIMD TUs to instantiate handlers and scalers with explicit SIMD levels (e.g., AVX2) for native dispatch. Result handlers: ResultHandlerCompare, SingleResultHandler, HeapHandler, ReservoirHandler, RangeHandler, PartialRangeHandler — all gain SL parameter. Scalers: DummyScaler templatized on SL. 512-bit methods use SL directly (removing #ifdef __AVX512F__ guard — safe because template bodies only instantiated when called). NormTableScaler stays non-template (public API). FixedStorageHandler: add SL parameter, remove SIMDResultHandler base class (never used polymorphically), remove final/virtual. Pure refactor. All existing callers use defaults and compile unchanged. Differential Revision: D95392149

Summary: Move kernel templates from .cpp anonymous namespaces into includable headers, parameterized on SIMDLevel SL. No behavior change — existing .cpp files include the headers and instantiate with defaults. New headers: - kernels_simd256.h: multi-BB kernel (from search_1.cpp) + single-BB QBS 256-bit kernel (from search_qbs.cpp non-AVX512 path) - kernels_simd512.h: AVX512 nq1/nqx kernels + dispatcher (from search_qbs.cpp) - decompose_qbs.h: unified kernel_accumulate_block<NQ, SL> that replaces #ifndef __AVX512F__ with if constexpr on SL, plus QBS decomposition logic Template param order: <int NQ, SIMDLevel SL, class ResultHandler, class Scaler> to enable ergonomic SL propagation via kernel_accumulate_block<Q1, SL>(...). ~900 lines moved (code motion), ~100 lines changed. Pure refactor. Differential Revision: D95392155

Summary: The core DD wiring for PQ4 fast scan. Introduces PQ4CodeScanner — a virtual base with plain-type interface that bundles handler+kernel behind the SIMD dispatch boundary. In DD mode, handler and kernel share the same SIMDLevel (AVX2/AVX512/NEON), selected at runtime via pq4_make_knn_scanner(). New files: - NormTableScalerSL<SL>: private SL-typed mirror of NormTableScaler - dispatching.h: ScannerMixIn<Handler> + pq4_make_knn_scanner_impl<SL> factory, using THE_LEVEL_TO_DISPATCH pattern (matches SQ modules) - impl-avx2.cpp, impl-avx512.cpp, impl-neon.cpp: per-SIMD TUs DD-required changes to handler hierarchy: - SIMDResultHandler::handle() made non-pure (default throws) — when SL != NONE, derived handle(simd16uint16<AVX2>) doesn't match base handle(simd16uint16<NONE>) - Remove 'final' from SL-templatized handle() methods (not overrides when SL != NONE) - Add 'using SIMDResultHandler::handle' to suppress -Woverloaded-virtual - NormTableScaler 512-bit methods guarded with !defined(FAISS_ENABLE_DD) Uses DISPATCH_SIMDLevel for runtime dispatch. NONE specialization compiled in the base TU (pq4_fast_scan.cpp). Old pq4_accumulate_loop paths unchanged. Differential Revision: D95392151

Summary: The "moment of truth" diff: flat PQ4 search now runs through the per-SIMD PQ4CodeScanner in DD mode, executing AVX2/AVX512/NEON kernels natively instead of scalar emulation. Changes: - Add make_knn_scanner() virtual to IndexFastScan — returns PQ4CodeScanner from pq4_make_knn_scanner() factory. Used by search_implem_12 (QBS path) and search_implem_14 (multi-BB path). - When scanner is available, search uses scanner->accumulate_loop_qbs() instead of the free pq4_accumulate_loop_qbs() + dynamic_cast chain. - Fallback: if make_knn_scanner() returns nullptr, the old make_knn_handler() path is used (for RaBitQ and future custom handlers). - IndexRaBitQFastScan::make_knn_scanner() returns nullptr (RaBitQ uses custom handlers; scanner support pending). - SWIG: ignore PQ4CodeScanner, make_knn_scanner, pq4_make_knn_scanner (internal dispatch machinery, not Python API). - IndexFastScan.h includes pq4_fast_scan.h for complete PQ4CodeScanner type (required by unique_ptr in SWIG-generated destructors). IVF path unchanged (planned for follow-up diff). Differential Revision: D95392156

Summary: Extends DD SIMD dispatch to the IVF fast scan path. All three IVF search implementations (implem 10, 12, 14) now use PQ4CodeScanner when available. Changes: - Add make_knn_scanner() virtual to IndexIVFFastScan (with_id_map=true). - search_implem_10 and search_implem_12: accept optional PQ4CodeScanner* parameter. When non-null, use scanner->accumulate_loop[_qbs]() instead of the free pq4_accumulate_loop[_qbs]() functions. Handler fields (id_map, q_map, ntotal, dbias, set_list_context) are still configured per inverted list via scanner->handler(). - search_implem_14: create per-thread scanner in the OMP parallel region. - search_dispatch_implem: try make_knn_scanner() first; fallback to make_knn_handler() for RaBitQ. - IndexIVFPQFastScan::scan_codes(): switch from make_knn_handler() + pq4_accumulate_loop() to make_knn_scanner() + scanner->accumulate_loop(). - IndexIVFRaBitQFastScan::make_knn_scanner() returns nullptr. - SWIG: ignore search_implem_10/12 (PQ4CodeScanner* parameter). HeapHandler normalizers optimization: not supported through scanner factory (thresholds default to neutral). Correct results; minor pruning difference only affects incremental/resumable search scenarios. Differential Revision: D95392152

Summary: Parameterize RaBitQHeapHandler on SIMDLevel and wire it through the PQ4CodeScanner dispatch boundary. Flat RaBitQ search now uses native SIMD kernels in DD mode. Handler changes: - RaBitQHeapHandler<C, W> -> RaBitQHeapHandler<C, W, SL> (defaulted) - Move member function definitions from .cpp to header (required for per-SIMD TU instantiation with non-default SL) - Change context from reference to pointer (allows post-construction set) - Add 'using SIMDResultHandler::handle' for DD compatibility Scanner wiring: - Create rabitq_dispatching.h with RaBitQScannerMixIn + factory - Add context parameter to make_knn_scanner() virtual interface - IndexRaBitQFastScan::make_knn_scanner() returns scanner via factory - Include rabitq_dispatching.h in all per-SIMD TUs + NONE base TU Static build impact: ZERO. Differential Revision: D95392153

meta-codesync · 2026-03-05T16:12:58Z

@algoriddle has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95392153.

algoriddle added 7 commits March 5, 2026 08:08

meta-cla bot added the CLA Signed label Mar 5, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wire flat RaBitQ search through PQ4CodeScanner dispatch#4870

Wire flat RaBitQ search through PQ4CodeScanner dispatch#4870
algoriddle wants to merge 7 commits intofacebookresearch:mainfrom
algoriddle:export-D95392153

algoriddle commented Mar 5, 2026

Uh oh!

meta-codesync bot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

algoriddle commented Mar 5, 2026

Uh oh!

meta-codesync bot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant