Wire flat RaBitQ search through PQ4CodeScanner dispatch#4870
Open
algoriddle wants to merge 7 commits intofacebookresearch:mainfrom
Open
Wire flat RaBitQ search through PQ4CodeScanner dispatch#4870algoriddle wants to merge 7 commits intofacebookresearch:mainfrom
algoriddle wants to merge 7 commits intofacebookresearch:mainfrom
Conversation
Summary: Templatize all simd wrapper types (simd16uint16, simd32uint8, simd8float32, etc.) on SIMDLevel. This is the foundation for PQ4 fast scan Dynamic Dispatch. Primary templates are declared in simdlib.h. Each platform header provides explicit specializations: - simdlib_avx2.h: simd16uint16<AVX2>, simd32uint8<AVX2>, etc. - simdlib_avx512.h: simd32uint16<AVX512>, simd64uint8<AVX512>, etc. - simdlib_neon.h: simd16uint16<ARM_NEON>, etc. - simdlib_emulated.h: simd16uint16<NONE>, etc. (always included) - simdlib_ppc64.h: simd16uint16<NONE>, etc. (PPC-optimized scalar) SINGLE_SIMD_LEVEL (inline constexpr in simd_levels.h) resolves to NONE in DD mode and to the compiled-in level in static mode. SINGLE_SIMD_LEVEL_256 maps through simd256_level_selector for 256-bit types (AVX512->AVX2, SVE->NEON). Code without explicit SL context uses these. This is migration scaffolding — subsequent diffs will replace SINGLE_SIMD_LEVEL usages with proper SL dispatch. simd_result_handlers.h is no longer %include'd by SWIG (the templatized types are unparseable by SWIG). make_knn_handler methods are %ignore'd. The Python API does not use these internal SIMD handler types. Pre-existing bug fixes bundled with this refactor: - simdlib_avx512.h: simd512bit::bin() stack buffer overflow (char[257] -> char[513]) - simdlib_avx2.h: simd256bit constructor used aligned _mm256_load_si256 instead of unaligned _mm256_loadu_si256 - All platform headers: simd16uint16/simd32uint8 operator+=/operator-= returned by value instead of by reference Static builds: zero performance change. Template specializations produce identical layout, ABI, and codegen as the old plain structs. Differential Revision: D95392150
Summary: Templatize the result handler hierarchy and scaler types on SIMDLevel SL, defaulted to SINGLE_SIMD_LEVEL_256. This allows per-SIMD TUs to instantiate handlers and scalers with explicit SIMD levels (e.g., AVX2) for native dispatch. Result handlers: ResultHandlerCompare, SingleResultHandler, HeapHandler, ReservoirHandler, RangeHandler, PartialRangeHandler — all gain SL parameter. Scalers: DummyScaler templatized on SL. 512-bit methods use SL directly (removing #ifdef __AVX512F__ guard — safe because template bodies only instantiated when called). NormTableScaler stays non-template (public API). FixedStorageHandler: add SL parameter, remove SIMDResultHandler base class (never used polymorphically), remove final/virtual. Pure refactor. All existing callers use defaults and compile unchanged. Differential Revision: D95392149
Summary: Move kernel templates from .cpp anonymous namespaces into includable headers, parameterized on SIMDLevel SL. No behavior change — existing .cpp files include the headers and instantiate with defaults. New headers: - kernels_simd256.h: multi-BB kernel (from search_1.cpp) + single-BB QBS 256-bit kernel (from search_qbs.cpp non-AVX512 path) - kernels_simd512.h: AVX512 nq1/nqx kernels + dispatcher (from search_qbs.cpp) - decompose_qbs.h: unified kernel_accumulate_block<NQ, SL> that replaces #ifndef __AVX512F__ with if constexpr on SL, plus QBS decomposition logic Template param order: <int NQ, SIMDLevel SL, class ResultHandler, class Scaler> to enable ergonomic SL propagation via kernel_accumulate_block<Q1, SL>(...). ~900 lines moved (code motion), ~100 lines changed. Pure refactor. Differential Revision: D95392155
Summary: The core DD wiring for PQ4 fast scan. Introduces PQ4CodeScanner — a virtual base with plain-type interface that bundles handler+kernel behind the SIMD dispatch boundary. In DD mode, handler and kernel share the same SIMDLevel (AVX2/AVX512/NEON), selected at runtime via pq4_make_knn_scanner(). New files: - NormTableScalerSL<SL>: private SL-typed mirror of NormTableScaler - dispatching.h: ScannerMixIn<Handler> + pq4_make_knn_scanner_impl<SL> factory, using THE_LEVEL_TO_DISPATCH pattern (matches SQ modules) - impl-avx2.cpp, impl-avx512.cpp, impl-neon.cpp: per-SIMD TUs DD-required changes to handler hierarchy: - SIMDResultHandler::handle() made non-pure (default throws) — when SL != NONE, derived handle(simd16uint16<AVX2>) doesn't match base handle(simd16uint16<NONE>) - Remove 'final' from SL-templatized handle() methods (not overrides when SL != NONE) - Add 'using SIMDResultHandler::handle' to suppress -Woverloaded-virtual - NormTableScaler 512-bit methods guarded with !defined(FAISS_ENABLE_DD) Uses DISPATCH_SIMDLevel for runtime dispatch. NONE specialization compiled in the base TU (pq4_fast_scan.cpp). Old pq4_accumulate_loop paths unchanged. Differential Revision: D95392151
Summary: The "moment of truth" diff: flat PQ4 search now runs through the per-SIMD PQ4CodeScanner in DD mode, executing AVX2/AVX512/NEON kernels natively instead of scalar emulation. Changes: - Add make_knn_scanner() virtual to IndexFastScan — returns PQ4CodeScanner from pq4_make_knn_scanner() factory. Used by search_implem_12 (QBS path) and search_implem_14 (multi-BB path). - When scanner is available, search uses scanner->accumulate_loop_qbs() instead of the free pq4_accumulate_loop_qbs() + dynamic_cast chain. - Fallback: if make_knn_scanner() returns nullptr, the old make_knn_handler() path is used (for RaBitQ and future custom handlers). - IndexRaBitQFastScan::make_knn_scanner() returns nullptr (RaBitQ uses custom handlers; scanner support pending). - SWIG: ignore PQ4CodeScanner, make_knn_scanner, pq4_make_knn_scanner (internal dispatch machinery, not Python API). - IndexFastScan.h includes pq4_fast_scan.h for complete PQ4CodeScanner type (required by unique_ptr in SWIG-generated destructors). IVF path unchanged (planned for follow-up diff). Differential Revision: D95392156
Summary: Extends DD SIMD dispatch to the IVF fast scan path. All three IVF search implementations (implem 10, 12, 14) now use PQ4CodeScanner when available. Changes: - Add make_knn_scanner() virtual to IndexIVFFastScan (with_id_map=true). - search_implem_10 and search_implem_12: accept optional PQ4CodeScanner* parameter. When non-null, use scanner->accumulate_loop[_qbs]() instead of the free pq4_accumulate_loop[_qbs]() functions. Handler fields (id_map, q_map, ntotal, dbias, set_list_context) are still configured per inverted list via scanner->handler(). - search_implem_14: create per-thread scanner in the OMP parallel region. - search_dispatch_implem: try make_knn_scanner() first; fallback to make_knn_handler() for RaBitQ. - IndexIVFPQFastScan::scan_codes(): switch from make_knn_handler() + pq4_accumulate_loop() to make_knn_scanner() + scanner->accumulate_loop(). - IndexIVFRaBitQFastScan::make_knn_scanner() returns nullptr. - SWIG: ignore search_implem_10/12 (PQ4CodeScanner* parameter). HeapHandler normalizers optimization: not supported through scanner factory (thresholds default to neutral). Correct results; minor pruning difference only affects incremental/resumable search scenarios. Differential Revision: D95392152
Summary: Parameterize RaBitQHeapHandler on SIMDLevel and wire it through the PQ4CodeScanner dispatch boundary. Flat RaBitQ search now uses native SIMD kernels in DD mode. Handler changes: - RaBitQHeapHandler<C, W> -> RaBitQHeapHandler<C, W, SL> (defaulted) - Move member function definitions from .cpp to header (required for per-SIMD TU instantiation with non-default SL) - Change context from reference to pointer (allows post-construction set) - Add 'using SIMDResultHandler::handle' for DD compatibility Scanner wiring: - Create rabitq_dispatching.h with RaBitQScannerMixIn + factory - Add context parameter to make_knn_scanner() virtual interface - IndexRaBitQFastScan::make_knn_scanner() returns scanner via factory - Include rabitq_dispatching.h in all per-SIMD TUs + NONE base TU Static build impact: ZERO. Differential Revision: D95392153
Contributor
|
@algoriddle has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95392153. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Parameterize RaBitQHeapHandler on SIMDLevel and wire it through the
PQ4CodeScanner dispatch boundary. Flat RaBitQ search now uses native SIMD
kernels in DD mode.
Handler changes:
per-SIMD TU instantiation with non-default SL)
Scanner wiring:
Static build impact: ZERO.
Differential Revision: D95392153