Skip to content

Releases: ashvardanian/NumKong

v7.4.5: Faster RMSD

06 Apr 21:04

Choose a tag to compare

  • Improve: Vectorize F32 SME MaxSim finalizer (0daacf3)
  • Improve: Remove centering from RMSD kernels (1a83ab4)
  • Fix: Emulated vs native test durations (4266451)

v7.4.4: CI & MSVC Hardening

06 Apr 12:17

Choose a tag to compare

  • Fix: ARMv7 Rust cross-compilation with CC for versioned GCC (a5e67e6)
  • Make: check_source_runs-probing like march=native on MSVC (7a152f3)
  • Fix: Drop _MM_FROUND_NO_EXC from _mm256_cvtps_ph calls (8649b0c)
  • Fix: Guard against old MSVC preprocessor (25d3304)
  • Make: Enforce newer preprocessor in MSVC (be966af)
  • Make: Cleaner CIBW artifact names & env forwarding (a6cf642)
  • Make: Forward cross-compilation flags for macOS wheels (6ed3b8c)
  • Make: Split ppc64le, s390x, i686 CIBW runs (c01795c)

Release v7.4.3

05 Apr 16:34

Choose a tag to compare

Release: v7.4.3 [skip ci]

Patch

  • Fix: Require AArch64 for NEON kernels (2ba1b34)
  • Docs: Table order & formatting (8673a56)
  • Make: Avoid --all-features in Rust cross-compilation CI (8be8bff)
  • Improve: Arm32 compatibility (6404172)
  • Make: cancel-in-progress CI to shift compute resources (dfc8fa0)
  • Improve: Harden Swift SDK for 6.1+ toolkit (965cd52)
  • Make: Strip .unsafeFlags & list platforms for SPM consumption (b061b78)
  • Make: Expose CNumKongDispatch target to Swift users (6aa00a8)

Release v7.4.2

05 Apr 09:07

Choose a tag to compare

Release: v7.4.2 [skip ci]

Patch

  • Docs: Shrink tables in the main README (6d2ea34)
  • Make: Inline Power Shell cross-compilation logic in CI (974c30c)
  • Make: Define _ARM64_ for Arm JS builds in MSVC (f303042)
  • Make: Skip same-named artifacts on CI reruns (7c098e5)

Release v7.4.1

05 Apr 00:12

Choose a tag to compare

Release: v7.4.1 [skip ci]

Patch

  • Make: Set repository.url for NPM (385480d)
  • Make: Pull MSVC ARM64 Cross-Compiler (e20c93e)
  • Fix: Swap f16x8 for u16x8 in cast_neon (154ec5d)

v7.4: Fast Tensor Contractions

04 Apr 23:26

Choose a tag to compare

  • Faster tensor contractions
  • Faster GEMM "packers" with SIMD
  • New SVE+SDOT kernels for i8
  • MSVC build stability on Arm

Minor

  • Add: WASM elementwise ops & spatial mini-float kernels (81b8c44)
  • Add: WASM type-casting kernels (e09df31)
  • Add: SVE+SDOT ops for 8-bit integers (913fc6b)

Patch

  • Fix: Misplaced NEON loads/stores in Sierra (05e3045)
  • Fix: Avoid unconsitional np symbols (9dffb68)
  • Make: Resolve probe locations for NPM consumers (c602f45)
  • Docs: Refined "What's Inside" (28f35cd)
  • Docs: Mini-float kernel selection strategy (04e6598)
  • Improve: Accelerate PyTests, reduce Decimal use (2417248)
  • Make: Move .pyi for PyLance (688ec2d)
  • Fix: Inconsistent SME function qualifiers (5b4148a)
  • Improve: Smaller test inputs under QEMU (ee36bf2)
  • Improve: Vectorize GEMM "packers" (86127a4)
  • Make: Longer timeouts for QEMU in CI (a9cc732)
  • Fix: vec_t store helper args order (eecbcac)
  • Fix: Negative stride tensor reductions (3ea81be)
  • Improve: Recursive stride collapsing and axis-lane fast paths for N-D reductions (cf8eaf6)
  • Improve: Faster reductions in strided tensors (61651ed)
  • Improve: Wider NEON curved, mesh, & probability F16 kernels (1c17678)
  • Fix: Harden mini-float type-casting (1911b89)
  • Make: Ship win32-arm64 NPM builds (578b7ad)
  • Make: Auto-bump JS platform-specific versions (5617f75)
  • Fix: vcombine instead of initializer lists for NEON arrays in MSVC (906c178)
  • Fix: Avoid flaky vld1_f16 for MSVC (7a987d2)

v7.3: Hardened Arm Kernels, Upgraded CI, Citations, & Docs

02 Apr 22:48

Choose a tag to compare

This release hardens Arm kernels across NEON, SVE, and SME. The most widespread fix replaces _x (don't-care) predicated intrinsics with _m (merge-with-zero) variants β€” inactive lanes left undefined by _x could carry stale data into reductions, producing wrong results for non-power-of-two dimensions on real SVE hardware. Partial-tail padding in BMOPA is fixed for sub-32-bit types, and strided reductions in NEON are hardened against off-by-one in non-contiguous layouts.

Thanks to the @ClickHouse team for help hardening tail loads and @albumentations-team for strided reductions!

On the performance side, NEON gets faster in-vector finalizers, vcvt_high for cheaper F16/BF16 widening, and new SDOT fallbacks for i4 and e3m2 that previously required SME β€” bringing sub-byte arithmetic to the much larger NEON install base. Streaming SVE picks up Giesen's trick for E4M3 β†’ F16 and faster mini-float norms. SME GEMMs use fewer branches in the inner loop.

Also, NumKong now ships a CITATION.cff β€” hit "Cite this repository" on GitHub to grab it in case you are writing a paper on a related topic πŸ€—

Minor

  • Add: NEON & SDOT fallbacks for i4 & e3m2 (0c6afa5)

Patch

  • Docs: M5 perf stats for Wasmtime v43 (43c2881)
  • Fix: Alternative MSVC-friendly cast (4744b9b)
  • Make: Disable LTCG due to MSVC issues (3d37684)
  • Make: Try PREBUILDS_ONLY=0 in CI (64c5f95)
  • Improve: Lower NEONHALF β†’ NEON requirements (37f99ec)
  • Fix: Wire nk_cast_neon benchmarks (3793af2)
  • Docs: Apple M5 native stats for secondary workloads (d7c81c4)
  • Improve: Faster in-vector 4-way finalizers in NEON (968dcd1)
  • Improve: Drop nk_f16x4_to_f32x4_neon (84bb20a)
  • Improve: vcvt_high for faster unpacking (a5f4a19)
  • Docs: Refresh GEMM/SYRK measurements Apple M4 β†’ M5 (3e010de)
  • Fix: Harden strided reductions in NEON & AVX2 (61ac67b)
  • Fix: Double-counted tail in Skylake f64 RMSD, Kabsch, and Umeyama (5391344)
  • Improve: Share decimal.Context.traps rules (3c28ae9)
  • Fix: Padding partial tail 32-bit words for BMOPA (2598487)
  • Fix: Missing scale type definitions of mini-floats (91862da)
  • Fix: Scalar buffer cast internal overwrites & aliasing (7b0e129)
  • Fix: Top-bottom variable names (a014134)
  • Improve: Giesen's E4M3 β†’ F16 in Streaming SVE (25322b5)
  • Improve: Fewer branches in SME GEMMs (858263c)
  • Fix: Up-round dimensions count in sub-byte C++ tests (87a72d0)
  • Make: Focus on M4 CPUs for SME probing (5ff63eb)
  • Improve: PyTesting across more shapes (4bc3e44)
  • Improve: Cleaner type-casting & promotion rules (23c2474)
  • Make: Hide formatting commits for v7-7.2 (f6ce2da)
  • Make: Native addon resolution for Deno & Bun (0d502d5)
  • Docs: Citations (6220137)
  • Improve: Faster mini-float norms in Streaming SVE (088de57)
  • Make: Integrate PyRight (0fe56c0)
  • Fix: F16 norms in SSVE skipped odd entries (bf3bfee)
  • Fix: Harden SVE MaxSim upcasting logic (803eb33)
  • Fix: Disable FPCR.AH bit (7b2b850)
  • Make: Node 24 for trusted publishing (9f1a4ef)
  • Fix: _m to zero-out predicated SVE/SME ops (16c157b)
  • Fix: _m to zero-out predicated SVE lanes in spatial/ (ac27cde)
  • Make: Replace stale prebuildify (74c5454)

Release v7.2.4

28 Mar 23:38

Choose a tag to compare

Release: v7.2.4 [skip ci]

Patch

  • Make: 2h timeout budget for JS & Py builds (2e8f081)

Release v7.2.3

28 Mar 23:22

Choose a tag to compare

Release: v7.2.3 [skip ci]

Patch

  • Fix: Harden implicit narrowing casts (319fae2)
  • Fix: Negating unsigned integers in MSVC (9be61e3)
  • Make: Retry flaky CI jobs (b622d63)
  • Make: Remove conflicting NEON probes (c0f3573)

Release v7.2.2

28 Mar 15:12

Choose a tag to compare

Release: v7.2.2 [skip ci]

Patch

  • Make: Trusted publishing for NPM (9578271)
  • Improve: VNNI spatial kernels for E2M3, E3M2, & E4M3 (02d5325)
  • Fix: NK_TARGET_NEON auto-detect in MSVC (4ad2124)