Releases: ashvardanian/NumKong
v7.4.5: Faster RMSD
v7.4.4: CI & MSVC Hardening
- Fix: ARMv7 Rust cross-compilation with CC for versioned GCC (a5e67e6)
- Make:
check_source_runs-probing likemarch=nativeon MSVC (7a152f3) - Fix: Drop
_MM_FROUND_NO_EXCfrom_mm256_cvtps_phcalls (8649b0c) - Fix: Guard against old MSVC preprocessor (25d3304)
- Make: Enforce newer preprocessor in MSVC (be966af)
- Make: Cleaner CIBW artifact names & env forwarding (a6cf642)
- Make: Forward cross-compilation flags for macOS wheels (6ed3b8c)
- Make: Split ppc64le, s390x, i686 CIBW runs (c01795c)
Release v7.4.3
Release: v7.4.3 [skip ci]
Patch
- Fix: Require AArch64 for NEON kernels (2ba1b34)
- Docs: Table order & formatting (8673a56)
- Make: Avoid
--all-featuresin Rust cross-compilation CI (8be8bff) - Improve: Arm32 compatibility (6404172)
- Make:
cancel-in-progressCI to shift compute resources (dfc8fa0) - Improve: Harden Swift SDK for 6.1+ toolkit (965cd52)
- Make: Strip
.unsafeFlags& list platforms for SPM consumption (b061b78) - Make: Expose
CNumKongDispatchtarget to Swift users (6aa00a8)
Release v7.4.2
Release v7.4.1
v7.4: Fast Tensor Contractions
- Faster tensor contractions
- Faster GEMM "packers" with SIMD
- New SVE+SDOT kernels for
i8 - MSVC build stability on Arm
Minor
- Add: WASM elementwise ops & spatial mini-float kernels (81b8c44)
- Add: WASM type-casting kernels (e09df31)
- Add: SVE+SDOT ops for 8-bit integers (913fc6b)
Patch
- Fix: Misplaced NEON loads/stores in Sierra (05e3045)
- Fix: Avoid unconsitional
npsymbols (9dffb68) - Make: Resolve probe locations for NPM consumers (c602f45)
- Docs: Refined "What's Inside" (28f35cd)
- Docs: Mini-float kernel selection strategy (04e6598)
- Improve: Accelerate PyTests, reduce
Decimaluse (2417248) - Make: Move
.pyifor PyLance (688ec2d) - Fix: Inconsistent SME function qualifiers (5b4148a)
- Improve: Smaller test inputs under QEMU (ee36bf2)
- Improve: Vectorize GEMM "packers" (86127a4)
- Make: Longer timeouts for QEMU in CI (a9cc732)
- Fix:
vec_tstore helper args order (eecbcac) - Fix: Negative stride tensor reductions (3ea81be)
- Improve: Recursive stride collapsing and axis-lane fast paths for N-D reductions (cf8eaf6)
- Improve: Faster reductions in strided tensors (61651ed)
- Improve: Wider NEON curved, mesh, & probability F16 kernels (1c17678)
- Fix: Harden mini-float type-casting (1911b89)
- Make: Ship
win32-arm64NPM builds (578b7ad) - Make: Auto-bump JS platform-specific versions (5617f75)
- Fix:
vcombineinstead of initializer lists for NEON arrays in MSVC (906c178) - Fix: Avoid flaky
vld1_f16for MSVC (7a987d2)
v7.3: Hardened Arm Kernels, Upgraded CI, Citations, & Docs
This release hardens Arm kernels across NEON, SVE, and SME. The most widespread fix replaces _x (don't-care) predicated intrinsics with _m (merge-with-zero) variants β inactive lanes left undefined by _x could carry stale data into reductions, producing wrong results for non-power-of-two dimensions on real SVE hardware. Partial-tail padding in BMOPA is fixed for sub-32-bit types, and strided reductions in NEON are hardened against off-by-one in non-contiguous layouts.
Thanks to the @ClickHouse team for help hardening tail loads and @albumentations-team for strided reductions!
On the performance side, NEON gets faster in-vector finalizers, vcvt_high for cheaper F16/BF16 widening, and new SDOT fallbacks for i4 and e3m2 that previously required SME β bringing sub-byte arithmetic to the much larger NEON install base. Streaming SVE picks up Giesen's trick for E4M3 β F16 and faster mini-float norms. SME GEMMs use fewer branches in the inner loop.
Also, NumKong now ships a CITATION.cff β hit "Cite this repository" on GitHub to grab it in case you are writing a paper on a related topic π€
Minor
- Add: NEON & SDOT fallbacks for
i4&e3m2(0c6afa5)
Patch
- Docs: M5 perf stats for Wasmtime v43 (43c2881)
- Fix: Alternative MSVC-friendly cast (4744b9b)
- Make: Disable LTCG due to MSVC issues (3d37684)
- Make: Try
PREBUILDS_ONLY=0in CI (64c5f95) - Improve: Lower NEONHALF β NEON requirements (37f99ec)
- Fix: Wire
nk_cast_neonbenchmarks (3793af2) - Docs: Apple M5 native stats for secondary workloads (d7c81c4)
- Improve: Faster in-vector 4-way finalizers in NEON (968dcd1)
- Improve: Drop
nk_f16x4_to_f32x4_neon(84bb20a) - Improve:
vcvt_highfor faster unpacking (a5f4a19) - Docs: Refresh GEMM/SYRK measurements Apple M4 β M5 (3e010de)
- Fix: Harden strided reductions in NEON & AVX2 (61ac67b)
- Fix: Double-counted tail in Skylake
f64RMSD, Kabsch, and Umeyama (5391344) - Improve: Share
decimal.Context.trapsrules (3c28ae9) - Fix: Padding partial tail 32-bit words for
BMOPA(2598487) - Fix: Missing scale type definitions of mini-floats (91862da)
- Fix: Scalar buffer cast internal overwrites & aliasing (7b0e129)
- Fix: Top-bottom variable names (a014134)
- Improve: Giesen's E4M3 β F16 in Streaming SVE (25322b5)
- Improve: Fewer branches in SME GEMMs (858263c)
- Fix: Up-round dimensions count in sub-byte C++ tests (87a72d0)
- Make: Focus on M4 CPUs for SME probing (5ff63eb)
- Improve: PyTesting across more shapes (4bc3e44)
- Improve: Cleaner type-casting & promotion rules (23c2474)
- Make: Hide formatting commits for v7-7.2 (f6ce2da)
- Make: Native addon resolution for Deno & Bun (0d502d5)
- Docs: Citations (6220137)
- Improve: Faster mini-float norms in Streaming SVE (088de57)
- Make: Integrate PyRight (0fe56c0)
- Fix: F16 norms in SSVE skipped odd entries (bf3bfee)
- Fix: Harden SVE MaxSim upcasting logic (803eb33)
- Fix: Disable
FPCR.AHbit (7b2b850) - Make: Node 24 for trusted publishing (9f1a4ef)
- Fix:
_mto zero-out predicated SVE/SME ops (16c157b) - Fix:
_mto zero-out predicated SVE lanes inspatial/(ac27cde) - Make: Replace stale
prebuildify(74c5454)