--kt-method FP8 crashes on sm_86 (RTX 3090) even with --kt-num-gpu-experts 0
Environment
- GPU: RTX 3090 (sm_86)
- CPU: AMD EPYC 7663 (AVX2)
- KTransformers: v0.5.3
- sgl-kernel: 0.3.21
- Model:
Qwen/Qwen3.5-122B-A10B-FP8
Steps to reproduce
python -m sglang.launch_server \
--model /path/to/Qwen3.5-122B-A10B-FP8 \
--kt-weight-path /path/to/Qwen3.5-122B-A10B-FP8 \
--kt-method FP8 \
--kt-cpuinfer 56 \
--kt-threadpool-count 1 \
--kt-num-gpu-experts 0 \
--attention-backend triton \
--tensor-parallel-size 2 \
--trust-remote-code \
--disable-shared-experts-fusion \
--disable-custom-all-reduce
What happens
AVX2 FP8 MoE layers initialize successfully on CPU:
TP MOE layer 47, pool: 0x6512e0e0, expert num: 256, num_experts_per_tok: 8
Created AVX2_FP8_MOE_TP 0 at numa 0
Model loads, KV cache allocates, then during CUDA graph capture:
triton.compiler.errors.CompilationError: at 1:0:
def fused_moe_kernel(
^
ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
Setting --moe-runner-backend cutlass hits:
AssertionError: cutlass_fp8 MoE requires CUDA 12.0+ with SM90 or CUDA 12.4+ with SM89
Notes
--kt-num-gpu-experts is set to 0
- The AVX2 FP8 CPU kernels load and initialize without error
- The crash occurs in the GPU-side MoE kernel compilation path
--kt-method BF16 does not crash
- RTX 3090 (sm_86) supports
fp8e4b15 and fp8e5 per the Triton error message, but not fp8e4nv
- The AVX2 tutorial lists RTX 3090 as supported hardware and includes an FP8 example
Related
--kt-method FP8crashes on sm_86 (RTX 3090) even with--kt-num-gpu-experts 0Environment
Qwen/Qwen3.5-122B-A10B-FP8Steps to reproduce
What happens
AVX2 FP8 MoE layers initialize successfully on CPU:
Model loads, KV cache allocates, then during CUDA graph capture:
Setting
--moe-runner-backend cutlasshits:Notes
--kt-num-gpu-expertsis set to 0--kt-method BF16does not crashfp8e4b15andfp8e5per the Triton error message, but notfp8e4nvRelated