Skip to content

FP8 crashes on sm_86 (RTX 3090) even with --kt-num-gpu-experts 0 #1930

@ormandj

Description

@ormandj

--kt-method FP8 crashes on sm_86 (RTX 3090) even with --kt-num-gpu-experts 0

Environment

  • GPU: RTX 3090 (sm_86)
  • CPU: AMD EPYC 7663 (AVX2)
  • KTransformers: v0.5.3
  • sgl-kernel: 0.3.21
  • Model: Qwen/Qwen3.5-122B-A10B-FP8

Steps to reproduce

python -m sglang.launch_server \
  --model /path/to/Qwen3.5-122B-A10B-FP8 \
  --kt-weight-path /path/to/Qwen3.5-122B-A10B-FP8 \
  --kt-method FP8 \
  --kt-cpuinfer 56 \
  --kt-threadpool-count 1 \
  --kt-num-gpu-experts 0 \
  --attention-backend triton \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --disable-shared-experts-fusion \
  --disable-custom-all-reduce

What happens

AVX2 FP8 MoE layers initialize successfully on CPU:

TP MOE layer 47, pool: 0x6512e0e0, expert num: 256, num_experts_per_tok: 8
Created AVX2_FP8_MOE_TP 0 at numa 0

Model loads, KV cache allocates, then during CUDA graph capture:

triton.compiler.errors.CompilationError: at 1:0:
def fused_moe_kernel(
^
ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")

Setting --moe-runner-backend cutlass hits:

AssertionError: cutlass_fp8 MoE requires CUDA 12.0+ with SM90 or CUDA 12.4+ with SM89

Notes

  • --kt-num-gpu-experts is set to 0
  • The AVX2 FP8 CPU kernels load and initialize without error
  • The crash occurs in the GPU-side MoE kernel compilation path
  • --kt-method BF16 does not crash
  • RTX 3090 (sm_86) supports fp8e4b15 and fp8e5 per the Triton error message, but not fp8e4nv
  • The AVX2 tutorial lists RTX 3090 as supported hardware and includes an FP8 example

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions