[ROCm][Windows] NotImplementedError for FP8 (e4m3fn) operators on RDNA4 (Navi 48) GPUs

### 🐛 Describe the bug

While official documentation for RDNA4 (Navi 48) and PyTorch on Windows states that FP8 (E5M2, E4M3) is supported, basic Eager-mode operators such as torch.mm and torch.mul throw a NotImplementedError. This suggests that the dispatching logic or the necessary ROCm/hipBLAS kernels are missing or not correctly linked for the Windows ROCm build.

>>> import torch
>>> a = torch.ones((512,512), dtype=torch.float8_e4m3fn).cuda()
>>> b = torch.ones((512,512), dtype=torch.float8_e4m3fn).cuda()
>>> a
tensor([[1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        ...,
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.]], device='cuda:0',
       dtype=torch.float8_e4m3fn)
>>> b
tensor([[1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        ...,
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.]], device='cuda:0',
       dtype=torch.float8_e4m3fn)
>>> c = torch.mm(a,b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: "addmm_cuda" not implemented for 'Float8_e4m3fn'
>>> d = torch.mul(a,b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: "mul_cuda" not implemented for 'Float8_e4m3fn'


### Versions

PyTorch version: 2.9.1+rocmsdk20260116
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 7.2.26024-f6f897bd3d

OS: Microsoft Windows 11 Pro (10.0.26200 64-bit)
GCC version: Could not collect
Clang version: Could not collect
CMake version: version 4.2.1
Libc version: N/A

Python version: 3.12.12 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 20:05:38) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-11-10.0.26200-SP0
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to:
GPU models and configuration: AMD Radeon AI PRO R9700 (gfx1201)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: 7.2.26024
MIOpen runtime version: 3.5.1
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==2.4.1
[pip3] torch==2.9.1+rocmsdk20260116
[pip3] torchaudio==2.9.1+rocmsdk20260116
[pip3] torchvision==0.24.1+rocmsdk20260116
[conda] numpy                     2.4.1                    pypi_0    pypi
[conda] torch                     2.9.1+rocmsdk20260116          pypi_0    pypi
[conda] torchaudio                2.9.1+rocmsdk20260116          pypi_0    pypi
[conda] torchvision               0.24.1+rocmsdk20260116          pypi_0    pypi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm][Windows] NotImplementedError for FP8 (e4m3fn) operators on RDNA4 (Navi 48) GPUs #2932

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ROCm][Windows] NotImplementedError for FP8 (e4m3fn) operators on RDNA4 (Navi 48) GPUs #2932

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions