Skip to content

[QDP] Add zero-copy amplitude encoding from float32 GPU tensors#999

Merged
guan404ming merged 6 commits intoapache:mainfrom
viiccwen:feature/encode-from-gpu-ptr-f32
Feb 4, 2026
Merged

[QDP] Add zero-copy amplitude encoding from float32 GPU tensors#999
guan404ming merged 6 commits intoapache:mainfrom
viiccwen:feature/encode-from-gpu-ptr-f32

Conversation

@viiccwen
Copy link
Contributor

@viiccwen viiccwen commented Jan 31, 2026

Purpose of PR

This PR adds encode_from_gpu_ptr_f32 and encode_from_gpu_ptr_f32_with_stream to QdpEngine, enabling zero-copy amplitude encoding from float32 GPU pointers. It relies on the existing GpuStateVector Float32 support and the launch_amplitude_encode_f32 / launch_l2_norm_f32 kernels.

Summary

  • New APIs (Linux/CUDA):
    • QdpEngine::encode_from_gpu_ptr_f32(input_d, input_len, num_qubits) — amplitude encoding from GPU f32 pointer using the default stream.
    • QdpEngine::encode_from_gpu_ptr_f32_with_stream(..., stream) — same with an explicit CUDA stream (null = default).
  • GPU pointer validation: All encode_from_gpu_ptr and encode_batch_from_gpu_ptr paths now validate the input pointer via cudaPointerGetAttributes (non-null, device/managed memory, same device as engine). Early checks for empty input / sample size are added where missing.
  • Amplitude encoder: calculate_inv_norm_gpu_f32 is refactored to call a new calculate_inv_norm_gpu_f32_with_stream; the stream-aware variant is used for f32 GPU encoding and is synchronized before host copy.
  • Python binding: get_torch_cuda_stream_ptr returns null when PyTorch reports stream pointer 0 (default stream) instead of raising, so default-stream usage is supported.
  • Tests: New Rust tests for encode_from_gpu_ptr_f32 and encode_from_gpu_ptr_f32_with_stream (default and non-default stream, f32/f64 engine, empty input, input length > state size). Existing GPU pointer tests updated to use valid GPU pointers so they still hit the intended error paths after the new validation. Basis tests use Float64 engine. Test helper create_test_data_f32 added in tests/common.
  • Python tests: Redundant in-function import torch removed; test_encode_cuda_tensor_unsupported_encoding parametrized with iqp / iqp-z and error message match updated to the current supported list.
## BEFORE: f64 path (old API — user had to convert f32→f64 to use this)
## AFTER: f32 path (new API — zero-copy from PyTorch f32 CUDA tensors)

Encode-from-GPU-pointer benchmark: 16 qubits, 200 iterations, state_len=65536
  BEFORE (encode_from_gpu_ptr f64): 0.0838 ms/encode, 11933.2 encodes/s
  AFTER  (encode_from_gpu_ptr_f32): 0.0632 ms/encode, 15820.1 encodes/s
  Speedup (f32 vs f64 path): 1.33x

Related Issues or PRs

closes #996 , also a follow-up PR for #995

Changes Made

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Test
  • CI/CD pipeline
  • Other

Breaking Changes

  • Yes
  • No

Checklist

  • Added or updated unit tests for all changes
  • Added or updated documentation for all changes
  • Successfully built and ran all unit tests or manual tests locally
  • PR title follows "MAHOUT-XXX: Brief Description" format (if related to an issue)
  • Code follows ASF guidelines

@rich7420 rich7420 added this to the Qumat 0.6.0 milestone Jan 31, 2026
@viiccwen viiccwen force-pushed the feature/encode-from-gpu-ptr-f32 branch from b6dec66 to eaad56a Compare January 31, 2026 16:17
@guan404ming
Copy link
Member

Design looks nice to me. Could you help attach before and after benchmark on your local machine to prove your improvement works? Thanks!

@viiccwen
Copy link
Contributor Author

viiccwen commented Feb 1, 2026

@guan404ming got it!

## BEFORE: f64 path (old API — user had to convert f32→f64 to use this)
## AFTER: f32 path (new API — zero-copy from PyTorch f32 CUDA tensors)

Encode-from-GPU-pointer benchmark: 16 qubits, 200 iterations, state_len=65536
  BEFORE (encode_from_gpu_ptr f64): 0.0838 ms/encode, 11933.2 encodes/s
  AFTER  (encode_from_gpu_ptr_f32): 0.0632 ms/encode, 15820.1 encodes/s
  Speedup (f32 vs f64 path): 1.33x

@viiccwen viiccwen force-pushed the feature/encode-from-gpu-ptr-f32 branch from 9bad00b to eaad56a Compare February 1, 2026 15:54
@guan404ming
Copy link
Member

Looks nice, please help resolve the conflicts. Thanks!

@viiccwen viiccwen force-pushed the feature/encode-from-gpu-ptr-f32 branch from eaad56a to 746635e Compare February 2, 2026 03:27
@viiccwen
Copy link
Contributor Author

viiccwen commented Feb 2, 2026

cc @guan404ming @rich7420
updated PR description

@guan404ming
Copy link
Member

Looks nice for the change.
Sorry about that but conflict comes again.

@viiccwen viiccwen force-pushed the feature/encode-from-gpu-ptr-f32 branch from 746635e to 07fcb2f Compare February 2, 2026 16:12
@rich7420
Copy link
Contributor

rich7420 commented Feb 3, 2026

@viiccwen thanks for the update!
I think we could add a test like e.g. test_encode_from_gpu_ptr_f32_null_pointer or somthing that calls encode_from_gpu_ptr_f32(std::ptr::null(), len, qubits) and asserts Err(MahoutError::InvalidInput(msg)) with msg.contains("null").
then overall lg

Copy link
Contributor

@rich7420 rich7420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@guan404ming guan404ming merged commit 42da30d into apache:main Feb 4, 2026
6 checks passed
@viiccwen
Copy link
Contributor Author

viiccwen commented Feb 4, 2026

Thx all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[QDP] Add zero-copy amplitude encoding from float32 GPU tensors

3 participants