Skip to content

Refactoring all test scripts to pytests and fixing CUDA graphs compatibility#13

Merged
akshaysubr merged 4 commits intomainfrom
testing-refactor
Mar 6, 2026
Merged

Refactoring all test scripts to pytests and fixing CUDA graphs compatibility#13
akshaysubr merged 4 commits intomainfrom
testing-refactor

Conversation

@akshaysubr
Copy link
Collaborator

This PR includes three major improvements to cuHPX:

  1. Test Suite Refactoring - Modernized all tests to use pytest with proper fixtures and dtype parameterization
  2. CUDA Graph Compatibility - Added support for capturing SHTCUDA/iSHTCUDA operations in CUDA graphs
  3. Bug Fix - Fixed segmentation fault caused by dangling CUDA stream references

- Convert all test scripts to pytest format with fixtures and assertions
- Add conftest.py with shared fixtures (device, nside, dtype, tolerances)
- Parameterize all SHT tests with both float32 and float64 dtypes
- Add tolerance helpers: get_impl_tol(), get_roundtrip_tol(), get_bluestein_tol()
- Add pytest configuration to pyproject.toml
- Move profiling scripts to benchmarks/ and examples/ directories
- Remove obsolete test_sht_two_stream_overlap_profiling.py

Signed-off-by: Akshay Subramaniam <6964110+akshaysubr@users.noreply.github.com>
@akshaysubr akshaysubr requested review from Copilot and xiaopoc January 22, 2026 01:40
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR modernizes the cuHPX test suite and adds critical CUDA graph support for production workloads.

Changes:

  • Refactored all test scripts from interactive input-based scripts to pytest-based unit tests with proper fixtures and parameterization
  • Added CUDA graph compatibility by fixing stream handling and pre-allocating buffers in SHTCUDA/iSHTCUDA operations
  • Fixed segmentation fault caused by dangling CUDA stream references in HealpixFFT/HealpixIFFT classes

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/test_sht_two_stream_overlap_profiling.py Removed old profiling test (replaced by benchmark script)
tests/test_sht_cuda_stream.py Converted to pytest with CUDA stream validation tests
tests/test_sht_cuda_batch.py Converted to pytest with batched operation tests
tests/test_sht_cuda.py Converted to pytest with CUDA vs PyTorch comparison tests
tests/test_sht_bluestein.py Converted to pytest with Bluestein algorithm tests
tests/test_regridding.py Converted to pytest with grid regridding tests
tests/test_harmonic_transform.py Converted to pytest with SHT/iSHT roundtrip tests
tests/test_grad.py Converted to pytest with gradient flow tests
tests/test_differentiability.py Converted to pytest with gradient-based optimization tests
tests/test_data_remapping.py Converted to pytest with ring/nest remapping tests
tests/test_cuda_graphs.py New file with comprehensive CUDA graph capture tests
tests/test_batch_remap.py Converted to pytest with batched remapping tests
tests/conftest.py New pytest configuration with shared fixtures and tolerance helpers
src/harmonic_transform/hpx_fft.h Fixed stream handling and added pre-allocated buffers for CUDA graphs
src/harmonic_transform/hpx_fft.cpp Updated to use pre-allocated buffers and in-place operations
pyproject.toml Added pytest configuration and test dependencies
examples/wmap_optimization.py New example demonstrating differentiable SHT on WMAP data
cuhpx/hpx_sht.py Fixed stream handling and pre-computed weights for CUDA graph compatibility
benchmarks/stream_overlap_profiling.py New benchmark for profiling stream overlap
benchmarks/sht_profiling.py New benchmark for profiling SHT operations
Comments suppressed due to low confidence (1)

src/harmonic_transform/hpx_fft.h:1

  • The condition n_ < n only triggers reconfiguration when the new batch size is larger. If n becomes smaller (e.g., from 8 to 4), no reconfiguration occurs, but the pre-allocated buffers sized for n_=8 will be reused. While the code uses slicing to handle this (line 77), the comment on line 133 is misleading - it should clarify that buffers are sized for the maximum batch size seen, not just 'when batch grows larger'.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@xiaopoc
Copy link
Collaborator

xiaopoc commented Jan 23, 2026

I ran the pytest suite on my end and observed 3 test failures. It seems that these failures occur with precision checks and are related to numerical tolerances thresholds being set too tightly. Maybe we can try the following to fix?

  1. tests/conftest.py (line 86) – Increase atol for float32 implementation comparisons
   # Current
   torch.float32: {"sht": (1e-4, 1e-5), "isht": (1e-3, 1e-2)},
   # Suggested
   torch.float32: {"sht": (1e-4, 2e-05), "isht": (1e-3, 0.1)},
  1. tests/test_sht_cuda_batch.py (line 66) – Increase atol for batch vs single comparison
   # Current
   atol = 1e-6 if dtype == torch.float32 else 1e-10
   # Suggested
   atol = 1e-4 if dtype == torch.float32 else 1e-10

@akshaysubr
Copy link
Collaborator Author

@xiaopoc Thanks for the review and independent testing. What GPU, CUDA version, etc were you testing with?

Copy link
Collaborator

@xiaopoc xiaopoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tests passed. The PR looks good to me, so approved it.

@akshaysubr akshaysubr merged commit 1e37484 into main Mar 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants