v1.11.0 #1146
shi-eric
announced in
Announcements
v1.11.0
#1146
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Warp v1.11.0
Warp v1.11 introduces group-aware spatial queries for multi-world workloads, provides new options for managing JIT compilation overhead, and expands differentiation capabilities with
wp.grad(). This release also includes expanded tile operations, the unpack operator in kernels, C++ integration examples, and a major API cleanup clarifying public versus internal interfaces.New features
Group-aware spatial queries
Warp v1.11 introduces group-aware construction and queries for
wp.Bvhandwp.Meshdata structures, enabling efficient spatial queries across multiple independent environments. This feature allows you to build a single acceleration structure containing geometry from multiple worlds or scenes, then query each world independently without traversing primitives from other worlds.When constructing a BVH or Mesh, assign each primitive to a group using the
groupsparameter. Warp builds isolated sub-trees for each group within a unified structure:This example shows a single-world query for clarity. For production use, launch multiple threads in parallel, each querying its assigned world from arrays of world IDs and ray parameters. See Newton's raytrace implementation for a real-world example of parallel multi-world raycasting.
Key features
groupsarray during construction to organize primitives into isolated sub-treesrootparameter to limit traversal to a specific groupwp.bvh_get_group_root()andwp.mesh_get_group_root()retrieve sub-tree roots for each groupThanks to @StafaH for implementing this feature.
Geometry query enhancements
Warp v1.11 adds several new query functions and improvements for spatial queries:
wp.mesh_query_ray_anyhit(): Fast any-hit query that returns immediately upon finding any intersection, useful for shadow ray calculations in renderingwp.mesh_query_ray_count_intersections(): Counts all ray-triangle intersections along a ray pathwp.mesh_query_point_sign_parity(): Point-in-mesh query using perturbed ray casting with majority voting for improved robustness in challenging casesmax_distparameter:wp.bvh_query_next()now accepts a maximum distance to filter intersections, useful for early ray terminationwp.bvh_query_aabb_tiled(),wp.bvh_query_ray_tiled(),wp.mesh_query_aabb_tiled(), etc.)Evaluate the gradients of functions
wp.grad()directly evaluates the gradient of a Warp function at specific input values, computing gradients inline during the forward pass. This is useful for computing forces from energy functions or when implementing custom adjoints that need to call auto-generated gradients of subfunctions, avoiding the need to manually code the entire adjoint chain. This contrasts withwp.Tape(), which records an entire computation graph for reverse-mode automatic differentiation across multiple kernel launches. This feature was implemented in response to community feedback (#125).wp.tile_map()supports n-ary maps (up to n=8)User-defined functions that accept up to 8 arguments may now be used as tile mapping functions. An equivalent number of tiles must be passed to
wp.tile_map(). For example:Generate tiles of random numbers
wp.tile_randf()andwp.tile_randi()have been introduced to generate tiles of random floats and ints, respectively. These functions accept optional lower and upper bound arguments to control the range of generated values. This snippet generates 4x4 tensors of random floats using 2x2 tiles:Alpha and Beta scalings in
wp.tile_matmul()Optional
alphaandbetascaling arguments have been added towp.tile_matmul()builtins.out = A * B + outout = alpha * A * B + beta * outout = A * Bout = alpha * A * BIn-place variants of Cholesky decomposition and linear solvers
wp.tile_cholesky_inplace(),wp.tile_cholesky_solve_inplace(),wp.tile_lower_solve_inplace(), andwp.tile_upper_solve_inplace()give the same results as their non-inplace counterparts, but overwrite input memory rather than allocate additional output memory, thereby halving shared memory usage. This is particularly beneficial in memory-constrained kernels where shared memory is limited. A standard example using Cholesky decomposition and the Cholesky solver looks like:Performance improvements
JIT-compile time improvements
Warp v1.11 brings three changes that aim to reduce the time to compile and load modules:
Precompiled headers
The CUDA C++ files that are generated from the Python modules all include the same set of header files. Warp now leverages NVRTC precompiled headers to cache the result of parsing these headers and reuse it for subsequent modules.
The first module that gets compiled incurs a 50 ms overhead to create the precompiled header, but every subsequent module in the same Python session gains 50-500 ms in compile time, with larger modules seeing the greatest benefit. The precompiled header is stored in a temporary directory and cached for the lifetime of the Python process. Each new Python process must recreate the precompiled header, as PCH files cannot be shared across processes due to internal memory layout requirements.
This feature is enabled by default, but can be disabled using
wp.config.use_precompiled_headers=False.Note for source builds: Precompiled headers require building Warp against CUDA Toolkit 12.8 or newer. Users installing from PyPI automatically have this feature because the Warp libraries on PyPI are now built against CUDA Toolkit 12.9.1.
For more details, see the NVRTC PCH documentation.
Optimization level control
By default, the CUDA Runtime Compiler performs a high level of optimizations on GPU kernels, favoring runtime performance at the cost of longer compilation times. Warp v1.11 introduces the
wp.config.optimization_levelsetting to control this tradeoff. When set toNone(the default), Warp uses level 3, which corresponds to maximum runtime optimization.This setting controls GPU kernel compilation and accepts values from 0 to 3:
The setting can be configured globally via
wp.config.optimization_levelor per-module usingwp.set_module_options({"optimization_level": 2}).This setting is available when Warp is built against CUDA Toolkit 12.9 or newer. PyPI wheels include this support.
Parallel compilation
Modules can now be compiled and loaded in parallel across multiple threads for both CPU and GPU. To benefit from parallel compilation, set
wp.config.load_module_max_workersto a positive integer (default is 0, which disables parallelization) and explicitly load modules usingwp.load_module()orwp.force_load(). You can also pass amax_workersargument directly to these functions to override the config setting. When modules are lazily compiled on-demand atwp.launch(), they are compiled one at a time and do not benefit from parallelization.Parallel compilation can significantly reduce startup time when working with many modules. The most time-consuming step being parallelized is the two-stage compilation process: translating Python code to CUDA/C++ source code, then JIT-compiling to binary libraries using NVRTC for GPU or LLVM for CPU.
wp.load_module()requiresrecursive=Trueto enable parallel compilation. This loads the specified module along with all its submodules. The following example loads Newton's inverse kinematics module and its submodules:With parallel compilation (
wp.config.load_module_max_workers=4), this takes about 3 seconds for CUDA (versus 4.5 seconds serially) and 3.7 seconds for CPU (versus 10.5 seconds serially).wp.force_load()provides lower-level control by accepting an explicit list of modules to compile, unlikewp.load_module()which operates on a module hierarchy. Warning: Without explicitmodulesanddevicearguments,wp.force_load()compiles all imported modules for all available devices, which can take much longer than not using it at all. For example,import newtonfollowed bywp.force_load(device="cuda:0")will compile over 100 modules.The following example shows selective compilation using a manually-built module list. This requires more setup but provides fine-grained control:
The above snippet takes about 9-10 seconds with parallel compilation (
wp.config.load_module_max_workers=4), compared to about 24 seconds with serial loading (wp.config.load_module_max_workers=0), yielding a roughly 2.5x speedup in this case.Parallel compilation is most effective when compilation time is distributed evenly across modules. Gains will be limited if a single module dominates the total compilation time.
Advanced optimization: For applications with many kernels in a single large module file, consider splitting them into separate submodules across multiple files. This enables parallel compilation of the submodules, trading some code organization complexity for faster compilation times.
Language enhancements
Unpack operator support in kernels
Warp now supports Python's unpack operator (
*) inside kernel function calls, enabling you to expand vectors, matrices, quaternions, and 1D array slices into individual arguments. This feature brings familiar Python idioms into Warp kernels and simplifies common patterns like constructing larger vectors from smaller ones, or copying array values into a vector.The unpack operator works on composite types to expand their components:
Important: When unpacking arrays, slice bounds must be compile-time constants and non-negative. The upper bound is required since the array length is not known at compile time.
C++ integration examples
Warp v1.11 introduces C++ integration examples demonstrating how to deploy Warp-compiled kernels in standalone C++ applications without runtime Python dependencies. Two approaches are demonstrated:
00_cubin_launch): Load Warp-generated CUBIN files at runtime using CUDA Driver API01_source_include): Statically include Warp-generated CUDA source (including forward and adjoint kernels) in C++ projectsThese examples are available at
warp/examples/cpp/with full build support for Make and CMake on Linux and Windows.Community feedback: We're gathering input on Warp's AOT and C++ interoperability roadmap through a survey on GitHub Discussions. If you work with native workflows, deployment in minimal-Python environments, or interoperability with other CUDA/C++ libraries, your feedback will help shape future development in these areas.
Public API clarification
Warp v1.11 refines the boundary between public and internal APIs alongside a major documentation reorganization. Symbols and namespaces intended for internal use now emit deprecation warnings when accessed and will be removed in v1.13 (nominally May 2026). The complete public API is now clearly documented in the restructured API Reference and Language Reference sections.
What this means for your code:
wp.config.verbose_warnings = True.warppackage (e.g.,wp.context) is deprecated. If your code relies on internal APIs likewp.context.runtime, you can access them viawp._src.context.runtime, but be aware these are not part of the public API and may change or be removed without notice.wp.DeviceLikeinstead of the deprecatedwp.Devicelike(note the capitalization).If you depend on functionality that's no longer accessible and believe it should be part of the public API, please open a feature request on GitHub. Note: We're aware that some functionality (such as graph coloring and color balancing) currently lacks a public API and requires accessing internal modules. We're tracking these gaps in issues like #1145.
Platform support
Python version requirements
Warp 1.11.0 drops support for Python 3.8, which reached end-of-life in October 2024. Python 3.9 is now the minimum supported version.
CUDA Toolkit updates
PyPI wheels are now built with CUDA Toolkit 12.9.1 (up from 12.8.0 in previous releases). This enables new optimizations and features, including the
wp.config.optimization_levelsetting for controlling kernel compilation.For users building Warp from source, CUDA Toolkit 12.9.1 or newer is recommended for full GPU support.
Acknowledgments
We also thank the following contributors from outside the core Warp development team:
wp.Bvhandwp.Mesh, a new built-in function forwp.mesh_query_ray_anyhit(), support for themax_distargument towp.bvh_query_next(), and improving the BVH SAHconstructor to use centroids for better build quality and traversal performance.
alpha/betascaling parameters towp.tile_matmul(), reducing shared memory usage and enabling operation fusion in tile kernels.Full Changelog
For a curated list of all changes in this release, please see the v1.11.0 section in CHANGELOG.md.
This discussion was created from the release v1.11.0.
Beta Was this translation helpful? Give feedback.
All reactions