v1.11.0 #1146

shi-eric · 2026-01-02T13:26:06Z

shi-eric
Jan 2, 2026
Maintainer

Warp v1.11.0

Warp v1.11 introduces group-aware spatial queries for multi-world workloads, provides new options for managing JIT compilation overhead, and expands differentiation capabilities with wp.grad(). This release also includes expanded tile operations, the unpack operator in kernels, C++ integration examples, and a major API cleanup clarifying public versus internal interfaces.

New features

Group-aware spatial queries

Warp v1.11 introduces group-aware construction and queries for wp.Bvh and wp.Mesh data structures, enabling efficient spatial queries across multiple independent environments. This feature allows you to build a single acceleration structure containing geometry from multiple worlds or scenes, then query each world independently without traversing primitives from other worlds.

When constructing a BVH or Mesh, assign each primitive to a group using the groups parameter. Warp builds isolated sub-trees for each group within a unified structure:

# Build a BVH in Python containing multiple worlds
lowers = wp.array(...)  # Shape bounds for all worlds
uppers = wp.array(...)
world_ids = wp.array([0, 0, 1, 1, 2, 2, ...], dtype=int)

bvh = wp.Bvh(lowers, uppers, groups=world_ids)

@wp.kernel
def raycast_world(
    bvh_id: wp.uint64,
    world_id: int,
    ray_origin: wp.vec3,
    ray_dir: wp.vec3
):
    # Get the root node for this world's sub-tree
    root = wp.bvh_get_group_root(bvh_id, world_id)
    
    # Query only intersects geometry from this world
    query = wp.bvh_query_ray(bvh_id, ray_origin, ray_dir, root)
    
    # Process hits
    shape_idx = int(0)
    while wp.bvh_query_next(query, shape_idx):
        # Handle intersection with shape_idx
        pass

# Launch kernel to query world 2
wp.launch(raycast_world, dim=1, inputs=[bvh.id, 2, origin, direction])

This example shows a single-world query for clarity. For production use, launch multiple threads in parallel, each querying its assigned world from arrays of world IDs and ray parameters. See Newton's raytrace implementation for a real-world example of parallel multi-world raycasting.

Key features

Group construction: Pass a groups array during construction to organize primitives into isolated sub-trees
Group-restricted queries: All query functions accept an optional root parameter to limit traversal to a specific group
Helper functions: wp.bvh_get_group_root() and wp.mesh_get_group_root() retrieve sub-tree roots for each group

Thanks to @StafaH for implementing this feature.

Geometry query enhancements

Warp v1.11 adds several new query functions and improvements for spatial queries:

wp.mesh_query_ray_anyhit(): Fast any-hit query that returns immediately upon finding any intersection, useful for shadow ray calculations in rendering
wp.mesh_query_ray_count_intersections(): Counts all ray-triangle intersections along a ray path
wp.mesh_query_point_sign_parity(): Point-in-mesh query using perturbed ray casting with majority voting for improved robustness in challenging cases
max_dist parameter: wp.bvh_query_next() now accepts a maximum distance to filter intersections, useful for early ray termination
Tiled query functions: Cooperative thread-block queries for use in tiled kernels (wp.bvh_query_aabb_tiled(), wp.bvh_query_ray_tiled(), wp.mesh_query_aabb_tiled(), etc.)

Evaluate the gradients of functions

wp.grad() directly evaluates the gradient of a Warp function at specific input values, computing gradients inline during the forward pass. This is useful for computing forces from energy functions or when implementing custom adjoints that need to call auto-generated gradients of subfunctions, avoiding the need to manually code the entire adjoint chain. This contrasts with wp.Tape(), which records an entire computation graph for reverse-mode automatic differentiation across multiple kernel launches. This feature was implemented in response to community feedback (#125).

import warp as wp
import numpy as np

k = 1.0

@wp.func
def compute_energy(x: float):
    return 0.5 * k * x * x

@wp.kernel
def compute_force(x: wp.array(dtype=float), U: wp.array(dtype=float), F: wp.array(dtype=float)):
    i = wp.tid()
    U[i] = compute_energy(x[i])
    F[i] = -wp.grad(compute_energy)(x[i])

N = 5
x = wp.array(np.arange(N, dtype=np.float32), dtype=float)
U = wp.zeros_like(x)
F = wp.zeros_like(x)

wp.launch(compute_force, N, inputs=[x], outputs=[U, F])

print(U.numpy())  # Energy: [0.  0.5 2.  4.5 8. ]
print(F.numpy())  # Force:  [ 0. -1. -2. -3. -4.]

`wp.tile_map()` supports n-ary maps (up to n=8)

User-defined functions that accept up to 8 arguments may now be used as tile mapping functions. An equivalent number of tiles must be passed to wp.tile_map(). For example:

@wp.func
def weighted_sum(a: float, b: float, c: float):
    return 0.5 * a + 0.3 * b + 0.2 * c

@wp.kernel
def compute():

    a = wp.tile_arange(0.0, 1.0, 0.1, dtype=float)
    b = wp.tile_ones(shape=10, dtype=float)
    c = wp.tile_arange(1.0, 2.0, 0.1, dtype=float)

    s = wp.tile_map(weighted_sum, a, b, c)

    print(s)

wp.launch_tiled(compute, dim=[1], inputs=[], block_dim=16)

Generate tiles of random numbers

wp.tile_randf() and wp.tile_randi() have been introduced to generate tiles of random floats and ints, respectively. These functions accept optional lower and upper bound arguments to control the range of generated values. This snippet generates 4x4 tensors of random floats using 2x2 tiles:

TILE_M, TILE_N = 2, 2
M, N = 2, 2
seed = 42

@wp.kernel
def rand_kernel(seed: int, x: wp.array2d(dtype=float)):
    i, j = wp.tid()
    rng = wp.rand_init(seed, i * TILE_M + j)
    t = wp.tile_randf(shape=(TILE_M, TILE_N), rng=rng)
    wp.tile_store(x, t, offset=(i * TILE_M, j * TILE_N))

x = wp.zeros(shape=(M * TILE_M, N * TILE_N), dtype=float)
wp.launch_tiled(rand_kernel, dim=[M, N], inputs=[seed, x], block_dim=32)
print(x.numpy())

Alpha and Beta scalings in `wp.tile_matmul()`

Optional alpha and beta scaling arguments have been added to wp.tile_matmul() builtins.

Previous Behavior	Updated Behavior
`out = A * B + out`	`out = alpha * A * B + beta * out`
`out = A * B`	`out = alpha * A * B`

In-place variants of Cholesky decomposition and linear solvers

wp.tile_cholesky_inplace(), wp.tile_cholesky_solve_inplace(), wp.tile_lower_solve_inplace(), and wp.tile_upper_solve_inplace() give the same results as their non-inplace counterparts, but overwrite input memory rather than allocate additional output memory, thereby halving shared memory usage. This is particularly beneficial in memory-constrained kernels where shared memory is limited. A standard example using Cholesky decomposition and the Cholesky solver looks like:

@wp.kernel()
def tile_math_cholesky_inplace(
    gA: wp.array2d(dtype=wp.float64),
    gy: wp.array1d(dtype=wp.float64),
):
    i, j = wp.tid()
    # Load A & y
    a = wp.tile_load(gA, shape=(TILE_M, TILE_M), storage="shared")
    y = wp.tile_load(gy, shape=TILE_M, storage="shared")
    # Compute L st LL^T = A inplace
    wp.tile_cholesky_inplace(a)
    # Solve for y in LL^T x = y inplace
    wp.tile_cholesky_solve_inplace(a, y)
    # Store L & y
    wp.tile_store(gA, a)
    wp.tile_store(gy, y)

Performance improvements

JIT-compile time improvements

Warp v1.11 brings three changes that aim to reduce the time to compile and load modules:

Precompiled headers

The CUDA C++ files that are generated from the Python modules all include the same set of header files. Warp now leverages NVRTC precompiled headers to cache the result of parsing these headers and reuse it for subsequent modules.

The first module that gets compiled incurs a 50 ms overhead to create the precompiled header, but every subsequent module in the same Python session gains 50-500 ms in compile time, with larger modules seeing the greatest benefit. The precompiled header is stored in a temporary directory and cached for the lifetime of the Python process. Each new Python process must recreate the precompiled header, as PCH files cannot be shared across processes due to internal memory layout requirements.

This feature is enabled by default, but can be disabled using wp.config.use_precompiled_headers=False.

Note for source builds: Precompiled headers require building Warp against CUDA Toolkit 12.8 or newer. Users installing from PyPI automatically have this feature because the Warp libraries on PyPI are now built against CUDA Toolkit 12.9.1.

For more details, see the NVRTC PCH documentation.

Optimization level control

By default, the CUDA Runtime Compiler performs a high level of optimizations on GPU kernels, favoring runtime performance at the cost of longer compilation times. Warp v1.11 introduces the wp.config.optimization_level setting to control this tradeoff. When set to None (the default), Warp uses level 3, which corresponds to maximum runtime optimization.

This setting controls GPU kernel compilation and accepts values from 0 to 3:

Level 3 (default): Maximum runtime performance, longest compile times
Level 2: Balanced tradeoff, can reduce initial compile times by up to 30%
Levels 0-1: Faster compilation, but may offer diminishing returns compared to level 2

The setting can be configured globally via wp.config.optimization_level or per-module using wp.set_module_options({"optimization_level": 2}).

This setting is available when Warp is built against CUDA Toolkit 12.9 or newer. PyPI wheels include this support.

Parallel compilation

Modules can now be compiled and loaded in parallel across multiple threads for both CPU and GPU. To benefit from parallel compilation, set wp.config.load_module_max_workers to a positive integer (default is 0, which disables parallelization) and explicitly load modules using wp.load_module() or wp.force_load(). You can also pass a max_workers argument directly to these functions to override the config setting. When modules are lazily compiled on-demand at wp.launch(), they are compiled one at a time and do not benefit from parallelization.

Parallel compilation can significantly reduce startup time when working with many modules. The most time-consuming step being parallelized is the two-stage compilation process: translating Python code to CUDA/C++ source code, then JIT-compiling to binary libraries using NVRTC for GPU or LLVM for CPU.

wp.load_module() requires recursive=True to enable parallel compilation. This loads the specified module along with all its submodules. The following example loads Newton's inverse kinematics module and its submodules:

import warp as wp

import newton._src.sim.ik

wp.config.load_module_max_workers = 4

wp.load_module(newton._src.sim.ik, device=wp.get_device("cuda:0"), recursive=True)

With parallel compilation (wp.config.load_module_max_workers=4), this takes about 3 seconds for CUDA (versus 4.5 seconds serially) and 3.7 seconds for CPU (versus 10.5 seconds serially).

wp.force_load() provides lower-level control by accepting an explicit list of modules to compile, unlike wp.load_module() which operates on a module hierarchy. Warning: Without explicit modules and device arguments, wp.force_load() compiles all imported modules for all available devices, which can take much longer than not using it at all. For example, import newton followed by wp.force_load(device="cuda:0") will compile over 100 modules.

The following example shows selective compilation using a manually-built module list. This requires more setup but provides fine-grained control:

import warp as wp

wp.config.load_module_max_workers = 4

module_list = [
    wp.get_module(m)
    for m in [
        "newton._src.sim.articulation",
        "newton._src.sim.ik.ik_common",
        "newton._src.sim.ik.ik_lm_optimizer",
        "newton._src.sim.ik.ik_objectives",
        "newton._src.sim.ik.ik_solver",
        "newton._src.geometry.inertia",
        "newton._src.viewer.gl.opengl",
        "newton._src.viewer.kernels",
        "newton._src.solvers.mujoco.kernels",
    ]
]

wp.force_load(device=wp.get_device("cpu"), modules=module_list)

The above snippet takes about 9-10 seconds with parallel compilation (wp.config.load_module_max_workers=4), compared to about 24 seconds with serial loading (wp.config.load_module_max_workers=0), yielding a roughly 2.5x speedup in this case.

Parallel compilation is most effective when compilation time is distributed evenly across modules. Gains will be limited if a single module dominates the total compilation time.

Advanced optimization: For applications with many kernels in a single large module file, consider splitting them into separate submodules across multiple files. This enables parallel compilation of the submodules, trading some code organization complexity for faster compilation times.

Language enhancements

Unpack operator support in kernels

Warp now supports Python's unpack operator (*) inside kernel function calls, enabling you to expand vectors, matrices, quaternions, and 1D array slices into individual arguments. This feature brings familiar Python idioms into Warp kernels and simplifies common patterns like constructing larger vectors from smaller ones, or copying array values into a vector.

The unpack operator works on composite types to expand their components:

@wp.kernel
def compute(
    arr: wp.array(dtype=float),
):
    # Unpack a 1D array slice into a vector.
    v1 = wp.vec3(*arr[:3])
    wp.expect_eq(v1, wp.vec3(1.0, 2.0, 3.0))

    # Unpack a vector into function arguments.
    v2 = wp.vec2(1.0, 2.0)
    x2 = wp.max(*v2)
    wp.expect_eq(x2, 2.0)

    # Build larger vectors by unpacking smaller ones.
    v3 = wp.vec3(1.0, 2.0, 3.0)
    v4 = wp.vec4(*v3, 4.0)
    wp.expect_eq(v4, wp.vec4(1.0, 2.0, 3.0, 4.0))


arr = wp.array((1, 2, 3, 4, 5, 6, 7, 8, 9), dtype=float)
wp.launch(compute, dim=1, inputs=(arr,))

Important: When unpacking arrays, slice bounds must be compile-time constants and non-negative. The upper bound is required since the array length is not known at compile time.

C++ integration examples

Warp v1.11 introduces C++ integration examples demonstrating how to deploy Warp-compiled kernels in standalone C++ applications without runtime Python dependencies. Two approaches are demonstrated:

Runtime CUBIN loading (00_cubin_launch): Load Warp-generated CUBIN files at runtime using CUDA Driver API
Source inclusion (01_source_include): Statically include Warp-generated CUDA source (including forward and adjoint kernels) in C++ projects

These examples are available at warp/examples/cpp/ with full build support for Make and CMake on Linux and Windows.

Community feedback: We're gathering input on Warp's AOT and C++ interoperability roadmap through a survey on GitHub Discussions. If you work with native workflows, deployment in minimal-Python environments, or interoperability with other CUDA/C++ libraries, your feedback will help shape future development in these areas.

Public API clarification

Warp v1.11 refines the boundary between public and internal APIs alongside a major documentation reorganization. Symbols and namespaces intended for internal use now emit deprecation warnings when accessed and will be removed in v1.13 (nominally May 2026). The complete public API is now clearly documented in the restructured API Reference and Language Reference sections.

What this means for your code:

Deprecation warnings will appear if you access internal APIs or deprecated symbols. The warnings will guide you to the correct migration path for each case. To get more details about where warnings originate (including filename and line number), set wp.config.verbose_warnings = True.
Internal module access via the top-level warp package (e.g., wp.context) is deprecated. If your code relies on internal APIs like wp.context.runtime, you can access them via wp._src.context.runtime, but be aware these are not part of the public API and may change or be removed without notice.
Type hint updates: Use wp.DeviceLike instead of the deprecated wp.Devicelike (note the capitalization).

If you depend on functionality that's no longer accessible and believe it should be part of the public API, please open a feature request on GitHub. Note: We're aware that some functionality (such as graph coloring and color balancing) currently lacks a public API and requires accessing internal modules. We're tracking these gaps in issues like #1145.

Platform support

Python version requirements

Warp 1.11.0 drops support for Python 3.8, which reached end-of-life in October 2024. Python 3.9 is now the minimum supported version.

CUDA Toolkit updates

PyPI wheels are now built with CUDA Toolkit 12.9.1 (up from 12.8.0 in previous releases). This enables new optimizations and features, including the wp.config.optimization_level setting for controlling kernel compilation.

For users building Warp from source, CUDA Toolkit 12.9.1 or newer is recommended for full GPU support.

Acknowledgments

We also thank the following contributors from outside the core Warp development team:

@StafaH for adding group-aware construction and queries for wp.Bvh and wp.Mesh, a new built-in function for
wp.mesh_query_ray_anyhit(), support for the max_dist argument to wp.bvh_query_next(), and improving the BVH SAH
constructor to use centroids for better build quality and traversal performance.
@RSchwan for adding in-place variants of tile linear algebra functions and alpha/beta scaling parameters to
wp.tile_matmul(), reducing shared memory usage and enabling operation fusion in tile kernels.

Full Changelog

For a curated list of all changes in this release, please see the v1.11.0 section in CHANGELOG.md.

This discussion was created from the release v1.11.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.11.0 #1146

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

v1.11.0 #1146

Uh oh!

shi-eric Jan 2, 2026 Maintainer

Warp v1.11.0

New features

Group-aware spatial queries

Key features

Geometry query enhancements

Evaluate the gradients of functions

wp.tile_map() supports n-ary maps (up to n=8)

Generate tiles of random numbers

Alpha and Beta scalings in wp.tile_matmul()

In-place variants of Cholesky decomposition and linear solvers

Performance improvements

JIT-compile time improvements

Precompiled headers

Optimization level control

Parallel compilation

Language enhancements

Unpack operator support in kernels

C++ integration examples

Public API clarification

Platform support

Python version requirements

CUDA Toolkit updates

Acknowledgments

Full Changelog

Replies: 0 comments

shi-eric
Jan 2, 2026
Maintainer

`wp.tile_map()` supports n-ary maps (up to n=8)

Alpha and Beta scalings in `wp.tile_matmul()`