Skip to content

TensorRT-RTX EP: Enabling CUDA Graph seems ineffective when running a session from a precompiled engine #27329

@matpoudret

Description

@matpoudret

Describe the issue

I noticed that CUDA Graph launch does not occur when running a TensorRT-RTX EP session from a precompiled (AOT build) TensorRT-RTX engine. This issue may be related to #26929.

In the following snippet from NvExecutionProvider::CreateNodeComputeInfoFromGraph, the CUDA Graph strategy is explicitly set to kWHOLE_GRAPH_CAPTURE. Indeed, according to the TensorRT-RTX API documentation, the default strategy is kDISABLED.

trt_runtime_config = std::unique_ptr<nvinfer1::IRuntimeConfig>(trt_engine->createRuntimeConfig());
if (trt_runtime_config && cuda_graph_enable_) {
trt_runtime_config->setDynamicShapesKernelSpecializationStrategy(nvinfer1::DynamicShapesKernelSpecializationStrategy::kEAGER);
#if TRT_MAJOR_RTX > 1 || (TRT_MAJOR_RTX == 1 && TRT_MINOR_RTX >= 3)
auto cuda_strategy_flag = trt_runtime_config->setCudaGraphStrategy(nvinfer1::CudaGraphStrategy::kWHOLE_GRAPH_CAPTURE);
LOGS_DEFAULT(INFO) << "[NvTensorRTRTX EP] CUDA graph strategy with RTX Graph capture enabled : " << cuda_strategy_flag;
#else
LOGS_DEFAULT(WARNING) << "[NvTensorRTRTX EP] CUDA graph is enabled but RTX Graph capture is not available. "
<< "The current TRT RTX version does not support RTX Graph. "
<< "Please upgrade to TRT RTX >= 1.3 to use RTX Graph capture feature for optimal CUDA graph performance.";
#endif
}

It seems that this CUDA Graph strategy override does not apply when the session is run from a precompiled engine.

To reproduce

  1. Unzip repro.tar.gz
tar -xzf repro.tar.gz
cd repro/
  1. (Expected behavior) When running repro.py without relying on precompiled TensorRT-RTX engine (script default), --num-runs calls to cudaGraphLaunch occur:
python repro.py --num-runs 5

You may profile these calls by using NVIDIA Nsight Systems:

Image

  1. (Issue) When running repro.py with the --use-precompiled-engine option, CUDA kernels are launched directly without relying on cudaGraphLaunch. This behavior may lead to CPU overhead associated with launching CUDA kernels sequentially:
python repro.py --use-precompiled-engine --num-runs 5

Here is the corresponding NVIDIA Nsight Systems profile:

Image

Urgency

No response

Platform

Linux

OS Version

Rocky Linux 8.10

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.24.1

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Other / Unknown

Execution Provider Library Version

TensorRT-RTX-1.3.0.35

Metadata

Metadata

Assignees

No one assigned

    Labels

    ep:CUDAissues related to the CUDA execution providerep:TensorRTissues related to TensorRT execution provider

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions