-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Describe the issue
I noticed that CUDA Graph launch does not occur when running a TensorRT-RTX EP session from a precompiled (AOT build) TensorRT-RTX engine. This issue may be related to #26929.
In the following snippet from NvExecutionProvider::CreateNodeComputeInfoFromGraph, the CUDA Graph strategy is explicitly set to kWHOLE_GRAPH_CAPTURE. Indeed, according to the TensorRT-RTX API documentation, the default strategy is kDISABLED.
onnxruntime/onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.cc
Lines 2951 to 2962 in a3749f1
| trt_runtime_config = std::unique_ptr<nvinfer1::IRuntimeConfig>(trt_engine->createRuntimeConfig()); | |
| if (trt_runtime_config && cuda_graph_enable_) { | |
| trt_runtime_config->setDynamicShapesKernelSpecializationStrategy(nvinfer1::DynamicShapesKernelSpecializationStrategy::kEAGER); | |
| #if TRT_MAJOR_RTX > 1 || (TRT_MAJOR_RTX == 1 && TRT_MINOR_RTX >= 3) | |
| auto cuda_strategy_flag = trt_runtime_config->setCudaGraphStrategy(nvinfer1::CudaGraphStrategy::kWHOLE_GRAPH_CAPTURE); | |
| LOGS_DEFAULT(INFO) << "[NvTensorRTRTX EP] CUDA graph strategy with RTX Graph capture enabled : " << cuda_strategy_flag; | |
| #else | |
| LOGS_DEFAULT(WARNING) << "[NvTensorRTRTX EP] CUDA graph is enabled but RTX Graph capture is not available. " | |
| << "The current TRT RTX version does not support RTX Graph. " | |
| << "Please upgrade to TRT RTX >= 1.3 to use RTX Graph capture feature for optimal CUDA graph performance."; | |
| #endif | |
| } |
It seems that this CUDA Graph strategy override does not apply when the session is run from a precompiled engine.
To reproduce
- Unzip repro.tar.gz
tar -xzf repro.tar.gz
cd repro/- (Expected behavior) When running
repro.pywithout relying on precompiled TensorRT-RTX engine (script default),--num-runscalls tocudaGraphLaunchoccur:
python repro.py --num-runs 5
You may profile these calls by using NVIDIA Nsight Systems:
- (Issue) When running
repro.pywith the--use-precompiled-engineoption, CUDA kernels are launched directly without relying oncudaGraphLaunch. This behavior may lead to CPU overhead associated with launching CUDA kernels sequentially:
python repro.py --use-precompiled-engine --num-runs 5
Here is the corresponding NVIDIA Nsight Systems profile:
Urgency
No response
Platform
Linux
OS Version
Rocky Linux 8.10
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.24.1
ONNX Runtime API
C++
Architecture
X64
Execution Provider
Other / Unknown
Execution Provider Library Version
TensorRT-RTX-1.3.0.35
