-
Notifications
You must be signed in to change notification settings - Fork 382
Description
Unifying TorchTensorRTModule and PythonTorchTensorRTModule
TL;DR
Unify TorchTensorRTModule (C++ runtime) and PythonTorchTensorRTModule (Python runtime) into a single runtime module with shared logic. The unified module will dispatch to either a Python or C++ execute_engine implementation based on the build and a runtime flag. This reduces maintenance burden, eliminates behavioral inconsistencies, preserves backward compatibility, and enables serialization and re-export for the Python runtime.
Goal(s)
Currently, we have two standalone runtime modules: TorchTensorRTModule for the C++ runtime and PythonTorchTensorRTModule for the Python runtime. These two runtime modules are implemented in different ways and don't share the same logic. This increases the maintenance difficulties and causes an inconsistency in the Torch-TensorRT runtime.
To address such an inconsistency, we are going to unify TorchTensorRTModule and PythonTorchTensorRTModule to create a single runtime module, calling the execute_engine ops that are implemented in Python and C++, depending on the build users choose. After such a change, the Python runtime can be serializable.
Proposed APIs / UX
If users selects python only build, then the execute_engine op is only going to be implemented in Python. In a full build, users can control whether they want to use the Python implementation of execute_ops or the C++ version by using a context manager.
Example Workflow
Same user UI for backward compatibility
Design & Implementation
-
Implement a Python Engine class (wrapper of Python TensorRT Engine) similar to C++ TRTEngine.h
- Make sure the serialization format is the same across both runtime
-
Implement an
execute_engineop, which is the counterpart of the C++ execute_engine op. Support features that were supported in PythonTorchTensorRTModule- output allocator
- cuda graph
- unowned tensor
- profiling
-
Implement an
execute_engineop registration logic, depending on the user's selection and Torch-TensorRT build.- The controlling flag will no longer be a compilation/runtime setting. It is a system setting.
- Add a global state of torch-tensorrt system that controls which runtime op is used.
- Default to be C++ with a full build and default to Python if it is a python only build
- Use a context manager to allow users to control the global state
Implementation Phases
MVP (<TARGET RELEASE VERSION>)
Merge both runtimes; Delete the PythonTorchTensorRTModule and reimplement the execute_op
Build the registration logic
Enable serialization and re-export
Extension Phase 1 (<TARGET RELEASE VERSION>)
Benchmark and make sure both runtime has the same performance