Skip to content

✨[Feature] Python Runtime Rework to Unify Both Runtime #4047

@cehongwang

Description

@cehongwang

Unifying TorchTensorRTModule and PythonTorchTensorRTModule

TL;DR

Unify TorchTensorRTModule (C++ runtime) and PythonTorchTensorRTModule (Python runtime) into a single runtime module with shared logic. The unified module will dispatch to either a Python or C++ execute_engine implementation based on the build and a runtime flag. This reduces maintenance burden, eliminates behavioral inconsistencies, preserves backward compatibility, and enables serialization and re-export for the Python runtime.

Goal(s)

Currently, we have two standalone runtime modules: TorchTensorRTModule for the C++ runtime and PythonTorchTensorRTModule for the Python runtime. These two runtime modules are implemented in different ways and don't share the same logic. This increases the maintenance difficulties and causes an inconsistency in the Torch-TensorRT runtime.

To address such an inconsistency, we are going to unify TorchTensorRTModule and PythonTorchTensorRTModule to create a single runtime module, calling the execute_engine ops that are implemented in Python and C++, depending on the build users choose. After such a change, the Python runtime can be serializable.

Proposed APIs / UX

If users selects python only build, then the execute_engine op is only going to be implemented in Python. In a full build, users can control whether they want to use the Python implementation of execute_ops or the C++ version by using a context manager.

Example Workflow

Same user UI for backward compatibility

Design & Implementation

  1. Implement a Python Engine class (wrapper of Python TensorRT Engine) similar to C++ TRTEngine.h

    • Make sure the serialization format is the same across both runtime
  2. Implement an execute_engine op, which is the counterpart of the C++ execute_engine op. Support features that were supported in PythonTorchTensorRTModule

    • output allocator
    • cuda graph
    • unowned tensor
    • profiling
  3. Implement an execute_engine op registration logic, depending on the user's selection and Torch-TensorRT build.

    • The controlling flag will no longer be a compilation/runtime setting. It is a system setting.
    • Add a global state of torch-tensorrt system that controls which runtime op is used.
    • Default to be C++ with a full build and default to Python if it is a python only build
    • Use a context manager to allow users to control the global state

Implementation Phases

MVP (<TARGET RELEASE VERSION>)

Merge both runtimes; Delete the PythonTorchTensorRTModule and reimplement the execute_op

Build the registration logic
Enable serialization and re-export

Extension Phase 1 (<TARGET RELEASE VERSION>)

Benchmark and make sure both runtime has the same performance

Extension Phase 2 (<TARGET RELEASE VERSION>)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions