-
Notifications
You must be signed in to change notification settings - Fork 42
Description
Summary
We are running usls::models::Sam3Image on an RTX 4090 with ONNXRuntime TensorRT EP enabled. TensorRT is clearly being used (engine serialization + DryRun ... on TensorRT:0), but end-to-end latency for a single 600×500 JPEG image is still ~1.1s after warm-up. This is only ~20% faster than CUDA EP on the same machine (~1.4s). We would like to confirm whether this performance is expected for SAM3 and what the recommended optimization path is.
Environment
OS: Ubuntu 24.04
GPU: NVIDIA GeForce RTX 4090 (24GB)
Driver: 580.126.09
CUDA toolkit: 13.0 (nvcc --version = 13.0.88)
TensorRT: 10.14.1.48 (10.14.1.48-1+cuda13.0)
ONNXRuntime: via ort-download-binaries (pyke dfbin cache)
Runner: our own thin Rust HTTP wrapper usls-server
What is usls-server (context)
This is our own small Rust axum service that:
accepts multipart form (image + JSON params)
decodes image bytes
calls usls::Runtime::forward(...)
returns COCO-style RLE masks grouped by class name
We will attach usls-server/src/main.rs and usls-server/Cargo.toml to this issue (for reference only).
Reproduction
- Start service (TensorRT + CUDA image processor):
export ORT_DYLIB_PATH=/root/ai-monitor/usls-server/target/releaseexport LD_LIBRARY_PATH=$ORT_DYLIB_PATH:$LD_LIBRARY_PATHUSLS_DEVICE=tensorrt:0 \USLS_PROCESSOR_DEVICE=cuda:0 \USLS_POOL_SIZE=1 \USLS_TEXT_CACHE_SIZE=4096 \USLS_DTYPE=fp32 ./target/release/usls-server
On first start we see logs like:
Initial model serialization with TensorRT may require a wait...
DryRun ... on TensorRT:0(NVIDIA) - Measure latency (single image, 600×500 JPEG):
curl -s -o /dev/null -w "usls_http=%{time_total}\n" \ -X POST http://127.0.0.1:8080/v1/sam3-image \ -F "image=@/root/001105.jpg" \ -F 'params={"prompts":["person","head","helmet"],"confs":0.5,"return":"rle","independent":false}'
Observed (warm): usls_http ≈ 1.126s
Notes:
independent=false (batch prompts in one forward call)
USLS_PROCESSOR_DEVICE=cpu:0 was slightly slower than cuda:0
Same machine with CUDA EP was ~1.4s (so TensorRT is only ~20% faster)
Expected / Question
Is ~1.1s per 600×500 image expected for Sam3Image even with TensorRT EP on an RTX 4090? If not, what are the recommended steps/knobs to optimize latency?