SAM3 (Sam3Image) with TensorRT EP on RTX 4090 still ~1.1s/image (600×500) via our thin usls-server wrapper — expected larger speedup

**Summary**
We are running usls::models::Sam3Image on an RTX 4090 with ONNXRuntime TensorRT EP enabled. TensorRT is clearly being used (engine serialization + DryRun ... on TensorRT:0), but end-to-end latency for a single 600×500 JPEG image is still ~1.1s after warm-up. This is only ~20% faster than CUDA EP on the same machine (~1.4s). We would like to confirm whether this performance is expected for SAM3 and what the recommended optimization path is.

**Environment**
OS: Ubuntu 24.04
GPU: NVIDIA GeForce RTX 4090 (24GB)
Driver: 580.126.09
CUDA toolkit: 13.0 (nvcc --version = 13.0.88)
TensorRT: 10.14.1.48 (10.14.1.48-1+cuda13.0)
ONNXRuntime: via ort-download-binaries (pyke dfbin cache)
Runner: our own thin Rust HTTP wrapper usls-server  

**What is usls-server (context)**
This is our own small Rust axum service that:
accepts multipart form (image + JSON params)
decodes image bytes
calls usls::Runtime<Sam3Image>::forward(...)
returns COCO-style RLE masks grouped by class name
We will attach usls-server/src/main.rs and usls-server/Cargo.toml to this issue (for reference only).

**Reproduction**
1) Start service (TensorRT + CUDA image processor):
export ORT_DYLIB_PATH=/root/ai-monitor/usls-server/target/releaseexport LD_LIBRARY_PATH=$ORT_DYLIB_PATH:$LD_LIBRARY_PATHUSLS_DEVICE=tensorrt:0 \USLS_PROCESSOR_DEVICE=cuda:0 \USLS_POOL_SIZE=1 \USLS_TEXT_CACHE_SIZE=4096 \USLS_DTYPE=fp32 \./target/release/usls-server
On first start we see logs like:
Initial model serialization with TensorRT may require a wait...
DryRun ... on TensorRT:0(NVIDIA)
2) Measure latency (single image, 600×500 JPEG):
curl -s -o /dev/null -w "usls_http=%{time_total}\n" \  -X POST http://127.0.0.1:8080/v1/sam3-image \  -F "image=@/root/001105.jpg" \  -F 'params={"prompts":["person","head","helmet"],"confs":0.5,"return":"rle","independent":false}'
Observed (warm): usls_http ≈ 1.126s
Notes:
independent=false (batch prompts in one forward call)
USLS_PROCESSOR_DEVICE=cpu:0 was slightly slower than cuda:0
Same machine with CUDA EP was ~1.4s (so TensorRT is only ~20% faster)

**Expected / Question**
Is ~1.1s per 600×500 image expected for Sam3Image even with TensorRT EP on an RTX 4090? If not, what are the recommended steps/knobs to optimize latency?

[main.rs.txt](https://github.com/user-attachments/files/24757643/main.rs.txt)
[Cargo.toml.txt](https://github.com/user-attachments/files/24757644/Cargo.toml.txt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAM3 (Sam3Image) with TensorRT EP on RTX 4090 still ~1.1s/image (600×500) via our thin usls-server wrapper — expected larger speedup #210

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

SAM3 (Sam3Image) with TensorRT EP on RTX 4090 still ~1.1s/image (600×500) via our thin usls-server wrapper — expected larger speedup #210

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions