Skip to content

SAM3 (Sam3Image) with TensorRT EP on RTX 4090 still ~1.1s/image (600×500) via our thin usls-server wrapper — expected larger speedup #210

@likebean

Description

@likebean

Summary
We are running usls::models::Sam3Image on an RTX 4090 with ONNXRuntime TensorRT EP enabled. TensorRT is clearly being used (engine serialization + DryRun ... on TensorRT:0), but end-to-end latency for a single 600×500 JPEG image is still ~1.1s after warm-up. This is only ~20% faster than CUDA EP on the same machine (~1.4s). We would like to confirm whether this performance is expected for SAM3 and what the recommended optimization path is.

Environment
OS: Ubuntu 24.04
GPU: NVIDIA GeForce RTX 4090 (24GB)
Driver: 580.126.09
CUDA toolkit: 13.0 (nvcc --version = 13.0.88)
TensorRT: 10.14.1.48 (10.14.1.48-1+cuda13.0)
ONNXRuntime: via ort-download-binaries (pyke dfbin cache)
Runner: our own thin Rust HTTP wrapper usls-server

What is usls-server (context)
This is our own small Rust axum service that:
accepts multipart form (image + JSON params)
decodes image bytes
calls usls::Runtime::forward(...)
returns COCO-style RLE masks grouped by class name
We will attach usls-server/src/main.rs and usls-server/Cargo.toml to this issue (for reference only).

Reproduction

  1. Start service (TensorRT + CUDA image processor):
    export ORT_DYLIB_PATH=/root/ai-monitor/usls-server/target/releaseexport LD_LIBRARY_PATH=$ORT_DYLIB_PATH:$LD_LIBRARY_PATHUSLS_DEVICE=tensorrt:0 \USLS_PROCESSOR_DEVICE=cuda:0 \USLS_POOL_SIZE=1 \USLS_TEXT_CACHE_SIZE=4096 \USLS_DTYPE=fp32 ./target/release/usls-server
    On first start we see logs like:
    Initial model serialization with TensorRT may require a wait...
    DryRun ... on TensorRT:0(NVIDIA)
  2. Measure latency (single image, 600×500 JPEG):
    curl -s -o /dev/null -w "usls_http=%{time_total}\n" \ -X POST http://127.0.0.1:8080/v1/sam3-image \ -F "image=@/root/001105.jpg" \ -F 'params={"prompts":["person","head","helmet"],"confs":0.5,"return":"rle","independent":false}'
    Observed (warm): usls_http ≈ 1.126s
    Notes:
    independent=false (batch prompts in one forward call)
    USLS_PROCESSOR_DEVICE=cpu:0 was slightly slower than cuda:0
    Same machine with CUDA EP was ~1.4s (so TensorRT is only ~20% faster)

Expected / Question
Is ~1.1s per 600×500 image expected for Sam3Image even with TensorRT EP on an RTX 4090? If not, what are the recommended steps/knobs to optimize latency?

main.rs.txt
Cargo.toml.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions