[TRTLLM-11267][feat] Add audio support for nemotron by 2ez4bz · Pull Request #12191 · NVIDIA/TensorRT-LLM

2ez4bz · 2026-03-13T07:05:03Z

Summary by CodeRabbit

Release Notes

New Features
- Added audio modality support to the multimodal model, enabling processing of audio inputs alongside image and video modalities.
- Integrated audio preprocessing, feature extraction, and encoding capabilities for end-to-end inference.
Tests
- Added comprehensive unit test coverage for audio processing, placeholder handling, and multimodal dispatch functionality.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

2ez4bz · 2026-03-13T07:05:15Z

/bot run

2ez4bz · 2026-03-13T07:07:54Z

tensorrt_llm/_torch/models/modeling_nemotron_nano.py

        },
        placeholder_placement=MultimodalPlaceholderPlacement.BEFORE_TEXT,
-        placeholders_separator="",
+        placeholders_separator="\n",


This is to match the vLLM behavior, which hardcodes newline separators in between multimodal placeholders regardless of model.

with such change, are the final prompts of trtllm the same as vllm?

Pretty much, yes. The prompts now match exactly, and the outputs are basically the same, with the caveat that TRTLLM does not seem stable run-to-run (meaning, if I run the same prompt twice, even with temperature 0.0, I'll get different output):

# Run 1. Prompt: '<think> Help me plan a trip to Zimbabwe ' Output: 2024". They want help planning a trip to Zimbabwe in 2024. We need to provide travel advice: best time to visit, visa requirements, health, safety, itinerary suggestions, places to see (Victoria Falls, Hwange National Park, Great Zimbabwe, Kariba, etc.), travel logistics (flights, transport, accommodation), budgeting, cultural tips, safety, etc. Also mention COVID restrictions (if any). Provide a sample itinerary, maybe 10-14 days. Provide resources, links (but not actual URLs? We can provide generic). Ensure compliance with policy: no # Run 2. Prompt: '<think> Help me plan a trip to Zimbabwe ' Output: 2024". They want help planning a trip to Zimbabwe in 2024. We need to provide travel advice: best time to visit, visa requirements, health, safety, itinerary suggestions (e.g., Victoria Falls, Hwange National Park, Great Zimbabwe ruins, Kariba Lake, etc.), transportation, accommodation, budget, cultural tips, etc. Also mention COVID restrictions (if any). Provide a sample itinerary, maybe 10 days. Provide resources, links (but not actual URLs? We can provide generic). Ensure compliance with policy: no disallowed content. It's fine. Provide thorough answer

2ez4bz · 2026-03-13T07:08:57Z

tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py

-
-    kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.8, enable_block_reuse=False)
-
-    @pytest.mark.skip(reason="Nano V2 VLM ckpt is not released yet.")


This was never enabled, and we have TestNemotron_Nano_12B_V2_VL below in the same file.

2ez4bz · 2026-03-13T07:09:49Z

tensorrt_llm/inputs/utils.py

@@ -384,9 +386,7 @@ def encode_base64_image(
 """

 HF_CHAT_TEMPLATE_EXCEPTIONS = ["llava_llama", "mistral_large_3"]
-PLACEHOLDER_EXCEPTIONS = [
-    "llava_next", "NemotronH_Nano_VL_V2", "mistral_large_3"


I removed this and the below code since, per my understanding, this is no longer necessary, since the chat template for both v2 and newer models can handle a list-of-dicts style content.

tensorrt-cicd · 2026-03-13T07:11:51Z

PR_Github #38842 [ run ] triggered by Bot. Commit: 167cefb Link to invocation

coderabbitai · 2026-03-13T07:20:15Z

📝 Walkthrough

Walkthrough

This pull request adds audio modality support to the NemotronH Nano VL V2 multimodal model. It introduces Parakeet-based audio feature extraction and projection components, integrates audio preprocessing and encoding into the model's forward flow, and updates input handling to accommodate audio alongside existing vision modalities. Updates include audio placeholder management, resampling logic, and context token configuration.

Changes

Cohort / File(s)	Summary
Audio Extraction & Projection `tensorrt_llm/_torch/models/modeling_parakeet.py`	Introduces ParakeetExtractor for audio clipping and token calculation, _ExtractorConfig for configuration, ParakeetProjection MLP layer, and ProjectedParakeet wrapper that handles encoder initialization, weight loading, and bf16 conversion for model inference.
Nemotron Model Audio Integration `tensorrt_llm/_torch/models/modeling_nemotron_nano.py`	Adds AUDIO_PLACEHOLDER constant, sound_encoder field, and audio-specific methods (_process_audio, _expand_audio_placeholders, _resample_audios, _encode_audio, _encode_multimodal). Extends forward flow and constructor to initialize and load audio encoder weights. Updates placeholder metadata and mm_token_ids composition for audio context tokens.
Vision Model Compatibility `tensorrt_llm/_torch/models/modeling_radio.py`	Relaxes load_weights error handling to permit keys starting with `model.patch_generator.video_embedder` alongside existing `model.blocks.` prefix allowance; includes TODO note for future conv3d support.
Input Processing Utilities `tensorrt_llm/inputs/utils.py`	Enables file:// scheme support in async_load_audio for local file paths. Removes NemotronH_Nano_VL_V2 from PLACEHOLDER_EXCEPTIONS list and simplifies specialized handling logic accordingly.
Audio Component Unit Tests `tests/unittest/_torch/modeling/test_modeling_parakeet.py`, `tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py`, `tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py`	Adds comprehensive tests for ParakeetExtractor clipping logic, token counting, audio splitting, and ProjectedParakeet forward contract. Includes NemotronH_Nano_VL_V2 multimodal dispatch tests and NanoV2VLInputProcessor audio preprocessing tests with resampling and placeholder expansion validation.
Test Infrastructure Updates `tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py`, `tests/integration/test_lists/test-db/l0_a10.yml`	Removes TestNano_V2_VLM accuracy test class. Adds test_modeling_parakeet.py to l0_a10.yml PyTorch test list.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant NanoV2Processor as NanoV2VLInputProcessor
    participant Extractor as ParakeetExtractor
    participant Model as NemotronH_Nano_VL_V2
    participant AudioEncoder as ProjectedParakeet
    participant VisionEncoder as NanoV2VLVisionEncoder

    Client->>NanoV2Processor: forward(audio_list, image_list, text)
    
    alt Has Audio
        NanoV2Processor->>NanoV2Processor: _expand_audio_placeholders()
        NanoV2Processor->>NanoV2Processor: _resample_audios()
        NanoV2Processor->>Extractor: __call__(resampled_audios)
        Extractor-->>NanoV2Processor: {input_features, attention_mask, audio_num_clips}
    end
    
    alt Has Images
        NanoV2Processor->>VisionEncoder: process(images)
        VisionEncoder-->>NanoV2Processor: image_embeddings
    end
    
    NanoV2Processor->>Model: forward(input_ids, embeddings, audio_features)
    
    alt Has Audio in MultimodalParams
        Model->>AudioEncoder: forward(audio_features, attention_mask)
        AudioEncoder-->>Model: projected_audio_embeddings
    end
    
    alt Has Images in MultimodalParams
        Model->>VisionEncoder: encode(images)
        VisionEncoder-->>Model: vision_embeddings
    end
    
    Model->>Model: _encode_multimodal(dispatch by modality)
    Model-->>Client: output_embeddings

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is incomplete. While the title indicates audio support for Nemotron, the Description and Test Coverage sections are empty. The checklist is marked complete but lacks supporting details.	Fill in the Description section explaining what audio support was added, why it was needed, and how it works. Complete the Test Coverage section listing the relevant test files that validate the changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding audio support to the Nemotron model, which aligns with the extensive audio-related changes throughout the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/models/modeling_nemotron_nano.py (1)

1211-1228: ⚠️ Potential issue | 🟠 Major

Fail fast if audio is enabled without sound_context_token_id.

If sound_config is present but sound_context_token_id is missing, preprocessing still injects <so_embedding> tokens while forward() never includes that ID in mm_token_ids. The request then runs with raw placeholder tokens instead of fused audio embeddings.

💡 Suggested fix

         self.sound_encoder: ProjectedParakeet | None = None
         sound_config = getattr(config, "sound_config", None)
         if sound_config is not None:
             self.sound_encoder = ProjectedParakeet(
                 sound_config,
                 llm_hidden_size=config.llm_config.hidden_size,
                 dtype=getattr(config, "torch_dtype", torch.bfloat16),
             ).eval()

         llm_model_config = copy.deepcopy(model_config)
         llm_model_config.pretrained_config = llm_model_config.pretrained_config.llm_config
         self.llm = AutoModelForCausalLM.from_config(llm_model_config)

         self.vocab_size = llm_model_config.pretrained_config.vocab_size
         self.model_dtype = getattr(config, "torch_dtype", torch.bfloat16)
         self.img_context_token_id = config.img_context_token_id
         self.video_context_token_id = config.video_context_token_id
         self.sound_context_token_id = getattr(config, "sound_context_token_id", None)
+        if self.sound_encoder is not None and self.sound_context_token_id is None:
+            raise ValueError(
+                "sound_context_token_id must be set when sound_config is present."
+            )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` around lines 1211 -
1228, If sound_config is provided but sound_context_token_id is None, the model
will inject placeholder <so_embedding> tokens without mapping them to the
corresponding token id, causing raw placeholders instead of fused audio
embeddings; in the constructor where sound_config, sound_encoder, and
sound_context_token_id are handled, add a fast-fail (raise a clear exception) or
set a sensible default when sound_config is not None and sound_context_token_id
is missing, and ensure the forward() logic that builds mm_token_ids expects
sound_context_token_id to be present (or skips audio processing if absent);
update initialization around sound_config / ProjectedParakeet and the attributes
sound_context_token_id to enforce this invariant so forward() never runs with
audio enabled but no sound_context_token_id.

🧹 Nitpick comments (2)

tests/unittest/_torch/modeling/test_modeling_parakeet.py (1)
92-107: Add a multi-channel audio case.

Every new extractor test uses 1-D arrays, but the real loader commonly yields (samples, channels) for stereo files. A small stereo-input case here would catch the current contract gap in _split_audio_into_clips() and keep the expected behavior explicit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modeling/test_modeling_parakeet.py` around lines 92 -
107, Add a small stereo (multi-channel) test: create a 2-D numpy array shaped
(samples, channels) (e.g., (16000,2)) and pass it into
ext._split_audio_into_clips and ext(audios, sampling_rate=16000,
return_tensors="pt") to verify the extractor handles multi-channel input; assert
that _split_audio_into_clips returns clips whose summed first-dimension length
per channel equals or exceeds the original samples, and assert the call still
returns "input_features", "attention_mask", and "audio_num_clips" with the
expected shapes/values (use ext._split_audio_into_clips and ext(...) as the
referenced symbols).
tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py (1)
215-262: Add a video dispatch case here.

This class only exercises audio and image routing. _encode_multimodal() also has to preserve the second tuple element from vision_encoder() for EVS/video requests, so a small video case that asserts the returned token-count metadata would close the coverage gap.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py` around
lines 215 - 262, Add a new test method in TestEncodeMultimodalDispatch that
exercises a "video" modality path: create model via _make_mock_model(), set
fake_video_embeds and fake_token_counts, set model.vision_encoder.return_value =
([fake_video_embeds], [fake_token_counts]), build mm_param.multimodal_data with
"modality_type": "video" and a suitable "video" payload, call
NemotronH_Nano_VL_V2._encode_multimodal(model, [mm_param]), assert
model.vision_encoder was called with [mm_param], assert the returned embeds
match fake_video_embeds and assert the second returned element (result_nones)
preserves the token-count metadata equals [fake_token_counts].

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Around line 1349-1363: The wrapper _encode_multimodal currently discards the
second return value from self.vision_encoder (which returns embeddings and
num_tokens_in_videos) and always returns a list of None, breaking EVS/video
handling; fix it by collecting the second return per multimodal param when
modality_type is "image" or "video" (i.e., capture embs, num_tokens =
self.vision_encoder([param]) and append num_tokens[0] to a parallel list),
ensure audio paths still append the correct placeholder if needed, and return
(mm_embeddings, mm_num_tokens) so forward() can pass those counts into
merge_evs_mm_embeds() correctly.

In `@tensorrt_llm/_torch/models/modeling_parakeet.py`:
- Around line 1-2: This file is missing the required NVIDIA Apache 2.0 license
header; add the standard NVIDIA Apache 2.0 copyright/license block at the very
top of the module (above the existing imports such as the typing imports and any
top-level symbols in modeling_parakeet.py), updating the year to the latest
meaningful modification year so the header precedes the first code line (e.g.,
before "from typing import Dict, NamedTuple, Optional").
- Around line 45-58: The helper _split_audio_into_clips currently asserts audio
is 1-D and will crash on multi-channel input; update _split_audio_into_clips to
handle multi-channel arrays from soundfile.read by checking audio.ndim: if ndim
== 2 downmix to mono (e.g., average across channels) before computing audio_len
and calling _clip_sizes, if ndim > 2 or channels == 0 raise a clear ValueError
indicating unsupported shape; keep the rest of the logic (clip_sizes, padding,
slicing) unchanged and reference _split_audio_into_clips and _clip_sizes when
making the change.

In `@tensorrt_llm/inputs/utils.py`:
- Around line 347-348: Normalize and unescape file:// URIs in a single helper
used by both loaders: extract the file path handling currently in load_audio()
(parsed_url and parsed_url.path) into a new utility (e.g.,
normalize_local_uri(uri) or reuse existing parsed_url logic) that strips the
"file://" scheme and applies urllib.parse.unquote on the path, then call that
helper from load_audio() and from default_multimodal_input_loader() before
calling soundfile.read(); ensure soundfile.read() always receives a plain
filesystem path, not a raw file:// URI.

---

Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Around line 1211-1228: If sound_config is provided but sound_context_token_id
is None, the model will inject placeholder <so_embedding> tokens without mapping
them to the corresponding token id, causing raw placeholders instead of fused
audio embeddings; in the constructor where sound_config, sound_encoder, and
sound_context_token_id are handled, add a fast-fail (raise a clear exception) or
set a sensible default when sound_config is not None and sound_context_token_id
is missing, and ensure the forward() logic that builds mm_token_ids expects
sound_context_token_id to be present (or skips audio processing if absent);
update initialization around sound_config / ProjectedParakeet and the attributes
sound_context_token_id to enforce this invariant so forward() never runs with
audio enabled but no sound_context_token_id.

---

Nitpick comments:
In `@tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py`:
- Around line 215-262: Add a new test method in TestEncodeMultimodalDispatch
that exercises a "video" modality path: create model via _make_mock_model(), set
fake_video_embeds and fake_token_counts, set model.vision_encoder.return_value =
([fake_video_embeds], [fake_token_counts]), build mm_param.multimodal_data with
"modality_type": "video" and a suitable "video" payload, call
NemotronH_Nano_VL_V2._encode_multimodal(model, [mm_param]), assert
model.vision_encoder was called with [mm_param], assert the returned embeds
match fake_video_embeds and assert the second returned element (result_nones)
preserves the token-count metadata equals [fake_token_counts].

In `@tests/unittest/_torch/modeling/test_modeling_parakeet.py`:
- Around line 92-107: Add a small stereo (multi-channel) test: create a 2-D
numpy array shaped (samples, channels) (e.g., (16000,2)) and pass it into
ext._split_audio_into_clips and ext(audios, sampling_rate=16000,
return_tensors="pt") to verify the extractor handles multi-channel input; assert
that _split_audio_into_clips returns clips whose summed first-dimension length
per channel equals or exceeds the original samples, and assert the call still
returns "input_features", "attention_mask", and "audio_num_clips" with the
expected shapes/values (use ext._split_audio_into_clips and ext(...) as the
referenced symbols).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 54c257d3-9472-45e6-86c9-b0f2f58a15a9

📥 Commits

Reviewing files that changed from the base of the PR and between 2727092 and 167cefb.

📒 Files selected for processing (9)

tensorrt_llm/_torch/models/modeling_nemotron_nano.py
tensorrt_llm/_torch/models/modeling_parakeet.py
tensorrt_llm/_torch/models/modeling_radio.py
tensorrt_llm/inputs/utils.py
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
tests/integration/test_lists/test-db/l0_a10.yml
tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py
tests/unittest/_torch/modeling/test_modeling_parakeet.py
tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py

💤 Files with no reviewable changes (1)

tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py

coderabbitai · 2026-03-13T07:20:18Z

tensorrt_llm/_torch/models/modeling_nemotron_nano.py

+    def _encode_multimodal(
+        self, multimodal_params: List[MultimodalParams]
+    ) -> Tuple[List[torch.Tensor], List[None]]:
+        """Dispatch multimodal encoding to the appropriate encoder."""
+        mm_embeddings = []
+        for param in multimodal_params:
+            modality_type = param.multimodal_data["modality_type"]
+            if modality_type in ("image", "video"):
+                embs, _ = self.vision_encoder([param])
+                mm_embeddings.append(embs[0])
+            elif modality_type == "audio":
+                mm_embeddings.append(self._encode_audio(param))
+            else:
+                raise ValueError(f"Unknown modality: {modality_type}")
+        return mm_embeddings, [None] * len(multimodal_params)


⚠️ Potential issue | 🟠 Major

Preserve video EVS metadata through _encode_multimodal().

self.vision_encoder([param]) returns both embeddings and num_tokens_in_videos, but this wrapper drops the second value and always returns [None]. That breaks the existing EVS/video path because forward() later passes those counts into merge_evs_mm_embeds().

💡 Suggested fix

def _encode_multimodal( self, multimodal_params: List[MultimodalParams] - ) -> Tuple[List[torch.Tensor], List[None]]: + ) -> Tuple[List[torch.Tensor], List[List[int] | None]]: """Dispatch multimodal encoding to the appropriate encoder.""" mm_embeddings = [] + num_tokens_in_videos = [] for param in multimodal_params: modality_type = param.multimodal_data["modality_type"] if modality_type in ("image", "video"): - embs, _ = self.vision_encoder([param]) + embs, token_counts = self.vision_encoder([param]) mm_embeddings.append(embs[0]) + num_tokens_in_videos.append(token_counts[0]) elif modality_type == "audio": mm_embeddings.append(self._encode_audio(param)) + num_tokens_in_videos.append(None) else: raise ValueError(f"Unknown modality: {modality_type}") - return mm_embeddings, [None] * len(multimodal_params) + return mm_embeddings, num_tokens_in_videos

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` around lines 1349 - 1363, The wrapper _encode_multimodal currently discards the second return value from self.vision_encoder (which returns embeddings and num_tokens_in_videos) and always returns a list of None, breaking EVS/video handling; fix it by collecting the second return per multimodal param when modality_type is "image" or "video" (i.e., capture embs, num_tokens = self.vision_encoder([param]) and append num_tokens[0] to a parallel list), ensure audio paths still append the correct placeholder if needed, and return (mm_embeddings, mm_num_tokens) so forward() can pass those counts into merge_evs_mm_embeds() correctly.

coderabbitai · 2026-03-13T07:20:18Z

tensorrt_llm/_torch/models/modeling_parakeet.py

+from typing import Dict, NamedTuple, Optional
+


⚠️ Potential issue | 🟠 Major

Add the standard NVIDIA Apache 2.0 header.

This new module lands without the repository's required license header.

As per coding guidelines, "All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of the latest meaningful modification. The header should be an Apache 2.0 license block as specified".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_parakeet.py` around lines 1 - 2, This file is missing the required NVIDIA Apache 2.0 license header; add the standard NVIDIA Apache 2.0 copyright/license block at the very top of the module (above the existing imports such as the typing imports and any top-level symbols in modeling_parakeet.py), updating the year to the latest meaningful modification year so the header precedes the first code line (e.g., before "from typing import Dict, NamedTuple, Optional").

coderabbitai · 2026-03-13T07:20:18Z

tensorrt_llm/_torch/models/modeling_parakeet.py

+    def _split_audio_into_clips(self, audio: np.ndarray) -> list[np.ndarray]:
+        assert audio.ndim == 1
+        audio_len = int(audio.shape[0])
+        clip_sizes = self._clip_sizes(audio_len)
+        target_len = sum(clip_sizes)
+        if audio_len < target_len:
+            audio = np.pad(audio, (0, target_len - audio_len))
+
+        clips = list[np.ndarray]()
+        offset = 0
+        for clip_size in clip_sizes:
+            clips.append(audio[offset : offset + clip_size])
+            offset += clip_size
+        return clips


⚠️ Potential issue | 🟠 Major

Handle multi-channel audio before clipping.

soundfile.read() commonly returns shape [num_samples, channels], but this helper only accepts 1-D arrays via a bare assert. A normal stereo file will therefore die with AssertionError here instead of being downmixed or rejected cleanly.

💡 Suggested fix

def _split_audio_into_clips(self, audio: np.ndarray) -> list[np.ndarray]: - assert audio.ndim == 1 + if audio.ndim == 2: + audio = audio.mean(axis=-1) + elif audio.ndim != 1: + raise ValueError(f"Expected mono audio, got shape {audio.shape}") audio_len = int(audio.shape[0]) clip_sizes = self._clip_sizes(audio_len) target_len = sum(clip_sizes) if audio_len < target_len: audio = np.pad(audio, (0, target_len - audio_len))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_parakeet.py` around lines 45 - 58, The helper _split_audio_into_clips currently asserts audio is 1-D and will crash on multi-channel input; update _split_audio_into_clips to handle multi-channel arrays from soundfile.read by checking audio.ndim: if ndim == 2 downmix to mono (e.g., average across channels) before computing audio_len and calling _clip_sizes, if ndim > 2 or channels == 0 raise a clear ValueError indicating unsupported shape; keep the rest of the logic (clip_sizes, padding, slicing) unchanged and reference _split_audio_into_clips and _clip_sizes when making the change.

coderabbitai · 2026-03-13T07:20:18Z

tensorrt_llm/inputs/utils.py

+    elif parsed_url.scheme == "file":
+        audio = parsed_url.path


⚠️ Potential issue | 🟠 Major

Mirror file:// handling in load_audio().

This only fixes the async path. The common sync loader still hands the raw file://... URI to soundfile.read(), so local audio URLs remain broken in default_multimodal_input_loader() and any other synchronous caller. It would be safer to normalize file:// once and reuse it from both loaders, ideally with URI unescaping as well.

💡 Suggested fix

-from urllib.parse import urlparse +from urllib.parse import unquote, urlparse ... def load_audio( audio: str, format: str = "pt", device: str = "cuda", ) -> Tuple[np.ndarray, int]: parsed_url = urlparse(audio) if parsed_url.scheme in ["http", "https"]: audio = requests.get(audio, stream=True, timeout=10) audio = BytesIO(audio.content) + elif parsed_url.scheme == "file": + audio = Path(unquote(parsed_url.path)) audio = soundfile.read(audio) return audio ... async def async_load_audio( audio: str, format: str = "pt", device: str = "cuda", ) -> Tuple[np.ndarray, int]: parsed_url = urlparse(audio) if parsed_url.scheme in ["http", "https"]: async with aiohttp.ClientSession() as session: async with session.get(audio) as response: audio = BytesIO(await response.content.read()) elif parsed_url.scheme == "file": - audio = parsed_url.path + audio = Path(unquote(parsed_url.path)) audio = soundfile.read(audio) return audio

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/inputs/utils.py` around lines 347 - 348, Normalize and unescape file:// URIs in a single helper used by both loaders: extract the file path handling currently in load_audio() (parsed_url and parsed_url.path) into a new utility (e.g., normalize_local_uri(uri) or reuse existing parsed_url logic) that strips the "file://" scheme and applies urllib.parse.unquote on the path, then call that helper from load_audio() and from default_multimodal_input_loader() before calling soundfile.read(); ensure soundfile.read() always receives a plain filesystem path, not a raw file:// URI.

tensorrt-cicd · 2026-03-13T07:59:13Z

PR_Github #38842 [ run ] completed with state FAILURE. Commit: 167cefb
/LLM/main/L0_MergeRequest_PR pipeline #30152 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Wanli-Jiang · 2026-03-13T11:40:43Z

tensorrt_llm/_torch/models/modeling_nemotron_nano.py

        placeholder_map={
            "image": "<image>",
            "video": "<video>",
+            "audio": AUDIO_PLACEHOLDER,


can you also put and

It looks like you may have hit enter too soon?

Wanli-Jiang · 2026-03-13T12:33:20Z

tensorrt_llm/_torch/models/modeling_parakeet.py

+            input_features=input_features,
+            attention_mask=attention_mask,
+        )
+        hidden_states = outputs.last_hidden_state.to(torch.bfloat16)


would we fetch dtype from config and convert it to the model dtype?

Good catch. I just got rid of this entirely, since it was already in bfloat16 with the checkpoint I was using, and we're setting the dtype in the __init__.

Wanli-Jiang · 2026-03-13T12:34:31Z

tensorrt_llm/_torch/models/modeling_nemotron_nano.py

should we take care of audio token here?

Good catch. I added them, although note that find_mm_token_lengths does not support audio yet.

Wanli-Jiang · 2026-03-13T12:35:25Z

tensorrt_llm/_torch/models/modeling_nemotron_nano.py

+        self._sound_context_token = getattr(config, "sound_context_token", AUDIO_PLACEHOLDER)
+        # At time of writing (03/09/2026), these were not included in the config.
+        self._sound_start = "<so_start>"
+        self._sound_end = "<so_end>"


L1. Hardcoded audio start/end tokens may drift from config

File: tensorrt_llm/_torch/models/modeling_nemotron_nano.py:599-601

Source: comprehensive-local-review

Details: The audio start/end tokens are hardcoded:
# At time of writing (03/09/2026), these were not included in the config. self._sound_start = "<so_start>" self._sound_end = "<so_end>"
The comment explains why, but there's no getattr(config, ...) fallback pattern to pick up config values if/when they're added. Currently safe, but could silently use wrong tokens when the config is updated.

Fix: Use the same getattr pattern as the sound_context_token:
self._sound_start = getattr(config, "sound_start_token", "<so_start>") self._sound_end = getattr(config, "sound_end_token", "<so_end>")

Opting to leave as is, since we don't even know whether those will be the attributes in the config anyway.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>

2ez4bz · 2026-03-13T21:01:32Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-13T21:08:29Z

PR_Github #38903 [ run ] triggered by Bot. Commit: 654e231 Link to invocation

tensorrt-cicd · 2026-03-14T01:00:29Z

PR_Github #38903 [ run ] completed with state FAILURE. Commit: 654e231
/LLM/main/L0_MergeRequest_PR pipeline #30211 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

2ez4bz requested review from a team as code owners March 13, 2026 07:05

2ez4bz requested review from danielafrimi, rakib-hasan and symphonylyh March 13, 2026 07:05

github-actions bot assigned 2ez4bz Mar 13, 2026

2ez4bz commented Mar 13, 2026

View reviewed changes

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

StanleySun639 approved these changes Mar 13, 2026

View reviewed changes

Wanli-Jiang reviewed Mar 13, 2026

View reviewed changes

Wanli-Jiang approved these changes Mar 13, 2026

View reviewed changes

Wanli-Jiang reviewed Mar 13, 2026

View reviewed changes

[TRTLLM-11267][feat] Add audio support for nemotron

654e231

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>

2ez4bz force-pushed the dev-nano-audio branch from 167cefb to 654e231 Compare March 13, 2026 20:58

2ez4bz requested a review from a team as a code owner March 13, 2026 20:58


		kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.8, enable_block_reuse=False)

		@pytest.mark.skip(reason="Nano V2 VLM ckpt is not released yet.")

Conversation

2ez4bz commented Mar 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

2ez4bz commented Mar 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

coderabbitai bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

Wanli-Jiang Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2ez4bz commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2ez4bz commented Mar 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 13, 2026 •

edited

Loading

Wanli-Jiang Mar 13, 2026 •

edited

Loading