Skip to content

[TRTLLM-11267][feat] Add audio support for nemotron#12191

Open
2ez4bz wants to merge 1 commit intoNVIDIA:mainfrom
2ez4bz:dev-nano-audio
Open

[TRTLLM-11267][feat] Add audio support for nemotron#12191
2ez4bz wants to merge 1 commit intoNVIDIA:mainfrom
2ez4bz:dev-nano-audio

Conversation

@2ez4bz
Copy link
Collaborator

@2ez4bz 2ez4bz commented Mar 13, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added audio modality support to the multimodal model, enabling processing of audio inputs alongside image and video modalities.
    • Integrated audio preprocessing, feature extraction, and encoding capabilities for end-to-end inference.
  • Tests

    • Added comprehensive unit test coverage for audio processing, placeholder handling, and multimodal dispatch functionality.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@2ez4bz
Copy link
Collaborator Author

2ez4bz commented Mar 13, 2026

/bot run

},
placeholder_placement=MultimodalPlaceholderPlacement.BEFORE_TEXT,
placeholders_separator="",
placeholders_separator="\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to match the vLLM behavior, which hardcodes newline separators in between multimodal placeholders regardless of model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with such change, are the final prompts of trtllm the same as vllm?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much, yes. The prompts now match exactly, and the outputs are basically the same, with the caveat that TRTLLM does not seem stable run-to-run (meaning, if I run the same prompt twice, even with temperature 0.0, I'll get different output):

# Run 1.
Prompt: '<think> Help me plan a trip to Zimbabwe '

Output: 2024". They want help planning a trip to Zimbabwe in 2024. We need to provide travel advice: best time to visit, visa requirements, health, safety, itinerary suggestions, places to see (Victoria Falls, Hwange National Park, Great Zimbabwe, Kariba, etc.), travel logistics (flights, transport, accommodation), budgeting, cultural tips, safety, etc. Also mention COVID restrictions (if any). Provide a sample itinerary, maybe 10-14 days. Provide resources, links (but not actual URLs? We can provide generic). Ensure compliance with policy: no

# Run 2.
Prompt: '<think> Help me plan a trip to Zimbabwe '

Output: 2024". They want help planning a trip to Zimbabwe in 2024. We need to provide travel advice: best time to visit, visa requirements, health, safety, itinerary suggestions (e.g., Victoria Falls, Hwange National Park, Great Zimbabwe ruins, Kariba Lake, etc.), transportation, accommodation, budget, cultural tips, etc. Also mention COVID restrictions (if any). Provide a sample itinerary, maybe 10 days. Provide resources, links (but not actual URLs? We can provide generic). Ensure compliance with policy: no disallowed content. It's fine. Provide thorough answer


kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.8, enable_block_reuse=False)

@pytest.mark.skip(reason="Nano V2 VLM ckpt is not released yet.")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was never enabled, and we have TestNemotron_Nano_12B_V2_VL below in the same file.

@@ -384,9 +386,7 @@ def encode_base64_image(
"""

HF_CHAT_TEMPLATE_EXCEPTIONS = ["llava_llama", "mistral_large_3"]
PLACEHOLDER_EXCEPTIONS = [
"llava_next", "NemotronH_Nano_VL_V2", "mistral_large_3"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this and the below code since, per my understanding, this is no longer necessary, since the chat template for both v2 and newer models can handle a list-of-dicts style content.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38842 [ run ] triggered by Bot. Commit: 167cefb Link to invocation

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

This pull request adds audio modality support to the NemotronH Nano VL V2 multimodal model. It introduces Parakeet-based audio feature extraction and projection components, integrates audio preprocessing and encoding into the model's forward flow, and updates input handling to accommodate audio alongside existing vision modalities. Updates include audio placeholder management, resampling logic, and context token configuration.

Changes

Cohort / File(s) Summary
Audio Extraction & Projection
tensorrt_llm/_torch/models/modeling_parakeet.py
Introduces ParakeetExtractor for audio clipping and token calculation, _ExtractorConfig for configuration, ParakeetProjection MLP layer, and ProjectedParakeet wrapper that handles encoder initialization, weight loading, and bf16 conversion for model inference.
Nemotron Model Audio Integration
tensorrt_llm/_torch/models/modeling_nemotron_nano.py
Adds AUDIO_PLACEHOLDER constant, sound_encoder field, and audio-specific methods (_process_audio, _expand_audio_placeholders, _resample_audios, _encode_audio, _encode_multimodal). Extends forward flow and constructor to initialize and load audio encoder weights. Updates placeholder metadata and mm_token_ids composition for audio context tokens.
Vision Model Compatibility
tensorrt_llm/_torch/models/modeling_radio.py
Relaxes load_weights error handling to permit keys starting with model.patch_generator.video_embedder alongside existing model.blocks. prefix allowance; includes TODO note for future conv3d support.
Input Processing Utilities
tensorrt_llm/inputs/utils.py
Enables file:// scheme support in async_load_audio for local file paths. Removes NemotronH_Nano_VL_V2 from PLACEHOLDER_EXCEPTIONS list and simplifies specialized handling logic accordingly.
Audio Component Unit Tests
tests/unittest/_torch/modeling/test_modeling_parakeet.py, tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py, tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py
Adds comprehensive tests for ParakeetExtractor clipping logic, token counting, audio splitting, and ProjectedParakeet forward contract. Includes NemotronH_Nano_VL_V2 multimodal dispatch tests and NanoV2VLInputProcessor audio preprocessing tests with resampling and placeholder expansion validation.
Test Infrastructure Updates
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py, tests/integration/test_lists/test-db/l0_a10.yml
Removes TestNano_V2_VLM accuracy test class. Adds test_modeling_parakeet.py to l0_a10.yml PyTorch test list.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant NanoV2Processor as NanoV2VLInputProcessor
    participant Extractor as ParakeetExtractor
    participant Model as NemotronH_Nano_VL_V2
    participant AudioEncoder as ProjectedParakeet
    participant VisionEncoder as NanoV2VLVisionEncoder

    Client->>NanoV2Processor: forward(audio_list, image_list, text)
    
    alt Has Audio
        NanoV2Processor->>NanoV2Processor: _expand_audio_placeholders()
        NanoV2Processor->>NanoV2Processor: _resample_audios()
        NanoV2Processor->>Extractor: __call__(resampled_audios)
        Extractor-->>NanoV2Processor: {input_features, attention_mask, audio_num_clips}
    end
    
    alt Has Images
        NanoV2Processor->>VisionEncoder: process(images)
        VisionEncoder-->>NanoV2Processor: image_embeddings
    end
    
    NanoV2Processor->>Model: forward(input_ids, embeddings, audio_features)
    
    alt Has Audio in MultimodalParams
        Model->>AudioEncoder: forward(audio_features, attention_mask)
        AudioEncoder-->>Model: projected_audio_embeddings
    end
    
    alt Has Images in MultimodalParams
        Model->>VisionEncoder: encode(images)
        VisionEncoder-->>Model: vision_embeddings
    end
    
    Model->>Model: _encode_multimodal(dispatch by modality)
    Model-->>Client: output_embeddings
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is incomplete. While the title indicates audio support for Nemotron, the Description and Test Coverage sections are empty. The checklist is marked complete but lacks supporting details. Fill in the Description section explaining what audio support was added, why it was needed, and how it works. Complete the Test Coverage section listing the relevant test files that validate the changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding audio support to the Nemotron model, which aligns with the extensive audio-related changes throughout the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/models/modeling_nemotron_nano.py (1)

1211-1228: ⚠️ Potential issue | 🟠 Major

Fail fast if audio is enabled without sound_context_token_id.

If sound_config is present but sound_context_token_id is missing, preprocessing still injects <so_embedding> tokens while forward() never includes that ID in mm_token_ids. The request then runs with raw placeholder tokens instead of fused audio embeddings.

💡 Suggested fix
         self.sound_encoder: ProjectedParakeet | None = None
         sound_config = getattr(config, "sound_config", None)
         if sound_config is not None:
             self.sound_encoder = ProjectedParakeet(
                 sound_config,
                 llm_hidden_size=config.llm_config.hidden_size,
                 dtype=getattr(config, "torch_dtype", torch.bfloat16),
             ).eval()

         llm_model_config = copy.deepcopy(model_config)
         llm_model_config.pretrained_config = llm_model_config.pretrained_config.llm_config
         self.llm = AutoModelForCausalLM.from_config(llm_model_config)

         self.vocab_size = llm_model_config.pretrained_config.vocab_size
         self.model_dtype = getattr(config, "torch_dtype", torch.bfloat16)
         self.img_context_token_id = config.img_context_token_id
         self.video_context_token_id = config.video_context_token_id
         self.sound_context_token_id = getattr(config, "sound_context_token_id", None)
+        if self.sound_encoder is not None and self.sound_context_token_id is None:
+            raise ValueError(
+                "sound_context_token_id must be set when sound_config is present."
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` around lines 1211 -
1228, If sound_config is provided but sound_context_token_id is None, the model
will inject placeholder <so_embedding> tokens without mapping them to the
corresponding token id, causing raw placeholders instead of fused audio
embeddings; in the constructor where sound_config, sound_encoder, and
sound_context_token_id are handled, add a fast-fail (raise a clear exception) or
set a sensible default when sound_config is not None and sound_context_token_id
is missing, and ensure the forward() logic that builds mm_token_ids expects
sound_context_token_id to be present (or skips audio processing if absent);
update initialization around sound_config / ProjectedParakeet and the attributes
sound_context_token_id to enforce this invariant so forward() never runs with
audio enabled but no sound_context_token_id.
🧹 Nitpick comments (2)
tests/unittest/_torch/modeling/test_modeling_parakeet.py (1)

92-107: Add a multi-channel audio case.

Every new extractor test uses 1-D arrays, but the real loader commonly yields (samples, channels) for stereo files. A small stereo-input case here would catch the current contract gap in _split_audio_into_clips() and keep the expected behavior explicit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modeling/test_modeling_parakeet.py` around lines 92 -
107, Add a small stereo (multi-channel) test: create a 2-D numpy array shaped
(samples, channels) (e.g., (16000,2)) and pass it into
ext._split_audio_into_clips and ext(audios, sampling_rate=16000,
return_tensors="pt") to verify the extractor handles multi-channel input; assert
that _split_audio_into_clips returns clips whose summed first-dimension length
per channel equals or exceeds the original samples, and assert the call still
returns "input_features", "attention_mask", and "audio_num_clips" with the
expected shapes/values (use ext._split_audio_into_clips and ext(...) as the
referenced symbols).
tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py (1)

215-262: Add a video dispatch case here.

This class only exercises audio and image routing. _encode_multimodal() also has to preserve the second tuple element from vision_encoder() for EVS/video requests, so a small video case that asserts the returned token-count metadata would close the coverage gap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py` around
lines 215 - 262, Add a new test method in TestEncodeMultimodalDispatch that
exercises a "video" modality path: create model via _make_mock_model(), set
fake_video_embeds and fake_token_counts, set model.vision_encoder.return_value =
([fake_video_embeds], [fake_token_counts]), build mm_param.multimodal_data with
"modality_type": "video" and a suitable "video" payload, call
NemotronH_Nano_VL_V2._encode_multimodal(model, [mm_param]), assert
model.vision_encoder was called with [mm_param], assert the returned embeds
match fake_video_embeds and assert the second returned element (result_nones)
preserves the token-count metadata equals [fake_token_counts].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Around line 1349-1363: The wrapper _encode_multimodal currently discards the
second return value from self.vision_encoder (which returns embeddings and
num_tokens_in_videos) and always returns a list of None, breaking EVS/video
handling; fix it by collecting the second return per multimodal param when
modality_type is "image" or "video" (i.e., capture embs, num_tokens =
self.vision_encoder([param]) and append num_tokens[0] to a parallel list),
ensure audio paths still append the correct placeholder if needed, and return
(mm_embeddings, mm_num_tokens) so forward() can pass those counts into
merge_evs_mm_embeds() correctly.

In `@tensorrt_llm/_torch/models/modeling_parakeet.py`:
- Around line 1-2: This file is missing the required NVIDIA Apache 2.0 license
header; add the standard NVIDIA Apache 2.0 copyright/license block at the very
top of the module (above the existing imports such as the typing imports and any
top-level symbols in modeling_parakeet.py), updating the year to the latest
meaningful modification year so the header precedes the first code line (e.g.,
before "from typing import Dict, NamedTuple, Optional").
- Around line 45-58: The helper _split_audio_into_clips currently asserts audio
is 1-D and will crash on multi-channel input; update _split_audio_into_clips to
handle multi-channel arrays from soundfile.read by checking audio.ndim: if ndim
== 2 downmix to mono (e.g., average across channels) before computing audio_len
and calling _clip_sizes, if ndim > 2 or channels == 0 raise a clear ValueError
indicating unsupported shape; keep the rest of the logic (clip_sizes, padding,
slicing) unchanged and reference _split_audio_into_clips and _clip_sizes when
making the change.

In `@tensorrt_llm/inputs/utils.py`:
- Around line 347-348: Normalize and unescape file:// URIs in a single helper
used by both loaders: extract the file path handling currently in load_audio()
(parsed_url and parsed_url.path) into a new utility (e.g.,
normalize_local_uri(uri) or reuse existing parsed_url logic) that strips the
"file://" scheme and applies urllib.parse.unquote on the path, then call that
helper from load_audio() and from default_multimodal_input_loader() before
calling soundfile.read(); ensure soundfile.read() always receives a plain
filesystem path, not a raw file:// URI.

---

Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Around line 1211-1228: If sound_config is provided but sound_context_token_id
is None, the model will inject placeholder <so_embedding> tokens without mapping
them to the corresponding token id, causing raw placeholders instead of fused
audio embeddings; in the constructor where sound_config, sound_encoder, and
sound_context_token_id are handled, add a fast-fail (raise a clear exception) or
set a sensible default when sound_config is not None and sound_context_token_id
is missing, and ensure the forward() logic that builds mm_token_ids expects
sound_context_token_id to be present (or skips audio processing if absent);
update initialization around sound_config / ProjectedParakeet and the attributes
sound_context_token_id to enforce this invariant so forward() never runs with
audio enabled but no sound_context_token_id.

---

Nitpick comments:
In `@tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py`:
- Around line 215-262: Add a new test method in TestEncodeMultimodalDispatch
that exercises a "video" modality path: create model via _make_mock_model(), set
fake_video_embeds and fake_token_counts, set model.vision_encoder.return_value =
([fake_video_embeds], [fake_token_counts]), build mm_param.multimodal_data with
"modality_type": "video" and a suitable "video" payload, call
NemotronH_Nano_VL_V2._encode_multimodal(model, [mm_param]), assert
model.vision_encoder was called with [mm_param], assert the returned embeds
match fake_video_embeds and assert the second returned element (result_nones)
preserves the token-count metadata equals [fake_token_counts].

In `@tests/unittest/_torch/modeling/test_modeling_parakeet.py`:
- Around line 92-107: Add a small stereo (multi-channel) test: create a 2-D
numpy array shaped (samples, channels) (e.g., (16000,2)) and pass it into
ext._split_audio_into_clips and ext(audios, sampling_rate=16000,
return_tensors="pt") to verify the extractor handles multi-channel input; assert
that _split_audio_into_clips returns clips whose summed first-dimension length
per channel equals or exceeds the original samples, and assert the call still
returns "input_features", "attention_mask", and "audio_num_clips" with the
expected shapes/values (use ext._split_audio_into_clips and ext(...) as the
referenced symbols).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 54c257d3-9472-45e6-86c9-b0f2f58a15a9

📥 Commits

Reviewing files that changed from the base of the PR and between 2727092 and 167cefb.

📒 Files selected for processing (9)
  • tensorrt_llm/_torch/models/modeling_nemotron_nano.py
  • tensorrt_llm/_torch/models/modeling_parakeet.py
  • tensorrt_llm/_torch/models/modeling_radio.py
  • tensorrt_llm/inputs/utils.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py
  • tests/unittest/_torch/modeling/test_modeling_parakeet.py
  • tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py
💤 Files with no reviewable changes (1)
  • tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py

Comment on lines +1349 to +1363
def _encode_multimodal(
self, multimodal_params: List[MultimodalParams]
) -> Tuple[List[torch.Tensor], List[None]]:
"""Dispatch multimodal encoding to the appropriate encoder."""
mm_embeddings = []
for param in multimodal_params:
modality_type = param.multimodal_data["modality_type"]
if modality_type in ("image", "video"):
embs, _ = self.vision_encoder([param])
mm_embeddings.append(embs[0])
elif modality_type == "audio":
mm_embeddings.append(self._encode_audio(param))
else:
raise ValueError(f"Unknown modality: {modality_type}")
return mm_embeddings, [None] * len(multimodal_params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve video EVS metadata through _encode_multimodal().

self.vision_encoder([param]) returns both embeddings and num_tokens_in_videos, but this wrapper drops the second value and always returns [None]. That breaks the existing EVS/video path because forward() later passes those counts into merge_evs_mm_embeds().

💡 Suggested fix
     def _encode_multimodal(
         self, multimodal_params: List[MultimodalParams]
-    ) -> Tuple[List[torch.Tensor], List[None]]:
+    ) -> Tuple[List[torch.Tensor], List[List[int] | None]]:
         """Dispatch multimodal encoding to the appropriate encoder."""
         mm_embeddings = []
+        num_tokens_in_videos = []
         for param in multimodal_params:
             modality_type = param.multimodal_data["modality_type"]
             if modality_type in ("image", "video"):
-                embs, _ = self.vision_encoder([param])
+                embs, token_counts = self.vision_encoder([param])
                 mm_embeddings.append(embs[0])
+                num_tokens_in_videos.append(token_counts[0])
             elif modality_type == "audio":
                 mm_embeddings.append(self._encode_audio(param))
+                num_tokens_in_videos.append(None)
             else:
                 raise ValueError(f"Unknown modality: {modality_type}")
-        return mm_embeddings, [None] * len(multimodal_params)
+        return mm_embeddings, num_tokens_in_videos
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` around lines 1349 -
1363, The wrapper _encode_multimodal currently discards the second return value
from self.vision_encoder (which returns embeddings and num_tokens_in_videos) and
always returns a list of None, breaking EVS/video handling; fix it by collecting
the second return per multimodal param when modality_type is "image" or "video"
(i.e., capture embs, num_tokens = self.vision_encoder([param]) and append
num_tokens[0] to a parallel list), ensure audio paths still append the correct
placeholder if needed, and return (mm_embeddings, mm_num_tokens) so forward()
can pass those counts into merge_evs_mm_embeds() correctly.

Comment on lines +1 to +2
from typing import Dict, NamedTuple, Optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add the standard NVIDIA Apache 2.0 header.

This new module lands without the repository's required license header.

As per coding guidelines, "All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of the latest meaningful modification. The header should be an Apache 2.0 license block as specified".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_parakeet.py` around lines 1 - 2, This
file is missing the required NVIDIA Apache 2.0 license header; add the standard
NVIDIA Apache 2.0 copyright/license block at the very top of the module (above
the existing imports such as the typing imports and any top-level symbols in
modeling_parakeet.py), updating the year to the latest meaningful modification
year so the header precedes the first code line (e.g., before "from typing
import Dict, NamedTuple, Optional").

Comment on lines +45 to +58
def _split_audio_into_clips(self, audio: np.ndarray) -> list[np.ndarray]:
assert audio.ndim == 1
audio_len = int(audio.shape[0])
clip_sizes = self._clip_sizes(audio_len)
target_len = sum(clip_sizes)
if audio_len < target_len:
audio = np.pad(audio, (0, target_len - audio_len))

clips = list[np.ndarray]()
offset = 0
for clip_size in clip_sizes:
clips.append(audio[offset : offset + clip_size])
offset += clip_size
return clips
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle multi-channel audio before clipping.

soundfile.read() commonly returns shape [num_samples, channels], but this helper only accepts 1-D arrays via a bare assert. A normal stereo file will therefore die with AssertionError here instead of being downmixed or rejected cleanly.

💡 Suggested fix
     def _split_audio_into_clips(self, audio: np.ndarray) -> list[np.ndarray]:
-        assert audio.ndim == 1
+        if audio.ndim == 2:
+            audio = audio.mean(axis=-1)
+        elif audio.ndim != 1:
+            raise ValueError(f"Expected mono audio, got shape {audio.shape}")
         audio_len = int(audio.shape[0])
         clip_sizes = self._clip_sizes(audio_len)
         target_len = sum(clip_sizes)
         if audio_len < target_len:
             audio = np.pad(audio, (0, target_len - audio_len))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_parakeet.py` around lines 45 - 58, The
helper _split_audio_into_clips currently asserts audio is 1-D and will crash on
multi-channel input; update _split_audio_into_clips to handle multi-channel
arrays from soundfile.read by checking audio.ndim: if ndim == 2 downmix to mono
(e.g., average across channels) before computing audio_len and calling
_clip_sizes, if ndim > 2 or channels == 0 raise a clear ValueError indicating
unsupported shape; keep the rest of the logic (clip_sizes, padding, slicing)
unchanged and reference _split_audio_into_clips and _clip_sizes when making the
change.

Comment on lines +347 to +348
elif parsed_url.scheme == "file":
audio = parsed_url.path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mirror file:// handling in load_audio().

This only fixes the async path. The common sync loader still hands the raw file://... URI to soundfile.read(), so local audio URLs remain broken in default_multimodal_input_loader() and any other synchronous caller. It would be safer to normalize file:// once and reuse it from both loaders, ideally with URI unescaping as well.

💡 Suggested fix
-from urllib.parse import urlparse
+from urllib.parse import unquote, urlparse
...
 def load_audio(
     audio: str,
     format: str = "pt",
     device: str = "cuda",
 ) -> Tuple[np.ndarray, int]:
     parsed_url = urlparse(audio)
     if parsed_url.scheme in ["http", "https"]:
         audio = requests.get(audio, stream=True, timeout=10)
         audio = BytesIO(audio.content)
+    elif parsed_url.scheme == "file":
+        audio = Path(unquote(parsed_url.path))

     audio = soundfile.read(audio)
     return audio
...
 async def async_load_audio(
     audio: str,
     format: str = "pt",
     device: str = "cuda",
 ) -> Tuple[np.ndarray, int]:
     parsed_url = urlparse(audio)
     if parsed_url.scheme in ["http", "https"]:
         async with aiohttp.ClientSession() as session:
             async with session.get(audio) as response:
                 audio = BytesIO(await response.content.read())
     elif parsed_url.scheme == "file":
-        audio = parsed_url.path
+        audio = Path(unquote(parsed_url.path))

     audio = soundfile.read(audio)
     return audio
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/inputs/utils.py` around lines 347 - 348, Normalize and unescape
file:// URIs in a single helper used by both loaders: extract the file path
handling currently in load_audio() (parsed_url and parsed_url.path) into a new
utility (e.g., normalize_local_uri(uri) or reuse existing parsed_url logic) that
strips the "file://" scheme and applies urllib.parse.unquote on the path, then
call that helper from load_audio() and from default_multimodal_input_loader()
before calling soundfile.read(); ensure soundfile.read() always receives a plain
filesystem path, not a raw file:// URI.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38842 [ run ] completed with state FAILURE. Commit: 167cefb
/LLM/main/L0_MergeRequest_PR pipeline #30152 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

placeholder_map={
"image": "<image>",
"video": "<video>",
"audio": AUDIO_PLACEHOLDER,
Copy link
Collaborator

@Wanli-Jiang Wanli-Jiang Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also put and

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you may have hit enter too soon?

input_features=input_features,
attention_mask=attention_mask,
)
hidden_states = outputs.last_hidden_state.to(torch.bfloat16)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we fetch dtype from config and convert it to the model dtype?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I just got rid of this entirely, since it was already in bfloat16 with the checkpoint I was using, and we're setting the dtype in the __init__.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we take care of audio token here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I added them, although note that find_mm_token_lengths does not support audio yet.

self._sound_context_token = getattr(config, "sound_context_token", AUDIO_PLACEHOLDER)
# At time of writing (03/09/2026), these were not included in the config.
self._sound_start = "<so_start>"
self._sound_end = "<so_end>"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • L1. Hardcoded audio start/end tokens may drift from config
    • File: tensorrt_llm/_torch/models/modeling_nemotron_nano.py:599-601
    • Source: comprehensive-local-review
    • Details: The audio start/end tokens are hardcoded:
      # At time of writing (03/09/2026), these were not included in the config.
      self._sound_start = "<so_start>"
      self._sound_end = "<so_end>"
      The comment explains why, but there's no getattr(config, ...) fallback pattern to pick up config values if/when they're added. Currently safe, but could silently use wrong tokens when the config is updated.
    • Fix: Use the same getattr pattern as the sound_context_token:
      self._sound_start = getattr(config, "sound_start_token", "<so_start>")
      self._sound_end = getattr(config, "sound_end_token", "<so_end>")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opting to leave as is, since we don't even know whether those will be the attributes in the config anyway.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
@2ez4bz 2ez4bz requested a review from a team as a code owner March 13, 2026 20:58
@2ez4bz
Copy link
Collaborator Author

2ez4bz commented Mar 13, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38903 [ run ] triggered by Bot. Commit: 654e231 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38903 [ run ] completed with state FAILURE. Commit: 654e231
/LLM/main/L0_MergeRequest_PR pipeline #30211 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants