[TRTLLM-11267][feat] Add audio support for nemotron#12191
[TRTLLM-11267][feat] Add audio support for nemotron#121912ez4bz wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
/bot run |
| }, | ||
| placeholder_placement=MultimodalPlaceholderPlacement.BEFORE_TEXT, | ||
| placeholders_separator="", | ||
| placeholders_separator="\n", |
There was a problem hiding this comment.
This is to match the vLLM behavior, which hardcodes newline separators in between multimodal placeholders regardless of model.
There was a problem hiding this comment.
with such change, are the final prompts of trtllm the same as vllm?
There was a problem hiding this comment.
Pretty much, yes. The prompts now match exactly, and the outputs are basically the same, with the caveat that TRTLLM does not seem stable run-to-run (meaning, if I run the same prompt twice, even with temperature 0.0, I'll get different output):
# Run 1.
Prompt: '<think> Help me plan a trip to Zimbabwe '
Output: 2024". They want help planning a trip to Zimbabwe in 2024. We need to provide travel advice: best time to visit, visa requirements, health, safety, itinerary suggestions, places to see (Victoria Falls, Hwange National Park, Great Zimbabwe, Kariba, etc.), travel logistics (flights, transport, accommodation), budgeting, cultural tips, safety, etc. Also mention COVID restrictions (if any). Provide a sample itinerary, maybe 10-14 days. Provide resources, links (but not actual URLs? We can provide generic). Ensure compliance with policy: no
# Run 2.
Prompt: '<think> Help me plan a trip to Zimbabwe '
Output: 2024". They want help planning a trip to Zimbabwe in 2024. We need to provide travel advice: best time to visit, visa requirements, health, safety, itinerary suggestions (e.g., Victoria Falls, Hwange National Park, Great Zimbabwe ruins, Kariba Lake, etc.), transportation, accommodation, budget, cultural tips, etc. Also mention COVID restrictions (if any). Provide a sample itinerary, maybe 10 days. Provide resources, links (but not actual URLs? We can provide generic). Ensure compliance with policy: no disallowed content. It's fine. Provide thorough answer
|
|
||
| kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.8, enable_block_reuse=False) | ||
|
|
||
| @pytest.mark.skip(reason="Nano V2 VLM ckpt is not released yet.") |
There was a problem hiding this comment.
This was never enabled, and we have TestNemotron_Nano_12B_V2_VL below in the same file.
| @@ -384,9 +386,7 @@ def encode_base64_image( | |||
| """ | |||
|
|
|||
| HF_CHAT_TEMPLATE_EXCEPTIONS = ["llava_llama", "mistral_large_3"] | |||
| PLACEHOLDER_EXCEPTIONS = [ | |||
| "llava_next", "NemotronH_Nano_VL_V2", "mistral_large_3" | |||
There was a problem hiding this comment.
I removed this and the below code since, per my understanding, this is no longer necessary, since the chat template for both v2 and newer models can handle a list-of-dicts style content.
|
PR_Github #38842 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis pull request adds audio modality support to the NemotronH Nano VL V2 multimodal model. It introduces Parakeet-based audio feature extraction and projection components, integrates audio preprocessing and encoding into the model's forward flow, and updates input handling to accommodate audio alongside existing vision modalities. Updates include audio placeholder management, resampling logic, and context token configuration. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant NanoV2Processor as NanoV2VLInputProcessor
participant Extractor as ParakeetExtractor
participant Model as NemotronH_Nano_VL_V2
participant AudioEncoder as ProjectedParakeet
participant VisionEncoder as NanoV2VLVisionEncoder
Client->>NanoV2Processor: forward(audio_list, image_list, text)
alt Has Audio
NanoV2Processor->>NanoV2Processor: _expand_audio_placeholders()
NanoV2Processor->>NanoV2Processor: _resample_audios()
NanoV2Processor->>Extractor: __call__(resampled_audios)
Extractor-->>NanoV2Processor: {input_features, attention_mask, audio_num_clips}
end
alt Has Images
NanoV2Processor->>VisionEncoder: process(images)
VisionEncoder-->>NanoV2Processor: image_embeddings
end
NanoV2Processor->>Model: forward(input_ids, embeddings, audio_features)
alt Has Audio in MultimodalParams
Model->>AudioEncoder: forward(audio_features, attention_mask)
AudioEncoder-->>Model: projected_audio_embeddings
end
alt Has Images in MultimodalParams
Model->>VisionEncoder: encode(images)
VisionEncoder-->>Model: vision_embeddings
end
Model->>Model: _encode_multimodal(dispatch by modality)
Model-->>Client: output_embeddings
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/models/modeling_nemotron_nano.py (1)
1211-1228:⚠️ Potential issue | 🟠 MajorFail fast if audio is enabled without
sound_context_token_id.If
sound_configis present butsound_context_token_idis missing, preprocessing still injects<so_embedding>tokens whileforward()never includes that ID inmm_token_ids. The request then runs with raw placeholder tokens instead of fused audio embeddings.💡 Suggested fix
self.sound_encoder: ProjectedParakeet | None = None sound_config = getattr(config, "sound_config", None) if sound_config is not None: self.sound_encoder = ProjectedParakeet( sound_config, llm_hidden_size=config.llm_config.hidden_size, dtype=getattr(config, "torch_dtype", torch.bfloat16), ).eval() llm_model_config = copy.deepcopy(model_config) llm_model_config.pretrained_config = llm_model_config.pretrained_config.llm_config self.llm = AutoModelForCausalLM.from_config(llm_model_config) self.vocab_size = llm_model_config.pretrained_config.vocab_size self.model_dtype = getattr(config, "torch_dtype", torch.bfloat16) self.img_context_token_id = config.img_context_token_id self.video_context_token_id = config.video_context_token_id self.sound_context_token_id = getattr(config, "sound_context_token_id", None) + if self.sound_encoder is not None and self.sound_context_token_id is None: + raise ValueError( + "sound_context_token_id must be set when sound_config is present." + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` around lines 1211 - 1228, If sound_config is provided but sound_context_token_id is None, the model will inject placeholder <so_embedding> tokens without mapping them to the corresponding token id, causing raw placeholders instead of fused audio embeddings; in the constructor where sound_config, sound_encoder, and sound_context_token_id are handled, add a fast-fail (raise a clear exception) or set a sensible default when sound_config is not None and sound_context_token_id is missing, and ensure the forward() logic that builds mm_token_ids expects sound_context_token_id to be present (or skips audio processing if absent); update initialization around sound_config / ProjectedParakeet and the attributes sound_context_token_id to enforce this invariant so forward() never runs with audio enabled but no sound_context_token_id.
🧹 Nitpick comments (2)
tests/unittest/_torch/modeling/test_modeling_parakeet.py (1)
92-107: Add a multi-channel audio case.Every new extractor test uses 1-D arrays, but the real loader commonly yields
(samples, channels)for stereo files. A small stereo-input case here would catch the current contract gap in_split_audio_into_clips()and keep the expected behavior explicit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/modeling/test_modeling_parakeet.py` around lines 92 - 107, Add a small stereo (multi-channel) test: create a 2-D numpy array shaped (samples, channels) (e.g., (16000,2)) and pass it into ext._split_audio_into_clips and ext(audios, sampling_rate=16000, return_tensors="pt") to verify the extractor handles multi-channel input; assert that _split_audio_into_clips returns clips whose summed first-dimension length per channel equals or exceeds the original samples, and assert the call still returns "input_features", "attention_mask", and "audio_num_clips" with the expected shapes/values (use ext._split_audio_into_clips and ext(...) as the referenced symbols).tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py (1)
215-262: Add a video dispatch case here.This class only exercises audio and image routing.
_encode_multimodal()also has to preserve the second tuple element fromvision_encoder()for EVS/video requests, so a small video case that asserts the returned token-count metadata would close the coverage gap.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py` around lines 215 - 262, Add a new test method in TestEncodeMultimodalDispatch that exercises a "video" modality path: create model via _make_mock_model(), set fake_video_embeds and fake_token_counts, set model.vision_encoder.return_value = ([fake_video_embeds], [fake_token_counts]), build mm_param.multimodal_data with "modality_type": "video" and a suitable "video" payload, call NemotronH_Nano_VL_V2._encode_multimodal(model, [mm_param]), assert model.vision_encoder was called with [mm_param], assert the returned embeds match fake_video_embeds and assert the second returned element (result_nones) preserves the token-count metadata equals [fake_token_counts].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Around line 1349-1363: The wrapper _encode_multimodal currently discards the
second return value from self.vision_encoder (which returns embeddings and
num_tokens_in_videos) and always returns a list of None, breaking EVS/video
handling; fix it by collecting the second return per multimodal param when
modality_type is "image" or "video" (i.e., capture embs, num_tokens =
self.vision_encoder([param]) and append num_tokens[0] to a parallel list),
ensure audio paths still append the correct placeholder if needed, and return
(mm_embeddings, mm_num_tokens) so forward() can pass those counts into
merge_evs_mm_embeds() correctly.
In `@tensorrt_llm/_torch/models/modeling_parakeet.py`:
- Around line 1-2: This file is missing the required NVIDIA Apache 2.0 license
header; add the standard NVIDIA Apache 2.0 copyright/license block at the very
top of the module (above the existing imports such as the typing imports and any
top-level symbols in modeling_parakeet.py), updating the year to the latest
meaningful modification year so the header precedes the first code line (e.g.,
before "from typing import Dict, NamedTuple, Optional").
- Around line 45-58: The helper _split_audio_into_clips currently asserts audio
is 1-D and will crash on multi-channel input; update _split_audio_into_clips to
handle multi-channel arrays from soundfile.read by checking audio.ndim: if ndim
== 2 downmix to mono (e.g., average across channels) before computing audio_len
and calling _clip_sizes, if ndim > 2 or channels == 0 raise a clear ValueError
indicating unsupported shape; keep the rest of the logic (clip_sizes, padding,
slicing) unchanged and reference _split_audio_into_clips and _clip_sizes when
making the change.
In `@tensorrt_llm/inputs/utils.py`:
- Around line 347-348: Normalize and unescape file:// URIs in a single helper
used by both loaders: extract the file path handling currently in load_audio()
(parsed_url and parsed_url.path) into a new utility (e.g.,
normalize_local_uri(uri) or reuse existing parsed_url logic) that strips the
"file://" scheme and applies urllib.parse.unquote on the path, then call that
helper from load_audio() and from default_multimodal_input_loader() before
calling soundfile.read(); ensure soundfile.read() always receives a plain
filesystem path, not a raw file:// URI.
---
Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Around line 1211-1228: If sound_config is provided but sound_context_token_id
is None, the model will inject placeholder <so_embedding> tokens without mapping
them to the corresponding token id, causing raw placeholders instead of fused
audio embeddings; in the constructor where sound_config, sound_encoder, and
sound_context_token_id are handled, add a fast-fail (raise a clear exception) or
set a sensible default when sound_config is not None and sound_context_token_id
is missing, and ensure the forward() logic that builds mm_token_ids expects
sound_context_token_id to be present (or skips audio processing if absent);
update initialization around sound_config / ProjectedParakeet and the attributes
sound_context_token_id to enforce this invariant so forward() never runs with
audio enabled but no sound_context_token_id.
---
Nitpick comments:
In `@tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py`:
- Around line 215-262: Add a new test method in TestEncodeMultimodalDispatch
that exercises a "video" modality path: create model via _make_mock_model(), set
fake_video_embeds and fake_token_counts, set model.vision_encoder.return_value =
([fake_video_embeds], [fake_token_counts]), build mm_param.multimodal_data with
"modality_type": "video" and a suitable "video" payload, call
NemotronH_Nano_VL_V2._encode_multimodal(model, [mm_param]), assert
model.vision_encoder was called with [mm_param], assert the returned embeds
match fake_video_embeds and assert the second returned element (result_nones)
preserves the token-count metadata equals [fake_token_counts].
In `@tests/unittest/_torch/modeling/test_modeling_parakeet.py`:
- Around line 92-107: Add a small stereo (multi-channel) test: create a 2-D
numpy array shaped (samples, channels) (e.g., (16000,2)) and pass it into
ext._split_audio_into_clips and ext(audios, sampling_rate=16000,
return_tensors="pt") to verify the extractor handles multi-channel input; assert
that _split_audio_into_clips returns clips whose summed first-dimension length
per channel equals or exceeds the original samples, and assert the call still
returns "input_features", "attention_mask", and "audio_num_clips" with the
expected shapes/values (use ext._split_audio_into_clips and ext(...) as the
referenced symbols).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 54c257d3-9472-45e6-86c9-b0f2f58a15a9
📒 Files selected for processing (9)
tensorrt_llm/_torch/models/modeling_nemotron_nano.pytensorrt_llm/_torch/models/modeling_parakeet.pytensorrt_llm/_torch/models/modeling_radio.pytensorrt_llm/inputs/utils.pytests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.pytests/integration/test_lists/test-db/l0_a10.ymltests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.pytests/unittest/_torch/modeling/test_modeling_parakeet.pytests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py
💤 Files with no reviewable changes (1)
- tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
| def _encode_multimodal( | ||
| self, multimodal_params: List[MultimodalParams] | ||
| ) -> Tuple[List[torch.Tensor], List[None]]: | ||
| """Dispatch multimodal encoding to the appropriate encoder.""" | ||
| mm_embeddings = [] | ||
| for param in multimodal_params: | ||
| modality_type = param.multimodal_data["modality_type"] | ||
| if modality_type in ("image", "video"): | ||
| embs, _ = self.vision_encoder([param]) | ||
| mm_embeddings.append(embs[0]) | ||
| elif modality_type == "audio": | ||
| mm_embeddings.append(self._encode_audio(param)) | ||
| else: | ||
| raise ValueError(f"Unknown modality: {modality_type}") | ||
| return mm_embeddings, [None] * len(multimodal_params) |
There was a problem hiding this comment.
Preserve video EVS metadata through _encode_multimodal().
self.vision_encoder([param]) returns both embeddings and num_tokens_in_videos, but this wrapper drops the second value and always returns [None]. That breaks the existing EVS/video path because forward() later passes those counts into merge_evs_mm_embeds().
💡 Suggested fix
def _encode_multimodal(
self, multimodal_params: List[MultimodalParams]
- ) -> Tuple[List[torch.Tensor], List[None]]:
+ ) -> Tuple[List[torch.Tensor], List[List[int] | None]]:
"""Dispatch multimodal encoding to the appropriate encoder."""
mm_embeddings = []
+ num_tokens_in_videos = []
for param in multimodal_params:
modality_type = param.multimodal_data["modality_type"]
if modality_type in ("image", "video"):
- embs, _ = self.vision_encoder([param])
+ embs, token_counts = self.vision_encoder([param])
mm_embeddings.append(embs[0])
+ num_tokens_in_videos.append(token_counts[0])
elif modality_type == "audio":
mm_embeddings.append(self._encode_audio(param))
+ num_tokens_in_videos.append(None)
else:
raise ValueError(f"Unknown modality: {modality_type}")
- return mm_embeddings, [None] * len(multimodal_params)
+ return mm_embeddings, num_tokens_in_videos🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` around lines 1349 -
1363, The wrapper _encode_multimodal currently discards the second return value
from self.vision_encoder (which returns embeddings and num_tokens_in_videos) and
always returns a list of None, breaking EVS/video handling; fix it by collecting
the second return per multimodal param when modality_type is "image" or "video"
(i.e., capture embs, num_tokens = self.vision_encoder([param]) and append
num_tokens[0] to a parallel list), ensure audio paths still append the correct
placeholder if needed, and return (mm_embeddings, mm_num_tokens) so forward()
can pass those counts into merge_evs_mm_embeds() correctly.
| from typing import Dict, NamedTuple, Optional | ||
|
|
There was a problem hiding this comment.
Add the standard NVIDIA Apache 2.0 header.
This new module lands without the repository's required license header.
As per coding guidelines, "All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of the latest meaningful modification. The header should be an Apache 2.0 license block as specified".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/models/modeling_parakeet.py` around lines 1 - 2, This
file is missing the required NVIDIA Apache 2.0 license header; add the standard
NVIDIA Apache 2.0 copyright/license block at the very top of the module (above
the existing imports such as the typing imports and any top-level symbols in
modeling_parakeet.py), updating the year to the latest meaningful modification
year so the header precedes the first code line (e.g., before "from typing
import Dict, NamedTuple, Optional").
| def _split_audio_into_clips(self, audio: np.ndarray) -> list[np.ndarray]: | ||
| assert audio.ndim == 1 | ||
| audio_len = int(audio.shape[0]) | ||
| clip_sizes = self._clip_sizes(audio_len) | ||
| target_len = sum(clip_sizes) | ||
| if audio_len < target_len: | ||
| audio = np.pad(audio, (0, target_len - audio_len)) | ||
|
|
||
| clips = list[np.ndarray]() | ||
| offset = 0 | ||
| for clip_size in clip_sizes: | ||
| clips.append(audio[offset : offset + clip_size]) | ||
| offset += clip_size | ||
| return clips |
There was a problem hiding this comment.
Handle multi-channel audio before clipping.
soundfile.read() commonly returns shape [num_samples, channels], but this helper only accepts 1-D arrays via a bare assert. A normal stereo file will therefore die with AssertionError here instead of being downmixed or rejected cleanly.
💡 Suggested fix
def _split_audio_into_clips(self, audio: np.ndarray) -> list[np.ndarray]:
- assert audio.ndim == 1
+ if audio.ndim == 2:
+ audio = audio.mean(axis=-1)
+ elif audio.ndim != 1:
+ raise ValueError(f"Expected mono audio, got shape {audio.shape}")
audio_len = int(audio.shape[0])
clip_sizes = self._clip_sizes(audio_len)
target_len = sum(clip_sizes)
if audio_len < target_len:
audio = np.pad(audio, (0, target_len - audio_len))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/models/modeling_parakeet.py` around lines 45 - 58, The
helper _split_audio_into_clips currently asserts audio is 1-D and will crash on
multi-channel input; update _split_audio_into_clips to handle multi-channel
arrays from soundfile.read by checking audio.ndim: if ndim == 2 downmix to mono
(e.g., average across channels) before computing audio_len and calling
_clip_sizes, if ndim > 2 or channels == 0 raise a clear ValueError indicating
unsupported shape; keep the rest of the logic (clip_sizes, padding, slicing)
unchanged and reference _split_audio_into_clips and _clip_sizes when making the
change.
| elif parsed_url.scheme == "file": | ||
| audio = parsed_url.path |
There was a problem hiding this comment.
Mirror file:// handling in load_audio().
This only fixes the async path. The common sync loader still hands the raw file://... URI to soundfile.read(), so local audio URLs remain broken in default_multimodal_input_loader() and any other synchronous caller. It would be safer to normalize file:// once and reuse it from both loaders, ideally with URI unescaping as well.
💡 Suggested fix
-from urllib.parse import urlparse
+from urllib.parse import unquote, urlparse
...
def load_audio(
audio: str,
format: str = "pt",
device: str = "cuda",
) -> Tuple[np.ndarray, int]:
parsed_url = urlparse(audio)
if parsed_url.scheme in ["http", "https"]:
audio = requests.get(audio, stream=True, timeout=10)
audio = BytesIO(audio.content)
+ elif parsed_url.scheme == "file":
+ audio = Path(unquote(parsed_url.path))
audio = soundfile.read(audio)
return audio
...
async def async_load_audio(
audio: str,
format: str = "pt",
device: str = "cuda",
) -> Tuple[np.ndarray, int]:
parsed_url = urlparse(audio)
if parsed_url.scheme in ["http", "https"]:
async with aiohttp.ClientSession() as session:
async with session.get(audio) as response:
audio = BytesIO(await response.content.read())
elif parsed_url.scheme == "file":
- audio = parsed_url.path
+ audio = Path(unquote(parsed_url.path))
audio = soundfile.read(audio)
return audio🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/inputs/utils.py` around lines 347 - 348, Normalize and unescape
file:// URIs in a single helper used by both loaders: extract the file path
handling currently in load_audio() (parsed_url and parsed_url.path) into a new
utility (e.g., normalize_local_uri(uri) or reuse existing parsed_url logic) that
strips the "file://" scheme and applies urllib.parse.unquote on the path, then
call that helper from load_audio() and from default_multimodal_input_loader()
before calling soundfile.read(); ensure soundfile.read() always receives a plain
filesystem path, not a raw file:// URI.
|
PR_Github #38842 [ run ] completed with state
|
| placeholder_map={ | ||
| "image": "<image>", | ||
| "video": "<video>", | ||
| "audio": AUDIO_PLACEHOLDER, |
There was a problem hiding this comment.
It looks like you may have hit enter too soon?
| input_features=input_features, | ||
| attention_mask=attention_mask, | ||
| ) | ||
| hidden_states = outputs.last_hidden_state.to(torch.bfloat16) |
There was a problem hiding this comment.
would we fetch dtype from config and convert it to the model dtype?
There was a problem hiding this comment.
Good catch. I just got rid of this entirely, since it was already in bfloat16 with the checkpoint I was using, and we're setting the dtype in the __init__.
There was a problem hiding this comment.
should we take care of audio token here?
There was a problem hiding this comment.
Good catch. I added them, although note that find_mm_token_lengths does not support audio yet.
| self._sound_context_token = getattr(config, "sound_context_token", AUDIO_PLACEHOLDER) | ||
| # At time of writing (03/09/2026), these were not included in the config. | ||
| self._sound_start = "<so_start>" | ||
| self._sound_end = "<so_end>" |
There was a problem hiding this comment.
- L1. Hardcoded audio start/end tokens may drift from config
- File:
tensorrt_llm/_torch/models/modeling_nemotron_nano.py:599-601 - Source:
comprehensive-local-review - Details: The audio start/end tokens are hardcoded:
The comment explains why, but there's no
# At time of writing (03/09/2026), these were not included in the config. self._sound_start = "<so_start>" self._sound_end = "<so_end>"
getattr(config, ...)fallback pattern to pick up config values if/when they're added. Currently safe, but could silently use wrong tokens when the config is updated. - Fix: Use the same
getattrpattern as the sound_context_token:self._sound_start = getattr(config, "sound_start_token", "<so_start>") self._sound_end = getattr(config, "sound_end_token", "<so_end>")
- File:
There was a problem hiding this comment.
Opting to leave as is, since we don't even know whether those will be the attributes in the config anyway.
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #38903 [ run ] triggered by Bot. Commit: |
|
PR_Github #38903 [ run ] completed with state
|
Summary by CodeRabbit
Release Notes
New Features
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.