Skip to content

[TRTLLM-11492][fix] Replace blocking fill loop with non-blocking can_forward gate in benchmark disagg mode#12208

Draft
chienchunhung wants to merge 1 commit intoNVIDIA:mainfrom
chienchunhung:test_for_disagg_hang
Draft

[TRTLLM-11492][fix] Replace blocking fill loop with non-blocking can_forward gate in benchmark disagg mode#12208
chienchunhung wants to merge 1 commit intoNVIDIA:mainfrom
chienchunhung:test_for_disagg_hang

Conversation

@chienchunhung
Copy link
Collaborator

@chienchunhung chienchunhung commented Mar 13, 2026

@coderabbitai summary

Description

Improve disagg benchmark logic and add test coverage.

The benchmark disagg fill loop in _prepare_and_schedule_batch blocked the GEN executor until all requests were fetched, starving KV transfer processing and causing deadlocks when the CTX server had limited KV cache. Replace the blocking loop with a non-blocking can_forward gate in both _executor_loop and _executor_loop_overlap. Each main-loop iteration now fetches available requests, services KV transfers, and checks readiness via _is_benchmark_disagg_fill_complete, allowing incremental progress.

  • Remove blocking fill loop from _prepare_and_schedule_batch
  • Add _is_benchmark_disagg_fill_complete helper (ADP and non-ADP)
  • Add can_forward gate to _executor_loop (previously only in overlap)
  • Extract is_benchmark_disagg attribute for consistent condition checks
  • Add unit tests covering threshold logic, gating, and incremental fill

This PR is co-worked with Cursor agent.

Test Coverage

Added a new unit test file "tests/unittest/_torch/executor/test_benchmark_disagg.py".

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38922 [ run ] triggered by Bot. Commit: 94c4723 Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants