Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
trl/generation/vllm_generation.py
Outdated
| # Handle async rollout_func | ||
| output = asyncio.run(rollout_func(rollout_prompts)) | ||
| else: | ||
| # [Legacy] Handle sync rollout_func |
There was a problem hiding this comment.
should we consider this as legacy?
There was a problem hiding this comment.
For OpenEnv, the new default is to use async so it'll be only useful for some old envs but I think having there "Legacy" may be misleading from TRL-side. I've removed it.
|
Changes extended to notebooks. |
albertvillanova
left a comment
There was a problem hiding this comment.
Thanks for addressing this important issue!
Some comments/questions below.
| vllm_mode=args.vllm_mode, | ||
| vllm_server_base_url=args.vllm_server_url if args.vllm_mode == "server" else None, | ||
| vllm_gpu_memory_utilization=0.4, | ||
| vllm_gpu_memory_utilization=0.5, |
There was a problem hiding this comment.
Is this change intended? Why?
trl/generation/vllm_generation.py
Outdated
| # Support both sync and async rollout functions: | ||
| if inspect.iscoroutinefunction(rollout_func): | ||
| # Handle async rollout_func | ||
| output = asyncio.run(rollout_func(rollout_prompts)) |
There was a problem hiding this comment.
This may raise an error if run in an environment with an active event loop (notebooks, IPython,...).
There was a problem hiding this comment.
great catch! I've handled it adding a nest_asyncio. Tested and working!
There was a problem hiding this comment.
Thanks again for addressing this and your proposed solution.
However, I am still concerned about this approach:
- Proposing to use
nest_asyncioin our example notebooks, but keeping the call toasyncio.runin the codebase
Some of my concerns (just brainstorming; please, feel free to give your opinions/arguments on this):
- in my opinion, the lib codebase itself should support handling a running loop, so async rollouts are reliable in common environments
- in my opinion, the use of
nest_asynciois a "reasonable" notebook workaround, but not a robust base‑code fix- it patches the running event loop to allow re‑entrant calls like
asyncio.runinside an active loop - this can avoid the immediate RuntimeError and may appear to work in notebooks
- but it's still blocking the loop thread and can lead to subtle hangs if the coroutine depends on other tasks that need the loop to make progress (custom rollout_func on same loop, tool-calling or streaming generation, IPython kernel, UI callbacks, background progress, etc.)
- it patches the running event loop to allow re‑entrant calls like
- on the other hand, we should be aware that it is not straightforward to support handling a running loop:
- our use case: awaiting async code in a synchronous function running inside an active event loop on the same thread
- problem: blocking on async results in the same thread causes deadlock
- possible solution: run async code in a new background thread when a loop is running; otherwise use asyncio.run directly
- in this line, I see that OpenEnv codebase implemented a
run_async_safelyfunction: https://github.com/meta-pytorch/OpenEnv/pull/343/changes#diff-897378bca76b3c096bfbfa9da0e18034a0ead03bd8a27f976e0fbc990a4360c2R13- I think that approach goes in the right direction
- but not 100% sure about their specific implementation:
- I think it still blocks the current thread while waiting for future.result()
- I could open an issue on their repo and discuss about this
As I said above, just brainstorming. Feel free to disagree. Also, we could address these subtleties in a subsequent PR if you prefer.
There was a problem hiding this comment.
Thanks a lot for all the details! I've tried generating a common solution for all the different cases (script/notebook). Both are now managed from TRL-side without introducing changes in the notebook. Since Colab already comes with nest_asyncio installed, we don't need to install the dependency. What do you think?
| ] | ||
| output = rollout_func(rollout_prompts) | ||
| # Support both sync and async rollout functions: | ||
| if inspect.iscoroutinefunction(rollout_func): |
There was a problem hiding this comment.
Whereas this check is enough for native coroutine functions (defined with async def), it will fail for async callables and awaitables: partials, callables with __call__, decorated functions, functions that return coroutines...
Do you think we should support all those edge cases or it is enough for the moment to check just for coroutines?
There was a problem hiding this comment.
mmm I think it's ok to just add the current support
There was a problem hiding this comment.
Maybe a naive question: is async rollout_func intended to be supported in both server and colocate modes?
There was a problem hiding this comment.
thanks for this one! It was missing 😄 added
|
updated @albertvillanova btw, maybe we can now wait for a new OpenEnv version before merging this one, since we're now pinging the version both in TRL and in OpenEnv (remote HF Spaces included) and that version doesn't include the async client |
What does this PR do?
Minimal changes to support the changes in OpenEnv for making the EnvClient async by default (meta-pytorch/OpenEnv#343).
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
@burtenshaw @kashif @albertvillanova @qgallouedec