This repository integrates the Sotopia Werewolf environment with the Verl Reinforcement Learning framework to train a Qwen agent using PPO.
We use Verl's Async Agent Loop architecture. This differs from standard RLHF (which typically does single-turn Q&A) by moving the entire game interaction inside the inference worker.
-
Orchestrator (
train_werewolf.py):- Starts the Ray cluster.
- Loads the PPO config (
werewolf_ppo.yaml). - Distributes work to Ray workers.
-
Make Work (
data/werewolf_train.parquet):- Verl is data-driven. It asks "What prompts should I train on?".
- Since Werewolf is an environment we just "start", we feed Verl a Dummy Dataset (created by
create_dummy_data.py). - Each row in the .parquet file triggers one game episode.
-
Rollout Worker (vLLM + AgentLoop):
- Verl sends a "prompt" (e.g., "Game 1") to the vLLM worker.
- Instead of just generating text, the worker triggers our Custom Agent Loop (
werewolf_agent_loop.py).
-
The Game Loop (
WerewolfAgentLoop):- Initializes: Calls
SotopiaWerewolfWrapperto create a fresh Werewolf game. - Interacts:
- Action: Uses vLLM
generate()to get the Trainee's move. - Environment: Calls
wrapper.step()to process that move and simulate other agents (using GPT-4 or other policies defined in Sotopia). - Loop: Repeats until the game ends.
- Action: Uses vLLM
- Returns: A full trajectory of (Prompt, Response, Reward) tokens back to the PPO trainer.
- Initializes: Calls
The entry point. It registers our custom werewolf worker type so Verl knows it exists, then launches the standard Verl PPO main function.
The main configuration file (Hydra format). Key settings:
rollout.name: vllm: We use vLLM for fast inference.rollout.mode: async: CRITICAL. Tells Verl we are using the Agent Loop system.rollout.agent.default_agent_loop: werewolf: Points to our custom class.
A small config file that maps the name "werewolf" to the Python class DidacticAgentLoop. Loaded dynamically by the AgentLoopManager.
The Core Logic. This class inherits from AgentLoopBase.
- It manages the conversation history (User/Assistant turns).
- It handles the masking (we only train on the Agent's output, not the Environment's observations).
- It calculates the final reward.
The Bridge. Sotopia is a complex multi-agent system. This wrapper makes it look like a simple environment:
setup_game(): Creates a scenario.step(): Handles the Trainee's action and automatically runs all other agents (Imposters, Villagers) to finish the round._parse_action(): Ensures the LLM's text output ("I vote for X") becomes a valid game action.
Generates data/werewolf_train.parquet.
- Why? Verl requires an input dataset to define the "epoch".
- We generate 100 "dummy" items. This effectively means "Run 100 parallel game episodes per epoch".
- Generate Data:
python examples/create_dummy_data.py
- Start Training:
python examples/train_werewolf.py --config-path ../config --config-name werewolf_ppo