-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Dear maintainers,
First of all, thank you very much for open-sourcing this project and sharing the codebase—it has been extremely helpful for our research on task-oriented planning with LLMs!
While reviewing the code, we noticed that the current repository only includes logic for:
Generating synthetic conversations
Rewriting conversations (basic/advanced summarization)
Generating task plans
Pairwise preference evaluation of plans (LLM-based annotation)
Sensitivity analysis of plan graphs
However, we did not find any code related to Direct Preference Optimization (DPO) (e.g., DPO loss function definition, model fine-tuning with preference data, DPO training loops, or dependency libraries like trl/torch for DPO). This seems inconsistent with the content mentioned in the associated paper (where DPO is presumably referenced/used).
We are particularly interested in how the preference data (from evaluate_plan.py) is used to optimize the model via DPO, and would greatly appreciate it if you could:
Confirm whether the paper’s DPO-related content corresponds to this codebase;
Open-source the DPO implementation code (e.g., DPO training pipeline, model fine-tuning logic, data processing for DPO, etc.);
If the DPO code cannot be open-sourced, provide brief documentation on how to integrate DPO with the existing preference annotation data in this project.
This will help us reproduce the full pipeline described in the paper and advance related research. Thank you again for your time and efforts!
Best regards,