Hi, thanks for the paper and the open-source code.
While reading the personal-agent sections, the policy model seems clearly specified as Qwen3-4B. However, the exact backbone used for the PRM / judge in the personal-agent experiments is still unclear.
From the paper:
- Section 5.1 specifies the personal-agent policy model as Qwen3-4B.
- Appendix C provides the personal-agent PRM judge prompt and the OPD hindsight hint prompt.
- The paper describes
teacher_log_probs in OPD in a way that seems closer to re-scoring with the policy model under the hint-enhanced context.
From the current codebase:
hint, eval_score, and teacher_log_probs in the OPD / combined implementation seem to be queried through the PRM service.
Could you clarify the personal-agent setup used in the paper?
- What exact model was used as the PRM / judge for the personal-agent experiments?
- In the personal-agent experiments, how were
teacher_log_probs obtained:
- from the PRM service,
- or from the policy model under the hint-enhanced context?
Hi, thanks for the paper and the open-source code.
While reading the personal-agent sections, the policy model seems clearly specified as Qwen3-4B. However, the exact backbone used for the PRM / judge in the personal-agent experiments is still unclear.
From the paper:
teacher_log_probsin OPD in a way that seems closer to re-scoring with the policy model under the hint-enhanced context.From the current codebase:
hint,eval_score, andteacher_log_probsin the OPD / combined implementation seem to be queried through the PRM service.Could you clarify the personal-agent setup used in the paper?
teacher_log_probsobtained: