What PRM model was used for the personal-agent experiments, and how were teacher_log_probs obtained in OPD?

Hi, thanks for the paper and the open-source code.

While reading the personal-agent sections, the policy model seems clearly specified as Qwen3-4B. However, the exact backbone used for the PRM / judge in the personal-agent experiments is still unclear.

From the paper:

- Section 5.1 specifies the personal-agent policy model as Qwen3-4B.
- Appendix C provides the personal-agent PRM judge prompt and the OPD hindsight hint prompt.
- The paper describes `teacher_log_probs` in OPD in a way that seems closer to re-scoring with the policy model under the hint-enhanced context.

From the current codebase:

- `hint`, `eval_score`, and `teacher_log_probs` in the OPD / combined implementation seem to be queried through the PRM service.

Could you clarify the personal-agent setup used in the paper?

1. What exact model was used as the PRM / judge for the personal-agent experiments?
2. In the personal-agent experiments, how were `teacher_log_probs` obtained:
   - from the PRM service,
   - or from the policy model under the hint-enhanced context?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What PRM model was used for the personal-agent experiments, and how were teacher_log_probs obtained in OPD? #83

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

What PRM model was used for the personal-agent experiments, and how were teacher_log_probs obtained in OPD? #83

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions