Skip to content

What PRM model was used for the personal-agent experiments, and how were teacher_log_probs obtained in OPD? #83

@astarforbae

Description

@astarforbae

Hi, thanks for the paper and the open-source code.

While reading the personal-agent sections, the policy model seems clearly specified as Qwen3-4B. However, the exact backbone used for the PRM / judge in the personal-agent experiments is still unclear.

From the paper:

  • Section 5.1 specifies the personal-agent policy model as Qwen3-4B.
  • Appendix C provides the personal-agent PRM judge prompt and the OPD hindsight hint prompt.
  • The paper describes teacher_log_probs in OPD in a way that seems closer to re-scoring with the policy model under the hint-enhanced context.

From the current codebase:

  • hint, eval_score, and teacher_log_probs in the OPD / combined implementation seem to be queried through the PRM service.

Could you clarify the personal-agent setup used in the paper?

  1. What exact model was used as the PRM / judge for the personal-agent experiments?
  2. In the personal-agent experiments, how were teacher_log_probs obtained:
    • from the PRM service,
    • or from the policy model under the hint-enhanced context?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions