Michele Brienza1,
Francesco Argenziano1,
Vincenzo Suriani2,
Domenico D. Bloisi3
Daniele Nardi1,
1 Department of Computer, Control and Management Engineering, Sapienza University of Rome, Rome, Italy, 2 School of Engineering, University of Basilicata, Potenza, Italy, 3 International University of Rome UNINT, Rome, Italy
This project uses the G-PlanET dataset, which must be downloaded from Hugging Face:
Dataset: yuchenlin/G-PlanET
You can download the dataset using the Hugging Face datasets library:
from datasets import load_dataset
dataset = load_dataset("yuchenlin/G-PlanET")Or using the Hugging Face CLI:
huggingface-cli download yuchenlin/G-PlanETThe trails (ID and image) used in this project are available on the project website:
Website: https://lab-rococo-sapienza.github.io/map-vlm/
pip install -r requirements.txtTo evaluate the generated plans, you need to install the PG2S metric library:
pip install pg2sOr install from source:
git clone https://github.com/Lab-RoCoCo-Sapienza/pg2s
cd pg2s
pip install .The main script test.py processes JSONL entries and generates planning outputs using both table-based and vision-based agents.
python test.py--jsonl: Path to input JSONL file (default:example.jsonl)--limit: Maximum number of records to process (default:1)--id: Process only the record with matching ID (optional)--image: Path to image file for vision planning (default:4.jpg)--model-table: OpenAI model for table planning (default:gpt-4o)--model-vision: OpenAI model for vision planning (default:gpt-4o)--output-dir: Directory to save generated plans (default:output_plans)
Process a single record:
python test.py --jsonl example.jsonl --limit 1Process multiple records with custom output directory:
python test.py --jsonl example.jsonl --limit 5 --output-dir ./resultsProcess a specific record by ID:
python test.py --jsonl example.jsonl --id "record_123"For each processed record, the script creates a subdirectory record_{id} containing:
input_table.txt- Input table in markdown formatsingle_agent_table.txt- Plan generated by single-agent with tablemulti_agent_table_env.txt- Environment summary from multi-agent with tablemulti_agent_table_plan.txt- Plan generated by multi-agent with tablesingle_agent_vision.txt- Plan generated by single-agent with visionmulti_agent_vision.txt- Plan generated by multi-agent with vision
This project uses the PG2S metric to evaluate the quality of generated plans.
from pg2s.metric import pg2s_score
plans = {
"task-1": {
'truth': [
'Turn around and walk to the sink.',
'Take the left glass out of the sink.',
'Turn around and walk to the microwave.',
'Heat the glass in the microwave.',
'Turn around and face the counter.',
'Place the glass in the left top cabinet.'
],
'predict': [
'Walk to the sink.',
'Pick up the glass from the sink.',
'Go to the microwave.',
'Heat the glass.',
'Walk to the counter.',
'Put the glass in the cabinet.'
]
},
}
# Calculate the similarity score with a custom alpha value
# alpha controls the balance between goal-wise and sentence-wise similarity
score = pg2s_score(plans, alpha=0.7)
print(f"PG2S Score: {score}")plans: Dictionary containing tasks with ground truth and predicted action sequencesalpha: Hyperparameter (default: 0.5) that balances:- Goal-wise similarity
- Sentence-wise similarity
For more information, see the PG2S repository.