Skip to content

Autoresearch/mar27#8

Draft
sandropapais wants to merge 143 commits intomainfrom
autoresearch/mar27
Draft

Autoresearch/mar27#8
sandropapais wants to merge 143 commits intomainfrom
autoresearch/mar27

Conversation

@sandropapais
Copy link
Copy Markdown
Contributor

No description provided.

L2: 0.5902 (Δ-0.0009 vs baseline), obj_box_col: 0.084% (Δ+0.004%)
Marginal L2 gain; plan loss weight increase not a strong lever.
exp002: motion loss 0.2→0.5 badly hurt both L2+col (discard)
exp004 staged: num_decoder 6→8 for richer instance features
queue_length=6: L2=0.5757 (best, Δ-0.0154 vs baseline)
obj_box_col=0.100% (slightly worse than baseline)
Stage exp005: queue6 + plan_loss_up combination
… revised

exp004 decoder=8: obj_box_col=0.074% (best), L2=0.5955 (slightly worse)
exp005 revised: queue=6 + decoder=8 combo (drop queue6+plan_loss_up)
exp005 queue6+decoder8: L2=0.5934, col=0.126% (negative synergy)
Needs >10 epochs to benefit from combined capacity.

Best results:
- L2: 0.5757 (exp003 queue_length=6, -2.6% vs baseline)
- col: 0.074% (exp004 num_decoder=8, -7.5% vs baseline)
Move research_log.md, results.tsv, docs/research_review.md → autoresearch/
…57), num_decoder=8 best col (0.074%)

- autoresearch/ folder: research_log.md, results.tsv, research_review.md
- scripts/dgx_run.sh: source ~/.bashrc fix for WANDB_API_KEY
- 5 new experiment configs in projects/configs/auto_mar25_*
queue_length=6 as proper named config (from autoresearch mar25 exp003,
best L2=0.5757)
…mar25 constraints

- baseline run (exp000) always submitted first, doesn't count against max-experiments
- sbatch now uses --export=ALL,WANDB_API_KEY=... to avoid WandB auth failure
- hard constraints: no motion loss increase, no rotation augmentation
Run base config (nomap_queue6) as reproducible reference for mar26 session.
exp001: extended training 10→15 epochs (B2 bottleneck)
exp002: plan_loss_reg 1→2, plan_loss_cls 0.5→1 on queue=6 base
exp003: num_det 50→100 (more agents to planner)
exp004: confidence_decay 0.6→0.8 (slower forgetting)
exp005: combo of epochs15 + plan_loss_up
L2=0.5927, obj_box_col=0.104%, NDS=0.5236
epochs 10→15: L2=0.5738 (-0.019), obj_box_col=0.099% (-0.005pp) — new best both metrics
plan_loss_up: obj_box_col=0.094% (new best), L2=0.5904 (behind exp001's 0.5738)
num_det 50→100: L2=0.5676 (new best), obj_box_col=0.091% (new best)
exp005 updated: epochs15 + plan_loss_up + num_det=100 (all confirmed wins)
confidence_decay 0.6→0.8: FAF +18, obj_box_col 0.120% (worse than baseline 0.104%)
More false alarms from stale instances confuse the planner
exp005 (epochs15+plan_up+det100): L2=0.6109, col=0.107% — negative synergy discard

Session best: exp003 num_det=100 — L2=0.5676, col=0.091% (both new bests)

research_review: add mar26 findings section, update confirmed wins,
negative evidence, best recipe, next experiments, open questions
Base: sparsedrive_r50_stage2_4gpu_bs24.py (WITH map, queue=4)
exp001: num_det 50→100
exp002: queue_length 4→6
exp003: nomap + plan_loss_up combo
exp004: epochs 10→15
exp005: epochs15 + num_det=100
L2=0.6274, obj_box_col=0.107%, NDS=0.5233, mAP_normal=0.5508
num_det=100: L2=0.6159 (-1.8%), col=0.091% (-15%), IDS 990→577 (-42%)
queue_length=6 hurts with map head active: col worsens 0.091%→0.147%, IDS stays at 995.
Hard constraint: do NOT increase queue_length when map head is active.
Skip cross_gnn attention layers when map_output is None (with_map=False).
The bs24 base config includes cross_gnn in operation_order; overriding
with_map=False disables map output but the operation_order still references
map_instance_feature_selected, causing UnboundLocalError at training start.
… config

Skipping cross_gnn when map_output is None leaves map head params without
gradients in DDP, causing reduction error. find_unused_parameters=True fixes.
nomap via with_map=False override on bs24 base is DDP-incompatible.
Map head module stays registered with params; neither find_unused_parameters
nor cross_gnn guard resolves PyTorch 1.13 mark-ready-twice error.
Config updated to plan_loss_up alone (no nomap) for resubmission.
plan_loss_up alone: L2=0.6223 (beats baseline), col=0.110% (slightly worse than baseline).
num_det=100 remains the strongest single lever. Moving to epochs=15.
epochs=15 hurts on bs24 with-map: L2=0.635, col=0.143% (worse than baseline).
Map head gradient competition amplifies with more training.
Hard constraint added. exp005 changed to num_det=100 + plan_loss_up combo.
exp005 (num_det=100 + plan_loss_up): L2=0.650, col=0.125% — negative synergy,
worst in session. Best result remains exp001 (num_det=100): L2=0.616, col=0.091%.

Key findings:
- bs24 with-map is brittle: only num_det=100 reliably improves it
- queue=6, epochs=15, plan_loss_up, and combinations all degrade performance
- with_map=False override is DDP-incompatible (PyTorch 1.13)
- num_det=100 confirmed universal win across 2 configs

Updated research_review: new 3.13 section, updated confirmed wins/negative
evidence tables, updated best recipe and next experiments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant