Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# No-looping ablation on the SP8192 SOTA stack (5 shards, 1×H100 screening) — Non-record submission

This folder captures the **“no looping”** ablation used during grant screening: run the current SP8192 SOTA training stack with **layer looping disabled**.

- **Track**: non-record (screening / grant experiments)
- **Base trainer**: `records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/train_gpt.py`
- **Ablation**: set `NUM_LOOPS=0` (disables depth recurrence / looping)
- **Budget**: `MAX_WALLCLOCK_SECONDS=600`
- **Train shards**: 5

## Results (3-seed)

Metric notes:
- **Pre-quantization post-EMA** isolates model quality before export.
- **`quantized_sliding_window`** is the post-quant sliding-window BPB reported by this trainer.

| Seed | Steps @ cap | Pre-quant post-EMA `val_bpb` | `quantized_sliding_window val_bpb` | Total submission size (quantized+brotli) |
|------|-------------|------------------------------:|-----------------------------------:|-----------------------------------------:|
| 0 | 658 | 1.327667 | 1.317445 | 16,033,831 |
| 42 | 724 | 1.291048 | 1.280317 | 16,034,416 |
| 1337 | 724 | 1.289564 | 1.278652 | 16,034,548 |
| **Mean** | | **1.302760** | **1.292138** | **16,034,265** |

## How to run

```bash
cd records/track_non_record_16mb/2026-04-21_NoLooping_SOTAStack_5Shards_1xH100
SEED=1337 RUN_ID=no_looping_1337 MAX_WALLCLOCK_SECONDS=600 NUM_LOOPS=0 \
torchrun --standalone --nproc_per_node=1 train_gpt.py
```

## Notes

- This submission uses a **thin launcher** that sets the no-looping env var and then executes the base record trainer.
- Training/eval dependencies should match the base record (FlashAttention 3, etc.).

Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Match the base record's runtime deps.
# (torch + flash-attn-3 are assumed to be installed separately)
brotli
huggingface-hub
numpy
sentencepiece
tqdm

Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"track": "non_record_16mb",
"date": "2026-04-21",
"name": "No-looping ablation on SP8192 SOTA stack (5 shards, 1×H100 screening)",
"author": "Gautam Naik",
"github_id": "gautamnaik",
"val_bpb": 1.2921,
"val_loss": 3.3511,
"bytes_total": 16034265
}

Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import os
import runpy
from pathlib import Path


def main() -> None:
# Disable looping (depth recurrence) for the base SOTA stack.
os.environ.setdefault("NUM_LOOPS", "0")

repo_root = Path(__file__).resolve().parents[3]
base_trainer = (
repo_root
/ "records"
/ "track_10min_16mb"
/ "2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT"
/ "train_gpt.py"
)

runpy.run_path(str(base_trainer), run_name="__main__")


if __name__ == "__main__":
main()

Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
train_shards: 5
val_tokens: 40540160
model_params:35941464
gptq:reserving 12s, effective=588000ms
0/20000 val_loss: 9.0072 val_bpb: 3.4870
500/20000 train_loss: 3.2386 train_time: 7.7m tok/s: 854791
658/20000 val_loss: 3.1480 val_bpb: 1.2187
stopping_early: wallclock_cap train_time: 588681ms step: 658/20000
ema:applying EMA weights
pre-quantization post-ema val_loss:3.42950098 val_bpb:1.32766670 eval_time:15858ms
Total submission size quantized+brotli: 16033831 bytes
quantized val_loss:3.44292430 val_bpb:1.33286329 eval_time:17376ms
quantized_sliding_window val_loss:3.40309696 val_bpb:1.31744488 eval_time:554037ms

Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
train_shards: 5
val_tokens: 40540160
model_params:35941464
gptq:reserving 12s, effective=588000ms
0/20000 val_loss: 9.0047 val_bpb: 3.4860
500/20000 train_loss: 3.2483 train_time: 6.8m tok/s: 967384
724/20000 val_loss: 3.1174 val_bpb: 1.2068
stopping_early: wallclock_cap train_time: 588064ms step: 724/20000
ema:applying EMA weights
pre-quantization post-ema val_loss:3.33107887 val_bpb:1.28956444 eval_time:15774ms
Total submission size quantized+brotli: 16034548 bytes
quantized val_loss:3.34474578 val_bpb:1.29485532 eval_time:17382ms
quantized_sliding_window val_loss:3.30289072 val_bpb:1.27865192 eval_time:542194ms

Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
train_shards: 5
val_tokens: 40540160
model_params:35941464
gptq:reserving 12s, effective=588000ms
0/20000 val_loss: 9.0090 val_bpb: 3.4877
500/20000 train_loss: 3.2470 train_time: 6.8m tok/s: 966857
724/20000 val_loss: 3.1179 val_bpb: 1.2070
stopping_early: wallclock_cap train_time: 588403ms step: 724/20000
ema:applying EMA weights
pre-quantization post-ema val_loss:3.33491082 val_bpb:1.29104791 eval_time:15899ms
Total submission size quantized+brotli: 16034416 bytes
quantized val_loss:3.34793848 val_bpb:1.29609132 eval_time:17726ms
quantized_sliding_window val_loss:3.30719303 val_bpb:1.28031748 eval_time:544877ms