Mandeep Sidhu commited on
Commit ·
2f2776e
1
Parent(s): 3550904
Add five-seed TinyStories streaming validation
Browse files- docs/plan.md +40 -36
- docs/streaming_multiseed_validation_report.md +57 -41
- runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/condition_summary.csv +7 -0
- runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/paired_final_deltas.csv +31 -0
- runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/stage_summary.csv +25 -0
- runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/RESULT_SUMMARY.md +63 -0
- runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/config.json +155 -0
- runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/metrics.jsonl +48 -0
- runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/summary.csv +25 -0
- runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/summary.json +530 -0
- runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/trace.jsonl +96 -0
- scripts/summarize_streaming_multiseed.py +44 -11
docs/plan.md
CHANGED
|
@@ -283,8 +283,9 @@ Use this order for every regime.
|
|
| 283 |
|---|---|---|
|
| 284 |
| original/local saved regime | offline backtest complete | retrospective support for interaction pressure law; do not rerun unless necessary |
|
| 285 |
| TinyStories static/coefficient regime | active | main coefficient evidence |
|
| 286 |
-
| TinyStories streaming regime |
|
| 287 |
-
|
|
|
|
|
| 288 |
|
| 289 |
## Current Formula Status
|
| 290 |
|
|
@@ -331,24 +332,31 @@ structure transfers, while coefficients may be regime-specific.
|
|
| 331 |
| TinyStories static optima | interaction form fits static dropout optima better than base ABC |
|
| 332 |
| TinyStories held-out prefix | supports pressure dependence on unique tokens |
|
| 333 |
| TinyStories held-out model | supports pressure dependence on model size |
|
| 334 |
-
| TinyStories streaming,
|
| 335 |
| cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
|
| 336 |
|
| 337 |
-
Latest TinyStories
|
| 338 |
|
| 339 |
| Condition | Mean final 4M validation loss | Std |
|
| 340 |
|---|---:|---:|
|
| 341 |
-
| `interaction` decay | 2.
|
| 342 |
-
| `smooth_low` decay | 2.
|
| 343 |
-
| `baseabc` decay | 2.
|
| 344 |
-
| static `0.08` | 2.
|
| 345 |
-
| static `0.12` | 2.
|
| 346 |
-
| static `0.18` | 2.
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 352 |
|
| 353 |
```text
|
| 354 |
Formula-derived dropout schedules track the moving useful dropout region and
|
|
@@ -361,8 +369,8 @@ The stronger claim:
|
|
| 361 |
Formula-derived dropout decay beats the best static dropout.
|
| 362 |
```
|
| 363 |
|
| 364 |
-
is supported at `n=
|
| 365 |
-
|
| 366 |
|
| 367 |
## Completed Static Backtest Gate
|
| 368 |
|
|
@@ -386,26 +394,21 @@ streaming multi-seed reports for each regime.
|
|
| 386 |
|
| 387 |
## Immediate Next Action
|
| 388 |
|
| 389 |
-
|
| 390 |
-
|
| 391 |
-
|
| 392 |
-
confirmation is needed.
|
| 393 |
|
| 394 |
## Next Training After Current Gate
|
| 395 |
|
| 396 |
-
|
| 397 |
-
|
|
|
|
|
|
|
| 398 |
|
| 399 |
```text
|
| 400 |
-
|
| 401 |
-
|
| 402 |
-
|
| 403 |
-
interaction decay
|
| 404 |
-
baseabc decay
|
| 405 |
-
smooth_low decay
|
| 406 |
-
static_dropout_0.08
|
| 407 |
-
static_dropout_0.12
|
| 408 |
-
static_dropout_0.18
|
| 409 |
```
|
| 410 |
|
| 411 |
Evaluate with paired seed comparisons:
|
|
@@ -418,13 +421,14 @@ decay minus best-static delta per seed
|
|
| 418 |
rank consistency across seeds
|
| 419 |
```
|
| 420 |
|
| 421 |
-
If decay wins across paired seeds, promote the
|
| 422 |
-
|
| 423 |
-
|
|
|
|
| 424 |
|
| 425 |
Latest streaming report:
|
| 426 |
|
| 427 |
```text
|
| 428 |
docs/streaming_multiseed_validation_report.md
|
| 429 |
-
runs/streaming_tinystories_multiseed_validation_l12/
|
| 430 |
```
|
|
|
|
| 283 |
|---|---|---|
|
| 284 |
| original/local saved regime | offline backtest complete | retrospective support for interaction pressure law; do not rerun unless necessary |
|
| 285 |
| TinyStories static/coefficient regime | active | main coefficient evidence |
|
| 286 |
+
| TinyStories streaming regime | 5-seed validation complete | current main streaming evidence; interaction decay beats best static in 5/5 paired final-loss comparisons |
|
| 287 |
+
| original/local streaming regime | pending report | summarize saved streaming runs before launching any additional training |
|
| 288 |
+
| next new streaming regime | pending | start only after TinyStories and original/local streaming reports are reconciled |
|
| 289 |
|
| 290 |
## Current Formula Status
|
| 291 |
|
|
|
|
| 332 |
| TinyStories static optima | interaction form fits static dropout optima better than base ABC |
|
| 333 |
| TinyStories held-out prefix | supports pressure dependence on unique tokens |
|
| 334 |
| TinyStories held-out model | supports pressure dependence on model size |
|
| 335 |
+
| TinyStories streaming, 5 seeds | interaction has best mean final loss; interaction beats best static in 5/5 paired final-loss comparisons |
|
| 336 |
| cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
|
| 337 |
|
| 338 |
+
Latest TinyStories 5-seed streaming final-loss table:
|
| 339 |
|
| 340 |
| Condition | Mean final 4M validation loss | Std |
|
| 341 |
|---|---:|---:|
|
| 342 |
+
| `interaction` decay | 2.5311 | 0.0213 |
|
| 343 |
+
| `smooth_low` decay | 2.5321 | 0.0203 |
|
| 344 |
+
| `baseabc` decay | 2.5357 | 0.0175 |
|
| 345 |
+
| static `0.08` | 2.5444 | 0.0211 |
|
| 346 |
+
| static `0.12` | 2.5477 | 0.0178 |
|
| 347 |
+
| static `0.18` | 2.5644 | 0.0182 |
|
| 348 |
+
|
| 349 |
+
Paired final-loss result:
|
| 350 |
+
|
| 351 |
+
| Decay schedule | Paired wins vs best static |
|
| 352 |
+
|---|---:|
|
| 353 |
+
| `interaction` | 5/5 |
|
| 354 |
+
| `baseabc` | 5/5 |
|
| 355 |
+
| `smooth_low` | 4/5, with the one miss only `+0.0003` |
|
| 356 |
+
|
| 357 |
+
The immediate risk is no longer TinyStories seed count. The main remaining risk
|
| 358 |
+
is external validity: the current strongest streaming result is one corpus and
|
| 359 |
+
one narrowed model/optimizer regime. The current defensible claim is:
|
| 360 |
|
| 361 |
```text
|
| 362 |
Formula-derived dropout schedules track the moving useful dropout region and
|
|
|
|
| 369 |
Formula-derived dropout decay beats the best static dropout.
|
| 370 |
```
|
| 371 |
|
| 372 |
+
is supported at `n=5` for this TinyStories setup, with interaction decay
|
| 373 |
+
beating the per-seed best static baseline in all five seeds.
|
| 374 |
|
| 375 |
## Completed Static Backtest Gate
|
| 376 |
|
|
|
|
| 394 |
|
| 395 |
## Immediate Next Action
|
| 396 |
|
| 397 |
+
Build the original/local streaming report from saved runs. Do not launch a
|
| 398 |
+
broad new regime sweep until the previous/local report is reconciled against
|
| 399 |
+
the TinyStories five-seed result.
|
|
|
|
| 400 |
|
| 401 |
## Next Training After Current Gate
|
| 402 |
|
| 403 |
+
No MPS training should launch before the previous/local streaming report is
|
| 404 |
+
generated from existing saved runs. If that report lacks enough coverage for a
|
| 405 |
+
clean claim, the next MPS run should be narrowly scoped to only the missing
|
| 406 |
+
conditions or seeds:
|
| 407 |
|
| 408 |
```text
|
| 409 |
+
preferred first step: no training, saved-run report only
|
| 410 |
+
possible follow-up: fill missing previous/local streaming cells
|
| 411 |
+
avoid: broad new regime sweep before the report audit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 412 |
```
|
| 413 |
|
| 414 |
Evaluate with paired seed comparisons:
|
|
|
|
| 421 |
rank consistency across seeds
|
| 422 |
```
|
| 423 |
|
| 424 |
+
If previous/local decay wins across paired seeds, promote the cross-regime
|
| 425 |
+
streaming claim. If it ties, claim competitive automatic scheduling rather than
|
| 426 |
+
superiority outside TinyStories. If it loses, fit a streaming-specific
|
| 427 |
+
correction offline before launching any broader experiment.
|
| 428 |
|
| 429 |
Latest streaming report:
|
| 430 |
|
| 431 |
```text
|
| 432 |
docs/streaming_multiseed_validation_report.md
|
| 433 |
+
runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/
|
| 434 |
```
|
docs/streaming_multiseed_validation_report.md
CHANGED
|
@@ -2,25 +2,26 @@
|
|
| 2 |
|
| 3 |
Date: 2026-05-30
|
| 4 |
|
| 5 |
-
This report combines
|
| 6 |
-
|
| 7 |
-
|
| 8 |
|
| 9 |
## Sources
|
| 10 |
|
| 11 |
- `runs/streaming_tinystories_interaction_schedule_l12/locked_stream/20260530-053831/metrics.jsonl`
|
| 12 |
- `runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-111523/metrics.jsonl`
|
|
|
|
| 13 |
|
| 14 |
## Condition Ranking By Final Loss
|
| 15 |
|
| 16 |
| Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
|
| 17 |
|---|---|---:|---:|---:|---:|---:|---:|---|
|
| 18 |
-
| `interaction` | `anchor_decay` |
|
| 19 |
-
| `smooth_low` | `decay` |
|
| 20 |
-
| `baseabc` | `anchor_decay` |
|
| 21 |
-
| `static_dropout_0.08` | `static` |
|
| 22 |
-
| `static_dropout_0.12` | `static` |
|
| 23 |
-
| `static_dropout_0.18` | `static` |
|
| 24 |
|
| 25 |
## Paired Final-Loss Deltas
|
| 26 |
|
|
@@ -47,45 +48,60 @@ baseline for that seed.
|
|
| 47 |
| 3 | `static_dropout_0.08` | 2.5478 | `static_dropout_0.08` | 2.5478 | +0.0000 |
|
| 48 |
| 3 | `static_dropout_0.12` | 2.5510 | `static_dropout_0.08` | 2.5478 | +0.0033 |
|
| 49 |
| 3 | `static_dropout_0.18` | 2.5667 | `static_dropout_0.08` | 2.5478 | +0.0189 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
## Stage Trajectory
|
| 52 |
|
| 53 |
| Stage | Prefix tokens | Condition | Dropout | N | Mean val | Std val | Mean train | Mean gap |
|
| 54 |
|---:|---:|---|---:|---:|---:|---:|---:|---:|
|
| 55 |
-
| 0 | 500,000 | `static_dropout_0.12` | 0.120 |
|
| 56 |
-
| 0 | 500,000 | `
|
| 57 |
-
| 0 | 500,000 | `
|
| 58 |
-
| 0 | 500,000 | `interaction` | 0.184 |
|
| 59 |
-
| 0 | 500,000 | `static_dropout_0.
|
| 60 |
-
| 0 | 500,000 | `baseabc` | 0.251 |
|
| 61 |
-
| 1 | 1,000,000 | `
|
| 62 |
-
| 1 | 1,000,000 | `
|
| 63 |
-
| 1 | 1,000,000 | `
|
| 64 |
-
| 1 | 1,000,000 | `static_dropout_0.18` | 0.180 |
|
| 65 |
-
| 1 | 1,000,000 | `baseabc` | 0.186 |
|
| 66 |
-
| 1 | 1,000,000 | `static_dropout_0.08` | 0.080 |
|
| 67 |
-
| 2 | 2,000,000 | `interaction` | 0.084 |
|
| 68 |
-
| 2 | 2,000,000 | `smooth_low` | 0.067 |
|
| 69 |
-
| 2 | 2,000,000 | `
|
| 70 |
-
| 2 | 2,000,000 | `
|
| 71 |
-
| 2 | 2,000,000 | `static_dropout_0.
|
| 72 |
-
| 2 | 2,000,000 | `static_dropout_0.
|
| 73 |
-
| 3 | 4,000,000 | `interaction` | 0.045 |
|
| 74 |
-
| 3 | 4,000,000 | `smooth_low` | 0.045 |
|
| 75 |
-
| 3 | 4,000,000 | `baseabc` | 0.020 |
|
| 76 |
-
| 3 | 4,000,000 | `static_dropout_0.08` | 0.080 |
|
| 77 |
-
| 3 | 4,000,000 | `static_dropout_0.12` | 0.120 |
|
| 78 |
-
| 3 | 4,000,000 | `static_dropout_0.18` | 0.180 |
|
| 79 |
|
| 80 |
## Interpretation
|
| 81 |
|
| 82 |
-
- `interaction` has the best
|
| 83 |
-
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
|
|
|
|
|
|
|
|
|
| 87 |
- Static `0.12` can win early stages, but holding it fixed loses at the
|
| 88 |
final 4M stage.
|
| 89 |
-
- This
|
| 90 |
-
|
| 91 |
-
|
|
|
|
| 2 |
|
| 3 |
Date: 2026-05-30
|
| 4 |
|
| 5 |
+
This report combines 5 random seeds (1, 2, 3, 4, 5) from saved streaming runs.
|
| 6 |
+
No additional training is performed by this script; it reads saved
|
| 7 |
+
`metrics.jsonl` files.
|
| 8 |
|
| 9 |
## Sources
|
| 10 |
|
| 11 |
- `runs/streaming_tinystories_interaction_schedule_l12/locked_stream/20260530-053831/metrics.jsonl`
|
| 12 |
- `runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-111523/metrics.jsonl`
|
| 13 |
+
- `runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/metrics.jsonl`
|
| 14 |
|
| 15 |
## Condition Ranking By Final Loss
|
| 16 |
|
| 17 |
| Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
|
| 18 |
|---|---|---:|---:|---:|---:|---:|---:|---|
|
| 19 |
+
| `interaction` | `anchor_decay` | 5 | 2.8309 | 0.0068 | 2.5311 | 0.0213 | 0.2626 | `0.18 -> 0.14 -> 0.08 -> 0.04` |
|
| 20 |
+
| `smooth_low` | `decay` | 5 | 2.8307 | 0.0069 | 2.5321 | 0.0203 | 0.2607 | `0.16 -> 0.11 -> 0.07 -> 0.05` |
|
| 21 |
+
| `baseabc` | `anchor_decay` | 5 | 2.8474 | 0.0028 | 2.5357 | 0.0175 | 0.2655 | `0.25 -> 0.19 -> 0.10 -> 0.02` |
|
| 22 |
+
| `static_dropout_0.08` | `static` | 5 | 2.8434 | 0.0072 | 2.5444 | 0.0211 | 0.2593 | `0.08 -> 0.08 -> 0.08 -> 0.08` |
|
| 23 |
+
| `static_dropout_0.12` | `static` | 5 | 2.8357 | 0.0061 | 2.5477 | 0.0178 | 0.2269 | `0.12 -> 0.12 -> 0.12 -> 0.12` |
|
| 24 |
+
| `static_dropout_0.18` | `static` | 5 | 2.8461 | 0.0047 | 2.5644 | 0.0182 | 0.2035 | `0.18 -> 0.18 -> 0.18 -> 0.18` |
|
| 25 |
|
| 26 |
## Paired Final-Loss Deltas
|
| 27 |
|
|
|
|
| 48 |
| 3 | `static_dropout_0.08` | 2.5478 | `static_dropout_0.08` | 2.5478 | +0.0000 |
|
| 49 |
| 3 | `static_dropout_0.12` | 2.5510 | `static_dropout_0.08` | 2.5478 | +0.0033 |
|
| 50 |
| 3 | `static_dropout_0.18` | 2.5667 | `static_dropout_0.08` | 2.5478 | +0.0189 |
|
| 51 |
+
| 4 | `interaction` | 2.4932 | `static_dropout_0.08` | 2.5098 | -0.0166 |
|
| 52 |
+
| 4 | `baseabc` | 2.5049 | `static_dropout_0.08` | 2.5098 | -0.0049 |
|
| 53 |
+
| 4 | `smooth_low` | 2.4959 | `static_dropout_0.08` | 2.5098 | -0.0139 |
|
| 54 |
+
| 4 | `static_dropout_0.08` | 2.5098 | `static_dropout_0.08` | 2.5098 | +0.0000 |
|
| 55 |
+
| 4 | `static_dropout_0.12` | 2.5166 | `static_dropout_0.08` | 2.5098 | +0.0068 |
|
| 56 |
+
| 4 | `static_dropout_0.18` | 2.5343 | `static_dropout_0.08` | 2.5098 | +0.0244 |
|
| 57 |
+
| 5 | `interaction` | 2.5447 | `static_dropout_0.08` | 2.5588 | -0.0141 |
|
| 58 |
+
| 5 | `baseabc` | 2.5481 | `static_dropout_0.08` | 2.5588 | -0.0107 |
|
| 59 |
+
| 5 | `smooth_low` | 2.5428 | `static_dropout_0.08` | 2.5588 | -0.0159 |
|
| 60 |
+
| 5 | `static_dropout_0.08` | 2.5588 | `static_dropout_0.08` | 2.5588 | +0.0000 |
|
| 61 |
+
| 5 | `static_dropout_0.12` | 2.5595 | `static_dropout_0.08` | 2.5588 | +0.0008 |
|
| 62 |
+
| 5 | `static_dropout_0.18` | 2.5806 | `static_dropout_0.08` | 2.5588 | +0.0218 |
|
| 63 |
|
| 64 |
## Stage Trajectory
|
| 65 |
|
| 66 |
| Stage | Prefix tokens | Condition | Dropout | N | Mean val | Std val | Mean train | Mean gap |
|
| 67 |
|---:|---:|---|---:|---:|---:|---:|---:|---:|
|
| 68 |
+
| 0 | 500,000 | `static_dropout_0.12` | 0.120 | 5 | 3.2226 | 0.0143 | 2.6968 | 0.5257 |
|
| 69 |
+
| 0 | 500,000 | `smooth_low` | 0.162 | 5 | 3.2287 | 0.0122 | 2.7909 | 0.4377 |
|
| 70 |
+
| 0 | 500,000 | `static_dropout_0.08` | 0.080 | 5 | 3.2304 | 0.0102 | 2.6173 | 0.6131 |
|
| 71 |
+
| 0 | 500,000 | `interaction` | 0.184 | 5 | 3.2326 | 0.0123 | 2.8108 | 0.4218 |
|
| 72 |
+
| 0 | 500,000 | `static_dropout_0.18` | 0.180 | 5 | 3.2349 | 0.0151 | 2.8056 | 0.4293 |
|
| 73 |
+
| 0 | 500,000 | `baseabc` | 0.251 | 5 | 3.2728 | 0.0102 | 2.9139 | 0.3588 |
|
| 74 |
+
| 1 | 1,000,000 | `interaction` | 0.141 | 5 | 2.8908 | 0.0027 | 2.4842 | 0.4065 |
|
| 75 |
+
| 1 | 1,000,000 | `smooth_low` | 0.115 | 5 | 2.8912 | 0.0018 | 2.4678 | 0.4234 |
|
| 76 |
+
| 1 | 1,000,000 | `static_dropout_0.12` | 0.120 | 5 | 2.8930 | 0.0121 | 2.4335 | 0.4595 |
|
| 77 |
+
| 1 | 1,000,000 | `static_dropout_0.18` | 0.180 | 5 | 2.8990 | 0.0106 | 2.5397 | 0.3593 |
|
| 78 |
+
| 1 | 1,000,000 | `baseabc` | 0.186 | 5 | 2.9041 | 0.0037 | 2.5659 | 0.3382 |
|
| 79 |
+
| 1 | 1,000,000 | `static_dropout_0.08` | 0.080 | 5 | 2.9132 | 0.0068 | 2.3531 | 0.5601 |
|
| 80 |
+
| 2 | 2,000,000 | `interaction` | 0.084 | 5 | 2.6690 | 0.0207 | 2.3392 | 0.3298 |
|
| 81 |
+
| 2 | 2,000,000 | `smooth_low` | 0.067 | 5 | 2.6708 | 0.0218 | 2.3360 | 0.3347 |
|
| 82 |
+
| 2 | 2,000,000 | `baseabc` | 0.105 | 5 | 2.6770 | 0.0186 | 2.3938 | 0.2833 |
|
| 83 |
+
| 2 | 2,000,000 | `static_dropout_0.12` | 0.120 | 5 | 2.6795 | 0.0163 | 2.3697 | 0.3098 |
|
| 84 |
+
| 2 | 2,000,000 | `static_dropout_0.08` | 0.080 | 5 | 2.6856 | 0.0161 | 2.3109 | 0.3747 |
|
| 85 |
+
| 2 | 2,000,000 | `static_dropout_0.18` | 0.180 | 5 | 2.6860 | 0.0159 | 2.4347 | 0.2513 |
|
| 86 |
+
| 3 | 4,000,000 | `interaction` | 0.045 | 5 | 2.5311 | 0.0213 | 2.2685 | 0.2626 |
|
| 87 |
+
| 3 | 4,000,000 | `smooth_low` | 0.045 | 5 | 2.5321 | 0.0203 | 2.2713 | 0.2607 |
|
| 88 |
+
| 3 | 4,000,000 | `baseabc` | 0.020 | 5 | 2.5357 | 0.0175 | 2.2702 | 0.2655 |
|
| 89 |
+
| 3 | 4,000,000 | `static_dropout_0.08` | 0.080 | 5 | 2.5444 | 0.0211 | 2.2851 | 0.2593 |
|
| 90 |
+
| 3 | 4,000,000 | `static_dropout_0.12` | 0.120 | 5 | 2.5477 | 0.0178 | 2.3208 | 0.2269 |
|
| 91 |
+
| 3 | 4,000,000 | `static_dropout_0.18` | 0.180 | 5 | 2.5644 | 0.0182 | 2.3609 | 0.2035 |
|
| 92 |
|
| 93 |
## Interpretation
|
| 94 |
|
| 95 |
+
- `interaction` has the best 5-seed mean final validation loss: 2.5311 +/- 0.0213.
|
| 96 |
+
- The best static baseline by mean final loss is `static_dropout_0.08` at 2.5444 +/- 0.0211.
|
| 97 |
+
- `smooth_low` is very close to `interaction`, suggesting the exact anchor
|
| 98 |
+
values may not be uniquely required as long as the schedule follows the
|
| 99 |
+
same pressure range.
|
| 100 |
+
- `interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0005.
|
| 101 |
+
- `smooth_low` beats the per-seed best static baseline in 4/5 seeds; worst paired delta is +0.0003.
|
| 102 |
+
- `baseabc` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0022.
|
| 103 |
- Static `0.12` can win early stages, but holding it fixed loses at the
|
| 104 |
final 4M stage.
|
| 105 |
+
- This is now the TinyStories paper-grade validation gate for this narrowed
|
| 106 |
+
setup: five seeds, paired seed comparisons, and static baselines selected
|
| 107 |
+
from the same stream protocol.
|
runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/condition_summary.csv
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
condition,kind,n,mean_trajectory_val,std_trajectory_val,mean_final_val,std_final_val,mean_final_gap,std_final_gap,dropout_path
|
| 2 |
+
interaction,anchor_decay,5,2.830870787985623,0.006818420404928253,2.5311126589775084,0.021347338722582713,0.2625980645418167,0.024789124454166093,0.18 -> 0.14 -> 0.08 -> 0.04
|
| 3 |
+
smooth_low,decay,5,2.830661616846919,0.006865263070272619,2.532056810706854,0.020285070194807492,0.26073483750224113,0.024773434805768067,0.16 -> 0.11 -> 0.07 -> 0.05
|
| 4 |
+
baseabc,anchor_decay,5,2.847391692176461,0.0027872612012979406,2.5356785252690317,0.017481971768559337,0.2655265092849731,0.022078743750300757,0.25 -> 0.19 -> 0.10 -> 0.02
|
| 5 |
+
static_dropout_0.08,static,5,2.843405600450933,0.007235109596464878,2.5443769969046115,0.021138839821398296,0.2592696316540241,0.023068417622449672,0.08 -> 0.08 -> 0.08 -> 0.08
|
| 6 |
+
static_dropout_0.12,static,5,2.835688897036016,0.006067411735608185,2.54771768823266,0.017777870988576795,0.2268920622766018,0.024198331849328455,0.12 -> 0.12 -> 0.12 -> 0.12
|
| 7 |
+
static_dropout_0.18,static,5,2.84606368560344,0.00466160864781479,2.564381641894579,0.01822875837957205,0.20352096483111382,0.019991069212145,0.18 -> 0.18 -> 0.18 -> 0.18
|
runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/paired_final_deltas.csv
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
seed,condition,final_val,best_static_condition,best_static_final_val,delta_vs_best_static
|
| 2 |
+
1,interaction,2.541424009948969,static_dropout_0.08,2.54193027690053,-0.0005062669515609741
|
| 3 |
+
1,baseabc,2.5396873727440834,static_dropout_0.08,2.54193027690053,-0.002242904156446457
|
| 4 |
+
1,smooth_low,2.5422630608081818,static_dropout_0.08,2.54193027690053,0.00033278390765190125
|
| 5 |
+
1,static_dropout_0.08,2.54193027690053,static_dropout_0.08,2.54193027690053,0.0
|
| 6 |
+
1,static_dropout_0.12,2.552573699504137,static_dropout_0.08,2.54193027690053,0.010643422603607178
|
| 7 |
+
1,static_dropout_0.18,2.563616167753935,static_dropout_0.08,2.54193027690053,0.021685890853405
|
| 8 |
+
2,interaction,2.5376960188150406,static_dropout_0.12,2.558806713670492,-0.021110694855451584
|
| 9 |
+
2,baseabc,2.5431769341230392,static_dropout_0.12,2.558806713670492,-0.015629779547452927
|
| 10 |
+
2,smooth_low,2.5386215560138226,static_dropout_0.12,2.558806713670492,-0.020185157656669617
|
| 11 |
+
2,static_dropout_0.08,2.563583254814148,static_dropout_0.12,2.558806713670492,0.004776541143655777
|
| 12 |
+
2,static_dropout_0.12,2.558806713670492,static_dropout_0.12,2.558806713670492,0.0
|
| 13 |
+
2,static_dropout_0.18,2.5767628997564316,static_dropout_0.12,2.558806713670492,0.017956186085939407
|
| 14 |
+
3,interaction,2.5385167226195335,static_dropout_0.08,2.547760896384716,-0.009244173765182495
|
| 15 |
+
3,baseabc,2.5425427742302418,static_dropout_0.08,2.547760896384716,-0.005218122154474258
|
| 16 |
+
3,smooth_low,2.5406627915799618,static_dropout_0.08,2.547760896384716,-0.007098104804754257
|
| 17 |
+
3,static_dropout_0.08,2.547760896384716,static_dropout_0.08,2.547760896384716,0.0
|
| 18 |
+
3,static_dropout_0.12,2.5510490722954273,static_dropout_0.08,2.547760896384716,0.0032881759107112885
|
| 19 |
+
3,static_dropout_0.18,2.566690094769001,static_dropout_0.08,2.547760896384716,0.018929198384284973
|
| 20 |
+
4,interaction,2.493242312222719,static_dropout_0.08,2.5098287016153336,-0.016586389392614365
|
| 21 |
+
4,baseabc,2.5048790462315083,static_dropout_0.08,2.5098287016153336,-0.004949655383825302
|
| 22 |
+
4,smooth_low,2.495888389647007,static_dropout_0.08,2.5098287016153336,-0.013940311968326569
|
| 23 |
+
4,static_dropout_0.08,2.5098287016153336,static_dropout_0.08,2.5098287016153336,0.0
|
| 24 |
+
4,static_dropout_0.12,2.516622833907604,static_dropout_0.08,2.5098287016153336,0.00679413229227066
|
| 25 |
+
4,static_dropout_0.18,2.5342570766806602,static_dropout_0.08,2.5098287016153336,0.02442837506532669
|
| 26 |
+
5,interaction,2.5446842312812805,static_dropout_0.08,2.5587818548083305,-0.014097623527050018
|
| 27 |
+
5,baseabc,2.548106499016285,static_dropout_0.08,2.5587818548083305,-0.010675355792045593
|
| 28 |
+
5,smooth_low,2.5428482554852962,static_dropout_0.08,2.5587818548083305,-0.015933599323034286
|
| 29 |
+
5,static_dropout_0.08,2.5587818548083305,static_dropout_0.08,2.5587818548083305,0.0
|
| 30 |
+
5,static_dropout_0.12,2.5595361217856407,static_dropout_0.08,2.5587818548083305,0.0007542669773101807
|
| 31 |
+
5,static_dropout_0.18,2.580581970512867,static_dropout_0.08,2.5587818548083305,0.021800115704536438
|
runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/stage_summary.csv
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
condition,stage,token_limit,dropout,n,mean_val,std_val,mean_train,std_train,mean_gap,std_gap
|
| 2 |
+
interaction,0,500000,0.184,5,3.232573334872723,0.012322249630223228,2.8108130276203154,0.01843696543963324,0.4217603072524071,0.02642271429216449
|
| 3 |
+
interaction,1,1000000,0.141,5,2.8907706700265408,0.00267614743609897,2.4842222586274145,0.010088488701185646,0.40654841139912606,0.010858718174861278
|
| 4 |
+
interaction,2,2000000,0.084,5,2.6690264880657195,0.020694473868817478,2.3392063215374947,0.013823027371981881,0.32982016652822493,0.01498506611092748
|
| 5 |
+
interaction,3,4000000,0.045,5,2.5311126589775084,0.021347338722582713,2.268514594435692,0.02281533749518331,0.2625980645418167,0.024789124454166093
|
| 6 |
+
baseabc,0,500000,0.251,5,3.2727632120251657,0.010179445899511374,2.913914993405342,0.014484851568583227,0.35884821861982347,0.023140322107556496
|
| 7 |
+
baseabc,1,1000000,0.186,5,2.904086685180664,0.0037183966108595805,2.5659306168556215,0.007757824816452149,0.33815606832504275,0.010478362783456022
|
| 8 |
+
baseabc,2,2000000,0.10500000000000001,5,2.677038346230984,0.018558286675126146,2.3937876164913177,0.01400362738363624,0.283250729739666,0.023973231722312253
|
| 9 |
+
baseabc,3,4000000,0.02,5,2.5356785252690317,0.017481971768559337,2.2701520159840585,0.022386508741524157,0.2655265092849731,0.022078743750300757
|
| 10 |
+
smooth_low,0,500000,0.16230079361664454,5,3.2286536514759065,0.012238699798719154,2.790942022204399,0.019096707542752278,0.43771162927150725,0.026763959097502593
|
| 11 |
+
smooth_low,1,1000000,0.11452606249945704,5,2.8911825358867644,0.0017684362872669639,2.467776434123516,0.010888431197392221,0.42340610176324844,0.011628309520454386
|
| 12 |
+
smooth_low,2,2000000,0.06673830013226953,5,2.6707534693181514,0.02176383141227104,2.336036388576031,0.01467458437647137,0.33471708074212075,0.01662120958844446
|
| 13 |
+
smooth_low,3,4000000,0.045000006515082035,5,2.532056810706854,0.020285070194807492,2.271321973204613,0.02045259624936235,0.26073483750224113,0.024773434805768067
|
| 14 |
+
static_dropout_0.08,0,500000,0.08,5,3.2304254487156867,0.010215096373828215,2.617284271121025,0.03148810449622768,0.6131411775946617,0.03563123820440458
|
| 15 |
+
static_dropout_0.08,1,1000000,0.08,5,2.9132092565298082,0.00683551263871941,2.3530609726905825,0.0062032351188941945,0.5601482838392258,0.0011661277729932713
|
| 16 |
+
static_dropout_0.08,2,2000000,0.08,5,2.6856106996536253,0.016060032597589886,2.310917650163174,0.009996090547415524,0.3746930494904518,0.01953395199808365
|
| 17 |
+
static_dropout_0.08,3,4000000,0.08,5,2.5443769969046115,0.021138839821398296,2.2851073652505876,0.025411287336708797,0.2592696316540241,0.023068417622449672
|
| 18 |
+
static_dropout_0.12,0,500000,0.12,5,3.2225750528275965,0.014316453554089874,2.6968257799744606,0.010457351991404112,0.5257492728531361,0.013561266528339002
|
| 19 |
+
static_dropout_0.12,1,1000000,0.12,5,2.893001724779606,0.012108793042410648,2.433453027904034,0.01094591291488611,0.4595486968755722,0.012902761746442865
|
| 20 |
+
static_dropout_0.12,2,2000000,0.12,5,2.6794611223042013,0.016337678176899628,2.369656093418598,0.01150856650994152,0.30980502888560296,0.022584218257239858
|
| 21 |
+
static_dropout_0.12,3,4000000,0.12,5,2.54771768823266,0.017777870988576795,2.3208256259560587,0.023844817077326764,0.2268920622766018,0.024198331849328455
|
| 22 |
+
static_dropout_0.18,0,500000,0.18,5,3.234873204678297,0.015085252989781608,2.8055711716413496,0.016932298003007645,0.42930203303694725,0.029596925527428923
|
| 23 |
+
static_dropout_0.18,1,1000000,0.18,5,2.899004338681698,0.010615437421382663,2.5396994188427926,0.0152169117514434,0.35930491983890533,0.01280367489475416
|
| 24 |
+
static_dropout_0.18,2,2000000,0.18,5,2.6859955571591856,0.015853498199607095,2.43468574732542,0.010811605842721716,0.25130980983376505,0.02032095250136152
|
| 25 |
+
static_dropout_0.18,3,4000000,0.18,5,2.564381641894579,0.01822875837957205,2.360860677063465,0.025011274732778335,0.20352096483111382,0.019991069212145
|
runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/RESULT_SUMMARY.md
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Locked Streaming Dropout Summary
|
| 2 |
+
|
| 3 |
+
Run directory: `runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335`
|
| 4 |
+
|
| 5 |
+
Model: `L12_H8_D320` causal Transformer, 17,367,040 parameters, 12 layers, 8 heads, 320 embedding dim.
|
| 6 |
+
Training per stage: 2,000 steps. Sampled tokens are cumulative in each stage row. Seeds present: 4, 5.
|
| 7 |
+
|
| 8 |
+
## Condition Ranking
|
| 9 |
+
|
| 10 |
+
| Condition | Kind | Final dropout | Mean trajectory val loss | Final val loss | Final gap | Dropout path |
|
| 11 |
+
|---|---|---:|---:|---:|---:|---|
|
| 12 |
+
| `smooth_low` | decay | 0.05 | 2.8245 | 2.5194 | 0.2339 | 0.16 -> 0.11 -> 0.07 -> 0.05 |
|
| 13 |
+
| `interaction` | anchor_decay | 0.04 | 2.8252 | 2.5190 | 0.2358 | 0.18 -> 0.14 -> 0.08 -> 0.04 |
|
| 14 |
+
| `static_dropout_0.12` | static | 0.12 | 2.8339 | 2.5381 | 0.2007 | 0.12 -> 0.12 -> 0.12 -> 0.12 |
|
| 15 |
+
| `static_dropout_0.08` | static | 0.08 | 2.8371 | 2.5343 | 0.2344 | 0.08 -> 0.08 -> 0.08 -> 0.08 |
|
| 16 |
+
| `baseabc` | anchor_decay | 0.02 | 2.8449 | 2.5265 | 0.2415 | 0.25 -> 0.19 -> 0.10 -> 0.02 |
|
| 17 |
+
| `static_dropout_0.18` | static | 0.18 | 2.8468 | 2.5574 | 0.1826 | 0.18 -> 0.18 -> 0.18 -> 0.18 |
|
| 18 |
+
|
| 19 |
+
## Stage Trajectory
|
| 20 |
+
|
| 21 |
+
### Stage 0: 500,000 Prefix Tokens
|
| 22 |
+
|
| 23 |
+
| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
|
| 24 |
+
|---|---:|---:|---:|---:|---:|
|
| 25 |
+
| `static_dropout_0.12` | 0.12 | 3.2275 | 2.6889 | 0.5386 | 2 |
|
| 26 |
+
| `static_dropout_0.08` | 0.08 | 3.2304 | 2.5885 | 0.6419 | 2 |
|
| 27 |
+
| `smooth_low` | 0.16 | 3.2315 | 2.7729 | 0.4587 | 2 |
|
| 28 |
+
| `interaction` | 0.18 | 3.2364 | 2.7933 | 0.4431 | 2 |
|
| 29 |
+
| `static_dropout_0.18` | 0.18 | 3.2472 | 2.7925 | 0.4548 | 2 |
|
| 30 |
+
| `baseabc` | 0.25 | 3.2830 | 2.9004 | 0.3826 | 2 |
|
| 31 |
+
|
| 32 |
+
### Stage 1: 1,000,000 Prefix Tokens
|
| 33 |
+
|
| 34 |
+
| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
|
| 35 |
+
|---|---:|---:|---:|---:|---:|
|
| 36 |
+
| `interaction` | 0.14 | 2.8919 | 2.4742 | 0.4177 | 2 |
|
| 37 |
+
| `smooth_low` | 0.11 | 2.8925 | 2.4570 | 0.4355 | 2 |
|
| 38 |
+
| `static_dropout_0.12` | 0.12 | 2.9000 | 2.4333 | 0.4667 | 2 |
|
| 39 |
+
| `baseabc` | 0.19 | 2.9071 | 2.5635 | 0.3436 | 2 |
|
| 40 |
+
| `static_dropout_0.18` | 0.18 | 2.9090 | 2.5408 | 0.3682 | 2 |
|
| 41 |
+
| `static_dropout_0.08` | 0.08 | 2.9121 | 2.3515 | 0.5605 | 2 |
|
| 42 |
+
|
| 43 |
+
### Stage 2: 2,000,000 Prefix Tokens
|
| 44 |
+
|
| 45 |
+
| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
|
| 46 |
+
|---|---:|---:|---:|---:|---:|
|
| 47 |
+
| `interaction` | 0.08 | 2.6534 | 2.3349 | 0.3185 | 2 |
|
| 48 |
+
| `smooth_low` | 0.07 | 2.6547 | 2.3320 | 0.3228 | 2 |
|
| 49 |
+
| `baseabc` | 0.10 | 2.6630 | 2.4016 | 0.2614 | 2 |
|
| 50 |
+
| `static_dropout_0.12` | 0.12 | 2.6701 | 2.3782 | 0.2919 | 2 |
|
| 51 |
+
| `static_dropout_0.08` | 0.08 | 2.6718 | 2.3169 | 0.3549 | 2 |
|
| 52 |
+
| `static_dropout_0.18` | 0.18 | 2.6736 | 2.4414 | 0.2322 | 2 |
|
| 53 |
+
|
| 54 |
+
### Stage 3: 4,000,000 Prefix Tokens
|
| 55 |
+
|
| 56 |
+
| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
|
| 57 |
+
|---|---:|---:|---:|---:|---:|
|
| 58 |
+
| `interaction` | 0.04 | 2.5190 | 2.2831 | 0.2358 | 2 |
|
| 59 |
+
| `smooth_low` | 0.05 | 2.5194 | 2.2855 | 0.2339 | 2 |
|
| 60 |
+
| `baseabc` | 0.02 | 2.5265 | 2.2850 | 0.2415 | 2 |
|
| 61 |
+
| `static_dropout_0.08` | 0.08 | 2.5343 | 2.2999 | 0.2344 | 2 |
|
| 62 |
+
| `static_dropout_0.12` | 0.12 | 2.5381 | 2.3373 | 0.2007 | 2 |
|
| 63 |
+
| `static_dropout_0.18` | 0.18 | 2.5574 | 2.3748 | 0.1826 | 2 |
|
runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/config.json
ADDED
|
@@ -0,0 +1,155 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"args": {
|
| 3 |
+
"mode": "locked_stream",
|
| 4 |
+
"corpus": null,
|
| 5 |
+
"corpus_glob": null,
|
| 6 |
+
"text_column": "text",
|
| 7 |
+
"use_cached_data": true,
|
| 8 |
+
"output_dir": "runs/streaming_tinystories_multiseed_validation_l12",
|
| 9 |
+
"resume_from": null,
|
| 10 |
+
"cache_dir": ".cache/dropout_decay_tinystories",
|
| 11 |
+
"models": [
|
| 12 |
+
"L12_H8_D320=12x8x320"
|
| 13 |
+
],
|
| 14 |
+
"seeds": [
|
| 15 |
+
4,
|
| 16 |
+
5
|
| 17 |
+
],
|
| 18 |
+
"token_limits": [
|
| 19 |
+
5000000
|
| 20 |
+
],
|
| 21 |
+
"stream_token_caps": [
|
| 22 |
+
500000,
|
| 23 |
+
1000000,
|
| 24 |
+
2000000,
|
| 25 |
+
4000000
|
| 26 |
+
],
|
| 27 |
+
"val_tokens": 500000,
|
| 28 |
+
"allow_short_corpus": false,
|
| 29 |
+
"force_retokenize": false,
|
| 30 |
+
"vocab_size": 4096,
|
| 31 |
+
"tokenizer_train_chars": 10000000,
|
| 32 |
+
"block_size": 128,
|
| 33 |
+
"batch_size": 16,
|
| 34 |
+
"steps": 2000,
|
| 35 |
+
"stage_steps": 2000,
|
| 36 |
+
"dropout_rates": [
|
| 37 |
+
0.08,
|
| 38 |
+
0.12,
|
| 39 |
+
0.18
|
| 40 |
+
],
|
| 41 |
+
"decays": [
|
| 42 |
+
{
|
| 43 |
+
"name": "smooth_low",
|
| 44 |
+
"kind": "decay",
|
| 45 |
+
"initial": 0.184,
|
| 46 |
+
"final": 0.045,
|
| 47 |
+
"schedule": "smoothstep",
|
| 48 |
+
"decay_tokens": null,
|
| 49 |
+
"anchors": []
|
| 50 |
+
}
|
| 51 |
+
],
|
| 52 |
+
"anchor_decays": [
|
| 53 |
+
{
|
| 54 |
+
"name": "interaction",
|
| 55 |
+
"kind": "anchor_decay",
|
| 56 |
+
"initial": 0.184,
|
| 57 |
+
"final": 0.045,
|
| 58 |
+
"schedule": "log_prefix_anchor",
|
| 59 |
+
"decay_tokens": null,
|
| 60 |
+
"anchors": [
|
| 61 |
+
[
|
| 62 |
+
500000,
|
| 63 |
+
0.184
|
| 64 |
+
],
|
| 65 |
+
[
|
| 66 |
+
1000000,
|
| 67 |
+
0.141
|
| 68 |
+
],
|
| 69 |
+
[
|
| 70 |
+
2000000,
|
| 71 |
+
0.084
|
| 72 |
+
],
|
| 73 |
+
[
|
| 74 |
+
4000000,
|
| 75 |
+
0.045
|
| 76 |
+
]
|
| 77 |
+
]
|
| 78 |
+
},
|
| 79 |
+
{
|
| 80 |
+
"name": "baseabc",
|
| 81 |
+
"kind": "anchor_decay",
|
| 82 |
+
"initial": 0.251,
|
| 83 |
+
"final": 0.02,
|
| 84 |
+
"schedule": "log_prefix_anchor",
|
| 85 |
+
"decay_tokens": null,
|
| 86 |
+
"anchors": [
|
| 87 |
+
[
|
| 88 |
+
500000,
|
| 89 |
+
0.251
|
| 90 |
+
],
|
| 91 |
+
[
|
| 92 |
+
1000000,
|
| 93 |
+
0.186
|
| 94 |
+
],
|
| 95 |
+
[
|
| 96 |
+
2000000,
|
| 97 |
+
0.105
|
| 98 |
+
],
|
| 99 |
+
[
|
| 100 |
+
4000000,
|
| 101 |
+
0.02
|
| 102 |
+
]
|
| 103 |
+
]
|
| 104 |
+
}
|
| 105 |
+
],
|
| 106 |
+
"decay_tokens": null,
|
| 107 |
+
"eval_batches": 64,
|
| 108 |
+
"train_eval_batches": 32,
|
| 109 |
+
"trace_eval_batches": 8,
|
| 110 |
+
"eval_every": 0,
|
| 111 |
+
"log_every": 1000,
|
| 112 |
+
"lr": 0.0003,
|
| 113 |
+
"weight_decay": 0.1,
|
| 114 |
+
"grad_clip": 1.0,
|
| 115 |
+
"plateau_delta": 0.01,
|
| 116 |
+
"target_min_dropout": 0.1,
|
| 117 |
+
"min_nonzero_margin": 0.01,
|
| 118 |
+
"min_high_dropout_margin": 0.03,
|
| 119 |
+
"screen_early_stop": false,
|
| 120 |
+
"screen_prune_patience": 3,
|
| 121 |
+
"screen_prune_min_delta": 0.01
|
| 122 |
+
},
|
| 123 |
+
"mode": "locked_stream",
|
| 124 |
+
"seeds": [
|
| 125 |
+
4,
|
| 126 |
+
5
|
| 127 |
+
],
|
| 128 |
+
"models": [
|
| 129 |
+
{
|
| 130 |
+
"model_name": "L12_H8_D320",
|
| 131 |
+
"n_layer": 12,
|
| 132 |
+
"n_head": 8,
|
| 133 |
+
"n_embd": 320
|
| 134 |
+
}
|
| 135 |
+
],
|
| 136 |
+
"device": "mps",
|
| 137 |
+
"torch": "2.12.0",
|
| 138 |
+
"python": "3.11.15 (main, Mar 3 2026, 00:52:57) [Clang 21.0.0 (clang-2100.0.123.102)]",
|
| 139 |
+
"mps_available": true,
|
| 140 |
+
"attribution": "Derived from Andrej Karpathy's nanochat project (https://github.com/karpathy/nanochat), MIT License, Copyright (c) 2025 Andrej Karpathy.",
|
| 141 |
+
"tokenizer_path": ".cache/dropout_decay_tinystories/tokenizer-v4096.json",
|
| 142 |
+
"encoded_path": ".cache/dropout_decay_tinystories/tokens-v4096-uint16.npy",
|
| 143 |
+
"train_tokens": 4500048,
|
| 144 |
+
"val_tokens": 500000,
|
| 145 |
+
"effective_token_limits": [
|
| 146 |
+
4500048
|
| 147 |
+
],
|
| 148 |
+
"effective_stream_token_caps": [
|
| 149 |
+
500000,
|
| 150 |
+
1000000,
|
| 151 |
+
2000000,
|
| 152 |
+
4000000
|
| 153 |
+
],
|
| 154 |
+
"resume_from": null
|
| 155 |
+
}
|
runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/metrics.jsonl
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.184, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 180.73087000846863, "eval_loss": 3.2204562090337276, "generalization_gap": 0.422469187527895, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.7979870215058327, "train_loss_last": 2.8813064098358154, "val_eval_loss": 3.2204562090337276}
|
| 2 |
+
{"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.141, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 196.36664009094238, "eval_loss": 2.8947813101112843, "generalization_gap": 0.41770828887820244, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.477073021233082, "train_loss_last": 2.6102116107940674, "val_eval_loss": 2.8947813101112843}
|
| 3 |
+
{"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.084, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 206.1359338760376, "eval_loss": 2.671290386468172, "generalization_gap": 0.3228793926537037, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.3484109938144684, "train_loss_last": 2.4124107360839844, "val_eval_loss": 2.671290386468172}
|
| 4 |
+
{"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.045, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 202.7109558582306, "eval_loss": 2.493242312222719, "generalization_gap": 0.2353867031633854, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.257855609059334, "train_loss_last": 2.1464571952819824, "val_eval_loss": 2.493242312222719}
|
| 5 |
+
{"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.184, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 201.26832008361816, "eval_loss": 3.2523921839892864, "generalization_gap": 0.46378010138869286, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.7886120826005936, "train_loss_last": 2.8814051151275635, "val_eval_loss": 3.2523921839892864}
|
| 6 |
+
{"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.141, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 195.89590692520142, "eval_loss": 2.888928048312664, "generalization_gap": 0.41768673807382584, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.471241310238838, "train_loss_last": 2.5935797691345215, "val_eval_loss": 2.888928048312664}
|
| 7 |
+
{"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.084, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 194.00955510139465, "eval_loss": 2.6355181634426117, "generalization_gap": 0.31410761922597885, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.321410544216633, "train_loss_last": 2.3252055644989014, "val_eval_loss": 2.6355181634426117}
|
| 8 |
+
{"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.045, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 193.66337299346924, "eval_loss": 2.5446842312812805, "generalization_gap": 0.23629452288150787, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.3083897083997726, "train_loss_last": 2.3208839893341064, "val_eval_loss": 2.5446842312812805}
|
| 9 |
+
{"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.251, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 194.9356129169464, "eval_loss": 3.279827632009983, "generalization_gap": 0.3836944177746773, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.896133214235306, "train_loss_last": 2.9102234840393066, "val_eval_loss": 3.279827632009983}
|
| 10 |
+
{"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.186, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 194.77453088760376, "eval_loss": 2.9080570228397846, "generalization_gap": 0.35393861308693886, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.5541184097528458, "train_loss_last": 2.793113946914673, "val_eval_loss": 2.9080570228397846}
|
| 11 |
+
{"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.105, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 193.48544478416443, "eval_loss": 2.6800981052219868, "generalization_gap": 0.27399464324116707, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.4061034619808197, "train_loss_last": 2.46707820892334, "val_eval_loss": 2.6800981052219868}
|
| 12 |
+
{"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 193.9006688594818, "eval_loss": 2.5048790462315083, "generalization_gap": 0.24500996246933937, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.259869083762169, "train_loss_last": 2.2745652198791504, "val_eval_loss": 2.5048790462315083}
|
| 13 |
+
{"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.251, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 194.33616995811462, "eval_loss": 3.2862001582980156, "generalization_gap": 0.38156063109636307, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.9046395272016525, "train_loss_last": 2.9626998901367188, "val_eval_loss": 3.2862001582980156}
|
| 14 |
+
{"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.186, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 194.12936902046204, "eval_loss": 2.9062237925827503, "generalization_gap": 0.33334336057305336, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.572880432009697, "train_loss_last": 2.6301145553588867, "val_eval_loss": 2.9062237925827503}
|
| 15 |
+
{"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.105, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 194.08817791938782, "eval_loss": 2.645930740982294, "generalization_gap": 0.24876827374100685, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.3971624672412872, "train_loss_last": 2.5148327350616455, "val_eval_loss": 2.645930740982294}
|
| 16 |
+
{"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 193.8023762702942, "eval_loss": 2.548106499016285, "generalization_gap": 0.2380063161253929, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.310100182890892, "train_loss_last": 2.243898868560791, "val_eval_loss": 2.548106499016285}
|
| 17 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 194.01875686645508, "eval_loss": 3.2171844728291035, "generalization_gap": 0.6240800507366657, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.5931044220924377, "train_loss_last": 2.6104087829589844, "val_eval_loss": 3.2171844728291035}
|
| 18 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 194.42746496200562, "eval_loss": 2.920697819441557, "generalization_gap": 0.5618984661996365, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.3587993532419205, "train_loss_last": 2.512050151824951, "val_eval_loss": 2.920697819441557}
|
| 19 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 193.96866512298584, "eval_loss": 2.684825126081705, "generalization_gap": 0.35957426205277443, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.3252508640289307, "train_loss_last": 2.418401002883911, "val_eval_loss": 2.684825126081705}
|
| 20 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 193.8896188735962, "eval_loss": 2.5098287016153336, "generalization_gap": 0.23803159594535828, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.2717971056699753, "train_loss_last": 2.3379616737365723, "val_eval_loss": 2.5098287016153336}
|
| 21 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 194.1848509311676, "eval_loss": 3.243676133453846, "generalization_gap": 0.6597021743655205, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.5839739590883255, "train_loss_last": 2.5418384075164795, "val_eval_loss": 3.243676133453846}
|
| 22 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 194.46870708465576, "eval_loss": 2.9034036584198475, "generalization_gap": 0.5591761879622936, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.344227470457554, "train_loss_last": 2.497213363647461, "val_eval_loss": 2.9034036584198475}
|
| 23 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 194.38420414924622, "eval_loss": 2.6587292850017548, "generalization_gap": 0.3501850962638855, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.3085441887378693, "train_loss_last": 2.4595203399658203, "val_eval_loss": 2.6587292850017548}
|
| 24 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 194.19127774238586, "eval_loss": 2.5587818548083305, "generalization_gap": 0.230830118060112, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.3279517367482185, "train_loss_last": 2.345273494720459, "val_eval_loss": 2.5587818548083305}
|
| 25 |
+
{"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 194.00335597991943, "eval_loss": 3.2164776138961315, "generalization_gap": 0.5361459217965603, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.6803316920995712, "train_loss_last": 2.7694272994995117, "val_eval_loss": 3.2164776138961315}
|
| 26 |
+
{"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 194.01300406455994, "eval_loss": 2.9069128818809986, "generalization_gap": 0.46784964576363564, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.439063236117363, "train_loss_last": 2.5384514331817627, "val_eval_loss": 2.9069128818809986}
|
| 27 |
+
{"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 194.0231637954712, "eval_loss": 2.6897822842001915, "generalization_gap": 0.31032323837280273, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.3794590458273888, "train_loss_last": 2.393846035003662, "val_eval_loss": 2.6897822842001915}
|
| 28 |
+
{"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 194.00913906097412, "eval_loss": 2.516622833907604, "generalization_gap": 0.2051488682627678, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.3114739656448364, "train_loss_last": 2.516744613647461, "val_eval_loss": 2.516622833907604}
|
| 29 |
+
{"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 193.97342491149902, "eval_loss": 3.238522443920374, "generalization_gap": 0.5409641526639462, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.6975582912564278, "train_loss_last": 2.941650629043579, "val_eval_loss": 3.238522443920374}
|
| 30 |
+
{"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 193.8040759563446, "eval_loss": 2.8931140787899494, "generalization_gap": 0.4656083397567272, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.427505739033222, "train_loss_last": 2.539917469024658, "val_eval_loss": 2.8931140787899494}
|
| 31 |
+
{"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 193.5177140235901, "eval_loss": 2.6504716500639915, "generalization_gap": 0.27343276143074036, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.377038888633251, "train_loss_last": 2.3899667263031006, "val_eval_loss": 2.6504716500639915}
|
| 32 |
+
{"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 193.79171109199524, "eval_loss": 2.5595361217856407, "generalization_gap": 0.19634868949651718, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.3631874322891235, "train_loss_last": 2.3815793991088867, "val_eval_loss": 2.5595361217856407}
|
| 33 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 194.09354996681213, "eval_loss": 3.234104972332716, "generalization_gap": 0.44031351432204247, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.7937914580106735, "train_loss_last": 2.9370627403259277, "val_eval_loss": 3.234104972332716}
|
| 34 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 194.1311731338501, "eval_loss": 2.9113698303699493, "generalization_gap": 0.3634902611374855, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.547879569232464, "train_loss_last": 2.7211272716522217, "val_eval_loss": 2.9113698303699493}
|
| 35 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 194.0225739479065, "eval_loss": 2.6890049539506435, "generalization_gap": 0.24288412556052208, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.4461208283901215, "train_loss_last": 2.6030375957489014, "val_eval_loss": 2.6890049539506435}
|
| 36 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 193.94474577903748, "eval_loss": 2.5342570766806602, "generalization_gap": 0.18950188159942627, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.344755195081234, "train_loss_last": 2.4995384216308594, "val_eval_loss": 2.5342570766806602}
|
| 37 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 194.22309184074402, "eval_loss": 3.260387759655714, "generalization_gap": 0.4692406989634037, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.7911470606923103, "train_loss_last": 2.9028453826904297, "val_eval_loss": 3.260387759655714}
|
| 38 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 194.1104338169098, "eval_loss": 2.9066071063280106, "generalization_gap": 0.37287352979183197, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.5337335765361786, "train_loss_last": 2.83718204498291, "val_eval_loss": 2.9066071063280106}
|
| 39 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 194.03589725494385, "eval_loss": 2.658251740038395, "generalization_gap": 0.2214929312467575, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.4367588087916374, "train_loss_last": 2.6042351722717285, "val_eval_loss": 2.658251740038395}
|
| 40 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 193.68132185935974, "eval_loss": 2.580581970512867, "generalization_gap": 0.1757577732205391, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.404824197292328, "train_loss_last": 2.4918441772460938, "val_eval_loss": 2.580581970512867}
|
| 41 |
+
{"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.16230079361664454, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 201.204332113266, "eval_loss": 3.2153887301683426, "generalization_gap": 0.43687721341848373, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.778511516749859, "train_loss_last": 2.8499650955200195, "val_eval_loss": 3.2153887301683426}
|
| 42 |
+
{"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.11452606249945704, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 204.5595259666443, "eval_loss": 2.893922034651041, "generalization_gap": 0.4342481978237629, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.459673836827278, "train_loss_last": 2.584320068359375, "val_eval_loss": 2.893922034651041}
|
| 43 |
+
{"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.06673830013226953, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 201.91624069213867, "eval_loss": 2.6745377629995346, "generalization_gap": 0.32962220162153244, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.344915561378002, "train_loss_last": 2.395663022994995, "val_eval_loss": 2.6745377629995346}
|
| 44 |
+
{"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.045000006515082035, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 195.82384395599365, "eval_loss": 2.495888389647007, "generalization_gap": 0.23202652484178543, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.2638618648052216, "train_loss_last": 2.146873712539673, "val_eval_loss": 2.495888389647007}
|
| 45 |
+
{"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.16230079361664454, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 195.43609189987183, "eval_loss": 3.2477101795375347, "generalization_gap": 0.48043813183903694, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.7672720476984978, "train_loss_last": 2.8498988151550293, "val_eval_loss": 3.2477101795375347}
|
| 46 |
+
{"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.11452606249945704, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 196.09808206558228, "eval_loss": 2.89107983186841, "generalization_gap": 0.4366713650524616, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.4544084668159485, "train_loss_last": 2.5650553703308105, "val_eval_loss": 2.89107983186841}
|
| 47 |
+
{"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.06673830013226953, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 196.0459849834442, "eval_loss": 2.6349585987627506, "generalization_gap": 0.31589408591389656, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.319064512848854, "train_loss_last": 2.319474220275879, "val_eval_loss": 2.6349585987627506}
|
| 48 |
+
{"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.045000006515082035, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 196.2467851638794, "eval_loss": 2.5428482554852962, "generalization_gap": 0.23579202219843864, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.3070562332868576, "train_loss_last": 2.3239519596099854, "val_eval_loss": 2.5428482554852962}
|
runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/summary.csv
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
run_mode,condition,condition_kind,stage,token_limit,model_name,n_layer,n_head,n_embd,parameters,dropout_initial,dropout_final,dropout_schedule,n,mean_train_eval_loss,std_train_eval_loss,mean_val_eval_loss,std_val_eval_loss,mean_generalization_gap,std_generalization_gap
|
| 2 |
+
locked_stream,baseabc,anchor_decay,0,500000,L12_H8_D320,12,8,320,17367040,0.251,0.02,log_prefix_anchor,2,2.900386370718479,0.006014871581398837,3.2830138951539993,0.004506056551557341,0.38262752443552017,0.0015088150298414955
|
| 3 |
+
locked_stream,interaction,anchor_decay,0,500000,L12_H8_D320,12,8,320,17367040,0.184,0.045,log_prefix_anchor,2,2.793299552053213,0.006629082873104159,3.236424196511507,0.022582144454879362,0.4431246444582939,0.02921122732798352
|
| 4 |
+
locked_stream,smooth_low,decay,0,500000,L12_H8_D320,12,8,320,17367040,0.184,0.045,smoothstep,2,2.7728917822241783,0.007947504783153755,3.2315494548529387,0.022854716026733408,0.45865767262876034,0.030802220809887162
|
| 5 |
+
locked_stream,static_dropout_0.08,static,0,500000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,2,2.5885391905903816,0.0064562123055806634,3.2304303031414747,0.01873243287264808,0.6418911125510931,0.02518864517822874
|
| 6 |
+
locked_stream,static_dropout_0.12,static,0,500000,L12_H8_D320,12,8,320,17367040,0.12,0.12,constant,2,2.6889449916779995,0.012181045080595719,3.2275000289082527,0.015588048800246605,0.5385550372302532,0.0034070037196508854
|
| 7 |
+
locked_stream,static_dropout_0.18,static,0,500000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,2,2.792469259351492,0.001869871275966133,3.247246365994215,0.018584737144575744,0.4547771066427231,0.020454608420541878
|
| 8 |
+
locked_stream,baseabc,anchor_decay,1,1000000,L12_H8_D320,12,8,320,17367040,0.251,0.02,log_prefix_anchor,2,2.5634994208812714,0.013266753166592413,2.9071404077112675,0.0012962895462253123,0.3436409868299961,0.014563042712817725
|
| 9 |
+
locked_stream,interaction,anchor_decay,1,1000000,L12_H8_D320,12,8,320,17367040,0.184,0.045,log_prefix_anchor,2,2.47415716573596,0.004123642389949808,2.891854679211974,0.004138881109864529,0.41769751347601414,1.5238719914720123e-05
|
| 10 |
+
locked_stream,smooth_low,decay,1,1000000,L12_H8_D320,12,8,320,17367040,0.184,0.045,smoothstep,2,2.4570411518216133,0.003723178840467485,2.8925009332597256,0.0020097408611055986,0.43545978143811226,0.001713437979361886
|
| 11 |
+
locked_stream,static_dropout_0.08,static,1,1000000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,2,2.351513411849737,0.010303877131481138,2.912050738930702,0.012228818533382818,0.560537327080965,0.00192494140190168
|
| 12 |
+
locked_stream,static_dropout_0.12,static,1,1000000,L12_H8_D320,12,8,320,17367040,0.12,0.12,constant,2,2.4332844875752926,0.008172384561739694,2.900013480335474,0.009757227237938778,0.4667289927601814,0.0015848426761990845
|
| 13 |
+
locked_stream,static_dropout_0.18,static,1,1000000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,2,2.540806572884321,0.010002727362158672,2.90898846834898,0.0033677544669751154,0.36818189546465874,0.006634972895183557
|
| 14 |
+
locked_stream,baseabc,anchor_decay,2,2000000,L12_H8_D320,12,8,320,17367040,0.251,0.02,log_prefix_anchor,2,2.4016329646110535,0.006322238010876659,2.6630144231021404,0.024159974949157448,0.26138145849108696,0.017837736938280786
|
| 15 |
+
locked_stream,interaction,anchor_decay,2,2000000,L12_H8_D320,12,8,320,17367040,0.184,0.045,log_prefix_anchor,2,2.3349107690155506,0.0190922010057151,2.653404274955392,0.0252947814794913,0.31849350593984127,0.006202580473776199
|
| 16 |
+
locked_stream,smooth_low,decay,2,2000000,L12_H8_D320,12,8,320,17367040,0.184,0.045,smoothstep,2,2.331990037113428,0.018279451715743147,2.6547481808811426,0.027986695425526037,0.3227581437677145,0.00970724370978289
|
| 17 |
+
locked_stream,static_dropout_0.08,static,2,2000000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,2,2.3168975263834,0.011813403389391254,2.67177720554173,0.01845254618839936,0.35487967915832996,0.006639142799008103
|
| 18 |
+
locked_stream,static_dropout_0.12,static,2,2000000,L12_H8_D320,12,8,320,17367040,0.12,0.12,constant,2,2.37824896723032,0.0017113095635120858,2.6701269671320915,0.027796815970450365,0.29187799990177155,0.02608550640693828
|
| 19 |
+
locked_stream,static_dropout_0.18,static,2,2000000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,2,2.4414398185908794,0.0066199475436894235,2.6736283469945192,0.021745806100631468,0.2321885284036398,0.015125858556942045
|
| 20 |
+
locked_stream,baseabc,anchor_decay,3,4000000,L12_H8_D320,12,8,320,17367040,0.251,0.02,log_prefix_anchor,2,2.2849846333265305,0.03551875082037381,2.5264927726238966,0.030566424997536902,0.24150813929736614,0.004952325822836911
|
| 21 |
+
locked_stream,interaction,anchor_decay,3,4000000,L12_H8_D320,12,8,320,17367040,0.184,0.045,log_prefix_anchor,2,2.2831226587295532,0.035733004324778946,2.518963271752,0.03637492980355821,0.23584061302244663,0.0006419254787792673
|
| 22 |
+
locked_stream,smooth_low,decay,3,4000000,L12_H8_D320,12,8,320,17367040,0.184,0.045,smoothstep,2,2.2854590490460396,0.030543030862435327,2.5193683225661516,0.033205639577864834,0.23390927352011204,0.0026626087154295068
|
| 23 |
+
locked_stream,static_dropout_0.08,static,3,4000000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,2,2.299874421209097,0.039707320430454655,2.534305278211832,0.03461510658323205,0.23443085700273514,0.0050922138472226
|
| 24 |
+
locked_stream,static_dropout_0.12,static,3,4000000,L12_H8_D320,12,8,320,17367040,0.12,0.12,constant,2,2.33733069896698,0.03656694294283975,2.5380794778466225,0.030344276861570076,0.2007487788796425,0.006222666081269672
|
| 25 |
+
locked_stream,static_dropout_0.18,static,3,4000000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,2,2.374789696186781,0.042475198802574214,2.5574195235967636,0.03275664656650025,0.18262982740998268,0.00971855223607397
|
runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/summary.json
ADDED
|
@@ -0,0 +1,530 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"run_mode": "locked_stream",
|
| 4 |
+
"condition": "baseabc",
|
| 5 |
+
"condition_kind": "anchor_decay",
|
| 6 |
+
"stage": 0,
|
| 7 |
+
"token_limit": 500000,
|
| 8 |
+
"model_name": "L12_H8_D320",
|
| 9 |
+
"n_layer": 12,
|
| 10 |
+
"n_head": 8,
|
| 11 |
+
"n_embd": 320,
|
| 12 |
+
"parameters": 17367040,
|
| 13 |
+
"dropout_initial": 0.251,
|
| 14 |
+
"dropout_final": 0.02,
|
| 15 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 16 |
+
"n": 2,
|
| 17 |
+
"mean_train_eval_loss": 2.900386370718479,
|
| 18 |
+
"std_train_eval_loss": 0.006014871581398837,
|
| 19 |
+
"mean_val_eval_loss": 3.2830138951539993,
|
| 20 |
+
"std_val_eval_loss": 0.004506056551557341,
|
| 21 |
+
"mean_generalization_gap": 0.38262752443552017,
|
| 22 |
+
"std_generalization_gap": 0.0015088150298414955
|
| 23 |
+
},
|
| 24 |
+
{
|
| 25 |
+
"run_mode": "locked_stream",
|
| 26 |
+
"condition": "interaction",
|
| 27 |
+
"condition_kind": "anchor_decay",
|
| 28 |
+
"stage": 0,
|
| 29 |
+
"token_limit": 500000,
|
| 30 |
+
"model_name": "L12_H8_D320",
|
| 31 |
+
"n_layer": 12,
|
| 32 |
+
"n_head": 8,
|
| 33 |
+
"n_embd": 320,
|
| 34 |
+
"parameters": 17367040,
|
| 35 |
+
"dropout_initial": 0.184,
|
| 36 |
+
"dropout_final": 0.045,
|
| 37 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 38 |
+
"n": 2,
|
| 39 |
+
"mean_train_eval_loss": 2.793299552053213,
|
| 40 |
+
"std_train_eval_loss": 0.006629082873104159,
|
| 41 |
+
"mean_val_eval_loss": 3.236424196511507,
|
| 42 |
+
"std_val_eval_loss": 0.022582144454879362,
|
| 43 |
+
"mean_generalization_gap": 0.4431246444582939,
|
| 44 |
+
"std_generalization_gap": 0.02921122732798352
|
| 45 |
+
},
|
| 46 |
+
{
|
| 47 |
+
"run_mode": "locked_stream",
|
| 48 |
+
"condition": "smooth_low",
|
| 49 |
+
"condition_kind": "decay",
|
| 50 |
+
"stage": 0,
|
| 51 |
+
"token_limit": 500000,
|
| 52 |
+
"model_name": "L12_H8_D320",
|
| 53 |
+
"n_layer": 12,
|
| 54 |
+
"n_head": 8,
|
| 55 |
+
"n_embd": 320,
|
| 56 |
+
"parameters": 17367040,
|
| 57 |
+
"dropout_initial": 0.184,
|
| 58 |
+
"dropout_final": 0.045,
|
| 59 |
+
"dropout_schedule": "smoothstep",
|
| 60 |
+
"n": 2,
|
| 61 |
+
"mean_train_eval_loss": 2.7728917822241783,
|
| 62 |
+
"std_train_eval_loss": 0.007947504783153755,
|
| 63 |
+
"mean_val_eval_loss": 3.2315494548529387,
|
| 64 |
+
"std_val_eval_loss": 0.022854716026733408,
|
| 65 |
+
"mean_generalization_gap": 0.45865767262876034,
|
| 66 |
+
"std_generalization_gap": 0.030802220809887162
|
| 67 |
+
},
|
| 68 |
+
{
|
| 69 |
+
"run_mode": "locked_stream",
|
| 70 |
+
"condition": "static_dropout_0.08",
|
| 71 |
+
"condition_kind": "static",
|
| 72 |
+
"stage": 0,
|
| 73 |
+
"token_limit": 500000,
|
| 74 |
+
"model_name": "L12_H8_D320",
|
| 75 |
+
"n_layer": 12,
|
| 76 |
+
"n_head": 8,
|
| 77 |
+
"n_embd": 320,
|
| 78 |
+
"parameters": 17367040,
|
| 79 |
+
"dropout_initial": 0.08,
|
| 80 |
+
"dropout_final": 0.08,
|
| 81 |
+
"dropout_schedule": "constant",
|
| 82 |
+
"n": 2,
|
| 83 |
+
"mean_train_eval_loss": 2.5885391905903816,
|
| 84 |
+
"std_train_eval_loss": 0.0064562123055806634,
|
| 85 |
+
"mean_val_eval_loss": 3.2304303031414747,
|
| 86 |
+
"std_val_eval_loss": 0.01873243287264808,
|
| 87 |
+
"mean_generalization_gap": 0.6418911125510931,
|
| 88 |
+
"std_generalization_gap": 0.02518864517822874
|
| 89 |
+
},
|
| 90 |
+
{
|
| 91 |
+
"run_mode": "locked_stream",
|
| 92 |
+
"condition": "static_dropout_0.12",
|
| 93 |
+
"condition_kind": "static",
|
| 94 |
+
"stage": 0,
|
| 95 |
+
"token_limit": 500000,
|
| 96 |
+
"model_name": "L12_H8_D320",
|
| 97 |
+
"n_layer": 12,
|
| 98 |
+
"n_head": 8,
|
| 99 |
+
"n_embd": 320,
|
| 100 |
+
"parameters": 17367040,
|
| 101 |
+
"dropout_initial": 0.12,
|
| 102 |
+
"dropout_final": 0.12,
|
| 103 |
+
"dropout_schedule": "constant",
|
| 104 |
+
"n": 2,
|
| 105 |
+
"mean_train_eval_loss": 2.6889449916779995,
|
| 106 |
+
"std_train_eval_loss": 0.012181045080595719,
|
| 107 |
+
"mean_val_eval_loss": 3.2275000289082527,
|
| 108 |
+
"std_val_eval_loss": 0.015588048800246605,
|
| 109 |
+
"mean_generalization_gap": 0.5385550372302532,
|
| 110 |
+
"std_generalization_gap": 0.0034070037196508854
|
| 111 |
+
},
|
| 112 |
+
{
|
| 113 |
+
"run_mode": "locked_stream",
|
| 114 |
+
"condition": "static_dropout_0.18",
|
| 115 |
+
"condition_kind": "static",
|
| 116 |
+
"stage": 0,
|
| 117 |
+
"token_limit": 500000,
|
| 118 |
+
"model_name": "L12_H8_D320",
|
| 119 |
+
"n_layer": 12,
|
| 120 |
+
"n_head": 8,
|
| 121 |
+
"n_embd": 320,
|
| 122 |
+
"parameters": 17367040,
|
| 123 |
+
"dropout_initial": 0.18,
|
| 124 |
+
"dropout_final": 0.18,
|
| 125 |
+
"dropout_schedule": "constant",
|
| 126 |
+
"n": 2,
|
| 127 |
+
"mean_train_eval_loss": 2.792469259351492,
|
| 128 |
+
"std_train_eval_loss": 0.001869871275966133,
|
| 129 |
+
"mean_val_eval_loss": 3.247246365994215,
|
| 130 |
+
"std_val_eval_loss": 0.018584737144575744,
|
| 131 |
+
"mean_generalization_gap": 0.4547771066427231,
|
| 132 |
+
"std_generalization_gap": 0.020454608420541878
|
| 133 |
+
},
|
| 134 |
+
{
|
| 135 |
+
"run_mode": "locked_stream",
|
| 136 |
+
"condition": "baseabc",
|
| 137 |
+
"condition_kind": "anchor_decay",
|
| 138 |
+
"stage": 1,
|
| 139 |
+
"token_limit": 1000000,
|
| 140 |
+
"model_name": "L12_H8_D320",
|
| 141 |
+
"n_layer": 12,
|
| 142 |
+
"n_head": 8,
|
| 143 |
+
"n_embd": 320,
|
| 144 |
+
"parameters": 17367040,
|
| 145 |
+
"dropout_initial": 0.251,
|
| 146 |
+
"dropout_final": 0.02,
|
| 147 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 148 |
+
"n": 2,
|
| 149 |
+
"mean_train_eval_loss": 2.5634994208812714,
|
| 150 |
+
"std_train_eval_loss": 0.013266753166592413,
|
| 151 |
+
"mean_val_eval_loss": 2.9071404077112675,
|
| 152 |
+
"std_val_eval_loss": 0.0012962895462253123,
|
| 153 |
+
"mean_generalization_gap": 0.3436409868299961,
|
| 154 |
+
"std_generalization_gap": 0.014563042712817725
|
| 155 |
+
},
|
| 156 |
+
{
|
| 157 |
+
"run_mode": "locked_stream",
|
| 158 |
+
"condition": "interaction",
|
| 159 |
+
"condition_kind": "anchor_decay",
|
| 160 |
+
"stage": 1,
|
| 161 |
+
"token_limit": 1000000,
|
| 162 |
+
"model_name": "L12_H8_D320",
|
| 163 |
+
"n_layer": 12,
|
| 164 |
+
"n_head": 8,
|
| 165 |
+
"n_embd": 320,
|
| 166 |
+
"parameters": 17367040,
|
| 167 |
+
"dropout_initial": 0.184,
|
| 168 |
+
"dropout_final": 0.045,
|
| 169 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 170 |
+
"n": 2,
|
| 171 |
+
"mean_train_eval_loss": 2.47415716573596,
|
| 172 |
+
"std_train_eval_loss": 0.004123642389949808,
|
| 173 |
+
"mean_val_eval_loss": 2.891854679211974,
|
| 174 |
+
"std_val_eval_loss": 0.004138881109864529,
|
| 175 |
+
"mean_generalization_gap": 0.41769751347601414,
|
| 176 |
+
"std_generalization_gap": 1.5238719914720123e-05
|
| 177 |
+
},
|
| 178 |
+
{
|
| 179 |
+
"run_mode": "locked_stream",
|
| 180 |
+
"condition": "smooth_low",
|
| 181 |
+
"condition_kind": "decay",
|
| 182 |
+
"stage": 1,
|
| 183 |
+
"token_limit": 1000000,
|
| 184 |
+
"model_name": "L12_H8_D320",
|
| 185 |
+
"n_layer": 12,
|
| 186 |
+
"n_head": 8,
|
| 187 |
+
"n_embd": 320,
|
| 188 |
+
"parameters": 17367040,
|
| 189 |
+
"dropout_initial": 0.184,
|
| 190 |
+
"dropout_final": 0.045,
|
| 191 |
+
"dropout_schedule": "smoothstep",
|
| 192 |
+
"n": 2,
|
| 193 |
+
"mean_train_eval_loss": 2.4570411518216133,
|
| 194 |
+
"std_train_eval_loss": 0.003723178840467485,
|
| 195 |
+
"mean_val_eval_loss": 2.8925009332597256,
|
| 196 |
+
"std_val_eval_loss": 0.0020097408611055986,
|
| 197 |
+
"mean_generalization_gap": 0.43545978143811226,
|
| 198 |
+
"std_generalization_gap": 0.001713437979361886
|
| 199 |
+
},
|
| 200 |
+
{
|
| 201 |
+
"run_mode": "locked_stream",
|
| 202 |
+
"condition": "static_dropout_0.08",
|
| 203 |
+
"condition_kind": "static",
|
| 204 |
+
"stage": 1,
|
| 205 |
+
"token_limit": 1000000,
|
| 206 |
+
"model_name": "L12_H8_D320",
|
| 207 |
+
"n_layer": 12,
|
| 208 |
+
"n_head": 8,
|
| 209 |
+
"n_embd": 320,
|
| 210 |
+
"parameters": 17367040,
|
| 211 |
+
"dropout_initial": 0.08,
|
| 212 |
+
"dropout_final": 0.08,
|
| 213 |
+
"dropout_schedule": "constant",
|
| 214 |
+
"n": 2,
|
| 215 |
+
"mean_train_eval_loss": 2.351513411849737,
|
| 216 |
+
"std_train_eval_loss": 0.010303877131481138,
|
| 217 |
+
"mean_val_eval_loss": 2.912050738930702,
|
| 218 |
+
"std_val_eval_loss": 0.012228818533382818,
|
| 219 |
+
"mean_generalization_gap": 0.560537327080965,
|
| 220 |
+
"std_generalization_gap": 0.00192494140190168
|
| 221 |
+
},
|
| 222 |
+
{
|
| 223 |
+
"run_mode": "locked_stream",
|
| 224 |
+
"condition": "static_dropout_0.12",
|
| 225 |
+
"condition_kind": "static",
|
| 226 |
+
"stage": 1,
|
| 227 |
+
"token_limit": 1000000,
|
| 228 |
+
"model_name": "L12_H8_D320",
|
| 229 |
+
"n_layer": 12,
|
| 230 |
+
"n_head": 8,
|
| 231 |
+
"n_embd": 320,
|
| 232 |
+
"parameters": 17367040,
|
| 233 |
+
"dropout_initial": 0.12,
|
| 234 |
+
"dropout_final": 0.12,
|
| 235 |
+
"dropout_schedule": "constant",
|
| 236 |
+
"n": 2,
|
| 237 |
+
"mean_train_eval_loss": 2.4332844875752926,
|
| 238 |
+
"std_train_eval_loss": 0.008172384561739694,
|
| 239 |
+
"mean_val_eval_loss": 2.900013480335474,
|
| 240 |
+
"std_val_eval_loss": 0.009757227237938778,
|
| 241 |
+
"mean_generalization_gap": 0.4667289927601814,
|
| 242 |
+
"std_generalization_gap": 0.0015848426761990845
|
| 243 |
+
},
|
| 244 |
+
{
|
| 245 |
+
"run_mode": "locked_stream",
|
| 246 |
+
"condition": "static_dropout_0.18",
|
| 247 |
+
"condition_kind": "static",
|
| 248 |
+
"stage": 1,
|
| 249 |
+
"token_limit": 1000000,
|
| 250 |
+
"model_name": "L12_H8_D320",
|
| 251 |
+
"n_layer": 12,
|
| 252 |
+
"n_head": 8,
|
| 253 |
+
"n_embd": 320,
|
| 254 |
+
"parameters": 17367040,
|
| 255 |
+
"dropout_initial": 0.18,
|
| 256 |
+
"dropout_final": 0.18,
|
| 257 |
+
"dropout_schedule": "constant",
|
| 258 |
+
"n": 2,
|
| 259 |
+
"mean_train_eval_loss": 2.540806572884321,
|
| 260 |
+
"std_train_eval_loss": 0.010002727362158672,
|
| 261 |
+
"mean_val_eval_loss": 2.90898846834898,
|
| 262 |
+
"std_val_eval_loss": 0.0033677544669751154,
|
| 263 |
+
"mean_generalization_gap": 0.36818189546465874,
|
| 264 |
+
"std_generalization_gap": 0.006634972895183557
|
| 265 |
+
},
|
| 266 |
+
{
|
| 267 |
+
"run_mode": "locked_stream",
|
| 268 |
+
"condition": "baseabc",
|
| 269 |
+
"condition_kind": "anchor_decay",
|
| 270 |
+
"stage": 2,
|
| 271 |
+
"token_limit": 2000000,
|
| 272 |
+
"model_name": "L12_H8_D320",
|
| 273 |
+
"n_layer": 12,
|
| 274 |
+
"n_head": 8,
|
| 275 |
+
"n_embd": 320,
|
| 276 |
+
"parameters": 17367040,
|
| 277 |
+
"dropout_initial": 0.251,
|
| 278 |
+
"dropout_final": 0.02,
|
| 279 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 280 |
+
"n": 2,
|
| 281 |
+
"mean_train_eval_loss": 2.4016329646110535,
|
| 282 |
+
"std_train_eval_loss": 0.006322238010876659,
|
| 283 |
+
"mean_val_eval_loss": 2.6630144231021404,
|
| 284 |
+
"std_val_eval_loss": 0.024159974949157448,
|
| 285 |
+
"mean_generalization_gap": 0.26138145849108696,
|
| 286 |
+
"std_generalization_gap": 0.017837736938280786
|
| 287 |
+
},
|
| 288 |
+
{
|
| 289 |
+
"run_mode": "locked_stream",
|
| 290 |
+
"condition": "interaction",
|
| 291 |
+
"condition_kind": "anchor_decay",
|
| 292 |
+
"stage": 2,
|
| 293 |
+
"token_limit": 2000000,
|
| 294 |
+
"model_name": "L12_H8_D320",
|
| 295 |
+
"n_layer": 12,
|
| 296 |
+
"n_head": 8,
|
| 297 |
+
"n_embd": 320,
|
| 298 |
+
"parameters": 17367040,
|
| 299 |
+
"dropout_initial": 0.184,
|
| 300 |
+
"dropout_final": 0.045,
|
| 301 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 302 |
+
"n": 2,
|
| 303 |
+
"mean_train_eval_loss": 2.3349107690155506,
|
| 304 |
+
"std_train_eval_loss": 0.0190922010057151,
|
| 305 |
+
"mean_val_eval_loss": 2.653404274955392,
|
| 306 |
+
"std_val_eval_loss": 0.0252947814794913,
|
| 307 |
+
"mean_generalization_gap": 0.31849350593984127,
|
| 308 |
+
"std_generalization_gap": 0.006202580473776199
|
| 309 |
+
},
|
| 310 |
+
{
|
| 311 |
+
"run_mode": "locked_stream",
|
| 312 |
+
"condition": "smooth_low",
|
| 313 |
+
"condition_kind": "decay",
|
| 314 |
+
"stage": 2,
|
| 315 |
+
"token_limit": 2000000,
|
| 316 |
+
"model_name": "L12_H8_D320",
|
| 317 |
+
"n_layer": 12,
|
| 318 |
+
"n_head": 8,
|
| 319 |
+
"n_embd": 320,
|
| 320 |
+
"parameters": 17367040,
|
| 321 |
+
"dropout_initial": 0.184,
|
| 322 |
+
"dropout_final": 0.045,
|
| 323 |
+
"dropout_schedule": "smoothstep",
|
| 324 |
+
"n": 2,
|
| 325 |
+
"mean_train_eval_loss": 2.331990037113428,
|
| 326 |
+
"std_train_eval_loss": 0.018279451715743147,
|
| 327 |
+
"mean_val_eval_loss": 2.6547481808811426,
|
| 328 |
+
"std_val_eval_loss": 0.027986695425526037,
|
| 329 |
+
"mean_generalization_gap": 0.3227581437677145,
|
| 330 |
+
"std_generalization_gap": 0.00970724370978289
|
| 331 |
+
},
|
| 332 |
+
{
|
| 333 |
+
"run_mode": "locked_stream",
|
| 334 |
+
"condition": "static_dropout_0.08",
|
| 335 |
+
"condition_kind": "static",
|
| 336 |
+
"stage": 2,
|
| 337 |
+
"token_limit": 2000000,
|
| 338 |
+
"model_name": "L12_H8_D320",
|
| 339 |
+
"n_layer": 12,
|
| 340 |
+
"n_head": 8,
|
| 341 |
+
"n_embd": 320,
|
| 342 |
+
"parameters": 17367040,
|
| 343 |
+
"dropout_initial": 0.08,
|
| 344 |
+
"dropout_final": 0.08,
|
| 345 |
+
"dropout_schedule": "constant",
|
| 346 |
+
"n": 2,
|
| 347 |
+
"mean_train_eval_loss": 2.3168975263834,
|
| 348 |
+
"std_train_eval_loss": 0.011813403389391254,
|
| 349 |
+
"mean_val_eval_loss": 2.67177720554173,
|
| 350 |
+
"std_val_eval_loss": 0.01845254618839936,
|
| 351 |
+
"mean_generalization_gap": 0.35487967915832996,
|
| 352 |
+
"std_generalization_gap": 0.006639142799008103
|
| 353 |
+
},
|
| 354 |
+
{
|
| 355 |
+
"run_mode": "locked_stream",
|
| 356 |
+
"condition": "static_dropout_0.12",
|
| 357 |
+
"condition_kind": "static",
|
| 358 |
+
"stage": 2,
|
| 359 |
+
"token_limit": 2000000,
|
| 360 |
+
"model_name": "L12_H8_D320",
|
| 361 |
+
"n_layer": 12,
|
| 362 |
+
"n_head": 8,
|
| 363 |
+
"n_embd": 320,
|
| 364 |
+
"parameters": 17367040,
|
| 365 |
+
"dropout_initial": 0.12,
|
| 366 |
+
"dropout_final": 0.12,
|
| 367 |
+
"dropout_schedule": "constant",
|
| 368 |
+
"n": 2,
|
| 369 |
+
"mean_train_eval_loss": 2.37824896723032,
|
| 370 |
+
"std_train_eval_loss": 0.0017113095635120858,
|
| 371 |
+
"mean_val_eval_loss": 2.6701269671320915,
|
| 372 |
+
"std_val_eval_loss": 0.027796815970450365,
|
| 373 |
+
"mean_generalization_gap": 0.29187799990177155,
|
| 374 |
+
"std_generalization_gap": 0.02608550640693828
|
| 375 |
+
},
|
| 376 |
+
{
|
| 377 |
+
"run_mode": "locked_stream",
|
| 378 |
+
"condition": "static_dropout_0.18",
|
| 379 |
+
"condition_kind": "static",
|
| 380 |
+
"stage": 2,
|
| 381 |
+
"token_limit": 2000000,
|
| 382 |
+
"model_name": "L12_H8_D320",
|
| 383 |
+
"n_layer": 12,
|
| 384 |
+
"n_head": 8,
|
| 385 |
+
"n_embd": 320,
|
| 386 |
+
"parameters": 17367040,
|
| 387 |
+
"dropout_initial": 0.18,
|
| 388 |
+
"dropout_final": 0.18,
|
| 389 |
+
"dropout_schedule": "constant",
|
| 390 |
+
"n": 2,
|
| 391 |
+
"mean_train_eval_loss": 2.4414398185908794,
|
| 392 |
+
"std_train_eval_loss": 0.0066199475436894235,
|
| 393 |
+
"mean_val_eval_loss": 2.6736283469945192,
|
| 394 |
+
"std_val_eval_loss": 0.021745806100631468,
|
| 395 |
+
"mean_generalization_gap": 0.2321885284036398,
|
| 396 |
+
"std_generalization_gap": 0.015125858556942045
|
| 397 |
+
},
|
| 398 |
+
{
|
| 399 |
+
"run_mode": "locked_stream",
|
| 400 |
+
"condition": "baseabc",
|
| 401 |
+
"condition_kind": "anchor_decay",
|
| 402 |
+
"stage": 3,
|
| 403 |
+
"token_limit": 4000000,
|
| 404 |
+
"model_name": "L12_H8_D320",
|
| 405 |
+
"n_layer": 12,
|
| 406 |
+
"n_head": 8,
|
| 407 |
+
"n_embd": 320,
|
| 408 |
+
"parameters": 17367040,
|
| 409 |
+
"dropout_initial": 0.251,
|
| 410 |
+
"dropout_final": 0.02,
|
| 411 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 412 |
+
"n": 2,
|
| 413 |
+
"mean_train_eval_loss": 2.2849846333265305,
|
| 414 |
+
"std_train_eval_loss": 0.03551875082037381,
|
| 415 |
+
"mean_val_eval_loss": 2.5264927726238966,
|
| 416 |
+
"std_val_eval_loss": 0.030566424997536902,
|
| 417 |
+
"mean_generalization_gap": 0.24150813929736614,
|
| 418 |
+
"std_generalization_gap": 0.004952325822836911
|
| 419 |
+
},
|
| 420 |
+
{
|
| 421 |
+
"run_mode": "locked_stream",
|
| 422 |
+
"condition": "interaction",
|
| 423 |
+
"condition_kind": "anchor_decay",
|
| 424 |
+
"stage": 3,
|
| 425 |
+
"token_limit": 4000000,
|
| 426 |
+
"model_name": "L12_H8_D320",
|
| 427 |
+
"n_layer": 12,
|
| 428 |
+
"n_head": 8,
|
| 429 |
+
"n_embd": 320,
|
| 430 |
+
"parameters": 17367040,
|
| 431 |
+
"dropout_initial": 0.184,
|
| 432 |
+
"dropout_final": 0.045,
|
| 433 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 434 |
+
"n": 2,
|
| 435 |
+
"mean_train_eval_loss": 2.2831226587295532,
|
| 436 |
+
"std_train_eval_loss": 0.035733004324778946,
|
| 437 |
+
"mean_val_eval_loss": 2.518963271752,
|
| 438 |
+
"std_val_eval_loss": 0.03637492980355821,
|
| 439 |
+
"mean_generalization_gap": 0.23584061302244663,
|
| 440 |
+
"std_generalization_gap": 0.0006419254787792673
|
| 441 |
+
},
|
| 442 |
+
{
|
| 443 |
+
"run_mode": "locked_stream",
|
| 444 |
+
"condition": "smooth_low",
|
| 445 |
+
"condition_kind": "decay",
|
| 446 |
+
"stage": 3,
|
| 447 |
+
"token_limit": 4000000,
|
| 448 |
+
"model_name": "L12_H8_D320",
|
| 449 |
+
"n_layer": 12,
|
| 450 |
+
"n_head": 8,
|
| 451 |
+
"n_embd": 320,
|
| 452 |
+
"parameters": 17367040,
|
| 453 |
+
"dropout_initial": 0.184,
|
| 454 |
+
"dropout_final": 0.045,
|
| 455 |
+
"dropout_schedule": "smoothstep",
|
| 456 |
+
"n": 2,
|
| 457 |
+
"mean_train_eval_loss": 2.2854590490460396,
|
| 458 |
+
"std_train_eval_loss": 0.030543030862435327,
|
| 459 |
+
"mean_val_eval_loss": 2.5193683225661516,
|
| 460 |
+
"std_val_eval_loss": 0.033205639577864834,
|
| 461 |
+
"mean_generalization_gap": 0.23390927352011204,
|
| 462 |
+
"std_generalization_gap": 0.0026626087154295068
|
| 463 |
+
},
|
| 464 |
+
{
|
| 465 |
+
"run_mode": "locked_stream",
|
| 466 |
+
"condition": "static_dropout_0.08",
|
| 467 |
+
"condition_kind": "static",
|
| 468 |
+
"stage": 3,
|
| 469 |
+
"token_limit": 4000000,
|
| 470 |
+
"model_name": "L12_H8_D320",
|
| 471 |
+
"n_layer": 12,
|
| 472 |
+
"n_head": 8,
|
| 473 |
+
"n_embd": 320,
|
| 474 |
+
"parameters": 17367040,
|
| 475 |
+
"dropout_initial": 0.08,
|
| 476 |
+
"dropout_final": 0.08,
|
| 477 |
+
"dropout_schedule": "constant",
|
| 478 |
+
"n": 2,
|
| 479 |
+
"mean_train_eval_loss": 2.299874421209097,
|
| 480 |
+
"std_train_eval_loss": 0.039707320430454655,
|
| 481 |
+
"mean_val_eval_loss": 2.534305278211832,
|
| 482 |
+
"std_val_eval_loss": 0.03461510658323205,
|
| 483 |
+
"mean_generalization_gap": 0.23443085700273514,
|
| 484 |
+
"std_generalization_gap": 0.0050922138472226
|
| 485 |
+
},
|
| 486 |
+
{
|
| 487 |
+
"run_mode": "locked_stream",
|
| 488 |
+
"condition": "static_dropout_0.12",
|
| 489 |
+
"condition_kind": "static",
|
| 490 |
+
"stage": 3,
|
| 491 |
+
"token_limit": 4000000,
|
| 492 |
+
"model_name": "L12_H8_D320",
|
| 493 |
+
"n_layer": 12,
|
| 494 |
+
"n_head": 8,
|
| 495 |
+
"n_embd": 320,
|
| 496 |
+
"parameters": 17367040,
|
| 497 |
+
"dropout_initial": 0.12,
|
| 498 |
+
"dropout_final": 0.12,
|
| 499 |
+
"dropout_schedule": "constant",
|
| 500 |
+
"n": 2,
|
| 501 |
+
"mean_train_eval_loss": 2.33733069896698,
|
| 502 |
+
"std_train_eval_loss": 0.03656694294283975,
|
| 503 |
+
"mean_val_eval_loss": 2.5380794778466225,
|
| 504 |
+
"std_val_eval_loss": 0.030344276861570076,
|
| 505 |
+
"mean_generalization_gap": 0.2007487788796425,
|
| 506 |
+
"std_generalization_gap": 0.006222666081269672
|
| 507 |
+
},
|
| 508 |
+
{
|
| 509 |
+
"run_mode": "locked_stream",
|
| 510 |
+
"condition": "static_dropout_0.18",
|
| 511 |
+
"condition_kind": "static",
|
| 512 |
+
"stage": 3,
|
| 513 |
+
"token_limit": 4000000,
|
| 514 |
+
"model_name": "L12_H8_D320",
|
| 515 |
+
"n_layer": 12,
|
| 516 |
+
"n_head": 8,
|
| 517 |
+
"n_embd": 320,
|
| 518 |
+
"parameters": 17367040,
|
| 519 |
+
"dropout_initial": 0.18,
|
| 520 |
+
"dropout_final": 0.18,
|
| 521 |
+
"dropout_schedule": "constant",
|
| 522 |
+
"n": 2,
|
| 523 |
+
"mean_train_eval_loss": 2.374789696186781,
|
| 524 |
+
"std_train_eval_loss": 0.042475198802574214,
|
| 525 |
+
"mean_val_eval_loss": 2.5574195235967636,
|
| 526 |
+
"std_val_eval_loss": 0.03275664656650025,
|
| 527 |
+
"mean_generalization_gap": 0.18262982740998268,
|
| 528 |
+
"std_generalization_gap": 0.00971855223607397
|
| 529 |
+
}
|
| 530 |
+
]
|
runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/trace.jsonl
ADDED
|
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{"condition": "interaction", "dropout": 0.184, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.3873605728149414}
|
| 2 |
+
{"condition": "interaction", "dropout": 0.184, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.8813064098358154}
|
| 3 |
+
{"condition": "interaction", "dropout": 0.141, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.771984100341797}
|
| 4 |
+
{"condition": "interaction", "dropout": 0.141, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.6102116107940674}
|
| 5 |
+
{"condition": "interaction", "dropout": 0.084, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.602119207382202}
|
| 6 |
+
{"condition": "interaction", "dropout": 0.084, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.4124107360839844}
|
| 7 |
+
{"condition": "interaction", "dropout": 0.045, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.5032730102539062}
|
| 8 |
+
{"condition": "interaction", "dropout": 0.045, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.1464571952819824}
|
| 9 |
+
{"condition": "interaction", "dropout": 0.184, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.293152093887329}
|
| 10 |
+
{"condition": "interaction", "dropout": 0.184, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.8814051151275635}
|
| 11 |
+
{"condition": "interaction", "dropout": 0.141, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.7541747093200684}
|
| 12 |
+
{"condition": "interaction", "dropout": 0.141, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.5935797691345215}
|
| 13 |
+
{"condition": "interaction", "dropout": 0.084, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.7131600379943848}
|
| 14 |
+
{"condition": "interaction", "dropout": 0.084, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.3252055644989014}
|
| 15 |
+
{"condition": "interaction", "dropout": 0.045, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.4249722957611084}
|
| 16 |
+
{"condition": "interaction", "dropout": 0.045, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.3208839893341064}
|
| 17 |
+
{"condition": "baseabc", "dropout": 0.251, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.5486912727355957}
|
| 18 |
+
{"condition": "baseabc", "dropout": 0.251, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.9102234840393066}
|
| 19 |
+
{"condition": "baseabc", "dropout": 0.186, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.9149229526519775}
|
| 20 |
+
{"condition": "baseabc", "dropout": 0.186, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.793113946914673}
|
| 21 |
+
{"condition": "baseabc", "dropout": 0.105, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.658621072769165}
|
| 22 |
+
{"condition": "baseabc", "dropout": 0.105, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.46707820892334}
|
| 23 |
+
{"condition": "baseabc", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.2726128101348877}
|
| 24 |
+
{"condition": "baseabc", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.2745652198791504}
|
| 25 |
+
{"condition": "baseabc", "dropout": 0.251, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.539700508117676}
|
| 26 |
+
{"condition": "baseabc", "dropout": 0.251, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.9626998901367188}
|
| 27 |
+
{"condition": "baseabc", "dropout": 0.186, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.8122684955596924}
|
| 28 |
+
{"condition": "baseabc", "dropout": 0.186, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.6301145553588867}
|
| 29 |
+
{"condition": "baseabc", "dropout": 0.105, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.6069304943084717}
|
| 30 |
+
{"condition": "baseabc", "dropout": 0.105, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.5148327350616455}
|
| 31 |
+
{"condition": "baseabc", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.388144016265869}
|
| 32 |
+
{"condition": "baseabc", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.243898868560791}
|
| 33 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.2069337368011475}
|
| 34 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.6104087829589844}
|
| 35 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.686776876449585}
|
| 36 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.512050151824951}
|
| 37 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.2968358993530273}
|
| 38 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.418401002883911}
|
| 39 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.4829370975494385}
|
| 40 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.3379616737365723}
|
| 41 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.4543862342834473}
|
| 42 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.5418384075164795}
|
| 43 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.6846256256103516}
|
| 44 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.497213363647461}
|
| 45 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.4699549674987793}
|
| 46 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.4595203399658203}
|
| 47 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.4750654697418213}
|
| 48 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.345273494720459}
|
| 49 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.1532158851623535}
|
| 50 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.7694272994995117}
|
| 51 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.7790844440460205}
|
| 52 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.5384514331817627}
|
| 53 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.5747876167297363}
|
| 54 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.393846035003662}
|
| 55 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.5318603515625}
|
| 56 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.516744613647461}
|
| 57 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.2838807106018066}
|
| 58 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.941650629043579}
|
| 59 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.8269617557525635}
|
| 60 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.539917469024658}
|
| 61 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.6944527626037598}
|
| 62 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.3899667263031006}
|
| 63 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.3812689781188965}
|
| 64 |
+
{"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.3815793991088867}
|
| 65 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.5123531818389893}
|
| 66 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.9370627403259277}
|
| 67 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.8317360877990723}
|
| 68 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.7211272716522217}
|
| 69 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.5763471126556396}
|
| 70 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.6030375957489014}
|
| 71 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.490279197692871}
|
| 72 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.4995384216308594}
|
| 73 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.4424760341644287}
|
| 74 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.9028453826904297}
|
| 75 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.7941365242004395}
|
| 76 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.83718204498291}
|
| 77 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.747528314590454}
|
| 78 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.6042351722717285}
|
| 79 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.5449204444885254}
|
| 80 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.4918441772460938}
|
| 81 |
+
{"condition": "smooth_low", "dropout": 0.17803874120648827, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.380012273788452}
|
| 82 |
+
{"condition": "smooth_low", "dropout": 0.16230079361664454, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.8499650955200195}
|
| 83 |
+
{"condition": "smooth_low", "dropout": 0.1400439632143008, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.7695858478546143}
|
| 84 |
+
{"condition": "smooth_low", "dropout": 0.11452606249945704, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.584320068359375}
|
| 85 |
+
{"condition": "smooth_low", "dropout": 0.0890049039721133, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.604490041732788}
|
| 86 |
+
{"condition": "smooth_low", "dropout": 0.06673830013226953, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.395663022994995}
|
| 87 |
+
{"condition": "smooth_low", "dropout": 0.05098406347992578, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.5218143463134766}
|
| 88 |
+
{"condition": "smooth_low", "dropout": 0.045000006515082035, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.146873712539673}
|
| 89 |
+
{"condition": "smooth_low", "dropout": 0.17803874120648827, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.2876977920532227}
|
| 90 |
+
{"condition": "smooth_low", "dropout": 0.16230079361664454, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.8498988151550293}
|
| 91 |
+
{"condition": "smooth_low", "dropout": 0.1400439632143008, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.7525501251220703}
|
| 92 |
+
{"condition": "smooth_low", "dropout": 0.11452606249945704, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.5650553703308105}
|
| 93 |
+
{"condition": "smooth_low", "dropout": 0.0890049039721133, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.7159547805786133}
|
| 94 |
+
{"condition": "smooth_low", "dropout": 0.06673830013226953, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.319474220275879}
|
| 95 |
+
{"condition": "smooth_low", "dropout": 0.05098406347992578, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.4393768310546875}
|
| 96 |
+
{"condition": "smooth_low", "dropout": 0.045000006515082035, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.3239519596099854}
|
scripts/summarize_streaming_multiseed.py
CHANGED
|
@@ -192,14 +192,41 @@ def write_report(
|
|
| 192 |
paired_rows: list[dict],
|
| 193 |
metrics_paths: list[Path],
|
| 194 |
) -> None:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 195 |
lines = [
|
| 196 |
"# TinyStories Multi-Seed Streaming Validation",
|
| 197 |
"",
|
| 198 |
"Date: 2026-05-30",
|
| 199 |
"",
|
| 200 |
-
"This report combines
|
| 201 |
-
"
|
| 202 |
-
"
|
|
|
|
| 203 |
"",
|
| 204 |
"## Sources",
|
| 205 |
"",
|
|
@@ -264,16 +291,22 @@ def write_report(
|
|
| 264 |
"",
|
| 265 |
"## Interpretation",
|
| 266 |
"",
|
| 267 |
-
"- `
|
| 268 |
-
"
|
| 269 |
-
"
|
| 270 |
-
"-
|
| 271 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 272 |
"- Static `0.12` can win early stages, but holding it fixed loses at the",
|
| 273 |
" final 4M stage.",
|
| 274 |
-
"- This
|
| 275 |
-
"
|
| 276 |
-
"
|
| 277 |
]
|
| 278 |
)
|
| 279 |
path.write_text("\n".join(lines) + "\n", encoding="utf-8")
|
|
|
|
| 192 |
paired_rows: list[dict],
|
| 193 |
metrics_paths: list[Path],
|
| 194 |
) -> None:
|
| 195 |
+
seed_ids = sorted({int(row["seed"]) for row in paired_rows})
|
| 196 |
+
seed_count = len(seed_ids)
|
| 197 |
+
best_row = condition_rows[0]
|
| 198 |
+
static_rows = [row for row in condition_rows if row["condition"].startswith("static_")]
|
| 199 |
+
best_static_row = min(static_rows, key=lambda row: row["mean_final_val"])
|
| 200 |
+
|
| 201 |
+
paired_win_lines = []
|
| 202 |
+
for row in condition_rows:
|
| 203 |
+
condition = row["condition"]
|
| 204 |
+
if condition.startswith("static_"):
|
| 205 |
+
continue
|
| 206 |
+
condition_deltas = [
|
| 207 |
+
item["delta_vs_best_static"]
|
| 208 |
+
for item in paired_rows
|
| 209 |
+
if item["condition"] == condition
|
| 210 |
+
]
|
| 211 |
+
wins = sum(delta < 0 for delta in condition_deltas)
|
| 212 |
+
ties = sum(delta == 0 for delta in condition_deltas)
|
| 213 |
+
worst_delta = max(condition_deltas)
|
| 214 |
+
paired_win_lines.append(
|
| 215 |
+
f"- `{condition}` beats the per-seed best static baseline in "
|
| 216 |
+
f"{wins}/{seed_count} seeds"
|
| 217 |
+
+ (f" with {ties} exact ties" if ties else "")
|
| 218 |
+
+ f"; worst paired delta is {worst_delta:+.4f}."
|
| 219 |
+
)
|
| 220 |
+
|
| 221 |
lines = [
|
| 222 |
"# TinyStories Multi-Seed Streaming Validation",
|
| 223 |
"",
|
| 224 |
"Date: 2026-05-30",
|
| 225 |
"",
|
| 226 |
+
f"This report combines {seed_count} random seeds "
|
| 227 |
+
f"({', '.join(str(seed) for seed in seed_ids)}) from saved streaming runs.",
|
| 228 |
+
"No additional training is performed by this script; it reads saved",
|
| 229 |
+
"`metrics.jsonl` files.",
|
| 230 |
"",
|
| 231 |
"## Sources",
|
| 232 |
"",
|
|
|
|
| 291 |
"",
|
| 292 |
"## Interpretation",
|
| 293 |
"",
|
| 294 |
+
f"- `{best_row['condition']}` has the best {seed_count}-seed mean final "
|
| 295 |
+
f"validation loss: {fmt(best_row['mean_final_val'])} +/- "
|
| 296 |
+
f"{fmt(best_row['std_final_val'])}.",
|
| 297 |
+
f"- The best static baseline by mean final loss is "
|
| 298 |
+
f"`{best_static_row['condition']}` at "
|
| 299 |
+
f"{fmt(best_static_row['mean_final_val'])} +/- "
|
| 300 |
+
f"{fmt(best_static_row['std_final_val'])}.",
|
| 301 |
+
"- `smooth_low` is very close to `interaction`, suggesting the exact anchor",
|
| 302 |
+
" values may not be uniquely required as long as the schedule follows the",
|
| 303 |
+
" same pressure range.",
|
| 304 |
+
*paired_win_lines,
|
| 305 |
"- Static `0.12` can win early stages, but holding it fixed loses at the",
|
| 306 |
" final 4M stage.",
|
| 307 |
+
"- This is now the TinyStories paper-grade validation gate for this narrowed",
|
| 308 |
+
" setup: five seeds, paired seed comparisons, and static baselines selected",
|
| 309 |
+
" from the same stream protocol.",
|
| 310 |
]
|
| 311 |
)
|
| 312 |
path.write_text("\n".join(lines) + "\n", encoding="utf-8")
|