Mandeep Sidhu commited on
Commit ·
bf705c0
1
Parent(s): 1c065aa
Add clean previous local five-seed validation
Browse files- docs/plan.md +52 -25
- docs/previous_regime_streaming_report.md +94 -113
- runs/previous_local_streaming_report/l16_updated_formula_clean_5seed/condition_summary.csv +9 -0
- runs/previous_local_streaming_report/l16_updated_formula_clean_5seed/paired_final_deltas.csv +41 -0
- runs/previous_local_streaming_report/l16_updated_formula_clean_5seed/stage_summary.csv +41 -0
- runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/RESULT_SUMMARY.md +86 -0
- runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/config.json +222 -0
- runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/metrics.jsonl +0 -0
- runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/summary.csv +41 -0
- runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/summary.json +882 -0
- runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/trace.jsonl +0 -0
docs/plan.md
CHANGED
|
@@ -284,7 +284,7 @@ Use this order for every regime.
|
|
| 284 |
| original/local saved regime | offline backtest complete | retrospective support for interaction pressure law; do not rerun unless necessary |
|
| 285 |
| TinyStories static/coefficient regime | active | main coefficient evidence |
|
| 286 |
| TinyStories streaming regime | 5-seed validation complete | current main streaming evidence; interaction decay beats best static in 5/5 paired final-loss comparisons |
|
| 287 |
-
| original/local streaming regime |
|
| 288 |
| next new streaming regime | pending | start only after TinyStories and original/local streaming reports are reconciled |
|
| 289 |
|
| 290 |
## Current Formula Status
|
|
@@ -333,7 +333,7 @@ structure transfers, while coefficients may be regime-specific.
|
|
| 333 |
| TinyStories held-out prefix | supports pressure dependence on unique tokens |
|
| 334 |
| TinyStories held-out model | supports pressure dependence on model size |
|
| 335 |
| TinyStories streaming, 5 seeds | interaction has best mean final loss; interaction beats best static in 5/5 paired final-loss comparisons |
|
| 336 |
-
| previous/local streaming,
|
| 337 |
| cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
|
| 338 |
|
| 339 |
Latest TinyStories 5-seed streaming final-loss table:
|
|
@@ -355,9 +355,9 @@ Paired final-loss result:
|
|
| 355 |
| `baseabc` | 5/5 |
|
| 356 |
| `smooth_low` | 4/5, with the one miss only `+0.0003` |
|
| 357 |
|
| 358 |
-
The immediate risk is no longer
|
| 359 |
-
|
| 360 |
-
|
| 361 |
|
| 362 |
```text
|
| 363 |
Formula-derived dropout schedules track the moving useful dropout region and
|
|
@@ -370,8 +370,36 @@ The stronger claim:
|
|
| 370 |
Formula-derived dropout decay beats the best static dropout.
|
| 371 |
```
|
| 372 |
|
| 373 |
-
is supported at `n=5`
|
| 374 |
-
beating the per-seed best static baseline in all
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 375 |
|
| 376 |
## Completed Static Backtest Gate
|
| 377 |
|
|
@@ -395,25 +423,23 @@ streaming multi-seed reports for each regime.
|
|
| 395 |
|
| 396 |
## Immediate Next Action
|
| 397 |
|
| 398 |
-
Reconcile the TinyStories five-seed report and previous/local
|
| 399 |
-
|
| 400 |
-
|
| 401 |
-
|
|
|
|
| 402 |
|
| 403 |
## Next Training After Current Gate
|
| 404 |
|
| 405 |
-
No MPS training should launch until the two completed
|
| 406 |
-
read together.
|
| 407 |
-
|
| 408 |
-
conditions. If external validity is the limiting issue, use a third held-out
|
| 409 |
-
regime instead:
|
| 410 |
|
| 411 |
```text
|
| 412 |
completed: TinyStories 5-seed streaming report
|
| 413 |
-
completed: previous/local
|
| 414 |
-
|
| 415 |
-
|
| 416 |
-
avoid: broad new sweep before choosing A vs B
|
| 417 |
```
|
| 418 |
|
| 419 |
Evaluate with paired seed comparisons:
|
|
@@ -426,10 +452,11 @@ decay minus best-static delta per seed
|
|
| 426 |
rank consistency across seeds
|
| 427 |
```
|
| 428 |
|
| 429 |
-
|
| 430 |
-
streaming claim
|
| 431 |
-
|
| 432 |
-
|
|
|
|
| 433 |
|
| 434 |
Latest streaming report:
|
| 435 |
|
|
@@ -437,5 +464,5 @@ Latest streaming report:
|
|
| 437 |
docs/streaming_multiseed_validation_report.md
|
| 438 |
docs/previous_regime_streaming_report.md
|
| 439 |
runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/
|
| 440 |
-
runs/previous_local_streaming_report/
|
| 441 |
```
|
|
|
|
| 284 |
| original/local saved regime | offline backtest complete | retrospective support for interaction pressure law; do not rerun unless necessary |
|
| 285 |
| TinyStories static/coefficient regime | active | main coefficient evidence |
|
| 286 |
| TinyStories streaming regime | 5-seed validation complete | current main streaming evidence; interaction decay beats best static in 5/5 paired final-loss comparisons |
|
| 287 |
+
| original/local streaming regime | 5-seed clean validation complete | previous/local interaction decay beats best static in 5/5 paired final-loss comparisons |
|
| 288 |
| next new streaming regime | pending | start only after TinyStories and original/local streaming reports are reconciled |
|
| 289 |
|
| 290 |
## Current Formula Status
|
|
|
|
| 333 |
| TinyStories held-out prefix | supports pressure dependence on unique tokens |
|
| 334 |
| TinyStories held-out model | supports pressure dependence on model size |
|
| 335 |
| TinyStories streaming, 5 seeds | interaction has best mean final loss; interaction beats best static in 5/5 paired final-loss comparisons |
|
| 336 |
+
| previous/local streaming, 5 seeds | interaction decay has best mean final loss; top decay schedules beat best static in 5/5 paired comparisons |
|
| 337 |
| cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
|
| 338 |
|
| 339 |
Latest TinyStories 5-seed streaming final-loss table:
|
|
|
|
| 355 |
| `baseabc` | 5/5 |
|
| 356 |
| `smooth_low` | 4/5, with the one miss only `+0.0003` |
|
| 357 |
|
| 358 |
+
The immediate risk is no longer seed count for TinyStories or previous/local.
|
| 359 |
+
The main remaining risk is external validity beyond two tested regimes. The
|
| 360 |
+
current defensible claim is:
|
| 361 |
|
| 362 |
```text
|
| 363 |
Formula-derived dropout schedules track the moving useful dropout region and
|
|
|
|
| 370 |
Formula-derived dropout decay beats the best static dropout.
|
| 371 |
```
|
| 372 |
|
| 373 |
+
is supported at `n=5` in both the TinyStories and previous/local streaming
|
| 374 |
+
setups, with interaction decay beating the per-seed best static baseline in all
|
| 375 |
+
five seeds in both regimes.
|
| 376 |
+
|
| 377 |
+
Latest previous/local 5-seed streaming final-loss table:
|
| 378 |
+
|
| 379 |
+
| Condition | Mean final 4M validation loss | Std |
|
| 380 |
+
|---|---:|---:|
|
| 381 |
+
| `prevlocal_interaction` decay | 4.3981 | 0.0095 |
|
| 382 |
+
| `hold_30_then_decay` | 4.4052 | 0.0112 |
|
| 383 |
+
| `mild_30_to_08` | 4.4073 | 0.0085 |
|
| 384 |
+
| `fitted_l16_static_law` | 4.4124 | 0.0084 |
|
| 385 |
+
| static `0.14` | 4.4455 | 0.0120 |
|
| 386 |
+
| static `0.30` | 4.4668 | 0.0141 |
|
| 387 |
+
| static `0.02` | 4.5358 | 0.0091 |
|
| 388 |
+
| static `0.00` | 4.5943 | 0.0216 |
|
| 389 |
+
|
| 390 |
+
Paired final-loss result:
|
| 391 |
+
|
| 392 |
+
| Decay schedule | Paired wins vs best static |
|
| 393 |
+
|---|---:|
|
| 394 |
+
| `prevlocal_interaction` | 5/5 |
|
| 395 |
+
| `hold_30_then_decay` | 5/5 |
|
| 396 |
+
| `mild_30_to_08` | 5/5 |
|
| 397 |
+
| `fitted_l16_static_law` | 5/5 |
|
| 398 |
+
|
| 399 |
+
The best static baseline in the clean previous/local run is static dropout
|
| 400 |
+
`0.14`. The interaction schedule improves mean final validation loss by about
|
| 401 |
+
`0.0473` and wins every paired seed comparison. This promotes previous/local
|
| 402 |
+
from exploratory support to a second multi-seed streaming validation regime.
|
| 403 |
|
| 404 |
## Completed Static Backtest Gate
|
| 405 |
|
|
|
|
| 423 |
|
| 424 |
## Immediate Next Action
|
| 425 |
|
| 426 |
+
Reconcile the TinyStories five-seed report and previous/local five-seed report
|
| 427 |
+
into the paper outline. The seed-count gap is now closed. The next empirical
|
| 428 |
+
weakness is external validity, so the preferred next experiment is a third
|
| 429 |
+
held-out regime with minimal coefficient calibration followed by narrowed
|
| 430 |
+
multi-seed streaming validation.
|
| 431 |
|
| 432 |
## Next Training After Current Gate
|
| 433 |
|
| 434 |
+
No MPS training should launch until the two completed five-seed streaming
|
| 435 |
+
reports are read together. Since previous/local seed count is no longer the
|
| 436 |
+
limiting issue, use a third held-out regime for the next validation step:
|
|
|
|
|
|
|
| 437 |
|
| 438 |
```text
|
| 439 |
completed: TinyStories 5-seed streaming report
|
| 440 |
+
completed: previous/local 5-seed clean streaming report
|
| 441 |
+
next: third held-out regime with minimal calibration
|
| 442 |
+
avoid: broad new sweep before cross-regime report reconciliation
|
|
|
|
| 443 |
```
|
| 444 |
|
| 445 |
Evaluate with paired seed comparisons:
|
|
|
|
| 452 |
rank consistency across seeds
|
| 453 |
```
|
| 454 |
|
| 455 |
+
Because previous/local decay wins across paired seeds, promote the cross-regime
|
| 456 |
+
streaming claim to "supported in two regimes." Do not yet claim universal
|
| 457 |
+
numeric coefficients. The next claim to test is whether the pressure-law
|
| 458 |
+
structure and regime-specific fitting procedure reproduce the win in a third
|
| 459 |
+
held-out regime.
|
| 460 |
|
| 461 |
Latest streaming report:
|
| 462 |
|
|
|
|
| 464 |
docs/streaming_multiseed_validation_report.md
|
| 465 |
docs/previous_regime_streaming_report.md
|
| 466 |
runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/
|
| 467 |
+
runs/previous_local_streaming_report/l16_updated_formula_clean_5seed/
|
| 468 |
```
|
docs/previous_regime_streaming_report.md
CHANGED
|
@@ -2,27 +2,28 @@
|
|
| 2 |
|
| 3 |
Date: 2026-05-30
|
| 4 |
|
| 5 |
-
This report combines
|
| 6 |
No additional training is performed by this script; it reads saved
|
| 7 |
`metrics.jsonl` files.
|
| 8 |
|
| 9 |
-
Regime: original/local saved streaming setup with L16_H8_D384, 31,457,280 parameters, five prefixes from 250k to 4M tokens, and 1,000 optimizer steps per stage. This
|
| 10 |
|
| 11 |
## Sources
|
| 12 |
|
| 13 |
-
- `runs/
|
| 14 |
|
| 15 |
## Condition Ranking By Final Loss
|
| 16 |
|
| 17 |
| Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
|
| 18 |
|---|---|---:|---:|---:|---:|---:|---:|---|
|
| 19 |
-
| `
|
| 20 |
-
| `
|
| 21 |
-
| `
|
| 22 |
-
| `
|
| 23 |
-
| `static_dropout_0.
|
| 24 |
-
| `static_dropout_0.
|
| 25 |
-
| `static_dropout_0` | `static` |
|
|
|
|
| 26 |
|
| 27 |
## Paired Final-Loss Deltas
|
| 28 |
|
|
@@ -31,122 +32,102 @@ baseline for that seed.
|
|
| 31 |
|
| 32 |
| Seed | Condition | Final val | Best static | Best static final val | Delta vs best static |
|
| 33 |
|---:|---|---:|---|---:|---:|
|
|
|
|
| 34 |
| 1 | `hold_30_then_decay` | 4.3939 | `static_dropout_0.14` | 4.4418 | -0.0479 |
|
| 35 |
| 1 | `mild_30_to_08` | 4.3995 | `static_dropout_0.14` | 4.4418 | -0.0423 |
|
| 36 |
| 1 | `fitted_l16_static_law` | 4.4207 | `static_dropout_0.14` | 4.4418 | -0.0211 |
|
| 37 |
| 1 | `static_dropout_0.14` | 4.4418 | `static_dropout_0.14` | 4.4418 | +0.0000 |
|
| 38 |
| 1 | `static_dropout_0.3` | 4.4602 | `static_dropout_0.14` | 4.4418 | +0.0184 |
|
| 39 |
-
| 1 | `static_dropout_0.02` | 4.5402 | `static_dropout_0.14` | 4.4418 | +0.
|
| 40 |
-
| 1 | `static_dropout_0` | 4.
|
| 41 |
-
| 2 | `
|
| 42 |
-
| 2 | `
|
| 43 |
-
| 2 | `
|
| 44 |
-
| 2 | `
|
| 45 |
-
| 2 | `static_dropout_0.
|
| 46 |
-
| 2 | `static_dropout_0.
|
| 47 |
-
| 2 | `static_dropout_0` | 4.
|
| 48 |
-
|
|
| 49 |
-
| 3 | `
|
| 50 |
-
| 3 | `
|
| 51 |
-
| 3 | `
|
| 52 |
-
| 3 | `
|
| 53 |
-
| 3 | `static_dropout_0.
|
| 54 |
-
| 3 | `static_dropout_0` | 4.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
## Stage Trajectory
|
| 57 |
|
| 58 |
| Stage | Prefix tokens | Condition | Dropout | N | Mean val | Std val | Mean train | Mean gap |
|
| 59 |
|---:|---:|---|---:|---:|---:|---:|---:|---:|
|
| 60 |
-
| 0 | 250,000 | `mild_30_to_08` | 0.300 |
|
| 61 |
-
| 0 | 250,000 | `
|
| 62 |
-
| 0 | 250,000 | `
|
| 63 |
-
| 0 | 250,000 | `static_dropout_0.14` | 0.140 |
|
| 64 |
-
| 0 | 250,000 | `
|
| 65 |
-
| 0 | 250,000 | `
|
| 66 |
-
| 0 | 250,000 | `
|
| 67 |
-
|
|
| 68 |
-
| 1 | 500,000 | `
|
| 69 |
-
| 1 | 500,000 | `
|
| 70 |
-
| 1 | 500,000 | `
|
| 71 |
-
| 1 | 500,000 | `
|
| 72 |
-
| 1 | 500,000 | `static_dropout_0.
|
| 73 |
-
| 1 | 500,000 | `
|
| 74 |
-
|
|
| 75 |
-
|
|
| 76 |
-
| 2 | 1,000,000 | `
|
| 77 |
-
| 2 | 1,000,000 | `
|
| 78 |
-
| 2 | 1,000,000 | `
|
| 79 |
-
| 2 | 1,000,000 | `static_dropout_0.
|
| 80 |
-
| 2 | 1,000,000 | `
|
| 81 |
-
|
|
| 82 |
-
|
|
| 83 |
-
|
|
| 84 |
-
| 3 | 2,000,000 | `
|
| 85 |
-
| 3 | 2,000,000 | `
|
| 86 |
-
| 3 | 2,000,000 | `
|
| 87 |
-
| 3 | 2,000,000 | `
|
| 88 |
-
|
|
| 89 |
-
|
|
| 90 |
-
|
|
| 91 |
-
|
|
| 92 |
-
| 4 | 4,000,000 | `
|
| 93 |
-
| 4 | 4,000,000 | `
|
| 94 |
-
| 4 | 4,000,000 | `
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
## Interpretation
|
| 97 |
|
| 98 |
-
- `
|
| 99 |
-
- The second-best final condition is `
|
| 100 |
-
- The best static baseline by mean final loss is `static_dropout_0.14` at 4.
|
| 101 |
-
- `
|
| 102 |
-
- `
|
| 103 |
-
- `
|
| 104 |
-
-
|
|
|
|
| 105 |
- This is a saved-run streaming validation artifact. Treat it as strong
|
| 106 |
evidence only when the tested conditions, seeds, static baselines, and
|
| 107 |
stream protocol match the claim being made.
|
| 108 |
-
|
| 109 |
-
## Supporting Exploratory Runs
|
| 110 |
-
|
| 111 |
-
The primary proof table above is the three-seed confirmation run:
|
| 112 |
-
|
| 113 |
-
```text
|
| 114 |
-
runs/stream_multiseed_confirm/locked_stream/20260526-203116/
|
| 115 |
-
```
|
| 116 |
-
|
| 117 |
-
Earlier single-seed runs are useful for interpreting how the schedule was
|
| 118 |
-
selected, but they are not counted as multi-seed proof:
|
| 119 |
-
|
| 120 |
-
| Supporting run | Role | Main reading |
|
| 121 |
-
|---|---|---|
|
| 122 |
-
| `runs/stream_schedule_search/locked_stream/20260526-171537/` | schedule search | decay schedules starting near `0.30` and ending near `0.02` to `0.08` beat static `0.14` and `0.30` at the final 4M prefix |
|
| 123 |
-
| `runs/stream_schedule_refinement/locked_stream/20260526-184506/` | endpoint and curvature refinement | several `hold_30` variants ended tightly around `4.394`, while `hold_24_then_decay` was weaker at `4.4214`, suggesting the initial dropout should not be reduced too aggressively in this regime |
|
| 124 |
-
| `runs/formula_l16_exact_multiseed/locked_stream/20260527-123806/` | coefficient-derived schedule check | `pressure_formula_l16_floor02` reached `4.4059 +/- 0.0042` over three seeds versus static `0.14` at `4.4459 +/- 0.0128` |
|
| 125 |
-
|
| 126 |
-
## Research Reading
|
| 127 |
-
|
| 128 |
-
This previous/local regime supports the same qualitative claim as the
|
| 129 |
-
TinyStories five-seed validation: a static dropout that is reasonable at one
|
| 130 |
-
stream scale is not necessarily optimal as the data prefix grows. In this
|
| 131 |
-
regime, the useful path keeps dropout high early (`0.30`) and then lowers it
|
| 132 |
-
as unique tokens and sampled tokens increase.
|
| 133 |
-
|
| 134 |
-
The strongest previous/local evidence is:
|
| 135 |
-
|
| 136 |
-
| Claim | Evidence |
|
| 137 |
-
|---|---|
|
| 138 |
-
| decay beats best static final loss | `hold_30_then_decay` beats the per-seed best static baseline in `3/3` seeds |
|
| 139 |
-
| endpoint is not uniquely fixed | `mild_30_to_08` is nearly tied with `hold_30_then_decay` |
|
| 140 |
-
| too-low early dropout is harmful | static `0.02` and `0.00` are much worse throughout the stream |
|
| 141 |
-
| too-high static dropout underuses later data | static `0.30` wins no final paired comparison despite being strong early |
|
| 142 |
-
| coefficient-derived schedules are viable | `fitted_l16_static_law` and `pressure_formula_l16_floor02` both beat static `0.14` in the saved three-seed comparisons |
|
| 143 |
-
|
| 144 |
-
Limitations:
|
| 145 |
-
|
| 146 |
-
1. This report is `n=3`, not `n=5`.
|
| 147 |
-
2. The schedules were refined inside this local regime, so this is not a
|
| 148 |
-
clean held-out-regime proof of universal coefficients.
|
| 149 |
-
3. The report still supports the cross-regime mechanism because the direction
|
| 150 |
-
of the effect matches TinyStories: high enough initial regularization
|
| 151 |
-
prevents early overfit, and lowering dropout later improves final validation
|
| 152 |
-
loss versus holding one static value fixed.
|
|
|
|
| 2 |
|
| 3 |
Date: 2026-05-30
|
| 4 |
|
| 5 |
+
This report combines 5 random seeds (1, 2, 3, 4, 5) from saved streaming runs.
|
| 6 |
No additional training is performed by this script; it reads saved
|
| 7 |
`metrics.jsonl` files.
|
| 8 |
|
| 9 |
+
Regime: original/local saved streaming setup with L16_H8_D384, 31,457,280 parameters, five prefixes from 250k to 4M tokens, and 1,000 optimizer steps per stage. This is a clean five-seed run including the updated previous/local interaction formula schedule, empirical decay schedules, and static baselines.
|
| 10 |
|
| 11 |
## Sources
|
| 12 |
|
| 13 |
+
- `runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/metrics.jsonl`
|
| 14 |
|
| 15 |
## Condition Ranking By Final Loss
|
| 16 |
|
| 17 |
| Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
|
| 18 |
|---|---|---:|---:|---:|---:|---:|---:|---|
|
| 19 |
+
| `prevlocal_interaction` | `anchor_decay` | 5 | 4.8609 | 0.0046 | 4.3981 | 0.0095 | 0.3177 | `0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07` |
|
| 20 |
+
| `hold_30_then_decay` | `anchor_decay` | 5 | 4.8512 | 0.0017 | 4.4052 | 0.0112 | 0.3565 | `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` |
|
| 21 |
+
| `mild_30_to_08` | `anchor_decay` | 5 | 4.8509 | 0.0015 | 4.4073 | 0.0085 | 0.3337 | `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` |
|
| 22 |
+
| `fitted_l16_static_law` | `anchor_decay` | 5 | 4.9521 | 0.0039 | 4.4124 | 0.0084 | 0.3137 | `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` |
|
| 23 |
+
| `static_dropout_0.14` | `static` | 5 | 4.9051 | 0.0088 | 4.4455 | 0.0120 | 0.3289 | `0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14` |
|
| 24 |
+
| `static_dropout_0.3` | `static` | 5 | 4.8767 | 0.0019 | 4.4668 | 0.0141 | 0.2349 | `0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30` |
|
| 25 |
+
| `static_dropout_0.02` | `static` | 5 | 5.1571 | 0.0097 | 4.5358 | 0.0091 | 0.4829 | `0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02` |
|
| 26 |
+
| `static_dropout_0` | `static` | 5 | 5.2511 | 0.0160 | 4.5943 | 0.0216 | 0.5529 | `0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00` |
|
| 27 |
|
| 28 |
## Paired Final-Loss Deltas
|
| 29 |
|
|
|
|
| 32 |
|
| 33 |
| Seed | Condition | Final val | Best static | Best static final val | Delta vs best static |
|
| 34 |
|---:|---|---:|---|---:|---:|
|
| 35 |
+
| 1 | `prevlocal_interaction` | 4.4023 | `static_dropout_0.14` | 4.4418 | -0.0394 |
|
| 36 |
| 1 | `hold_30_then_decay` | 4.3939 | `static_dropout_0.14` | 4.4418 | -0.0479 |
|
| 37 |
| 1 | `mild_30_to_08` | 4.3995 | `static_dropout_0.14` | 4.4418 | -0.0423 |
|
| 38 |
| 1 | `fitted_l16_static_law` | 4.4207 | `static_dropout_0.14` | 4.4418 | -0.0211 |
|
| 39 |
| 1 | `static_dropout_0.14` | 4.4418 | `static_dropout_0.14` | 4.4418 | +0.0000 |
|
| 40 |
| 1 | `static_dropout_0.3` | 4.4602 | `static_dropout_0.14` | 4.4418 | +0.0184 |
|
| 41 |
+
| 1 | `static_dropout_0.02` | 4.5402 | `static_dropout_0.14` | 4.4418 | +0.0984 |
|
| 42 |
+
| 1 | `static_dropout_0` | 4.5704 | `static_dropout_0.14` | 4.4418 | +0.1286 |
|
| 43 |
+
| 2 | `prevlocal_interaction` | 4.4020 | `static_dropout_0.14` | 4.4602 | -0.0583 |
|
| 44 |
+
| 2 | `hold_30_then_decay` | 4.4068 | `static_dropout_0.14` | 4.4602 | -0.0534 |
|
| 45 |
+
| 2 | `mild_30_to_08` | 4.4080 | `static_dropout_0.14` | 4.4602 | -0.0522 |
|
| 46 |
+
| 2 | `fitted_l16_static_law` | 4.4136 | `static_dropout_0.14` | 4.4602 | -0.0466 |
|
| 47 |
+
| 2 | `static_dropout_0.14` | 4.4602 | `static_dropout_0.14` | 4.4602 | +0.0000 |
|
| 48 |
+
| 2 | `static_dropout_0.3` | 4.4719 | `static_dropout_0.14` | 4.4602 | +0.0117 |
|
| 49 |
+
| 2 | `static_dropout_0.02` | 4.5466 | `static_dropout_0.14` | 4.4602 | +0.0864 |
|
| 50 |
+
| 2 | `static_dropout_0` | 4.6094 | `static_dropout_0.14` | 4.4602 | +0.1492 |
|
| 51 |
+
| 3 | `prevlocal_interaction` | 4.4029 | `static_dropout_0.14` | 4.4356 | -0.0328 |
|
| 52 |
+
| 3 | `hold_30_then_decay` | 4.4174 | `static_dropout_0.14` | 4.4356 | -0.0183 |
|
| 53 |
+
| 3 | `mild_30_to_08` | 4.4151 | `static_dropout_0.14` | 4.4356 | -0.0206 |
|
| 54 |
+
| 3 | `fitted_l16_static_law` | 4.4134 | `static_dropout_0.14` | 4.4356 | -0.0223 |
|
| 55 |
+
| 3 | `static_dropout_0.14` | 4.4356 | `static_dropout_0.14` | 4.4356 | +0.0000 |
|
| 56 |
+
| 3 | `static_dropout_0.3` | 4.4758 | `static_dropout_0.14` | 4.4356 | +0.0401 |
|
| 57 |
+
| 3 | `static_dropout_0.02` | 4.5345 | `static_dropout_0.14` | 4.4356 | +0.0988 |
|
| 58 |
+
| 3 | `static_dropout_0` | 4.5928 | `static_dropout_0.14` | 4.4356 | +0.1571 |
|
| 59 |
+
| 4 | `prevlocal_interaction` | 4.3811 | `static_dropout_0.14` | 4.4337 | -0.0526 |
|
| 60 |
+
| 4 | `hold_30_then_decay` | 4.3936 | `static_dropout_0.14` | 4.4337 | -0.0400 |
|
| 61 |
+
| 4 | `mild_30_to_08` | 4.3978 | `static_dropout_0.14` | 4.4337 | -0.0359 |
|
| 62 |
+
| 4 | `fitted_l16_static_law` | 4.3983 | `static_dropout_0.14` | 4.4337 | -0.0354 |
|
| 63 |
+
| 4 | `static_dropout_0.14` | 4.4337 | `static_dropout_0.14` | 4.4337 | +0.0000 |
|
| 64 |
+
| 4 | `static_dropout_0.3` | 4.4455 | `static_dropout_0.14` | 4.4337 | +0.0118 |
|
| 65 |
+
| 4 | `static_dropout_0.02` | 4.5220 | `static_dropout_0.14` | 4.4337 | +0.0883 |
|
| 66 |
+
| 4 | `static_dropout_0` | 4.5768 | `static_dropout_0.14` | 4.4337 | +0.1432 |
|
| 67 |
+
| 5 | `prevlocal_interaction` | 4.4024 | `static_dropout_0.14` | 4.4560 | -0.0536 |
|
| 68 |
+
| 5 | `hold_30_then_decay` | 4.4145 | `static_dropout_0.14` | 4.4560 | -0.0415 |
|
| 69 |
+
| 5 | `mild_30_to_08` | 4.4161 | `static_dropout_0.14` | 4.4560 | -0.0399 |
|
| 70 |
+
| 5 | `fitted_l16_static_law` | 4.4161 | `static_dropout_0.14` | 4.4560 | -0.0399 |
|
| 71 |
+
| 5 | `static_dropout_0.14` | 4.4560 | `static_dropout_0.14` | 4.4560 | +0.0000 |
|
| 72 |
+
| 5 | `static_dropout_0.3` | 4.4805 | `static_dropout_0.14` | 4.4560 | +0.0245 |
|
| 73 |
+
| 5 | `static_dropout_0.02` | 4.5355 | `static_dropout_0.14` | 4.4560 | +0.0796 |
|
| 74 |
+
| 5 | `static_dropout_0` | 4.6219 | `static_dropout_0.14` | 4.4560 | +0.1660 |
|
| 75 |
|
| 76 |
## Stage Trajectory
|
| 77 |
|
| 78 |
| Stage | Prefix tokens | Condition | Dropout | N | Mean val | Std val | Mean train | Mean gap |
|
| 79 |
|---:|---:|---|---:|---:|---:|---:|---:|---:|
|
| 80 |
+
| 0 | 250,000 | `mild_30_to_08` | 0.300 | 5 | 5.4483 | 0.0138 | 4.4429 | 1.0054 |
|
| 81 |
+
| 0 | 250,000 | `hold_30_then_decay` | 0.300 | 5 | 5.4483 | 0.0138 | 4.4429 | 1.0054 |
|
| 82 |
+
| 0 | 250,000 | `static_dropout_0.3` | 0.300 | 5 | 5.4483 | 0.0138 | 4.4429 | 1.0054 |
|
| 83 |
+
| 0 | 250,000 | `static_dropout_0.14` | 0.140 | 5 | 5.4773 | 0.0224 | 4.0298 | 1.4475 |
|
| 84 |
+
| 0 | 250,000 | `prevlocal_interaction` | 0.385 | 5 | 5.4947 | 0.0109 | 4.6016 | 0.8930 |
|
| 85 |
+
| 0 | 250,000 | `static_dropout_0.02` | 0.020 | 5 | 5.7426 | 0.0242 | 3.5371 | 2.2055 |
|
| 86 |
+
| 0 | 250,000 | `fitted_l16_static_law` | 0.600 | 5 | 5.7842 | 0.0096 | 5.1640 | 0.6202 |
|
| 87 |
+
| 0 | 250,000 | `static_dropout_0` | 0.000 | 5 | 5.8330 | 0.0198 | 3.4443 | 2.3887 |
|
| 88 |
+
| 1 | 500,000 | `mild_30_to_08` | 0.240 | 5 | 5.0582 | 0.0159 | 4.0349 | 1.0233 |
|
| 89 |
+
| 1 | 500,000 | `static_dropout_0.3` | 0.300 | 5 | 5.0667 | 0.0173 | 4.1383 | 0.9284 |
|
| 90 |
+
| 1 | 500,000 | `hold_30_then_decay` | 0.300 | 5 | 5.0667 | 0.0173 | 4.1383 | 0.9284 |
|
| 91 |
+
| 1 | 500,000 | `prevlocal_interaction` | 0.319 | 5 | 5.0715 | 0.0118 | 4.2065 | 0.8650 |
|
| 92 |
+
| 1 | 500,000 | `static_dropout_0.14` | 0.140 | 5 | 5.1492 | 0.0070 | 3.7143 | 1.4349 |
|
| 93 |
+
| 1 | 500,000 | `fitted_l16_static_law` | 0.400 | 5 | 5.1507 | 0.0102 | 4.4632 | 0.6875 |
|
| 94 |
+
| 1 | 500,000 | `static_dropout_0.02` | 0.020 | 5 | 5.5754 | 0.0248 | 3.1246 | 2.4508 |
|
| 95 |
+
| 1 | 500,000 | `static_dropout_0` | 0.000 | 5 | 5.7175 | 0.0502 | 2.9583 | 2.7592 |
|
| 96 |
+
| 2 | 1,000,000 | `hold_30_then_decay` | 0.200 | 5 | 4.7757 | 0.0144 | 4.0378 | 0.7379 |
|
| 97 |
+
| 2 | 1,000,000 | `mild_30_to_08` | 0.180 | 5 | 4.7774 | 0.0138 | 3.9886 | 0.7888 |
|
| 98 |
+
| 2 | 1,000,000 | `prevlocal_interaction` | 0.227 | 5 | 4.7811 | 0.0084 | 4.0826 | 0.6984 |
|
| 99 |
+
| 2 | 1,000,000 | `static_dropout_0.3` | 0.300 | 5 | 4.7983 | 0.0144 | 4.1501 | 0.6481 |
|
| 100 |
+
| 2 | 1,000,000 | `fitted_l16_static_law` | 0.300 | 5 | 4.8326 | 0.0102 | 4.2632 | 0.5694 |
|
| 101 |
+
| 2 | 1,000,000 | `static_dropout_0.14` | 0.140 | 5 | 4.8490 | 0.0202 | 3.8712 | 0.9779 |
|
| 102 |
+
| 2 | 1,000,000 | `static_dropout_0.02` | 0.020 | 5 | 5.1470 | 0.0222 | 3.4615 | 1.6854 |
|
| 103 |
+
| 2 | 1,000,000 | `static_dropout_0` | 0.000 | 5 | 5.2637 | 0.0274 | 3.3260 | 1.9377 |
|
| 104 |
+
| 3 | 2,000,000 | `prevlocal_interaction` | 0.139 | 5 | 4.5590 | 0.0142 | 4.0802 | 0.4788 |
|
| 105 |
+
| 3 | 2,000,000 | `hold_30_then_decay` | 0.100 | 5 | 4.5599 | 0.0161 | 4.0445 | 0.5154 |
|
| 106 |
+
| 3 | 2,000,000 | `mild_30_to_08` | 0.120 | 5 | 4.5631 | 0.0155 | 4.0441 | 0.5190 |
|
| 107 |
+
| 3 | 2,000,000 | `fitted_l16_static_law` | 0.140 | 5 | 4.5806 | 0.0153 | 4.1471 | 0.4334 |
|
| 108 |
+
| 3 | 2,000,000 | `static_dropout_0.3` | 0.300 | 5 | 4.6035 | 0.0141 | 4.2150 | 0.3885 |
|
| 109 |
+
| 3 | 2,000,000 | `static_dropout_0.14` | 0.140 | 5 | 4.6048 | 0.0136 | 4.0399 | 0.5648 |
|
| 110 |
+
| 3 | 2,000,000 | `static_dropout_0.02` | 0.020 | 5 | 4.7847 | 0.0196 | 3.8405 | 0.9442 |
|
| 111 |
+
| 3 | 2,000,000 | `static_dropout_0` | 0.000 | 5 | 4.8472 | 0.0171 | 3.7786 | 1.0687 |
|
| 112 |
+
| 4 | 4,000,000 | `prevlocal_interaction` | 0.066 | 5 | 4.3981 | 0.0095 | 4.0805 | 0.3177 |
|
| 113 |
+
| 4 | 4,000,000 | `hold_30_then_decay` | 0.020 | 5 | 4.4052 | 0.0112 | 4.0488 | 0.3565 |
|
| 114 |
+
| 4 | 4,000,000 | `mild_30_to_08` | 0.080 | 5 | 4.4073 | 0.0085 | 4.0736 | 0.3337 |
|
| 115 |
+
| 4 | 4,000,000 | `fitted_l16_static_law` | 0.020 | 5 | 4.4124 | 0.0084 | 4.0987 | 0.3137 |
|
| 116 |
+
| 4 | 4,000,000 | `static_dropout_0.14` | 0.140 | 5 | 4.4455 | 0.0120 | 4.1165 | 0.3289 |
|
| 117 |
+
| 4 | 4,000,000 | `static_dropout_0.3` | 0.300 | 5 | 4.4668 | 0.0141 | 4.2319 | 0.2349 |
|
| 118 |
+
| 4 | 4,000,000 | `static_dropout_0.02` | 0.020 | 5 | 4.5358 | 0.0091 | 4.0529 | 0.4829 |
|
| 119 |
+
| 4 | 4,000,000 | `static_dropout_0` | 0.000 | 5 | 4.5943 | 0.0216 | 4.0414 | 0.5529 |
|
| 120 |
|
| 121 |
## Interpretation
|
| 122 |
|
| 123 |
+
- `prevlocal_interaction` has the best 5-seed mean final validation loss: 4.3981 +/- 0.0095.
|
| 124 |
+
- The second-best final condition is `hold_30_then_decay` at 4.4052 +/- 0.0112.
|
| 125 |
+
- The best static baseline by mean final loss is `static_dropout_0.14` at 4.4455 +/- 0.0120.
|
| 126 |
+
- `prevlocal_interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0328.
|
| 127 |
+
- `hold_30_then_decay` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0183.
|
| 128 |
+
- `mild_30_to_08` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0206.
|
| 129 |
+
- `fitted_l16_static_law` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0211.
|
| 130 |
+
- The best first-stage condition is `mild_30_to_08` at prefix 250,000 with mean validation loss 5.4483; compare this with the final ranking before claiming a schedule is uniformly better.
|
| 131 |
- This is a saved-run streaming validation artifact. Treat it as strong
|
| 132 |
evidence only when the tested conditions, seeds, static baselines, and
|
| 133 |
stream protocol match the claim being made.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
runs/previous_local_streaming_report/l16_updated_formula_clean_5seed/condition_summary.csv
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
condition,kind,n,mean_trajectory_val,std_trajectory_val,mean_final_val,std_final_val,mean_final_gap,std_final_gap,dropout_path
|
| 2 |
+
prevlocal_interaction,anchor_decay,5,4.860862210392952,0.0046364279557658235,4.3981304407119755,0.009545784836147743,0.3176518455147743,0.007999498965173152,0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07
|
| 3 |
+
hold_30_then_decay,anchor_decay,5,4.851180048286915,0.0016753687570399134,4.405232906341553,0.011151070705538514,0.3564802721142769,0.01297330703929578,0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02
|
| 4 |
+
mild_30_to_08,anchor_decay,5,4.850860581099987,0.0014618995680224028,4.40728645324707,0.008502541215009067,0.3337064355611801,0.010359634321755684,0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08
|
| 5 |
+
fitted_l16_static_law,anchor_decay,5,4.952093484103679,0.0038574646544463683,4.412404176592827,0.00843791675235308,0.3137470245361328,0.007204760471400837,0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02
|
| 6 |
+
static_dropout_0.14,static,5,4.905146500468254,0.00876134360549518,4.44545366615057,0.012017216742245517,0.32894645929336547,0.01603071874172604,0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14
|
| 7 |
+
static_dropout_0.3,static,5,4.8767191568017,0.0019103599368448555,4.46677490323782,0.014064932048228269,0.23490906208753587,0.008922414622347311,0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30
|
| 8 |
+
static_dropout_0.02,static,5,5.157098578512668,0.009693091424804937,4.535757505893708,0.00908401354385357,0.48288719058036805,0.020126181497736668,0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02
|
| 9 |
+
static_dropout_0,static,5,5.251133863329888,0.016029529764030867,4.594272664189338,0.021638340853154137,0.5528693303465844,0.029132548047629703,0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00
|
runs/previous_local_streaming_report/l16_updated_formula_clean_5seed/paired_final_deltas.csv
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
seed,condition,final_val,best_static_condition,best_static_final_val,delta_vs_best_static
|
| 2 |
+
1,prevlocal_interaction,4.402347795665264,static_dropout_0.14,4.4417688846588135,-0.03942108899354935
|
| 3 |
+
1,hold_30_then_decay,4.393906190991402,static_dropout_0.14,4.4417688846588135,-0.047862693667411804
|
| 4 |
+
1,mild_30_to_08,4.399485997855663,static_dropout_0.14,4.4417688846588135,-0.04228288680315018
|
| 5 |
+
1,fitted_l16_static_law,4.420692957937717,static_dropout_0.14,4.4417688846588135,-0.02107592672109604
|
| 6 |
+
1,static_dropout_0.14,4.4417688846588135,static_dropout_0.14,4.4417688846588135,0.0
|
| 7 |
+
1,static_dropout_0.3,4.460195399820805,static_dropout_0.14,4.4417688846588135,0.01842651516199112
|
| 8 |
+
1,static_dropout_0.02,4.5401855930686,static_dropout_0.14,4.4417688846588135,0.09841670840978622
|
| 9 |
+
1,static_dropout_0,4.570374272763729,static_dropout_0.14,4.4417688846588135,0.12860538810491562
|
| 10 |
+
2,prevlocal_interaction,4.401971310377121,static_dropout_0.14,4.460222490131855,-0.05825117975473404
|
| 11 |
+
2,hold_30_then_decay,4.406779877841473,static_dropout_0.14,4.460222490131855,-0.053442612290382385
|
| 12 |
+
2,mild_30_to_08,4.4080275148153305,static_dropout_0.14,4.460222490131855,-0.052194975316524506
|
| 13 |
+
2,fitted_l16_static_law,4.41358345746994,static_dropout_0.14,4.460222490131855,-0.046639032661914825
|
| 14 |
+
2,static_dropout_0.14,4.460222490131855,static_dropout_0.14,4.460222490131855,0.0
|
| 15 |
+
2,static_dropout_0.3,4.4719239845871925,static_dropout_0.14,4.460222490131855,0.011701494455337524
|
| 16 |
+
2,static_dropout_0.02,4.546629846096039,static_dropout_0.14,4.460222490131855,0.08640735596418381
|
| 17 |
+
2,static_dropout_0,4.609437867999077,static_dropout_0.14,4.460222490131855,0.14921537786722183
|
| 18 |
+
3,prevlocal_interaction,4.402896843850613,static_dropout_0.14,4.43564984947443,-0.032753005623817444
|
| 19 |
+
3,hold_30_then_decay,4.417374566197395,static_dropout_0.14,4.43564984947443,-0.01827528327703476
|
| 20 |
+
3,mild_30_to_08,4.415062002837658,static_dropout_0.14,4.43564984947443,-0.020587846636772156
|
| 21 |
+
3,fitted_l16_static_law,4.413399815559387,static_dropout_0.14,4.43564984947443,-0.022250033915042877
|
| 22 |
+
3,static_dropout_0.14,4.43564984947443,static_dropout_0.14,4.43564984947443,0.0
|
| 23 |
+
3,static_dropout_0.3,4.475773207843304,static_dropout_0.14,4.43564984947443,0.040123358368873596
|
| 24 |
+
3,static_dropout_0.02,4.534482300281525,static_dropout_0.14,4.43564984947443,0.09883245080709457
|
| 25 |
+
3,static_dropout_0,4.592755533754826,static_dropout_0.14,4.43564984947443,0.1571056842803955
|
| 26 |
+
4,prevlocal_interaction,4.381064593791962,static_dropout_0.14,4.433655060827732,-0.052590467035770416
|
| 27 |
+
4,hold_30_then_decay,4.3936478942632675,static_dropout_0.14,4.433655060827732,-0.04000716656446457
|
| 28 |
+
4,mild_30_to_08,4.397788874804974,static_dropout_0.14,4.433655060827732,-0.035866186022758484
|
| 29 |
+
4,fitted_l16_static_law,4.398257076740265,static_dropout_0.14,4.433655060827732,-0.035397984087467194
|
| 30 |
+
4,static_dropout_0.14,4.433655060827732,static_dropout_0.14,4.433655060827732,0.0
|
| 31 |
+
4,static_dropout_0.3,4.445499815046787,static_dropout_0.14,4.433655060827732,0.011844754219055176
|
| 32 |
+
4,static_dropout_0.02,4.52195218205452,static_dropout_0.14,4.433655060827732,0.08829712122678757
|
| 33 |
+
4,static_dropout_0,4.576848782598972,static_dropout_0.14,4.433655060827732,0.14319372177124023
|
| 34 |
+
5,prevlocal_interaction,4.402371659874916,static_dropout_0.14,4.455972045660019,-0.053600385785102844
|
| 35 |
+
5,hold_30_then_decay,4.4144560024142265,static_dropout_0.14,4.455972045660019,-0.04151604324579239
|
| 36 |
+
5,mild_30_to_08,4.416067875921726,static_dropout_0.14,4.455972045660019,-0.039904169738292694
|
| 37 |
+
5,fitted_l16_static_law,4.4160875752568245,static_dropout_0.14,4.455972045660019,-0.03988447040319443
|
| 38 |
+
5,static_dropout_0.14,4.455972045660019,static_dropout_0.14,4.455972045660019,0.0
|
| 39 |
+
5,static_dropout_0.3,4.48048210889101,static_dropout_0.14,4.455972045660019,0.024510063230991364
|
| 40 |
+
5,static_dropout_0.02,4.5355376079678535,static_dropout_0.14,4.455972045660019,0.07956556230783463
|
| 41 |
+
5,static_dropout_0,4.62194686383009,static_dropout_0.14,4.455972045660019,0.16597481817007065
|
runs/previous_local_streaming_report/l16_updated_formula_clean_5seed/stage_summary.csv
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
condition,stage,token_limit,dropout,n,mean_val,std_val,mean_train,std_train,mean_gap,std_gap
|
| 2 |
+
prevlocal_interaction,0,250000,0.385,5,5.4946602284908295,0.01093302726132647,4.6016244173049925,0.026939774812057612,0.8930358111858367,0.016010040404479016
|
| 3 |
+
prevlocal_interaction,1,500000,0.319,5,5.071460470557213,0.01179360076463939,4.20646400153637,0.03245807641301778,0.8649964690208435,0.028479060454710485
|
| 4 |
+
prevlocal_interaction,2,1000000,0.227,5,4.781069016456604,0.008428247627355752,4.08262689858675,0.023420093535722695,0.698442117869854,0.028148054160246055
|
| 5 |
+
prevlocal_interaction,3,2000000,0.139,5,4.558990895748138,0.014177173687439483,4.080180557072163,0.022080406390158354,0.4788103386759758,0.021926835907345392
|
| 6 |
+
prevlocal_interaction,4,4000000,0.066,5,4.3981304407119755,0.009545784836147743,4.080478595197201,0.009311692964638253,0.3176518455147743,0.007999498965173152
|
| 7 |
+
hold_30_then_decay,0,250000,0.3,5,5.4483301296830176,0.013828501308583057,4.442901518940926,0.027340510763309508,1.0054286107420922,0.02219730946529904
|
| 8 |
+
hold_30_then_decay,1,500000,0.3,5,5.066737350821495,0.017273545737457947,4.1383186161518095,0.03875357135925004,0.9284187346696854,0.04002925354224623
|
| 9 |
+
hold_30_then_decay,2,1000000,0.2,5,4.775730343163014,0.014352387307903692,4.037793649733066,0.02368477035230831,0.7379366934299469,0.01882967372675974
|
| 10 |
+
hold_30_then_decay,3,2000000,0.1,5,4.559869511425495,0.016051317749301037,4.044496415555477,0.019708741233353012,0.515373095870018,0.020379012527272283
|
| 11 |
+
hold_30_then_decay,4,4000000,0.02,5,4.405232906341553,0.011151070705538514,4.0487526342272755,0.007824452256379268,0.3564802721142769,0.01297330703929578
|
| 12 |
+
mild_30_to_08,0,250000,0.3,5,5.448330116271973,0.013828516502701142,4.442901518940926,0.027340586099289153,1.005428597331047,0.022197359989343385
|
| 13 |
+
mild_30_to_08,1,500000,0.24,5,5.058184179663658,0.015882720199114145,4.034893324971199,0.04033083916799125,1.023290854692459,0.04098800520602419
|
| 14 |
+
mild_30_to_08,2,1000000,0.18,5,4.777442049980164,0.013845858727658497,3.9886452093720437,0.02349137402419598,0.7887968406081199,0.018652082916074838
|
| 15 |
+
mild_30_to_08,3,2000000,0.12,5,4.563060106337071,0.015509498762185112,4.044088624417782,0.020441976517745996,0.5189714819192887,0.022529376522631556
|
| 16 |
+
mild_30_to_08,4,4000000,0.08,5,4.40728645324707,0.008502541215009067,4.07358001768589,0.0063536190340169095,0.3337064355611801,0.010359634321755684
|
| 17 |
+
fitted_l16_static_law,0,250000,0.6,5,5.7842145070433615,0.009632183754286684,5.164006796479225,0.02748612153330559,0.6202077105641365,0.018181362630120823
|
| 18 |
+
fitted_l16_static_law,1,500000,0.4,5,5.150681225955486,0.010164023432481408,4.463223123550415,0.029267257511679485,0.6874581024050712,0.024496105147219012
|
| 19 |
+
fitted_l16_static_law,2,1000000,0.3,5,4.832601730525494,0.010169544124120607,4.263189716637134,0.02333674196202296,0.569412013888359,0.023004537548591726
|
| 20 |
+
fitted_l16_static_law,3,2000000,0.14,5,4.58056578040123,0.01532149630405117,4.14712455868721,0.01706029159496315,0.4334412217140198,0.019914111395845077
|
| 21 |
+
fitted_l16_static_law,4,4000000,0.02,5,4.412404176592827,0.00843791675235308,4.098657152056694,0.01111204513074185,0.3137470245361328,0.007204760471400837
|
| 22 |
+
static_dropout_0.14,0,250000,0.14,5,5.477323499321938,0.02236835486589015,4.029827673733235,0.018556819977249093,1.4474958255887032,0.03092474074054602
|
| 23 |
+
static_dropout_0.14,1,500000,0.14,5,5.149166536331177,0.007010026540791338,3.714307613670826,0.03238913748160129,1.4348589226603508,0.031243440426199517
|
| 24 |
+
static_dropout_0.14,2,1000000,0.14,5,4.849037018418312,0.020208736415348236,3.8711691960692405,0.02974306105040781,0.9778678223490715,0.023799818088071894
|
| 25 |
+
static_dropout_0.14,3,2000000,0.14,5,4.6047517821192745,0.013619996903704912,4.039909638464451,0.025550506633378975,0.5648421436548233,0.015970945478988943
|
| 26 |
+
static_dropout_0.14,4,4000000,0.14,5,4.44545366615057,0.012017216742245517,4.116507206857205,0.014037194709348206,0.32894645929336547,0.01603071874172604
|
| 27 |
+
static_dropout_0.3,0,250000,0.3,5,5.448330155014991,0.0138285316736341,4.442901518940926,0.027340553421349313,1.005428636074066,0.022197311058747782
|
| 28 |
+
static_dropout_0.3,1,500000,0.3,5,5.066737298667431,0.017273470277214743,4.138318654894829,0.03875368971811196,0.9284186437726021,0.04002925284584238
|
| 29 |
+
static_dropout_0.3,2,1000000,0.3,5,4.79825523942709,0.01441949497608529,4.150126910209655,0.023298256740585745,0.6481283292174339,0.017421801083541605
|
| 30 |
+
static_dropout_0.3,3,2000000,0.3,5,4.603498187661171,0.014129740963263297,4.2150133237242695,0.015000678307181381,0.38848486393690107,0.01687487069399014
|
| 31 |
+
static_dropout_0.3,4,4000000,0.3,5,4.46677490323782,0.014064932048228269,4.231865841150284,0.010414934638152858,0.23490906208753587,0.008922414622347311
|
| 32 |
+
static_dropout_0.02,0,250000,0.02,5,5.742638063430786,0.024161263410536992,3.537110958993435,0.008037117123073168,2.2055271044373512,0.030737843551395496
|
| 33 |
+
static_dropout_0.02,1,500000,0.02,5,5.575391733646393,0.024791398740622035,3.124619247019291,0.031814549489392455,2.450772486627102,0.030503049251572257
|
| 34 |
+
static_dropout_0.02,2,1000000,0.02,5,5.14697041362524,0.022233878343551068,3.4615398421883583,0.03992270092195685,1.685430571436882,0.04092951267469098
|
| 35 |
+
static_dropout_0.02,3,2000000,0.02,5,4.784735175967216,0.019582585992709827,3.840523959696293,0.03097454466954304,0.9442112162709236,0.02147121638277758
|
| 36 |
+
static_dropout_0.02,4,4000000,0.02,5,4.535757505893708,0.00908401354385357,4.052870315313339,0.02163703576438587,0.48288719058036805,0.020126181497736668
|
| 37 |
+
static_dropout_0,0,250000,0.0,5,5.8329681470990185,0.019809207037273006,3.4442542552948,0.022358399496724347,2.388713891804218,0.038334133145443657
|
| 38 |
+
static_dropout_0,1,500000,0.0,5,5.717529235780239,0.05024223752386389,2.958326259255409,0.044781060309162554,2.75920297652483,0.07102439530887096
|
| 39 |
+
static_dropout_0,2,1000000,0.0,5,5.26366505920887,0.027353946222948587,3.3260142356157303,0.03607156293344983,1.9376508235931396,0.03553067411055354
|
| 40 |
+
static_dropout_0,3,2000000,0.0,5,4.847234210371971,0.0170992476167825,3.778580814599991,0.03536285448761605,1.0686533957719804,0.025604091638377884
|
| 41 |
+
static_dropout_0,4,4000000,0.0,5,4.594272664189338,0.021638340853154137,4.041403333842754,0.017193152802814336,0.5528693303465844,0.029132548047629703
|
runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/RESULT_SUMMARY.md
ADDED
|
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Locked Streaming Dropout Summary
|
| 2 |
+
|
| 3 |
+
Run directory: `runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525`
|
| 4 |
+
|
| 5 |
+
Model: `L16_H8_D384` causal Transformer, 31,457,280 parameters, 16 layers, 8 heads, 384 embedding dim.
|
| 6 |
+
Training per stage: 1,000 steps. Sampled tokens are cumulative in each stage row. Seeds present: 1, 2, 3, 4, 5.
|
| 7 |
+
|
| 8 |
+
## Condition Ranking
|
| 9 |
+
|
| 10 |
+
| Condition | Kind | Final dropout | Mean trajectory val loss | Final val loss | Final gap | Dropout path |
|
| 11 |
+
|---|---|---:|---:|---:|---:|---|
|
| 12 |
+
| `mild_30_to_08` | anchor_decay | 0.08 | 4.8509 | 4.4073 | 0.3337 | 0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08 |
|
| 13 |
+
| `hold_30_then_decay` | anchor_decay | 0.02 | 4.8512 | 4.4052 | 0.3565 | 0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02 |
|
| 14 |
+
| `prevlocal_interaction` | anchor_decay | 0.07 | 4.8609 | 4.3981 | 0.3177 | 0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07 |
|
| 15 |
+
| `static_dropout_0.3` | static | 0.30 | 4.8767 | 4.4668 | 0.2349 | 0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30 |
|
| 16 |
+
| `static_dropout_0.14` | static | 0.14 | 4.9051 | 4.4455 | 0.3289 | 0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14 |
|
| 17 |
+
| `fitted_l16_static_law` | anchor_decay | 0.02 | 4.9521 | 4.4124 | 0.3137 | 0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02 |
|
| 18 |
+
| `static_dropout_0.02` | static | 0.02 | 5.1571 | 4.5358 | 0.4829 | 0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02 |
|
| 19 |
+
| `static_dropout_0` | static | 0.00 | 5.2511 | 4.5943 | 0.5529 | 0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00 |
|
| 20 |
+
|
| 21 |
+
## Stage Trajectory
|
| 22 |
+
|
| 23 |
+
### Stage 0: 250,000 Prefix Tokens
|
| 24 |
+
|
| 25 |
+
| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
|
| 26 |
+
|---|---:|---:|---:|---:|---:|
|
| 27 |
+
| `mild_30_to_08` | 0.30 | 5.4483 | 4.4429 | 1.0054 | 5 |
|
| 28 |
+
| `hold_30_then_decay` | 0.30 | 5.4483 | 4.4429 | 1.0054 | 5 |
|
| 29 |
+
| `static_dropout_0.3` | 0.30 | 5.4483 | 4.4429 | 1.0054 | 5 |
|
| 30 |
+
| `static_dropout_0.14` | 0.14 | 5.4773 | 4.0298 | 1.4475 | 5 |
|
| 31 |
+
| `prevlocal_interaction` | 0.39 | 5.4947 | 4.6016 | 0.8930 | 5 |
|
| 32 |
+
| `static_dropout_0.02` | 0.02 | 5.7426 | 3.5371 | 2.2055 | 5 |
|
| 33 |
+
| `fitted_l16_static_law` | 0.60 | 5.7842 | 5.1640 | 0.6202 | 5 |
|
| 34 |
+
| `static_dropout_0` | 0.00 | 5.8330 | 3.4443 | 2.3887 | 5 |
|
| 35 |
+
|
| 36 |
+
### Stage 1: 500,000 Prefix Tokens
|
| 37 |
+
|
| 38 |
+
| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
|
| 39 |
+
|---|---:|---:|---:|---:|---:|
|
| 40 |
+
| `mild_30_to_08` | 0.24 | 5.0582 | 4.0349 | 1.0233 | 5 |
|
| 41 |
+
| `static_dropout_0.3` | 0.30 | 5.0667 | 4.1383 | 0.9284 | 5 |
|
| 42 |
+
| `hold_30_then_decay` | 0.30 | 5.0667 | 4.1383 | 0.9284 | 5 |
|
| 43 |
+
| `prevlocal_interaction` | 0.32 | 5.0715 | 4.2065 | 0.8650 | 5 |
|
| 44 |
+
| `static_dropout_0.14` | 0.14 | 5.1492 | 3.7143 | 1.4349 | 5 |
|
| 45 |
+
| `fitted_l16_static_law` | 0.40 | 5.1507 | 4.4632 | 0.6875 | 5 |
|
| 46 |
+
| `static_dropout_0.02` | 0.02 | 5.5754 | 3.1246 | 2.4508 | 5 |
|
| 47 |
+
| `static_dropout_0` | 0.00 | 5.7175 | 2.9583 | 2.7592 | 5 |
|
| 48 |
+
|
| 49 |
+
### Stage 2: 1,000,000 Prefix Tokens
|
| 50 |
+
|
| 51 |
+
| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
|
| 52 |
+
|---|---:|---:|---:|---:|---:|
|
| 53 |
+
| `hold_30_then_decay` | 0.20 | 4.7757 | 4.0378 | 0.7379 | 5 |
|
| 54 |
+
| `mild_30_to_08` | 0.18 | 4.7774 | 3.9886 | 0.7888 | 5 |
|
| 55 |
+
| `prevlocal_interaction` | 0.23 | 4.7811 | 4.0826 | 0.6984 | 5 |
|
| 56 |
+
| `static_dropout_0.3` | 0.30 | 4.7983 | 4.1501 | 0.6481 | 5 |
|
| 57 |
+
| `fitted_l16_static_law` | 0.30 | 4.8326 | 4.2632 | 0.5694 | 5 |
|
| 58 |
+
| `static_dropout_0.14` | 0.14 | 4.8490 | 3.8712 | 0.9779 | 5 |
|
| 59 |
+
| `static_dropout_0.02` | 0.02 | 5.1470 | 3.4615 | 1.6854 | 5 |
|
| 60 |
+
| `static_dropout_0` | 0.00 | 5.2637 | 3.3260 | 1.9377 | 5 |
|
| 61 |
+
|
| 62 |
+
### Stage 3: 2,000,000 Prefix Tokens
|
| 63 |
+
|
| 64 |
+
| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
|
| 65 |
+
|---|---:|---:|---:|---:|---:|
|
| 66 |
+
| `prevlocal_interaction` | 0.14 | 4.5590 | 4.0802 | 0.4788 | 5 |
|
| 67 |
+
| `hold_30_then_decay` | 0.10 | 4.5599 | 4.0445 | 0.5154 | 5 |
|
| 68 |
+
| `mild_30_to_08` | 0.12 | 4.5631 | 4.0441 | 0.5190 | 5 |
|
| 69 |
+
| `fitted_l16_static_law` | 0.14 | 4.5806 | 4.1471 | 0.4334 | 5 |
|
| 70 |
+
| `static_dropout_0.3` | 0.30 | 4.6035 | 4.2150 | 0.3885 | 5 |
|
| 71 |
+
| `static_dropout_0.14` | 0.14 | 4.6048 | 4.0399 | 0.5648 | 5 |
|
| 72 |
+
| `static_dropout_0.02` | 0.02 | 4.7847 | 3.8405 | 0.9442 | 5 |
|
| 73 |
+
| `static_dropout_0` | 0.00 | 4.8472 | 3.7786 | 1.0687 | 5 |
|
| 74 |
+
|
| 75 |
+
### Stage 4: 4,000,000 Prefix Tokens
|
| 76 |
+
|
| 77 |
+
| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
|
| 78 |
+
|---|---:|---:|---:|---:|---:|
|
| 79 |
+
| `prevlocal_interaction` | 0.07 | 4.3981 | 4.0805 | 0.3177 | 5 |
|
| 80 |
+
| `hold_30_then_decay` | 0.02 | 4.4052 | 4.0488 | 0.3565 | 5 |
|
| 81 |
+
| `mild_30_to_08` | 0.08 | 4.4073 | 4.0736 | 0.3337 | 5 |
|
| 82 |
+
| `fitted_l16_static_law` | 0.02 | 4.4124 | 4.0987 | 0.3137 | 5 |
|
| 83 |
+
| `static_dropout_0.14` | 0.14 | 4.4455 | 4.1165 | 0.3289 | 5 |
|
| 84 |
+
| `static_dropout_0.3` | 0.30 | 4.4668 | 4.2319 | 0.2349 | 5 |
|
| 85 |
+
| `static_dropout_0.02` | 0.02 | 4.5358 | 4.0529 | 0.4829 | 5 |
|
| 86 |
+
| `static_dropout_0` | 0.00 | 4.5943 | 4.0414 | 0.5529 | 5 |
|
runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/config.json
ADDED
|
@@ -0,0 +1,222 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"args": {
|
| 3 |
+
"mode": "locked_stream",
|
| 4 |
+
"corpus": null,
|
| 5 |
+
"corpus_glob": null,
|
| 6 |
+
"text_column": "text",
|
| 7 |
+
"use_cached_data": true,
|
| 8 |
+
"output_dir": "runs/previous_local_updated_formula_clean_l16",
|
| 9 |
+
"resume_from": null,
|
| 10 |
+
"cache_dir": ".cache/dropout_decay",
|
| 11 |
+
"models": [
|
| 12 |
+
"L16_H8_D384=16x8x384"
|
| 13 |
+
],
|
| 14 |
+
"seeds": [
|
| 15 |
+
1,
|
| 16 |
+
2,
|
| 17 |
+
3,
|
| 18 |
+
4,
|
| 19 |
+
5
|
| 20 |
+
],
|
| 21 |
+
"token_limits": [
|
| 22 |
+
5000000
|
| 23 |
+
],
|
| 24 |
+
"stream_token_caps": [
|
| 25 |
+
250000,
|
| 26 |
+
500000,
|
| 27 |
+
1000000,
|
| 28 |
+
2000000,
|
| 29 |
+
4000000
|
| 30 |
+
],
|
| 31 |
+
"val_tokens": 500000,
|
| 32 |
+
"allow_short_corpus": false,
|
| 33 |
+
"force_retokenize": false,
|
| 34 |
+
"vocab_size": 4096,
|
| 35 |
+
"tokenizer_train_chars": 10000000,
|
| 36 |
+
"block_size": 128,
|
| 37 |
+
"batch_size": 16,
|
| 38 |
+
"steps": 2000,
|
| 39 |
+
"stage_steps": 1000,
|
| 40 |
+
"dropout_rates": [
|
| 41 |
+
0.0,
|
| 42 |
+
0.02,
|
| 43 |
+
0.14,
|
| 44 |
+
0.3
|
| 45 |
+
],
|
| 46 |
+
"decays": [],
|
| 47 |
+
"anchor_decays": [
|
| 48 |
+
{
|
| 49 |
+
"name": "prevlocal_interaction",
|
| 50 |
+
"kind": "anchor_decay",
|
| 51 |
+
"initial": 0.385,
|
| 52 |
+
"final": 0.066,
|
| 53 |
+
"schedule": "log_prefix_anchor",
|
| 54 |
+
"decay_tokens": null,
|
| 55 |
+
"anchors": [
|
| 56 |
+
[
|
| 57 |
+
250000,
|
| 58 |
+
0.385
|
| 59 |
+
],
|
| 60 |
+
[
|
| 61 |
+
500000,
|
| 62 |
+
0.319
|
| 63 |
+
],
|
| 64 |
+
[
|
| 65 |
+
1000000,
|
| 66 |
+
0.227
|
| 67 |
+
],
|
| 68 |
+
[
|
| 69 |
+
2000000,
|
| 70 |
+
0.139
|
| 71 |
+
],
|
| 72 |
+
[
|
| 73 |
+
4000000,
|
| 74 |
+
0.066
|
| 75 |
+
]
|
| 76 |
+
]
|
| 77 |
+
},
|
| 78 |
+
{
|
| 79 |
+
"name": "hold_30_then_decay",
|
| 80 |
+
"kind": "anchor_decay",
|
| 81 |
+
"initial": 0.3,
|
| 82 |
+
"final": 0.02,
|
| 83 |
+
"schedule": "log_prefix_anchor",
|
| 84 |
+
"decay_tokens": null,
|
| 85 |
+
"anchors": [
|
| 86 |
+
[
|
| 87 |
+
250000,
|
| 88 |
+
0.3
|
| 89 |
+
],
|
| 90 |
+
[
|
| 91 |
+
500000,
|
| 92 |
+
0.3
|
| 93 |
+
],
|
| 94 |
+
[
|
| 95 |
+
1000000,
|
| 96 |
+
0.2
|
| 97 |
+
],
|
| 98 |
+
[
|
| 99 |
+
2000000,
|
| 100 |
+
0.1
|
| 101 |
+
],
|
| 102 |
+
[
|
| 103 |
+
4000000,
|
| 104 |
+
0.02
|
| 105 |
+
]
|
| 106 |
+
]
|
| 107 |
+
},
|
| 108 |
+
{
|
| 109 |
+
"name": "mild_30_to_08",
|
| 110 |
+
"kind": "anchor_decay",
|
| 111 |
+
"initial": 0.3,
|
| 112 |
+
"final": 0.08,
|
| 113 |
+
"schedule": "log_prefix_anchor",
|
| 114 |
+
"decay_tokens": null,
|
| 115 |
+
"anchors": [
|
| 116 |
+
[
|
| 117 |
+
250000,
|
| 118 |
+
0.3
|
| 119 |
+
],
|
| 120 |
+
[
|
| 121 |
+
500000,
|
| 122 |
+
0.24
|
| 123 |
+
],
|
| 124 |
+
[
|
| 125 |
+
1000000,
|
| 126 |
+
0.18
|
| 127 |
+
],
|
| 128 |
+
[
|
| 129 |
+
2000000,
|
| 130 |
+
0.12
|
| 131 |
+
],
|
| 132 |
+
[
|
| 133 |
+
4000000,
|
| 134 |
+
0.08
|
| 135 |
+
]
|
| 136 |
+
]
|
| 137 |
+
},
|
| 138 |
+
{
|
| 139 |
+
"name": "fitted_l16_static_law",
|
| 140 |
+
"kind": "anchor_decay",
|
| 141 |
+
"initial": 0.6,
|
| 142 |
+
"final": 0.02,
|
| 143 |
+
"schedule": "log_prefix_anchor",
|
| 144 |
+
"decay_tokens": null,
|
| 145 |
+
"anchors": [
|
| 146 |
+
[
|
| 147 |
+
250000,
|
| 148 |
+
0.6
|
| 149 |
+
],
|
| 150 |
+
[
|
| 151 |
+
500000,
|
| 152 |
+
0.4
|
| 153 |
+
],
|
| 154 |
+
[
|
| 155 |
+
1000000,
|
| 156 |
+
0.3
|
| 157 |
+
],
|
| 158 |
+
[
|
| 159 |
+
2000000,
|
| 160 |
+
0.14
|
| 161 |
+
],
|
| 162 |
+
[
|
| 163 |
+
4000000,
|
| 164 |
+
0.02
|
| 165 |
+
]
|
| 166 |
+
]
|
| 167 |
+
}
|
| 168 |
+
],
|
| 169 |
+
"decay_tokens": null,
|
| 170 |
+
"eval_batches": 64,
|
| 171 |
+
"train_eval_batches": 32,
|
| 172 |
+
"trace_eval_batches": 8,
|
| 173 |
+
"eval_every": 0,
|
| 174 |
+
"log_every": 250,
|
| 175 |
+
"lr": 0.0003,
|
| 176 |
+
"weight_decay": 0.1,
|
| 177 |
+
"grad_clip": 1.0,
|
| 178 |
+
"plateau_delta": 0.01,
|
| 179 |
+
"target_min_dropout": 0.1,
|
| 180 |
+
"min_nonzero_margin": 0.01,
|
| 181 |
+
"min_high_dropout_margin": 0.03,
|
| 182 |
+
"screen_early_stop": false,
|
| 183 |
+
"screen_prune_patience": 3,
|
| 184 |
+
"screen_prune_min_delta": 0.01
|
| 185 |
+
},
|
| 186 |
+
"mode": "locked_stream",
|
| 187 |
+
"seeds": [
|
| 188 |
+
1,
|
| 189 |
+
2,
|
| 190 |
+
3,
|
| 191 |
+
4,
|
| 192 |
+
5
|
| 193 |
+
],
|
| 194 |
+
"models": [
|
| 195 |
+
{
|
| 196 |
+
"model_name": "L16_H8_D384",
|
| 197 |
+
"n_layer": 16,
|
| 198 |
+
"n_head": 8,
|
| 199 |
+
"n_embd": 384
|
| 200 |
+
}
|
| 201 |
+
],
|
| 202 |
+
"device": "mps",
|
| 203 |
+
"torch": "2.12.0",
|
| 204 |
+
"python": "3.11.15 (main, Mar 3 2026, 00:52:57) [Clang 21.0.0 (clang-2100.0.123.102)]",
|
| 205 |
+
"mps_available": true,
|
| 206 |
+
"attribution": "Derived from Andrej Karpathy's nanochat project (https://github.com/karpathy/nanochat), MIT License, Copyright (c) 2025 Andrej Karpathy.",
|
| 207 |
+
"tokenizer_path": ".cache/dropout_decay/tokenizer-v4096.json",
|
| 208 |
+
"encoded_path": ".cache/dropout_decay/tokens-v4096-uint16.npy",
|
| 209 |
+
"train_tokens": 5000970,
|
| 210 |
+
"val_tokens": 500000,
|
| 211 |
+
"effective_token_limits": [
|
| 212 |
+
5000000
|
| 213 |
+
],
|
| 214 |
+
"effective_stream_token_caps": [
|
| 215 |
+
250000,
|
| 216 |
+
500000,
|
| 217 |
+
1000000,
|
| 218 |
+
2000000,
|
| 219 |
+
4000000
|
| 220 |
+
],
|
| 221 |
+
"resume_from": null
|
| 222 |
+
}
|
runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/metrics.jsonl
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/summary.csv
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
run_mode,condition,condition_kind,stage,token_limit,model_name,n_layer,n_head,n_embd,parameters,dropout_initial,dropout_final,dropout_schedule,n,mean_train_eval_loss,std_train_eval_loss,mean_val_eval_loss,std_val_eval_loss,mean_generalization_gap,std_generalization_gap
|
| 2 |
+
locked_stream,fitted_l16_static_law,anchor_decay,0,250000,L16_H8_D384,16,8,384,31457280,0.6,0.02,log_prefix_anchor,5,5.164006796479225,0.02748612153330559,5.7842145070433615,0.009632183754286684,0.6202077105641365,0.018181362630120823
|
| 3 |
+
locked_stream,hold_30_then_decay,anchor_decay,0,250000,L16_H8_D384,16,8,384,31457280,0.3,0.02,log_prefix_anchor,5,4.442901518940926,0.027340510763309508,5.4483301296830176,0.013828501308583057,1.0054286107420922,0.02219730946529904
|
| 4 |
+
locked_stream,mild_30_to_08,anchor_decay,0,250000,L16_H8_D384,16,8,384,31457280,0.3,0.08,log_prefix_anchor,5,4.442901518940926,0.027340586099289153,5.448330116271973,0.013828516502701142,1.005428597331047,0.022197359989343385
|
| 5 |
+
locked_stream,prevlocal_interaction,anchor_decay,0,250000,L16_H8_D384,16,8,384,31457280,0.385,0.066,log_prefix_anchor,5,4.6016244173049925,0.026939774812057612,5.4946602284908295,0.01093302726132647,0.8930358111858367,0.016010040404479016
|
| 6 |
+
locked_stream,static_dropout_0,static,0,250000,L16_H8_D384,16,8,384,31457280,0.0,0.0,constant,5,3.4442542552948,0.022358399496724347,5.8329681470990185,0.019809207037273006,2.388713891804218,0.038334133145443657
|
| 7 |
+
locked_stream,static_dropout_0.02,static,0,250000,L16_H8_D384,16,8,384,31457280,0.02,0.02,constant,5,3.537110958993435,0.008037117123073168,5.742638063430786,0.024161263410536992,2.2055271044373512,0.030737843551395496
|
| 8 |
+
locked_stream,static_dropout_0.14,static,0,250000,L16_H8_D384,16,8,384,31457280,0.14,0.14,constant,5,4.029827673733235,0.018556819977249093,5.477323499321938,0.02236835486589015,1.4474958255887032,0.03092474074054602
|
| 9 |
+
locked_stream,static_dropout_0.3,static,0,250000,L16_H8_D384,16,8,384,31457280,0.3,0.3,constant,5,4.442901518940926,0.027340553421349313,5.448330155014991,0.0138285316736341,1.005428636074066,0.022197311058747782
|
| 10 |
+
locked_stream,fitted_l16_static_law,anchor_decay,1,500000,L16_H8_D384,16,8,384,31457280,0.6,0.02,log_prefix_anchor,5,4.463223123550415,0.029267257511679485,5.150681225955486,0.010164023432481408,0.6874581024050712,0.024496105147219012
|
| 11 |
+
locked_stream,hold_30_then_decay,anchor_decay,1,500000,L16_H8_D384,16,8,384,31457280,0.3,0.02,log_prefix_anchor,5,4.1383186161518095,0.03875357135925004,5.066737350821495,0.017273545737457947,0.9284187346696854,0.04002925354224623
|
| 12 |
+
locked_stream,mild_30_to_08,anchor_decay,1,500000,L16_H8_D384,16,8,384,31457280,0.3,0.08,log_prefix_anchor,5,4.034893324971199,0.04033083916799125,5.058184179663658,0.015882720199114145,1.023290854692459,0.04098800520602419
|
| 13 |
+
locked_stream,prevlocal_interaction,anchor_decay,1,500000,L16_H8_D384,16,8,384,31457280,0.385,0.066,log_prefix_anchor,5,4.20646400153637,0.03245807641301778,5.071460470557213,0.01179360076463939,0.8649964690208435,0.028479060454710485
|
| 14 |
+
locked_stream,static_dropout_0,static,1,500000,L16_H8_D384,16,8,384,31457280,0.0,0.0,constant,5,2.958326259255409,0.044781060309162554,5.717529235780239,0.05024223752386389,2.75920297652483,0.07102439530887096
|
| 15 |
+
locked_stream,static_dropout_0.02,static,1,500000,L16_H8_D384,16,8,384,31457280,0.02,0.02,constant,5,3.124619247019291,0.031814549489392455,5.575391733646393,0.024791398740622035,2.450772486627102,0.030503049251572257
|
| 16 |
+
locked_stream,static_dropout_0.14,static,1,500000,L16_H8_D384,16,8,384,31457280,0.14,0.14,constant,5,3.714307613670826,0.03238913748160129,5.149166536331177,0.007010026540791338,1.4348589226603508,0.031243440426199517
|
| 17 |
+
locked_stream,static_dropout_0.3,static,1,500000,L16_H8_D384,16,8,384,31457280,0.3,0.3,constant,5,4.138318654894829,0.03875368971811196,5.066737298667431,0.017273470277214743,0.9284186437726021,0.04002925284584238
|
| 18 |
+
locked_stream,fitted_l16_static_law,anchor_decay,2,1000000,L16_H8_D384,16,8,384,31457280,0.6,0.02,log_prefix_anchor,5,4.263189716637134,0.02333674196202296,4.832601730525494,0.010169544124120607,0.569412013888359,0.023004537548591726
|
| 19 |
+
locked_stream,hold_30_then_decay,anchor_decay,2,1000000,L16_H8_D384,16,8,384,31457280,0.3,0.02,log_prefix_anchor,5,4.037793649733066,0.02368477035230831,4.775730343163014,0.014352387307903692,0.7379366934299469,0.01882967372675974
|
| 20 |
+
locked_stream,mild_30_to_08,anchor_decay,2,1000000,L16_H8_D384,16,8,384,31457280,0.3,0.08,log_prefix_anchor,5,3.9886452093720437,0.02349137402419598,4.777442049980164,0.013845858727658497,0.7887968406081199,0.018652082916074838
|
| 21 |
+
locked_stream,prevlocal_interaction,anchor_decay,2,1000000,L16_H8_D384,16,8,384,31457280,0.385,0.066,log_prefix_anchor,5,4.08262689858675,0.023420093535722695,4.781069016456604,0.008428247627355752,0.698442117869854,0.028148054160246055
|
| 22 |
+
locked_stream,static_dropout_0,static,2,1000000,L16_H8_D384,16,8,384,31457280,0.0,0.0,constant,5,3.3260142356157303,0.03607156293344983,5.26366505920887,0.027353946222948587,1.9376508235931396,0.03553067411055354
|
| 23 |
+
locked_stream,static_dropout_0.02,static,2,1000000,L16_H8_D384,16,8,384,31457280,0.02,0.02,constant,5,3.4615398421883583,0.03992270092195685,5.14697041362524,0.022233878343551068,1.685430571436882,0.04092951267469098
|
| 24 |
+
locked_stream,static_dropout_0.14,static,2,1000000,L16_H8_D384,16,8,384,31457280,0.14,0.14,constant,5,3.8711691960692405,0.02974306105040781,4.849037018418312,0.020208736415348236,0.9778678223490715,0.023799818088071894
|
| 25 |
+
locked_stream,static_dropout_0.3,static,2,1000000,L16_H8_D384,16,8,384,31457280,0.3,0.3,constant,5,4.150126910209655,0.023298256740585745,4.79825523942709,0.01441949497608529,0.6481283292174339,0.017421801083541605
|
| 26 |
+
locked_stream,fitted_l16_static_law,anchor_decay,3,2000000,L16_H8_D384,16,8,384,31457280,0.6,0.02,log_prefix_anchor,5,4.14712455868721,0.01706029159496315,4.58056578040123,0.01532149630405117,0.4334412217140198,0.019914111395845077
|
| 27 |
+
locked_stream,hold_30_then_decay,anchor_decay,3,2000000,L16_H8_D384,16,8,384,31457280,0.3,0.02,log_prefix_anchor,5,4.044496415555477,0.019708741233353012,4.559869511425495,0.016051317749301037,0.515373095870018,0.020379012527272283
|
| 28 |
+
locked_stream,mild_30_to_08,anchor_decay,3,2000000,L16_H8_D384,16,8,384,31457280,0.3,0.08,log_prefix_anchor,5,4.044088624417782,0.020441976517745996,4.563060106337071,0.015509498762185112,0.5189714819192887,0.022529376522631556
|
| 29 |
+
locked_stream,prevlocal_interaction,anchor_decay,3,2000000,L16_H8_D384,16,8,384,31457280,0.385,0.066,log_prefix_anchor,5,4.080180557072163,0.022080406390158354,4.558990895748138,0.014177173687439483,0.4788103386759758,0.021926835907345392
|
| 30 |
+
locked_stream,static_dropout_0,static,3,2000000,L16_H8_D384,16,8,384,31457280,0.0,0.0,constant,5,3.778580814599991,0.03536285448761605,4.847234210371971,0.0170992476167825,1.0686533957719804,0.025604091638377884
|
| 31 |
+
locked_stream,static_dropout_0.02,static,3,2000000,L16_H8_D384,16,8,384,31457280,0.02,0.02,constant,5,3.840523959696293,0.03097454466954304,4.784735175967216,0.019582585992709827,0.9442112162709236,0.02147121638277758
|
| 32 |
+
locked_stream,static_dropout_0.14,static,3,2000000,L16_H8_D384,16,8,384,31457280,0.14,0.14,constant,5,4.039909638464451,0.025550506633378975,4.6047517821192745,0.013619996903704912,0.5648421436548233,0.015970945478988943
|
| 33 |
+
locked_stream,static_dropout_0.3,static,3,2000000,L16_H8_D384,16,8,384,31457280,0.3,0.3,constant,5,4.2150133237242695,0.015000678307181381,4.603498187661171,0.014129740963263297,0.38848486393690107,0.01687487069399014
|
| 34 |
+
locked_stream,fitted_l16_static_law,anchor_decay,4,4000000,L16_H8_D384,16,8,384,31457280,0.6,0.02,log_prefix_anchor,5,4.098657152056694,0.01111204513074185,4.412404176592827,0.00843791675235308,0.3137470245361328,0.007204760471400837
|
| 35 |
+
locked_stream,hold_30_then_decay,anchor_decay,4,4000000,L16_H8_D384,16,8,384,31457280,0.3,0.02,log_prefix_anchor,5,4.0487526342272755,0.007824452256379268,4.405232906341553,0.011151070705538514,0.3564802721142769,0.01297330703929578
|
| 36 |
+
locked_stream,mild_30_to_08,anchor_decay,4,4000000,L16_H8_D384,16,8,384,31457280,0.3,0.08,log_prefix_anchor,5,4.07358001768589,0.0063536190340169095,4.40728645324707,0.008502541215009067,0.3337064355611801,0.010359634321755684
|
| 37 |
+
locked_stream,prevlocal_interaction,anchor_decay,4,4000000,L16_H8_D384,16,8,384,31457280,0.385,0.066,log_prefix_anchor,5,4.080478595197201,0.009311692964638253,4.3981304407119755,0.009545784836147743,0.3176518455147743,0.007999498965173152
|
| 38 |
+
locked_stream,static_dropout_0,static,4,4000000,L16_H8_D384,16,8,384,31457280,0.0,0.0,constant,5,4.041403333842754,0.017193152802814336,4.594272664189338,0.021638340853154137,0.5528693303465844,0.029132548047629703
|
| 39 |
+
locked_stream,static_dropout_0.02,static,4,4000000,L16_H8_D384,16,8,384,31457280,0.02,0.02,constant,5,4.052870315313339,0.02163703576438587,4.535757505893708,0.00908401354385357,0.48288719058036805,0.020126181497736668
|
| 40 |
+
locked_stream,static_dropout_0.14,static,4,4000000,L16_H8_D384,16,8,384,31457280,0.14,0.14,constant,5,4.116507206857205,0.014037194709348206,4.44545366615057,0.012017216742245517,0.32894645929336547,0.01603071874172604
|
| 41 |
+
locked_stream,static_dropout_0.3,static,4,4000000,L16_H8_D384,16,8,384,31457280,0.3,0.3,constant,5,4.231865841150284,0.010414934638152858,4.46677490323782,0.014064932048228269,0.23490906208753587,0.008922414622347311
|
runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/summary.json
ADDED
|
@@ -0,0 +1,882 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"run_mode": "locked_stream",
|
| 4 |
+
"condition": "fitted_l16_static_law",
|
| 5 |
+
"condition_kind": "anchor_decay",
|
| 6 |
+
"stage": 0,
|
| 7 |
+
"token_limit": 250000,
|
| 8 |
+
"model_name": "L16_H8_D384",
|
| 9 |
+
"n_layer": 16,
|
| 10 |
+
"n_head": 8,
|
| 11 |
+
"n_embd": 384,
|
| 12 |
+
"parameters": 31457280,
|
| 13 |
+
"dropout_initial": 0.6,
|
| 14 |
+
"dropout_final": 0.02,
|
| 15 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 16 |
+
"n": 5,
|
| 17 |
+
"mean_train_eval_loss": 5.164006796479225,
|
| 18 |
+
"std_train_eval_loss": 0.02748612153330559,
|
| 19 |
+
"mean_val_eval_loss": 5.7842145070433615,
|
| 20 |
+
"std_val_eval_loss": 0.009632183754286684,
|
| 21 |
+
"mean_generalization_gap": 0.6202077105641365,
|
| 22 |
+
"std_generalization_gap": 0.018181362630120823
|
| 23 |
+
},
|
| 24 |
+
{
|
| 25 |
+
"run_mode": "locked_stream",
|
| 26 |
+
"condition": "hold_30_then_decay",
|
| 27 |
+
"condition_kind": "anchor_decay",
|
| 28 |
+
"stage": 0,
|
| 29 |
+
"token_limit": 250000,
|
| 30 |
+
"model_name": "L16_H8_D384",
|
| 31 |
+
"n_layer": 16,
|
| 32 |
+
"n_head": 8,
|
| 33 |
+
"n_embd": 384,
|
| 34 |
+
"parameters": 31457280,
|
| 35 |
+
"dropout_initial": 0.3,
|
| 36 |
+
"dropout_final": 0.02,
|
| 37 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 38 |
+
"n": 5,
|
| 39 |
+
"mean_train_eval_loss": 4.442901518940926,
|
| 40 |
+
"std_train_eval_loss": 0.027340510763309508,
|
| 41 |
+
"mean_val_eval_loss": 5.4483301296830176,
|
| 42 |
+
"std_val_eval_loss": 0.013828501308583057,
|
| 43 |
+
"mean_generalization_gap": 1.0054286107420922,
|
| 44 |
+
"std_generalization_gap": 0.02219730946529904
|
| 45 |
+
},
|
| 46 |
+
{
|
| 47 |
+
"run_mode": "locked_stream",
|
| 48 |
+
"condition": "mild_30_to_08",
|
| 49 |
+
"condition_kind": "anchor_decay",
|
| 50 |
+
"stage": 0,
|
| 51 |
+
"token_limit": 250000,
|
| 52 |
+
"model_name": "L16_H8_D384",
|
| 53 |
+
"n_layer": 16,
|
| 54 |
+
"n_head": 8,
|
| 55 |
+
"n_embd": 384,
|
| 56 |
+
"parameters": 31457280,
|
| 57 |
+
"dropout_initial": 0.3,
|
| 58 |
+
"dropout_final": 0.08,
|
| 59 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 60 |
+
"n": 5,
|
| 61 |
+
"mean_train_eval_loss": 4.442901518940926,
|
| 62 |
+
"std_train_eval_loss": 0.027340586099289153,
|
| 63 |
+
"mean_val_eval_loss": 5.448330116271973,
|
| 64 |
+
"std_val_eval_loss": 0.013828516502701142,
|
| 65 |
+
"mean_generalization_gap": 1.005428597331047,
|
| 66 |
+
"std_generalization_gap": 0.022197359989343385
|
| 67 |
+
},
|
| 68 |
+
{
|
| 69 |
+
"run_mode": "locked_stream",
|
| 70 |
+
"condition": "prevlocal_interaction",
|
| 71 |
+
"condition_kind": "anchor_decay",
|
| 72 |
+
"stage": 0,
|
| 73 |
+
"token_limit": 250000,
|
| 74 |
+
"model_name": "L16_H8_D384",
|
| 75 |
+
"n_layer": 16,
|
| 76 |
+
"n_head": 8,
|
| 77 |
+
"n_embd": 384,
|
| 78 |
+
"parameters": 31457280,
|
| 79 |
+
"dropout_initial": 0.385,
|
| 80 |
+
"dropout_final": 0.066,
|
| 81 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 82 |
+
"n": 5,
|
| 83 |
+
"mean_train_eval_loss": 4.6016244173049925,
|
| 84 |
+
"std_train_eval_loss": 0.026939774812057612,
|
| 85 |
+
"mean_val_eval_loss": 5.4946602284908295,
|
| 86 |
+
"std_val_eval_loss": 0.01093302726132647,
|
| 87 |
+
"mean_generalization_gap": 0.8930358111858367,
|
| 88 |
+
"std_generalization_gap": 0.016010040404479016
|
| 89 |
+
},
|
| 90 |
+
{
|
| 91 |
+
"run_mode": "locked_stream",
|
| 92 |
+
"condition": "static_dropout_0",
|
| 93 |
+
"condition_kind": "static",
|
| 94 |
+
"stage": 0,
|
| 95 |
+
"token_limit": 250000,
|
| 96 |
+
"model_name": "L16_H8_D384",
|
| 97 |
+
"n_layer": 16,
|
| 98 |
+
"n_head": 8,
|
| 99 |
+
"n_embd": 384,
|
| 100 |
+
"parameters": 31457280,
|
| 101 |
+
"dropout_initial": 0.0,
|
| 102 |
+
"dropout_final": 0.0,
|
| 103 |
+
"dropout_schedule": "constant",
|
| 104 |
+
"n": 5,
|
| 105 |
+
"mean_train_eval_loss": 3.4442542552948,
|
| 106 |
+
"std_train_eval_loss": 0.022358399496724347,
|
| 107 |
+
"mean_val_eval_loss": 5.8329681470990185,
|
| 108 |
+
"std_val_eval_loss": 0.019809207037273006,
|
| 109 |
+
"mean_generalization_gap": 2.388713891804218,
|
| 110 |
+
"std_generalization_gap": 0.038334133145443657
|
| 111 |
+
},
|
| 112 |
+
{
|
| 113 |
+
"run_mode": "locked_stream",
|
| 114 |
+
"condition": "static_dropout_0.02",
|
| 115 |
+
"condition_kind": "static",
|
| 116 |
+
"stage": 0,
|
| 117 |
+
"token_limit": 250000,
|
| 118 |
+
"model_name": "L16_H8_D384",
|
| 119 |
+
"n_layer": 16,
|
| 120 |
+
"n_head": 8,
|
| 121 |
+
"n_embd": 384,
|
| 122 |
+
"parameters": 31457280,
|
| 123 |
+
"dropout_initial": 0.02,
|
| 124 |
+
"dropout_final": 0.02,
|
| 125 |
+
"dropout_schedule": "constant",
|
| 126 |
+
"n": 5,
|
| 127 |
+
"mean_train_eval_loss": 3.537110958993435,
|
| 128 |
+
"std_train_eval_loss": 0.008037117123073168,
|
| 129 |
+
"mean_val_eval_loss": 5.742638063430786,
|
| 130 |
+
"std_val_eval_loss": 0.024161263410536992,
|
| 131 |
+
"mean_generalization_gap": 2.2055271044373512,
|
| 132 |
+
"std_generalization_gap": 0.030737843551395496
|
| 133 |
+
},
|
| 134 |
+
{
|
| 135 |
+
"run_mode": "locked_stream",
|
| 136 |
+
"condition": "static_dropout_0.14",
|
| 137 |
+
"condition_kind": "static",
|
| 138 |
+
"stage": 0,
|
| 139 |
+
"token_limit": 250000,
|
| 140 |
+
"model_name": "L16_H8_D384",
|
| 141 |
+
"n_layer": 16,
|
| 142 |
+
"n_head": 8,
|
| 143 |
+
"n_embd": 384,
|
| 144 |
+
"parameters": 31457280,
|
| 145 |
+
"dropout_initial": 0.14,
|
| 146 |
+
"dropout_final": 0.14,
|
| 147 |
+
"dropout_schedule": "constant",
|
| 148 |
+
"n": 5,
|
| 149 |
+
"mean_train_eval_loss": 4.029827673733235,
|
| 150 |
+
"std_train_eval_loss": 0.018556819977249093,
|
| 151 |
+
"mean_val_eval_loss": 5.477323499321938,
|
| 152 |
+
"std_val_eval_loss": 0.02236835486589015,
|
| 153 |
+
"mean_generalization_gap": 1.4474958255887032,
|
| 154 |
+
"std_generalization_gap": 0.03092474074054602
|
| 155 |
+
},
|
| 156 |
+
{
|
| 157 |
+
"run_mode": "locked_stream",
|
| 158 |
+
"condition": "static_dropout_0.3",
|
| 159 |
+
"condition_kind": "static",
|
| 160 |
+
"stage": 0,
|
| 161 |
+
"token_limit": 250000,
|
| 162 |
+
"model_name": "L16_H8_D384",
|
| 163 |
+
"n_layer": 16,
|
| 164 |
+
"n_head": 8,
|
| 165 |
+
"n_embd": 384,
|
| 166 |
+
"parameters": 31457280,
|
| 167 |
+
"dropout_initial": 0.3,
|
| 168 |
+
"dropout_final": 0.3,
|
| 169 |
+
"dropout_schedule": "constant",
|
| 170 |
+
"n": 5,
|
| 171 |
+
"mean_train_eval_loss": 4.442901518940926,
|
| 172 |
+
"std_train_eval_loss": 0.027340553421349313,
|
| 173 |
+
"mean_val_eval_loss": 5.448330155014991,
|
| 174 |
+
"std_val_eval_loss": 0.0138285316736341,
|
| 175 |
+
"mean_generalization_gap": 1.005428636074066,
|
| 176 |
+
"std_generalization_gap": 0.022197311058747782
|
| 177 |
+
},
|
| 178 |
+
{
|
| 179 |
+
"run_mode": "locked_stream",
|
| 180 |
+
"condition": "fitted_l16_static_law",
|
| 181 |
+
"condition_kind": "anchor_decay",
|
| 182 |
+
"stage": 1,
|
| 183 |
+
"token_limit": 500000,
|
| 184 |
+
"model_name": "L16_H8_D384",
|
| 185 |
+
"n_layer": 16,
|
| 186 |
+
"n_head": 8,
|
| 187 |
+
"n_embd": 384,
|
| 188 |
+
"parameters": 31457280,
|
| 189 |
+
"dropout_initial": 0.6,
|
| 190 |
+
"dropout_final": 0.02,
|
| 191 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 192 |
+
"n": 5,
|
| 193 |
+
"mean_train_eval_loss": 4.463223123550415,
|
| 194 |
+
"std_train_eval_loss": 0.029267257511679485,
|
| 195 |
+
"mean_val_eval_loss": 5.150681225955486,
|
| 196 |
+
"std_val_eval_loss": 0.010164023432481408,
|
| 197 |
+
"mean_generalization_gap": 0.6874581024050712,
|
| 198 |
+
"std_generalization_gap": 0.024496105147219012
|
| 199 |
+
},
|
| 200 |
+
{
|
| 201 |
+
"run_mode": "locked_stream",
|
| 202 |
+
"condition": "hold_30_then_decay",
|
| 203 |
+
"condition_kind": "anchor_decay",
|
| 204 |
+
"stage": 1,
|
| 205 |
+
"token_limit": 500000,
|
| 206 |
+
"model_name": "L16_H8_D384",
|
| 207 |
+
"n_layer": 16,
|
| 208 |
+
"n_head": 8,
|
| 209 |
+
"n_embd": 384,
|
| 210 |
+
"parameters": 31457280,
|
| 211 |
+
"dropout_initial": 0.3,
|
| 212 |
+
"dropout_final": 0.02,
|
| 213 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 214 |
+
"n": 5,
|
| 215 |
+
"mean_train_eval_loss": 4.1383186161518095,
|
| 216 |
+
"std_train_eval_loss": 0.03875357135925004,
|
| 217 |
+
"mean_val_eval_loss": 5.066737350821495,
|
| 218 |
+
"std_val_eval_loss": 0.017273545737457947,
|
| 219 |
+
"mean_generalization_gap": 0.9284187346696854,
|
| 220 |
+
"std_generalization_gap": 0.04002925354224623
|
| 221 |
+
},
|
| 222 |
+
{
|
| 223 |
+
"run_mode": "locked_stream",
|
| 224 |
+
"condition": "mild_30_to_08",
|
| 225 |
+
"condition_kind": "anchor_decay",
|
| 226 |
+
"stage": 1,
|
| 227 |
+
"token_limit": 500000,
|
| 228 |
+
"model_name": "L16_H8_D384",
|
| 229 |
+
"n_layer": 16,
|
| 230 |
+
"n_head": 8,
|
| 231 |
+
"n_embd": 384,
|
| 232 |
+
"parameters": 31457280,
|
| 233 |
+
"dropout_initial": 0.3,
|
| 234 |
+
"dropout_final": 0.08,
|
| 235 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 236 |
+
"n": 5,
|
| 237 |
+
"mean_train_eval_loss": 4.034893324971199,
|
| 238 |
+
"std_train_eval_loss": 0.04033083916799125,
|
| 239 |
+
"mean_val_eval_loss": 5.058184179663658,
|
| 240 |
+
"std_val_eval_loss": 0.015882720199114145,
|
| 241 |
+
"mean_generalization_gap": 1.023290854692459,
|
| 242 |
+
"std_generalization_gap": 0.04098800520602419
|
| 243 |
+
},
|
| 244 |
+
{
|
| 245 |
+
"run_mode": "locked_stream",
|
| 246 |
+
"condition": "prevlocal_interaction",
|
| 247 |
+
"condition_kind": "anchor_decay",
|
| 248 |
+
"stage": 1,
|
| 249 |
+
"token_limit": 500000,
|
| 250 |
+
"model_name": "L16_H8_D384",
|
| 251 |
+
"n_layer": 16,
|
| 252 |
+
"n_head": 8,
|
| 253 |
+
"n_embd": 384,
|
| 254 |
+
"parameters": 31457280,
|
| 255 |
+
"dropout_initial": 0.385,
|
| 256 |
+
"dropout_final": 0.066,
|
| 257 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 258 |
+
"n": 5,
|
| 259 |
+
"mean_train_eval_loss": 4.20646400153637,
|
| 260 |
+
"std_train_eval_loss": 0.03245807641301778,
|
| 261 |
+
"mean_val_eval_loss": 5.071460470557213,
|
| 262 |
+
"std_val_eval_loss": 0.01179360076463939,
|
| 263 |
+
"mean_generalization_gap": 0.8649964690208435,
|
| 264 |
+
"std_generalization_gap": 0.028479060454710485
|
| 265 |
+
},
|
| 266 |
+
{
|
| 267 |
+
"run_mode": "locked_stream",
|
| 268 |
+
"condition": "static_dropout_0",
|
| 269 |
+
"condition_kind": "static",
|
| 270 |
+
"stage": 1,
|
| 271 |
+
"token_limit": 500000,
|
| 272 |
+
"model_name": "L16_H8_D384",
|
| 273 |
+
"n_layer": 16,
|
| 274 |
+
"n_head": 8,
|
| 275 |
+
"n_embd": 384,
|
| 276 |
+
"parameters": 31457280,
|
| 277 |
+
"dropout_initial": 0.0,
|
| 278 |
+
"dropout_final": 0.0,
|
| 279 |
+
"dropout_schedule": "constant",
|
| 280 |
+
"n": 5,
|
| 281 |
+
"mean_train_eval_loss": 2.958326259255409,
|
| 282 |
+
"std_train_eval_loss": 0.044781060309162554,
|
| 283 |
+
"mean_val_eval_loss": 5.717529235780239,
|
| 284 |
+
"std_val_eval_loss": 0.05024223752386389,
|
| 285 |
+
"mean_generalization_gap": 2.75920297652483,
|
| 286 |
+
"std_generalization_gap": 0.07102439530887096
|
| 287 |
+
},
|
| 288 |
+
{
|
| 289 |
+
"run_mode": "locked_stream",
|
| 290 |
+
"condition": "static_dropout_0.02",
|
| 291 |
+
"condition_kind": "static",
|
| 292 |
+
"stage": 1,
|
| 293 |
+
"token_limit": 500000,
|
| 294 |
+
"model_name": "L16_H8_D384",
|
| 295 |
+
"n_layer": 16,
|
| 296 |
+
"n_head": 8,
|
| 297 |
+
"n_embd": 384,
|
| 298 |
+
"parameters": 31457280,
|
| 299 |
+
"dropout_initial": 0.02,
|
| 300 |
+
"dropout_final": 0.02,
|
| 301 |
+
"dropout_schedule": "constant",
|
| 302 |
+
"n": 5,
|
| 303 |
+
"mean_train_eval_loss": 3.124619247019291,
|
| 304 |
+
"std_train_eval_loss": 0.031814549489392455,
|
| 305 |
+
"mean_val_eval_loss": 5.575391733646393,
|
| 306 |
+
"std_val_eval_loss": 0.024791398740622035,
|
| 307 |
+
"mean_generalization_gap": 2.450772486627102,
|
| 308 |
+
"std_generalization_gap": 0.030503049251572257
|
| 309 |
+
},
|
| 310 |
+
{
|
| 311 |
+
"run_mode": "locked_stream",
|
| 312 |
+
"condition": "static_dropout_0.14",
|
| 313 |
+
"condition_kind": "static",
|
| 314 |
+
"stage": 1,
|
| 315 |
+
"token_limit": 500000,
|
| 316 |
+
"model_name": "L16_H8_D384",
|
| 317 |
+
"n_layer": 16,
|
| 318 |
+
"n_head": 8,
|
| 319 |
+
"n_embd": 384,
|
| 320 |
+
"parameters": 31457280,
|
| 321 |
+
"dropout_initial": 0.14,
|
| 322 |
+
"dropout_final": 0.14,
|
| 323 |
+
"dropout_schedule": "constant",
|
| 324 |
+
"n": 5,
|
| 325 |
+
"mean_train_eval_loss": 3.714307613670826,
|
| 326 |
+
"std_train_eval_loss": 0.03238913748160129,
|
| 327 |
+
"mean_val_eval_loss": 5.149166536331177,
|
| 328 |
+
"std_val_eval_loss": 0.007010026540791338,
|
| 329 |
+
"mean_generalization_gap": 1.4348589226603508,
|
| 330 |
+
"std_generalization_gap": 0.031243440426199517
|
| 331 |
+
},
|
| 332 |
+
{
|
| 333 |
+
"run_mode": "locked_stream",
|
| 334 |
+
"condition": "static_dropout_0.3",
|
| 335 |
+
"condition_kind": "static",
|
| 336 |
+
"stage": 1,
|
| 337 |
+
"token_limit": 500000,
|
| 338 |
+
"model_name": "L16_H8_D384",
|
| 339 |
+
"n_layer": 16,
|
| 340 |
+
"n_head": 8,
|
| 341 |
+
"n_embd": 384,
|
| 342 |
+
"parameters": 31457280,
|
| 343 |
+
"dropout_initial": 0.3,
|
| 344 |
+
"dropout_final": 0.3,
|
| 345 |
+
"dropout_schedule": "constant",
|
| 346 |
+
"n": 5,
|
| 347 |
+
"mean_train_eval_loss": 4.138318654894829,
|
| 348 |
+
"std_train_eval_loss": 0.03875368971811196,
|
| 349 |
+
"mean_val_eval_loss": 5.066737298667431,
|
| 350 |
+
"std_val_eval_loss": 0.017273470277214743,
|
| 351 |
+
"mean_generalization_gap": 0.9284186437726021,
|
| 352 |
+
"std_generalization_gap": 0.04002925284584238
|
| 353 |
+
},
|
| 354 |
+
{
|
| 355 |
+
"run_mode": "locked_stream",
|
| 356 |
+
"condition": "fitted_l16_static_law",
|
| 357 |
+
"condition_kind": "anchor_decay",
|
| 358 |
+
"stage": 2,
|
| 359 |
+
"token_limit": 1000000,
|
| 360 |
+
"model_name": "L16_H8_D384",
|
| 361 |
+
"n_layer": 16,
|
| 362 |
+
"n_head": 8,
|
| 363 |
+
"n_embd": 384,
|
| 364 |
+
"parameters": 31457280,
|
| 365 |
+
"dropout_initial": 0.6,
|
| 366 |
+
"dropout_final": 0.02,
|
| 367 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 368 |
+
"n": 5,
|
| 369 |
+
"mean_train_eval_loss": 4.263189716637134,
|
| 370 |
+
"std_train_eval_loss": 0.02333674196202296,
|
| 371 |
+
"mean_val_eval_loss": 4.832601730525494,
|
| 372 |
+
"std_val_eval_loss": 0.010169544124120607,
|
| 373 |
+
"mean_generalization_gap": 0.569412013888359,
|
| 374 |
+
"std_generalization_gap": 0.023004537548591726
|
| 375 |
+
},
|
| 376 |
+
{
|
| 377 |
+
"run_mode": "locked_stream",
|
| 378 |
+
"condition": "hold_30_then_decay",
|
| 379 |
+
"condition_kind": "anchor_decay",
|
| 380 |
+
"stage": 2,
|
| 381 |
+
"token_limit": 1000000,
|
| 382 |
+
"model_name": "L16_H8_D384",
|
| 383 |
+
"n_layer": 16,
|
| 384 |
+
"n_head": 8,
|
| 385 |
+
"n_embd": 384,
|
| 386 |
+
"parameters": 31457280,
|
| 387 |
+
"dropout_initial": 0.3,
|
| 388 |
+
"dropout_final": 0.02,
|
| 389 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 390 |
+
"n": 5,
|
| 391 |
+
"mean_train_eval_loss": 4.037793649733066,
|
| 392 |
+
"std_train_eval_loss": 0.02368477035230831,
|
| 393 |
+
"mean_val_eval_loss": 4.775730343163014,
|
| 394 |
+
"std_val_eval_loss": 0.014352387307903692,
|
| 395 |
+
"mean_generalization_gap": 0.7379366934299469,
|
| 396 |
+
"std_generalization_gap": 0.01882967372675974
|
| 397 |
+
},
|
| 398 |
+
{
|
| 399 |
+
"run_mode": "locked_stream",
|
| 400 |
+
"condition": "mild_30_to_08",
|
| 401 |
+
"condition_kind": "anchor_decay",
|
| 402 |
+
"stage": 2,
|
| 403 |
+
"token_limit": 1000000,
|
| 404 |
+
"model_name": "L16_H8_D384",
|
| 405 |
+
"n_layer": 16,
|
| 406 |
+
"n_head": 8,
|
| 407 |
+
"n_embd": 384,
|
| 408 |
+
"parameters": 31457280,
|
| 409 |
+
"dropout_initial": 0.3,
|
| 410 |
+
"dropout_final": 0.08,
|
| 411 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 412 |
+
"n": 5,
|
| 413 |
+
"mean_train_eval_loss": 3.9886452093720437,
|
| 414 |
+
"std_train_eval_loss": 0.02349137402419598,
|
| 415 |
+
"mean_val_eval_loss": 4.777442049980164,
|
| 416 |
+
"std_val_eval_loss": 0.013845858727658497,
|
| 417 |
+
"mean_generalization_gap": 0.7887968406081199,
|
| 418 |
+
"std_generalization_gap": 0.018652082916074838
|
| 419 |
+
},
|
| 420 |
+
{
|
| 421 |
+
"run_mode": "locked_stream",
|
| 422 |
+
"condition": "prevlocal_interaction",
|
| 423 |
+
"condition_kind": "anchor_decay",
|
| 424 |
+
"stage": 2,
|
| 425 |
+
"token_limit": 1000000,
|
| 426 |
+
"model_name": "L16_H8_D384",
|
| 427 |
+
"n_layer": 16,
|
| 428 |
+
"n_head": 8,
|
| 429 |
+
"n_embd": 384,
|
| 430 |
+
"parameters": 31457280,
|
| 431 |
+
"dropout_initial": 0.385,
|
| 432 |
+
"dropout_final": 0.066,
|
| 433 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 434 |
+
"n": 5,
|
| 435 |
+
"mean_train_eval_loss": 4.08262689858675,
|
| 436 |
+
"std_train_eval_loss": 0.023420093535722695,
|
| 437 |
+
"mean_val_eval_loss": 4.781069016456604,
|
| 438 |
+
"std_val_eval_loss": 0.008428247627355752,
|
| 439 |
+
"mean_generalization_gap": 0.698442117869854,
|
| 440 |
+
"std_generalization_gap": 0.028148054160246055
|
| 441 |
+
},
|
| 442 |
+
{
|
| 443 |
+
"run_mode": "locked_stream",
|
| 444 |
+
"condition": "static_dropout_0",
|
| 445 |
+
"condition_kind": "static",
|
| 446 |
+
"stage": 2,
|
| 447 |
+
"token_limit": 1000000,
|
| 448 |
+
"model_name": "L16_H8_D384",
|
| 449 |
+
"n_layer": 16,
|
| 450 |
+
"n_head": 8,
|
| 451 |
+
"n_embd": 384,
|
| 452 |
+
"parameters": 31457280,
|
| 453 |
+
"dropout_initial": 0.0,
|
| 454 |
+
"dropout_final": 0.0,
|
| 455 |
+
"dropout_schedule": "constant",
|
| 456 |
+
"n": 5,
|
| 457 |
+
"mean_train_eval_loss": 3.3260142356157303,
|
| 458 |
+
"std_train_eval_loss": 0.03607156293344983,
|
| 459 |
+
"mean_val_eval_loss": 5.26366505920887,
|
| 460 |
+
"std_val_eval_loss": 0.027353946222948587,
|
| 461 |
+
"mean_generalization_gap": 1.9376508235931396,
|
| 462 |
+
"std_generalization_gap": 0.03553067411055354
|
| 463 |
+
},
|
| 464 |
+
{
|
| 465 |
+
"run_mode": "locked_stream",
|
| 466 |
+
"condition": "static_dropout_0.02",
|
| 467 |
+
"condition_kind": "static",
|
| 468 |
+
"stage": 2,
|
| 469 |
+
"token_limit": 1000000,
|
| 470 |
+
"model_name": "L16_H8_D384",
|
| 471 |
+
"n_layer": 16,
|
| 472 |
+
"n_head": 8,
|
| 473 |
+
"n_embd": 384,
|
| 474 |
+
"parameters": 31457280,
|
| 475 |
+
"dropout_initial": 0.02,
|
| 476 |
+
"dropout_final": 0.02,
|
| 477 |
+
"dropout_schedule": "constant",
|
| 478 |
+
"n": 5,
|
| 479 |
+
"mean_train_eval_loss": 3.4615398421883583,
|
| 480 |
+
"std_train_eval_loss": 0.03992270092195685,
|
| 481 |
+
"mean_val_eval_loss": 5.14697041362524,
|
| 482 |
+
"std_val_eval_loss": 0.022233878343551068,
|
| 483 |
+
"mean_generalization_gap": 1.685430571436882,
|
| 484 |
+
"std_generalization_gap": 0.04092951267469098
|
| 485 |
+
},
|
| 486 |
+
{
|
| 487 |
+
"run_mode": "locked_stream",
|
| 488 |
+
"condition": "static_dropout_0.14",
|
| 489 |
+
"condition_kind": "static",
|
| 490 |
+
"stage": 2,
|
| 491 |
+
"token_limit": 1000000,
|
| 492 |
+
"model_name": "L16_H8_D384",
|
| 493 |
+
"n_layer": 16,
|
| 494 |
+
"n_head": 8,
|
| 495 |
+
"n_embd": 384,
|
| 496 |
+
"parameters": 31457280,
|
| 497 |
+
"dropout_initial": 0.14,
|
| 498 |
+
"dropout_final": 0.14,
|
| 499 |
+
"dropout_schedule": "constant",
|
| 500 |
+
"n": 5,
|
| 501 |
+
"mean_train_eval_loss": 3.8711691960692405,
|
| 502 |
+
"std_train_eval_loss": 0.02974306105040781,
|
| 503 |
+
"mean_val_eval_loss": 4.849037018418312,
|
| 504 |
+
"std_val_eval_loss": 0.020208736415348236,
|
| 505 |
+
"mean_generalization_gap": 0.9778678223490715,
|
| 506 |
+
"std_generalization_gap": 0.023799818088071894
|
| 507 |
+
},
|
| 508 |
+
{
|
| 509 |
+
"run_mode": "locked_stream",
|
| 510 |
+
"condition": "static_dropout_0.3",
|
| 511 |
+
"condition_kind": "static",
|
| 512 |
+
"stage": 2,
|
| 513 |
+
"token_limit": 1000000,
|
| 514 |
+
"model_name": "L16_H8_D384",
|
| 515 |
+
"n_layer": 16,
|
| 516 |
+
"n_head": 8,
|
| 517 |
+
"n_embd": 384,
|
| 518 |
+
"parameters": 31457280,
|
| 519 |
+
"dropout_initial": 0.3,
|
| 520 |
+
"dropout_final": 0.3,
|
| 521 |
+
"dropout_schedule": "constant",
|
| 522 |
+
"n": 5,
|
| 523 |
+
"mean_train_eval_loss": 4.150126910209655,
|
| 524 |
+
"std_train_eval_loss": 0.023298256740585745,
|
| 525 |
+
"mean_val_eval_loss": 4.79825523942709,
|
| 526 |
+
"std_val_eval_loss": 0.01441949497608529,
|
| 527 |
+
"mean_generalization_gap": 0.6481283292174339,
|
| 528 |
+
"std_generalization_gap": 0.017421801083541605
|
| 529 |
+
},
|
| 530 |
+
{
|
| 531 |
+
"run_mode": "locked_stream",
|
| 532 |
+
"condition": "fitted_l16_static_law",
|
| 533 |
+
"condition_kind": "anchor_decay",
|
| 534 |
+
"stage": 3,
|
| 535 |
+
"token_limit": 2000000,
|
| 536 |
+
"model_name": "L16_H8_D384",
|
| 537 |
+
"n_layer": 16,
|
| 538 |
+
"n_head": 8,
|
| 539 |
+
"n_embd": 384,
|
| 540 |
+
"parameters": 31457280,
|
| 541 |
+
"dropout_initial": 0.6,
|
| 542 |
+
"dropout_final": 0.02,
|
| 543 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 544 |
+
"n": 5,
|
| 545 |
+
"mean_train_eval_loss": 4.14712455868721,
|
| 546 |
+
"std_train_eval_loss": 0.01706029159496315,
|
| 547 |
+
"mean_val_eval_loss": 4.58056578040123,
|
| 548 |
+
"std_val_eval_loss": 0.01532149630405117,
|
| 549 |
+
"mean_generalization_gap": 0.4334412217140198,
|
| 550 |
+
"std_generalization_gap": 0.019914111395845077
|
| 551 |
+
},
|
| 552 |
+
{
|
| 553 |
+
"run_mode": "locked_stream",
|
| 554 |
+
"condition": "hold_30_then_decay",
|
| 555 |
+
"condition_kind": "anchor_decay",
|
| 556 |
+
"stage": 3,
|
| 557 |
+
"token_limit": 2000000,
|
| 558 |
+
"model_name": "L16_H8_D384",
|
| 559 |
+
"n_layer": 16,
|
| 560 |
+
"n_head": 8,
|
| 561 |
+
"n_embd": 384,
|
| 562 |
+
"parameters": 31457280,
|
| 563 |
+
"dropout_initial": 0.3,
|
| 564 |
+
"dropout_final": 0.02,
|
| 565 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 566 |
+
"n": 5,
|
| 567 |
+
"mean_train_eval_loss": 4.044496415555477,
|
| 568 |
+
"std_train_eval_loss": 0.019708741233353012,
|
| 569 |
+
"mean_val_eval_loss": 4.559869511425495,
|
| 570 |
+
"std_val_eval_loss": 0.016051317749301037,
|
| 571 |
+
"mean_generalization_gap": 0.515373095870018,
|
| 572 |
+
"std_generalization_gap": 0.020379012527272283
|
| 573 |
+
},
|
| 574 |
+
{
|
| 575 |
+
"run_mode": "locked_stream",
|
| 576 |
+
"condition": "mild_30_to_08",
|
| 577 |
+
"condition_kind": "anchor_decay",
|
| 578 |
+
"stage": 3,
|
| 579 |
+
"token_limit": 2000000,
|
| 580 |
+
"model_name": "L16_H8_D384",
|
| 581 |
+
"n_layer": 16,
|
| 582 |
+
"n_head": 8,
|
| 583 |
+
"n_embd": 384,
|
| 584 |
+
"parameters": 31457280,
|
| 585 |
+
"dropout_initial": 0.3,
|
| 586 |
+
"dropout_final": 0.08,
|
| 587 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 588 |
+
"n": 5,
|
| 589 |
+
"mean_train_eval_loss": 4.044088624417782,
|
| 590 |
+
"std_train_eval_loss": 0.020441976517745996,
|
| 591 |
+
"mean_val_eval_loss": 4.563060106337071,
|
| 592 |
+
"std_val_eval_loss": 0.015509498762185112,
|
| 593 |
+
"mean_generalization_gap": 0.5189714819192887,
|
| 594 |
+
"std_generalization_gap": 0.022529376522631556
|
| 595 |
+
},
|
| 596 |
+
{
|
| 597 |
+
"run_mode": "locked_stream",
|
| 598 |
+
"condition": "prevlocal_interaction",
|
| 599 |
+
"condition_kind": "anchor_decay",
|
| 600 |
+
"stage": 3,
|
| 601 |
+
"token_limit": 2000000,
|
| 602 |
+
"model_name": "L16_H8_D384",
|
| 603 |
+
"n_layer": 16,
|
| 604 |
+
"n_head": 8,
|
| 605 |
+
"n_embd": 384,
|
| 606 |
+
"parameters": 31457280,
|
| 607 |
+
"dropout_initial": 0.385,
|
| 608 |
+
"dropout_final": 0.066,
|
| 609 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 610 |
+
"n": 5,
|
| 611 |
+
"mean_train_eval_loss": 4.080180557072163,
|
| 612 |
+
"std_train_eval_loss": 0.022080406390158354,
|
| 613 |
+
"mean_val_eval_loss": 4.558990895748138,
|
| 614 |
+
"std_val_eval_loss": 0.014177173687439483,
|
| 615 |
+
"mean_generalization_gap": 0.4788103386759758,
|
| 616 |
+
"std_generalization_gap": 0.021926835907345392
|
| 617 |
+
},
|
| 618 |
+
{
|
| 619 |
+
"run_mode": "locked_stream",
|
| 620 |
+
"condition": "static_dropout_0",
|
| 621 |
+
"condition_kind": "static",
|
| 622 |
+
"stage": 3,
|
| 623 |
+
"token_limit": 2000000,
|
| 624 |
+
"model_name": "L16_H8_D384",
|
| 625 |
+
"n_layer": 16,
|
| 626 |
+
"n_head": 8,
|
| 627 |
+
"n_embd": 384,
|
| 628 |
+
"parameters": 31457280,
|
| 629 |
+
"dropout_initial": 0.0,
|
| 630 |
+
"dropout_final": 0.0,
|
| 631 |
+
"dropout_schedule": "constant",
|
| 632 |
+
"n": 5,
|
| 633 |
+
"mean_train_eval_loss": 3.778580814599991,
|
| 634 |
+
"std_train_eval_loss": 0.03536285448761605,
|
| 635 |
+
"mean_val_eval_loss": 4.847234210371971,
|
| 636 |
+
"std_val_eval_loss": 0.0170992476167825,
|
| 637 |
+
"mean_generalization_gap": 1.0686533957719804,
|
| 638 |
+
"std_generalization_gap": 0.025604091638377884
|
| 639 |
+
},
|
| 640 |
+
{
|
| 641 |
+
"run_mode": "locked_stream",
|
| 642 |
+
"condition": "static_dropout_0.02",
|
| 643 |
+
"condition_kind": "static",
|
| 644 |
+
"stage": 3,
|
| 645 |
+
"token_limit": 2000000,
|
| 646 |
+
"model_name": "L16_H8_D384",
|
| 647 |
+
"n_layer": 16,
|
| 648 |
+
"n_head": 8,
|
| 649 |
+
"n_embd": 384,
|
| 650 |
+
"parameters": 31457280,
|
| 651 |
+
"dropout_initial": 0.02,
|
| 652 |
+
"dropout_final": 0.02,
|
| 653 |
+
"dropout_schedule": "constant",
|
| 654 |
+
"n": 5,
|
| 655 |
+
"mean_train_eval_loss": 3.840523959696293,
|
| 656 |
+
"std_train_eval_loss": 0.03097454466954304,
|
| 657 |
+
"mean_val_eval_loss": 4.784735175967216,
|
| 658 |
+
"std_val_eval_loss": 0.019582585992709827,
|
| 659 |
+
"mean_generalization_gap": 0.9442112162709236,
|
| 660 |
+
"std_generalization_gap": 0.02147121638277758
|
| 661 |
+
},
|
| 662 |
+
{
|
| 663 |
+
"run_mode": "locked_stream",
|
| 664 |
+
"condition": "static_dropout_0.14",
|
| 665 |
+
"condition_kind": "static",
|
| 666 |
+
"stage": 3,
|
| 667 |
+
"token_limit": 2000000,
|
| 668 |
+
"model_name": "L16_H8_D384",
|
| 669 |
+
"n_layer": 16,
|
| 670 |
+
"n_head": 8,
|
| 671 |
+
"n_embd": 384,
|
| 672 |
+
"parameters": 31457280,
|
| 673 |
+
"dropout_initial": 0.14,
|
| 674 |
+
"dropout_final": 0.14,
|
| 675 |
+
"dropout_schedule": "constant",
|
| 676 |
+
"n": 5,
|
| 677 |
+
"mean_train_eval_loss": 4.039909638464451,
|
| 678 |
+
"std_train_eval_loss": 0.025550506633378975,
|
| 679 |
+
"mean_val_eval_loss": 4.6047517821192745,
|
| 680 |
+
"std_val_eval_loss": 0.013619996903704912,
|
| 681 |
+
"mean_generalization_gap": 0.5648421436548233,
|
| 682 |
+
"std_generalization_gap": 0.015970945478988943
|
| 683 |
+
},
|
| 684 |
+
{
|
| 685 |
+
"run_mode": "locked_stream",
|
| 686 |
+
"condition": "static_dropout_0.3",
|
| 687 |
+
"condition_kind": "static",
|
| 688 |
+
"stage": 3,
|
| 689 |
+
"token_limit": 2000000,
|
| 690 |
+
"model_name": "L16_H8_D384",
|
| 691 |
+
"n_layer": 16,
|
| 692 |
+
"n_head": 8,
|
| 693 |
+
"n_embd": 384,
|
| 694 |
+
"parameters": 31457280,
|
| 695 |
+
"dropout_initial": 0.3,
|
| 696 |
+
"dropout_final": 0.3,
|
| 697 |
+
"dropout_schedule": "constant",
|
| 698 |
+
"n": 5,
|
| 699 |
+
"mean_train_eval_loss": 4.2150133237242695,
|
| 700 |
+
"std_train_eval_loss": 0.015000678307181381,
|
| 701 |
+
"mean_val_eval_loss": 4.603498187661171,
|
| 702 |
+
"std_val_eval_loss": 0.014129740963263297,
|
| 703 |
+
"mean_generalization_gap": 0.38848486393690107,
|
| 704 |
+
"std_generalization_gap": 0.01687487069399014
|
| 705 |
+
},
|
| 706 |
+
{
|
| 707 |
+
"run_mode": "locked_stream",
|
| 708 |
+
"condition": "fitted_l16_static_law",
|
| 709 |
+
"condition_kind": "anchor_decay",
|
| 710 |
+
"stage": 4,
|
| 711 |
+
"token_limit": 4000000,
|
| 712 |
+
"model_name": "L16_H8_D384",
|
| 713 |
+
"n_layer": 16,
|
| 714 |
+
"n_head": 8,
|
| 715 |
+
"n_embd": 384,
|
| 716 |
+
"parameters": 31457280,
|
| 717 |
+
"dropout_initial": 0.6,
|
| 718 |
+
"dropout_final": 0.02,
|
| 719 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 720 |
+
"n": 5,
|
| 721 |
+
"mean_train_eval_loss": 4.098657152056694,
|
| 722 |
+
"std_train_eval_loss": 0.01111204513074185,
|
| 723 |
+
"mean_val_eval_loss": 4.412404176592827,
|
| 724 |
+
"std_val_eval_loss": 0.00843791675235308,
|
| 725 |
+
"mean_generalization_gap": 0.3137470245361328,
|
| 726 |
+
"std_generalization_gap": 0.007204760471400837
|
| 727 |
+
},
|
| 728 |
+
{
|
| 729 |
+
"run_mode": "locked_stream",
|
| 730 |
+
"condition": "hold_30_then_decay",
|
| 731 |
+
"condition_kind": "anchor_decay",
|
| 732 |
+
"stage": 4,
|
| 733 |
+
"token_limit": 4000000,
|
| 734 |
+
"model_name": "L16_H8_D384",
|
| 735 |
+
"n_layer": 16,
|
| 736 |
+
"n_head": 8,
|
| 737 |
+
"n_embd": 384,
|
| 738 |
+
"parameters": 31457280,
|
| 739 |
+
"dropout_initial": 0.3,
|
| 740 |
+
"dropout_final": 0.02,
|
| 741 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 742 |
+
"n": 5,
|
| 743 |
+
"mean_train_eval_loss": 4.0487526342272755,
|
| 744 |
+
"std_train_eval_loss": 0.007824452256379268,
|
| 745 |
+
"mean_val_eval_loss": 4.405232906341553,
|
| 746 |
+
"std_val_eval_loss": 0.011151070705538514,
|
| 747 |
+
"mean_generalization_gap": 0.3564802721142769,
|
| 748 |
+
"std_generalization_gap": 0.01297330703929578
|
| 749 |
+
},
|
| 750 |
+
{
|
| 751 |
+
"run_mode": "locked_stream",
|
| 752 |
+
"condition": "mild_30_to_08",
|
| 753 |
+
"condition_kind": "anchor_decay",
|
| 754 |
+
"stage": 4,
|
| 755 |
+
"token_limit": 4000000,
|
| 756 |
+
"model_name": "L16_H8_D384",
|
| 757 |
+
"n_layer": 16,
|
| 758 |
+
"n_head": 8,
|
| 759 |
+
"n_embd": 384,
|
| 760 |
+
"parameters": 31457280,
|
| 761 |
+
"dropout_initial": 0.3,
|
| 762 |
+
"dropout_final": 0.08,
|
| 763 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 764 |
+
"n": 5,
|
| 765 |
+
"mean_train_eval_loss": 4.07358001768589,
|
| 766 |
+
"std_train_eval_loss": 0.0063536190340169095,
|
| 767 |
+
"mean_val_eval_loss": 4.40728645324707,
|
| 768 |
+
"std_val_eval_loss": 0.008502541215009067,
|
| 769 |
+
"mean_generalization_gap": 0.3337064355611801,
|
| 770 |
+
"std_generalization_gap": 0.010359634321755684
|
| 771 |
+
},
|
| 772 |
+
{
|
| 773 |
+
"run_mode": "locked_stream",
|
| 774 |
+
"condition": "prevlocal_interaction",
|
| 775 |
+
"condition_kind": "anchor_decay",
|
| 776 |
+
"stage": 4,
|
| 777 |
+
"token_limit": 4000000,
|
| 778 |
+
"model_name": "L16_H8_D384",
|
| 779 |
+
"n_layer": 16,
|
| 780 |
+
"n_head": 8,
|
| 781 |
+
"n_embd": 384,
|
| 782 |
+
"parameters": 31457280,
|
| 783 |
+
"dropout_initial": 0.385,
|
| 784 |
+
"dropout_final": 0.066,
|
| 785 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 786 |
+
"n": 5,
|
| 787 |
+
"mean_train_eval_loss": 4.080478595197201,
|
| 788 |
+
"std_train_eval_loss": 0.009311692964638253,
|
| 789 |
+
"mean_val_eval_loss": 4.3981304407119755,
|
| 790 |
+
"std_val_eval_loss": 0.009545784836147743,
|
| 791 |
+
"mean_generalization_gap": 0.3176518455147743,
|
| 792 |
+
"std_generalization_gap": 0.007999498965173152
|
| 793 |
+
},
|
| 794 |
+
{
|
| 795 |
+
"run_mode": "locked_stream",
|
| 796 |
+
"condition": "static_dropout_0",
|
| 797 |
+
"condition_kind": "static",
|
| 798 |
+
"stage": 4,
|
| 799 |
+
"token_limit": 4000000,
|
| 800 |
+
"model_name": "L16_H8_D384",
|
| 801 |
+
"n_layer": 16,
|
| 802 |
+
"n_head": 8,
|
| 803 |
+
"n_embd": 384,
|
| 804 |
+
"parameters": 31457280,
|
| 805 |
+
"dropout_initial": 0.0,
|
| 806 |
+
"dropout_final": 0.0,
|
| 807 |
+
"dropout_schedule": "constant",
|
| 808 |
+
"n": 5,
|
| 809 |
+
"mean_train_eval_loss": 4.041403333842754,
|
| 810 |
+
"std_train_eval_loss": 0.017193152802814336,
|
| 811 |
+
"mean_val_eval_loss": 4.594272664189338,
|
| 812 |
+
"std_val_eval_loss": 0.021638340853154137,
|
| 813 |
+
"mean_generalization_gap": 0.5528693303465844,
|
| 814 |
+
"std_generalization_gap": 0.029132548047629703
|
| 815 |
+
},
|
| 816 |
+
{
|
| 817 |
+
"run_mode": "locked_stream",
|
| 818 |
+
"condition": "static_dropout_0.02",
|
| 819 |
+
"condition_kind": "static",
|
| 820 |
+
"stage": 4,
|
| 821 |
+
"token_limit": 4000000,
|
| 822 |
+
"model_name": "L16_H8_D384",
|
| 823 |
+
"n_layer": 16,
|
| 824 |
+
"n_head": 8,
|
| 825 |
+
"n_embd": 384,
|
| 826 |
+
"parameters": 31457280,
|
| 827 |
+
"dropout_initial": 0.02,
|
| 828 |
+
"dropout_final": 0.02,
|
| 829 |
+
"dropout_schedule": "constant",
|
| 830 |
+
"n": 5,
|
| 831 |
+
"mean_train_eval_loss": 4.052870315313339,
|
| 832 |
+
"std_train_eval_loss": 0.02163703576438587,
|
| 833 |
+
"mean_val_eval_loss": 4.535757505893708,
|
| 834 |
+
"std_val_eval_loss": 0.00908401354385357,
|
| 835 |
+
"mean_generalization_gap": 0.48288719058036805,
|
| 836 |
+
"std_generalization_gap": 0.020126181497736668
|
| 837 |
+
},
|
| 838 |
+
{
|
| 839 |
+
"run_mode": "locked_stream",
|
| 840 |
+
"condition": "static_dropout_0.14",
|
| 841 |
+
"condition_kind": "static",
|
| 842 |
+
"stage": 4,
|
| 843 |
+
"token_limit": 4000000,
|
| 844 |
+
"model_name": "L16_H8_D384",
|
| 845 |
+
"n_layer": 16,
|
| 846 |
+
"n_head": 8,
|
| 847 |
+
"n_embd": 384,
|
| 848 |
+
"parameters": 31457280,
|
| 849 |
+
"dropout_initial": 0.14,
|
| 850 |
+
"dropout_final": 0.14,
|
| 851 |
+
"dropout_schedule": "constant",
|
| 852 |
+
"n": 5,
|
| 853 |
+
"mean_train_eval_loss": 4.116507206857205,
|
| 854 |
+
"std_train_eval_loss": 0.014037194709348206,
|
| 855 |
+
"mean_val_eval_loss": 4.44545366615057,
|
| 856 |
+
"std_val_eval_loss": 0.012017216742245517,
|
| 857 |
+
"mean_generalization_gap": 0.32894645929336547,
|
| 858 |
+
"std_generalization_gap": 0.01603071874172604
|
| 859 |
+
},
|
| 860 |
+
{
|
| 861 |
+
"run_mode": "locked_stream",
|
| 862 |
+
"condition": "static_dropout_0.3",
|
| 863 |
+
"condition_kind": "static",
|
| 864 |
+
"stage": 4,
|
| 865 |
+
"token_limit": 4000000,
|
| 866 |
+
"model_name": "L16_H8_D384",
|
| 867 |
+
"n_layer": 16,
|
| 868 |
+
"n_head": 8,
|
| 869 |
+
"n_embd": 384,
|
| 870 |
+
"parameters": 31457280,
|
| 871 |
+
"dropout_initial": 0.3,
|
| 872 |
+
"dropout_final": 0.3,
|
| 873 |
+
"dropout_schedule": "constant",
|
| 874 |
+
"n": 5,
|
| 875 |
+
"mean_train_eval_loss": 4.231865841150284,
|
| 876 |
+
"std_train_eval_loss": 0.010414934638152858,
|
| 877 |
+
"mean_val_eval_loss": 4.46677490323782,
|
| 878 |
+
"std_val_eval_loss": 0.014064932048228269,
|
| 879 |
+
"mean_generalization_gap": 0.23490906208753587,
|
| 880 |
+
"std_generalization_gap": 0.008922414622347311
|
| 881 |
+
}
|
| 882 |
+
]
|
runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/trace.jsonl
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|