dropout-decay / docs /tinystories_streaming_report.md

Mandeep Sidhu

Use absolute regime names for streaming reports

dcae82e 3 days ago

7.33 kB

	# TinyStories Multi-Seed Streaming Validation

	Date: 2026-05-30

	This report combines 5 random seeds (1, 2, 3, 4, 5) from saved streaming runs.
	No additional training is performed by this script; it reads saved
	`metrics.jsonl` files.

	Regime: TinyStories BPE streaming validation with L12_H8_D320, 17,367,040 parameters, four prefixes from 500k to 4M tokens, and 2,000 optimizer steps per stage.

	## Sources

	- `runs/streaming_tinystories_interaction_schedule_l12/locked_stream/20260530-053831/metrics.jsonl`
	- `runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-111523/metrics.jsonl`
	- `runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/metrics.jsonl`

	## Condition Ranking By Final Loss

	\| Condition \| Kind \| N \| Mean trajectory val \| Std trajectory val \| Mean final val \| Std final val \| Mean final gap \| Dropout path \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---\|
	\| `interaction` \| `anchor_decay` \| 5 \| 2.8309 \| 0.0068 \| 2.5311 \| 0.0213 \| 0.2626 \| `0.18 -> 0.14 -> 0.08 -> 0.04` \|
	\| `smooth_low` \| `decay` \| 5 \| 2.8307 \| 0.0069 \| 2.5321 \| 0.0203 \| 0.2607 \| `0.16 -> 0.11 -> 0.07 -> 0.05` \|
	\| `baseabc` \| `anchor_decay` \| 5 \| 2.8474 \| 0.0028 \| 2.5357 \| 0.0175 \| 0.2655 \| `0.25 -> 0.19 -> 0.10 -> 0.02` \|
	\| `static_dropout_0.08` \| `static` \| 5 \| 2.8434 \| 0.0072 \| 2.5444 \| 0.0211 \| 0.2593 \| `0.08 -> 0.08 -> 0.08 -> 0.08` \|
	\| `static_dropout_0.12` \| `static` \| 5 \| 2.8357 \| 0.0061 \| 2.5477 \| 0.0178 \| 0.2269 \| `0.12 -> 0.12 -> 0.12 -> 0.12` \|
	\| `static_dropout_0.18` \| `static` \| 5 \| 2.8461 \| 0.0047 \| 2.5644 \| 0.0182 \| 0.2035 \| `0.18 -> 0.18 -> 0.18 -> 0.18` \|

	## Paired Final-Loss Deltas

	Negative `delta_vs_best_static` means the condition beat the best static
	baseline for that seed.

	\| Seed \| Condition \| Final val \| Best static \| Best static final val \| Delta vs best static \|
	\|---:\|---\|---:\|---\|---:\|---:\|
	\| 1 \| `interaction` \| 2.5414 \| `static_dropout_0.08` \| 2.5419 \| -0.0005 \|
	\| 1 \| `baseabc` \| 2.5397 \| `static_dropout_0.08` \| 2.5419 \| -0.0022 \|
	\| 1 \| `smooth_low` \| 2.5423 \| `static_dropout_0.08` \| 2.5419 \| +0.0003 \|
	\| 1 \| `static_dropout_0.08` \| 2.5419 \| `static_dropout_0.08` \| 2.5419 \| +0.0000 \|
	\| 1 \| `static_dropout_0.12` \| 2.5526 \| `static_dropout_0.08` \| 2.5419 \| +0.0106 \|
	\| 1 \| `static_dropout_0.18` \| 2.5636 \| `static_dropout_0.08` \| 2.5419 \| +0.0217 \|
	\| 2 \| `interaction` \| 2.5377 \| `static_dropout_0.12` \| 2.5588 \| -0.0211 \|
	\| 2 \| `baseabc` \| 2.5432 \| `static_dropout_0.12` \| 2.5588 \| -0.0156 \|
	\| 2 \| `smooth_low` \| 2.5386 \| `static_dropout_0.12` \| 2.5588 \| -0.0202 \|
	\| 2 \| `static_dropout_0.08` \| 2.5636 \| `static_dropout_0.12` \| 2.5588 \| +0.0048 \|
	\| 2 \| `static_dropout_0.12` \| 2.5588 \| `static_dropout_0.12` \| 2.5588 \| +0.0000 \|
	\| 2 \| `static_dropout_0.18` \| 2.5768 \| `static_dropout_0.12` \| 2.5588 \| +0.0180 \|
	\| 3 \| `interaction` \| 2.5385 \| `static_dropout_0.08` \| 2.5478 \| -0.0092 \|
	\| 3 \| `baseabc` \| 2.5425 \| `static_dropout_0.08` \| 2.5478 \| -0.0052 \|
	\| 3 \| `smooth_low` \| 2.5407 \| `static_dropout_0.08` \| 2.5478 \| -0.0071 \|
	\| 3 \| `static_dropout_0.08` \| 2.5478 \| `static_dropout_0.08` \| 2.5478 \| +0.0000 \|
	\| 3 \| `static_dropout_0.12` \| 2.5510 \| `static_dropout_0.08` \| 2.5478 \| +0.0033 \|
	\| 3 \| `static_dropout_0.18` \| 2.5667 \| `static_dropout_0.08` \| 2.5478 \| +0.0189 \|
	\| 4 \| `interaction` \| 2.4932 \| `static_dropout_0.08` \| 2.5098 \| -0.0166 \|
	\| 4 \| `baseabc` \| 2.5049 \| `static_dropout_0.08` \| 2.5098 \| -0.0049 \|
	\| 4 \| `smooth_low` \| 2.4959 \| `static_dropout_0.08` \| 2.5098 \| -0.0139 \|
	\| 4 \| `static_dropout_0.08` \| 2.5098 \| `static_dropout_0.08` \| 2.5098 \| +0.0000 \|
	\| 4 \| `static_dropout_0.12` \| 2.5166 \| `static_dropout_0.08` \| 2.5098 \| +0.0068 \|
	\| 4 \| `static_dropout_0.18` \| 2.5343 \| `static_dropout_0.08` \| 2.5098 \| +0.0244 \|
	\| 5 \| `interaction` \| 2.5447 \| `static_dropout_0.08` \| 2.5588 \| -0.0141 \|
	\| 5 \| `baseabc` \| 2.5481 \| `static_dropout_0.08` \| 2.5588 \| -0.0107 \|
	\| 5 \| `smooth_low` \| 2.5428 \| `static_dropout_0.08` \| 2.5588 \| -0.0159 \|
	\| 5 \| `static_dropout_0.08` \| 2.5588 \| `static_dropout_0.08` \| 2.5588 \| +0.0000 \|
	\| 5 \| `static_dropout_0.12` \| 2.5595 \| `static_dropout_0.08` \| 2.5588 \| +0.0008 \|
	\| 5 \| `static_dropout_0.18` \| 2.5806 \| `static_dropout_0.08` \| 2.5588 \| +0.0218 \|

	## Stage Trajectory

	\| Stage \| Prefix tokens \| Condition \| Dropout \| N \| Mean val \| Std val \| Mean train \| Mean gap \|
	\|---:\|---:\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| 0 \| 500,000 \| `static_dropout_0.12` \| 0.120 \| 5 \| 3.2226 \| 0.0143 \| 2.6968 \| 0.5257 \|
	\| 0 \| 500,000 \| `smooth_low` \| 0.162 \| 5 \| 3.2287 \| 0.0122 \| 2.7909 \| 0.4377 \|
	\| 0 \| 500,000 \| `static_dropout_0.08` \| 0.080 \| 5 \| 3.2304 \| 0.0102 \| 2.6173 \| 0.6131 \|
	\| 0 \| 500,000 \| `interaction` \| 0.184 \| 5 \| 3.2326 \| 0.0123 \| 2.8108 \| 0.4218 \|
	\| 0 \| 500,000 \| `static_dropout_0.18` \| 0.180 \| 5 \| 3.2349 \| 0.0151 \| 2.8056 \| 0.4293 \|
	\| 0 \| 500,000 \| `baseabc` \| 0.251 \| 5 \| 3.2728 \| 0.0102 \| 2.9139 \| 0.3588 \|
	\| 1 \| 1,000,000 \| `interaction` \| 0.141 \| 5 \| 2.8908 \| 0.0027 \| 2.4842 \| 0.4065 \|
	\| 1 \| 1,000,000 \| `smooth_low` \| 0.115 \| 5 \| 2.8912 \| 0.0018 \| 2.4678 \| 0.4234 \|
	\| 1 \| 1,000,000 \| `static_dropout_0.12` \| 0.120 \| 5 \| 2.8930 \| 0.0121 \| 2.4335 \| 0.4595 \|
	\| 1 \| 1,000,000 \| `static_dropout_0.18` \| 0.180 \| 5 \| 2.8990 \| 0.0106 \| 2.5397 \| 0.3593 \|
	\| 1 \| 1,000,000 \| `baseabc` \| 0.186 \| 5 \| 2.9041 \| 0.0037 \| 2.5659 \| 0.3382 \|
	\| 1 \| 1,000,000 \| `static_dropout_0.08` \| 0.080 \| 5 \| 2.9132 \| 0.0068 \| 2.3531 \| 0.5601 \|
	\| 2 \| 2,000,000 \| `interaction` \| 0.084 \| 5 \| 2.6690 \| 0.0207 \| 2.3392 \| 0.3298 \|
	\| 2 \| 2,000,000 \| `smooth_low` \| 0.067 \| 5 \| 2.6708 \| 0.0218 \| 2.3360 \| 0.3347 \|
	\| 2 \| 2,000,000 \| `baseabc` \| 0.105 \| 5 \| 2.6770 \| 0.0186 \| 2.3938 \| 0.2833 \|
	\| 2 \| 2,000,000 \| `static_dropout_0.12` \| 0.120 \| 5 \| 2.6795 \| 0.0163 \| 2.3697 \| 0.3098 \|
	\| 2 \| 2,000,000 \| `static_dropout_0.08` \| 0.080 \| 5 \| 2.6856 \| 0.0161 \| 2.3109 \| 0.3747 \|
	\| 2 \| 2,000,000 \| `static_dropout_0.18` \| 0.180 \| 5 \| 2.6860 \| 0.0159 \| 2.4347 \| 0.2513 \|
	\| 3 \| 4,000,000 \| `interaction` \| 0.045 \| 5 \| 2.5311 \| 0.0213 \| 2.2685 \| 0.2626 \|
	\| 3 \| 4,000,000 \| `smooth_low` \| 0.045 \| 5 \| 2.5321 \| 0.0203 \| 2.2713 \| 0.2607 \|
	\| 3 \| 4,000,000 \| `baseabc` \| 0.020 \| 5 \| 2.5357 \| 0.0175 \| 2.2702 \| 0.2655 \|
	\| 3 \| 4,000,000 \| `static_dropout_0.08` \| 0.080 \| 5 \| 2.5444 \| 0.0211 \| 2.2851 \| 0.2593 \|
	\| 3 \| 4,000,000 \| `static_dropout_0.12` \| 0.120 \| 5 \| 2.5477 \| 0.0178 \| 2.3208 \| 0.2269 \|
	\| 3 \| 4,000,000 \| `static_dropout_0.18` \| 0.180 \| 5 \| 2.5644 \| 0.0182 \| 2.3609 \| 0.2035 \|

	## Interpretation

	- `interaction` has the best 5-seed mean final validation loss: 2.5311 +/- 0.0213.
	- The second-best final condition is `smooth_low` at 2.5321 +/- 0.0203.
	- The best static baseline by mean final loss is `static_dropout_0.08` at 2.5444 +/- 0.0211.
	- `interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0005.
	- `smooth_low` beats the per-seed best static baseline in 4/5 seeds; worst paired delta is +0.0003.
	- `baseabc` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0022.
	- The best first-stage condition is `static_dropout_0.12` at prefix 500,000 with mean validation loss 3.2226; compare this with the final ranking before claiming a schedule is uniformly better.
	- This is a saved-run streaming validation artifact. Treat it as strong
	evidence only when the tested conditions, seeds, static baselines, and
	stream protocol match the claim being made.

	# TinyStories Multi-Seed Streaming Validation

	Date: 2026-05-30

	This report combines 5 random seeds (1, 2, 3, 4, 5) from saved streaming runs.
	No additional training is performed by this script; it reads saved
	`metrics.jsonl` files.

	Regime: TinyStories BPE streaming validation with L12_H8_D320, 17,367,040 parameters, four prefixes from 500k to 4M tokens, and 2,000 optimizer steps per stage.

	## Sources

	- `runs/streaming_tinystories_interaction_schedule_l12/locked_stream/20260530-053831/metrics.jsonl`
	- `runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-111523/metrics.jsonl`
	- `runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/metrics.jsonl`

	## Condition Ranking By Final Loss

	\| Condition \| Kind \| N \| Mean trajectory val \| Std trajectory val \| Mean final val \| Std final val \| Mean final gap \| Dropout path \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---\|
	\| `interaction` \| `anchor_decay` \| 5 \| 2.8309 \| 0.0068 \| 2.5311 \| 0.0213 \| 0.2626 \| `0.18 -> 0.14 -> 0.08 -> 0.04` \|
	\| `smooth_low` \| `decay` \| 5 \| 2.8307 \| 0.0069 \| 2.5321 \| 0.0203 \| 0.2607 \| `0.16 -> 0.11 -> 0.07 -> 0.05` \|
	\| `baseabc` \| `anchor_decay` \| 5 \| 2.8474 \| 0.0028 \| 2.5357 \| 0.0175 \| 0.2655 \| `0.25 -> 0.19 -> 0.10 -> 0.02` \|
	\| `static_dropout_0.08` \| `static` \| 5 \| 2.8434 \| 0.0072 \| 2.5444 \| 0.0211 \| 0.2593 \| `0.08 -> 0.08 -> 0.08 -> 0.08` \|
	\| `static_dropout_0.12` \| `static` \| 5 \| 2.8357 \| 0.0061 \| 2.5477 \| 0.0178 \| 0.2269 \| `0.12 -> 0.12 -> 0.12 -> 0.12` \|
	\| `static_dropout_0.18` \| `static` \| 5 \| 2.8461 \| 0.0047 \| 2.5644 \| 0.0182 \| 0.2035 \| `0.18 -> 0.18 -> 0.18 -> 0.18` \|

	## Paired Final-Loss Deltas

	Negative `delta_vs_best_static` means the condition beat the best static
	baseline for that seed.

	\| Seed \| Condition \| Final val \| Best static \| Best static final val \| Delta vs best static \|
	\|---:\|---\|---:\|---\|---:\|---:\|
	\| 1 \| `interaction` \| 2.5414 \| `static_dropout_0.08` \| 2.5419 \| -0.0005 \|
	\| 1 \| `baseabc` \| 2.5397 \| `static_dropout_0.08` \| 2.5419 \| -0.0022 \|
	\| 1 \| `smooth_low` \| 2.5423 \| `static_dropout_0.08` \| 2.5419 \| +0.0003 \|
	\| 1 \| `static_dropout_0.08` \| 2.5419 \| `static_dropout_0.08` \| 2.5419 \| +0.0000 \|
	\| 1 \| `static_dropout_0.12` \| 2.5526 \| `static_dropout_0.08` \| 2.5419 \| +0.0106 \|
	\| 1 \| `static_dropout_0.18` \| 2.5636 \| `static_dropout_0.08` \| 2.5419 \| +0.0217 \|
	\| 2 \| `interaction` \| 2.5377 \| `static_dropout_0.12` \| 2.5588 \| -0.0211 \|
	\| 2 \| `baseabc` \| 2.5432 \| `static_dropout_0.12` \| 2.5588 \| -0.0156 \|
	\| 2 \| `smooth_low` \| 2.5386 \| `static_dropout_0.12` \| 2.5588 \| -0.0202 \|
	\| 2 \| `static_dropout_0.08` \| 2.5636 \| `static_dropout_0.12` \| 2.5588 \| +0.0048 \|
	\| 2 \| `static_dropout_0.12` \| 2.5588 \| `static_dropout_0.12` \| 2.5588 \| +0.0000 \|
	\| 2 \| `static_dropout_0.18` \| 2.5768 \| `static_dropout_0.12` \| 2.5588 \| +0.0180 \|
	\| 3 \| `interaction` \| 2.5385 \| `static_dropout_0.08` \| 2.5478 \| -0.0092 \|
	\| 3 \| `baseabc` \| 2.5425 \| `static_dropout_0.08` \| 2.5478 \| -0.0052 \|
	\| 3 \| `smooth_low` \| 2.5407 \| `static_dropout_0.08` \| 2.5478 \| -0.0071 \|
	\| 3 \| `static_dropout_0.08` \| 2.5478 \| `static_dropout_0.08` \| 2.5478 \| +0.0000 \|
	\| 3 \| `static_dropout_0.12` \| 2.5510 \| `static_dropout_0.08` \| 2.5478 \| +0.0033 \|
	\| 3 \| `static_dropout_0.18` \| 2.5667 \| `static_dropout_0.08` \| 2.5478 \| +0.0189 \|
	\| 4 \| `interaction` \| 2.4932 \| `static_dropout_0.08` \| 2.5098 \| -0.0166 \|
	\| 4 \| `baseabc` \| 2.5049 \| `static_dropout_0.08` \| 2.5098 \| -0.0049 \|
	\| 4 \| `smooth_low` \| 2.4959 \| `static_dropout_0.08` \| 2.5098 \| -0.0139 \|
	\| 4 \| `static_dropout_0.08` \| 2.5098 \| `static_dropout_0.08` \| 2.5098 \| +0.0000 \|
	\| 4 \| `static_dropout_0.12` \| 2.5166 \| `static_dropout_0.08` \| 2.5098 \| +0.0068 \|
	\| 4 \| `static_dropout_0.18` \| 2.5343 \| `static_dropout_0.08` \| 2.5098 \| +0.0244 \|
	\| 5 \| `interaction` \| 2.5447 \| `static_dropout_0.08` \| 2.5588 \| -0.0141 \|
	\| 5 \| `baseabc` \| 2.5481 \| `static_dropout_0.08` \| 2.5588 \| -0.0107 \|
	\| 5 \| `smooth_low` \| 2.5428 \| `static_dropout_0.08` \| 2.5588 \| -0.0159 \|
	\| 5 \| `static_dropout_0.08` \| 2.5588 \| `static_dropout_0.08` \| 2.5588 \| +0.0000 \|
	\| 5 \| `static_dropout_0.12` \| 2.5595 \| `static_dropout_0.08` \| 2.5588 \| +0.0008 \|
	\| 5 \| `static_dropout_0.18` \| 2.5806 \| `static_dropout_0.08` \| 2.5588 \| +0.0218 \|

	## Stage Trajectory

	\| Stage \| Prefix tokens \| Condition \| Dropout \| N \| Mean val \| Std val \| Mean train \| Mean gap \|
	\|---:\|---:\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| 0 \| 500,000 \| `static_dropout_0.12` \| 0.120 \| 5 \| 3.2226 \| 0.0143 \| 2.6968 \| 0.5257 \|
	\| 0 \| 500,000 \| `smooth_low` \| 0.162 \| 5 \| 3.2287 \| 0.0122 \| 2.7909 \| 0.4377 \|
	\| 0 \| 500,000 \| `static_dropout_0.08` \| 0.080 \| 5 \| 3.2304 \| 0.0102 \| 2.6173 \| 0.6131 \|
	\| 0 \| 500,000 \| `interaction` \| 0.184 \| 5 \| 3.2326 \| 0.0123 \| 2.8108 \| 0.4218 \|
	\| 0 \| 500,000 \| `static_dropout_0.18` \| 0.180 \| 5 \| 3.2349 \| 0.0151 \| 2.8056 \| 0.4293 \|
	\| 0 \| 500,000 \| `baseabc` \| 0.251 \| 5 \| 3.2728 \| 0.0102 \| 2.9139 \| 0.3588 \|
	\| 1 \| 1,000,000 \| `interaction` \| 0.141 \| 5 \| 2.8908 \| 0.0027 \| 2.4842 \| 0.4065 \|
	\| 1 \| 1,000,000 \| `smooth_low` \| 0.115 \| 5 \| 2.8912 \| 0.0018 \| 2.4678 \| 0.4234 \|
	\| 1 \| 1,000,000 \| `static_dropout_0.12` \| 0.120 \| 5 \| 2.8930 \| 0.0121 \| 2.4335 \| 0.4595 \|
	\| 1 \| 1,000,000 \| `static_dropout_0.18` \| 0.180 \| 5 \| 2.8990 \| 0.0106 \| 2.5397 \| 0.3593 \|
	\| 1 \| 1,000,000 \| `baseabc` \| 0.186 \| 5 \| 2.9041 \| 0.0037 \| 2.5659 \| 0.3382 \|
	\| 1 \| 1,000,000 \| `static_dropout_0.08` \| 0.080 \| 5 \| 2.9132 \| 0.0068 \| 2.3531 \| 0.5601 \|
	\| 2 \| 2,000,000 \| `interaction` \| 0.084 \| 5 \| 2.6690 \| 0.0207 \| 2.3392 \| 0.3298 \|
	\| 2 \| 2,000,000 \| `smooth_low` \| 0.067 \| 5 \| 2.6708 \| 0.0218 \| 2.3360 \| 0.3347 \|
	\| 2 \| 2,000,000 \| `baseabc` \| 0.105 \| 5 \| 2.6770 \| 0.0186 \| 2.3938 \| 0.2833 \|
	\| 2 \| 2,000,000 \| `static_dropout_0.12` \| 0.120 \| 5 \| 2.6795 \| 0.0163 \| 2.3697 \| 0.3098 \|
	\| 2 \| 2,000,000 \| `static_dropout_0.08` \| 0.080 \| 5 \| 2.6856 \| 0.0161 \| 2.3109 \| 0.3747 \|
	\| 2 \| 2,000,000 \| `static_dropout_0.18` \| 0.180 \| 5 \| 2.6860 \| 0.0159 \| 2.4347 \| 0.2513 \|
	\| 3 \| 4,000,000 \| `interaction` \| 0.045 \| 5 \| 2.5311 \| 0.0213 \| 2.2685 \| 0.2626 \|
	\| 3 \| 4,000,000 \| `smooth_low` \| 0.045 \| 5 \| 2.5321 \| 0.0203 \| 2.2713 \| 0.2607 \|
	\| 3 \| 4,000,000 \| `baseabc` \| 0.020 \| 5 \| 2.5357 \| 0.0175 \| 2.2702 \| 0.2655 \|
	\| 3 \| 4,000,000 \| `static_dropout_0.08` \| 0.080 \| 5 \| 2.5444 \| 0.0211 \| 2.2851 \| 0.2593 \|
	\| 3 \| 4,000,000 \| `static_dropout_0.12` \| 0.120 \| 5 \| 2.5477 \| 0.0178 \| 2.3208 \| 0.2269 \|
	\| 3 \| 4,000,000 \| `static_dropout_0.18` \| 0.180 \| 5 \| 2.5644 \| 0.0182 \| 2.3609 \| 0.2035 \|

	## Interpretation

	- `interaction` has the best 5-seed mean final validation loss: 2.5311 +/- 0.0213.
	- The second-best final condition is `smooth_low` at 2.5321 +/- 0.0203.
	- The best static baseline by mean final loss is `static_dropout_0.08` at 2.5444 +/- 0.0211.
	- `interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0005.
	- `smooth_low` beats the per-seed best static baseline in 4/5 seeds; worst paired delta is +0.0003.
	- `baseabc` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0022.
	- The best first-stage condition is `static_dropout_0.12` at prefix 500,000 with mean validation loss 3.2226; compare this with the final ranking before claiming a schedule is uniformly better.
	- This is a saved-run streaming validation artifact. Treat it as strong
	evidence only when the tested conditions, seeds, static baselines, and
	stream protocol match the claim being made.