Commit History

Training status: step 256000
4e8123c
verified

tekkmaven commited on

Checkpoint weights: step 256000 (1667MB)
4402a06
verified

tekkmaven commited on

Training status: step 240000
2704215
verified

tekkmaven commited on

Checkpoint weights: step 240000 (1667MB)
d69c665
verified

tekkmaven commited on

Training status: step 224000
c9c1faf
verified

tekkmaven commited on

Checkpoint weights: step 224000 (1667MB)
e2fa33d
verified

tekkmaven commited on

Training status: step 208000
01efe26
verified

tekkmaven commited on

Checkpoint weights: step 208000 (1666MB)
e2bef11
verified

tekkmaven commited on

Training status: step 192000
b087da5
verified

tekkmaven commited on

Checkpoint weights: step 192000 (1665MB)
190759f
verified

tekkmaven commited on

Training status: step 176000
ad408e0
verified

tekkmaven commited on

Checkpoint weights: step 176000 (1666MB)
270c549
verified

tekkmaven commited on

Training status: step 160000
7fc6928
verified

tekkmaven commited on

Checkpoint weights: step 160000 (1666MB)
3054872
verified

tekkmaven commited on

Training status: step 144000
dabb0f9
verified

tekkmaven commited on

Checkpoint weights: step 144000 (1666MB)
3f7e2e0
verified

tekkmaven commited on

Training status: step 128000
07469ea
verified

tekkmaven commited on

Checkpoint weights: step 128000 (1666MB)
8d3a460
verified

tekkmaven commited on

Training status: step 112000
107d7e2
verified

tekkmaven commited on

Checkpoint weights: step 112000 (1664MB)
42f2552
verified

tekkmaven commited on

Training status: step 96000
aebd4fb
verified

tekkmaven commited on

Checkpoint weights: step 96000 (1666MB)
7871a7c
verified

tekkmaven commited on

Training status: step 80000
0e4bcd7
verified

tekkmaven commited on

Checkpoint weights: step 80000 (1666MB)
6863f70
verified

tekkmaven commited on

Training status: step 64000
114bae5
verified

tekkmaven commited on

Checkpoint weights: step 64000 (1666MB)
79ca53f
verified

tekkmaven commited on

Training status: step 48000
4bf1029
verified

tekkmaven commited on

Checkpoint weights: step 48000 (1666MB)
9063b83
verified

tekkmaven commited on

Remove stale training_status.json from diverged run
df69066
verified

tekkmaven commited on

Always pass --resume so train.py checks Hub when local is empty"
abb62fc
verified

tekkmaven commited on

Auto-resume from Hub when local checkpoints are cleared by Kaggle\n\nNow --resume checks: local disk first β†’ Hub download fallback β†’ fresh start\nNo more lost progress across Kaggle sessions."
b2c7b9b
verified

tekkmaven commited on

Training status: step 32000
523eb1c
verified

tekkmaven commited on

Checkpoint weights: step 32000 (1668MB)
dcd216f
verified

tekkmaven commited on

Training status: step 16000
b7e64d7
verified

tekkmaven commited on

Checkpoint weights: step 16000 (1671MB)
7c08bcf
verified

tekkmaven commited on

Fix divergence: peak_lr 1e-3 β†’ 3e-4, max_grad_norm 1.0 β†’ 0.5 (batch=8 too small for high LR)"
9492039
verified

tekkmaven commited on

Delete stale checkpoint from diverged run
554b735
verified

tekkmaven commited on

Training status: step 16000
7d6db81
verified

tekkmaven commited on

Checkpoint weights: step 16000 (1688MB)
52f418f
verified

tekkmaven commited on

Fix data.py: handle edge cases, add fallback for failed datasets, fix orca-agentinstruct split names
a92ed74
verified

tekkmaven commited on

CRITICAL FIX: Wire up real data pipeline (was training on random tokens!)\n\nChanges:\n- Replace random.randint batch with TAPDataPipeline streaming real data\n- Add tokenizer initialization with TAP special tokens\n- Add pad_fraction diagnostic to detect data issues\n- Keep upd_rms diagnostic\n- Stage-aware curriculum switching\n- Proper pad token masking in loss (was already in build_model_mtp)\n\nThis fixes the root cause of loss stuck at ln(vocab)=10.85"
bd50131
verified

tekkmaven commited on

Remove stale training_status.json from old run
3f51ef2
verified

tekkmaven commited on

Delete stale checkpoints from stuck training (loss=10.95, bad init + low LR)
59033e2
verified

tekkmaven commited on

Fix config.py default peak_lr to 1e-3
32dd395
verified

tekkmaven commited on

Fix stuck training: peak_lr 5e-4 β†’ 1e-3, increase weight decay for stability
242a666
verified

tekkmaven commited on

Fix stuck loss: increase LR to 1e-3, fix initialization scales, add diagnostics
4f792d0
verified

tekkmaven commited on

Training status: step 176000
f02ec75
verified

tekkmaven commited on

Checkpoint weights: step 176000 (1670MB)
e696556
verified

tekkmaven commited on

Training status: step 160000
25117be
verified

tekkmaven commited on

Checkpoint weights: step 160000 (1678MB)
a3de037
verified

tekkmaven commited on