vikramp commited on
Commit
ec8cfd3
·
verified ·
1 Parent(s): 68cb4b0

Upload folder using huggingface_hub

Browse files
logs/output_run_20260201_listera_outlaid.log ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2026-02-01 17:09:25,627 - root - INFO - Run: run_20260201_listera_outlaid
2
+ 2026-02-01 17:09:25,627 - root - INFO - Log directory: /root/tiny_moe/training_runs/Tiny_MoE/logs
3
+ 2026-02-01 17:09:25,627 - root - INFO - Output dir: /root/tiny_moe/training_runs
4
+ 2026-02-01 17:09:27,910 - jax._src.xla_bridge - INFO - Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
5
+ 2026-02-01 17:09:31,543 - root - INFO - Flax version: 0.11.1
6
+ 2026-02-01 17:09:31,544 - root - INFO - Optax version: 0.2.6
7
+ 2026-02-01 17:09:31,544 - root - INFO - Platform: gpu
8
+ 2026-02-01 17:09:31,544 - root - INFO - Num Devices: 8
9
+ 2026-02-01 17:09:31,544 - root - INFO - Devices: [CudaDevice(id=0), CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3), CudaDevice(id=4), CudaDevice(id=5), CudaDevice(id=6), CudaDevice(id=7)]
10
+ 2026-02-01 17:09:32,327 - root - INFO - Model config:
11
+ Config(name='Tiny_MoE',
12
+ dtype=<class 'jax.numpy.bfloat16'>,
13
+ vocab_size=50304,
14
+ block_size=2048,
15
+ n_layer=30,
16
+ n_embed=672,
17
+ n_glu_hidden=2048,
18
+ n_head=12,
19
+ n_kv_head=4,
20
+ n_experts=8,
21
+ init_stddev=0.02,
22
+ expert_load_factor=1.25,
23
+ aux_loss_coeff=0.01,
24
+ moe_bias=True,
25
+ mlp_bias=False,
26
+ attention_bias=False,
27
+ load_balance_loss_coeff=0.01,
28
+ z_loss_coeff=0.0005,
29
+ expert_top_k=2,
30
+ ln_epsilon=1e-05,
31
+ rope_theta=0.0001,
32
+ expert_partition_spec=PartitionSpec('devices',),
33
+ sdpa_implementation='cudnn',
34
+ value_residual_init=0.5,
35
+ logit_softcap=30.0)
36
+ 2026-02-01 17:10:54,390 - root - INFO - Parameter Count: 1,062,182,190
37
+ 2026-02-01 17:10:54,390 - root - INFO - Sharded / MoE Parameter Count: 992,210,160
38
+ 2026-02-01 17:10:54,390 - root - INFO - Replicated Parameter Count: 69,972,030
39
+ 2026-02-01 17:10:55,591 - root - INFO - Weight decay param count: 1,062,140,928
40
+ 2026-02-01 17:10:55,591 - root - INFO - Training config:
41
+ TrainerConfig(num_tokens=100000000000,
42
+ num_tokens_per_batch=262144,
43
+ mB=128,
44
+ T=2048,
45
+ max_steps=381469,
46
+ max_lr=0.002,
47
+ min_lr=0.0002,
48
+ max_grad_norm=1.0,
49
+ weight_decay=0.1,
50
+ adam_b1=0.9,
51
+ adam_b2=0.95,
52
+ warmup_steps=3814,
53
+ print_interval=100,
54
+ val=True,
55
+ val_interval=5000,
56
+ val_batches=50,
57
+ checkpoint_model=False,
58
+ checkpoint_optimizer=False,
59
+ checkpoint_interval=10000)
60
+ 2026-02-01 17:10:55,591 - root - INFO - Effective batch size per device: 16
61
+ 2026-02-01 17:10:58,934 - root - INFO - ModdedNanoGPTDataLoader: 1030 shards (train)
62
+ 2026-02-01 17:10:59,029 - root - INFO - HuggingfaceDataLoader initialized:
63
+ ------------------------
64
+ label: train
65
+ shards: 1,030
66
+ shard size: 100,000,000
67
+ batch size: 128
68
+ block size: 2048
69
+ device rank: 1
70
+ start shard: 0
71
+ start pos: 0
72
+ ------------------------
73
+ 2026-02-01 17:10:59,029 - root - INFO - ModdedNanoGPTDataLoader: 1 shards (val)
74
+ 2026-02-01 17:10:59,125 - root - INFO - Starting from step: 0
75
+ 2026-02-01 17:12:05,383 - root - INFO - 0 | lr: 0.0000 | loss: 13.1395 | logits loss: 12.7500 | load balance loss: 30.1163 | z loss: 146.0000 | avg iter time: 0.00ms | avg tok/sec: 0.00 | tokens processed: 262,144
76
+ 2026-02-01 17:14:35,281 - root - INFO - 100 | lr: 0.0001 | loss: 7.9276 | logits loss: 7.6250 | load balance loss: 30.3898 | z loss: 21.6250 | avg iter time: 1491.52ms | avg tok/sec: 175,755.74 | tokens processed: 26,476,544
77
+ 2026-02-01 17:16:06,430 - root - INFO - 200 | lr: 0.0001 | loss: 6.6204 | logits loss: 6.3125 | load balance loss: 30.2806 | z loss: 16.3750 | avg iter time: 903.96ms | avg tok/sec: 289,995.07 | tokens processed: 52,690,944
78
+ 2026-02-01 17:17:37,687 - root - INFO - 300 | lr: 0.0002 | loss: 6.0076 | logits loss: 5.6875 | load balance loss: 30.3037 | z loss: 10.0000 | avg iter time: 905.10ms | avg tok/sec: 289,631.04 | tokens processed: 78,905,344
79
+ 2026-02-01 17:19:09,230 - root - INFO - 400 | lr: 0.0002 | loss: 5.6975 | logits loss: 5.4062 | load balance loss: 30.1520 | z loss: 6.1875 | avg iter time: 907.95ms | avg tok/sec: 288,722.13 | tokens processed: 105,119,744
80
+ 2026-02-01 17:20:40,730 - root - INFO - 500 | lr: 0.0003 | loss: 5.3403 | logits loss: 5.0312 | load balance loss: 30.1389 | z loss: 5.5625 | avg iter time: 907.54ms | avg tok/sec: 288,852.61 | tokens processed: 131,334,144
81
+ 2026-02-01 17:22:11,887 - root - INFO - 600 | lr: 0.0003 | loss: 5.0923 | logits loss: 4.7812 | load balance loss: 30.1347 | z loss: 4.2188 | avg iter time: 906.53ms | avg tok/sec: 289,172.34 | tokens processed: 157,548,544
82
+ 2026-02-01 17:23:43,043 - root - INFO - 700 | lr: 0.0004 | loss: 4.9635 | logits loss: 4.6562 | load balance loss: 30.2859 | z loss: 4.0938 | avg iter time: 904.04ms | avg tok/sec: 289,968.65 | tokens processed: 183,762,944
83
+ 2026-02-01 17:25:14,578 - root - INFO - 800 | lr: 0.0004 | loss: 4.8897 | logits loss: 4.5938 | load balance loss: 30.1570 | z loss: 3.6406 | avg iter time: 907.87ms | avg tok/sec: 288,745.97 | tokens processed: 209,977,344
84
+ 2026-02-01 17:26:46,080 - root - INFO - 900 | lr: 0.0005 | loss: 4.6463 | logits loss: 4.3438 | load balance loss: 30.2383 | z loss: 3.3125 | avg iter time: 907.55ms | avg tok/sec: 288,849.48 | tokens processed: 236,191,744
85
+ 2026-02-01 17:28:17,201 - root - INFO - 1000 | lr: 0.0005 | loss: 4.5858 | logits loss: 4.2812 | load balance loss: 30.1439 | z loss: 2.4844 | avg iter time: 903.78ms | avg tok/sec: 290,051.38 | tokens processed: 262,406,144
86
+ 2026-02-01 17:29:48,401 - root - INFO - 1100 | lr: 0.0006 | loss: 4.4841 | logits loss: 4.1875 | load balance loss: 30.1222 | z loss: 2.1094 | avg iter time: 904.55ms | avg tok/sec: 289,806.70 | tokens processed: 288,620,544
87
+ 2026-02-01 17:31:19,820 - root - INFO - 1200 | lr: 0.0006 | loss: 4.2636 | logits loss: 3.9688 | load balance loss: 30.1213 | z loss: 2.1250 | avg iter time: 906.72ms | avg tok/sec: 289,111.48 | tokens processed: 314,834,944
88
+ 2026-02-01 17:32:50,893 - root - INFO - 1300 | lr: 0.0007 | loss: 4.2775 | logits loss: 3.9688 | load balance loss: 30.1323 | z loss: 1.7266 | avg iter time: 903.26ms | avg tok/sec: 290,219.89 | tokens processed: 341,049,344
logs/run_20260201_listera_outlaid_train.csv ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ step,lr,loss,load_balance_loss,z_loss,time,tokens_processed,tokens_per_sec
2
+ 0,5.2438384e-07,13.139485359191895,30.116331100463867,146.0,0,262144,0
3
+ 100,5.296277e-05,7.927635669708252,30.389793395996094,21.625,1491.524543762207,26476544,175755.74005558804
4
+ 200,0.00010540115,6.620430946350098,30.280624389648438,16.375,903.9601945877075,52690944,289995.0701032392
5
+ 300,0.00015783953,6.007584571838379,30.303728103637695,10.0,905.0963592529297,78905344,289631.0401871186
6
+ 400,0.00021027793,5.6975274085998535,30.15201187133789,6.1875,907.9456448554993,105119744,288722.12944170303
7
+ 500,0.00026271632,5.3403472900390625,30.13890266418457,5.5625,907.5354933738708,131334144,288852.6144861272
8
+ 600,0.0003151547,5.092349052429199,30.13466453552246,4.21875,906.5320682525635,157548544,289172.34059387475
9
+ 700,0.00036759308,4.963479995727539,30.28586196899414,4.09375,904.0425515174866,183762944,289968.6519843302
10
+ 800,0.00042003146,4.889721870422363,30.156982421875,3.640625,907.8706812858582,209977344,288745.9694465666
11
+ 900,0.00047246984,4.646252632141113,30.238346099853516,3.3125,907.5453472137451,236191744,288849.4782159462
12
+ 1000,0.00052490825,4.585753440856934,30.143917083740234,2.484375,903.7847113609314,262406144,290051.37695376587
13
+ 1100,0.0005773466,4.4841203689575195,30.122188568115234,2.109375,904.5477652549744,288620544,289806.6968592938
14
+ 1200,0.000629785,4.2636213302612305,30.121326446533203,2.125,906.7229151725769,314834944,289111.47563763295
15
+ 1300,0.0006822234,4.277533054351807,30.13232421875,1.7265625,903.2599401473999,341049344,290219.88947857206
logs/run_20260201_listera_outlaid_val.csv ADDED
@@ -0,0 +1 @@
 
 
1
+ step,loss,logits_loss