vikramp commited on
Commit
ad073c9
·
verified ·
1 Parent(s): ca61cb3

Upload folder using huggingface_hub

Browse files
logs/output_run_20260202_outskip_reboant.log ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2026-02-02 10:33:27,808 - root - INFO - Run: run_20260202_outskip_reboant
2
+ 2026-02-02 10:33:27,808 - root - INFO - Log directory: /root/tiny_moe/training_runs/Tiny_MoE/logs
3
+ 2026-02-02 10:33:27,809 - root - INFO - Output dir: /root/tiny_moe/training_runs
4
+ 2026-02-02 10:33:30,313 - jax._src.xla_bridge - INFO - Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
5
+ 2026-02-02 10:33:33,681 - root - INFO - Flax version: 0.11.1
6
+ 2026-02-02 10:33:33,681 - root - INFO - Optax version: 0.2.6
7
+ 2026-02-02 10:33:33,682 - root - INFO - Platform: gpu
8
+ 2026-02-02 10:33:33,682 - root - INFO - Num Devices: 8
9
+ 2026-02-02 10:33:33,682 - root - INFO - Devices: [CudaDevice(id=0), CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3), CudaDevice(id=4), CudaDevice(id=5), CudaDevice(id=6), CudaDevice(id=7)]
10
+ 2026-02-02 10:33:34,484 - root - INFO - Model config:
11
+ Config(name='Tiny_MoE',
12
+ dtype=<class 'jax.numpy.bfloat16'>,
13
+ vocab_size=50304,
14
+ block_size=4096,
15
+ n_layer=30,
16
+ n_embed=672,
17
+ n_glu_hidden=2048,
18
+ n_head=12,
19
+ n_kv_head=4,
20
+ n_experts=8,
21
+ init_stddev=0.02,
22
+ expert_load_factor=1.25,
23
+ aux_loss_coeff=0.01,
24
+ moe_bias=True,
25
+ mlp_bias=False,
26
+ attention_bias=False,
27
+ load_balance_loss_coeff=0.01,
28
+ z_loss_coeff=0.0005,
29
+ expert_top_k=2,
30
+ ln_epsilon=1e-05,
31
+ rope_theta=0.0001,
32
+ expert_partition_spec=PartitionSpec('devices',),
33
+ sdpa_implementation='flash_attn_jax',
34
+ window_size=(512, 0),
35
+ value_residual_init=0.5,
36
+ logit_softcap=30.0)
37
+ 2026-02-02 10:35:15,485 - root - INFO - Parameter Count: 1,062,185,550
38
+ 2026-02-02 10:35:15,485 - root - INFO - Sharded / MoE Parameter Count: 992,210,160
39
+ 2026-02-02 10:35:15,485 - root - INFO - Replicated Parameter Count: 69,975,390
40
+ 2026-02-02 10:35:16,940 - root - INFO - Weight decay param count: 1,062,140,928
41
+ 2026-02-02 10:35:16,941 - root - INFO - Training config:
42
+ TrainerConfig(num_tokens=100000000000,
43
+ num_tokens_per_batch=262144,
44
+ mB=64,
45
+ T=4096,
46
+ max_steps=381469,
47
+ max_lr=0.008,
48
+ min_lr=0.0008,
49
+ max_grad_norm=1.0,
50
+ weight_decay=0.1,
51
+ adam_b1=0.9,
52
+ adam_b2=0.95,
53
+ warmup_steps=3814,
54
+ print_interval=100,
55
+ val=True,
56
+ val_interval=5000,
57
+ val_batches=50,
58
+ checkpoint_model=False,
59
+ checkpoint_optimizer=False,
60
+ checkpoint_interval=10000)
61
+ 2026-02-02 10:35:16,941 - root - INFO - Effective batch size per device: 8
62
+ 2026-02-02 10:35:21,859 - root - INFO - HuggingfaceDataLoader: 1030 shards (train)
63
+ 2026-02-02 10:35:21,944 - root - INFO - HuggingfaceDataLoader initialized:
64
+ ------------------------
65
+ label: train
66
+ shards: 1,030
67
+ shard size: 100,000,000
68
+ batch size: 64
69
+ block size: 4096
70
+ device rank: 1
71
+ start shard: 0
72
+ start pos: 0
73
+ ------------------------
74
+ 2026-02-02 10:35:21,944 - root - INFO - HuggingfaceDataLoader: 1 shards (val)
75
+ 2026-02-02 10:35:22,030 - root - INFO - Starting from step: 0
76
+ 2026-02-02 10:39:58,092 - root - INFO - 100 | lr: 0.0002 | loss: 6.9877 | logits loss: 6.6875 | load balance loss: 30.3411 | z loss: 17.1250 | avg iter time: 0.00ms | avg tok/sec: 0.00 | tokens processed: 26,214,400 | elapsed: 0h 4m 36s | ETA: calculating...
77
+ 2026-02-02 10:41:16,197 - root - INFO - 200 | lr: 0.0004 | loss: 5.9530 | logits loss: 5.6562 | load balance loss: 30.4010 | z loss: 12.1875 | avg iter time: 780.98ms | avg tok/sec: 335,658.21 | tokens processed: 52,428,800 | elapsed: 0h 5m 54s | ETA: 82h 42m
78
+ 2026-02-02 10:42:34,244 - root - INFO - 300 | lr: 0.0006 | loss: 5.4104 | logits loss: 5.0938 | load balance loss: 30.2613 | z loss: 8.2500 | avg iter time: 780.40ms | avg tok/sec: 335,907.96 | tokens processed: 78,643,200 | elapsed: 0h 7m 12s | ETA: 82h 37m
79
+ 2026-02-02 10:43:52,355 - root - INFO - 400 | lr: 0.0008 | loss: 5.2326 | logits loss: 4.9375 | load balance loss: 30.2812 | z loss: 4.6875 | avg iter time: 780.99ms | avg tok/sec: 335,657.85 | tokens processed: 104,857,600 | elapsed: 0h 8m 30s | ETA: 82h 40m
80
+ 2026-02-02 10:45:10,457 - root - INFO - 500 | lr: 0.0011 | loss: 4.7870 | logits loss: 4.4688 | load balance loss: 30.2062 | z loss: 3.9844 | avg iter time: 780.95ms | avg tok/sec: 335,672.76 | tokens processed: 131,072,000 | elapsed: 0h 9m 48s | ETA: 82h 38m
81
+ 2026-02-02 10:45:29,627 - root - WARNING - Received KeyboardInterrupt. Exiting...
82
+ 2026-02-02 10:45:29,910 - root - INFO - Training completed.
logs/output_run_20260202_unplace_stormer.log ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2026-02-02 10:30:36,879 - root - INFO - Run: run_20260202_unplace_stormer
2
+ 2026-02-02 10:30:36,880 - root - INFO - Log directory: /root/tiny_moe/training_runs/Tiny_MoE/logs
3
+ 2026-02-02 10:30:36,880 - root - INFO - Output dir: /root/tiny_moe/training_runs
4
+ 2026-02-02 10:30:39,364 - jax._src.xla_bridge - INFO - Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
5
+ 2026-02-02 10:30:42,683 - root - INFO - Flax version: 0.11.1
6
+ 2026-02-02 10:30:42,683 - root - INFO - Optax version: 0.2.6
7
+ 2026-02-02 10:30:42,683 - root - INFO - Platform: gpu
8
+ 2026-02-02 10:30:42,683 - root - INFO - Num Devices: 8
9
+ 2026-02-02 10:30:42,683 - root - INFO - Devices: [CudaDevice(id=0), CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3), CudaDevice(id=4), CudaDevice(id=5), CudaDevice(id=6), CudaDevice(id=7)]
10
+ 2026-02-02 10:30:43,545 - root - INFO - Model config:
11
+ Config(name='Tiny_MoE',
12
+ dtype=<class 'jax.numpy.bfloat16'>,
13
+ vocab_size=50304,
14
+ block_size=4096,
15
+ n_layer=30,
16
+ n_embed=672,
17
+ n_glu_hidden=2048,
18
+ n_head=12,
19
+ n_kv_head=4,
20
+ n_experts=8,
21
+ init_stddev=0.02,
22
+ expert_load_factor=1.25,
23
+ aux_loss_coeff=0.01,
24
+ moe_bias=True,
25
+ mlp_bias=False,
26
+ attention_bias=False,
27
+ load_balance_loss_coeff=0.01,
28
+ z_loss_coeff=0.0005,
29
+ expert_top_k=2,
30
+ ln_epsilon=1e-05,
31
+ rope_theta=0.0001,
32
+ expert_partition_spec=PartitionSpec('devices',),
33
+ sdpa_implementation='flash_attn_jax',
34
+ window_size=(512, 0),
35
+ value_residual_init=0.5,
36
+ logit_softcap=30.0)
37
+ 2026-02-02 10:32:31,109 - root - INFO - Parameter Count: 1,062,185,550
38
+ 2026-02-02 10:32:31,109 - root - INFO - Sharded / MoE Parameter Count: 992,210,160
39
+ 2026-02-02 10:32:31,109 - root - INFO - Replicated Parameter Count: 69,975,390
40
+ 2026-02-02 10:32:32,634 - root - INFO - Weight decay param count: 1,062,140,928
41
+ 2026-02-02 10:32:32,634 - root - INFO - Training config:
42
+ TrainerConfig(num_tokens=100000000000,
43
+ num_tokens_per_batch=262144,
44
+ mB=128,
45
+ T=4096,
46
+ max_steps=381469,
47
+ max_lr=0.008,
48
+ min_lr=0.0008,
49
+ max_grad_norm=1.0,
50
+ weight_decay=0.1,
51
+ adam_b1=0.9,
52
+ adam_b2=0.95,
53
+ warmup_steps=3814,
54
+ print_interval=100,
55
+ val=True,
56
+ val_interval=5000,
57
+ val_batches=50,
58
+ checkpoint_model=False,
59
+ checkpoint_optimizer=False,
60
+ checkpoint_interval=10000)
61
+ 2026-02-02 10:32:32,635 - root - INFO - Effective batch size per device: 16
logs/run_20260202_outskip_reboant_train.csv ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ step,lr,loss,load_balance_loss,z_loss,time,tokens_processed,tokens_per_sec,elapsed_seconds
2
+ 100,0.00021185107,6.987718105316162,30.34114646911621,17.125,0,26214400,0,276.06064891815186
3
+ 200,0.0004216046,5.953006744384766,30.400999069213867,12.1875,780.9849190711975,52428800,335658.2100352977,354.1655626296997
4
+ 300,0.0006313581,5.410363674163818,30.261280059814453,8.25,780.404257774353,78643200,335907.957175442,432.21277475357056
5
+ 400,0.0008411117,5.232638835906982,30.28117561340332,4.6875,780.9857678413391,104857600,335657.845244186,510.3233594894409
6
+ 500,0.0010508653,4.787019729614258,30.20621109008789,3.984375,780.9510564804077,131072000,335672.76441295986,588.4260060787201
logs/run_20260202_outskip_reboant_train.png ADDED
logs/run_20260202_outskip_reboant_val.csv ADDED
@@ -0,0 +1 @@
 
 
1
+ step,loss,logits_loss