vikramp commited on
Commit
69a94b7
·
verified ·
1 Parent(s): b1fba3d

Upload folder using huggingface_hub

Browse files
logs/output_run_20260202_ablach_achagua.log ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2026-02-02 06:07:45,749 - root - INFO - Run: run_20260202_ablach_achagua
2
+ 2026-02-02 06:07:45,750 - root - INFO - Log directory: /root/tiny_moe/training_runs/Tiny_MoE/logs
3
+ 2026-02-02 06:07:45,750 - root - INFO - Output dir: /root/tiny_moe/training_runs
4
+ 2026-02-02 06:07:47,584 - jax._src.xla_bridge - INFO - Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
5
+ 2026-02-02 06:07:51,290 - root - INFO - Flax version: 0.11.1
6
+ 2026-02-02 06:07:51,290 - root - INFO - Optax version: 0.2.6
7
+ 2026-02-02 06:07:51,291 - root - INFO - Platform: gpu
8
+ 2026-02-02 06:07:51,291 - root - INFO - Num Devices: 8
9
+ 2026-02-02 06:07:51,291 - root - INFO - Devices: [CudaDevice(id=0), CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3), CudaDevice(id=4), CudaDevice(id=5), CudaDevice(id=6), CudaDevice(id=7)]
10
+ 2026-02-02 06:07:53,529 - root - INFO - Model config:
11
+ Config(name='Tiny_MoE',
12
+ dtype=<class 'jax.numpy.bfloat16'>,
13
+ vocab_size=50304,
14
+ block_size=2048,
15
+ n_layer=30,
16
+ n_embed=672,
17
+ n_glu_hidden=2048,
18
+ n_head=12,
19
+ n_kv_head=4,
20
+ n_experts=8,
21
+ init_stddev=0.02,
22
+ expert_load_factor=1.25,
23
+ aux_loss_coeff=0.01,
24
+ moe_bias=True,
25
+ mlp_bias=False,
26
+ attention_bias=False,
27
+ load_balance_loss_coeff=0.01,
28
+ z_loss_coeff=0.0005,
29
+ expert_top_k=2,
30
+ ln_epsilon=1e-05,
31
+ rope_theta=0.0001,
32
+ expert_partition_spec=PartitionSpec('devices',),
33
+ sdpa_implementation='cudnn',
34
+ value_residual_init=0.5,
35
+ logit_softcap=30.0)
36
+ 2026-02-02 06:09:40,641 - root - INFO - Parameter Count: 1,062,185,550
37
+ 2026-02-02 06:09:40,641 - root - INFO - Sharded / MoE Parameter Count: 992,210,160
38
+ 2026-02-02 06:09:40,641 - root - INFO - Replicated Parameter Count: 69,975,390
39
+ 2026-02-02 06:09:42,197 - root - INFO - Weight decay param count: 1,062,140,928
40
+ 2026-02-02 06:09:42,198 - root - INFO - Training config:
41
+ TrainerConfig(num_tokens=100000000000,
42
+ num_tokens_per_batch=262144,
43
+ mB=128,
44
+ T=2048,
45
+ max_steps=381469,
46
+ max_lr=0.007,
47
+ min_lr=0.0007000000000000001,
48
+ max_grad_norm=1.0,
49
+ weight_decay=0.1,
50
+ adam_b1=0.9,
51
+ adam_b2=0.95,
52
+ warmup_steps=3814,
53
+ print_interval=100,
54
+ val=True,
55
+ val_interval=5000,
56
+ val_batches=50,
57
+ checkpoint_model=False,
58
+ checkpoint_optimizer=False,
59
+ checkpoint_interval=10000)
60
+ 2026-02-02 06:09:42,198 - root - INFO - Effective batch size per device: 16
61
+ 2026-02-02 06:09:46,771 - root - INFO - HuggingfaceDataLoader: 1030 shards (train)
62
+ 2026-02-02 06:09:46,918 - root - INFO - HuggingfaceDataLoader initialized:
63
+ ------------------------
64
+ label: train
65
+ shards: 1,030
66
+ shard size: 100,000,000
67
+ batch size: 128
68
+ block size: 2048
69
+ device rank: 1
70
+ start shard: 0
71
+ start pos: 0
72
+ ------------------------
73
+ 2026-02-02 06:09:46,918 - root - INFO - HuggingfaceDataLoader: 1 shards (val)
74
+ 2026-02-02 06:09:47,065 - root - INFO - Starting from step: 0
75
+ 2026-02-02 06:14:48,129 - root - INFO - 100 | lr: 0.0002 | loss: 7.0857 | logits loss: 6.7812 | load balance loss: 30.4082 | z loss: 23.7500 | avg iter time: 0.00ms | avg tok/sec: 0.00 | tokens processed: 26,214,400 | ETA: calculating...
76
+ 2026-02-02 06:16:14,788 - root - INFO - 200 | lr: 0.0004 | loss: 6.0730 | logits loss: 5.7500 | load balance loss: 30.2875 | z loss: 13.3125 | avg iter time: 866.51ms | avg tok/sec: 302,529.28 | tokens processed: 52,428,800 | ETA: 91h 46m
77
+ 2026-02-02 06:17:41,403 - root - INFO - 300 | lr: 0.0006 | loss: 5.4924 | logits loss: 5.1875 | load balance loss: 30.2579 | z loss: 8.3750 | avg iter time: 866.05ms | avg tok/sec: 302,687.48 | tokens processed: 78,643,200 | ETA: 91h 41m
78
+ 2026-02-02 06:18:47,899 - root - INFO - Downloading fineweb_train_000002.bin from kjj0/fineweb100B-gpt2...
79
+ 2026-02-02 06:18:48,716 - httpx - INFO - HTTP Request: HEAD https://huggingface.co/datasets/kjj0/fineweb100B-gpt2/resolve/main/fineweb_train_000002.bin "HTTP/1.1 302 Found"
80
+ 2026-02-02 06:18:48,777 - httpx - INFO - HTTP Request: GET https://huggingface.co/api/datasets/kjj0/fineweb100B-gpt2/xet-read-token/50d1422b27e1a928440c26a8829f3f827f44ac56 "HTTP/1.1 200 OK"
81
+ 2026-02-02 06:19:08,078 - root - INFO - 400 | lr: 0.0007 | loss: 5.2925 | logits loss: 5.0000 | load balance loss: 30.2058 | z loss: 3.9688 | avg iter time: 866.68ms | avg tok/sec: 302,470.53 | tokens processed: 104,857,600 | ETA: 91h 44m
82
+ 2026-02-02 06:20:35,236 - root - INFO - 500 | lr: 0.0009 | loss: 4.8856 | logits loss: 4.5938 | load balance loss: 30.2052 | z loss: 3.8438 | avg iter time: 871.46ms | avg tok/sec: 300,808.45 | tokens processed: 131,072,000 | ETA: 92h 13m
83
+ 2026-02-02 06:22:02,151 - root - INFO - 600 | lr: 0.0011 | loss: 4.7636 | logits loss: 4.4688 | load balance loss: 30.1569 | z loss: 2.7812 | avg iter time: 869.04ms | avg tok/sec: 301,647.68 | tokens processed: 157,286,400 | ETA: 91h 56m
84
+ 2026-02-02 06:23:28,677 - root - INFO - 700 | lr: 0.0013 | loss: 4.6242 | logits loss: 4.3125 | load balance loss: 30.1497 | z loss: 2.3750 | avg iter time: 865.17ms | avg tok/sec: 302,997.50 | tokens processed: 183,500,800 | ETA: 91h 30m
85
+ 2026-02-02 06:23:42,291 - root - WARNING - Received KeyboardInterrupt. Exiting...
86
+ 2026-02-02 06:23:42,574 - root - INFO - Training completed.
logs/output_run_20260202_lamp_girtline.log ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2026-02-02 06:01:40,251 - root - INFO - Run: run_20260202_lamp_girtline
2
+ 2026-02-02 06:01:40,251 - root - INFO - Log directory: /root/tiny_moe/training_runs/Tiny_MoE/logs
3
+ 2026-02-02 06:01:40,251 - root - INFO - Output dir: /root/tiny_moe/training_runs
4
+ 2026-02-02 06:01:42,087 - jax._src.xla_bridge - INFO - Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
5
+ 2026-02-02 06:01:45,794 - root - INFO - Flax version: 0.11.1
6
+ 2026-02-02 06:01:45,794 - root - INFO - Optax version: 0.2.6
7
+ 2026-02-02 06:01:45,794 - root - INFO - Platform: gpu
8
+ 2026-02-02 06:01:45,794 - root - INFO - Num Devices: 8
9
+ 2026-02-02 06:01:45,794 - root - INFO - Devices: [CudaDevice(id=0), CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3), CudaDevice(id=4), CudaDevice(id=5), CudaDevice(id=6), CudaDevice(id=7)]
10
+ 2026-02-02 06:01:47,300 - root - INFO - Model config:
11
+ Config(name='Tiny_MoE',
12
+ dtype=<class 'jax.numpy.bfloat16'>,
13
+ vocab_size=50304,
14
+ block_size=2048,
15
+ n_layer=30,
16
+ n_embed=672,
17
+ n_glu_hidden=2048,
18
+ n_head=12,
19
+ n_kv_head=4,
20
+ n_experts=8,
21
+ init_stddev=0.02,
22
+ expert_load_factor=1.25,
23
+ aux_loss_coeff=0.01,
24
+ moe_bias=True,
25
+ mlp_bias=False,
26
+ attention_bias=False,
27
+ load_balance_loss_coeff=0.01,
28
+ z_loss_coeff=0.0005,
29
+ expert_top_k=2,
30
+ ln_epsilon=1e-05,
31
+ rope_theta=0.0001,
32
+ expert_partition_spec=PartitionSpec('devices',),
33
+ sdpa_implementation='cudnn',
34
+ value_residual_init=0.5,
35
+ logit_softcap=30.0)
36
+ 2026-02-02 06:03:35,568 - root - INFO - Parameter Count: 1,062,182,190
37
+ 2026-02-02 06:03:35,568 - root - INFO - Sharded / MoE Parameter Count: 992,210,160
38
+ 2026-02-02 06:03:35,568 - root - INFO - Replicated Parameter Count: 69,972,030
39
+ 2026-02-02 06:03:36,936 - root - INFO - Weight decay param count: 1,062,140,928
40
+ 2026-02-02 06:03:36,936 - root - INFO - Training config:
41
+ TrainerConfig(num_tokens=100000000000,
42
+ num_tokens_per_batch=262144,
43
+ mB=128,
44
+ T=2048,
45
+ max_steps=381469,
46
+ max_lr=0.007,
47
+ min_lr=0.0007000000000000001,
48
+ max_grad_norm=1.0,
49
+ weight_decay=0.1,
50
+ adam_b1=0.9,
51
+ adam_b2=0.95,
52
+ warmup_steps=3814,
53
+ print_interval=100,
54
+ val=True,
55
+ val_interval=5000,
56
+ val_batches=50,
57
+ checkpoint_model=False,
58
+ checkpoint_optimizer=False,
59
+ checkpoint_interval=10000)
60
+ 2026-02-02 06:03:36,936 - root - INFO - Effective batch size per device: 16
61
+ 2026-02-02 06:03:42,079 - root - INFO - HuggingfaceDataLoader: 1030 shards (train)
62
+ 2026-02-02 06:03:42,079 - root - INFO - Downloading fineweb_train_000001.bin from kjj0/fineweb100B-gpt2...
63
+ 2026-02-02 06:03:42,672 - httpx - INFO - HTTP Request: HEAD https://huggingface.co/datasets/kjj0/fineweb100B-gpt2/resolve/main/fineweb_train_000001.bin "HTTP/1.1 302 Found"
64
+ 2026-02-02 06:03:42,760 - httpx - INFO - HTTP Request: GET https://huggingface.co/api/datasets/kjj0/fineweb100B-gpt2/xet-read-token/50d1422b27e1a928440c26a8829f3f827f44ac56 "HTTP/1.1 200 OK"
65
+ 2026-02-02 06:03:44,236 - root - INFO - HuggingfaceDataLoader initialized:
66
+ ------------------------
67
+ label: train
68
+ shards: 1,030
69
+ shard size: 100,000,000
70
+ batch size: 128
71
+ block size: 2048
72
+ device rank: 1
73
+ start shard: 0
74
+ start pos: 0
75
+ ------------------------
76
+ 2026-02-02 06:03:44,236 - root - INFO - HuggingfaceDataLoader: 1 shards (val)
77
+ 2026-02-02 06:03:44,236 - root - INFO - Downloading fineweb_val_000000.bin from kjj0/fineweb100B-gpt2...
78
+ 2026-02-02 06:03:44,301 - httpx - INFO - HTTP Request: HEAD https://huggingface.co/datasets/kjj0/fineweb100B-gpt2/resolve/main/fineweb_val_000000.bin "HTTP/1.1 302 Found"
79
+ 2026-02-02 06:03:45,538 - root - INFO - Starting from step: 0
80
+ 2026-02-02 06:06:47,338 - root - WARNING - Received KeyboardInterrupt. Exiting...
81
+ 2026-02-02 06:06:47,695 - root - INFO - Training completed.
logs/run_20260202_ablach_achagua_train.csv ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ step,lr,loss,load_balance_loss,z_loss,time,tokens_processed,tokens_per_sec
2
+ 100,0.00018536969,7.085675239562988,30.4082088470459,23.75,0,26214400,0
3
+ 200,0.00036890403,6.073034763336182,30.28752326965332,13.3125,866.5078687667847,52428800,302529.2780930931
4
+ 300,0.0005524384,5.492429733276367,30.257863998413086,8.375,866.0549712181091,78643200,302687.48371860693
5
+ 400,0.0007359727,5.292534828186035,30.205780029296875,3.96875,866.6761755943298,104857600,302470.5274957313
6
+ 500,0.00091950706,4.88559103012085,30.20521354675293,3.84375,871.4648866653442,131072000,300808.4479491682
7
+ 600,0.0011030415,4.763609886169434,30.156856536865234,2.78125,869.0403437614441,157286400,301647.67594720534
8
+ 700,0.0012865758,4.6241607666015625,30.14971923828125,2.375,865.1688623428345,183500800,302997.49726328225
logs/run_20260202_ablach_achagua_train.png ADDED
logs/run_20260202_ablach_achagua_val.csv ADDED
@@ -0,0 +1 @@
 
 
1
+ step,loss,logits_loss
logs/run_20260202_lamp_girtline_train.csv ADDED
@@ -0,0 +1 @@
 
 
1
+ step,lr,loss,load_balance_loss,z_loss,time,tokens_processed,tokens_per_sec
logs/run_20260202_lamp_girtline_train.png ADDED
logs/run_20260202_lamp_girtline_val.csv ADDED
@@ -0,0 +1 @@
 
 
1
+ step,loss,logits_loss