| 2026-02-01 15:19:29,881 - root - INFO - Run: run_20260201_slapping_arboret |
| 2026-02-01 15:19:29,881 - root - INFO - Log directory: /root/tiny_moe/training_runs/Tiny_MoE/logs |
| 2026-02-01 15:19:29,881 - root - INFO - Output dir: /root/tiny_moe/training_runs |
| 2026-02-01 15:19:32,184 - jax._src.xla_bridge - INFO - Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory |
| 2026-02-01 15:19:35,781 - root - INFO - Flax version: 0.11.1 |
| 2026-02-01 15:19:35,781 - root - INFO - Optax version: 0.2.6 |
| 2026-02-01 15:19:35,782 - root - INFO - Platform: gpu |
| 2026-02-01 15:19:35,782 - root - INFO - Num Devices: 8 |
| 2026-02-01 15:19:35,782 - root - INFO - Devices: [CudaDevice(id=0), CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3), CudaDevice(id=4), CudaDevice(id=5), CudaDevice(id=6), CudaDevice(id=7)] |
| 2026-02-01 15:19:36,629 - root - INFO - Model config: |
| Config(name='Tiny_MoE', |
| dtype=<class 'jax.numpy.bfloat16'>, |
| vocab_size=50304, |
| block_size=2048, |
| n_layer=30, |
| n_embed=672, |
| n_glu_hidden=2048, |
| n_head=12, |
| n_kv_head=4, |
| n_experts=8, |
| init_stddev=0.02, |
| expert_load_factor=1.25, |
| aux_loss_coeff=0.01, |
| moe_bias=True, |
| mlp_bias=False, |
| attention_bias=False, |
| load_balance_loss_coeff=0.01, |
| z_loss_coeff=0.0005, |
| expert_top_k=2, |
| ln_epsilon=1e-05, |
| rope_theta=0.0001, |
| expert_partition_spec=PartitionSpec('devices',), |
| sdpa_implementation='cudnn', |
| value_residual_init=0.5, |
| unet_skip_in_layers=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14), |
| unet_skip_out_layers=(15, |
| 16, |
| 17, |
| 18, |
| 19, |
| 20, |
| 21, |
| 22, |
| 23, |
| 24, |
| 25, |
| 26, |
| 27, |
| 28, |
| 29), |
| skip_gate_input_dim=12, |
| skip_lambda_init=-1.5) |
| 2026-02-01 15:21:04,803 - root - INFO - Parameter Count: 1,062,182,203 |
| 2026-02-01 15:21:04,803 - root - INFO - Sharded / MoE Parameter Count: 992,210,160 |
| 2026-02-01 15:21:04,803 - root - INFO - Replicated Parameter Count: 69,972,043 |
| 2026-02-01 15:21:05,842 - root - INFO - Weight decay param count: 1,062,140,940 |
| 2026-02-01 15:21:05,843 - root - INFO - Training config: |
| TrainerConfig(num_tokens=100000000000, |
| num_tokens_per_batch=262144, |
| mB=128, |
| T=2048, |
| max_steps=381469, |
| max_lr=0.001, |
| min_lr=0.0001, |
| max_grad_norm=1.0, |
| weight_decay=0.1, |
| adam_b1=0.9, |
| adam_b2=0.95, |
| warmup_steps=3814, |
| print_interval=100, |
| val=True, |
| val_interval=5000, |
| val_batches=50, |
| checkpoint_model=False, |
| checkpoint_optimizer=False, |
| checkpoint_interval=10000) |
| 2026-02-01 15:21:05,843 - root - INFO - Effective batch size per device: 16 |
| 2026-02-01 15:21:09,048 - root - INFO - ModdedNanoGPTDataLoader: 1030 shards (train) |
| 2026-02-01 15:21:09,145 - root - INFO - HuggingfaceDataLoader initialized: |
| ------------------------ |
| label: train |
| shards: 1,030 |
| shard size: 100,000,000 |
| batch size: 128 |
| block size: 2048 |
| device rank: 1 |
| start shard: 0 |
| start pos: 0 |
| ------------------------ |
| 2026-02-01 15:21:09,145 - root - INFO - ModdedNanoGPTDataLoader: 1 shards (val) |
| 2026-02-01 15:21:09,242 - root - INFO - Starting from step: 0 |
| 2026-02-01 15:22:23,524 - root - INFO - 0 | lr: 0.0000 | loss: 13.8428 | logits loss: 13.4375 | load balance loss: 30.1714 | z loss: 145.0000 | avg iter time: 0.00ms | avg tok/sec: 0.00 | tokens processed: 262,144 |
| 2026-02-01 15:25:00,793 - root - INFO - 100 | lr: 0.0000 | loss: 8.5943 | logits loss: 8.2500 | load balance loss: 30.2268 | z loss: 31.5000 | avg iter time: 1565.30ms | avg tok/sec: 167,472.34 | tokens processed: 26,476,544 |
| 2026-02-01 15:26:32,354 - root - INFO - 200 | lr: 0.0001 | loss: 7.2038 | logits loss: 6.9062 | load balance loss: 30.2843 | z loss: 9.4375 | avg iter time: 908.10ms | avg tok/sec: 288,674.05 | tokens processed: 52,690,944 |
| 2026-02-01 15:28:03,056 - root - INFO - 300 | lr: 0.0001 | loss: 6.3778 | logits loss: 6.0625 | load balance loss: 30.1838 | z loss: 8.5625 | avg iter time: 899.53ms | avg tok/sec: 291,424.69 | tokens processed: 78,905,344 |
| 2026-02-01 15:29:16,156 - root - INFO - Downloading fineweb_train_000002.bin from kjj0/fineweb100B-gpt2... |
| 2026-02-01 15:29:16,642 - httpx - INFO - HTTP Request: HEAD https://huggingface.co/datasets/kjj0/fineweb100B-gpt2/resolve/main/fineweb_train_000002.bin "HTTP/1.1 302 Found" |
| 2026-02-01 15:29:16,849 - httpx - INFO - HTTP Request: GET https://huggingface.co/api/datasets/kjj0/fineweb100B-gpt2/xet-read-token/50d1422b27e1a928440c26a8829f3f827f44ac56 "HTTP/1.1 200 OK" |
| 2026-02-01 15:29:41,322 - root - INFO - 400 | lr: 0.0001 | loss: 6.0378 | logits loss: 5.7188 | load balance loss: 30.1847 | z loss: 9.0625 | avg iter time: 975.28ms | avg tok/sec: 268,789.27 | tokens processed: 105,119,744 |
| 2026-02-01 15:31:12,279 - root - INFO - 500 | lr: 0.0001 | loss: 5.6352 | logits loss: 5.3438 | load balance loss: 30.1091 | z loss: 6.1562 | avg iter time: 902.19ms | avg tok/sec: 290,564.78 | tokens processed: 131,334,144 |
| 2026-02-01 15:32:31,151 - root - WARNING - Received KeyboardInterrupt. Exiting... |
| 2026-02-01 15:32:31,385 - root - INFO - Training completed. |
|
|