| ## Base model training | |
| timestamp: 2025-12-27 20:59:08 | |
| - run: pdlm_depth4_bs8_pr1_ratio40_causal | |
| - device_type: | |
| - depth: 4 | |
| - max_seq_len: 1024 | |
| - block_size: 8 | |
| - prefix_pure_tokens: 1 | |
| - is_causal: True | |
| - num_iterations: -1 | |
| - target_flops: -1.0000 | |
| - target_param_data_ratio: 40 | |
| - device_batch_size: 64 | |
| - total_batch_size: 524,288 | |
| - embedding_lr: 0.2000 | |
| - unembedding_lr: 0.0040 | |
| - weight_decay: 0.0000 | |
| - matrix_lr: 0.0200 | |
| - grad_clip: 1.0000 | |
| - warmup_ratio: 0.0000 | |
| - warmdown_ratio: 0.2000 | |
| - final_lr_frac: 0.0000 | |
| - resume_from_step: -1 | |
| - eval_every: 99,999 | |
| - eval_tokens: 10,485,760 | |
| - core_metric_every: -1 | |
| - core_metric_max_per_task: 500 | |
| - sample_every: 2000 | |
| - save_every: -1 | |
| - model_tag: | |
| - Number of parameters: 5,453,056 | |
| - Number of FLOPs per token: 3.774874e+07 | |
| - Calculated number of iterations: 416 | |
| - Number of training tokens: 218,103,808 | |
| - Tokens : Params ratio: 39.9966 | |
| - DDP world size: 1 | |
| - warmup_ratio: 0.0000 | |
| - warmdown_ratio: 0.2000 | |
| - final_lr_frac: 0.0000 | |
| - Minimum validation bpb: 1.1379 | |
| - Final validation bpb: 1.1379 | |
| - CORE metric estimate: None | |
| - MFU %: 2.64% | |
| - Total training flops: 8.233143e+15 | |
| - Total training time: 5.14m | |
| - Peak memory usage: 6905.37MiB | |