## Base model training timestamp: 2025-12-27 20:59:08 - run: pdlm_depth4_bs8_pr1_ratio40_causal - device_type: - depth: 4 - max_seq_len: 1024 - block_size: 8 - prefix_pure_tokens: 1 - is_causal: True - num_iterations: -1 - target_flops: -1.0000 - target_param_data_ratio: 40 - device_batch_size: 64 - total_batch_size: 524,288 - embedding_lr: 0.2000 - unembedding_lr: 0.0040 - weight_decay: 0.0000 - matrix_lr: 0.0200 - grad_clip: 1.0000 - warmup_ratio: 0.0000 - warmdown_ratio: 0.2000 - final_lr_frac: 0.0000 - resume_from_step: -1 - eval_every: 99,999 - eval_tokens: 10,485,760 - core_metric_every: -1 - core_metric_max_per_task: 500 - sample_every: 2000 - save_every: -1 - model_tag: - Number of parameters: 5,453,056 - Number of FLOPs per token: 3.774874e+07 - Calculated number of iterations: 416 - Number of training tokens: 218,103,808 - Tokens : Params ratio: 39.9966 - DDP world size: 1 - warmup_ratio: 0.0000 - warmdown_ratio: 0.2000 - final_lr_frac: 0.0000 - Minimum validation bpb: 1.1379 - Final validation bpb: 1.1379 - CORE metric estimate: None - MFU %: 2.64% - Total training flops: 8.233143e+15 - Total training time: 5.14m - Peak memory usage: 6905.37MiB