Consolidate all training logging through MetricsLogger b103659 thomas-schweich commited on 3 days ago
Log patience counter, best val loss/step in val records a050f72 thomas-schweich commited on 3 days ago
Per-model early stopping: freeze converged variants individually 190085d thomas-schweich commited on 3 days ago
Push metrics to HF at eval intervals, add dashboard HF sync 86ec60c thomas-schweich commited on 3 days ago
Remove .item() CUDA sync from hot path, batch size 512, run slugs fc9d7f7 thomas-schweich commited on 3 days ago
Add post-training evals, /dev/shm checkpoints, async HF push, and _orig_mod fix 87b2fa6 thomas-schweich commited on 3 days ago
Safetensors migration, checkpoint integrity, and multi-model training. (#1) 230508d unverified thomas-schweich commited on 3 days ago