Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
rob-x-ai 
posted an update 7 days ago
Post
146
Genesis 1B is now public. 🔥

I’m training a 1.003B parameter model from scratch on 2× RTX 4090s and opened a public playground for early checkpoints.

The real bottleneck wasn’t training.
It was checkpointing:

FSDP full-state gather over PCIe = NCCL timeout hell

Switching to DCP sharded checkpoints changed the trajectory of the run.

- Playground: rob-x-ai/genesis-1b-playground
- Write-up: https://kroonen.ai/blog/distributed-checkpoint-failures-rtx4090/
In this post