Dense SwiGLU Baseline (384.5M) โ€” Training Checkpoint (Step 15,259)

Resumable training checkpoint with full optimizer state at the end of 8B tokens training.

Note: This model was trained with a Chinchilla-like token budget (8B tokens for 384.5M parameters, ~21 tokens/param). The model may benefit from continued training beyond this point.

Contents

  • checkpoint.pt - Model weights + training state
  • model.safetensors - Model weights (safetensors format)
  • optimizer_rank0.pt - AdamW optimizer state (GPU 0)
  • optimizer_rank1.pt - AdamW optimizer state (GPU 1)
  • training_state.json - Step counter, LR, etc.

Model Config

  • Parameters: 384.5M (all active per token)
  • Hidden: 1024, Layers: 20, Heads: 16, KV Heads: 4
  • MLP: Dense SwiGLU, Intermediate: 4358
  • Training: 8B tokens (15,259 steps), AdamW lr=2.1e-4, cosine 5% warmup

Resume Training

import torch

checkpoint = torch.load("checkpoint.pt", map_location="cpu")
model.load_state_dict(checkpoint["model"])

# Load optimizer for your GPU rank (0 or 1)
rank = torch.distributed.get_rank()
optimizer_state = torch.load(f"optimizer_rank{rank}.pt", map_location="cpu")
optimizer.load_state_dict(optimizer_state)

# Resume from step 15,259

Pretrained Weights (inference)

For inference use the safetensors checkpoint in ../final/ instead.

License

CC-BY-NC-4.0

Complexity-ML -- 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support