experimental_gqa_1_5b — iter 16,000

Megatron-LM checkpoint, trained from scratch on FineWeb sample-10BT/100BT_part1 (text) + codeparrot-clean (code), tokenized with cl100k_base.

This branch: iter 16,000 (~14.68B tokens trained). Other revisions: branches , , , . The branch tracks iter 16,000.

Architecture

Layers: 32, hidden 2048, FFN 4096, GQA 16Q/4KV, head_dim 128
Vocab 100,352 (cl100k_base, padded for TP=4)
RoPE base 1e7, partial 0.25; SwiGLU; RMSNorm + 1p
Attention output gate; QK-LayerNorm with WD; untied embeddings
Pretrain: bf16

Loading

This is a Megatron-LM checkpoint (sharded). To use: Updated Git hooks. Git LFS initialized. Then point Megatron's at the cloned dir; tokenizer is at . Architecture flags must match — see the recipe in the upstream training repo.

Training schedule

Iters 0–12,000: GBS=128, LR 3e-4 cosine, warmup 1000 — over 80,000-step schedule
Iters 12,000–16,000: GBS=512, LR cosine warm-restart over 30,000-step schedule

Tokenizer (chat-extended)

See tokenizer/ for the Cl100kChatTokenizer with 16 reserved chat/think/tool tokens (IDs 100277-100292).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support