experimental_gqa_1_5b β€” iter 16,000

Megatron-LM checkpoint, trained from scratch on FineWeb sample-10BT/100BT_part1 (text) + codeparrot-clean (code), tokenized with cl100k_base.

This branch: iter 16,000 (~14.68B tokens trained). Other revisions: branches , , , . The branch tracks iter 16,000.

Architecture

  • Layers: 32, hidden 2048, FFN 4096, GQA 16Q/4KV, head_dim 128
  • Vocab 100,352 (cl100k_base, padded for TP=4)
  • RoPE base 1e7, partial 0.25; SwiGLU; RMSNorm + 1p
  • Attention output gate; QK-LayerNorm with WD; untied embeddings
  • Pretrain: bf16

Loading

This is a Megatron-LM checkpoint (sharded). To use: Updated Git hooks. Git LFS initialized. Then point Megatron's at the cloned dir; tokenizer is at . Architecture flags must match β€” see the recipe in the upstream training repo.

Training schedule

  • Iters 0–12,000: GBS=128, LR 3e-4 cosine, warmup 1000 β€” over 80,000-step schedule
  • Iters 12,000–16,000: GBS=512, LR cosine warm-restart over 30,000-step schedule

Tokenizer (chat-extended)

See tokenizer/ for the Cl100kChatTokenizer with 16 reserved chat/think/tool tokens (IDs 100277-100292).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support