10R Dense 124M (SFT v8)
A 124M-parameter dense GPT language model — the dense baseline for the nanoMoE v10 project — pretrained from scratch and instruction-tuned in the ChatML format.
Custom architecture. This is not a 🤗 Transformers model and will not load with
AutoModel.from_pretrained. Use the includedmodel_dense.py(see Usage).
Model details
| Architecture | Dense decoder-only Transformer (nanoGPT-style) |
| Parameters | 124.3M total (85.0M non-embedding) |
| Layers / heads / width | 12 / 12 / 768 (head_dim 64) |
| Context length | 1024 tokens |
| Vocabulary | 50,264 (GPT-2 r50k_base BPE + 7 ChatML special tokens) |
| Optimizer | Muon (2-D hidden weights) + AdamW (embeddings, norms, biases) |
| Precision | bfloat16 |
Training
Pretraining (from random init). ~5.1B tokens of the v10 corpus: long-form (Cosmopedia), code (StarCoder, Python), chat (OpenHermes-2.5, Infinity-Instruct, Magpie), web (FineWeb sample), and synthetic reasoning data. Cosine LR schedule with a low-LR cooldown anneal. Base validation loss ≈ 1.62.
Supervised fine-tuning (SFT v8). ~2 epochs over a ~12M-token ChatML instruction set
(Magpie-Pro + OpenHermes tier-1 + Infinity chat, plus smaller code, arithmetic,
anti-hallucination, and constraint-detection subsets). Learning rates: Muon 4e-5,
AdamW 1.3e-5; epoch-first schedule capped at 2 epochs to limit overfitting.
Intended use and limitations
A small (GPT-2-small–class) base + SFT model for research, experimentation, and lightweight assistant tasks. It follows simple instructions, answers basic questions, completes code, and tends to say "I don't know" rather than fabricate.
It is not production-ready. Expect factual errors, weak multi-step arithmetic, and no knowledge of recent events. Do not use it for high-stakes decisions.
Prompt format (ChatML)
<|im_start|>user
{your message}<|im_end|>
<|im_start|>assistant
Special tokens (IDs 50257–50263): <|im_start|>, <|im_end|>, <|tool_call|>,
<|tool_call_end|>, <|tool_result|>, <|tool_result_end|>, <|tool_error|>.
Generation should stop at <|im_end|> (id 50258).
Usage
import torch, tiktoken
from model_dense import GPT, GPTConfig # from this repo
# 1) model
ck = torch.load("ckpt_dense_9830_sft8.pt", map_location="cpu", weights_only=False)
model = GPT(GPTConfig(**ck["model_args"]))
model.load_state_dict(ck["model"])
model.eval()
# 2) tokenizer — GPT-2 r50k_base + 7 ChatML specials (no tokenizer.json; built in code)
base = tiktoken.get_encoding("r50k_base")
specials = ["<|im_start|>", "<|im_end|>", "<|tool_call|>", "<|tool_call_end|>",
"<|tool_result|>", "<|tool_result_end|>", "<|tool_error|>"]
enc = tiktoken.Encoding(
name="v9_chatml_enc",
pat_str=base._pat_str,
mergeable_ranks=base._mergeable_ranks,
special_tokens={**base._special_tokens,
**{t: 50257 + i for i, t in enumerate(specials)}},
)
# 3) generate (generate() matches the nanoGPT API; adjust if your model_dense.py differs)
prompt = ("<|im_start|>user\nWrite a haiku about the ocean.<|im_end|>\n"
"<|im_start|>assistant\n")
idx = torch.tensor([enc.encode(prompt, allowed_special="all")])
out = model.generate(idx, max_new_tokens=128, temperature=0.7, top_k=50)[0].tolist()
text = enc.decode(out)
print(text.split("<|im_start|>assistant\n")[-1].split("<|im_end|>")[0].strip())
The tokenizer is constructed in code, not loaded from a
tokenizer.json. It must include the 7 special tokens above in exactly this order, or the IDs will be wrong.
Files
| File | Purpose |
|---|---|
ckpt_dense_9830_sft8.pt |
Model weights + config (model and model_args) |
model_dense.py |
Architecture definition (required to load the model) |
If the checkpoint is the full training file, it may also contain optimizer_muon /
optimizer_adamw (for resuming training). Inference needs only model and model_args.
License
The apache-2.0 tag above is a placeholder. This model is trained on a mix of
datasets under varying licenses (OpenHermes-2.5, Magpie-Pro, Infinity-Instruct,
Cosmopedia, StarCoder, FineWeb); set the license to whatever your data-source
obligations require before redistributing.