10R Dense 124M (SFT v8)

A 124M-parameter dense GPT language model — the dense baseline for the nanoMoE v10 project — pretrained from scratch and instruction-tuned in the ChatML format.

Custom architecture. This is not a 🤗 Transformers model and will not load with AutoModel.from_pretrained. Use the included model_dense.py (see Usage).

Model details


Architecture	Dense decoder-only Transformer (nanoGPT-style)
Parameters	124.3M total (85.0M non-embedding)
Layers / heads / width	12 / 12 / 768 (head_dim 64)
Context length	1024 tokens
Vocabulary	50,264 (GPT-2 `r50k_base` BPE + 7 ChatML special tokens)
Optimizer	Muon (2-D hidden weights) + AdamW (embeddings, norms, biases)
Precision	bfloat16

Training

Pretraining (from random init). ~5.1B tokens of the v10 corpus: long-form (Cosmopedia), code (StarCoder, Python), chat (OpenHermes-2.5, Infinity-Instruct, Magpie), web (FineWeb sample), and synthetic reasoning data. Cosine LR schedule with a low-LR cooldown anneal. Base validation loss ≈ 1.62.

Supervised fine-tuning (SFT v8). ~2 epochs over a ~12M-token ChatML instruction set (Magpie-Pro + OpenHermes tier-1 + Infinity chat, plus smaller code, arithmetic, anti-hallucination, and constraint-detection subsets). Learning rates: Muon 4e-5, AdamW 1.3e-5; epoch-first schedule capped at 2 epochs to limit overfitting.

Intended use and limitations

A small (GPT-2-small–class) base + SFT model for research, experimentation, and lightweight assistant tasks. It follows simple instructions, answers basic questions, completes code, and tends to say "I don't know" rather than fabricate.

It is not production-ready. Expect factual errors, weak multi-step arithmetic, and no knowledge of recent events. Do not use it for high-stakes decisions.

Prompt format (ChatML)

<|im_start|>user
{your message}<|im_end|>
<|im_start|>assistant

Special tokens (IDs 50257–50263): <|im_start|>, <|im_end|>, <|tool_call|>, <|tool_call_end|>, <|tool_result|>, <|tool_result_end|>, <|tool_error|>. Generation should stop at <|im_end|> (id 50258).

Usage

import torch, tiktoken
from model_dense import GPT, GPTConfig          # from this repo

# 1) model
ck = torch.load("ckpt_dense_9830_sft8.pt", map_location="cpu", weights_only=False)
model = GPT(GPTConfig(**ck["model_args"]))
model.load_state_dict(ck["model"])
model.eval()

# 2) tokenizer — GPT-2 r50k_base + 7 ChatML specials (no tokenizer.json; built in code)
base = tiktoken.get_encoding("r50k_base")
specials = ["<|im_start|>", "<|im_end|>", "<|tool_call|>", "<|tool_call_end|>",
            "<|tool_result|>", "<|tool_result_end|>", "<|tool_error|>"]
enc = tiktoken.Encoding(
    name="v9_chatml_enc",
    pat_str=base._pat_str,
    mergeable_ranks=base._mergeable_ranks,
    special_tokens={**base._special_tokens,
                    **{t: 50257 + i for i, t in enumerate(specials)}},
)

# 3) generate (generate() matches the nanoGPT API; adjust if your model_dense.py differs)
prompt = ("<|im_start|>user\nWrite a haiku about the ocean.<|im_end|>\n"
          "<|im_start|>assistant\n")
idx = torch.tensor([enc.encode(prompt, allowed_special="all")])
out = model.generate(idx, max_new_tokens=128, temperature=0.7, top_k=50)[0].tolist()
text = enc.decode(out)
print(text.split("<|im_start|>assistant\n")[-1].split("<|im_end|>")[0].strip())

The tokenizer is constructed in code, not loaded from a tokenizer.json. It must include the 7 special tokens above in exactly this order, or the IDs will be wrong.

Files

File	Purpose
`ckpt_dense_9830_sft8.pt`	Model weights + config (`model` and `model_args`)
`model_dense.py`	Architecture definition (required to load the model)

If the checkpoint is the full training file, it may also contain optimizer_muon / optimizer_adamw (for resuming training). Inference needs only model and model_args.

License

The apache-2.0 tag above is a placeholder. This model is trained on a mix of datasets under varying licenses (OpenHermes-2.5, Magpie-Pro, Infinity-Instruct, Cosmopedia, StarCoder, FineWeb); set the license to whatever your data-source obligations require before redistributing.

Downloads last month: -; Downloads are not tracked for this model. How to track

Daxamite
/

10R_Dense_124m