10R Dense 124M (SFT v8)

A 124M-parameter dense GPT language model — the dense baseline for the nanoMoE v10 project — pretrained from scratch and instruction-tuned in the ChatML format.

Custom architecture. This is not a 🤗 Transformers model and will not load with AutoModel.from_pretrained. Use the included model_dense.py (see Usage).

Model details

Architecture Dense decoder-only Transformer (nanoGPT-style)
Parameters 124.3M total (85.0M non-embedding)
Layers / heads / width 12 / 12 / 768 (head_dim 64)
Context length 1024 tokens
Vocabulary 50,264 (GPT-2 r50k_base BPE + 7 ChatML special tokens)
Optimizer Muon (2-D hidden weights) + AdamW (embeddings, norms, biases)
Precision bfloat16

Training

Pretraining (from random init). ~5.1B tokens of the v10 corpus: long-form (Cosmopedia), code (StarCoder, Python), chat (OpenHermes-2.5, Infinity-Instruct, Magpie), web (FineWeb sample), and synthetic reasoning data. Cosine LR schedule with a low-LR cooldown anneal. Base validation loss ≈ 1.62.

Supervised fine-tuning (SFT v8). ~2 epochs over a ~12M-token ChatML instruction set (Magpie-Pro + OpenHermes tier-1 + Infinity chat, plus smaller code, arithmetic, anti-hallucination, and constraint-detection subsets). Learning rates: Muon 4e-5, AdamW 1.3e-5; epoch-first schedule capped at 2 epochs to limit overfitting.

Intended use and limitations

A small (GPT-2-small–class) base + SFT model for research, experimentation, and lightweight assistant tasks. It follows simple instructions, answers basic questions, completes code, and tends to say "I don't know" rather than fabricate.

It is not production-ready. Expect factual errors, weak multi-step arithmetic, and no knowledge of recent events. Do not use it for high-stakes decisions.

Prompt format (ChatML)

<|im_start|>user
{your message}<|im_end|>
<|im_start|>assistant

Special tokens (IDs 50257–50263): <|im_start|>, <|im_end|>, <|tool_call|>, <|tool_call_end|>, <|tool_result|>, <|tool_result_end|>, <|tool_error|>. Generation should stop at <|im_end|> (id 50258).

Usage

import torch, tiktoken
from model_dense import GPT, GPTConfig          # from this repo

# 1) model
ck = torch.load("ckpt_dense_9830_sft8.pt", map_location="cpu", weights_only=False)
model = GPT(GPTConfig(**ck["model_args"]))
model.load_state_dict(ck["model"])
model.eval()

# 2) tokenizer — GPT-2 r50k_base + 7 ChatML specials (no tokenizer.json; built in code)
base = tiktoken.get_encoding("r50k_base")
specials = ["<|im_start|>", "<|im_end|>", "<|tool_call|>", "<|tool_call_end|>",
            "<|tool_result|>", "<|tool_result_end|>", "<|tool_error|>"]
enc = tiktoken.Encoding(
    name="v9_chatml_enc",
    pat_str=base._pat_str,
    mergeable_ranks=base._mergeable_ranks,
    special_tokens={**base._special_tokens,
                    **{t: 50257 + i for i, t in enumerate(specials)}},
)

# 3) generate (generate() matches the nanoGPT API; adjust if your model_dense.py differs)
prompt = ("<|im_start|>user\nWrite a haiku about the ocean.<|im_end|>\n"
          "<|im_start|>assistant\n")
idx = torch.tensor([enc.encode(prompt, allowed_special="all")])
out = model.generate(idx, max_new_tokens=128, temperature=0.7, top_k=50)[0].tolist()
text = enc.decode(out)
print(text.split("<|im_start|>assistant\n")[-1].split("<|im_end|>")[0].strip())

The tokenizer is constructed in code, not loaded from a tokenizer.json. It must include the 7 special tokens above in exactly this order, or the IDs will be wrong.

Files

File Purpose
ckpt_dense_9830_sft8.pt Model weights + config (model and model_args)
model_dense.py Architecture definition (required to load the model)

If the checkpoint is the full training file, it may also contain optimizer_muon / optimizer_adamw (for resuming training). Inference needs only model and model_args.

License

The apache-2.0 tag above is a placeholder. This model is trained on a mix of datasets under varying licenses (OpenHermes-2.5, Magpie-Pro, Infinity-Instruct, Cosmopedia, StarCoder, FineWeb); set the license to whatever your data-source obligations require before redistributing.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Daxamite/10R_Dense_124m

Space using Daxamite/10R_Dense_124m 1