slm-125m-instruct
A 125M decoder-only language model (instruction-tuned via chat SFT + code SFT). Part of the SLM model family โ built entirely from scratch, from raw web data through to a production-ready aligned model.
This is the instruct variant โ the base model supervised fine-tuned on chat and code instruction datasets.
It follows instructions reliably and can generate Python code.
Use tohio/slm-125m-chat for the DPO-aligned version preferred for open-ended conversation.
Use tohio/slm-125m for the raw base model.
Model Family
| Variant | Hub | Description |
|---|---|---|
| Base | tohio/slm-125m | Pretrained only |
| Instruct | tohio/slm-125m-instruct | Chat + code SFT |
| Chat | tohio/slm-125m-chat | SFT + DPO aligned |
Architecture
| Component | Choice | Rationale |
|---|---|---|
| Positional encoding | RoPE | Better length generalisation, relative position awareness |
| Normalization | RMSNorm | Faster than LayerNorm, modern standard |
| Activation | SwiGLU | Better gradient flow, used by LLaMA and Mistral |
| Attention | GQA | Reduces KV cache memory at inference |
| Bias | None | Simpler, modern standard |
| Embeddings | Tied | Reduces parameters, effective at small scale |
| Vocab size | 32,000 | Custom BPE tokenizer trained on the pretraining corpus |
| Parameters | 125.3M (125,264,640 parameters) |
Training
Pretraining corpus โ 5B tokens blended across the following sources:
| Source | Target Share | Link |
|---|---|---|
common_crawl |
10.0% | Common Crawl |
fineweb |
47.5% | FineWeb |
wikipedia |
10.0% | Wikipedia (EN) |
pg19 |
2.5% | PG-19 (Project Gutenberg) |
pes2o |
5.0% | peS2o (academic papers) |
open_web_math |
10.0% | OpenWebMath |
stackexchange |
5.0% | StackExchange |
code |
10.0% | Code (multi-source) |
Realized mix may differ from target โ supply-bound sources (pes2o, jupyter at this scale) route their deficit to FineWeb.
Fine-tuning
| Stage | Dataset | Size |
|---|---|---|
| Chat SFT | OpenHermes-2.5 | ~1M examples |
| Code SFT | Magicoder-OSS-Instruct-75K | ~75K examples |
Hardware: NVIDIA H200 (pretraining on 1ร H200, fine-tuning on 1ร H200)
Evaluation
Evaluated using lm-evaluation-harness.
| Benchmark | Few-shot | Metric | Score |
|---|---|---|---|
| HellaSwag | 10-shot | acc_norm | 0.3257 |
| ARC-Easy | 25-shot | acc_norm | 0.4739 |
| ARC-Challenge | 25-shot | acc_norm | 0.2585 |
| MMLU | 5-shot | acc | 0.2531 |
| TruthfulQA | 0-shot | acc | 0.4187 |
| HumanEval | 0-shot | pass@1 | 0.0000 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"tohio/slm-125m-instruct",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"tohio/slm-125m-instruct",
trust_remote_code=True,
)
messages = [
{"role": "system", "content": "Answer clearly and concisely."},
{"role": "user", "content": "Explain what a transformer is."},
]
inputs = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
add_generation_prompt=True,
return_dict=True,
)
endofturn_id = tokenizer.convert_tokens_to_ids("<|endofturn|>")
output = model.generate(
**inputs,
max_new_tokens=120,
do_sample=False,
repetition_penalty=1.1,
pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
eos_token_id=[tokenizer.eos_token_id, endofturn_id],
)
input_len = inputs["input_ids"].shape[1]
print(tokenizer.decode(output[0][input_len:], skip_special_tokens=True))
trust_remote_code=True loads the custom SLM architecture bundled alongside the model weights โ no local install of the tohio/slm codebase required.
Limitations
- Scale: At 125M parameters this model is significantly smaller than frontier models. It will underperform on complex reasoning, long-context tasks, and domains not well-represented in the pretraining data.
- Hallucination: Like all language models, this model can generate plausible-sounding but factually incorrect content. Outputs should not be used as a source of truth without independent verification.
- Safety: DPO alignment provides basic harmlessness training but does not guarantee safe outputs in all contexts. This model has not undergone red-teaming or adversarial safety evaluation.
- Languages: Training data is predominantly English. Performance on other languages will be significantly degraded.
- Code: Code generation is primarily Python-oriented, reflecting the code sub-mix distribution used in pretraining and SFT.
Related
- Downloads last month
- 310
Model tree for tohio/slm-125m-instruct
Base model
tohio/slm-125m