slm-125m-instruct

A 125M decoder-only language model (instruction-tuned via chat SFT + code SFT). Part of the SLM model family โ€” built entirely from scratch, from raw web data through to a production-ready aligned model.

This is the instruct variant โ€” the base model supervised fine-tuned on chat and code instruction datasets. It follows instructions reliably and can generate Python code. Use tohio/slm-125m-chat for the DPO-aligned version preferred for open-ended conversation. Use tohio/slm-125m for the raw base model.

Model Family

Variant Hub Description
Base tohio/slm-125m Pretrained only
Instruct tohio/slm-125m-instruct Chat + code SFT
Chat tohio/slm-125m-chat SFT + DPO aligned

Architecture

Component Choice Rationale
Positional encoding RoPE Better length generalisation, relative position awareness
Normalization RMSNorm Faster than LayerNorm, modern standard
Activation SwiGLU Better gradient flow, used by LLaMA and Mistral
Attention GQA Reduces KV cache memory at inference
Bias None Simpler, modern standard
Embeddings Tied Reduces parameters, effective at small scale
Vocab size 32,000 Custom BPE tokenizer trained on the pretraining corpus
Parameters 125.3M (125,264,640 parameters)

Training

Pretraining corpus โ€” 5B tokens blended across the following sources:

Source Target Share Link
common_crawl 10.0% Common Crawl
fineweb 47.5% FineWeb
wikipedia 10.0% Wikipedia (EN)
pg19 2.5% PG-19 (Project Gutenberg)
pes2o 5.0% peS2o (academic papers)
open_web_math 10.0% OpenWebMath
stackexchange 5.0% StackExchange
code 10.0% Code (multi-source)

Realized mix may differ from target โ€” supply-bound sources (pes2o, jupyter at this scale) route their deficit to FineWeb.

Fine-tuning

Stage Dataset Size
Chat SFT OpenHermes-2.5 ~1M examples
Code SFT Magicoder-OSS-Instruct-75K ~75K examples

Hardware: NVIDIA H200 (pretraining on 1ร— H200, fine-tuning on 1ร— H200)

Evaluation

Evaluated using lm-evaluation-harness.

Benchmark Few-shot Metric Score
HellaSwag 10-shot acc_norm 0.3257
ARC-Easy 25-shot acc_norm 0.4739
ARC-Challenge 25-shot acc_norm 0.2585
MMLU 5-shot acc 0.2531
TruthfulQA 0-shot acc 0.4187
HumanEval 0-shot pass@1 0.0000

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "tohio/slm-125m-instruct",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "tohio/slm-125m-instruct",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "Answer clearly and concisely."},
    {"role": "user", "content": "Explain what a transformer is."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
    return_dict=True,
)

endofturn_id = tokenizer.convert_tokens_to_ids("<|endofturn|>")

output = model.generate(
    **inputs,
    max_new_tokens=120,
    do_sample=False,
    repetition_penalty=1.1,
    pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
    eos_token_id=[tokenizer.eos_token_id, endofturn_id],
)

input_len = inputs["input_ids"].shape[1]
print(tokenizer.decode(output[0][input_len:], skip_special_tokens=True))

trust_remote_code=True loads the custom SLM architecture bundled alongside the model weights โ€” no local install of the tohio/slm codebase required.

Limitations

  • Scale: At 125M parameters this model is significantly smaller than frontier models. It will underperform on complex reasoning, long-context tasks, and domains not well-represented in the pretraining data.
  • Hallucination: Like all language models, this model can generate plausible-sounding but factually incorrect content. Outputs should not be used as a source of truth without independent verification.
  • Safety: DPO alignment provides basic harmlessness training but does not guarantee safe outputs in all contexts. This model has not undergone red-teaming or adversarial safety evaluation.
  • Languages: Training data is predominantly English. Performance on other languages will be significantly degraded.
  • Code: Code generation is primarily Python-oriented, reflecting the code sub-mix distribution used in pretraining and SFT.

Related

  • slm โ€” full training pipeline (data curation through serving)
  • ai-infra โ€” production Kubernetes serving via vLLM
Downloads last month
310
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for tohio/slm-125m-instruct

Base model

tohio/slm-125m
Finetuned
(2)
this model