slm-125m

A 125M decoder-only language model (base pretrained model). Part of the SLM model family โ€” built entirely from scratch, from raw web data through to a production-ready aligned model.

This is the base variant โ€” pretrained on 5.5B tokens with no fine-tuning. It is suitable for research and as a starting point for further fine-tuning. Use tohio/slm-125m-instruct for instruction following or tohio/slm-125m-chat for aligned conversation.

Model Family

Variant Hub Description
Base tohio/slm-125m Pretrained only
Instruct tohio/slm-125m-instruct Chat + response-control + code SFT
Chat tohio/slm-125m-chat SFT + DPO aligned

Architecture

Component Choice Rationale
Positional encoding RoPE Better length generalisation, relative position awareness
Normalization RMSNorm Faster than LayerNorm, modern standard
Activation SwiGLU Better gradient flow, used by LLaMA and Mistral
Attention GQA Reduces KV cache memory at inference
Bias None Simpler, modern standard
Embeddings Tied Reduces parameters, effective at small scale
Vocab size 32,000 Custom BPE tokenizer trained on the pretraining corpus
Parameters 125.3M (125,264,640 parameters)

Training

Pretraining corpus โ€” 5.5B tokens blended across the following sources:

Source Target Share Link
common_crawl 10.0% Common Crawl
fineweb 46.0% FineWeb
wikipedia 10.0% Wikipedia (EN)
pg19 2.5% PG-19 (Project Gutenberg)
pes2o 5.0% peS2o (academic papers)
open_web_math 10.0% OpenWebMath
stackexchange 5.0% StackExchange
synthetic_arithmetic 1.5% [Synthetic arithmetic](generated locally by curator/sources/synthetic_arithmetic.py)
code 10.0% Code (multi-source)

Realized mix may differ from target โ€” supply-bound sources (pes2o, jupyter at this scale) route their deficit to FineWeb.

Hardware: NVIDIA H200 (pretraining on 1ร— H200, fine-tuning on 1ร— H200)

Evaluation

Evaluated using lm-evaluation-harness.

Benchmark Few-shot Metric Score
HellaSwag 10-shot acc_norm 0.3333
ARC-Easy 25-shot acc_norm 0.4465
ARC-Challenge 25-shot acc_norm 0.2372
MMLU 5-shot acc 0.2614
TruthfulQA 0-shot acc 0.4414
HumanEval 0-shot pass@1 0.0000

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "tohio/slm-125m",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "tohio/slm-125m",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "Answer clearly and concisely."},
    {"role": "user", "content": "Explain what a transformer is."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
    return_dict=True,
)

endofturn_id = tokenizer.convert_tokens_to_ids("<|endofturn|>")

output = model.generate(
    **inputs,
    max_new_tokens=120,
    do_sample=False,
    repetition_penalty=1.1,
    pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
    eos_token_id=[tokenizer.eos_token_id, endofturn_id],
)

input_len = inputs["input_ids"].shape[1]
print(tokenizer.decode(output[0][input_len:], skip_special_tokens=True))

trust_remote_code=True loads the custom SLM architecture bundled alongside the model weights โ€” no local install of the tohio/slm codebase required.

Limitations

  • Scale: At 125M parameters this model is significantly smaller than frontier models. It will underperform on complex reasoning, long-context tasks, and domains not well-represented in the pretraining data.
  • Hallucination: Like all language models, this model can generate plausible-sounding but factually incorrect content. Outputs should not be used as a source of truth without independent verification.
  • Safety: DPO alignment provides basic harmlessness training but does not guarantee safe outputs in all contexts. This model has not undergone red-teaming or adversarial safety evaluation.
  • Languages: Training data is predominantly English. Performance on other languages will be significantly degraded.
  • Code: Code generation is primarily Python-oriented, reflecting the code sub-mix distribution used in pretraining and SFT.

Related

  • slm โ€” full training pipeline (data curation through serving)
  • ai-infra โ€” production Kubernetes serving via vLLM
Downloads last month
586
Safetensors
Model size
0.1B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for tohio/slm-125m

Finetunes
2 models