KeyLM-75M-Instruct / README.md
Eclipse-Senpai's picture
Update README.md
d137cc3 verified
---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- keylm
- small-language-model
- instruct
- gqa
- rope
- swiglu
- qk-norm
- custom_code
datasets:
- HuggingFaceFW/fineweb-edu-score-2
- wikimedia/wikipedia
- HuggingFaceGECLM/REDDIT_comments
- marin-community/stackexchange-markdown
- allenai/WildChat-1M
- HuggingFaceH4/ultrachat_200k
- lmsys/lmsys-chat-1m
- OpenAssistant/oasst2
- HuggingFaceTB/cosmopedia-100k
- HuggingFaceTB/smol-smoltalk
- HuggingFaceTB/smoltalk2
base_model: Eclipse-Senpai/KeyLM-75M
base_model_relation: finetune
---
# KeyLM-75M-Instruct
KeyLM-75M-Instruct is a 75M parameter instruction-tuned language model trained from scratch on approximately 18 billion tokens. That training budget is a small fraction of what comparable small models use (SmolLM-135M was trained on roughly 600B tokens, SmolLM2-135M on roughly 2T). Despite this, it is competitive on instruction following, outperforming SmolLM-135M-Instruct on IFEval while using about half the parameters and a fraction of the data.
## Table of Contents
1. [Model Summary](#model-summary)
2. [How to Use](#how-to-use)
3. [Evaluation](#evaluation)
4. [Training](#training)
5. [Limitations](#limitations)
6. [License](#license)
7. [Citation](#citation)
## Model Summary
KeyLM is a compact decoder-only transformer built on the standard small-model recipe used by Llama and Qwen3: grouped-query attention, rotary position embeddings (RoPE), SwiGLU feed-forward layers, and per-head QK-RMSNorm. It is designed for lightweight, low-latency English chat and instruction following.
| Field | Value |
|---|---|
| Parameters | 75,251,200 |
| Layers | 24 |
| Hidden size | 512 |
| Attention heads | 8 (2 KV heads, GQA) |
| Context length | 2048 |
| Vocabulary | 12,020 (ByteLevel BPE) |
| Precision | bfloat16 |
| Training tokens | ~18B |
GGUF builds for `llama.cpp`, LM Studio, and Ollama are available at [KeyLM-75M-Instruct-GGUF](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct-GGUF).
## How to Use
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Eclipse-Senpai/KeyLM-75M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id, trust_remote_code=True, torch_dtype=torch.bfloat16
)
messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
)
outputs = model.generate(
inputs, max_new_tokens=128, do_sample=True,
temperature=0.7, top_p=0.9, repetition_penalty=1.1,
)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
```
## Evaluation
### Instruction following (IFEval)
This is where KeyLM is competitive. All rows are evaluated with `lm_eval` (`ifeval`, 541 prompts, greedy decoding).
| Model | Params | Train tokens | inst (strict) | prompt (strict) | 4-metric avg |
|---|---|---|---|---|---|
| **KeyLM-75M-Instruct** | **75M** | **~18B** | **22.42** | **12.75** | **17.85** |
| SmolLM-135M-Instruct | 135M | ~600B | 21.58 | 9.98 | 17.15 |
| SmolLM2-135M-Instruct | 135M | ~2T | 32.37 | 18.85 | 26.98 |
KeyLM beats the original SmolLM-135M-Instruct at roughly half the size and a fraction of the training data. SmolLM2-135M-Instruct, a far more heavily trained model, remains ahead.
### Base vs Instruct
The base and instruction-tuned checkpoints across all benchmarks. Commonsense and knowledge tasks are zero-shot via `lm_eval` (accuracy; ARC and HellaSwag length-normalized); IFEval is the 4-metric average.
| Benchmark | KeyLM-75M (base) | KeyLM-75M-Instruct | Random |
|---|---|---|---|
| IFEval (4-metric avg) | — | **17.85** | — |
| MMLU | 23.0 | **24.0** | 25.0 |
| ARC (avg) | 29.9 | **30.8** | 25.0 |
| HellaSwag | 29.7 | **31.0** | 25.0 |
| PIQA | 60.0 | **61.3** | 50.0 |
| WinoGrande | **48.4** | 48.3 | 50.0 |
| OpenBookQA | 25.0 | 25.0 | 25.0 |
Instruction tuning leaves knowledge and reasoning roughly unchanged; its real effect is the instruction-following ability IFEval captures. Both versions sit modestly above random on basic commonsense and at chance on MMLU.
## Training
### Pretraining
KeyLM was pretrained from random initialization on approximately 18B tokens, drawn from a weighted mixture of public datasets and streamed through a deterministic curriculum.
| Category | Share | Sources |
|---|---|---|
| Formal / quality | ~30% | FineWeb-Edu, Wikipedia |
| Casual / social | ~30% | Reddit comments, StackExchange |
| Conversational | ~25% | WildChat, UltraChat, LMSYS-Chat, OASST2 |
| Structured knowledge | ~5% | Cosmopedia |
| Typo augmentation | ~10% | Synthetic (contrastive) |
### Post-training
Instruction tuning used `smol-smoltalk`, `ultrachat_200k`, and several `smoltalk2` splits (magpie, persona instruction-following, science, OpenHermes, system chats, summarization), with assistant-only loss masking, plus a set of custom synthetic instruction-following examples.
## Limitations
- Minimal world knowledge. Not suitable for factual question answering, reasoning, math, or code.
- English only.
- No dedicated safety alignment was performed. Apply your own filtering before any user-facing use.
## License
Apache 2.0. The weights are trained from scratch and free to use, modify, and redistribute.
## Citation
```bibtex
@misc{keylm75m2026,
title = {KeyLM-75M: a from-scratch small language model},
author = {Eclipse-Senpai},
year = {2026},
howpublished = {\url{https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct}}
}
```