Instructions to use Eclipse-Senpai/KeyLM-75M-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Eclipse-Senpai/KeyLM-75M-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Eclipse-Senpai/KeyLM-75M-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Eclipse-Senpai/KeyLM-75M-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Eclipse-Senpai/KeyLM-75M-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Eclipse-Senpai/KeyLM-75M-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Eclipse-Senpai/KeyLM-75M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Eclipse-Senpai/KeyLM-75M-Instruct

SGLang

How to use Eclipse-Senpai/KeyLM-75M-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Eclipse-Senpai/KeyLM-75M-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Eclipse-Senpai/KeyLM-75M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Eclipse-Senpai/KeyLM-75M-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Eclipse-Senpai/KeyLM-75M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Eclipse-Senpai/KeyLM-75M-Instruct with Docker Model Runner:
```
docker model run hf.co/Eclipse-Senpai/KeyLM-75M-Instruct
```

KeyLM-75M-Instruct / README.md

Eclipse-Senpai

Update README.md

d137cc3 verified 2 days ago

preview code

raw

history blame contribute delete

5.64 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- keylm
	- small-language-model
	- instruct
	- gqa
	- rope
	- swiglu
	- qk-norm
	- custom_code
	datasets:
	- HuggingFaceFW/fineweb-edu-score-2
	- wikimedia/wikipedia
	- HuggingFaceGECLM/REDDIT_comments
	- marin-community/stackexchange-markdown
	- allenai/WildChat-1M
	- HuggingFaceH4/ultrachat_200k
	- lmsys/lmsys-chat-1m
	- OpenAssistant/oasst2
	- HuggingFaceTB/cosmopedia-100k
	- HuggingFaceTB/smol-smoltalk
	- HuggingFaceTB/smoltalk2
	base_model: Eclipse-Senpai/KeyLM-75M
	base_model_relation: finetune
	---

	# KeyLM-75M-Instruct

	KeyLM-75M-Instruct is a 75M parameter instruction-tuned language model trained from scratch on approximately 18 billion tokens. That training budget is a small fraction of what comparable small models use (SmolLM-135M was trained on roughly 600B tokens, SmolLM2-135M on roughly 2T). Despite this, it is competitive on instruction following, outperforming SmolLM-135M-Instruct on IFEval while using about half the parameters and a fraction of the data.

	## Table of Contents

	1. [Model Summary](#model-summary)
	2. [How to Use](#how-to-use)
	3. [Evaluation](#evaluation)
	4. [Training](#training)
	5. [Limitations](#limitations)
	6. [License](#license)
	7. [Citation](#citation)

	## Model Summary

	KeyLM is a compact decoder-only transformer built on the standard small-model recipe used by Llama and Qwen3: grouped-query attention, rotary position embeddings (RoPE), SwiGLU feed-forward layers, and per-head QK-RMSNorm. It is designed for lightweight, low-latency English chat and instruction following.

	\| Field \| Value \|
	\|---\|---\|
	\| Parameters \| 75,251,200 \|
	\| Layers \| 24 \|
	\| Hidden size \| 512 \|
	\| Attention heads \| 8 (2 KV heads, GQA) \|
	\| Context length \| 2048 \|
	\| Vocabulary \| 12,020 (ByteLevel BPE) \|
	\| Precision \| bfloat16 \|
	\| Training tokens \| ~18B \|

	GGUF builds for `llama.cpp`, LM Studio, and Ollama are available at [KeyLM-75M-Instruct-GGUF](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct-GGUF).

	## How to Use

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "Eclipse-Senpai/KeyLM-75M-Instruct"
	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_id, trust_remote_code=True, torch_dtype=torch.bfloat16
	)

	messages = [{"role": "user", "content": "What is the capital of France?"}]
	inputs = tokenizer.apply_chat_template(
	messages, add_generation_prompt=True, return_tensors="pt"
	)
	outputs = model.generate(
	inputs, max_new_tokens=128, do_sample=True,
	temperature=0.7, top_p=0.9, repetition_penalty=1.1,
	)
	print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
	```

	## Evaluation

	### Instruction following (IFEval)

	This is where KeyLM is competitive. All rows are evaluated with `lm_eval` (`ifeval`, 541 prompts, greedy decoding).

	\| Model \| Params \| Train tokens \| inst (strict) \| prompt (strict) \| 4-metric avg \|
	\|---\|---\|---\|---\|---\|---\|
	\| KeyLM-75M-Instruct \| 75M \| ~18B \| 22.42 \| 12.75 \| 17.85 \|
	\| SmolLM-135M-Instruct \| 135M \| ~600B \| 21.58 \| 9.98 \| 17.15 \|
	\| SmolLM2-135M-Instruct \| 135M \| ~2T \| 32.37 \| 18.85 \| 26.98 \|

	KeyLM beats the original SmolLM-135M-Instruct at roughly half the size and a fraction of the training data. SmolLM2-135M-Instruct, a far more heavily trained model, remains ahead.

	### Base vs Instruct

	The base and instruction-tuned checkpoints across all benchmarks. Commonsense and knowledge tasks are zero-shot via `lm_eval` (accuracy; ARC and HellaSwag length-normalized); IFEval is the 4-metric average.

	\| Benchmark \| KeyLM-75M (base) \| KeyLM-75M-Instruct \| Random \|
	\|---\|---\|---\|---\|
	\| IFEval (4-metric avg) \| — \| 17.85 \| — \|
	\| MMLU \| 23.0 \| 24.0 \| 25.0 \|
	\| ARC (avg) \| 29.9 \| 30.8 \| 25.0 \|
	\| HellaSwag \| 29.7 \| 31.0 \| 25.0 \|
	\| PIQA \| 60.0 \| 61.3 \| 50.0 \|
	\| WinoGrande \| 48.4 \| 48.3 \| 50.0 \|
	\| OpenBookQA \| 25.0 \| 25.0 \| 25.0 \|

	Instruction tuning leaves knowledge and reasoning roughly unchanged; its real effect is the instruction-following ability IFEval captures. Both versions sit modestly above random on basic commonsense and at chance on MMLU.

	## Training

	### Pretraining

	KeyLM was pretrained from random initialization on approximately 18B tokens, drawn from a weighted mixture of public datasets and streamed through a deterministic curriculum.

	\| Category \| Share \| Sources \|
	\|---\|---\|---\|
	\| Formal / quality \| ~30% \| FineWeb-Edu, Wikipedia \|
	\| Casual / social \| ~30% \| Reddit comments, StackExchange \|
	\| Conversational \| ~25% \| WildChat, UltraChat, LMSYS-Chat, OASST2 \|
	\| Structured knowledge \| ~5% \| Cosmopedia \|
	\| Typo augmentation \| ~10% \| Synthetic (contrastive) \|

	### Post-training

	Instruction tuning used `smol-smoltalk`, `ultrachat_200k`, and several `smoltalk2` splits (magpie, persona instruction-following, science, OpenHermes, system chats, summarization), with assistant-only loss masking, plus a set of custom synthetic instruction-following examples.

	## Limitations

	- Minimal world knowledge. Not suitable for factual question answering, reasoning, math, or code.
	- English only.
	- No dedicated safety alignment was performed. Apply your own filtering before any user-facing use.

	## License

	Apache 2.0. The weights are trained from scratch and free to use, modify, and redistribute.

	## Citation

	```bibtex
	@misc{keylm75m2026,
	title = {KeyLM-75M: a from-scratch small language model},
	author = {Eclipse-Senpai},
	year = {2026},
	howpublished = {\url{https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct}}
	}
	```