Text Generation
Transformers
Safetensors
English
keylm75m
keylm
small-language-model
instruct
gqa
rope
swiglu
qk-norm
custom_code
conversational
Instructions to use Eclipse-Senpai/KeyLM-75M-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Eclipse-Senpai/KeyLM-75M-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Eclipse-Senpai/KeyLM-75M-Instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Eclipse-Senpai/KeyLM-75M-Instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Eclipse-Senpai/KeyLM-75M-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Eclipse-Senpai/KeyLM-75M-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Eclipse-Senpai/KeyLM-75M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Eclipse-Senpai/KeyLM-75M-Instruct
- SGLang
How to use Eclipse-Senpai/KeyLM-75M-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Eclipse-Senpai/KeyLM-75M-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Eclipse-Senpai/KeyLM-75M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Eclipse-Senpai/KeyLM-75M-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Eclipse-Senpai/KeyLM-75M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Eclipse-Senpai/KeyLM-75M-Instruct with Docker Model Runner:
docker model run hf.co/Eclipse-Senpai/KeyLM-75M-Instruct
| license: apache-2.0 | |
| language: | |
| - en | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - keylm | |
| - small-language-model | |
| - instruct | |
| - gqa | |
| - rope | |
| - swiglu | |
| - qk-norm | |
| - custom_code | |
| datasets: | |
| - HuggingFaceFW/fineweb-edu-score-2 | |
| - wikimedia/wikipedia | |
| - HuggingFaceGECLM/REDDIT_comments | |
| - marin-community/stackexchange-markdown | |
| - allenai/WildChat-1M | |
| - HuggingFaceH4/ultrachat_200k | |
| - lmsys/lmsys-chat-1m | |
| - OpenAssistant/oasst2 | |
| - HuggingFaceTB/cosmopedia-100k | |
| - HuggingFaceTB/smol-smoltalk | |
| - HuggingFaceTB/smoltalk2 | |
| base_model: Eclipse-Senpai/KeyLM-75M | |
| base_model_relation: finetune | |
| # KeyLM-75M-Instruct | |
| KeyLM-75M-Instruct is a 75M parameter instruction-tuned language model trained from scratch on approximately 18 billion tokens. That training budget is a small fraction of what comparable small models use (SmolLM-135M was trained on roughly 600B tokens, SmolLM2-135M on roughly 2T). Despite this, it is competitive on instruction following, outperforming SmolLM-135M-Instruct on IFEval while using about half the parameters and a fraction of the data. | |
| ## Table of Contents | |
| 1. [Model Summary](#model-summary) | |
| 2. [How to Use](#how-to-use) | |
| 3. [Evaluation](#evaluation) | |
| 4. [Training](#training) | |
| 5. [Limitations](#limitations) | |
| 6. [License](#license) | |
| 7. [Citation](#citation) | |
| ## Model Summary | |
| KeyLM is a compact decoder-only transformer built on the standard small-model recipe used by Llama and Qwen3: grouped-query attention, rotary position embeddings (RoPE), SwiGLU feed-forward layers, and per-head QK-RMSNorm. It is designed for lightweight, low-latency English chat and instruction following. | |
| | Field | Value | | |
| |---|---| | |
| | Parameters | 75,251,200 | | |
| | Layers | 24 | | |
| | Hidden size | 512 | | |
| | Attention heads | 8 (2 KV heads, GQA) | | |
| | Context length | 2048 | | |
| | Vocabulary | 12,020 (ByteLevel BPE) | | |
| | Precision | bfloat16 | | |
| | Training tokens | ~18B | | |
| GGUF builds for `llama.cpp`, LM Studio, and Ollama are available at [KeyLM-75M-Instruct-GGUF](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct-GGUF). | |
| ## How to Use | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model_id = "Eclipse-Senpai/KeyLM-75M-Instruct" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, trust_remote_code=True, torch_dtype=torch.bfloat16 | |
| ) | |
| messages = [{"role": "user", "content": "What is the capital of France?"}] | |
| inputs = tokenizer.apply_chat_template( | |
| messages, add_generation_prompt=True, return_tensors="pt" | |
| ) | |
| outputs = model.generate( | |
| inputs, max_new_tokens=128, do_sample=True, | |
| temperature=0.7, top_p=0.9, repetition_penalty=1.1, | |
| ) | |
| print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| ## Evaluation | |
| ### Instruction following (IFEval) | |
| This is where KeyLM is competitive. All rows are evaluated with `lm_eval` (`ifeval`, 541 prompts, greedy decoding). | |
| | Model | Params | Train tokens | inst (strict) | prompt (strict) | 4-metric avg | | |
| |---|---|---|---|---|---| | |
| | **KeyLM-75M-Instruct** | **75M** | **~18B** | **22.42** | **12.75** | **17.85** | | |
| | SmolLM-135M-Instruct | 135M | ~600B | 21.58 | 9.98 | 17.15 | | |
| | SmolLM2-135M-Instruct | 135M | ~2T | 32.37 | 18.85 | 26.98 | | |
| KeyLM beats the original SmolLM-135M-Instruct at roughly half the size and a fraction of the training data. SmolLM2-135M-Instruct, a far more heavily trained model, remains ahead. | |
| ### Base vs Instruct | |
| The base and instruction-tuned checkpoints across all benchmarks. Commonsense and knowledge tasks are zero-shot via `lm_eval` (accuracy; ARC and HellaSwag length-normalized); IFEval is the 4-metric average. | |
| | Benchmark | KeyLM-75M (base) | KeyLM-75M-Instruct | Random | | |
| |---|---|---|---| | |
| | IFEval (4-metric avg) | — | **17.85** | — | | |
| | MMLU | 23.0 | **24.0** | 25.0 | | |
| | ARC (avg) | 29.9 | **30.8** | 25.0 | | |
| | HellaSwag | 29.7 | **31.0** | 25.0 | | |
| | PIQA | 60.0 | **61.3** | 50.0 | | |
| | WinoGrande | **48.4** | 48.3 | 50.0 | | |
| | OpenBookQA | 25.0 | 25.0 | 25.0 | | |
| Instruction tuning leaves knowledge and reasoning roughly unchanged; its real effect is the instruction-following ability IFEval captures. Both versions sit modestly above random on basic commonsense and at chance on MMLU. | |
| ## Training | |
| ### Pretraining | |
| KeyLM was pretrained from random initialization on approximately 18B tokens, drawn from a weighted mixture of public datasets and streamed through a deterministic curriculum. | |
| | Category | Share | Sources | | |
| |---|---|---| | |
| | Formal / quality | ~30% | FineWeb-Edu, Wikipedia | | |
| | Casual / social | ~30% | Reddit comments, StackExchange | | |
| | Conversational | ~25% | WildChat, UltraChat, LMSYS-Chat, OASST2 | | |
| | Structured knowledge | ~5% | Cosmopedia | | |
| | Typo augmentation | ~10% | Synthetic (contrastive) | | |
| ### Post-training | |
| Instruction tuning used `smol-smoltalk`, `ultrachat_200k`, and several `smoltalk2` splits (magpie, persona instruction-following, science, OpenHermes, system chats, summarization), with assistant-only loss masking, plus a set of custom synthetic instruction-following examples. | |
| ## Limitations | |
| - Minimal world knowledge. Not suitable for factual question answering, reasoning, math, or code. | |
| - English only. | |
| - No dedicated safety alignment was performed. Apply your own filtering before any user-facing use. | |
| ## License | |
| Apache 2.0. The weights are trained from scratch and free to use, modify, and redistribute. | |
| ## Citation | |
| ```bibtex | |
| @misc{keylm75m2026, | |
| title = {KeyLM-75M: a from-scratch small language model}, | |
| author = {Eclipse-Senpai}, | |
| year = {2026}, | |
| howpublished = {\url{https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct}} | |
| } | |
| ``` | |