Text Generation
Transformers
Safetensors
English
keylm75m
keylm
small-language-model
instruct
gqa
rope
swiglu
qk-norm
custom_code
conversational
Instructions to use Eclipse-Senpai/KeyLM-75M-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Eclipse-Senpai/KeyLM-75M-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Eclipse-Senpai/KeyLM-75M-Instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Eclipse-Senpai/KeyLM-75M-Instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Eclipse-Senpai/KeyLM-75M-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Eclipse-Senpai/KeyLM-75M-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Eclipse-Senpai/KeyLM-75M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Eclipse-Senpai/KeyLM-75M-Instruct
- SGLang
How to use Eclipse-Senpai/KeyLM-75M-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Eclipse-Senpai/KeyLM-75M-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Eclipse-Senpai/KeyLM-75M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Eclipse-Senpai/KeyLM-75M-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Eclipse-Senpai/KeyLM-75M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Eclipse-Senpai/KeyLM-75M-Instruct with Docker Model Runner:
docker model run hf.co/Eclipse-Senpai/KeyLM-75M-Instruct
main commit
Browse files- README.md +182 -0
- config.json +30 -0
- configuration_keylm.py +13 -0
- generation_config.json +9 -0
- model.safetensors +3 -0
- modeling_keylm.py +25 -0
- special_tokens_map.json +5 -0
- tokenizer.json +0 -0
- tokenizer_config.json +13 -0
README.md
CHANGED
|
@@ -1,3 +1,185 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
library_name: transformers
|
| 6 |
+
pipeline_tag: text-generation
|
| 7 |
+
tags:
|
| 8 |
+
- keylm
|
| 9 |
+
- small-language-model
|
| 10 |
+
- instruct
|
| 11 |
+
- gqa
|
| 12 |
+
- rope
|
| 13 |
+
- swiglu
|
| 14 |
+
- qk-norm
|
| 15 |
+
- custom_code
|
| 16 |
+
model-index:
|
| 17 |
+
- name: KeyLM-75M-Instruct
|
| 18 |
+
results:
|
| 19 |
+
- task:
|
| 20 |
+
type: text-generation
|
| 21 |
+
name: Instruction Following
|
| 22 |
+
dataset:
|
| 23 |
+
name: IFEval
|
| 24 |
+
type: google/IFEval
|
| 25 |
+
metrics:
|
| 26 |
+
- type: acc
|
| 27 |
+
name: IFEval (4-metric average)
|
| 28 |
+
value: 17.85
|
| 29 |
+
- type: acc
|
| 30 |
+
name: IFEval instruction-level (strict)
|
| 31 |
+
value: 22.42
|
| 32 |
+
- type: acc
|
| 33 |
+
name: IFEval prompt-level (strict)
|
| 34 |
+
value: 12.75
|
| 35 |
+
- task:
|
| 36 |
+
type: text-generation
|
| 37 |
+
name: Multiple Choice
|
| 38 |
+
dataset:
|
| 39 |
+
name: MMLU
|
| 40 |
+
type: cais/mmlu
|
| 41 |
+
metrics:
|
| 42 |
+
- type: acc
|
| 43 |
+
name: MMLU (0-shot)
|
| 44 |
+
value: 23.0
|
| 45 |
+
- task:
|
| 46 |
+
type: text-generation
|
| 47 |
+
name: Multiple Choice
|
| 48 |
+
dataset:
|
| 49 |
+
name: ARC-Challenge
|
| 50 |
+
type: allenai/ai2_arc
|
| 51 |
+
metrics:
|
| 52 |
+
- type: acc_norm
|
| 53 |
+
name: ARC-Challenge (0-shot)
|
| 54 |
+
value: 25.5
|
| 55 |
+
- task:
|
| 56 |
+
type: text-generation
|
| 57 |
+
name: Multiple Choice
|
| 58 |
+
dataset:
|
| 59 |
+
name: ARC-Easy
|
| 60 |
+
type: allenai/ai2_arc
|
| 61 |
+
metrics:
|
| 62 |
+
- type: acc_norm
|
| 63 |
+
name: ARC-Easy (0-shot)
|
| 64 |
+
value: 26.6
|
| 65 |
+
- task:
|
| 66 |
+
type: text-generation
|
| 67 |
+
name: Multiple Choice
|
| 68 |
+
dataset:
|
| 69 |
+
name: HellaSwag
|
| 70 |
+
type: Rowan/hellaswag
|
| 71 |
+
metrics:
|
| 72 |
+
- type: acc_norm
|
| 73 |
+
name: HellaSwag (0-shot)
|
| 74 |
+
value: 26.7
|
| 75 |
+
- task:
|
| 76 |
+
type: text-generation
|
| 77 |
+
name: Multiple Choice
|
| 78 |
+
dataset:
|
| 79 |
+
name: PIQA
|
| 80 |
+
type: ybisk/piqa
|
| 81 |
+
metrics:
|
| 82 |
+
- type: acc
|
| 83 |
+
name: PIQA (0-shot)
|
| 84 |
+
value: 53.1
|
| 85 |
+
- task:
|
| 86 |
+
type: text-generation
|
| 87 |
+
name: Multiple Choice
|
| 88 |
+
dataset:
|
| 89 |
+
name: WinoGrande
|
| 90 |
+
type: allenai/winogrande
|
| 91 |
+
metrics:
|
| 92 |
+
- type: acc
|
| 93 |
+
name: WinoGrande (0-shot)
|
| 94 |
+
value: 48.9
|
| 95 |
+
- task:
|
| 96 |
+
type: text-generation
|
| 97 |
+
name: Multiple Choice
|
| 98 |
+
dataset:
|
| 99 |
+
name: OpenBookQA
|
| 100 |
+
type: allenai/openbookqa
|
| 101 |
+
metrics:
|
| 102 |
+
- type: acc_norm
|
| 103 |
+
name: OpenBookQA (0-shot)
|
| 104 |
+
value: 18.4
|
| 105 |
---
|
| 106 |
+
|
| 107 |
+
# KeyLM-75M-Instruct
|
| 108 |
+
|
| 109 |
+
KeyLM-75M-Instruct is a 75M parameter instruction-tuned language model trained from scratch on approximately 18 billion tokens. That training budget is a small fraction of what comparable small models use (SmolLM-135M was trained on roughly 600B tokens, SmolLM2-135M on roughly 2T). Despite this, it is competitive on instruction following, outperforming SmolLM-135M-Instruct on IFEval while using about half the parameters and a fraction of the training tokens.
|
| 110 |
+
|
| 111 |
+
## Results
|
| 112 |
+
|
| 113 |
+
IFEval, evaluated with `lm_eval` (541 prompts, greedy decoding).
|
| 114 |
+
|
| 115 |
+
| Model | Params | Train tokens | IFEval (4-metric avg) |
|
| 116 |
+
|---|---|---|---|
|
| 117 |
+
| **KeyLM-75M-Instruct** | **75M** | **~18B** | **17.85** |
|
| 118 |
+
| SmolLM-135M-Instruct | 135M | ~600B | 17.15 |
|
| 119 |
+
| SmolLM2-135M-Instruct | 135M | ~2T | 26.98 |
|
| 120 |
+
|
| 121 |
+
Full benchmark results (MMLU, ARC, HellaSwag, PIQA, WinoGrande, OpenBookQA) appear in the evaluation panel above. On those multiple-choice knowledge and reasoning tasks the model scores near random chance, which is expected at this parameter and token budget. Its usable behavior comes from instruction tuning rather than parametric knowledge.
|
| 122 |
+
|
| 123 |
+
## Usage
|
| 124 |
+
|
| 125 |
+
KeyLM ships its own modeling code, so load it with `trust_remote_code=True` (requires `transformers>=4.51`).
|
| 126 |
+
|
| 127 |
+
```python
|
| 128 |
+
import torch
|
| 129 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 130 |
+
|
| 131 |
+
model_id = "Eclipse-Senpai/KeyLM-75M-Instruct"
|
| 132 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
| 133 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 134 |
+
model_id, trust_remote_code=True, torch_dtype=torch.float16
|
| 135 |
+
)
|
| 136 |
+
|
| 137 |
+
messages = [{"role": "user", "content": "What is the capital of France?"}]
|
| 138 |
+
inputs = tokenizer.apply_chat_template(
|
| 139 |
+
messages, add_generation_prompt=True, return_tensors="pt"
|
| 140 |
+
)
|
| 141 |
+
outputs = model.generate(
|
| 142 |
+
inputs, max_new_tokens=128, do_sample=True,
|
| 143 |
+
temperature=0.7, top_p=0.9, repetition_penalty=1.1,
|
| 144 |
+
)
|
| 145 |
+
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
|
| 146 |
+
```
|
| 147 |
+
|
| 148 |
+
GGUF builds for `llama.cpp`, LM Studio, and Ollama are available at [KeyLM-75M-Instruct-GGUF](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct-GGUF).
|
| 149 |
+
|
| 150 |
+
## Model details
|
| 151 |
+
|
| 152 |
+
| Field | Value |
|
| 153 |
+
|---|---|
|
| 154 |
+
| Parameters | 75,251,200 |
|
| 155 |
+
| Architecture | Grouped-query attention, RoPE, SwiGLU, QK-RMSNorm |
|
| 156 |
+
| Hidden size | 512 |
|
| 157 |
+
| Layers | 24 |
|
| 158 |
+
| Attention heads | 8 (2 KV heads) |
|
| 159 |
+
| Context length | 2048 |
|
| 160 |
+
| Vocabulary | 12,020 (ByteLevel BPE) |
|
| 161 |
+
| Precision | float16 |
|
| 162 |
+
| Chat format | `User:` / `Assistant:`, assistant turns end with `</s>` |
|
| 163 |
+
|
| 164 |
+
The architecture follows the standard small decoder recipe used by Llama and Qwen3. Weights are trained from random initialization. Instruction tuning uses `smol-smoltalk`, `ultrachat_200k`, and several `smoltalk2` splits with assistant-only loss masking, followed by a personality tuning pass.
|
| 165 |
+
|
| 166 |
+
## Limitations
|
| 167 |
+
|
| 168 |
+
- Minimal world knowledge. Not suitable for factual question answering, reasoning, math, or code.
|
| 169 |
+
- English only.
|
| 170 |
+
- No dedicated safety alignment was performed.
|
| 171 |
+
|
| 172 |
+
## License
|
| 173 |
+
|
| 174 |
+
Apache 2.0. Weights are trained from scratch and free to use, modify, and redistribute.
|
| 175 |
+
|
| 176 |
+
## Citation
|
| 177 |
+
|
| 178 |
+
```bibtex
|
| 179 |
+
@misc{keylm75m2026,
|
| 180 |
+
title = {KeyLM-75M: a from-scratch small language model},
|
| 181 |
+
author = {Eclipse-Senpai},
|
| 182 |
+
year = {2026},
|
| 183 |
+
howpublished = {\url{https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct}}
|
| 184 |
+
}
|
| 185 |
+
```
|
config.json
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"KeyLM75M"
|
| 4 |
+
],
|
| 5 |
+
"model_type": "keylm75m",
|
| 6 |
+
"auto_map": {
|
| 7 |
+
"AutoConfig": "configuration_keylm.KeyLM75MConfig",
|
| 8 |
+
"AutoModelForCausalLM": "modeling_keylm.KeyLM75M"
|
| 9 |
+
},
|
| 10 |
+
"vocab_size": 12020,
|
| 11 |
+
"hidden_size": 512,
|
| 12 |
+
"intermediate_size": 1280,
|
| 13 |
+
"num_hidden_layers": 24,
|
| 14 |
+
"num_attention_heads": 8,
|
| 15 |
+
"num_key_value_heads": 2,
|
| 16 |
+
"head_dim": 64,
|
| 17 |
+
"max_position_embeddings": 2048,
|
| 18 |
+
"rope_theta": 10000.0,
|
| 19 |
+
"rms_norm_eps": 1e-06,
|
| 20 |
+
"hidden_act": "silu",
|
| 21 |
+
"attention_bias": false,
|
| 22 |
+
"attention_dropout": 0.0,
|
| 23 |
+
"use_sliding_window": false,
|
| 24 |
+
"tie_word_embeddings": false,
|
| 25 |
+
"initializer_range": 0.02,
|
| 26 |
+
"bos_token_id": 1,
|
| 27 |
+
"eos_token_id": 2,
|
| 28 |
+
"pad_token_id": 2,
|
| 29 |
+
"torch_dtype": "float16"
|
| 30 |
+
}
|
configuration_keylm.py
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""KeyLM model configuration.
|
| 2 |
+
|
| 3 |
+
KeyLM-75M is a from-scratch small language model. Its decoder block is a
|
| 4 |
+
Qwen3-style layout (grouped-query attention, RoPE, SwiGLU, and per-head
|
| 5 |
+
QK-RMSNorm), so the configuration inherits Qwen3Config and only overrides the
|
| 6 |
+
``model_type`` so the model carries its own identity on the Hub.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
from transformers.models.qwen3.configuration_qwen3 import Qwen3Config
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
class KeyLM75MConfig(Qwen3Config):
|
| 13 |
+
model_type = "keylm75m"
|
generation_config.json
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token_id": 1,
|
| 3 |
+
"eos_token_id": 2,
|
| 4 |
+
"pad_token_id": 2,
|
| 5 |
+
"do_sample": true,
|
| 6 |
+
"temperature": 0.7,
|
| 7 |
+
"top_p": 0.9,
|
| 8 |
+
"repetition_penalty": 1.1
|
| 9 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:62a2f1202bf8c44f7839d0f402c81977d8516936a6a7aa70bc8cebd210791b4b
|
| 3 |
+
size 150531664
|
modeling_keylm.py
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""KeyLM model implementation.
|
| 2 |
+
|
| 3 |
+
KeyLM-75M uses a Qwen3-style decoder (GQA + RoPE + SwiGLU + per-head
|
| 4 |
+
QK-RMSNorm). Rather than vendor a full copy of the transformer, the classes
|
| 5 |
+
below specialise the upstream Qwen3 implementation and bind it to KeyLMConfig
|
| 6 |
+
so the model loads under its own name via `trust_remote_code=True`.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
try:
|
| 10 |
+
from transformers.models.qwen3.modeling_qwen3 import Qwen3ForCausalLM, Qwen3Model
|
| 11 |
+
except ImportError as exc: # pragma: no cover - guidance for old transformers
|
| 12 |
+
raise ImportError(
|
| 13 |
+
"KeyLM requires a transformers version that ships the Qwen3 model "
|
| 14 |
+
"(transformers>=4.51). Please upgrade transformers."
|
| 15 |
+
) from exc
|
| 16 |
+
|
| 17 |
+
from .configuration_keylm import KeyLM75MConfig
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
class KeyLM75MModel(Qwen3Model):
|
| 21 |
+
config_class = KeyLM75MConfig
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
class KeyLM75M(Qwen3ForCausalLM):
|
| 25 |
+
config_class = KeyLM75MConfig
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": "<s>",
|
| 3 |
+
"eos_token": "</s>",
|
| 4 |
+
"unk_token": "[UNK]"
|
| 5 |
+
}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": "<s>",
|
| 3 |
+
"eos_token": "</s>",
|
| 4 |
+
"lowercase": false,
|
| 5 |
+
"model_max_length": 2048,
|
| 6 |
+
"tokenizer_class": "PreTrainedTokenizerFast",
|
| 7 |
+
"unk_token": "[UNK]",
|
| 8 |
+
"vocab_size": 12020,
|
| 9 |
+
"chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{% if loop.index0 > 0 %}\n{% endif %}User: {{ message['content'] }}\n{% elif message['role'] == 'assistant' %}Assistant: {{ message['content'] }}</s>{% endif %}{% endfor %}{% if add_generation_prompt %}Assistant: {% endif %}",
|
| 10 |
+
"add_bos_token": false,
|
| 11 |
+
"add_eos_token": false,
|
| 12 |
+
"clean_up_tokenization_spaces": false
|
| 13 |
+
}
|