Instructions to use Eclipse-Senpai/KeyLM-75M-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Eclipse-Senpai/KeyLM-75M-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Eclipse-Senpai/KeyLM-75M-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Eclipse-Senpai/KeyLM-75M-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Eclipse-Senpai/KeyLM-75M-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Eclipse-Senpai/KeyLM-75M-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Eclipse-Senpai/KeyLM-75M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Eclipse-Senpai/KeyLM-75M-Instruct

SGLang

How to use Eclipse-Senpai/KeyLM-75M-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Eclipse-Senpai/KeyLM-75M-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Eclipse-Senpai/KeyLM-75M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Eclipse-Senpai/KeyLM-75M-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Eclipse-Senpai/KeyLM-75M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Eclipse-Senpai/KeyLM-75M-Instruct with Docker Model Runner:
```
docker model run hf.co/Eclipse-Senpai/KeyLM-75M-Instruct
```

Eclipse-Senpai commited on 2 days ago

Commit

a7de577

verified ·

1 Parent(s): 97f644f

main commit

Browse files

Files changed (9) hide show

README.md +182 -0
config.json +30 -0
configuration_keylm.py +13 -0
generation_config.json +9 -0
model.safetensors +3 -0
modeling_keylm.py +25 -0
special_tokens_map.json +5 -0
tokenizer.json +0 -0
tokenizer_config.json +13 -0

README.md CHANGED Viewed

@@ -1,3 +1,185 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+language:
+- en
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- keylm
+- small-language-model
+- instruct
+- gqa
+- rope
+- swiglu
+- qk-norm
+- custom_code
+model-index:
+- name: KeyLM-75M-Instruct
+  results:
+  - task:
+      type: text-generation
+      name: Instruction Following
+    dataset:
+      name: IFEval
+      type: google/IFEval
+    metrics:
+    - type: acc
+      name: IFEval (4-metric average)
+      value: 17.85
+    - type: acc
+      name: IFEval instruction-level (strict)
+      value: 22.42
+    - type: acc
+      name: IFEval prompt-level (strict)
+      value: 12.75
+  - task:
+      type: text-generation
+      name: Multiple Choice
+    dataset:
+      name: MMLU
+      type: cais/mmlu
+    metrics:
+    - type: acc
+      name: MMLU (0-shot)
+      value: 23.0
+  - task:
+      type: text-generation
+      name: Multiple Choice
+    dataset:
+      name: ARC-Challenge
+      type: allenai/ai2_arc
+    metrics:
+    - type: acc_norm
+      name: ARC-Challenge (0-shot)
+      value: 25.5
+  - task:
+      type: text-generation
+      name: Multiple Choice
+    dataset:
+      name: ARC-Easy
+      type: allenai/ai2_arc
+    metrics:
+    - type: acc_norm
+      name: ARC-Easy (0-shot)
+      value: 26.6
+  - task:
+      type: text-generation
+      name: Multiple Choice
+    dataset:
+      name: HellaSwag
+      type: Rowan/hellaswag
+    metrics:
+    - type: acc_norm
+      name: HellaSwag (0-shot)
+      value: 26.7
+  - task:
+      type: text-generation
+      name: Multiple Choice
+    dataset:
+      name: PIQA
+      type: ybisk/piqa
+    metrics:
+    - type: acc
+      name: PIQA (0-shot)
+      value: 53.1
+  - task:
+      type: text-generation
+      name: Multiple Choice
+    dataset:
+      name: WinoGrande
+      type: allenai/winogrande
+    metrics:
+    - type: acc
+      name: WinoGrande (0-shot)
+      value: 48.9
+  - task:
+      type: text-generation
+      name: Multiple Choice
+    dataset:
+      name: OpenBookQA
+      type: allenai/openbookqa
+    metrics:
+    - type: acc_norm
+      name: OpenBookQA (0-shot)
+      value: 18.4
 ---
+# KeyLM-75M-Instruct
+KeyLM-75M-Instruct is a 75M parameter instruction-tuned language model trained from scratch on approximately 18 billion tokens. That training budget is a small fraction of what comparable small models use (SmolLM-135M was trained on roughly 600B tokens, SmolLM2-135M on roughly 2T). Despite this, it is competitive on instruction following, outperforming SmolLM-135M-Instruct on IFEval while using about half the parameters and a fraction of the training tokens.
+## Results
+IFEval, evaluated with `lm_eval` (541 prompts, greedy decoding).
+| Model | Params | Train tokens | IFEval (4-metric avg) |
+|---|---|---|---|
+| **KeyLM-75M-Instruct** | **75M** | **~18B** | **17.85** |
+| SmolLM-135M-Instruct | 135M | ~600B | 17.15 |
+| SmolLM2-135M-Instruct | 135M | ~2T | 26.98 |
+Full benchmark results (MMLU, ARC, HellaSwag, PIQA, WinoGrande, OpenBookQA) appear in the evaluation panel above. On those multiple-choice knowledge and reasoning tasks the model scores near random chance, which is expected at this parameter and token budget. Its usable behavior comes from instruction tuning rather than parametric knowledge.
+## Usage
+KeyLM ships its own modeling code, so load it with `trust_remote_code=True` (requires `transformers>=4.51`).
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "Eclipse-Senpai/KeyLM-75M-Instruct"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id, trust_remote_code=True, torch_dtype=torch.float16
+)
+messages = [{"role": "user", "content": "What is the capital of France?"}]
+inputs = tokenizer.apply_chat_template(
+    messages, add_generation_prompt=True, return_tensors="pt"
+)
+outputs = model.generate(
+    inputs, max_new_tokens=128, do_sample=True,
+    temperature=0.7, top_p=0.9, repetition_penalty=1.1,
+)
+print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
+```
+GGUF builds for `llama.cpp`, LM Studio, and Ollama are available at [KeyLM-75M-Instruct-GGUF](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct-GGUF).
+## Model details
+| Field | Value |
+|---|---|
+| Parameters | 75,251,200 |
+| Architecture | Grouped-query attention, RoPE, SwiGLU, QK-RMSNorm |
+| Hidden size | 512 |
+| Layers | 24 |
+| Attention heads | 8 (2 KV heads) |
+| Context length | 2048 |
+| Vocabulary | 12,020 (ByteLevel BPE) |
+| Precision | float16 |
+| Chat format | `User:` / `Assistant:`, assistant turns end with `</s>` |
+The architecture follows the standard small decoder recipe used by Llama and Qwen3. Weights are trained from random initialization. Instruction tuning uses `smol-smoltalk`, `ultrachat_200k`, and several `smoltalk2` splits with assistant-only loss masking, followed by a personality tuning pass.
+## Limitations
+- Minimal world knowledge. Not suitable for factual question answering, reasoning, math, or code.
+- English only.
+- No dedicated safety alignment was performed.
+## License
+Apache 2.0. Weights are trained from scratch and free to use, modify, and redistribute.
+## Citation
+```bibtex
+@misc{keylm75m2026,
+  title  = {KeyLM-75M: a from-scratch small language model},
+  author = {Eclipse-Senpai},
+  year   = {2026},
+  howpublished = {\url{https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct}}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "architectures": [
+    "KeyLM75M"
+  ],
+  "model_type": "keylm75m",
+  "auto_map": {
+    "AutoConfig": "configuration_keylm.KeyLM75MConfig",
+    "AutoModelForCausalLM": "modeling_keylm.KeyLM75M"
+  },
+  "vocab_size": 12020,
+  "hidden_size": 512,
+  "intermediate_size": 1280,
+  "num_hidden_layers": 24,
+  "num_attention_heads": 8,
+  "num_key_value_heads": 2,
+  "head_dim": 64,
+  "max_position_embeddings": 2048,
+  "rope_theta": 10000.0,
+  "rms_norm_eps": 1e-06,
+  "hidden_act": "silu",
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "use_sliding_window": false,
+  "tie_word_embeddings": false,
+  "initializer_range": 0.02,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 2,
+  "torch_dtype": "float16"
+}

configuration_keylm.py ADDED Viewed

	@@ -0,0 +1,13 @@

+"""KeyLM model configuration.
+KeyLM-75M is a from-scratch small language model. Its decoder block is a
+Qwen3-style layout (grouped-query attention, RoPE, SwiGLU, and per-head
+QK-RMSNorm), so the configuration inherits Qwen3Config and only overrides the
+``model_type`` so the model carries its own identity on the Hub.
+"""
+from transformers.models.qwen3.configuration_qwen3 import Qwen3Config
+class KeyLM75MConfig(Qwen3Config):
+    model_type = "keylm75m"

generation_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 2,
+  "do_sample": true,
+  "temperature": 0.7,
+  "top_p": 0.9,
+  "repetition_penalty": 1.1
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:62a2f1202bf8c44f7839d0f402c81977d8516936a6a7aa70bc8cebd210791b4b
+size 150531664

modeling_keylm.py ADDED Viewed

	@@ -0,0 +1,25 @@

+"""KeyLM model implementation.
+KeyLM-75M uses a Qwen3-style decoder (GQA + RoPE + SwiGLU + per-head
+QK-RMSNorm). Rather than vendor a full copy of the transformer, the classes
+below specialise the upstream Qwen3 implementation and bind it to KeyLMConfig
+so the model loads under its own name via `trust_remote_code=True`.
+"""
+try:
+    from transformers.models.qwen3.modeling_qwen3 import Qwen3ForCausalLM, Qwen3Model
+except ImportError as exc:  # pragma: no cover - guidance for old transformers
+    raise ImportError(
+        "KeyLM requires a transformers version that ships the Qwen3 model "
+        "(transformers>=4.51). Please upgrade transformers."
+    ) from exc
+from .configuration_keylm import KeyLM75MConfig
+class KeyLM75MModel(Qwen3Model):
+    config_class = KeyLM75MConfig
+class KeyLM75M(Qwen3ForCausalLM):
+    config_class = KeyLM75MConfig

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "lowercase": false,
+  "model_max_length": 2048,
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "[UNK]",
+  "vocab_size": 12020,
+  "chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{% if loop.index0 > 0 %}\n{% endif %}User: {{ message['content'] }}\n{% elif message['role'] == 'assistant' %}Assistant: {{ message['content'] }}</s>{% endif %}{% endfor %}{% if add_generation_prompt %}Assistant: {% endif %}",
+  "add_bos_token": false,
+  "add_eos_token": false,
+  "clean_up_tokenization_spaces": false
+}