Text Generation
Transformers
Safetensors
English
keylm75m
keylm
small-language-model
base
pretrained
gqa
rope
swiglu
qk-norm
custom_code
Instructions to use Eclipse-Senpai/KeyLM-75M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Eclipse-Senpai/KeyLM-75M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Eclipse-Senpai/KeyLM-75M", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Eclipse-Senpai/KeyLM-75M", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Eclipse-Senpai/KeyLM-75M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Eclipse-Senpai/KeyLM-75M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Eclipse-Senpai/KeyLM-75M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Eclipse-Senpai/KeyLM-75M
- SGLang
How to use Eclipse-Senpai/KeyLM-75M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Eclipse-Senpai/KeyLM-75M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Eclipse-Senpai/KeyLM-75M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Eclipse-Senpai/KeyLM-75M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Eclipse-Senpai/KeyLM-75M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Eclipse-Senpai/KeyLM-75M with Docker Model Runner:
docker model run hf.co/Eclipse-Senpai/KeyLM-75M
Update README.md
Browse files
README.md
CHANGED
|
@@ -30,7 +30,7 @@ datasets:
|
|
| 30 |
|
| 31 |
KeyLM-75M is a 75M parameter base language model trained from scratch on approximately 18 billion tokens. That training budget is a small fraction of what comparable small models use (SmolLM-135M was trained on roughly 600B tokens, SmolLM2-135M on roughly 2T).
|
| 32 |
|
| 33 |
-
This is the **base** model: a text-completion model, not instruction-tuned.
|
| 34 |
|
| 35 |
## Table of Contents
|
| 36 |
|
|
@@ -44,7 +44,7 @@ This is the **base** model: a text-completion model, not instruction-tuned. It i
|
|
| 44 |
|
| 45 |
## Model Summary
|
| 46 |
|
| 47 |
-
KeyLM is a compact decoder-only transformer built on the standard small-model recipe used by Llama and Qwen3: grouped-query attention, rotary position embeddings (RoPE), SwiGLU feed-forward layers, and per-head QK-RMSNorm.
|
| 48 |
|
| 49 |
| Field | Value |
|
| 50 |
|---|---|
|
|
@@ -59,8 +59,6 @@ KeyLM is a compact decoder-only transformer built on the standard small-model re
|
|
| 59 |
|
| 60 |
## How to Use
|
| 61 |
|
| 62 |
-
This is a base model: it continues text and has no chat template. Load it with `trust_remote_code=True` (requires `transformers>=4.51`).
|
| 63 |
-
|
| 64 |
```python
|
| 65 |
import torch
|
| 66 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
@@ -79,8 +77,6 @@ outputs = model.generate(
|
|
| 79 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 80 |
```
|
| 81 |
|
| 82 |
-
For fine-tuning, the bfloat16 weights load directly into the usual `transformers` training stack; the model also fine-tunes with assistant-only loss masking under a plain `User:` / `Assistant:` format, which is how the Instruct version was produced.
|
| 83 |
-
|
| 84 |
## Evaluation
|
| 85 |
|
| 86 |
On standard multiple-choice benchmarks KeyLM performs at or near random chance. This is expected at 75M parameters and 18B tokens: the model holds little parametric knowledge. Scores are zero-shot via `lm_eval` (accuracy; ARC and HellaSwag use length-normalized accuracy).
|
|
|
|
| 30 |
|
| 31 |
KeyLM-75M is a 75M parameter base language model trained from scratch on approximately 18 billion tokens. That training budget is a small fraction of what comparable small models use (SmolLM-135M was trained on roughly 600B tokens, SmolLM2-135M on roughly 2T).
|
| 32 |
|
| 33 |
+
This is the **base** model: a text-completion model, not instruction-tuned. For chat and instruction following, use [KeyLM-75M-Instruct](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct).
|
| 34 |
|
| 35 |
## Table of Contents
|
| 36 |
|
|
|
|
| 44 |
|
| 45 |
## Model Summary
|
| 46 |
|
| 47 |
+
KeyLM is a compact decoder-only transformer built on the standard small-model recipe used by Llama and Qwen3: grouped-query attention, rotary position embeddings (RoPE), SwiGLU feed-forward layers, and per-head QK-RMSNorm.
|
| 48 |
|
| 49 |
| Field | Value |
|
| 50 |
|---|---|
|
|
|
|
| 59 |
|
| 60 |
## How to Use
|
| 61 |
|
|
|
|
|
|
|
| 62 |
```python
|
| 63 |
import torch
|
| 64 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
| 77 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 78 |
```
|
| 79 |
|
|
|
|
|
|
|
| 80 |
## Evaluation
|
| 81 |
|
| 82 |
On standard multiple-choice benchmarks KeyLM performs at or near random chance. This is expected at 75M parameters and 18B tokens: the model holds little parametric knowledge. Scores are zero-shot via `lm_eval` (accuracy; ARC and HellaSwag use length-normalized accuracy).
|