Kompress-Small: Distilled Token Compressor (70M params, 13ms ONNX)
6-layer distilled version of kompress-base. Same quality, 4x faster, half the size.
Compresses text in LLM context windows so agents can do more with less. Distilled from the 22-layer teacher using knowledge distillation — the student learns the teacher's probability distributions, not just hard labels.
Results
| Model | Quality | Latency | Size | Params |
|---|---|---|---|---|
| kompress-small (ONNX) | 7.4/10 | 13-29ms | 279MB | 70M |
| kompress-base | 7.9/10 | 49-84ms | 600MB | 150M |
| LLMLingua-2 | 5.9-6.2/10 | 113-117ms | 710MB | 179M |
Detailed Benchmarks
Unstructured text (assistant explanations, plans, error analysis, status summaries):
| Metric | kompress-small | kompress-base | LLMLingua-2 |
|---|---|---|---|
| Quality (Claude-judged) | 7.4/10 | 7.9/10 | 5.9/10 |
| Latency (median) | 29ms | 84ms | 117ms |
| Compression | 58% kept | 58% kept | 53% kept |
Real Claude Code sessions (text segments):
| Metric | kompress-small | kompress-base | LLMLingua-2 |
|---|---|---|---|
| Quality (Claude-judged) | 7.4/10 | 7.3/10 | 6.2/10 |
| Latency (median) | 13ms | 83ms | 114ms |
Architecture
Distilled from kompress-base by keeping 6 evenly-spaced layers from the original 22:
Teacher (kompress-base): layers [0,1,2,...,21] → 150M params, 600MB
Student (kompress-small): layers [0,4,8,12,17,21] → 70M params, 279MB
Both token classification head and span importance CNN are copied from the teacher, then the full model is fine-tuned with knowledge distillation:
loss = 0.5 * CrossEntropy(student, labels) + 0.5 * KL_div(student, teacher) * T²
The soft KL divergence loss transfers richer information than hard labels alone — teaching the student how confident the teacher is about each token, not just the binary decision.
Available Formats
| Format | Size | Latency | Use case |
|---|---|---|---|
model.safetensors |
279MB | ~77ms (MPS) | PyTorch inference |
model_int8.pt |
189MB | ~50ms (CPU) | PyTorch INT8 |
model.onnx + .data |
275MB | 13-29ms (CPU) | Production deployment |
Usage
ONNX (Fastest)
import onnxruntime as ort
from transformers import AutoTokenizer
sess = ort.InferenceSession("model.onnx")
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
text = "Your text here..."
words = text.split()
enc = tokenizer(words, is_split_into_words=True, truncation=True,
max_length=8192, padding=True, return_tensors="np")
logits = sess.run(None, {
"input_ids": enc["input_ids"],
"attention_mask": enc["attention_mask"]
})[0][0]
# Argmax: keep tokens where class 1 > class 0
keep = logits[:, 1] > logits[:, 0]
word_ids = enc.word_ids(batch_index=0)
kept_words = []
seen = set()
for idx, wid in enumerate(word_ids):
if wid is not None and wid not in seen and keep[idx]:
kept_words.append(words[wid])
seen.add(wid)
compressed = " ".join(kept_words)
With Headroom
pip install headroom-ai
kompress-small is available as a lightweight alternative in Headroom.
Training Details
- Teacher: kompress-base (22 layers, 150M params)
- Distillation: 5 epochs, batch 32, lr 3e-5, temperature 4.0, α=0.5
- Data: 263K examples (same extractive labels from 8 diverse datasets)
- Hardware: NVIDIA H100
- Training time: ~4 hours
Limitations
- English only
- Best on natural language text; use specialized compressors for JSON/code/logs
- Sequences >512 tokens may degrade (inherited from ModernBERT layer pruning)
License
Apache 2.0
Evaluation results
- Quality Score (Claude-judged)self-reported7.400
- Latency ONNX (median)self-reported13ms