Kompress-Small: Distilled Token Compressor (70M params, 13ms ONNX)

6-layer distilled version of kompress-base. Same quality, 4x faster, half the size.

Compresses text in LLM context windows so agents can do more with less. Distilled from the 22-layer teacher using knowledge distillation — the student learns the teacher's probability distributions, not just hard labels.

Results

Model	Quality	Latency	Size	Params
kompress-small (ONNX)	7.4/10	13-29ms	279MB	70M
kompress-base	7.9/10	49-84ms	600MB	150M
LLMLingua-2	5.9-6.2/10	113-117ms	710MB	179M

Detailed Benchmarks

Unstructured text (assistant explanations, plans, error analysis, status summaries):

Metric	kompress-small	kompress-base	LLMLingua-2
Quality (Claude-judged)	7.4/10	7.9/10	5.9/10
Latency (median)	29ms	84ms	117ms
Compression	58% kept	58% kept	53% kept

Real Claude Code sessions (text segments):

Metric	kompress-small	kompress-base	LLMLingua-2
Quality (Claude-judged)	7.4/10	7.3/10	6.2/10
Latency (median)	13ms	83ms	114ms

Architecture

Distilled from kompress-base by keeping 6 evenly-spaced layers from the original 22:

Teacher (kompress-base):  layers [0,1,2,...,21]  → 150M params, 600MB
Student (kompress-small): layers [0,4,8,12,17,21] → 70M params, 279MB

Both token classification head and span importance CNN are copied from the teacher, then the full model is fine-tuned with knowledge distillation:

loss = 0.5 * CrossEntropy(student, labels) + 0.5 * KL_div(student, teacher) * T²

The soft KL divergence loss transfers richer information than hard labels alone — teaching the student how confident the teacher is about each token, not just the binary decision.

Available Formats

Format	Size	Latency	Use case
`model.safetensors`	279MB	~77ms (MPS)	PyTorch inference
`model_int8.pt`	189MB	~50ms (CPU)	PyTorch INT8
`model.onnx` + `.data`	275MB	13-29ms (CPU)	Production deployment

Usage

ONNX (Fastest)

import onnxruntime as ort
from transformers import AutoTokenizer

sess = ort.InferenceSession("model.onnx")
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")

text = "Your text here..."
words = text.split()
enc = tokenizer(words, is_split_into_words=True, truncation=True,
                max_length=8192, padding=True, return_tensors="np")

logits = sess.run(None, {
    "input_ids": enc["input_ids"],
    "attention_mask": enc["attention_mask"]
})[0][0]

# Argmax: keep tokens where class 1 > class 0
keep = logits[:, 1] > logits[:, 0]
word_ids = enc.word_ids(batch_index=0)

kept_words = []
seen = set()
for idx, wid in enumerate(word_ids):
    if wid is not None and wid not in seen and keep[idx]:
        kept_words.append(words[wid])
        seen.add(wid)

compressed = " ".join(kept_words)

With Headroom

pip install headroom-ai

kompress-small is available as a lightweight alternative in Headroom.

Training Details

Teacher: kompress-base (22 layers, 150M params)
Distillation: 5 epochs, batch 32, lr 3e-5, temperature 4.0, α=0.5
Data: 263K examples (same extractive labels from 8 diverse datasets)
Hardware: NVIDIA H100
Training time: ~4 hours

Limitations

English only
Best on natural language text; use specialized compressors for JSON/code/logs
Sequences >512 tokens may degrade (inherited from ModernBERT layer pruning)

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

69.8M params

Tensor type

F32

Evaluation results

Quality Score (Claude-judged)
self-reported

7.400
Latency ONNX (median)
self-reported

13ms