Kompress-Small: Distilled Token Compressor (70M params, 13ms ONNX)

6-layer distilled version of kompress-base. Same quality, 4x faster, half the size.

Compresses text in LLM context windows so agents can do more with less. Distilled from the 22-layer teacher using knowledge distillation — the student learns the teacher's probability distributions, not just hard labels.

Results

Model Quality Latency Size Params
kompress-small (ONNX) 7.4/10 13-29ms 279MB 70M
kompress-base 7.9/10 49-84ms 600MB 150M
LLMLingua-2 5.9-6.2/10 113-117ms 710MB 179M

Detailed Benchmarks

Unstructured text (assistant explanations, plans, error analysis, status summaries):

Metric kompress-small kompress-base LLMLingua-2
Quality (Claude-judged) 7.4/10 7.9/10 5.9/10
Latency (median) 29ms 84ms 117ms
Compression 58% kept 58% kept 53% kept

Real Claude Code sessions (text segments):

Metric kompress-small kompress-base LLMLingua-2
Quality (Claude-judged) 7.4/10 7.3/10 6.2/10
Latency (median) 13ms 83ms 114ms

Architecture

Distilled from kompress-base by keeping 6 evenly-spaced layers from the original 22:

Teacher (kompress-base):  layers [0,1,2,...,21]  → 150M params, 600MB
Student (kompress-small): layers [0,4,8,12,17,21] → 70M params, 279MB

Both token classification head and span importance CNN are copied from the teacher, then the full model is fine-tuned with knowledge distillation:

loss = 0.5 * CrossEntropy(student, labels) + 0.5 * KL_div(student, teacher) * T²

The soft KL divergence loss transfers richer information than hard labels alone — teaching the student how confident the teacher is about each token, not just the binary decision.

Available Formats

Format Size Latency Use case
model.safetensors 279MB ~77ms (MPS) PyTorch inference
model_int8.pt 189MB ~50ms (CPU) PyTorch INT8
model.onnx + .data 275MB 13-29ms (CPU) Production deployment

Usage

ONNX (Fastest)

import onnxruntime as ort
from transformers import AutoTokenizer

sess = ort.InferenceSession("model.onnx")
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")

text = "Your text here..."
words = text.split()
enc = tokenizer(words, is_split_into_words=True, truncation=True,
                max_length=8192, padding=True, return_tensors="np")

logits = sess.run(None, {
    "input_ids": enc["input_ids"],
    "attention_mask": enc["attention_mask"]
})[0][0]

# Argmax: keep tokens where class 1 > class 0
keep = logits[:, 1] > logits[:, 0]
word_ids = enc.word_ids(batch_index=0)

kept_words = []
seen = set()
for idx, wid in enumerate(word_ids):
    if wid is not None and wid not in seen and keep[idx]:
        kept_words.append(words[wid])
        seen.add(wid)

compressed = " ".join(kept_words)

With Headroom

pip install headroom-ai

kompress-small is available as a lightweight alternative in Headroom.

Training Details

  • Teacher: kompress-base (22 layers, 150M params)
  • Distillation: 5 epochs, batch 32, lr 3e-5, temperature 4.0, α=0.5
  • Data: 263K examples (same extractive labels from 8 diverse datasets)
  • Hardware: NVIDIA H100
  • Training time: ~4 hours

Limitations

  • English only
  • Best on natural language text; use specialized compressors for JSON/code/logs
  • Sequences >512 tokens may degrade (inherited from ModernBERT layer pruning)

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results