---
license: apache-2.0
language:
- en
tags:
- token-compression
- context-optimization
- llm
- agentic
- modernbert
datasets:
- lmsys/lmsys-chat-1m
- cnn_dailymail
- EdinburghNLP/xsum
- ccdv/govreport-summarization
- ccdv/arxiv-summarization
- huuuyeah/meetingbank
- knkarthick/samsum
metrics:
- f1
- accuracy
pipeline_tag: token-classification
model-index:
- name: kompress-base
  results:
  - task:
      type: token-classification
      name: Token Compression
    metrics:
    - name: Quality Score (Claude-judged)
      type: custom
      value: 7.9
    - name: LLMLingua-2 Quality Score
      type: custom
      value: 5.9
    - name: Latency (median, Apple Silicon MPS)
      type: latency
      value: 84ms
    - name: LLMLingua-2 Latency
      type: latency
      value: 117ms
---

# Kompress: ModernBERT Token Compressor for LLM Context Windows

**Kompress compresses text in LLM context windows so agents can do more with less.** It's a drop-in replacement for LLMLingua-2 that's higher quality and 2.3x faster.

## Results

| Model | Quality | Latency | Size | Params |
|-------|---------|---------|------|--------|
| **kompress-base** | **7.9/10** | **84ms** (MPS) | 600MB | 150M |
| [kompress-small](https://huggingface.co/chopratejas/kompress-small) | 7.4/10 | **13-29ms** (ONNX) | 279MB | 70M |
| LLMLingua-2 | 5.9/10 | 117ms | 710MB | 179M |

### Quality on Real Agent Data (Claude-judged)

| Eval Set | kompress-base | kompress-small | LLMLingua-2 |
|----------|--------------|----------------|-------------|
| Unstructured NL text | **7.9/10** | 7.4/10 | 5.9/10 |
| Claude Code sessions | **7.3/10** | **7.4/10** | 6.2/10 |

Quality scores are judged by Claude Sonnet 4.6: "Can an LLM fully understand and act on the compressed version?" (1-10 scale).

## How It Works

Kompress is a **dual-head ModernBERT** model trained to classify each token as keep or discard:

- **Token head**: Binary classifier (keep/discard per token via argmax)
- **Span head**: 1D CNN that identifies important regions, boosts borderline tokens in critical spans

The model decides how much to compress based on content density — no fixed compression ratio.

### Example

```
ORIGINAL (98 words):
After investigating the memory leak, I traced it to the event listener
registration in the WebSocket handler. Every time a client connects, we
register a new listener on the global event bus, but when the client
disconnects, the cleanup function only removes the WebSocket connection
from the pool — it doesn't unregister the event listener. Over time,
these orphaned listeners accumulate and each one holds a reference to
the connection's closure, which in turn holds the entire request context.
The fix is straightforward: store the listener reference at connection
time and explicitly remove it in the disconnect handler.

COMPRESSED (59 words, 60% kept):
investigating memory leak, traced event listener registration WebSocket
handler. Every time client connects, register new listener global event
bus, client disconnects, cleanup function only removes WebSocket
connection pool — doesn't unregister event listener. Over time, orphaned
listeners accumulate each one holds reference connection's closure, holds
entire request context. fix straightforward: store listener reference
connection time explicitly remove disconnect handler.
```

An LLM can fully understand and act on the compressed version.

## Usage

```python
from kompress.inference.pytorch_runner import KompressRunner

# Auto-downloads from HuggingFace on first use
runner = KompressRunner()

result = runner.compress("Your long text here...")
print(result.compressed)       # Compressed text
print(result.compression_ratio) # e.g., 0.62
print(result.tokens_saved)      # Number of tokens saved
```

### With Headroom (LLM Proxy)

```bash
pip install headroom-ai
```

Kompress is built into [Headroom](https://github.com/headroom-ai/headroom) as the default text compressor. It auto-downloads and runs on every API request that passes through the proxy.

## Training

### Architecture
- **Base**: `answerdotai/ModernBERT-base` (149M params, 8192 token context)
- **Token head**: Linear(768, 2) — binary keep/discard classifier
- **Span head**: Conv1d(768→256, k=5) → GELU → Conv1d(256→1, k=3) → Sigmoid
- **Total**: 150M params

### Data
215K extractive compression labels from 8 diverse datasets, labeled by Claude Sonnet 4.6:

| Dataset | Count | Type |
|---------|-------|------|
| LMSYS-Chat-1M | 57K | LLM conversations |
| CNN/DailyMail | 50K | News articles |
| WikiHow | 50K | How-to guides |
| MeetingBank | 50K | Meeting transcripts |
| XSum | 47K | News articles |
| GovReport | 25K | Government reports |
| ArXiv | 25K | Academic papers |
| SAMSum | 14K | Dialogues |

### Labeling Approach

Key insight: the labels must be **strictly extractive** — a subset of original words in original order. Previous versions failed because the labeling LLM rephrased text, causing alignment failures (5-12% keep ratio instead of the intended 40-60%).

The fix: prompt Claude to "select words like highlighting with a marker" rather than "compress this text." This ensures every word in the compressed output exists in the original, and the greedy alignment recovers 95%+ of the intended labels.

### Training Details
- 3 epochs, batch size 32, learning rate 2e-5
- BF16 mixed precision on NVIDIA H100
- HuggingFace Trainer with warmup + cosine schedule
- ~3 hours training time

## Model Family

| | kompress-base | [kompress-small](https://huggingface.co/chopratejas/kompress-small) | LLMLingua-2 |
|---|---|---|---|
| Architecture | ModernBERT 22-layer | ModernBERT 6-layer (distilled) | mBERT (2018) |
| Params | 150M | 70M | 179M |
| Size | 600MB | 279MB (ONNX: 275MB) | 710MB |
| Max context | 8,192 tokens | 8,192 tokens | 512 tokens |
| Quality | **7.9/10** | 7.4/10 | 5.9/10 |
| Latency | 84ms (MPS) | **13-29ms (ONNX)** | 117ms |
| Training data | 215K from 8 datasets | Distilled from base | 41K from MeetingBank |
| Labeling model | Claude Sonnet 4.6 | — | GPT-4 |
| Compression | Content-adaptive | Content-adaptive | Fixed ratio |

## Limitations

- English only (ModernBERT is English-focused)
- Best on natural language text; structured data (JSON, code, logs) should use specialized compressors
- Compression ratio varies by content (60-80% kept for dense text, 40-60% for verbose text)

## License

Apache 2.0