--- license: apache-2.0 language: - en tags: - token-compression - context-optimization - llm - agentic - modernbert datasets: - lmsys/lmsys-chat-1m - cnn_dailymail - EdinburghNLP/xsum - ccdv/govreport-summarization - ccdv/arxiv-summarization - huuuyeah/meetingbank - knkarthick/samsum metrics: - f1 - accuracy pipeline_tag: token-classification model-index: - name: kompress-base results: - task: type: token-classification name: Token Compression metrics: - name: Quality Score (Claude-judged) type: custom value: 7.9 - name: LLMLingua-2 Quality Score type: custom value: 5.9 - name: Latency (median, Apple Silicon MPS) type: latency value: 84ms - name: LLMLingua-2 Latency type: latency value: 117ms --- # Kompress: ModernBERT Token Compressor for LLM Context Windows **Kompress compresses text in LLM context windows so agents can do more with less.** It's a drop-in replacement for LLMLingua-2 that's higher quality and 2.3x faster. ## Results | Model | Quality | Latency | Size | Params | |-------|---------|---------|------|--------| | **kompress-base** | **7.9/10** | **84ms** (MPS) | 600MB | 150M | | [kompress-small](https://huggingface.co/chopratejas/kompress-small) | 7.4/10 | **13-29ms** (ONNX) | 279MB | 70M | | LLMLingua-2 | 5.9/10 | 117ms | 710MB | 179M | ### Quality on Real Agent Data (Claude-judged) | Eval Set | kompress-base | kompress-small | LLMLingua-2 | |----------|--------------|----------------|-------------| | Unstructured NL text | **7.9/10** | 7.4/10 | 5.9/10 | | Claude Code sessions | **7.3/10** | **7.4/10** | 6.2/10 | Quality scores are judged by Claude Sonnet 4.6: "Can an LLM fully understand and act on the compressed version?" (1-10 scale). ## How It Works Kompress is a **dual-head ModernBERT** model trained to classify each token as keep or discard: - **Token head**: Binary classifier (keep/discard per token via argmax) - **Span head**: 1D CNN that identifies important regions, boosts borderline tokens in critical spans The model decides how much to compress based on content density — no fixed compression ratio. ### Example ``` ORIGINAL (98 words): After investigating the memory leak, I traced it to the event listener registration in the WebSocket handler. Every time a client connects, we register a new listener on the global event bus, but when the client disconnects, the cleanup function only removes the WebSocket connection from the pool — it doesn't unregister the event listener. Over time, these orphaned listeners accumulate and each one holds a reference to the connection's closure, which in turn holds the entire request context. The fix is straightforward: store the listener reference at connection time and explicitly remove it in the disconnect handler. COMPRESSED (59 words, 60% kept): investigating memory leak, traced event listener registration WebSocket handler. Every time client connects, register new listener global event bus, client disconnects, cleanup function only removes WebSocket connection pool — doesn't unregister event listener. Over time, orphaned listeners accumulate each one holds reference connection's closure, holds entire request context. fix straightforward: store listener reference connection time explicitly remove disconnect handler. ``` An LLM can fully understand and act on the compressed version. ## Usage ```python from kompress.inference.pytorch_runner import KompressRunner # Auto-downloads from HuggingFace on first use runner = KompressRunner() result = runner.compress("Your long text here...") print(result.compressed) # Compressed text print(result.compression_ratio) # e.g., 0.62 print(result.tokens_saved) # Number of tokens saved ``` ### With Headroom (LLM Proxy) ```bash pip install headroom-ai ``` Kompress is built into [Headroom](https://github.com/headroom-ai/headroom) as the default text compressor. It auto-downloads and runs on every API request that passes through the proxy. ## Training ### Architecture - **Base**: `answerdotai/ModernBERT-base` (149M params, 8192 token context) - **Token head**: Linear(768, 2) — binary keep/discard classifier - **Span head**: Conv1d(768→256, k=5) → GELU → Conv1d(256→1, k=3) → Sigmoid - **Total**: 150M params ### Data 215K extractive compression labels from 8 diverse datasets, labeled by Claude Sonnet 4.6: | Dataset | Count | Type | |---------|-------|------| | LMSYS-Chat-1M | 57K | LLM conversations | | CNN/DailyMail | 50K | News articles | | WikiHow | 50K | How-to guides | | MeetingBank | 50K | Meeting transcripts | | XSum | 47K | News articles | | GovReport | 25K | Government reports | | ArXiv | 25K | Academic papers | | SAMSum | 14K | Dialogues | ### Labeling Approach Key insight: the labels must be **strictly extractive** — a subset of original words in original order. Previous versions failed because the labeling LLM rephrased text, causing alignment failures (5-12% keep ratio instead of the intended 40-60%). The fix: prompt Claude to "select words like highlighting with a marker" rather than "compress this text." This ensures every word in the compressed output exists in the original, and the greedy alignment recovers 95%+ of the intended labels. ### Training Details - 3 epochs, batch size 32, learning rate 2e-5 - BF16 mixed precision on NVIDIA H100 - HuggingFace Trainer with warmup + cosine schedule - ~3 hours training time ## Model Family | | kompress-base | [kompress-small](https://huggingface.co/chopratejas/kompress-small) | LLMLingua-2 | |---|---|---|---| | Architecture | ModernBERT 22-layer | ModernBERT 6-layer (distilled) | mBERT (2018) | | Params | 150M | 70M | 179M | | Size | 600MB | 279MB (ONNX: 275MB) | 710MB | | Max context | 8,192 tokens | 8,192 tokens | 512 tokens | | Quality | **7.9/10** | 7.4/10 | 5.9/10 | | Latency | 84ms (MPS) | **13-29ms (ONNX)** | 117ms | | Training data | 215K from 8 datasets | Distilled from base | 41K from MeetingBank | | Labeling model | Claude Sonnet 4.6 | — | GPT-4 | | Compression | Content-adaptive | Content-adaptive | Fixed ratio | ## Limitations - English only (ModernBERT is English-focused) - Best on natural language text; structured data (JSON, code, logs) should use specialized compressors - Compression ratio varies by content (60-80% kept for dense text, 40-60% for verbose text) ## License Apache 2.0