LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Paper
• 2403.12968 • Published
• 25
Kompress is a ModernBERT-based token compressor trained on 330K examples of structured tool outputs — JSON API responses, git diffs, error logs, source code, CLI output, database results, and agentic conversation traces. It is a drop-in replacement for LLMLingua-2.
| Metric | Kompress | LLMLingua-2 |
|---|---|---|
| Entity Preservation | 82.1% | 36.0% |
| Compression Ratio | 48.1% | 206.0% (expands!) |
| Model Size | 600 MB | 1,400 MB |
| Context Window | 8,192 | 512 |
| Parameters | 149M | 355M |
| Dataset | Kompress | LLMLingua-2 | Note |
|---|---|---|---|
| MeetingBank | 46.3% | 57.4% | LLMLingua's training domain |
| GSM8K | 97.8% | 98.9% | Both excellent; LLMLingua keeps 88% vs Kompress 50% |
| Metric | Kompress | LLMLingua-2 |
|---|---|---|
| Entity Preservation | 91.1% | 13.5% |
| Compression Ratio | 49.9% | 85.8% |
LLMLingua-2 was trained on meeting transcripts. When applied to structured tool outputs, it:
/Users/foo/.claude/tasks/abc-123 → abc - 123 abc 1234e149fea-6eb8-4feb → 4e149fea - 6eb8 - 4febKompress fixes these with:
| Source | Examples | Type |
|---|---|---|
| Toucan-1.5M (MCP tool outputs) | ~80K | Real MCP server tool responses |
| SWE-agent trajectories | ~60K | Bash output, file reads, git diffs |
| ToolBench | ~50K | REST API JSON responses |
| Glaive Function Calling | ~40K | Function call/response pairs |
| CodeSearchNet | ~40K | Source code (Python, JS, Java, Go, Ruby, PHP) |
| JetBrains diff-xyz | ~10K | Git unified diffs |
| SQL create-context | ~10K | Database schemas + queries |
| Claude Code sessions | ~15K | Real agentic coding traces (API-labeled) |
| SWE-bench trajectories | ~15K | Open-source coding agent traces |
| Glaive + SWE (API-labeled) | ~10K | Function calling + coding (API-labeled) |
Labeling: Heuristic rules for structured data (JSON→keep keys, diffs→keep +/- lines, logs→keep errors) + Claude Sonnet distillation for natural language segments.
Input tokens → ModernBERT-base encoder (149M params, 8K context) →
Head 1: Token-level keep/discard (Linear → Softmax)
Head 2: Span importance (Conv1d → GELU → Conv1d → Sigmoid)
Final score = token_prob × (0.5 + 0.5 × span_score)
The span head (~200K extra params) learns contiguous importance regions, preventing entity splitting and maintaining coherence.
pip install kompress
from kompress.inference.pytorch_runner import KompressRunner
runner = KompressRunner(checkpoint_path="chopratejas/kompress-base")
result = runner.compress(
'{"users": [{"id": 1, "name": "Alice", "email": "alice@example.com"}, '
'{"id": 2, "name": "Bob", "email": "bob@example.com"}, '
'{"id": 3, "name": "Charlie", "email": "charlie@example.com"}]}',
target_ratio=0.5,
)
print(result.compressed)
# Keeps: keys, structure, unique values — discards repetitive patterns
from kompress.integration.headroom_bridge import patch_content_router
from headroom.transforms import ContentRouter
router = ContentRouter()
patch_content_router(router) # Swaps LLMLingua → Kompress
Apache 2.0
@software{kompress2025,
title={Kompress: Token Compression for Structured Tool Outputs and Agentic Contexts},
author={Tejas Chopra},
year={2025},
url={https://huggingface.co/chopratejas/kompress-base},
}
Base model
answerdotai/ModernBERT-base