v3: trained on 330K structured tool outputs (H100) — JSON, diffs, logs, code, SQL, agentic traces
b7a65fe verified | license: apache-2.0 | |
| language: | |
| - en | |
| library_name: transformers | |
| tags: | |
| - token-compression | |
| - prompt-compression | |
| - context-compression | |
| - agentic | |
| - modernbert | |
| - llmlingua | |
| - headroom | |
| - tool-outputs | |
| - structured-data | |
| pipeline_tag: token-classification | |
| base_model: answerdotai/ModernBERT-base | |
| datasets: | |
| - SWE-bench/SWE-smith-trajectories | |
| - glaiveai/glaive-function-calling-v2 | |
| - nebius/SWE-agent-trajectories | |
| - Agent-Ark/Toucan-1.5M | |
| - tuandunghcmut/toolbench-v1 | |
| - JetBrains-Research/diff-xyz | |
| - code_search_net | |
| - b-mc2/sql-create-context | |
| model-index: | |
| - name: kompress-base | |
| results: | |
| - task: | |
| type: token-classification | |
| name: Token Compression | |
| metrics: | |
| - type: f1 | |
| value: 0.9956 | |
| name: F1 | |
| - type: accuracy | |
| value: 0.9926 | |
| name: Accuracy | |
| # Kompress: Token Compression for Structured Tool Outputs & Agentic Contexts | |
| **Kompress** is a ModernBERT-based token compressor trained on **330K examples** of structured tool outputs — JSON API responses, git diffs, error logs, source code, CLI output, database results, and agentic conversation traces. It is a drop-in replacement for [LLMLingua-2](https://arxiv.org/abs/2403.12968). | |
| ## Key Results | |
| ### On Agentic / Structured Data (our target domain) | |
| | Metric | Kompress | LLMLingua-2 | | |
| |--------|----------|-------------| | |
| | Entity Preservation | **82.1%** | 36.0% | | |
| | Compression Ratio | **48.1%** | 206.0% (expands!) | | |
| | Model Size | **600 MB** | 1,400 MB | | |
| | Context Window | **8,192** | 512 | | |
| | Parameters | **149M** | 355M | | |
| ### On LLMLingua-2's Benchmarks | |
| | Dataset | Kompress | LLMLingua-2 | Note | | |
| |---------|----------|-------------|------| | |
| | MeetingBank | 46.3% | **57.4%** | LLMLingua's training domain | | |
| | GSM8K | 97.8% | **98.9%** | Both excellent; LLMLingua keeps 88% vs Kompress 50% | | |
| ### Cross-Agent Generalization (Cursor IDE — never seen in training) | |
| | Metric | Kompress | LLMLingua-2 | | |
| |--------|----------|-------------| | |
| | Entity Preservation | **91.1%** | 13.5% | | |
| | Compression Ratio | **49.9%** | 85.8% | | |
| ## Why Kompress? | |
| LLMLingua-2 was trained on meeting transcripts. When applied to structured tool outputs, it: | |
| - **Destroys file paths**: `/Users/foo/.claude/tasks/abc-123` → `abc - 123 abc 123` | |
| - **Expands instead of compressing**: 206% average ratio on agentic data | |
| - **Fragments UUIDs**: `4e149fea-6eb8-4feb` → `4e149fea - 6eb8 - 4feb` | |
| - **Has no cross-chunk awareness**: 512-token limit | |
| Kompress fixes these with: | |
| 1. **Trained on structured data** — 330K examples of real tool outputs: JSON, diffs, logs, code, CLI output, SQL | |
| 2. **Dual-head architecture** — token classification + span importance CNN prevents entity splitting | |
| 3. **ModernBERT backbone** — 8K context window, code-pretrained, RoPE attention | |
| ## Training Data (330K examples) | |
| | Source | Examples | Type | | |
| |--------|----------|------| | |
| | Toucan-1.5M (MCP tool outputs) | ~80K | Real MCP server tool responses | | |
| | SWE-agent trajectories | ~60K | Bash output, file reads, git diffs | | |
| | ToolBench | ~50K | REST API JSON responses | | |
| | Glaive Function Calling | ~40K | Function call/response pairs | | |
| | CodeSearchNet | ~40K | Source code (Python, JS, Java, Go, Ruby, PHP) | | |
| | JetBrains diff-xyz | ~10K | Git unified diffs | | |
| | SQL create-context | ~10K | Database schemas + queries | | |
| | Claude Code sessions | ~15K | Real agentic coding traces (API-labeled) | | |
| | SWE-bench trajectories | ~15K | Open-source coding agent traces | | |
| | Glaive + SWE (API-labeled) | ~10K | Function calling + coding (API-labeled) | | |
| Labeling: Heuristic rules for structured data (JSON→keep keys, diffs→keep +/- lines, logs→keep errors) + Claude Sonnet distillation for natural language segments. | |
| ## Architecture | |
| ``` | |
| Input tokens → ModernBERT-base encoder (149M params, 8K context) → | |
| Head 1: Token-level keep/discard (Linear → Softmax) | |
| Head 2: Span importance (Conv1d → GELU → Conv1d → Sigmoid) | |
| Final score = token_prob × (0.5 + 0.5 × span_score) | |
| ``` | |
| The span head (~200K extra params) learns contiguous importance regions, preventing entity splitting and maintaining coherence. | |
| ## Quick Start | |
| ```python | |
| pip install kompress | |
| from kompress.inference.pytorch_runner import KompressRunner | |
| runner = KompressRunner(checkpoint_path="chopratejas/kompress-base") | |
| result = runner.compress( | |
| '{"users": [{"id": 1, "name": "Alice", "email": "alice@example.com"}, ' | |
| '{"id": 2, "name": "Bob", "email": "bob@example.com"}, ' | |
| '{"id": 3, "name": "Charlie", "email": "charlie@example.com"}]}', | |
| target_ratio=0.5, | |
| ) | |
| print(result.compressed) | |
| # Keeps: keys, structure, unique values — discards repetitive patterns | |
| ``` | |
| ## Use with Headroom | |
| ```python | |
| from kompress.integration.headroom_bridge import patch_content_router | |
| from headroom.transforms import ContentRouter | |
| router = ContentRouter() | |
| patch_content_router(router) # Swaps LLMLingua → Kompress | |
| ``` | |
| ## Training Details | |
| - **Base model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M params) | |
| - **Training**: 3 epochs, batch=64, lr=2e-5, AdamW + torch.compile on NVIDIA H100 | |
| - **Loss**: CrossEntropy (token head) + 0.3 × BCE (span head) | |
| - **Final metrics**: F1=0.9956, Precision=0.9959, Recall=0.9953, train_loss=0.068 | |
| - **Training time**: 2h39m on H100 (330K examples, 3 epochs) | |
| ## License | |
| Apache 2.0 | |
| ## Citation | |
| ```bibtex | |
| @software{kompress2025, | |
| title={Kompress: Token Compression for Structured Tool Outputs and Agentic Contexts}, | |
| author={Tejas Chopra}, | |
| year={2025}, | |
| url={https://huggingface.co/chopratejas/kompress-base}, | |
| } | |
| ``` | |
| ## Links | |
| - [GitHub](https://github.com/chopratejas/kompress) — Source code, training pipeline, eval scripts | |
| - [Headroom](https://github.com/chopratejas/headroom) — Context compression framework | |
| - [LLMLingua-2](https://arxiv.org/abs/2403.12968) — The model Kompress replaces | |
| - [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) — Base encoder | |