v3: trained on 330K structured tool outputs (H100) — JSON, diffs, logs, code, SQL, agentic traces

Browse files

Files changed (3) hide show

README.md +64 -46
model.safetensors +1 -1
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -11,11 +11,19 @@ tags:
 - modernbert
 - llmlingua
 - headroom
 pipeline_tag: token-classification
 base_model: answerdotai/ModernBERT-base
 datasets:
 - SWE-bench/SWE-smith-trajectories
 - glaiveai/glaive-function-calling-v2
 model-index:
 - name: kompress-base
   results:
@@ -24,19 +32,21 @@ model-index:
       name: Token Compression
     metrics:
     - type: f1
-      value: 0.997
       name: F1
     - type: accuracy
-      value: 0.994
       name: Accuracy
 ---
-# Kompress: Token Compression for Agentic Contexts
-**Kompress** is a ModernBERT-based token compressor trained specifically for agentic LLM contexts. It is a drop-in replacement for [LLMLingua-2](https://arxiv.org/abs/2403.12968) that achieves **2.3x better entity preservation** while being **2.3x smaller** and supporting **16x longer context windows**.
 ## Key Results
 | Metric | Kompress | LLMLingua-2 |
 |--------|----------|-------------|
 | Entity Preservation | **82.1%** | 36.0% |
@@ -44,23 +54,53 @@ model-index:
 | Model Size | **600 MB** | 1,400 MB |
 | Context Window | **8,192** | 512 |
 | Parameters | **149M** | 355M |
-| Trained on Agentic Data | Yes | No (meeting transcripts) |
 ## Why Kompress?
-LLMLingua-2 was trained on meeting transcripts (MeetingBank). When applied to agentic contexts (tool outputs, code, file paths, error traces), it:
-- **Destroys file paths**: `/Users/foo/.claude/tasks/abc-123` becomes `abc - 123 abc 123 123`
-- **Splits entity names**: Keeps "John" but drops "Smith"
 - **Expands instead of compressing**: 206% average ratio on agentic data
-- **Has no cross-chunk awareness**: 512-token chunks, no global context
-Kompress fixes all of these with:
-1. **Agentic training data** — trained on real Claude Code sessions, SWE-bench trajectories, and function-calling traces
 2. **Dual-head architecture** — token classification + span importance CNN prevents entity splitting
 3. **ModernBERT backbone** — 8K context window, code-pretrained, RoPE attention
 ## Architecture
 ```
@@ -71,76 +111,53 @@ Input tokens → ModernBERT-base encoder (149M params, 8K context) →
 Final score = token_prob × (0.5 + 0.5 × span_score)
 ```
-The span head (~200K extra params) learns contiguous importance regions, preventing the "split entity" and "incoherent fragments" problems of pure token-level classifiers.
 ## Quick Start
 ```python
-# Install
 pip install kompress
-# Compress text
 from kompress.inference.pytorch_runner import KompressRunner
 runner = KompressRunner(checkpoint_path="chopratejas/kompress-base")
 result = runner.compress(
-    "The function parse_config in /Users/dev/app/config.py returned None "
-    "because the YAML file was malformed at line 42. Error: yaml.scanner."
-    "ScannerError: mapping values are not allowed here.",
     target_ratio=0.5,
 )
 print(result.compressed)
-# Keeps: parse_config, /Users/dev/app/config.py, None, YAML, line 42, ScannerError
 ```
 ## Use with Headroom
-Kompress is designed as a drop-in replacement for LLMLingua-2 in the [Headroom](https://github.com/chopratejas/headroom) compression pipeline:
 ```python
-from kompress.integration.transform import KompressCompressor, KompressConfig
 from kompress.integration.headroom_bridge import patch_content_router
-# Option 1: Use directly
-compressor = KompressCompressor(KompressConfig(
-    checkpoint_path="chopratejas/kompress-base"
-))
-result = compressor.compress(long_tool_output)
-# Option 2: Patch existing Headroom pipeline
 from headroom.transforms import ContentRouter
 router = ContentRouter()
 patch_content_router(router)  # Swaps LLMLingua → Kompress
 ```
-## Training Data
-Trained on 15,051 labeled examples from three diverse sources:
-| Source | Segments | Type |
-|--------|----------|------|
-| Claude Code sessions | ~10,000 | Real agentic coding traces |
-| Glaive Function Calling | ~3,000 | General tool-use across domains |
-| SWE-bench Trajectories | ~2,000 | Open-source coding agent traces |
-Labels generated via Claude Sonnet distillation with task-conditioned, entity-aware prompts.
 ## Training Details
 - **Base model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M params)
-- **Training**: 5 epochs, batch=32, lr=2e-5, AdamW, on NVIDIA A100
 - **Loss**: CrossEntropy (token head) + 0.3 × BCE (span head)
-- **Metrics**: F1=0.997, Precision=0.994, Recall=1.0
 ## License
-Apache 2.0 — use it however you want.
 ## Citation
 ```bibtex
 @software{kompress2025,
-  title={Kompress: Token Compression for Agentic Contexts},
   author={Tejas Chopra},
   year={2025},
   url={https://huggingface.co/chopratejas/kompress-base},
@@ -149,6 +166,7 @@ Apache 2.0 — use it however you want.
 ## Links
 - [Headroom](https://github.com/chopratejas/headroom) — Context compression framework
-- [LLMLingua-2 paper](https://arxiv.org/abs/2403.12968) — The model Kompress replaces
 - [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) — Base encoder

 - modernbert
 - llmlingua
 - headroom
+- tool-outputs
+- structured-data
 pipeline_tag: token-classification
 base_model: answerdotai/ModernBERT-base
 datasets:
 - SWE-bench/SWE-smith-trajectories
 - glaiveai/glaive-function-calling-v2
+- nebius/SWE-agent-trajectories
+- Agent-Ark/Toucan-1.5M
+- tuandunghcmut/toolbench-v1
+- JetBrains-Research/diff-xyz
+- code_search_net
+- b-mc2/sql-create-context
 model-index:
 - name: kompress-base
   results:
       name: Token Compression
     metrics:
     - type: f1
+      value: 0.9956
       name: F1
     - type: accuracy
+      value: 0.9926
       name: Accuracy
 ---
+# Kompress: Token Compression for Structured Tool Outputs & Agentic Contexts
+**Kompress** is a ModernBERT-based token compressor trained on **330K examples** of structured tool outputs — JSON API responses, git diffs, error logs, source code, CLI output, database results, and agentic conversation traces. It is a drop-in replacement for [LLMLingua-2](https://arxiv.org/abs/2403.12968).
 ## Key Results
+### On Agentic / Structured Data (our target domain)
 | Metric | Kompress | LLMLingua-2 |
 |--------|----------|-------------|
 | Entity Preservation | **82.1%** | 36.0% |
 | Model Size | **600 MB** | 1,400 MB |
 | Context Window | **8,192** | 512 |
 | Parameters | **149M** | 355M |
+### On LLMLingua-2's Benchmarks
+| Dataset | Kompress | LLMLingua-2 | Note |
+|---------|----------|-------------|------|
+| MeetingBank | 46.3% | **57.4%** | LLMLingua's training domain |
+| GSM8K | 97.8% | **98.9%** | Both excellent; LLMLingua keeps 88% vs Kompress 50% |
+### Cross-Agent Generalization (Cursor IDE — never seen in training)
+| Metric | Kompress | LLMLingua-2 |
+|--------|----------|-------------|
+| Entity Preservation | **91.1%** | 13.5% |
+| Compression Ratio | **49.9%** | 85.8% |
 ## Why Kompress?
+LLMLingua-2 was trained on meeting transcripts. When applied to structured tool outputs, it:
+- **Destroys file paths**: `/Users/foo/.claude/tasks/abc-123` → `abc - 123 abc 123`
 - **Expands instead of compressing**: 206% average ratio on agentic data
+- **Fragments UUIDs**: `4e149fea-6eb8-4feb` → `4e149fea - 6eb8 - 4feb`
+- **Has no cross-chunk awareness**: 512-token limit
+Kompress fixes these with:
+1. **Trained on structured data** — 330K examples of real tool outputs: JSON, diffs, logs, code, CLI output, SQL
 2. **Dual-head architecture** — token classification + span importance CNN prevents entity splitting
 3. **ModernBERT backbone** — 8K context window, code-pretrained, RoPE attention
+## Training Data (330K examples)
+| Source | Examples | Type |
+|--------|----------|------|
+| Toucan-1.5M (MCP tool outputs) | ~80K | Real MCP server tool responses |
+| SWE-agent trajectories | ~60K | Bash output, file reads, git diffs |
+| ToolBench | ~50K | REST API JSON responses |
+| Glaive Function Calling | ~40K | Function call/response pairs |
+| CodeSearchNet | ~40K | Source code (Python, JS, Java, Go, Ruby, PHP) |
+| JetBrains diff-xyz | ~10K | Git unified diffs |
+| SQL create-context | ~10K | Database schemas + queries |
+| Claude Code sessions | ~15K | Real agentic coding traces (API-labeled) |
+| SWE-bench trajectories | ~15K | Open-source coding agent traces |
+| Glaive + SWE (API-labeled) | ~10K | Function calling + coding (API-labeled) |
+Labeling: Heuristic rules for structured data (JSON→keep keys, diffs→keep +/- lines, logs→keep errors) + Claude Sonnet distillation for natural language segments.
 ## Architecture
 ```
 Final score = token_prob × (0.5 + 0.5 × span_score)
 ```
+The span head (~200K extra params) learns contiguous importance regions, preventing entity splitting and maintaining coherence.
 ## Quick Start
 ```python
 pip install kompress
 from kompress.inference.pytorch_runner import KompressRunner
 runner = KompressRunner(checkpoint_path="chopratejas/kompress-base")
 result = runner.compress(
+    '{"users": [{"id": 1, "name": "Alice", "email": "alice@example.com"}, '
+    '{"id": 2, "name": "Bob", "email": "bob@example.com"}, '
+    '{"id": 3, "name": "Charlie", "email": "charlie@example.com"}]}',
     target_ratio=0.5,
 )
 print(result.compressed)
+# Keeps: keys, structure, unique values — discards repetitive patterns
 ```
 ## Use with Headroom
 ```python
 from kompress.integration.headroom_bridge import patch_content_router
 from headroom.transforms import ContentRouter
 router = ContentRouter()
 patch_content_router(router)  # Swaps LLMLingua → Kompress
 ```
 ## Training Details
 - **Base model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M params)
+- **Training**: 3 epochs, batch=64, lr=2e-5, AdamW + torch.compile on NVIDIA H100
 - **Loss**: CrossEntropy (token head) + 0.3 × BCE (span head)
+- **Final metrics**: F1=0.9956, Precision=0.9959, Recall=0.9953, train_loss=0.068
+- **Training time**: 2h39m on H100 (330K examples, 3 epochs)
 ## License
+Apache 2.0
 ## Citation
 ```bibtex
 @software{kompress2025,
+  title={Kompress: Token Compression for Structured Tool Outputs and Agentic Contexts},
   author={Tejas Chopra},
   year={2025},
   url={https://huggingface.co/chopratejas/kompress-base},
 ## Links
+- [GitHub](https://github.com/chopratejas/kompress) — Source code, training pipeline, eval scripts
 - [Headroom](https://github.com/chopratejas/headroom) — Context compression framework
+- [LLMLingua-2](https://arxiv.org/abs/2403.12968) — The model Kompress replaces
 - [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) — Base encoder

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1b95ef3ac2d846544939888b143f34f30fa7daf9623a2f1ed4c050f98ecc9c31
 size 600015548

 version https://git-lfs.github.com/spec/v1
+oid sha256:48f6af5958adc710a7758c4a6920aa7811f41fd063d299bd09ee445d5982c4d7
 size 600015548

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a38655a76ccc51a01ef7d311276d42cfa6e09bbcd0b1bdbe6318161bbdb9b26f
 size 5201

 version https://git-lfs.github.com/spec/v1
+oid sha256:8a6e6c606fb5a5649d8d2dcd06034c2fbab23960b8cf598bd93ff80309a26601
 size 5201