chopratejas commited on
Commit
b7a65fe
·
verified ·
1 Parent(s): 3c5f566

v3: trained on 330K structured tool outputs (H100) — JSON, diffs, logs, code, SQL, agentic traces

Browse files
Files changed (3) hide show
  1. README.md +64 -46
  2. model.safetensors +1 -1
  3. training_args.bin +1 -1
README.md CHANGED
@@ -11,11 +11,19 @@ tags:
11
  - modernbert
12
  - llmlingua
13
  - headroom
 
 
14
  pipeline_tag: token-classification
15
  base_model: answerdotai/ModernBERT-base
16
  datasets:
17
  - SWE-bench/SWE-smith-trajectories
18
  - glaiveai/glaive-function-calling-v2
 
 
 
 
 
 
19
  model-index:
20
  - name: kompress-base
21
  results:
@@ -24,19 +32,21 @@ model-index:
24
  name: Token Compression
25
  metrics:
26
  - type: f1
27
- value: 0.997
28
  name: F1
29
  - type: accuracy
30
- value: 0.994
31
  name: Accuracy
32
  ---
33
 
34
- # Kompress: Token Compression for Agentic Contexts
35
 
36
- **Kompress** is a ModernBERT-based token compressor trained specifically for agentic LLM contexts. It is a drop-in replacement for [LLMLingua-2](https://arxiv.org/abs/2403.12968) that achieves **2.3x better entity preservation** while being **2.3x smaller** and supporting **16x longer context windows**.
37
 
38
  ## Key Results
39
 
 
 
40
  | Metric | Kompress | LLMLingua-2 |
41
  |--------|----------|-------------|
42
  | Entity Preservation | **82.1%** | 36.0% |
@@ -44,23 +54,53 @@ model-index:
44
  | Model Size | **600 MB** | 1,400 MB |
45
  | Context Window | **8,192** | 512 |
46
  | Parameters | **149M** | 355M |
47
- | Trained on Agentic Data | Yes | No (meeting transcripts) |
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  ## Why Kompress?
50
 
51
- LLMLingua-2 was trained on meeting transcripts (MeetingBank). When applied to agentic contexts (tool outputs, code, file paths, error traces), it:
52
 
53
- - **Destroys file paths**: `/Users/foo/.claude/tasks/abc-123` becomes `abc - 123 abc 123 123`
54
- - **Splits entity names**: Keeps "John" but drops "Smith"
55
  - **Expands instead of compressing**: 206% average ratio on agentic data
56
- - **Has no cross-chunk awareness**: 512-token chunks, no global context
 
57
 
58
- Kompress fixes all of these with:
59
 
60
- 1. **Agentic training data** — trained on real Claude Code sessions, SWE-bench trajectories, and function-calling traces
61
  2. **Dual-head architecture** — token classification + span importance CNN prevents entity splitting
62
  3. **ModernBERT backbone** — 8K context window, code-pretrained, RoPE attention
63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  ## Architecture
65
 
66
  ```
@@ -71,76 +111,53 @@ Input tokens → ModernBERT-base encoder (149M params, 8K context) →
71
  Final score = token_prob × (0.5 + 0.5 × span_score)
72
  ```
73
 
74
- The span head (~200K extra params) learns contiguous importance regions, preventing the "split entity" and "incoherent fragments" problems of pure token-level classifiers.
75
 
76
  ## Quick Start
77
 
78
  ```python
79
- # Install
80
  pip install kompress
81
 
82
- # Compress text
83
  from kompress.inference.pytorch_runner import KompressRunner
84
 
85
  runner = KompressRunner(checkpoint_path="chopratejas/kompress-base")
86
  result = runner.compress(
87
- "The function parse_config in /Users/dev/app/config.py returned None "
88
- "because the YAML file was malformed at line 42. Error: yaml.scanner."
89
- "ScannerError: mapping values are not allowed here.",
90
  target_ratio=0.5,
91
  )
92
  print(result.compressed)
93
- # Keeps: parse_config, /Users/dev/app/config.py, None, YAML, line 42, ScannerError
94
  ```
95
 
96
  ## Use with Headroom
97
 
98
- Kompress is designed as a drop-in replacement for LLMLingua-2 in the [Headroom](https://github.com/chopratejas/headroom) compression pipeline:
99
-
100
  ```python
101
- from kompress.integration.transform import KompressCompressor, KompressConfig
102
  from kompress.integration.headroom_bridge import patch_content_router
103
-
104
- # Option 1: Use directly
105
- compressor = KompressCompressor(KompressConfig(
106
- checkpoint_path="chopratejas/kompress-base"
107
- ))
108
- result = compressor.compress(long_tool_output)
109
-
110
- # Option 2: Patch existing Headroom pipeline
111
  from headroom.transforms import ContentRouter
 
112
  router = ContentRouter()
113
  patch_content_router(router) # Swaps LLMLingua → Kompress
114
  ```
115
 
116
- ## Training Data
117
-
118
- Trained on 15,051 labeled examples from three diverse sources:
119
-
120
- | Source | Segments | Type |
121
- |--------|----------|------|
122
- | Claude Code sessions | ~10,000 | Real agentic coding traces |
123
- | Glaive Function Calling | ~3,000 | General tool-use across domains |
124
- | SWE-bench Trajectories | ~2,000 | Open-source coding agent traces |
125
-
126
- Labels generated via Claude Sonnet distillation with task-conditioned, entity-aware prompts.
127
-
128
  ## Training Details
129
 
130
  - **Base model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M params)
131
- - **Training**: 5 epochs, batch=32, lr=2e-5, AdamW, on NVIDIA A100
132
  - **Loss**: CrossEntropy (token head) + 0.3 × BCE (span head)
133
- - **Metrics**: F1=0.997, Precision=0.994, Recall=1.0
 
134
 
135
  ## License
136
 
137
- Apache 2.0 — use it however you want.
138
 
139
  ## Citation
140
 
141
  ```bibtex
142
  @software{kompress2025,
143
- title={Kompress: Token Compression for Agentic Contexts},
144
  author={Tejas Chopra},
145
  year={2025},
146
  url={https://huggingface.co/chopratejas/kompress-base},
@@ -149,6 +166,7 @@ Apache 2.0 — use it however you want.
149
 
150
  ## Links
151
 
 
152
  - [Headroom](https://github.com/chopratejas/headroom) — Context compression framework
153
- - [LLMLingua-2 paper](https://arxiv.org/abs/2403.12968) — The model Kompress replaces
154
  - [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) — Base encoder
 
11
  - modernbert
12
  - llmlingua
13
  - headroom
14
+ - tool-outputs
15
+ - structured-data
16
  pipeline_tag: token-classification
17
  base_model: answerdotai/ModernBERT-base
18
  datasets:
19
  - SWE-bench/SWE-smith-trajectories
20
  - glaiveai/glaive-function-calling-v2
21
+ - nebius/SWE-agent-trajectories
22
+ - Agent-Ark/Toucan-1.5M
23
+ - tuandunghcmut/toolbench-v1
24
+ - JetBrains-Research/diff-xyz
25
+ - code_search_net
26
+ - b-mc2/sql-create-context
27
  model-index:
28
  - name: kompress-base
29
  results:
 
32
  name: Token Compression
33
  metrics:
34
  - type: f1
35
+ value: 0.9956
36
  name: F1
37
  - type: accuracy
38
+ value: 0.9926
39
  name: Accuracy
40
  ---
41
 
42
+ # Kompress: Token Compression for Structured Tool Outputs & Agentic Contexts
43
 
44
+ **Kompress** is a ModernBERT-based token compressor trained on **330K examples** of structured tool outputs — JSON API responses, git diffs, error logs, source code, CLI output, database results, and agentic conversation traces. It is a drop-in replacement for [LLMLingua-2](https://arxiv.org/abs/2403.12968).
45
 
46
  ## Key Results
47
 
48
+ ### On Agentic / Structured Data (our target domain)
49
+
50
  | Metric | Kompress | LLMLingua-2 |
51
  |--------|----------|-------------|
52
  | Entity Preservation | **82.1%** | 36.0% |
 
54
  | Model Size | **600 MB** | 1,400 MB |
55
  | Context Window | **8,192** | 512 |
56
  | Parameters | **149M** | 355M |
57
+
58
+ ### On LLMLingua-2's Benchmarks
59
+
60
+ | Dataset | Kompress | LLMLingua-2 | Note |
61
+ |---------|----------|-------------|------|
62
+ | MeetingBank | 46.3% | **57.4%** | LLMLingua's training domain |
63
+ | GSM8K | 97.8% | **98.9%** | Both excellent; LLMLingua keeps 88% vs Kompress 50% |
64
+
65
+ ### Cross-Agent Generalization (Cursor IDE — never seen in training)
66
+
67
+ | Metric | Kompress | LLMLingua-2 |
68
+ |--------|----------|-------------|
69
+ | Entity Preservation | **91.1%** | 13.5% |
70
+ | Compression Ratio | **49.9%** | 85.8% |
71
 
72
  ## Why Kompress?
73
 
74
+ LLMLingua-2 was trained on meeting transcripts. When applied to structured tool outputs, it:
75
 
76
+ - **Destroys file paths**: `/Users/foo/.claude/tasks/abc-123` `abc - 123 abc 123`
 
77
  - **Expands instead of compressing**: 206% average ratio on agentic data
78
+ - **Fragments UUIDs**: `4e149fea-6eb8-4feb` `4e149fea - 6eb8 - 4feb`
79
+ - **Has no cross-chunk awareness**: 512-token limit
80
 
81
+ Kompress fixes these with:
82
 
83
+ 1. **Trained on structured data** — 330K examples of real tool outputs: JSON, diffs, logs, code, CLI output, SQL
84
  2. **Dual-head architecture** — token classification + span importance CNN prevents entity splitting
85
  3. **ModernBERT backbone** — 8K context window, code-pretrained, RoPE attention
86
 
87
+ ## Training Data (330K examples)
88
+
89
+ | Source | Examples | Type |
90
+ |--------|----------|------|
91
+ | Toucan-1.5M (MCP tool outputs) | ~80K | Real MCP server tool responses |
92
+ | SWE-agent trajectories | ~60K | Bash output, file reads, git diffs |
93
+ | ToolBench | ~50K | REST API JSON responses |
94
+ | Glaive Function Calling | ~40K | Function call/response pairs |
95
+ | CodeSearchNet | ~40K | Source code (Python, JS, Java, Go, Ruby, PHP) |
96
+ | JetBrains diff-xyz | ~10K | Git unified diffs |
97
+ | SQL create-context | ~10K | Database schemas + queries |
98
+ | Claude Code sessions | ~15K | Real agentic coding traces (API-labeled) |
99
+ | SWE-bench trajectories | ~15K | Open-source coding agent traces |
100
+ | Glaive + SWE (API-labeled) | ~10K | Function calling + coding (API-labeled) |
101
+
102
+ Labeling: Heuristic rules for structured data (JSON→keep keys, diffs→keep +/- lines, logs→keep errors) + Claude Sonnet distillation for natural language segments.
103
+
104
  ## Architecture
105
 
106
  ```
 
111
  Final score = token_prob × (0.5 + 0.5 × span_score)
112
  ```
113
 
114
+ The span head (~200K extra params) learns contiguous importance regions, preventing entity splitting and maintaining coherence.
115
 
116
  ## Quick Start
117
 
118
  ```python
 
119
  pip install kompress
120
 
 
121
  from kompress.inference.pytorch_runner import KompressRunner
122
 
123
  runner = KompressRunner(checkpoint_path="chopratejas/kompress-base")
124
  result = runner.compress(
125
+ '{"users": [{"id": 1, "name": "Alice", "email": "alice@example.com"}, '
126
+ '{"id": 2, "name": "Bob", "email": "bob@example.com"}, '
127
+ '{"id": 3, "name": "Charlie", "email": "charlie@example.com"}]}',
128
  target_ratio=0.5,
129
  )
130
  print(result.compressed)
131
+ # Keeps: keys, structure, unique values discards repetitive patterns
132
  ```
133
 
134
  ## Use with Headroom
135
 
 
 
136
  ```python
 
137
  from kompress.integration.headroom_bridge import patch_content_router
 
 
 
 
 
 
 
 
138
  from headroom.transforms import ContentRouter
139
+
140
  router = ContentRouter()
141
  patch_content_router(router) # Swaps LLMLingua → Kompress
142
  ```
143
 
 
 
 
 
 
 
 
 
 
 
 
 
144
  ## Training Details
145
 
146
  - **Base model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M params)
147
+ - **Training**: 3 epochs, batch=64, lr=2e-5, AdamW + torch.compile on NVIDIA H100
148
  - **Loss**: CrossEntropy (token head) + 0.3 × BCE (span head)
149
+ - **Final metrics**: F1=0.9956, Precision=0.9959, Recall=0.9953, train_loss=0.068
150
+ - **Training time**: 2h39m on H100 (330K examples, 3 epochs)
151
 
152
  ## License
153
 
154
+ Apache 2.0
155
 
156
  ## Citation
157
 
158
  ```bibtex
159
  @software{kompress2025,
160
+ title={Kompress: Token Compression for Structured Tool Outputs and Agentic Contexts},
161
  author={Tejas Chopra},
162
  year={2025},
163
  url={https://huggingface.co/chopratejas/kompress-base},
 
166
 
167
  ## Links
168
 
169
+ - [GitHub](https://github.com/chopratejas/kompress) — Source code, training pipeline, eval scripts
170
  - [Headroom](https://github.com/chopratejas/headroom) — Context compression framework
171
+ - [LLMLingua-2](https://arxiv.org/abs/2403.12968) — The model Kompress replaces
172
  - [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) — Base encoder
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1b95ef3ac2d846544939888b143f34f30fa7daf9623a2f1ed4c050f98ecc9c31
3
  size 600015548
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:48f6af5958adc710a7758c4a6920aa7811f41fd063d299bd09ee445d5982c4d7
3
  size 600015548
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a38655a76ccc51a01ef7d311276d42cfa6e09bbcd0b1bdbe6318161bbdb9b26f
3
  size 5201
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8a6e6c606fb5a5649d8d2dcd06034c2fbab23960b8cf598bd93ff80309a26601
3
  size 5201