File size: 7,823 Bytes
51b4c8b
07c6642
 
 
51b4c8b
 
07c6642
 
 
51b4c8b
37964d9
 
 
 
 
85f4123
51b4c8b
 
85f4123
 
 
 
51b4c8b
 
37964d9
51b4c8b
 
 
 
 
37964d9
 
 
 
 
3a4c2e6
0a80056
37964d9
51b4c8b
37964d9
51b4c8b
37964d9
 
 
 
 
 
51b4c8b
37964d9
51b4c8b
37964d9
95a9c82
37964d9
 
 
 
 
 
 
 
51b4c8b
 
 
37964d9
51b4c8b
 
 
37964d9
 
 
51b4c8b
 
37964d9
 
 
51b4c8b
 
37964d9
51b4c8b
37964d9
 
51b4c8b
37964d9
 
51b4c8b
37964d9
 
51b4c8b
37964d9
 
 
 
 
51b4c8b
37964d9
51b4c8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
07c6642
 
 
 
 
 
 
 
 
 
 
 
37964d9
51b4c8b
 
 
 
 
 
 
 
 
 
 
37964d9
 
 
51b4c8b
 
37964d9
51b4c8b
37964d9
 
07c6642
 
51b4c8b
37964d9
51b4c8b
 
 
 
 
 
37964d9
51b4c8b
37964d9
51b4c8b
37964d9
51b4c8b
37964d9
51b4c8b
37964d9
 
 
 
 
 
 
 
 
 
 
51b4c8b
37964d9
51b4c8b
37964d9
51b4c8b
37964d9
 
 
 
 
 
 
51b4c8b
37964d9
51b4c8b
37964d9
 
 
 
51b4c8b
37964d9
51b4c8b
37964d9
51b4c8b
 
 
 
3a4c2e6
 
 
 
 
 
 
 
51b4c8b
07c6642
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
---
base_model: Qwen/Qwen3.5-2B
datasets:
- KRLabsOrg/tool-output-extraction-swebench
language:
- en
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
- code
- tool-output
- pruning
- coding-agents
- extraction
thumbnail: https://raw.githubusercontent.com/KRLabsOrg/squeez/main/assets/squeez_mascot.png
---

<p align="center">
  <img src="https://raw.githubusercontent.com/KRLabsOrg/squeez/main/assets/squeez_mascot.png" alt="Squeez mascot" width="180">
</p>

# Squeez-2B

**Squeez-2B** is a 2B parameter model fine-tuned from Qwen 3.5 2B for task-conditioned tool-output pruning in coding agents. Given a focused query and one raw tool observation, it extracts the smallest verbatim evidence block the agent should inspect next — removing **92%** of input tokens while retaining **0.86 recall**.

```
Tool output (500 lines) → Squeez → Relevant lines (30 lines) → Agent context
```

- Outperforms zero-shot Qwen 3.5 35B A3B by **+11 recall points**
- Returns verbatim lines only (no rewriting or summarization)
- Works as CLI pipe, Python library, or vLLM server
- Trained on **27 tool types** from real SWE-bench workflows and synthetic multi-ecosystem outputs

**Resources:** [Paper](https://arxiv.org/abs/2604.04979) | [Dataset](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) | [Code & CLI](https://github.com/KRLabsOrg/squeez) | [Blog post](https://huggingface.co/blog/KRLabsOrg/squeez)

## Results

Evaluated on 618 manually curated held-out examples spanning 27 tool types.

| Model | Prec. | Recall | F1 | Compression |
|-------|-------|--------|-----|-------------|
| **Squeez-2B** | **0.80** | **0.86** | **0.80** | 0.92 |
| Qwen 3.5 35B A3B (zero-shot) | 0.74 | 0.75 | 0.73 | 0.92 |
| Kimi K2 (zero-shot) | 0.61 | 0.53 | 0.68 | 0.94 |
| Qwen 3.5 2B (untrained) | 0.42 | 0.53 | 0.55 | 0.82 |

The fine-tuned 2B model is also the most precise system in the comparison, indicating it has learned a tool-specific extraction policy rather than relying on generic instruction following.

### Qualitative patterns

| Pattern | Example | Squeez-2B | Baseline failure |
|---------|---------|-----------|-----------------|
| Precise selection | `git_log`, 21 lines — find one commit | Selects the single correct entry | Qwen 35B picks a plausible but wrong commit |
| Failure-block extraction | Service log, 176 lines — two similar TLS errors | Returns the correct 5-line block | Qwen 35B picks the wrong TLS error (different timestamp) |
| Correct empty prediction | `docker_logs`, 316 lines — no matching evidence | Returns empty output | Qwen 35B generates "No relevant lines found..." |
| Adjacent over-selection | Build output, 110 lines — Dockerfile error | Finds the right error + nearby noise | Qwen 35B misses the Dockerfile error entirely |

On the 59 negative examples in the test set, Squeez-2B correctly returns empty output 80% of the time. Qwen 35B returns empty only 7% of the time.

## Quick Start

### CLI (recommended)

```bash
pip install squeez

# With vLLM server
vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384
export SQUEEZ_SERVER_URL=http://localhost:8000/v1

pytest -q 2>&1 | squeez "find the failure block"
git log --oneline -50 | squeez "find the commit that changed CSRF handling"
cat src/auth/middleware.py | squeez "find the referer validation logic"
```

### Python API

```python
from squeez.inference.extractor import ToolOutputExtractor

# vLLM server
extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1")

# Or local
extractor = ToolOutputExtractor(model_path="KRLabsOrg/squeez-2b")

filtered = extractor.extract(
    task="Find the failing test block",
    tool_output=raw_output,
)
```

### With transformers directly

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "KRLabsOrg/squeez-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": (
        "You prune verbose tool output for a coding agent. "
        "Given a focused extraction query and one tool output, return only the "
        "smallest verbatim evidence block(s) the agent should read next. "
        "Return the kept text inside <relevant_lines> tags. "
        "Do not rewrite, summarize, or invent lines."
    )},
    {"role": "user", "content": (
        "<query>
Find the failing authentication test
</query>
"
        "<tool_output>
"
        "PASSED tests/test_login.py::test_valid_credentials
"
        "FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
"
        "PASSED tests/test_login.py::test_logout
"
        "</tool_output>"
    )},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
# <relevant_lines>
# FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
# </relevant_lines>
```

## Input/Output Format

**Input** — Chat messages with system prompt:
- System: extraction instructions (see above)
- User: `<query>{task}</query>
<tool_output>{raw_output}</tool_output>`

**Output** — Verbatim lines in XML tags:
```
<relevant_lines>
{only the lines that matter, copied verbatim}
</relevant_lines>
```

## Supported Tool Types (27)

**SWE-bench derived (14):** `read_file` | `grep` | `git_log` | `git_blame` | `git_diff` | `test_output` | `python` | `curl` | `pip_install` | `ls` | `lint_output` | `build_output` | `type_check` | `coverage`

**Synthetic multi-ecosystem (13):** `npm_build` | `tsc` | `npm_install` | `docker_logs` | `docker_build` | `make_cmake` | `kubectl` | `cargo_build` | `go_build` | `mvn_gradle` | `terraform` | `mypy_pyright` | `eslint`

## Training Details

| | |
|---|---|
| **Base model** | Qwen/Qwen3.5-2B |
| **Method** | LoRA (r=16, alpha=32) via Unsloth |
| **Training data** | 10,508 examples (SWE-bench + synthetic) |
| **Epochs** | 3 |
| **Max sequence length** | 20,000 tokens |
| **Learning rate** | 2e-4 |
| **Batch size** | 8 (32 effective with 4x gradient accumulation) |
| **Hardware** | Single NVIDIA A100 80GB |
| **Dataset** | [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) |

## Usage with Coding Agents

Add to your `CLAUDE.md` or agent system prompt:

```
When you invoke a shell command, pipe it through `squeez` and describe what you need.
Examples:
- bun test 2>&1 | squeez "did the tests pass?"
- git log --oneline -50 | squeez "find the commit that broke CSRF"
- cat src/auth/middleware.py | squeez "find the referer validation logic"
```

## Limitations

- Best on software engineering tool output; not designed for general-purpose summarization
- Synthetic data generated by `openai/gpt-oss-120b` — may not fully reflect real-world distributions for all ecosystems
- Evaluates single tool observations, not full agent trajectories
- Max input: 20,000 tokens (training length); can be extended at serving time

## License

Apache 2.0

## Citation

```bibtex
@misc{kovács2026squeeztaskconditionedtooloutputpruning,
      title={Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents}, 
      author={Ádám Kovács},
      year={2026},
      eprint={2604.04979},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2604.04979}, 
}
```