Instructions to use KRLabsOrg/squeez-2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use KRLabsOrg/squeez-2b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="KRLabsOrg/squeez-2b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("KRLabsOrg/squeez-2b")
model = AutoModelForImageTextToText.from_pretrained("KRLabsOrg/squeez-2b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use KRLabsOrg/squeez-2b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "KRLabsOrg/squeez-2b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KRLabsOrg/squeez-2b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/KRLabsOrg/squeez-2b

SGLang

How to use KRLabsOrg/squeez-2b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "KRLabsOrg/squeez-2b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KRLabsOrg/squeez-2b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "KRLabsOrg/squeez-2b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KRLabsOrg/squeez-2b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use KRLabsOrg/squeez-2b with Docker Model Runner:
```
docker model run hf.co/KRLabsOrg/squeez-2b
```

squeez-2b

File size: 7,823 Bytes

---
base_model: Qwen/Qwen3.5-2B
datasets:
- KRLabsOrg/tool-output-extraction-swebench
language:
- en
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
- code
- tool-output
- pruning
- coding-agents
- extraction
thumbnail: https://raw.githubusercontent.com/KRLabsOrg/squeez/main/assets/squeez_mascot.png
---

<p align="center">
  <img src="https://raw.githubusercontent.com/KRLabsOrg/squeez/main/assets/squeez_mascot.png" alt="Squeez mascot" width="180">
</p>

# Squeez-2B

**Squeez-2B** is a 2B parameter model fine-tuned from Qwen 3.5 2B for task-conditioned tool-output pruning in coding agents. Given a focused query and one raw tool observation, it extracts the smallest verbatim evidence block the agent should inspect next — removing **92%** of input tokens while retaining **0.86 recall**.

```
Tool output (500 lines) → Squeez → Relevant lines (30 lines) → Agent context
```

- Outperforms zero-shot Qwen 3.5 35B A3B by **+11 recall points**
- Returns verbatim lines only (no rewriting or summarization)
- Works as CLI pipe, Python library, or vLLM server
- Trained on **27 tool types** from real SWE-bench workflows and synthetic multi-ecosystem outputs

**Resources:** [Paper](https://arxiv.org/abs/2604.04979) | [Dataset](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) | [Code & CLI](https://github.com/KRLabsOrg/squeez) | [Blog post](https://huggingface.co/blog/KRLabsOrg/squeez)

## Results

Evaluated on 618 manually curated held-out examples spanning 27 tool types.

| Model | Prec. | Recall | F1 | Compression |
|-------|-------|--------|-----|-------------|
| **Squeez-2B** | **0.80** | **0.86** | **0.80** | 0.92 |
| Qwen 3.5 35B A3B (zero-shot) | 0.74 | 0.75 | 0.73 | 0.92 |
| Kimi K2 (zero-shot) | 0.61 | 0.53 | 0.68 | 0.94 |
| Qwen 3.5 2B (untrained) | 0.42 | 0.53 | 0.55 | 0.82 |

The fine-tuned 2B model is also the most precise system in the comparison, indicating it has learned a tool-specific extraction policy rather than relying on generic instruction following.

### Qualitative patterns

| Pattern | Example | Squeez-2B | Baseline failure |
|---------|---------|-----------|-----------------|
| Precise selection | `git_log`, 21 lines — find one commit | Selects the single correct entry | Qwen 35B picks a plausible but wrong commit |
| Failure-block extraction | Service log, 176 lines — two similar TLS errors | Returns the correct 5-line block | Qwen 35B picks the wrong TLS error (different timestamp) |
| Correct empty prediction | `docker_logs`, 316 lines — no matching evidence | Returns empty output | Qwen 35B generates "No relevant lines found..." |
| Adjacent over-selection | Build output, 110 lines — Dockerfile error | Finds the right error + nearby noise | Qwen 35B misses the Dockerfile error entirely |

On the 59 negative examples in the test set, Squeez-2B correctly returns empty output 80% of the time. Qwen 35B returns empty only 7% of the time.

## Quick Start

### CLI (recommended)

```bash
pip install squeez

# With vLLM server
vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384
export SQUEEZ_SERVER_URL=http://localhost:8000/v1

pytest -q 2>&1 | squeez "find the failure block"
git log --oneline -50 | squeez "find the commit that changed CSRF handling"
cat src/auth/middleware.py | squeez "find the referer validation logic"
```

### Python API

```python
from squeez.inference.extractor import ToolOutputExtractor

# vLLM server
extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1")

# Or local
extractor = ToolOutputExtractor(model_path="KRLabsOrg/squeez-2b")

filtered = extractor.extract(
    task="Find the failing test block",
    tool_output=raw_output,
)
```

### With transformers directly

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "KRLabsOrg/squeez-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": (
        "You prune verbose tool output for a coding agent. "
        "Given a focused extraction query and one tool output, return only the "
        "smallest verbatim evidence block(s) the agent should read next. "
        "Return the kept text inside <relevant_lines> tags. "
        "Do not rewrite, summarize, or invent lines."
    )},
    {"role": "user", "content": (
        "<query>
Find the failing authentication test
</query>
"
        "<tool_output>
"
        "PASSED tests/test_login.py::test_valid_credentials
"
        "FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
"
        "PASSED tests/test_login.py::test_logout
"
        "</tool_output>"
    )},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
# <relevant_lines>
# FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
# </relevant_lines>
```

## Input/Output Format

**Input** — Chat messages with system prompt:
- System: extraction instructions (see above)
- User: `<query>{task}</query>
<tool_output>{raw_output}</tool_output>`

**Output** — Verbatim lines in XML tags:
```
<relevant_lines>
{only the lines that matter, copied verbatim}
</relevant_lines>
```

## Supported Tool Types (27)

**SWE-bench derived (14):** `read_file` | `grep` | `git_log` | `git_blame` | `git_diff` | `test_output` | `python` | `curl` | `pip_install` | `ls` | `lint_output` | `build_output` | `type_check` | `coverage`

**Synthetic multi-ecosystem (13):** `npm_build` | `tsc` | `npm_install` | `docker_logs` | `docker_build` | `make_cmake` | `kubectl` | `cargo_build` | `go_build` | `mvn_gradle` | `terraform` | `mypy_pyright` | `eslint`

## Training Details

| | |
|---|---|
| **Base model** | Qwen/Qwen3.5-2B |
| **Method** | LoRA (r=16, alpha=32) via Unsloth |
| **Training data** | 10,508 examples (SWE-bench + synthetic) |
| **Epochs** | 3 |
| **Max sequence length** | 20,000 tokens |
| **Learning rate** | 2e-4 |
| **Batch size** | 8 (32 effective with 4x gradient accumulation) |
| **Hardware** | Single NVIDIA A100 80GB |
| **Dataset** | [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) |

## Usage with Coding Agents

Add to your `CLAUDE.md` or agent system prompt:

```
When you invoke a shell command, pipe it through `squeez` and describe what you need.
Examples:
- bun test 2>&1 | squeez "did the tests pass?"
- git log --oneline -50 | squeez "find the commit that broke CSRF"
- cat src/auth/middleware.py | squeez "find the referer validation logic"
```

## Limitations

- Best on software engineering tool output; not designed for general-purpose summarization
- Synthetic data generated by `openai/gpt-oss-120b` — may not fully reflect real-world distributions for all ecosystems
- Evaluates single tool observations, not full agent trajectories
- Max input: 20,000 tokens (training length); can be extended at serving time

## License

Apache 2.0

## Citation

```bibtex
@misc{kovács2026squeeztaskconditionedtooloutputpruning,
      title={Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents}, 
      author={Ádám Kovács},
      year={2026},
      eprint={2604.04979},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2604.04979}, 
}
```