Instructions to use KRLabsOrg/squeez-2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use KRLabsOrg/squeez-2b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="KRLabsOrg/squeez-2b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("KRLabsOrg/squeez-2b")
model = AutoModelForImageTextToText.from_pretrained("KRLabsOrg/squeez-2b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use KRLabsOrg/squeez-2b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "KRLabsOrg/squeez-2b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KRLabsOrg/squeez-2b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/KRLabsOrg/squeez-2b

SGLang

How to use KRLabsOrg/squeez-2b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "KRLabsOrg/squeez-2b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KRLabsOrg/squeez-2b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "KRLabsOrg/squeez-2b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KRLabsOrg/squeez-2b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use KRLabsOrg/squeez-2b with Docker Model Runner:
```
docker model run hf.co/KRLabsOrg/squeez-2b
```

squeez-2b / README.md

nielsr HF Staff

Add library_name to metadata

07c6642 verified about 1 month ago

preview code

raw

history blame

7.82 kB

	---
	base_model: Qwen/Qwen3.5-2B
	datasets:
	- KRLabsOrg/tool-output-extraction-swebench
	language:
	- en
	license: apache-2.0
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- code
	- tool-output
	- pruning
	- coding-agents
	- extraction
	thumbnail: https://raw.githubusercontent.com/KRLabsOrg/squeez/main/assets/squeez_mascot.png
	---

	<p align="center">
	<img src="https://raw.githubusercontent.com/KRLabsOrg/squeez/main/assets/squeez_mascot.png" alt="Squeez mascot" width="180">
	</p>

	# Squeez-2B

	Squeez-2B is a 2B parameter model fine-tuned from Qwen 3.5 2B for task-conditioned tool-output pruning in coding agents. Given a focused query and one raw tool observation, it extracts the smallest verbatim evidence block the agent should inspect next — removing 92% of input tokens while retaining 0.86 recall.

	```
	Tool output (500 lines) → Squeez → Relevant lines (30 lines) → Agent context
	```

	- Outperforms zero-shot Qwen 3.5 35B A3B by +11 recall points
	- Returns verbatim lines only (no rewriting or summarization)
	- Works as CLI pipe, Python library, or vLLM server
	- Trained on 27 tool types from real SWE-bench workflows and synthetic multi-ecosystem outputs

	Resources: [Paper](https://arxiv.org/abs/2604.04979) \| [Dataset](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) \| [Code & CLI](https://github.com/KRLabsOrg/squeez) \| [Blog post](https://huggingface.co/blog/KRLabsOrg/squeez)

	## Results

	Evaluated on 618 manually curated held-out examples spanning 27 tool types.

	\| Model \| Prec. \| Recall \| F1 \| Compression \|
	\|-------\|-------\|--------\|-----\|-------------\|
	\| Squeez-2B \| 0.80 \| 0.86 \| 0.80 \| 0.92 \|
	\| Qwen 3.5 35B A3B (zero-shot) \| 0.74 \| 0.75 \| 0.73 \| 0.92 \|
	\| Kimi K2 (zero-shot) \| 0.61 \| 0.53 \| 0.68 \| 0.94 \|
	\| Qwen 3.5 2B (untrained) \| 0.42 \| 0.53 \| 0.55 \| 0.82 \|

	The fine-tuned 2B model is also the most precise system in the comparison, indicating it has learned a tool-specific extraction policy rather than relying on generic instruction following.

	### Qualitative patterns

	\| Pattern \| Example \| Squeez-2B \| Baseline failure \|
	\|---------\|---------\|-----------\|-----------------\|
	\| Precise selection \| `git_log`, 21 lines — find one commit \| Selects the single correct entry \| Qwen 35B picks a plausible but wrong commit \|
	\| Failure-block extraction \| Service log, 176 lines — two similar TLS errors \| Returns the correct 5-line block \| Qwen 35B picks the wrong TLS error (different timestamp) \|
	\| Correct empty prediction \| `docker_logs`, 316 lines — no matching evidence \| Returns empty output \| Qwen 35B generates "No relevant lines found..." \|
	\| Adjacent over-selection \| Build output, 110 lines — Dockerfile error \| Finds the right error + nearby noise \| Qwen 35B misses the Dockerfile error entirely \|

	On the 59 negative examples in the test set, Squeez-2B correctly returns empty output 80% of the time. Qwen 35B returns empty only 7% of the time.

	## Quick Start

	### CLI (recommended)

	```bash
	pip install squeez

	# With vLLM server
	vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384
	export SQUEEZ_SERVER_URL=http://localhost:8000/v1

	pytest -q 2>&1 \| squeez "find the failure block"
	git log --oneline -50 \| squeez "find the commit that changed CSRF handling"
	cat src/auth/middleware.py \| squeez "find the referer validation logic"
	```

	### Python API

	```python
	from squeez.inference.extractor import ToolOutputExtractor

	# vLLM server
	extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1")

	# Or local
	extractor = ToolOutputExtractor(model_path="KRLabsOrg/squeez-2b")

	filtered = extractor.extract(
	task="Find the failing test block",
	tool_output=raw_output,
	)
	```

	### With transformers directly

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_name = "KRLabsOrg/squeez-2b"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True,
	)

	messages = [
	{"role": "system", "content": (
	"You prune verbose tool output for a coding agent. "
	"Given a focused extraction query and one tool output, return only the "
	"smallest verbatim evidence block(s) the agent should read next. "
	"Return the kept text inside <relevant_lines> tags. "
	"Do not rewrite, summarize, or invent lines."
	)},
	{"role": "user", "content": (
	"<query>
	Find the failing authentication test
	</query>
	"
	"<tool_output>
	"
	"PASSED tests/test_login.py::test_valid_credentials
	"
	"FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
	"
	"PASSED tests/test_login.py::test_logout
	"
	"</tool_output>"
	)},
	]

	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	with torch.no_grad():
	outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)

	response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
	print(response)
	# <relevant_lines>
	# FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
	# </relevant_lines>
	```

	## Input/Output Format

	Input — Chat messages with system prompt:
	- System: extraction instructions (see above)
	- User: `<query>{task}</query>
	<tool_output>{raw_output}</tool_output>`

	Output — Verbatim lines in XML tags:
	```
	<relevant_lines>
	{only the lines that matter, copied verbatim}
	</relevant_lines>
	```

	## Supported Tool Types (27)

	SWE-bench derived (14): `read_file` \| `grep` \| `git_log` \| `git_blame` \| `git_diff` \| `test_output` \| `python` \| `curl` \| `pip_install` \| `ls` \| `lint_output` \| `build_output` \| `type_check` \| `coverage`

	Synthetic multi-ecosystem (13): `npm_build` \| `tsc` \| `npm_install` \| `docker_logs` \| `docker_build` \| `make_cmake` \| `kubectl` \| `cargo_build` \| `go_build` \| `mvn_gradle` \| `terraform` \| `mypy_pyright` \| `eslint`

	## Training Details

	\| \| \|
	\|---\|---\|
	\| Base model \| Qwen/Qwen3.5-2B \|
	\| Method \| LoRA (r=16, alpha=32) via Unsloth \|
	\| Training data \| 10,508 examples (SWE-bench + synthetic) \|
	\| Epochs \| 3 \|
	\| Max sequence length \| 20,000 tokens \|
	\| Learning rate \| 2e-4 \|
	\| Batch size \| 8 (32 effective with 4x gradient accumulation) \|
	\| Hardware \| Single NVIDIA A100 80GB \|
	\| Dataset \| [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) \|

	## Usage with Coding Agents

	Add to your `CLAUDE.md` or agent system prompt:

	```
	When you invoke a shell command, pipe it through `squeez` and describe what you need.
	Examples:
	- bun test 2>&1 \| squeez "did the tests pass?"
	- git log --oneline -50 \| squeez "find the commit that broke CSRF"
	- cat src/auth/middleware.py \| squeez "find the referer validation logic"
	```

	## Limitations

	- Best on software engineering tool output; not designed for general-purpose summarization
	- Synthetic data generated by `openai/gpt-oss-120b` — may not fully reflect real-world distributions for all ecosystems
	- Evaluates single tool observations, not full agent trajectories
	- Max input: 20,000 tokens (training length); can be extended at serving time

	## License

	Apache 2.0

	## Citation

	```bibtex
	@misc{kovács2026squeeztaskconditionedtooloutputpruning,
	title={Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents},
	author={Ádám Kovács},
	year={2026},
	eprint={2604.04979},
	archivePrefix={arXiv},
	primaryClass={cs.SE},
	url={https://arxiv.org/abs/2604.04979},
	}
	```