Text Generation
Transformers
Safetensors
English
qwen3_5
image-text-to-text
code
tool-output
pruning
coding-agents
extraction
conversational
Instructions to use KRLabsOrg/squeez-2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use KRLabsOrg/squeez-2b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="KRLabsOrg/squeez-2b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("KRLabsOrg/squeez-2b") model = AutoModelForImageTextToText.from_pretrained("KRLabsOrg/squeez-2b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use KRLabsOrg/squeez-2b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "KRLabsOrg/squeez-2b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "KRLabsOrg/squeez-2b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/KRLabsOrg/squeez-2b
- SGLang
How to use KRLabsOrg/squeez-2b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "KRLabsOrg/squeez-2b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "KRLabsOrg/squeez-2b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "KRLabsOrg/squeez-2b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "KRLabsOrg/squeez-2b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use KRLabsOrg/squeez-2b with Docker Model Runner:
docker model run hf.co/KRLabsOrg/squeez-2b
| base_model: Qwen/Qwen3.5-2B | |
| datasets: | |
| - KRLabsOrg/tool-output-extraction-swebench | |
| language: | |
| - en | |
| license: apache-2.0 | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| tags: | |
| - code | |
| - tool-output | |
| - pruning | |
| - coding-agents | |
| - extraction | |
| thumbnail: https://raw.githubusercontent.com/KRLabsOrg/squeez/main/assets/squeez_mascot.png | |
| <p align="center"> | |
| <img src="https://raw.githubusercontent.com/KRLabsOrg/squeez/main/assets/squeez_mascot.png" alt="Squeez mascot" width="180"> | |
| </p> | |
| # Squeez-2B | |
| **Squeez-2B** is a 2B parameter model fine-tuned from Qwen 3.5 2B for task-conditioned tool-output pruning in coding agents. Given a focused query and one raw tool observation, it extracts the smallest verbatim evidence block the agent should inspect next — removing **92%** of input tokens while retaining **0.86 recall**. | |
| ``` | |
| Tool output (500 lines) → Squeez → Relevant lines (30 lines) → Agent context | |
| ``` | |
| - Outperforms zero-shot Qwen 3.5 35B A3B by **+11 recall points** | |
| - Returns verbatim lines only (no rewriting or summarization) | |
| - Works as CLI pipe, Python library, or vLLM server | |
| - Trained on **27 tool types** from real SWE-bench workflows and synthetic multi-ecosystem outputs | |
| **Resources:** [Paper](https://arxiv.org/abs/2604.04979) | [Dataset](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) | [Code & CLI](https://github.com/KRLabsOrg/squeez) | [Blog post](https://huggingface.co/blog/KRLabsOrg/squeez) | |
| ## Results | |
| Evaluated on 618 manually curated held-out examples spanning 27 tool types. | |
| | Model | Prec. | Recall | F1 | Compression | | |
| |-------|-------|--------|-----|-------------| | |
| | **Squeez-2B** | **0.80** | **0.86** | **0.80** | 0.92 | | |
| | Qwen 3.5 35B A3B (zero-shot) | 0.74 | 0.75 | 0.73 | 0.92 | | |
| | Kimi K2 (zero-shot) | 0.61 | 0.53 | 0.68 | 0.94 | | |
| | Qwen 3.5 2B (untrained) | 0.42 | 0.53 | 0.55 | 0.82 | | |
| The fine-tuned 2B model is also the most precise system in the comparison, indicating it has learned a tool-specific extraction policy rather than relying on generic instruction following. | |
| ### Qualitative patterns | |
| | Pattern | Example | Squeez-2B | Baseline failure | | |
| |---------|---------|-----------|-----------------| | |
| | Precise selection | `git_log`, 21 lines — find one commit | Selects the single correct entry | Qwen 35B picks a plausible but wrong commit | | |
| | Failure-block extraction | Service log, 176 lines — two similar TLS errors | Returns the correct 5-line block | Qwen 35B picks the wrong TLS error (different timestamp) | | |
| | Correct empty prediction | `docker_logs`, 316 lines — no matching evidence | Returns empty output | Qwen 35B generates "No relevant lines found..." | | |
| | Adjacent over-selection | Build output, 110 lines — Dockerfile error | Finds the right error + nearby noise | Qwen 35B misses the Dockerfile error entirely | | |
| On the 59 negative examples in the test set, Squeez-2B correctly returns empty output 80% of the time. Qwen 35B returns empty only 7% of the time. | |
| ## Quick Start | |
| ### CLI (recommended) | |
| ```bash | |
| pip install squeez | |
| # With vLLM server | |
| vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384 | |
| export SQUEEZ_SERVER_URL=http://localhost:8000/v1 | |
| pytest -q 2>&1 | squeez "find the failure block" | |
| git log --oneline -50 | squeez "find the commit that changed CSRF handling" | |
| cat src/auth/middleware.py | squeez "find the referer validation logic" | |
| ``` | |
| ### Python API | |
| ```python | |
| from squeez.inference.extractor import ToolOutputExtractor | |
| # vLLM server | |
| extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1") | |
| # Or local | |
| extractor = ToolOutputExtractor(model_path="KRLabsOrg/squeez-2b") | |
| filtered = extractor.extract( | |
| task="Find the failing test block", | |
| tool_output=raw_output, | |
| ) | |
| ``` | |
| ### With transformers directly | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| model_name = "KRLabsOrg/squeez-2b" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_name, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| trust_remote_code=True, | |
| ) | |
| messages = [ | |
| {"role": "system", "content": ( | |
| "You prune verbose tool output for a coding agent. " | |
| "Given a focused extraction query and one tool output, return only the " | |
| "smallest verbatim evidence block(s) the agent should read next. " | |
| "Return the kept text inside <relevant_lines> tags. " | |
| "Do not rewrite, summarize, or invent lines." | |
| )}, | |
| {"role": "user", "content": ( | |
| "<query> | |
| Find the failing authentication test | |
| </query> | |
| " | |
| "<tool_output> | |
| " | |
| "PASSED tests/test_login.py::test_valid_credentials | |
| " | |
| "FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401 | |
| " | |
| "PASSED tests/test_login.py::test_logout | |
| " | |
| "</tool_output>" | |
| )}, | |
| ] | |
| prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| with torch.no_grad(): | |
| outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True) | |
| response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) | |
| print(response) | |
| # <relevant_lines> | |
| # FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401 | |
| # </relevant_lines> | |
| ``` | |
| ## Input/Output Format | |
| **Input** — Chat messages with system prompt: | |
| - System: extraction instructions (see above) | |
| - User: `<query>{task}</query> | |
| <tool_output>{raw_output}</tool_output>` | |
| **Output** — Verbatim lines in XML tags: | |
| ``` | |
| <relevant_lines> | |
| {only the lines that matter, copied verbatim} | |
| </relevant_lines> | |
| ``` | |
| ## Supported Tool Types (27) | |
| **SWE-bench derived (14):** `read_file` | `grep` | `git_log` | `git_blame` | `git_diff` | `test_output` | `python` | `curl` | `pip_install` | `ls` | `lint_output` | `build_output` | `type_check` | `coverage` | |
| **Synthetic multi-ecosystem (13):** `npm_build` | `tsc` | `npm_install` | `docker_logs` | `docker_build` | `make_cmake` | `kubectl` | `cargo_build` | `go_build` | `mvn_gradle` | `terraform` | `mypy_pyright` | `eslint` | |
| ## Training Details | |
| | | | | |
| |---|---| | |
| | **Base model** | Qwen/Qwen3.5-2B | | |
| | **Method** | LoRA (r=16, alpha=32) via Unsloth | | |
| | **Training data** | 10,508 examples (SWE-bench + synthetic) | | |
| | **Epochs** | 3 | | |
| | **Max sequence length** | 20,000 tokens | | |
| | **Learning rate** | 2e-4 | | |
| | **Batch size** | 8 (32 effective with 4x gradient accumulation) | | |
| | **Hardware** | Single NVIDIA A100 80GB | | |
| | **Dataset** | [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) | | |
| ## Usage with Coding Agents | |
| Add to your `CLAUDE.md` or agent system prompt: | |
| ``` | |
| When you invoke a shell command, pipe it through `squeez` and describe what you need. | |
| Examples: | |
| - bun test 2>&1 | squeez "did the tests pass?" | |
| - git log --oneline -50 | squeez "find the commit that broke CSRF" | |
| - cat src/auth/middleware.py | squeez "find the referer validation logic" | |
| ``` | |
| ## Limitations | |
| - Best on software engineering tool output; not designed for general-purpose summarization | |
| - Synthetic data generated by `openai/gpt-oss-120b` — may not fully reflect real-world distributions for all ecosystems | |
| - Evaluates single tool observations, not full agent trajectories | |
| - Max input: 20,000 tokens (training length); can be extended at serving time | |
| ## License | |
| Apache 2.0 | |
| ## Citation | |
| ```bibtex | |
| @misc{kovács2026squeeztaskconditionedtooloutputpruning, | |
| title={Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents}, | |
| author={Ádám Kovács}, | |
| year={2026}, | |
| eprint={2604.04979}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.SE}, | |
| url={https://arxiv.org/abs/2604.04979}, | |
| } | |
| ``` |