Qwen2.5-7B-Instruct — Squished for Apple Silicon

This is Qwen2.5-7B-Instruct (7B parameters) compressed with Squish — a local inference engine for Apple Silicon.

Weights are INT4-quantized using Squish's ARM NEON-accelerated pipeline and load in under a second on M-series hardware.

Quick start

brew tap konjoai/squish
brew install squish
squish pull qwen2.5:7b
squish run qwen2.5:7b

Model details

Property	Value
Parameters	7B
Family	Qwen2.5
Developer	Alibaba Cloud
Raw size	14.4 GB
Squished size	9.6 GB
Context window	131,072 tokens
Minimum RAM	16 GB unified memory
Quantization	INT4 (Squish pipeline)
Format	MLX-compatible safetensors

Use case

Strong instruction following with long context. Great for document analysis and coding tasks.

Requirements

macOS 13.0 or later
Apple Silicon (M1, M2, M3, M4, M5)
16 GB unified memory minimum

Intel Macs, Linux, and Windows are not supported.

How to use with Squish

# Pull and run
squish pull qwen2.5:7b
squish run qwen2.5:7b

# OpenAI-compatible API on port 11435
curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen2.5:7b","messages":[{"role":"user","content":"Hello"}]}'

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11435/v1", api_key="squish")
response = client.chat.completions.create(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Load with mlx_lm directly

from mlx_lm import load, generate

model, tokenizer = load("squishai/Qwen2.5-7B-Instruct-bf16-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
print(response)

Compression details

This model was compressed using Squish's three-tier pipeline:

INT4 quantization via squish_quant_rs Rust extension with ARM NEON acceleration
Compressed weight loader — weights decompress directly into Metal-mapped memory at load time
KV cache quantization — attention cache stored at reduced precision during generation

Source weights: mlx-community/Qwen2.5-7B-Instruct-bf16

License

The original model weights are subject to the license of the source model (Alibaba Cloud). The compression and tooling are MIT licensed. See Squish license for details.

Pre-compressed by Konjo AI · squish.run

Downloads last month: 27

Safetensors

Model size

1B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for squishai/Qwen2.5-7B-Instruct-bf16-squished

Base model

Qwen/Qwen2.5-7B

Finetuned

mlx-community/Qwen2.5-7B-Instruct-bf16

Quantized

(2)

this model