Qwen2.5-7B-Instruct — Squished for Apple Silicon

This is Qwen2.5-7B-Instruct (7B parameters) compressed with Squish — a local inference engine for Apple Silicon.

Weights are INT4-quantized using Squish's ARM NEON-accelerated pipeline and load in under a second on M-series hardware.

Quick start

brew tap konjoai/squish
brew install squish
squish pull qwen2.5:7b
squish run qwen2.5:7b

Model details

Property Value
Parameters 7B
Family Qwen2.5
Developer Alibaba Cloud
Raw size 14.4 GB
Squished size 9.6 GB
Context window 131,072 tokens
Minimum RAM 16 GB unified memory
Quantization INT4 (Squish pipeline)
Format MLX-compatible safetensors

Use case

Strong instruction following with long context. Great for document analysis and coding tasks.

Requirements

  • macOS 13.0 or later
  • Apple Silicon (M1, M2, M3, M4, M5)
  • 16 GB unified memory minimum

Intel Macs, Linux, and Windows are not supported.

How to use with Squish

# Pull and run
squish pull qwen2.5:7b
squish run qwen2.5:7b

# OpenAI-compatible API on port 11435
curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen2.5:7b","messages":[{"role":"user","content":"Hello"}]}'
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11435/v1", api_key="squish")
response = client.chat.completions.create(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Load with mlx_lm directly

from mlx_lm import load, generate

model, tokenizer = load("squishai/Qwen2.5-7B-Instruct-bf16-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
print(response)

Compression details

This model was compressed using Squish's three-tier pipeline:

  • INT4 quantization via squish_quant_rs Rust extension with ARM NEON acceleration
  • Compressed weight loader — weights decompress directly into Metal-mapped memory at load time
  • KV cache quantization — attention cache stored at reduced precision during generation

Source weights: mlx-community/Qwen2.5-7B-Instruct-bf16

License

The original model weights are subject to the license of the source model (Alibaba Cloud). The compression and tooling are MIT licensed. See Squish license for details.


Pre-compressed by Konjo AI · squish.run

Downloads last month
27
Safetensors
Model size
1B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for squishai/Qwen2.5-7B-Instruct-bf16-squished

Base model

Qwen/Qwen2.5-7B
Quantized
(2)
this model