Qwen3-8B — Squished for Apple Silicon

This is Qwen3-8B (8B parameters) compressed with Squish — a local inference engine for Apple Silicon.

Weights are INT4-quantized using Squish's ARM NEON-accelerated pipeline and load in under a second on M-series hardware.

Quick start

brew tap konjoai/squish
brew install squish
squish pull qwen3:8b
squish run qwen3:8b

Model details

Property	Value
Parameters	8B
Family	Qwen3
Developer	Alibaba Cloud
Raw size	16.4 GB
Squished size	11.0 GB
Context window	131,072 tokens
Minimum RAM	16 GB unified memory
Quantization	INT4 (Squish pipeline)
Format	MLX-compatible safetensors

Use case

High-quality reasoning and coding with 128k context. Best for M2/M3 16GB and above.

Requirements

macOS 13.0 or later
Apple Silicon (M1, M2, M3, M4, M5)
16 GB unified memory minimum

Intel Macs, Linux, and Windows are not supported.

How to use with Squish

# Pull and run
squish pull qwen3:8b
squish run qwen3:8b

# OpenAI-compatible API on port 11435
curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3:8b","messages":[{"role":"user","content":"Hello"}]}'

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11435/v1", api_key="squish")
response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Load with mlx_lm directly

from mlx_lm import load, generate

model, tokenizer = load("squishai/Qwen3-8B-bf16-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
print(response)

Compression details

This model was compressed using Squish's three-tier pipeline:

INT4 quantization via squish_quant_rs Rust extension with ARM NEON acceleration
Compressed weight loader — weights decompress directly into Metal-mapped memory at load time
KV cache quantization — attention cache stored at reduced precision during generation

Source weights: mlx-community/Qwen3-8B-bf16

License

The original model weights are subject to the license of the source model (Alibaba Cloud). The compression and tooling are MIT licensed. See Squish license for details.

Pre-compressed by Konjo AI · squish.run

Downloads last month: 26

Safetensors

Model size

1B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for squishai/Qwen3-8B-bf16-squished

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

mlx-community/Qwen3-8B-bf16

Quantized

(2)

this model