README / README.md
wscholl's picture
Update README.md
21a7ce9 verified
metadata
title: README
emoji: 🔥
colorFrom: blue
colorTo: red
sdk: static
pinned: false
license: mit
Squish

Squeeze the Most Out of Your Models

Pre-compressed models for Apple Silicon. Load in under a second. Run fully local.

GitHub License Platform Website


What is this?

This organization hosts models pre-compressed by Squish — a local inference engine for Apple Silicon that gets models off disk and into memory in under a second.

Every model here was compressed with Squish's INT4 quantization pipeline and is ready to load directly with squish pull. No setup, no Python environment, no cloud.

brew tap konjoai/squish
brew trust konjoai/squish
brew install squish
squish pull qwen3:8b
squish run qwen3:8b

Why pre-compressed?

Compression takes time. A Qwen3-8B model compresses in roughly 8 minutes on an M3. You shouldn't have to wait for that on first run. Models in this org are pre-compressed and validated — pull once, load instantly every time after.

Format What it means
*-bf16-squished INT4-compressed, ready for squish run

Available models

Model Squish ID Raw size Squished size Context
Qwen3-8B qwen3:8b 16.4 GB 4.4 GB 128k
Qwen3-4B qwen3:4b 8.2 GB 2.2 GB 32k
Qwen3-0.6B qwen3:0.6b 1.3 GB 0.9 GB 32k
Qwen2.5-7B-Instruct qwen2.5:7b 14.4 GB 3.9 GB 128k
Qwen2.5-1.5B-Instruct qwen2.5:1.5b 3.1 GB 0.9 GB 32k
Llama-3.2-3B-Instruct llama3.2:3b 6.4 GB 1.7 GB 128k
Llama-3.2-1B-Instruct llama3.2:1b 2.5 GB 0.7 GB 128k
Gemma-3-4B-Instruct gemma3:4b 9.8 GB 2.6 GB 128k
Gemma-3-1B-Instruct gemma3:1b 2.0 GB 0.5 GB 32k

More models added as the catalog grows. Run squish catalog for the full list.


Load time comparison (M3 16GB)

Model Squish (INT4) Ollama llama.cpp
Qwen3-8B 0.43s 4.2s 6.1s
Llama-3.2-3B 0.33s 1.8s 2.4s

Measured cold-start on Apple M3 16GB. Results will vary by chip and storage.


OpenAI-compatible API

Squish runs a local server on port 11435. Any OpenAI client works out of the box:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11435/v1", api_key="squish")
response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Explain attention mechanisms briefly."}]
)
print(response.choices[0].message.content)
# Or point your existing tools at it
export OPENAI_BASE_URL=http://localhost:11435/v1
export OPENAI_API_KEY=squish

How models are compressed

Squish uses a three-tier compression pipeline:

  • INT4 quantization via a Rust extension (squish_quant_rs) with ARM NEON acceleration — 8–12 GB/s throughput on Apple Silicon
  • Compressed weight loader — weights stay compressed on disk and decompress directly into Metal-mapped memory at load time
  • KV cache quantization — attention cache stored at reduced precision during generation, not just weights

The result is a model that fits in memory on a base M3 16GB and loads faster than Ollama can parse its configuration.


Using models directly with mlx_lm

from mlx_lm import load, generate

model, tokenizer = load("squishai/Qwen3-8B-bf16-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)

Requirements

  • macOS 13.0 or later
  • Apple Silicon (M1, M2, M3, M4, M5)
  • Sufficient unified memory for the model (see table above)

Intel Macs and Linux are not supported. Windows is not planned.


Links


Squish it. Run it. Go.

Built by Konjo AI  ·  MIT License