Spaces:

squishai
/

README

Configuration error

App Files Files Community

README / README.md

wscholl

Update README.md

21a7ce9 verified 1 day ago

preview code

raw

history blame contribute delete

5.35 kB

metadata

title: README
emoji: 🔥
colorFrom: blue
colorTo: red
sdk: static
pinned: false
license: mit

Squeeze the Most Out of Your Models

Pre-compressed models for Apple Silicon. Load in under a second. Run fully local.

What is this?

This organization hosts models pre-compressed by Squish — a local inference engine for Apple Silicon that gets models off disk and into memory in under a second.

Every model here was compressed with Squish's INT4 quantization pipeline and is ready to load directly with squish pull. No setup, no Python environment, no cloud.

brew tap konjoai/squish
brew trust konjoai/squish
brew install squish
squish pull qwen3:8b
squish run qwen3:8b

Why pre-compressed?

Compression takes time. A Qwen3-8B model compresses in roughly 8 minutes on an M3. You shouldn't have to wait for that on first run. Models in this org are pre-compressed and validated — pull once, load instantly every time after.

Format	What it means
`*-bf16-squished`	INT4-compressed, ready for `squish run`

Available models

Model	Squish ID	Raw size	Squished size	Context
Qwen3-8B	`qwen3:8b`	16.4 GB	4.4 GB	128k
Qwen3-4B	`qwen3:4b`	8.2 GB	2.2 GB	32k
Qwen3-0.6B	`qwen3:0.6b`	1.3 GB	0.9 GB	32k
Qwen2.5-7B-Instruct	`qwen2.5:7b`	14.4 GB	3.9 GB	128k
Qwen2.5-1.5B-Instruct	`qwen2.5:1.5b`	3.1 GB	0.9 GB	32k
Llama-3.2-3B-Instruct	`llama3.2:3b`	6.4 GB	1.7 GB	128k
Llama-3.2-1B-Instruct	`llama3.2:1b`	2.5 GB	0.7 GB	128k
Gemma-3-4B-Instruct	`gemma3:4b`	9.8 GB	2.6 GB	128k
Gemma-3-1B-Instruct	`gemma3:1b`	2.0 GB	0.5 GB	32k

More models added as the catalog grows. Run squish catalog for the full list.

Load time comparison (M3 16GB)

Model	Squish (INT4)	Ollama	llama.cpp
Qwen3-8B	0.43s	4.2s	6.1s
Llama-3.2-3B	0.33s	1.8s	2.4s

Measured cold-start on Apple M3 16GB. Results will vary by chip and storage.

OpenAI-compatible API

Squish runs a local server on port 11435. Any OpenAI client works out of the box:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11435/v1", api_key="squish")
response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Explain attention mechanisms briefly."}]
)
print(response.choices[0].message.content)

# Or point your existing tools at it
export OPENAI_BASE_URL=http://localhost:11435/v1
export OPENAI_API_KEY=squish

How models are compressed

Squish uses a three-tier compression pipeline:

INT4 quantization via a Rust extension (squish_quant_rs) with ARM NEON acceleration — 8–12 GB/s throughput on Apple Silicon
Compressed weight loader — weights stay compressed on disk and decompress directly into Metal-mapped memory at load time
KV cache quantization — attention cache stored at reduced precision during generation, not just weights

The result is a model that fits in memory on a base M3 16GB and loads faster than Ollama can parse its configuration.

Using models directly with mlx_lm

from mlx_lm import load, generate

model, tokenizer = load("squishai/Qwen3-8B-bf16-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)

Requirements

macOS 13.0 or later
Apple Silicon (M1, M2, M3, M4, M5)
Sufficient unified memory for the model (see table above)

Intel Macs and Linux are not supported. Windows is not planned.