AI & ML interests

None defined yet.

Recent Activity

wscholl  updated a Space about 20 hours ago
squishai/README
wscholl  updated a model about 20 hours ago
squishai/gemma-3-4b-it-bf16-squished
wscholl  published a model about 20 hours ago
squishai/gemma-3-4b-it-bf16-squished
View all activity

Organization Card
Squish

Squeeze the Most Out of Your Models

Pre-compressed models for Apple Silicon. Load in under a second. Run fully local.

GitHub License Platform Website


What is this?

This organization hosts models pre-compressed by Squish — a local inference engine for Apple Silicon that gets models off disk and into memory in under a second.

Every model here was compressed with Squish's INT4 quantization pipeline and is ready to load directly with squish pull. No setup, no Python environment, no cloud.

brew tap konjoai/squish
brew trust konjoai/squish
brew install squish
squish pull qwen3:8b
squish run qwen3:8b

Why pre-compressed?

Compression takes time. A Qwen3-8B model compresses in roughly 8 minutes on an M3. You shouldn't have to wait for that on first run. Models in this org are pre-compressed and validated — pull once, load instantly every time after.

Format What it means
*-bf16-squished INT4-compressed, ready for squish run

Available models

Model Squish ID Raw size Squished size Context
Qwen3-8B qwen3:8b 16.4 GB 4.4 GB 128k
Qwen3-4B qwen3:4b 8.2 GB 2.2 GB 32k
Qwen3-0.6B qwen3:0.6b 1.3 GB 0.9 GB 32k
Qwen2.5-7B-Instruct qwen2.5:7b 14.4 GB 3.9 GB 128k
Qwen2.5-1.5B-Instruct qwen2.5:1.5b 3.1 GB 0.9 GB 32k
Llama-3.2-3B-Instruct llama3.2:3b 6.4 GB 1.7 GB 128k
Llama-3.2-1B-Instruct llama3.2:1b 2.5 GB 0.7 GB 128k
Gemma-3-4B-Instruct gemma3:4b 9.8 GB 2.6 GB 128k
Gemma-3-1B-Instruct gemma3:1b 2.0 GB 0.5 GB 32k

More models added as the catalog grows. Run squish catalog for the full list.


Load time comparison (M3 16GB)

Model Squish (INT4) Ollama llama.cpp
Qwen3-8B 0.43s 4.2s 6.1s
Llama-3.2-3B 0.33s 1.8s 2.4s

Measured cold-start on Apple M3 16GB. Results will vary by chip and storage.


OpenAI-compatible API

Squish runs a local server on port 11435. Any OpenAI client works out of the box:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11435/v1", api_key="squish")
response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Explain attention mechanisms briefly."}]
)
print(response.choices[0].message.content)
# Or point your existing tools at it
export OPENAI_BASE_URL=http://localhost:11435/v1
export OPENAI_API_KEY=squish

How models are compressed

Squish uses a three-tier compression pipeline:

  • INT4 quantization via a Rust extension (squish_quant_rs) with ARM NEON acceleration — 8–12 GB/s throughput on Apple Silicon
  • Compressed weight loader — weights stay compressed on disk and decompress directly into Metal-mapped memory at load time
  • KV cache quantization — attention cache stored at reduced precision during generation, not just weights

The result is a model that fits in memory on a base M3 16GB and loads faster than Ollama can parse its configuration.


Using models directly with mlx_lm

from mlx_lm import load, generate

model, tokenizer = load("squishai/Qwen3-8B-bf16-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)

Requirements

  • macOS 13.0 or later
  • Apple Silicon (M1, M2, M3, M4, M5)
  • Sufficient unified memory for the model (see table above)

Intel Macs and Linux are not supported. Windows is not planned.


Links


Squish it. Run it. Go.

Built by Konjo AI  Â·  MIT License

datasets 0

None public yet