File size: 5,351 Bytes
8d36433 21a7ce9 6e2f616 21a7ce9 8d36433 b2ddf0c 8d36433 b2ddf0c 8d36433 b2ddf0c dfd0d71 b2ddf0c dfd0d71 b2ddf0c dfd0d71 8d36433 b2ddf0c dfd0d71 b2ddf0c dfd0d71 8d36433 b2ddf0c 8d36433 b2ddf0c dfd0d71 b2ddf0c dfd0d71 b2ddf0c dfd0d71 b2ddf0c dfd0d71 b2ddf0c dfd0d71 b2ddf0c dfd0d71 b2ddf0c dfd0d71 b2ddf0c 8d36433 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | ---
title: README
emoji: 🔥
colorFrom: blue
colorTo: red
sdk: static
pinned: false
license: mit
---
<div align="center">
<img src="https://raw.githubusercontent.com/konjoai/squish/main/assets/squish-logo-1.png" width="330" alt="Squish" />
<h2>Squeeze the Most Out of Your Models</h2
<h3>Pre-compressed models for Apple Silicon. Load in under a second. Run fully local.</h3>
[](https://github.com/konjoai/squish)
[](https://github.com/konjoai/squish/blob/main/LICENSE)
[](https://github.com/konjoai/squish)
[](https://squish.run)
</div>
---
## What is this?
This organization hosts models pre-compressed by [Squish](https://github.com/konjoai/squish) — a local inference engine for Apple Silicon that gets models off disk and into memory in under a second.
Every model here was compressed with Squish's INT4 quantization pipeline and is ready to load directly with `squish pull`. No setup, no Python environment, no cloud.
```bash
brew tap konjoai/squish
brew trust konjoai/squish
brew install squish
squish pull qwen3:8b
squish run qwen3:8b
```
---
## Why pre-compressed?
Compression takes time. A Qwen3-8B model compresses in roughly 8 minutes on an M3. You shouldn't have to wait for that on first run. Models in this org are pre-compressed and validated — pull once, load instantly every time after.
| Format | What it means |
|--------|--------------|
| `*-bf16-squished` | INT4-compressed, ready for `squish run` |
---
## Available models
| Model | Squish ID | Raw size | Squished size | Context |
|-------|-----------|----------|---------------|---------|
| [Qwen3-8B](https://huggingface.co/squishai/Qwen3-8B-bf16-squished) | `qwen3:8b` | 16.4 GB | 4.4 GB | 128k |
| [Qwen3-4B](https://huggingface.co/squishai/Qwen3-4B-bf16-squished) | `qwen3:4b` | 8.2 GB | 2.2 GB | 32k |
| [Qwen3-0.6B](https://huggingface.co/squishai/Qwen3-0.6B-bf16-squished) | `qwen3:0.6b` | 1.3 GB | 0.9 GB | 32k |
| [Qwen2.5-7B-Instruct](https://huggingface.co/squishai/Qwen2.5-7B-Instruct-bf16-squished) | `qwen2.5:7b` | 14.4 GB | 3.9 GB | 128k |
| [Qwen2.5-1.5B-Instruct](https://huggingface.co/squishai/Qwen2.5-1.5B-Instruct-bf16-squished) | `qwen2.5:1.5b` | 3.1 GB | 0.9 GB | 32k |
| [Llama-3.2-3B-Instruct](https://huggingface.co/squishai/Llama-3.2-3B-Instruct-bf16-squished) | `llama3.2:3b` | 6.4 GB | 1.7 GB | 128k |
| [Llama-3.2-1B-Instruct](https://huggingface.co/squishai/Llama-3.2-1B-Instruct-bf16-squished) | `llama3.2:1b` | 2.5 GB | 0.7 GB | 128k |
| [Gemma-3-4B-Instruct](https://huggingface.co/squishai/gemma-3-4b-it-bf16-squished) | `gemma3:4b` | 9.8 GB | 2.6 GB | 128k |
| [Gemma-3-1B-Instruct](https://huggingface.co/squishai/gemma-3-1b-it-bf16-squished) | `gemma3:1b` | 2.0 GB | 0.5 GB | 32k |
More models added as the catalog grows. Run `squish catalog` for the full list.
---
## Load time comparison (M3 16GB)
| Model | Squish (INT4) | Ollama | llama.cpp |
|-------|--------------|--------|-----------|
| Qwen3-8B | **0.43s** | 4.2s | 6.1s |
| Llama-3.2-3B | **0.33s** | 1.8s | 2.4s |
*Measured cold-start on Apple M3 16GB. Results will vary by chip and storage.*
---
## OpenAI-compatible API
Squish runs a local server on port 11435. Any OpenAI client works out of the box:
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11435/v1", api_key="squish")
response = client.chat.completions.create(
model="qwen3:8b",
messages=[{"role": "user", "content": "Explain attention mechanisms briefly."}]
)
print(response.choices[0].message.content)
```
```bash
# Or point your existing tools at it
export OPENAI_BASE_URL=http://localhost:11435/v1
export OPENAI_API_KEY=squish
```
---
## How models are compressed
Squish uses a three-tier compression pipeline:
- **INT4 quantization** via a Rust extension (`squish_quant_rs`) with ARM NEON acceleration — 8–12 GB/s throughput on Apple Silicon
- **Compressed weight loader** — weights stay compressed on disk and decompress directly into Metal-mapped memory at load time
- **KV cache quantization** — attention cache stored at reduced precision during generation, not just weights
The result is a model that fits in memory on a base M3 16GB and loads faster than Ollama can parse its configuration.
---
## Using models directly with mlx_lm
```python
from mlx_lm import load, generate
model, tokenizer = load("squishai/Qwen3-8B-bf16-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
```
---
## Requirements
- macOS 13.0 or later
- Apple Silicon (M1, M2, M3, M4, M5)
- Sufficient unified memory for the model (see table above)
> Intel Macs and Linux are not supported. Windows is not planned.
---
## Links
- CLI and inference engine: [github.com/konjoai/squish](https://github.com/konjoai/squish)
- Install: `brew tap konjoai/squish && brew install squish`
- Issues and discussions: [GitHub Issues](https://github.com/konjoai/squish/issues)
---
*Squish it. Run it. Go.*
Built by [Konjo AI](https://github.com/konjoai) · MIT License
|