File size: 3,001 Bytes
06aeecb
a9e62d5
 
 
 
06aeecb
 
 
 
a9e62d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
title: Konjo AI
emoji: πŸ—œ
colorFrom: gray
colorTo: blue
sdk: static
pinned: false
---

# Konjo AI

Local AI infrastructure for Apple Silicon. We make models that already exist
run faster on the hardware you already own.

🌐 [squish.run](https://squish.run) Β· πŸ’» [github.com/konjoai](https://github.com/konjoai)

---

## squish β€” Local LLM inference for Apple Silicon

[squish](https://github.com/konjoai/squish) is an MLX-based local inference
server with a block-level paged KV cache and INT3 quantization support for the
Qwen3 family. On a 16 GB M3 MacBook against Ollama:

- **5.4Γ— faster** end-to-end response at 4000-token prompts (12.78s vs 69.6s)
- **1.5Γ— faster** end-to-end on 75-token prompts (5.50s vs 8.09s)
- **33% less RAM** during inference (3.36 GB vs ~5 GB)
- **INT3 support** for Qwen3 with no measurable accuracy loss (Ollama doesn't ship INT3)

The honest tradeoff: Ollama still wins first-token latency on short prompts.
squish wins when you care about total response time on real workloads.

**Install:**
```bash
brew tap konjoai/squish && brew install squish
# or
pip install squish-ai
```

**Use:**
```bash
squish pull konjoai/Qwen3-8B-squished
squish run Qwen3-8B-squished
```

[Full benchmarks](https://github.com/konjoai/squish/blob/main/docs/RESULTS.md) Β·
[Repo](https://github.com/konjoai/squish) Β·
[Issues](https://github.com/konjoai/squish/issues)

---

## Pre-Compressed Models

This org hosts models pre-compressed by squish. Pull once, load instantly every
time after.

| Model | Squish ID | Quantization | Disk size | Context |
|---|---|---|---|---|
| _Available after first publish batch_ |

The format is `mlx_lm`-compatible β€” you can also use these models directly:

```python
from mlx_lm import load, generate

model, tokenizer = load("konjoai/Qwen2.5-7B-Instruct-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
print(response)
```

---

## How models are compressed

squish uses a three-tier pipeline:

1. **INT4/INT3 quantization** via a Rust extension (`squish_quant_rs`) with ARM
   NEON acceleration
2. **Block-level paged KV cache** β€” KV state is chunked into fixed-size blocks
   for prefix reuse across sessions
3. **Quantization safeguards** β€” squish hard-blocks INT3 on model families where
   it collapses (e.g. Gemma-3 loses ~15pp on common benchmarks); INT3 ships only
   for families that hold accuracy (Qwen3 specifically)

---

## Other projects

We also build [squash](https://github.com/konjoai/squash), a security and EU AI
Act compliance scanner for HuggingFace models. Independent codebase, related
mission.

---

## License

squish is BUSL-1.1. Compressed models inherit their base model's license β€” Qwen3
is Apache-2.0, Llama is the Llama Community License, etc. Check each model's card
for specifics.

---

## Requirements

- macOS 13.0 or later
- Apple Silicon (M1 / M2 / M3 / M4 / M5)
- Enough unified memory for the model (table above)

Intel Macs and Linux are not supported.