File size: 3,001 Bytes
06aeecb a9e62d5 06aeecb a9e62d5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | ---
title: Konjo AI
emoji: π
colorFrom: gray
colorTo: blue
sdk: static
pinned: false
---
# Konjo AI
Local AI infrastructure for Apple Silicon. We make models that already exist
run faster on the hardware you already own.
π [squish.run](https://squish.run) Β· π» [github.com/konjoai](https://github.com/konjoai)
---
## squish β Local LLM inference for Apple Silicon
[squish](https://github.com/konjoai/squish) is an MLX-based local inference
server with a block-level paged KV cache and INT3 quantization support for the
Qwen3 family. On a 16 GB M3 MacBook against Ollama:
- **5.4Γ faster** end-to-end response at 4000-token prompts (12.78s vs 69.6s)
- **1.5Γ faster** end-to-end on 75-token prompts (5.50s vs 8.09s)
- **33% less RAM** during inference (3.36 GB vs ~5 GB)
- **INT3 support** for Qwen3 with no measurable accuracy loss (Ollama doesn't ship INT3)
The honest tradeoff: Ollama still wins first-token latency on short prompts.
squish wins when you care about total response time on real workloads.
**Install:**
```bash
brew tap konjoai/squish && brew install squish
# or
pip install squish-ai
```
**Use:**
```bash
squish pull konjoai/Qwen3-8B-squished
squish run Qwen3-8B-squished
```
[Full benchmarks](https://github.com/konjoai/squish/blob/main/docs/RESULTS.md) Β·
[Repo](https://github.com/konjoai/squish) Β·
[Issues](https://github.com/konjoai/squish/issues)
---
## Pre-Compressed Models
This org hosts models pre-compressed by squish. Pull once, load instantly every
time after.
| Model | Squish ID | Quantization | Disk size | Context |
|---|---|---|---|---|
| _Available after first publish batch_ |
The format is `mlx_lm`-compatible β you can also use these models directly:
```python
from mlx_lm import load, generate
model, tokenizer = load("konjoai/Qwen2.5-7B-Instruct-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
print(response)
```
---
## How models are compressed
squish uses a three-tier pipeline:
1. **INT4/INT3 quantization** via a Rust extension (`squish_quant_rs`) with ARM
NEON acceleration
2. **Block-level paged KV cache** β KV state is chunked into fixed-size blocks
for prefix reuse across sessions
3. **Quantization safeguards** β squish hard-blocks INT3 on model families where
it collapses (e.g. Gemma-3 loses ~15pp on common benchmarks); INT3 ships only
for families that hold accuracy (Qwen3 specifically)
---
## Other projects
We also build [squash](https://github.com/konjoai/squash), a security and EU AI
Act compliance scanner for HuggingFace models. Independent codebase, related
mission.
---
## License
squish is BUSL-1.1. Compressed models inherit their base model's license β Qwen3
is Apache-2.0, Llama is the Llama Community License, etc. Check each model's card
for specifics.
---
## Requirements
- macOS 13.0 or later
- Apple Silicon (M1 / M2 / M3 / M4 / M5)
- Enough unified memory for the model (table above)
Intel Macs and Linux are not supported.
|