| --- |
| title: Konjo AI |
| emoji: π |
| colorFrom: gray |
| colorTo: blue |
| sdk: static |
| pinned: false |
| --- |
| |
| # Konjo AI |
|
|
| Local AI infrastructure for Apple Silicon. We make models that already exist |
| run faster on the hardware you already own. |
|
|
| π [squish.run](https://squish.run) Β· π» [github.com/konjoai](https://github.com/konjoai) |
|
|
| --- |
|
|
| ## squish β Local LLM inference for Apple Silicon |
|
|
| [squish](https://github.com/konjoai/squish) is an MLX-based local inference |
| server with a block-level paged KV cache and INT3 quantization support for the |
| Qwen3 family. On a 16 GB M3 MacBook against Ollama: |
|
|
| - **5.4Γ faster** end-to-end response at 4000-token prompts (12.78s vs 69.6s) |
| - **1.5Γ faster** end-to-end on 75-token prompts (5.50s vs 8.09s) |
| - **33% less RAM** during inference (3.36 GB vs ~5 GB) |
| - **INT3 support** for Qwen3 with no measurable accuracy loss (Ollama doesn't ship INT3) |
|
|
| The honest tradeoff: Ollama still wins first-token latency on short prompts. |
| squish wins when you care about total response time on real workloads. |
|
|
| **Install:** |
| ```bash |
| brew tap konjoai/squish && brew install squish |
| # or |
| pip install squish-ai |
| ``` |
|
|
| **Use:** |
| ```bash |
| squish pull konjoai/Qwen3-8B-squished |
| squish run Qwen3-8B-squished |
| ``` |
|
|
| [Full benchmarks](https://github.com/konjoai/squish/blob/main/docs/RESULTS.md) Β· |
| [Repo](https://github.com/konjoai/squish) Β· |
| [Issues](https://github.com/konjoai/squish/issues) |
|
|
| --- |
|
|
| ## Pre-Compressed Models |
|
|
| This org hosts models pre-compressed by squish. Pull once, load instantly every |
| time after. |
|
|
| | Model | Squish ID | Quantization | Disk size | Context | |
| |---|---|---|---|---| |
| | _Available after first publish batch_ | |
|
|
| The format is `mlx_lm`-compatible β you can also use these models directly: |
|
|
| ```python |
| from mlx_lm import load, generate |
| |
| model, tokenizer = load("konjoai/Qwen2.5-7B-Instruct-squished") |
| response = generate(model, tokenizer, prompt="Hello", max_tokens=100) |
| print(response) |
| ``` |
|
|
| --- |
|
|
| ## How models are compressed |
|
|
| squish uses a three-tier pipeline: |
|
|
| 1. **INT4/INT3 quantization** via a Rust extension (`squish_quant_rs`) with ARM |
| NEON acceleration |
| 2. **Block-level paged KV cache** β KV state is chunked into fixed-size blocks |
| for prefix reuse across sessions |
| 3. **Quantization safeguards** β squish hard-blocks INT3 on model families where |
| it collapses (e.g. Gemma-3 loses ~15pp on common benchmarks); INT3 ships only |
| for families that hold accuracy (Qwen3 specifically) |
|
|
| --- |
|
|
| ## Other projects |
|
|
| We also build [squash](https://github.com/konjoai/squash), a security and EU AI |
| Act compliance scanner for HuggingFace models. Independent codebase, related |
| mission. |
|
|
| --- |
|
|
| ## License |
|
|
| squish is BUSL-1.1. Compressed models inherit their base model's license β Qwen3 |
| is Apache-2.0, Llama is the Llama Community License, etc. Check each model's card |
| for specifics. |
|
|
| --- |
|
|
| ## Requirements |
|
|
| - macOS 13.0 or later |
| - Apple Silicon (M1 / M2 / M3 / M4 / M5) |
| - Enough unified memory for the model (table above) |
|
|
| Intel Macs and Linux are not supported. |
|
|