AI & ML interests
None defined yet.
Recent Activity
๐ Konjo AI
Local AI infrastructure for Apple Silicon. We make models that already exist run faster on the hardware you already own.
๐ squish.run ยท ๐ป github.com/konjoai
squish โ Local LLM inference for Apple Silicon
squish is an MLX-based local inference server with a block-level paged KV cache and INT3 quantization support for the Qwen3 family. On a 16 GB M3 MacBook against Ollama:
- 5.4ร faster end-to-end response at 4000-token prompts (12.78s vs 69.6s)
- 1.5ร faster end-to-end on 75-token prompts (5.50s vs 8.09s)
- 33% less RAM during inference (3.36 GB vs ~5 GB)
- INT3 support for Qwen3 with no measurable accuracy loss (Ollama doesn't ship INT3)
The honest tradeoff: Ollama still wins first-token latency on short prompts. squish wins when you care about total response time on real workloads.
Install
brew tap konjoai/squish && brew install squish
# or
pip install squish-ai
Use
squish pull konjoai/Qwen3-8B-squished
squish run Qwen3-8B-squished
Full benchmarks ยท Repo ยท Issues
Pre-Compressed Models
This org hosts models pre-compressed by squish. Pull once, load instantly every time after.
| Model | Squish ID | Quantization | Disk size | Context |
|---|---|---|---|---|
| Available after first publish batch | ||||
The format is mlx_lm-compatible โ you can also use these models directly:
from mlx_lm import load, generate
model, tokenizer = load("konjoai/Qwen2.5-7B-Instruct-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
print(response)
How models are compressed
squish uses a three-tier pipeline:
- INT4/INT3 quantization via a Rust extension
(
squish_quant_rs) with ARM NEON acceleration - Block-level paged KV cache โ KV state is chunked into fixed-size blocks for prefix reuse across sessions
- Quantization safeguards โ squish hard-blocks INT3 on model families where it collapses (e.g. Gemma-3 loses ~15pp on common benchmarks); INT3 ships only for families that hold accuracy (Qwen3 specifically)
Other projects
We also build squash, a security and EU AI Act compliance scanner for HuggingFace models. Independent codebase, related mission.
License
squish is BUSL-1.1. Compressed models inherit their base model's license โ Qwen3 is Apache-2.0, Llama is the Llama Community License, etc. Check each model's card for specifics.
Requirements
- macOS 13.0 or later
- Apple Silicon (M1 / M2 / M3 / M4 / M5)
- Enough unified memory for the model (table above)
Intel Macs and Linux are not supported.