Local AI infrastructure for Apple Silicon. We make models that already exist run faster on the hardware you already own.
๐ squish.run ยท ๐ป github.com/konjoai
squish is an MLX-based local inference server with a block-level paged KV cache and INT3 quantization support for the Qwen3 family. On a 16 GB M3 MacBook against Ollama:
The honest tradeoff: Ollama still wins first-token latency on short prompts. squish wins when you care about total response time on real workloads.
brew tap konjoai/squish && brew install squish
# or
pip install squish-ai
squish pull konjoai/Qwen3-8B-squished
squish run Qwen3-8B-squished
Full benchmarks ยท Repo ยท Issues
This org hosts models pre-compressed by squish. Pull once, load instantly every time after.
| Model | Squish ID | Quantization | Disk size | Context |
|---|---|---|---|---|
| Available after first publish batch | ||||
The format is mlx_lm-compatible โ you can also use these models directly:
from mlx_lm import load, generate
model, tokenizer = load("konjoai/Qwen2.5-7B-Instruct-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
print(response)
squish uses a three-tier pipeline:
squish_quant_rs) with ARM NEON accelerationWe also build squash, a security and EU AI Act compliance scanner for HuggingFace models. Independent codebase, related mission.
squish is BUSL-1.1. Compressed models inherit their base model's license โ Qwen3 is Apache-2.0, Llama is the Llama Community License, etc. Check each model's card for specifics.
Intel Macs and Linux are not supported.