--- title: Konjo AI emoji: ๐Ÿ—œ colorFrom: gray colorTo: blue sdk: static pinned: false --- # Konjo AI Local AI infrastructure for Apple Silicon. We make models that already exist run faster on the hardware you already own. ๐ŸŒ [squish.run](https://squish.run) ยท ๐Ÿ’ป [github.com/konjoai](https://github.com/konjoai) --- ## squish โ€” Local LLM inference for Apple Silicon [squish](https://github.com/konjoai/squish) is an MLX-based local inference server with a block-level paged KV cache and INT3 quantization support for the Qwen3 family. On a 16 GB M3 MacBook against Ollama: - **5.4ร— faster** end-to-end response at 4000-token prompts (12.78s vs 69.6s) - **1.5ร— faster** end-to-end on 75-token prompts (5.50s vs 8.09s) - **33% less RAM** during inference (3.36 GB vs ~5 GB) - **INT3 support** for Qwen3 with no measurable accuracy loss (Ollama doesn't ship INT3) The honest tradeoff: Ollama still wins first-token latency on short prompts. squish wins when you care about total response time on real workloads. **Install:** ```bash brew tap konjoai/squish && brew install squish # or pip install squish-ai ``` **Use:** ```bash squish pull konjoai/Qwen3-8B-squished squish run Qwen3-8B-squished ``` [Full benchmarks](https://github.com/konjoai/squish/blob/main/docs/RESULTS.md) ยท [Repo](https://github.com/konjoai/squish) ยท [Issues](https://github.com/konjoai/squish/issues) --- ## Pre-Compressed Models This org hosts models pre-compressed by squish. Pull once, load instantly every time after. | Model | Squish ID | Quantization | Disk size | Context | |---|---|---|---|---| | _Available after first publish batch_ | The format is `mlx_lm`-compatible โ€” you can also use these models directly: ```python from mlx_lm import load, generate model, tokenizer = load("konjoai/Qwen2.5-7B-Instruct-squished") response = generate(model, tokenizer, prompt="Hello", max_tokens=100) print(response) ``` --- ## How models are compressed squish uses a three-tier pipeline: 1. **INT4/INT3 quantization** via a Rust extension (`squish_quant_rs`) with ARM NEON acceleration 2. **Block-level paged KV cache** โ€” KV state is chunked into fixed-size blocks for prefix reuse across sessions 3. **Quantization safeguards** โ€” squish hard-blocks INT3 on model families where it collapses (e.g. Gemma-3 loses ~15pp on common benchmarks); INT3 ships only for families that hold accuracy (Qwen3 specifically) --- ## Other projects We also build [squash](https://github.com/konjoai/squash), a security and EU AI Act compliance scanner for HuggingFace models. Independent codebase, related mission. --- ## License squish is BUSL-1.1. Compressed models inherit their base model's license โ€” Qwen3 is Apache-2.0, Llama is the Llama Community License, etc. Check each model's card for specifics. --- ## Requirements - macOS 13.0 or later - Apple Silicon (M1 / M2 / M3 / M4 / M5) - Enough unified memory for the model (table above) Intel Macs and Linux are not supported.