| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta charset="utf-8" /> |
| <meta name="viewport" content="width=device-width, initial-scale=1" /> |
| <title>Konjo AI</title> |
| <style> |
| :root { color-scheme: light dark; } |
| * { box-sizing: border-box; } |
| body { |
| margin: 0; |
| font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, |
| Helvetica, Arial, sans-serif; |
| line-height: 1.6; |
| color: inherit; |
| background: transparent; |
| } |
| .wrap { max-width: 820px; margin: 0 auto; padding: 8px 4px 32px; } |
| h1 { font-size: 1.9rem; margin: 0 0 .25rem; } |
| h2 { font-size: 1.3rem; margin: 2rem 0 .5rem; } |
| h3 { font-size: 1.05rem; margin: 1.25rem 0 .4rem; } |
| p { margin: .5rem 0; } |
| a { color: #2563eb; text-decoration: none; } |
| a:hover { text-decoration: underline; } |
| ul { margin: .5rem 0; padding-left: 1.3rem; } |
| li { margin: .25rem 0; } |
| hr { border: none; border-top: 1px solid rgba(128,128,128,.3); margin: 1.5rem 0; } |
| code { |
| font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace; |
| font-size: .85em; |
| background: rgba(128,128,128,.15); |
| padding: .12em .35em; |
| border-radius: 4px; |
| } |
| pre { |
| background: rgba(128,128,128,.12); |
| border: 1px solid rgba(128,128,128,.2); |
| border-radius: 8px; |
| padding: .85rem 1rem; |
| overflow-x: auto; |
| } |
| pre code { background: none; padding: 0; } |
| table { border-collapse: collapse; width: 100%; margin: .75rem 0; font-size: .92rem; } |
| th, td { border: 1px solid rgba(128,128,128,.3); padding: .45rem .6rem; text-align: left; } |
| th { background: rgba(128,128,128,.12); } |
| .tagline { color: #6b7280; margin-top: 0; } |
| .links { font-size: .95rem; } |
| </style> |
| </head> |
| <body> |
| <div class="wrap"> |
|
|
| <h1>π Konjo AI</h1> |
| <p class="tagline">Local AI infrastructure for Apple Silicon. We make models |
| that already exist run faster on the hardware you already own.</p> |
| <p class="links">π <a href="https://squish.run">squish.run</a> Β· |
| π» <a href="https://github.com/konjoai">github.com/konjoai</a></p> |
|
|
| <hr /> |
|
|
| <h2>squish β Local LLM inference for Apple Silicon</h2> |
| <p><a href="https://github.com/konjoai/squish">squish</a> is an MLX-based local |
| inference server with a block-level paged KV cache and INT3 quantization |
| support for the Qwen3 family. On a 16 GB M3 MacBook against Ollama:</p> |
| <ul> |
| <li><strong>5.4Γ faster</strong> end-to-end response at 4000-token prompts (12.78s vs 69.6s)</li> |
| <li><strong>1.5Γ faster</strong> end-to-end on 75-token prompts (5.50s vs 8.09s)</li> |
| <li><strong>33% less RAM</strong> during inference (3.36 GB vs ~5 GB)</li> |
| <li><strong>INT3 support</strong> for Qwen3 with no measurable accuracy loss (Ollama doesn't ship INT3)</li> |
| </ul> |
| <p>The honest tradeoff: Ollama still wins first-token latency on short prompts. |
| squish wins when you care about total response time on real workloads.</p> |
|
|
| <h3>Install</h3> |
| <pre><code>brew tap konjoai/squish && brew install squish |
| # or |
| pip install squish-ai</code></pre> |
|
|
| <h3>Use</h3> |
| <pre><code>squish pull konjoai/Qwen3-8B-squished |
| squish run Qwen3-8B-squished</code></pre> |
|
|
| <p class="links"> |
| <a href="https://github.com/konjoai/squish/blob/main/docs/RESULTS.md">Full benchmarks</a> Β· |
| <a href="https://github.com/konjoai/squish">Repo</a> Β· |
| <a href="https://github.com/konjoai/squish/issues">Issues</a> |
| </p> |
|
|
| <hr /> |
|
|
| <h2>Pre-Compressed Models</h2> |
| <p>This org hosts models pre-compressed by squish. Pull once, load instantly |
| every time after.</p> |
| <table> |
| <thead> |
| <tr><th>Model</th><th>Squish ID</th><th>Quantization</th><th>Disk size</th><th>Context</th></tr> |
| </thead> |
| <tbody> |
| <tr><td colspan="5"><em>Available after first publish batch</em></td></tr> |
| </tbody> |
| </table> |
| <p>The format is <code>mlx_lm</code>-compatible β you can also use these models directly:</p> |
| <pre><code>from mlx_lm import load, generate |
|
|
| model, tokenizer = load("konjoai/Qwen2.5-7B-Instruct-squished") |
| response = generate(model, tokenizer, prompt="Hello", max_tokens=100) |
| print(response)</code></pre> |
|
|
| <hr /> |
|
|
| <h2>How models are compressed</h2> |
| <p>squish uses a three-tier pipeline:</p> |
| <ul> |
| <li><strong>INT4/INT3 quantization</strong> via a Rust extension |
| (<code>squish_quant_rs</code>) with ARM NEON acceleration</li> |
| <li><strong>Block-level paged KV cache</strong> β KV state is chunked into |
| fixed-size blocks for prefix reuse across sessions</li> |
| <li><strong>Quantization safeguards</strong> β squish hard-blocks INT3 on |
| model families where it collapses (e.g. Gemma-3 loses ~15pp on common |
| benchmarks); INT3 ships only for families that hold accuracy (Qwen3 |
| specifically)</li> |
| </ul> |
|
|
| <hr /> |
|
|
| <h2>Other projects</h2> |
| <p>We also build <a href="https://github.com/konjoai/squash">squash</a>, a |
| security and EU AI Act compliance scanner for HuggingFace models. Independent |
| codebase, related mission.</p> |
|
|
| <hr /> |
|
|
| <h2>License</h2> |
| <p>squish is BUSL-1.1. Compressed models inherit their base model's license β |
| Qwen3 is Apache-2.0, Llama is the Llama Community License, etc. Check each |
| model's card for specifics.</p> |
|
|
| <hr /> |
|
|
| <h2>Requirements</h2> |
| <ul> |
| <li>macOS 13.0 or later</li> |
| <li>Apple Silicon (M1 / M2 / M3 / M4 / M5)</li> |
| <li>Enough unified memory for the model (table above)</li> |
| </ul> |
| <p>Intel Macs and Linux are not supported.</p> |
|
|
| </div> |
| </body> |
| </html> |
|
|