File size: 5,471 Bytes
d6a71ce 06aeecb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | <!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Konjo AI</title>
<style>
:root { color-scheme: light dark; }
* { box-sizing: border-box; }
body {
margin: 0;
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto,
Helvetica, Arial, sans-serif;
line-height: 1.6;
color: inherit;
background: transparent;
}
.wrap { max-width: 820px; margin: 0 auto; padding: 8px 4px 32px; }
h1 { font-size: 1.9rem; margin: 0 0 .25rem; }
h2 { font-size: 1.3rem; margin: 2rem 0 .5rem; }
h3 { font-size: 1.05rem; margin: 1.25rem 0 .4rem; }
p { margin: .5rem 0; }
a { color: #2563eb; text-decoration: none; }
a:hover { text-decoration: underline; }
ul { margin: .5rem 0; padding-left: 1.3rem; }
li { margin: .25rem 0; }
hr { border: none; border-top: 1px solid rgba(128,128,128,.3); margin: 1.5rem 0; }
code {
font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace;
font-size: .85em;
background: rgba(128,128,128,.15);
padding: .12em .35em;
border-radius: 4px;
}
pre {
background: rgba(128,128,128,.12);
border: 1px solid rgba(128,128,128,.2);
border-radius: 8px;
padding: .85rem 1rem;
overflow-x: auto;
}
pre code { background: none; padding: 0; }
table { border-collapse: collapse; width: 100%; margin: .75rem 0; font-size: .92rem; }
th, td { border: 1px solid rgba(128,128,128,.3); padding: .45rem .6rem; text-align: left; }
th { background: rgba(128,128,128,.12); }
.tagline { color: #6b7280; margin-top: 0; }
.links { font-size: .95rem; }
</style>
</head>
<body>
<div class="wrap">
<h1>π Konjo AI</h1>
<p class="tagline">Local AI infrastructure for Apple Silicon. We make models
that already exist run faster on the hardware you already own.</p>
<p class="links">π <a href="https://squish.run">squish.run</a> Β·
π» <a href="https://github.com/konjoai">github.com/konjoai</a></p>
<hr />
<h2>squish β Local LLM inference for Apple Silicon</h2>
<p><a href="https://github.com/konjoai/squish">squish</a> is an MLX-based local
inference server with a block-level paged KV cache and INT3 quantization
support for the Qwen3 family. On a 16 GB M3 MacBook against Ollama:</p>
<ul>
<li><strong>5.4Γ faster</strong> end-to-end response at 4000-token prompts (12.78s vs 69.6s)</li>
<li><strong>1.5Γ faster</strong> end-to-end on 75-token prompts (5.50s vs 8.09s)</li>
<li><strong>33% less RAM</strong> during inference (3.36 GB vs ~5 GB)</li>
<li><strong>INT3 support</strong> for Qwen3 with no measurable accuracy loss (Ollama doesn't ship INT3)</li>
</ul>
<p>The honest tradeoff: Ollama still wins first-token latency on short prompts.
squish wins when you care about total response time on real workloads.</p>
<h3>Install</h3>
<pre><code>brew tap konjoai/squish && brew install squish
# or
pip install squish-ai</code></pre>
<h3>Use</h3>
<pre><code>squish pull konjoai/Qwen3-8B-squished
squish run Qwen3-8B-squished</code></pre>
<p class="links">
<a href="https://github.com/konjoai/squish/blob/main/docs/RESULTS.md">Full benchmarks</a> Β·
<a href="https://github.com/konjoai/squish">Repo</a> Β·
<a href="https://github.com/konjoai/squish/issues">Issues</a>
</p>
<hr />
<h2>Pre-Compressed Models</h2>
<p>This org hosts models pre-compressed by squish. Pull once, load instantly
every time after.</p>
<table>
<thead>
<tr><th>Model</th><th>Squish ID</th><th>Quantization</th><th>Disk size</th><th>Context</th></tr>
</thead>
<tbody>
<tr><td colspan="5"><em>Available after first publish batch</em></td></tr>
</tbody>
</table>
<p>The format is <code>mlx_lm</code>-compatible β you can also use these models directly:</p>
<pre><code>from mlx_lm import load, generate
model, tokenizer = load("konjoai/Qwen2.5-7B-Instruct-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
print(response)</code></pre>
<hr />
<h2>How models are compressed</h2>
<p>squish uses a three-tier pipeline:</p>
<ul>
<li><strong>INT4/INT3 quantization</strong> via a Rust extension
(<code>squish_quant_rs</code>) with ARM NEON acceleration</li>
<li><strong>Block-level paged KV cache</strong> β KV state is chunked into
fixed-size blocks for prefix reuse across sessions</li>
<li><strong>Quantization safeguards</strong> β squish hard-blocks INT3 on
model families where it collapses (e.g. Gemma-3 loses ~15pp on common
benchmarks); INT3 ships only for families that hold accuracy (Qwen3
specifically)</li>
</ul>
<hr />
<h2>Other projects</h2>
<p>We also build <a href="https://github.com/konjoai/squash">squash</a>, a
security and EU AI Act compliance scanner for HuggingFace models. Independent
codebase, related mission.</p>
<hr />
<h2>License</h2>
<p>squish is BUSL-1.1. Compressed models inherit their base model's license β
Qwen3 is Apache-2.0, Llama is the Llama Community License, etc. Check each
model's card for specifics.</p>
<hr />
<h2>Requirements</h2>
<ul>
<li>macOS 13.0 or later</li>
<li>Apple Silicon (M1 / M2 / M3 / M4 / M5)</li>
<li>Enough unified memory for the model (table above)</li>
</ul>
<p>Intel Macs and Linux are not supported.</p>
</div>
</body>
</html>
|