README / index.html
wscholl's picture
feat: render squish-focused org card content
d6a71ce verified
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Konjo AI</title>
<style>
:root { color-scheme: light dark; }
* { box-sizing: border-box; }
body {
margin: 0;
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto,
Helvetica, Arial, sans-serif;
line-height: 1.6;
color: inherit;
background: transparent;
}
.wrap { max-width: 820px; margin: 0 auto; padding: 8px 4px 32px; }
h1 { font-size: 1.9rem; margin: 0 0 .25rem; }
h2 { font-size: 1.3rem; margin: 2rem 0 .5rem; }
h3 { font-size: 1.05rem; margin: 1.25rem 0 .4rem; }
p { margin: .5rem 0; }
a { color: #2563eb; text-decoration: none; }
a:hover { text-decoration: underline; }
ul { margin: .5rem 0; padding-left: 1.3rem; }
li { margin: .25rem 0; }
hr { border: none; border-top: 1px solid rgba(128,128,128,.3); margin: 1.5rem 0; }
code {
font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace;
font-size: .85em;
background: rgba(128,128,128,.15);
padding: .12em .35em;
border-radius: 4px;
}
pre {
background: rgba(128,128,128,.12);
border: 1px solid rgba(128,128,128,.2);
border-radius: 8px;
padding: .85rem 1rem;
overflow-x: auto;
}
pre code { background: none; padding: 0; }
table { border-collapse: collapse; width: 100%; margin: .75rem 0; font-size: .92rem; }
th, td { border: 1px solid rgba(128,128,128,.3); padding: .45rem .6rem; text-align: left; }
th { background: rgba(128,128,128,.12); }
.tagline { color: #6b7280; margin-top: 0; }
.links { font-size: .95rem; }
</style>
</head>
<body>
<div class="wrap">
<h1>πŸ—œ Konjo AI</h1>
<p class="tagline">Local AI infrastructure for Apple Silicon. We make models
that already exist run faster on the hardware you already own.</p>
<p class="links">🌐 <a href="https://squish.run">squish.run</a> ·
πŸ’» <a href="https://github.com/konjoai">github.com/konjoai</a></p>
<hr />
<h2>squish β€” Local LLM inference for Apple Silicon</h2>
<p><a href="https://github.com/konjoai/squish">squish</a> is an MLX-based local
inference server with a block-level paged KV cache and INT3 quantization
support for the Qwen3 family. On a 16 GB M3 MacBook against Ollama:</p>
<ul>
<li><strong>5.4Γ— faster</strong> end-to-end response at 4000-token prompts (12.78s vs 69.6s)</li>
<li><strong>1.5Γ— faster</strong> end-to-end on 75-token prompts (5.50s vs 8.09s)</li>
<li><strong>33% less RAM</strong> during inference (3.36 GB vs ~5 GB)</li>
<li><strong>INT3 support</strong> for Qwen3 with no measurable accuracy loss (Ollama doesn't ship INT3)</li>
</ul>
<p>The honest tradeoff: Ollama still wins first-token latency on short prompts.
squish wins when you care about total response time on real workloads.</p>
<h3>Install</h3>
<pre><code>brew tap konjoai/squish &amp;&amp; brew install squish
# or
pip install squish-ai</code></pre>
<h3>Use</h3>
<pre><code>squish pull konjoai/Qwen3-8B-squished
squish run Qwen3-8B-squished</code></pre>
<p class="links">
<a href="https://github.com/konjoai/squish/blob/main/docs/RESULTS.md">Full benchmarks</a> Β·
<a href="https://github.com/konjoai/squish">Repo</a> Β·
<a href="https://github.com/konjoai/squish/issues">Issues</a>
</p>
<hr />
<h2>Pre-Compressed Models</h2>
<p>This org hosts models pre-compressed by squish. Pull once, load instantly
every time after.</p>
<table>
<thead>
<tr><th>Model</th><th>Squish ID</th><th>Quantization</th><th>Disk size</th><th>Context</th></tr>
</thead>
<tbody>
<tr><td colspan="5"><em>Available after first publish batch</em></td></tr>
</tbody>
</table>
<p>The format is <code>mlx_lm</code>-compatible β€” you can also use these models directly:</p>
<pre><code>from mlx_lm import load, generate
model, tokenizer = load("konjoai/Qwen2.5-7B-Instruct-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
print(response)</code></pre>
<hr />
<h2>How models are compressed</h2>
<p>squish uses a three-tier pipeline:</p>
<ul>
<li><strong>INT4/INT3 quantization</strong> via a Rust extension
(<code>squish_quant_rs</code>) with ARM NEON acceleration</li>
<li><strong>Block-level paged KV cache</strong> β€” KV state is chunked into
fixed-size blocks for prefix reuse across sessions</li>
<li><strong>Quantization safeguards</strong> β€” squish hard-blocks INT3 on
model families where it collapses (e.g. Gemma-3 loses ~15pp on common
benchmarks); INT3 ships only for families that hold accuracy (Qwen3
specifically)</li>
</ul>
<hr />
<h2>Other projects</h2>
<p>We also build <a href="https://github.com/konjoai/squash">squash</a>, a
security and EU AI Act compliance scanner for HuggingFace models. Independent
codebase, related mission.</p>
<hr />
<h2>License</h2>
<p>squish is BUSL-1.1. Compressed models inherit their base model's license β€”
Qwen3 is Apache-2.0, Llama is the Llama Community License, etc. Check each
model's card for specifics.</p>
<hr />
<h2>Requirements</h2>
<ul>
<li>macOS 13.0 or later</li>
<li>Apple Silicon (M1 / M2 / M3 / M4 / M5)</li>
<li>Enough unified memory for the model (table above)</li>
</ul>
<p>Intel Macs and Linux are not supported.</p>
</div>
</body>
</html>