Spaces:

konjoai
/

README

Running

App Files Files Community

README / index.html

wscholl

feat: render squish-focused org card content

d6a71ce verified 3 days ago

raw

history blame contribute delete

5.47 kB

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="utf-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<title>Konjo AI</title>
	<style>
	:root { color-scheme: light dark; }
	* { box-sizing: border-box; }
	body {
	margin: 0;
	font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto,
	Helvetica, Arial, sans-serif;
	line-height: 1.6;
	color: inherit;
	background: transparent;
	}
	.wrap { max-width: 820px; margin: 0 auto; padding: 8px 4px 32px; }
	h1 { font-size: 1.9rem; margin: 0 0 .25rem; }
	h2 { font-size: 1.3rem; margin: 2rem 0 .5rem; }
	h3 { font-size: 1.05rem; margin: 1.25rem 0 .4rem; }
	p { margin: .5rem 0; }
	a { color: #2563eb; text-decoration: none; }
	a:hover { text-decoration: underline; }
	ul { margin: .5rem 0; padding-left: 1.3rem; }
	li { margin: .25rem 0; }
	hr { border: none; border-top: 1px solid rgba(128,128,128,.3); margin: 1.5rem 0; }
	code {
	font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace;
	font-size: .85em;
	background: rgba(128,128,128,.15);
	padding: .12em .35em;
	border-radius: 4px;
	}
	pre {
	background: rgba(128,128,128,.12);
	border: 1px solid rgba(128,128,128,.2);
	border-radius: 8px;
	padding: .85rem 1rem;
	overflow-x: auto;
	}
	pre code { background: none; padding: 0; }
	table { border-collapse: collapse; width: 100%; margin: .75rem 0; font-size: .92rem; }
	th, td { border: 1px solid rgba(128,128,128,.3); padding: .45rem .6rem; text-align: left; }
	th { background: rgba(128,128,128,.12); }
	.tagline { color: #6b7280; margin-top: 0; }
	.links { font-size: .95rem; }
	</style>
	</head>
	<body>
	<div class="wrap">

	<h1>🗜 Konjo AI</h1>
	<p class="tagline">Local AI infrastructure for Apple Silicon. We make models
	that already exist run faster on the hardware you already own.</p>
	<p class="links">🌐 <a href="https://squish.run">squish.run</a> ·
	💻 <a href="https://github.com/konjoai">github.com/konjoai</a></p>

	<hr />

	<h2>squish — Local LLM inference for Apple Silicon</h2>
	<p><a href="https://github.com/konjoai/squish">squish</a> is an MLX-based local
	inference server with a block-level paged KV cache and INT3 quantization
	support for the Qwen3 family. On a 16 GB M3 MacBook against Ollama:</p>
	<ul>
	<li><strong>5.4× faster</strong> end-to-end response at 4000-token prompts (12.78s vs 69.6s)</li>
	<li><strong>1.5× faster</strong> end-to-end on 75-token prompts (5.50s vs 8.09s)</li>
	<li><strong>33% less RAM</strong> during inference (3.36 GB vs ~5 GB)</li>
	<li><strong>INT3 support</strong> for Qwen3 with no measurable accuracy loss (Ollama doesn't ship INT3)</li>
	</ul>
	<p>The honest tradeoff: Ollama still wins first-token latency on short prompts.
	squish wins when you care about total response time on real workloads.</p>

	<h3>Install</h3>
	<pre><code>brew tap konjoai/squish && brew install squish
	# or
	pip install squish-ai</code></pre>

	<h3>Use</h3>
	<pre><code>squish pull konjoai/Qwen3-8B-squished
	squish run Qwen3-8B-squished</code></pre>

	<p class="links">
	<a href="https://github.com/konjoai/squish/blob/main/docs/RESULTS.md">Full benchmarks</a> ·
	<a href="https://github.com/konjoai/squish">Repo</a> ·
	<a href="https://github.com/konjoai/squish/issues">Issues</a>
	</p>

	<hr />

	<h2>Pre-Compressed Models</h2>
	<p>This org hosts models pre-compressed by squish. Pull once, load instantly
	every time after.</p>
	<table>
	<thead>
	<tr><th>Model</th><th>Squish ID</th><th>Quantization</th><th>Disk size</th><th>Context</th></tr>
	</thead>
	<tbody>
	<tr><td colspan="5"><em>Available after first publish batch</em></td></tr>
	</tbody>
	</table>
	<p>The format is <code>mlx_lm</code>-compatible — you can also use these models directly:</p>
	<pre><code>from mlx_lm import load, generate

	model, tokenizer = load("konjoai/Qwen2.5-7B-Instruct-squished")
	response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
	print(response)</code></pre>

	<hr />

	<h2>How models are compressed</h2>
	<p>squish uses a three-tier pipeline:</p>
	<ul>
	<li><strong>INT4/INT3 quantization</strong> via a Rust extension
	(<code>squish_quant_rs</code>) with ARM NEON acceleration</li>
	<li><strong>Block-level paged KV cache</strong> — KV state is chunked into
	fixed-size blocks for prefix reuse across sessions</li>
	<li><strong>Quantization safeguards</strong> — squish hard-blocks INT3 on
	model families where it collapses (e.g. Gemma-3 loses ~15pp on common
	benchmarks); INT3 ships only for families that hold accuracy (Qwen3
	specifically)</li>
	</ul>

	<hr />

	<h2>Other projects</h2>
	<p>We also build <a href="https://github.com/konjoai/squash">squash</a>, a
	security and EU AI Act compliance scanner for HuggingFace models. Independent
	codebase, related mission.</p>

	<hr />

	<h2>License</h2>
	<p>squish is BUSL-1.1. Compressed models inherit their base model's license —
	Qwen3 is Apache-2.0, Llama is the Llama Community License, etc. Check each
	model's card for specifics.</p>

	<hr />

	<h2>Requirements</h2>
	<ul>
	<li>macOS 13.0 or later</li>
	<li>Apple Silicon (M1 / M2 / M3 / M4 / M5)</li>
	<li>Enough unified memory for the model (table above)</li>
	</ul>
	<p>Intel Macs and Linux are not supported.</p>

	</div>
	</body>
	</html>