File size: 5,471 Bytes
d6a71ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
06aeecb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Konjo AI</title>
<style>
  :root { color-scheme: light dark; }
  * { box-sizing: border-box; }
  body {
    margin: 0;
    font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto,
                 Helvetica, Arial, sans-serif;
    line-height: 1.6;
    color: inherit;
    background: transparent;
  }
  .wrap { max-width: 820px; margin: 0 auto; padding: 8px 4px 32px; }
  h1 { font-size: 1.9rem; margin: 0 0 .25rem; }
  h2 { font-size: 1.3rem; margin: 2rem 0 .5rem; }
  h3 { font-size: 1.05rem; margin: 1.25rem 0 .4rem; }
  p { margin: .5rem 0; }
  a { color: #2563eb; text-decoration: none; }
  a:hover { text-decoration: underline; }
  ul { margin: .5rem 0; padding-left: 1.3rem; }
  li { margin: .25rem 0; }
  hr { border: none; border-top: 1px solid rgba(128,128,128,.3); margin: 1.5rem 0; }
  code {
    font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace;
    font-size: .85em;
    background: rgba(128,128,128,.15);
    padding: .12em .35em;
    border-radius: 4px;
  }
  pre {
    background: rgba(128,128,128,.12);
    border: 1px solid rgba(128,128,128,.2);
    border-radius: 8px;
    padding: .85rem 1rem;
    overflow-x: auto;
  }
  pre code { background: none; padding: 0; }
  table { border-collapse: collapse; width: 100%; margin: .75rem 0; font-size: .92rem; }
  th, td { border: 1px solid rgba(128,128,128,.3); padding: .45rem .6rem; text-align: left; }
  th { background: rgba(128,128,128,.12); }
  .tagline { color: #6b7280; margin-top: 0; }
  .links { font-size: .95rem; }
</style>
</head>
<body>
<div class="wrap">

  <h1>πŸ—œ Konjo AI</h1>
  <p class="tagline">Local AI infrastructure for Apple Silicon. We make models
    that already exist run faster on the hardware you already own.</p>
  <p class="links">🌐 <a href="https://squish.run">squish.run</a> ·
    πŸ’» <a href="https://github.com/konjoai">github.com/konjoai</a></p>

  <hr />

  <h2>squish β€” Local LLM inference for Apple Silicon</h2>
  <p><a href="https://github.com/konjoai/squish">squish</a> is an MLX-based local
    inference server with a block-level paged KV cache and INT3 quantization
    support for the Qwen3 family. On a 16 GB M3 MacBook against Ollama:</p>
  <ul>
    <li><strong>5.4Γ— faster</strong> end-to-end response at 4000-token prompts (12.78s vs 69.6s)</li>
    <li><strong>1.5Γ— faster</strong> end-to-end on 75-token prompts (5.50s vs 8.09s)</li>
    <li><strong>33% less RAM</strong> during inference (3.36 GB vs ~5 GB)</li>
    <li><strong>INT3 support</strong> for Qwen3 with no measurable accuracy loss (Ollama doesn't ship INT3)</li>
  </ul>
  <p>The honest tradeoff: Ollama still wins first-token latency on short prompts.
    squish wins when you care about total response time on real workloads.</p>

  <h3>Install</h3>
  <pre><code>brew tap konjoai/squish &amp;&amp; brew install squish
# or
pip install squish-ai</code></pre>

  <h3>Use</h3>
  <pre><code>squish pull konjoai/Qwen3-8B-squished
squish run Qwen3-8B-squished</code></pre>

  <p class="links">
    <a href="https://github.com/konjoai/squish/blob/main/docs/RESULTS.md">Full benchmarks</a> Β·
    <a href="https://github.com/konjoai/squish">Repo</a> Β·
    <a href="https://github.com/konjoai/squish/issues">Issues</a>
  </p>

  <hr />

  <h2>Pre-Compressed Models</h2>
  <p>This org hosts models pre-compressed by squish. Pull once, load instantly
    every time after.</p>
  <table>
    <thead>
      <tr><th>Model</th><th>Squish ID</th><th>Quantization</th><th>Disk size</th><th>Context</th></tr>
    </thead>
    <tbody>
      <tr><td colspan="5"><em>Available after first publish batch</em></td></tr>
    </tbody>
  </table>
  <p>The format is <code>mlx_lm</code>-compatible β€” you can also use these models directly:</p>
  <pre><code>from mlx_lm import load, generate

model, tokenizer = load("konjoai/Qwen2.5-7B-Instruct-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
print(response)</code></pre>

  <hr />

  <h2>How models are compressed</h2>
  <p>squish uses a three-tier pipeline:</p>
  <ul>
    <li><strong>INT4/INT3 quantization</strong> via a Rust extension
      (<code>squish_quant_rs</code>) with ARM NEON acceleration</li>
    <li><strong>Block-level paged KV cache</strong> β€” KV state is chunked into
      fixed-size blocks for prefix reuse across sessions</li>
    <li><strong>Quantization safeguards</strong> β€” squish hard-blocks INT3 on
      model families where it collapses (e.g. Gemma-3 loses ~15pp on common
      benchmarks); INT3 ships only for families that hold accuracy (Qwen3
      specifically)</li>
  </ul>

  <hr />

  <h2>Other projects</h2>
  <p>We also build <a href="https://github.com/konjoai/squash">squash</a>, a
    security and EU AI Act compliance scanner for HuggingFace models. Independent
    codebase, related mission.</p>

  <hr />

  <h2>License</h2>
  <p>squish is BUSL-1.1. Compressed models inherit their base model's license β€”
    Qwen3 is Apache-2.0, Llama is the Llama Community License, etc. Check each
    model's card for specifics.</p>

  <hr />

  <h2>Requirements</h2>
  <ul>
    <li>macOS 13.0 or later</li>
    <li>Apple Silicon (M1 / M2 / M3 / M4 / M5)</li>
    <li>Enough unified memory for the model (table above)</li>
  </ul>
  <p>Intel Macs and Linux are not supported.</p>

</div>
</body>
</html>