Spaces:

konjoai
/

README

Running

File size: 3,001 Bytes

06aeecb
a9e62d5
 
 
 
06aeecb
 
 
 
a9e62d5

---
title: Konjo AI
emoji: 🗜
colorFrom: gray
colorTo: blue
sdk: static
pinned: false
---

# Konjo AI

Local AI infrastructure for Apple Silicon. We make models that already exist
run faster on the hardware you already own.

🌐 [squish.run](https://squish.run) · 💻 [github.com/konjoai](https://github.com/konjoai)

---

## squish — Local LLM inference for Apple Silicon

[squish](https://github.com/konjoai/squish) is an MLX-based local inference
server with a block-level paged KV cache and INT3 quantization support for the
Qwen3 family. On a 16 GB M3 MacBook against Ollama:

- **5.4× faster** end-to-end response at 4000-token prompts (12.78s vs 69.6s)
- **1.5× faster** end-to-end on 75-token prompts (5.50s vs 8.09s)
- **33% less RAM** during inference (3.36 GB vs ~5 GB)
- **INT3 support** for Qwen3 with no measurable accuracy loss (Ollama doesn't ship INT3)

The honest tradeoff: Ollama still wins first-token latency on short prompts.
squish wins when you care about total response time on real workloads.

**Install:**
```bash
brew tap konjoai/squish && brew install squish
# or
pip install squish-ai
```

**Use:**
```bash
squish pull konjoai/Qwen3-8B-squished
squish run Qwen3-8B-squished
```

[Full benchmarks](https://github.com/konjoai/squish/blob/main/docs/RESULTS.md) ·
[Repo](https://github.com/konjoai/squish) ·
[Issues](https://github.com/konjoai/squish/issues)

---

## Pre-Compressed Models

This org hosts models pre-compressed by squish. Pull once, load instantly every
time after.

| Model | Squish ID | Quantization | Disk size | Context |
|---|---|---|---|---|
| _Available after first publish batch_ |

The format is `mlx_lm`-compatible — you can also use these models directly:

```python
from mlx_lm import load, generate

model, tokenizer = load("konjoai/Qwen2.5-7B-Instruct-squished")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
print(response)
```

---

## How models are compressed

squish uses a three-tier pipeline:

1. **INT4/INT3 quantization** via a Rust extension (`squish_quant_rs`) with ARM
   NEON acceleration
2. **Block-level paged KV cache** — KV state is chunked into fixed-size blocks
   for prefix reuse across sessions
3. **Quantization safeguards** — squish hard-blocks INT3 on model families where
   it collapses (e.g. Gemma-3 loses ~15pp on common benchmarks); INT3 ships only
   for families that hold accuracy (Qwen3 specifically)

---

## Other projects

We also build [squash](https://github.com/konjoai/squash), a security and EU AI
Act compliance scanner for HuggingFace models. Independent codebase, related
mission.

---

## License

squish is BUSL-1.1. Compressed models inherit their base model's license — Qwen3
is Apache-2.0, Llama is the Llama Community License, etc. Check each model's card
for specifics.

---

## Requirements

- macOS 13.0 or later
- Apple Silicon (M1 / M2 / M3 / M4 / M5)
- Enough unified memory for the model (table above)

Intel Macs and Linux are not supported.