Spaces:

konjoai
/

README

Running

App Files Files Community

README / README.md

wscholl

feat: konjoai org README — squish-focused

a9e62d5 verified 3 days ago

preview code

raw

history blame contribute delete

3 kB

	---
	title: Konjo AI
	emoji: 🗜
	colorFrom: gray
	colorTo: blue
	sdk: static
	pinned: false
	---

	# Konjo AI

	Local AI infrastructure for Apple Silicon. We make models that already exist
	run faster on the hardware you already own.

	🌐 [squish.run](https://squish.run) · 💻 [github.com/konjoai](https://github.com/konjoai)

	---

	## squish — Local LLM inference for Apple Silicon

	[squish](https://github.com/konjoai/squish) is an MLX-based local inference
	server with a block-level paged KV cache and INT3 quantization support for the
	Qwen3 family. On a 16 GB M3 MacBook against Ollama:

	- 5.4× faster end-to-end response at 4000-token prompts (12.78s vs 69.6s)
	- 1.5× faster end-to-end on 75-token prompts (5.50s vs 8.09s)
	- 33% less RAM during inference (3.36 GB vs ~5 GB)
	- INT3 support for Qwen3 with no measurable accuracy loss (Ollama doesn't ship INT3)

	The honest tradeoff: Ollama still wins first-token latency on short prompts.
	squish wins when you care about total response time on real workloads.

	Install:
	```bash
	brew tap konjoai/squish && brew install squish
	# or
	pip install squish-ai
	```

	Use:
	```bash
	squish pull konjoai/Qwen3-8B-squished
	squish run Qwen3-8B-squished
	```

	[Full benchmarks](https://github.com/konjoai/squish/blob/main/docs/RESULTS.md) ·
	[Repo](https://github.com/konjoai/squish) ·
	[Issues](https://github.com/konjoai/squish/issues)

	---

	## Pre-Compressed Models

	This org hosts models pre-compressed by squish. Pull once, load instantly every
	time after.

	\| Model \| Squish ID \| Quantization \| Disk size \| Context \|
	\|---\|---\|---\|---\|---\|
	\| _Available after first publish batch_ \|

	The format is `mlx_lm`-compatible — you can also use these models directly:

	```python
	from mlx_lm import load, generate

	model, tokenizer = load("konjoai/Qwen2.5-7B-Instruct-squished")
	response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
	print(response)
	```

	---

	## How models are compressed

	squish uses a three-tier pipeline:

	1. INT4/INT3 quantization via a Rust extension (`squish_quant_rs`) with ARM
	NEON acceleration
	2. Block-level paged KV cache — KV state is chunked into fixed-size blocks
	for prefix reuse across sessions
	3. Quantization safeguards — squish hard-blocks INT3 on model families where
	it collapses (e.g. Gemma-3 loses ~15pp on common benchmarks); INT3 ships only
	for families that hold accuracy (Qwen3 specifically)

	---

	## Other projects

	We also build [squash](https://github.com/konjoai/squash), a security and EU AI
	Act compliance scanner for HuggingFace models. Independent codebase, related
	mission.

	---

	## License

	squish is BUSL-1.1. Compressed models inherit their base model's license — Qwen3
	is Apache-2.0, Llama is the Llama Community License, etc. Check each model's card
	for specifics.

	---

	## Requirements

	- macOS 13.0 or later
	- Apple Silicon (M1 / M2 / M3 / M4 / M5)
	- Enough unified memory for the model (table above)

	Intel Macs and Linux are not supported.