wscholl commited on
Commit
dfd0d71
·
verified ·
1 Parent(s): 414d18a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -53
README.md CHANGED
@@ -1,33 +1,20 @@
1
- ---
2
- title: README
3
- emoji: 🔥
4
- colorFrom: blue
5
- colorTo: red
6
- sdk: static
7
- pinned: false
8
- license: mit
9
- ---
10
- <div align="center">
11
- <img src="https://raw.githubusercontent.com/wesleyscholl/squish/main/assets/squish-logo-1.png" width="330" alt="Squish" />
12
-
13
- <h3>Pre-compressed models for Apple Silicon. Load in under a second. Run fully local.</h3>
14
 
15
- [![GitHub](https://img.shields.io/badge/GitHub-wesleyscholl%2Fsquish-black?logo=github)](https://github.com/wesleyscholl/squish)
16
- [![License](https://img.shields.io/badge/license-MIT-green)](https://github.com/wesleyscholl/squish/blob/main/LICENSE)
17
- [![Platform](https://img.shields.io/badge/platform-Apple%20Silicon%20M1–M5-lightgrey?logo=apple)](https://github.com/wesleyscholl/squish)
18
-
19
- </div>
20
 
21
  ---
22
 
23
  ## What is this?
24
 
25
- This organization hosts models pre-compressed by [Squish](https://github.com/wesleyscholl/squish) — a local inference engine for Apple Silicon that gets models off disk and into memory in under a second.
26
 
27
  Every model here was compressed with Squish's INT4 quantization pipeline and is ready to load directly with `squish pull`. No setup, no Python environment, no cloud.
28
 
29
  ```bash
30
- brew install wesleyscholl/squish/squish
 
 
31
  squish pull qwen3:8b
32
  squish run qwen3:8b
33
  ```
@@ -39,39 +26,37 @@ squish run qwen3:8b
39
  Compression takes time. A Qwen3-8B model compresses in roughly 8 minutes on an M3. You shouldn't have to wait for that on first run. Models in this org are pre-compressed and validated — pull once, load instantly every time after.
40
 
41
  | Format | What it means |
42
- |---|---|
43
- | `*-squished` | INT4-compressed, ready for `squish run` |
44
- | `*-squished-int8` | INT8-compressed, higher quality, larger |
45
 
46
  ---
47
 
48
  ## Available models
49
 
50
  | Model | Squish ID | Raw size | Squished size | Context |
51
- |---|---|---|---|---|
52
  | Qwen3-8B | `qwen3:8b` | 16.4 GB | 4.4 GB | 128k |
53
  | Qwen3-4B | `qwen3:4b` | 8.2 GB | 2.2 GB | 32k |
54
- | Qwen3-1.7B | `qwen3:1.7b` | 3.5 GB | 1.0 GB | 32k |
55
  | Qwen2.5-7B-Instruct | `qwen2.5:7b` | 14.4 GB | 3.9 GB | 128k |
56
  | Qwen2.5-1.5B-Instruct | `qwen2.5:1.5b` | 3.1 GB | 0.9 GB | 32k |
57
  | Llama-3.2-3B-Instruct | `llama3.2:3b` | 6.4 GB | 1.7 GB | 128k |
 
58
  | Gemma-3-4B-Instruct | `gemma3:4b` | 9.8 GB | 2.6 GB | 128k |
59
- | DeepSeek-R1-Distill-7B | `deepseek-r1:7b` | 14.4 GB | 3.9 GB | 128k |
60
 
61
- More models added as the catalog grows. Check `squish catalog` for the full list.
62
 
63
  ---
64
 
65
  ## Load time comparison (M3 16GB)
66
 
67
  | Model | Squish (INT4) | Ollama | llama.cpp |
68
- |---|---|---|---|
69
- | Qwen3-8B | **0.43s** | 4.2s | 6.1s |
70
- | Llama-3.2-3B | **0.33s** | 1.8s | 2.4s |
71
-
72
- *Measured cold-start on Apple M3 16GB. Results will vary by chip and storage.*
73
 
74
- > ⚠️ Benchmark figures above are from internal testing. Full reproducible benchmark methodology coming soon — see [GitHub issue #benchmark](https://github.com/wesleyscholl/squish) for status.
75
 
76
  ---
77
 
@@ -83,7 +68,6 @@ Squish runs a local server on port 11435. Any OpenAI client works out of the box
83
  from openai import OpenAI
84
 
85
  client = OpenAI(base_url="http://localhost:11435/v1", api_key="squish")
86
-
87
  response = client.chat.completions.create(
88
  model="qwen3:8b",
89
  messages=[{"role": "user", "content": "Explain attention mechanisms briefly."}]
@@ -92,7 +76,7 @@ print(response.choices[0].message.content)
92
  ```
93
 
94
  ```bash
95
- # Or just point your existing tools at it
96
  export OPENAI_BASE_URL=http://localhost:11435/v1
97
  export OPENAI_API_KEY=squish
98
  ```
@@ -103,22 +87,20 @@ export OPENAI_API_KEY=squish
103
 
104
  Squish uses a three-tier compression pipeline:
105
 
106
- 1. **INT4 quantization** via a Rust extension (`squish_quant_rs`) with ARM NEON acceleration — 8–12 GB/s throughput on Apple Silicon
107
- 2. **Compressed weight loader** — weights stay compressed on disk and decompress directly into Metal-mapped memory at load time
108
- 3. **KV cache quantization** — attention cache stored at reduced precision during generation, not just weights
109
 
110
  The result is a model that fits in memory on a base M3 16GB and loads faster than Ollama can parse its configuration.
111
 
112
  ---
113
 
114
- ## Using models directly
115
-
116
- You can also load these models with `mlx_lm` if you want to use them outside of Squish:
117
 
118
  ```python
119
  from mlx_lm import load, generate
120
 
121
- model, tokenizer = load("squish-community/Qwen3-8B-squished")
122
  response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
123
  ```
124
 
@@ -128,25 +110,20 @@ response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
128
 
129
  - macOS 13.0 or later
130
  - Apple Silicon (M1, M2, M3, M4, M5)
131
- - Enough unified memory for the model (check the table above)
132
 
133
- Intel Macs and Linux are not supported. Windows is not planned.
134
 
135
  ---
136
 
137
  ## Links
138
 
139
- - **CLI and inference engine**: [github.com/wesleyscholl/squish](https://github.com/wesleyscholl/squish)
140
- - **Install**: `brew install wesleyscholl/squish/squish`
141
- - **Issues and discussions**: [GitHub Issues](https://github.com/wesleyscholl/squish/issues)
142
- - **Discord**: [discord.gg/squish](https://discord.gg/FqzqeJCuh)
143
 
144
  ---
145
 
146
- <div align="center">
147
-
148
  *Squish it. Run it. Go.*
149
 
150
- Built by [Konjo AI](https://github.com/wesleyscholl) · MIT License
151
-
152
- </div>
 
1
+ # Squish
2
+ Pre-compressed models for Apple Silicon. Load in under a second. Run fully local.
 
 
 
 
 
 
 
 
 
 
 
3
 
4
+ [GitHub](https://github.com/konjoai/squish) &nbsp;·&nbsp; [MIT License](https://github.com/konjoai/squish/blob/main/LICENSE) &nbsp;·&nbsp; [squish.run](https://squish.run)
 
 
 
 
5
 
6
  ---
7
 
8
  ## What is this?
9
 
10
+ This organization hosts models pre-compressed by **Squish** — a local inference engine for Apple Silicon that gets models off disk and into memory in under a second.
11
 
12
  Every model here was compressed with Squish's INT4 quantization pipeline and is ready to load directly with `squish pull`. No setup, no Python environment, no cloud.
13
 
14
  ```bash
15
+ brew tap konjoai/squish
16
+ brew trust konjoai/squish
17
+ brew install squish
18
  squish pull qwen3:8b
19
  squish run qwen3:8b
20
  ```
 
26
  Compression takes time. A Qwen3-8B model compresses in roughly 8 minutes on an M3. You shouldn't have to wait for that on first run. Models in this org are pre-compressed and validated — pull once, load instantly every time after.
27
 
28
  | Format | What it means |
29
+ |--------|--------------|
30
+ | `*-bf16-squished` | INT4-compressed, ready for `squish run` |
 
31
 
32
  ---
33
 
34
  ## Available models
35
 
36
  | Model | Squish ID | Raw size | Squished size | Context |
37
+ |-------|-----------|----------|---------------|---------|
38
  | Qwen3-8B | `qwen3:8b` | 16.4 GB | 4.4 GB | 128k |
39
  | Qwen3-4B | `qwen3:4b` | 8.2 GB | 2.2 GB | 32k |
40
+ | Qwen3-0.6B | `qwen3:0.6b` | 1.3 GB | 0.9 GB | 32k |
41
  | Qwen2.5-7B-Instruct | `qwen2.5:7b` | 14.4 GB | 3.9 GB | 128k |
42
  | Qwen2.5-1.5B-Instruct | `qwen2.5:1.5b` | 3.1 GB | 0.9 GB | 32k |
43
  | Llama-3.2-3B-Instruct | `llama3.2:3b` | 6.4 GB | 1.7 GB | 128k |
44
+ | Llama-3.2-1B-Instruct | `llama3.2:1b` | 2.5 GB | 0.7 GB | 128k |
45
  | Gemma-3-4B-Instruct | `gemma3:4b` | 9.8 GB | 2.6 GB | 128k |
46
+ | Gemma-3-1B-Instruct | `gemma3:1b` | 2.0 GB | 0.5 GB | 32k |
47
 
48
+ More models added as the catalog grows. Run `squish catalog` for the full list.
49
 
50
  ---
51
 
52
  ## Load time comparison (M3 16GB)
53
 
54
  | Model | Squish (INT4) | Ollama | llama.cpp |
55
+ |-------|--------------|--------|-----------|
56
+ | Qwen3-8B | 0.43s | 4.2s | 6.1s |
57
+ | Llama-3.2-3B | 0.33s | 1.8s | 2.4s |
 
 
58
 
59
+ Measured cold-start on Apple M3 16GB. Results will vary by chip and storage.
60
 
61
  ---
62
 
 
68
  from openai import OpenAI
69
 
70
  client = OpenAI(base_url="http://localhost:11435/v1", api_key="squish")
 
71
  response = client.chat.completions.create(
72
  model="qwen3:8b",
73
  messages=[{"role": "user", "content": "Explain attention mechanisms briefly."}]
 
76
  ```
77
 
78
  ```bash
79
+ # Or point your existing tools at it
80
  export OPENAI_BASE_URL=http://localhost:11435/v1
81
  export OPENAI_API_KEY=squish
82
  ```
 
87
 
88
  Squish uses a three-tier compression pipeline:
89
 
90
+ - **INT4 quantization** via a Rust extension (`squish_quant_rs`) with ARM NEON acceleration — 8–12 GB/s throughput on Apple Silicon
91
+ - **Compressed weight loader** — weights stay compressed on disk and decompress directly into Metal-mapped memory at load time
92
+ - **KV cache quantization** — attention cache stored at reduced precision during generation, not just weights
93
 
94
  The result is a model that fits in memory on a base M3 16GB and loads faster than Ollama can parse its configuration.
95
 
96
  ---
97
 
98
+ ## Using models directly with mlx_lm
 
 
99
 
100
  ```python
101
  from mlx_lm import load, generate
102
 
103
+ model, tokenizer = load("squishai/Qwen3-8B-bf16-squished")
104
  response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
105
  ```
106
 
 
110
 
111
  - macOS 13.0 or later
112
  - Apple Silicon (M1, M2, M3, M4, M5)
113
+ - Sufficient unified memory for the model (see table above)
114
 
115
+ > Intel Macs and Linux are not supported. Windows is not planned.
116
 
117
  ---
118
 
119
  ## Links
120
 
121
+ - CLI and inference engine: [github.com/konjoai/squish](https://github.com/konjoai/squish)
122
+ - Install: `brew tap konjoai/squish && brew install squish`
123
+ - Issues and discussions: [GitHub Issues](https://github.com/konjoai/squish/issues)
 
124
 
125
  ---
126
 
 
 
127
  *Squish it. Run it. Go.*
128
 
129
+ Built by [Konjo AI](https://github.com/konjoai) &nbsp;·&nbsp; MIT License