azkabanned commited on
Commit
f3e13ef
·
verified ·
1 Parent(s): 0a766b7

Update model card: R7+R8 training, nervous system, personal AI OS

Browse files
Files changed (1) hide show
  1. README.md +118 -235
README.md CHANGED
@@ -1,241 +1,124 @@
1
  ---
2
- language: en
3
- license: apache-2.0
4
- library_name: mlx
5
- pipeline_tag: text-generation
6
- tags:
7
- - mlx
8
- - apple-silicon
9
- - tool-calling
10
- - orchestrator
11
- - local-ai
12
- - personal-ai-os
13
- base_model: Qwen/Qwen3-4B
14
- ---
15
-
16
- # Zora 4B
17
-
18
- The orchestrator brain for [Zora](https://github.com/Azkabanned/zora) — a private, local-first AI system that
19
- runs on Apple Silicon.
20
-
21
- ## What is this?
22
-
23
- Zora 4B is a fine-tuned [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) model, quantised to 4-bit for efficient
24
- inference on Apple Silicon via [MLX](https://github.com/ml-explore/mlx). It serves as Zora's primary reasoning
25
- brain — handling tool calling, task routing, reflection, and conversational interaction.
26
-
27
- This is **not** a general-purpose chat model. It is specifically trained for orchestrator behaviour: deciding
28
- which tools to call, how to route tasks across local and remote compute, and how to reason about multi-step
29
- goals.
30
-
31
- ## Key capabilities
32
-
33
- - **Tool calling** — 39+ tools with structured `<tool_call>` output format
34
- - **Task routing** — classifies prompts into direct response, queued goal, or delegated work
35
- - **Reflection** — self-assessment of priorities, active threads, and proactive suggestions
36
- - **Multi-turn reasoning** — maintains context across tool call chains (up to 8 rounds)
37
- - **Thinking mode** optional `<think>` blocks for chain-of-thought reasoning
38
-
39
- ## Hardware requirements
40
-
41
- | Config | RAM | Performance |
42
- |--------|-----|-------------|
43
- | Mac Mini M4 24GB | 24GB | ~45 tok/s (recommended orchestrator) |
44
- | MacBook Air M3 16GB | 16GB | ~35 tok/s |
45
- | Any Apple Silicon | 8GB+ | Will run, but may be slow |
46
-
47
- ## Usage
48
-
49
- ### With MLX (recommended)
50
-
51
- ```python
52
- from mlx_lm import load, generate
53
-
54
- model, tokenizer = load("project-zora/zora-4b")
55
- response = generate(model, tokenizer, prompt="What's running on my cluster?", max_tokens=512)
56
- ```
57
-
58
- ### With Zora orchestrator
59
-
60
- This model is downloaded automatically when you run `./install.sh` in the [Zora
61
- repository](https://github.com/Azkabanned/zora). The orchestrator loads it at startup and uses it for all local
62
- reasoning.
63
-
64
- ## Training
65
-
66
- - **Base model:** Qwen3-4B
67
- - **Fine-tuning:** SFT on 1,186 synthetic orchestrator examples (zero personal data)
68
- - **Quantisation:** 4-bit affine (group size 64) via MLX
69
- - **Training data:** Synthetic tool-calling conversations, task routing decisions, reflection outputs, and
70
- multi-step goal execution traces
71
- - **No personal data** was used in training — all examples are synthetic
72
-
73
- ## Architecture
74
-
75
- ```
76
- Qwen3ForCausalLM
77
- ├── 36 layers, 2560 hidden size
78
- ├── 32 attention heads, 8 KV heads (GQA)
79
- ├── 9728 intermediate size (SiLU)
80
- ├── RoPE (theta=1M, max 40960 positions)
81
- └── 4-bit affine quantisation (group=64)
82
- ```
83
-
84
- ## Limitations
85
-
86
- - Trained for Zora's orchestrator context may underperform on general chat benchmarks
87
- - English only
88
- - Best results with the Zora tool/system prompt format
89
- - Not suitable for tasks requiring >40K context (use a larger model via cloud offload)
90
-
91
- ## License
92
-
93
- Apache 2.0
94
-
95
- ## Links
96
-
97
- - [Zora on GitHub](https://github.com/Azkabanned/zora)
98
- - [Qwen3-4B (base model)](https://huggingface.co/Qwen/Qwen3-4B)
99
- this is our current one.
100
-
101
- ⏺ Here's the updated version — keeps your format and style, updates the facts:
102
-
103
- ---
104
- language: en
105
- license: apache-2.0
106
- library_name: mlx
107
- pipeline_tag: text-generation
108
- tags:
109
- - mlx
110
- - apple-silicon
111
- - tool-calling
112
- - orchestrator
113
- - local-ai
114
- - personal-ai-os
115
- base_model: Qwen/Qwen3-4B
116
- ---
117
-
118
- # Zora 4B
119
-
120
- The orchestrator brain for [Zora](https://github.com/Azkabanned/zora) — a private, local-first personal AI OS
121
- that runs on Apple Silicon.
122
-
123
- ## What is this?
124
-
125
- Zora 4B is a fine-tuned [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) model, quantised to 4-bit for efficient
126
- inference on Apple Silicon via [MLX](https://github.com/ml-explore/mlx). It serves as Zora's primary reasoning
127
- brain — handling tool calling, task routing, structured reflection, and conversational interaction.
128
-
129
- This is **not** a general-purpose chat model. It is specifically trained for orchestrator behaviour: deciding
130
- which tools to call, how to route tasks across local and remote compute, producing structured JSON for
131
- autonomous cognition, and managing multi-step goals.
132
-
133
- ## Key capabilities
134
-
135
- - **Tool calling** — 39+ tools with structured `<tool_call>` output format
136
- - **Task routing** — classifies prompts into direct response, queued goal, or delegated work to 70B worker nodes
137
- - **Structured reflection** — produces complete COG-X JSON schema for autonomous cognition loops (reasoning,
138
- priorities, goals, repairs, notifications)
139
- - **Task delegation** — routes complex build/code/refactor tasks to worker nodes with bigger models
140
- - **Multi-turn reasoning** — maintains context across tool call chains (up to 8 rounds)
141
- - **Thinking mode** — optional `<think>` blocks for chain-of-thought reasoning
142
-
143
- ## Hardware requirements
144
-
145
- | Config | RAM | Performance |
146
- |--------|-----|-------------|
147
- | Mac Mini M4 24GB | 24GB | ~90 tok/s with TurboQuant KV cache (recommended orchestrator) |
148
- | MacBook Pro M5 Max 128GB | 128GB | ~110 tok/s with speculative decoding |
149
- | MacBook Air M3 16GB | 16GB | ~35 tok/s |
150
- | Any Apple Silicon | 8GB+ | Will run, but may be slow |
151
-
152
- The entire stack — model, KV cache, and OS — runs in **7GB RAM** on a 24GB Mac Mini. TurboQuant PolarQuant
153
- compresses the KV cache to 1.5GB for 32K context.
154
-
155
- ## Usage
156
-
157
- ### With MLX (recommended)
158
-
159
- ```python
160
- from mlx_lm import load, generate
161
-
162
- model, tokenizer = load("project-zora/zora-4b")
163
- response = generate(model, tokenizer, prompt="What's running on my cluster?", max_tokens=512)
164
-
165
- As an Anthropic-compatible API
166
-
167
- # Zora exposes the same API format as Anthropic on port 4001
168
- export ANTHROPIC_BASE_URL=http://localhost:4001
169
- export ANTHROPIC_API_KEY=local
170
- claude # now running on your Metal GPU
171
-
172
- With Zora orchestrator
173
-
174
- This model is downloaded automatically when you run ./install.sh in the https://github.com/Azkabanned/zora. The
175
- orchestrator loads it at startup and uses it for all local reasoning.
176
-
177
- Training
178
-
179
- ┌───────┬───────────────────────────────────────────────────────────────┬────────────────┐
180
- │ Round │ Focus │ Examples │
181
- ├───────┼───────────────────────────────────────────────────────────────┼────────────────┤
182
- │ R1-R3 │ Core tool calling, multi-step chains │ 600+ │
183
- ├───────┼───────────────────────────────────────────────────────────────┼────────────────┤
184
- │ R4-R5 │ Edge cases, delegation rules │ 200+ │
185
- ├───────┼───────────────────────────────────────────────────────────────┼────────────────┤
186
- │ R6 │ All features (Team Zora, Enhanced Memory, Presence, 37 tools) │ 200+ │
187
- ├───────┼───────────────────────────────────────────────────────────────┼────────────────┤
188
- │ R7 │ Structured JSON reflection (COG-X schema) │ 37 │
189
- ├───────┼───────────────────────────────────────────────────────────────┼────────────────┤
190
- │ R8 │ Delegation routing (complex build tasks → worker) │ 40 │
191
- ├───────┼───────────────────────────────────────────────────────────────┼────────────────┤
192
- │ Total │ │ 1,107 examples │
193
- └───────┴───────────────────────────────────────────────────────────────┴────────────────┘
194
-
195
- - Base model: Qwen3-4B
196
- - Fine-tuning: LoRA SFT (16 layers, lr=1e-4, 2500 iterations)
197
- - Final validation loss: 0.017
198
- - Quantisation: 4-bit (4.5 bits per weight) via MLX
199
- - Training hardware: MacBook Pro M5 Max 128GB via MLX LoRA
200
- - Test result: 8/10 tool calling accuracy
201
- - No personal data was used in training — all examples are synthetic
202
-
203
- Architecture
204
-
205
- Qwen3ForCausalLM
206
- ├── 36 layers, 2560 hidden size
207
- ├── 32 attention heads, 8 KV heads (GQA)
208
- ├── 9728 intermediate size (SiLU)
209
- ├── RoPE (theta=1M, max 40960 positions)
210
- ├── 4-bit quantisation (4.5 bits/weight)
211
- └── TurboQuant PolarQuant KV cache compatible
212
 
213
- What makes Zora different
 
 
 
214
 
215
- Zora is a personal AI OS — not a chatbot. The brain model is one part of a larger system:
216
 
217
- - Real-time nervous system — events from every channel (WhatsApp, Telegram, email, Teams, calendar) flow through
218
- one universal event bus
219
- - Autonomous operator — follow-through engine that owns work: triages, drafts replies, sends follow-ups, tracks
220
- commitments
221
- - Self-improving — LoRA training pipeline runs on your hardware, improving the model from your usage patterns
222
- - Privacy by architecture — all inference on-device, data never leaves your machine, secrets in macOS Keychain
223
 
224
- Limitations
225
-
226
- - Trained for Zora's orchestrator context — may underperform on general chat benchmarks
227
- - English only
228
- - Best results with the Zora tool/system prompt format
229
- - Structured JSON output may truncate on very large contexts (>10K chars) — the orchestrator has repair logic
230
- for this
231
- - Not suitable for tasks requiring >40K context (use a larger model via worker delegation or cloud offload)
232
 
233
- License
234
-
235
- Apache 2.0
236
-
237
- Links
238
-
239
- - https://github.com/Azkabanned/zora
240
- - https://huggingface.co/Qwen/Qwen3-4B
241
- - https://github.com/ml-explore/mlx
 
1
  ---
2
+ language: en
3
+ license: apache-2.0
4
+ library_name: mlx
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - mlx
8
+ - apple-silicon
9
+ - tool-calling
10
+ - orchestrator
11
+ - local-ai
12
+ - personal-ai-os
13
+ base_model: Qwen/Qwen3-4B
14
+ ---
15
+
16
+ # Zora 4B
17
+
18
+ The orchestrator brain for [Zora](https://github.com/Azkabanned/zora) — a private, local-first personal AI OS that runs on Apple Silicon.
19
+
20
+ ## What is this?
21
+
22
+ Zora 4B is a fine-tuned [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) model, quantised to 4-bit for efficient inference on Apple Silicon via [MLX](https://github.com/ml-explore/mlx). It serves as Zora's primary reasoning brain — handling tool calling, task routing, structured reflection, and conversational interaction.
23
+
24
+ This is **not** a general-purpose chat model. It is specifically trained for orchestrator behaviour: deciding which tools to call, how to route tasks across local and remote compute, producing structured JSON for autonomous cognition, and managing multi-step goals.
25
+
26
+ ## Key capabilities
27
+
28
+ - **Tool calling** 39+ tools with structured `<tool_call>` output format
29
+ - **Task routing** — classifies prompts into direct response, queued goal, or delegated work to 70B worker nodes
30
+ - **Structured reflection** — produces complete COG-X JSON schema for autonomous cognition loops
31
+ - **Task delegation** — routes complex build/code/refactor tasks to worker nodes with bigger models
32
+ - **Multi-turn reasoning** — maintains context across tool call chains (up to 8 rounds)
33
+ - **Thinking mode** — optional `<think>` blocks for chain-of-thought reasoning
34
+
35
+ ## Hardware requirements
36
+
37
+ | Config | RAM | Performance |
38
+ |--------|-----|-------------|
39
+ | Mac Mini M4 24GB | 24GB | ~90 tok/s with TurboQuant KV cache |
40
+ | MacBook Pro M5 Max 128GB | 128GB | ~110 tok/s with speculative decoding |
41
+ | MacBook Air M3 16GB | 16GB | ~35 tok/s |
42
+ | Any Apple Silicon | 8GB+ | Will run, but may be slow |
43
+
44
+ The entire stack model, KV cache, and OS runs in **7GB RAM** on a 24GB Mac Mini.
45
+
46
+ ## Usage
47
+
48
+ ### With MLX
49
+
50
+ ```python
51
+ from mlx_lm import load, generate
52
+
53
+ model, tokenizer = load("project-zora/zora-4b")
54
+ response = generate(model, tokenizer, prompt="What's running on my cluster?", max_tokens=512)
55
+ ```
56
+
57
+ ### As an Anthropic-compatible API
58
+
59
+ ```bash
60
+ export ANTHROPIC_BASE_URL=http://localhost:4001
61
+ export ANTHROPIC_API_KEY=local
62
+ claude # now running on your Metal GPU
63
+ ```
64
+
65
+ ### With Zora orchestrator
66
+
67
+ This model is downloaded automatically when you run `./install.sh` in the [Zora repository](https://github.com/Azkabanned/zora).
68
+
69
+ ## Training
70
+
71
+ | Round | Focus | Examples |
72
+ |-------|-------|---------|
73
+ | R1-R3 | Core tool calling, multi-step chains | 600+ |
74
+ | R4-R5 | Edge cases, delegation rules | 200+ |
75
+ | R6 | All features (Team Zora, Enhanced Memory, Presence) | 200+ |
76
+ | R7 | Structured JSON reflection (COG-X schema) | 37 |
77
+ | R8 | Delegation routing (complex build tasks) | 40 |
78
+ | **Total** | | **1,107 examples** |
79
+
80
+ - **Base model:** Qwen3-4B
81
+ - **Method:** LoRA SFT (16 layers, lr=1e-4, 2500 iterations)
82
+ - **Final val loss:** 0.017
83
+ - **Quantisation:** 4-bit (4.5 bits per weight) via MLX
84
+ - **Hardware:** MacBook Pro M5 Max 128GB
85
+ - **Test result:** 8/10 tool calling accuracy
86
+ - **No personal data**all examples are synthetic
87
+
88
+ ## Architecture
89
+
90
+ ```
91
+ Qwen3ForCausalLM
92
+ +-- 36 layers, 2560 hidden size
93
+ +-- 32 attention heads, 8 KV heads (GQA)
94
+ +-- 9728 intermediate size (SiLU)
95
+ +-- RoPE (theta=1M, max 40960 positions)
96
+ +-- 4-bit quantisation (4.5 bits/weight)
97
+ +-- TurboQuant PolarQuant KV cache compatible
98
+ ```
99
+
100
+ ## What makes Zora different
101
+
102
+ Zora is a personal AI OS — not a chatbot. This brain model is one part of a larger system:
103
+
104
+ - **Real-time nervous system** — events from every channel flow through one universal event bus
105
+ - **Autonomous operator** — follow-through engine that owns work across all channels
106
+ - **Self-improving** — LoRA training pipeline runs on your hardware
107
+ - **Privacy by architecture** — all inference on-device, data never leaves your machine
108
+
109
+ ## Limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
 
111
+ - Trained for Zora's orchestrator context — may underperform on general chat benchmarks
112
+ - English only
113
+ - Best results with the Zora tool/system prompt format
114
+ - Not suitable for tasks requiring >40K context
115
 
116
+ ## License
117
 
118
+ Apache 2.0
 
 
 
 
 
119
 
120
+ ## Links
 
 
 
 
 
 
 
121
 
122
+ - [Zora on GitHub](https://github.com/Azkabanned/zora)
123
+ - [Qwen3-4B (base model)](https://huggingface.co/Qwen/Qwen3-4B)
124
+ - [MLX](https://github.com/ml-explore/mlx)