project-zora
/

zora-4b

@@ -1,241 +1,124 @@
 ---
-  language: en
-  license: apache-2.0
-  library_name: mlx
-  pipeline_tag: text-generation
-  tags:
-  - mlx
-  - apple-silicon
-  - tool-calling
-  - orchestrator
-  - local-ai
-  - personal-ai-os
-  base_model: Qwen/Qwen3-4B
-  ---
-# Zora 4B
-The orchestrator brain for [Zora](https://github.com/Azkabanned/zora) — a private, local-first AI system that
-runs on Apple Silicon.
-## What is this?
-Zora 4B is a fine-tuned [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) model, quantised to 4-bit for efficient
-inference on Apple Silicon via [MLX](https://github.com/ml-explore/mlx). It serves as Zora's primary reasoning
-brain — handling tool calling, task routing, reflection, and conversational interaction.
-This is **not** a general-purpose chat model. It is specifically trained for orchestrator behaviour: deciding
-which tools to call, how to route tasks across local and remote compute, and how to reason about multi-step
-goals.
-## Key capabilities
-- **Tool calling** — 39+ tools with structured `<tool_call>` output format
-- **Task routing** — classifies prompts into direct response, queued goal, or delegated work
-- **Reflection** — self-assessment of priorities, active threads, and proactive suggestions
-- **Multi-turn reasoning** — maintains context across tool call chains (up to 8 rounds)
-- **Thinking mode** — optional `<think>` blocks for chain-of-thought reasoning
-## Hardware requirements
-| Config | RAM | Performance |
-|--------|-----|-------------|
-| Mac Mini M4 24GB | 24GB | ~45 tok/s (recommended orchestrator) |
-| MacBook Air M3 16GB | 16GB | ~35 tok/s |
-| Any Apple Silicon | 8GB+ | Will run, but may be slow |
-## Usage
-### With MLX (recommended)
-```python
-from mlx_lm import load, generate
-model, tokenizer = load("project-zora/zora-4b")
-response = generate(model, tokenizer, prompt="What's running on my cluster?", max_tokens=512)
-```
-### With Zora orchestrator
-This model is downloaded automatically when you run `./install.sh` in the [Zora
-repository](https://github.com/Azkabanned/zora). The orchestrator loads it at startup and uses it for all local
-reasoning.
-## Training
-- **Base model:** Qwen3-4B
-- **Fine-tuning:** SFT on 1,186 synthetic orchestrator examples (zero personal data)
-- **Quantisation:** 4-bit affine (group size 64) via MLX
-- **Training data:** Synthetic tool-calling conversations, task routing decisions, reflection outputs, and
-multi-step goal execution traces
-- **No personal data** was used in training — all examples are synthetic
-## Architecture
-```
-Qwen3ForCausalLM
-├── 36 layers, 2560 hidden size
-├── 32 attention heads, 8 KV heads (GQA)
-├── 9728 intermediate size (SiLU)
-├── RoPE (theta=1M, max 40960 positions)
-└── 4-bit affine quantisation (group=64)
-```
-## Limitations
-- Trained for Zora's orchestrator context — may underperform on general chat benchmarks
-- English only
-- Best results with the Zora tool/system prompt format
-- Not suitable for tasks requiring >40K context (use a larger model via cloud offload)
-## License
-Apache 2.0
-## Links
-- [Zora on GitHub](https://github.com/Azkabanned/zora)
-- [Qwen3-4B (base model)](https://huggingface.co/Qwen/Qwen3-4B)
- this is our current one.
-⏺ Here's the updated version — keeps your format and style, updates the facts:
-  ---
-  language: en
-  license: apache-2.0
-  library_name: mlx
-  pipeline_tag: text-generation
-  tags:
-  - mlx
-  - apple-silicon
-  - tool-calling
-  - orchestrator
-  - local-ai
-  - personal-ai-os
-  base_model: Qwen/Qwen3-4B
-  ---
-  # Zora 4B
-  The orchestrator brain for [Zora](https://github.com/Azkabanned/zora) — a private, local-first personal AI OS
-  that runs on Apple Silicon.
-  ## What is this?
-  Zora 4B is a fine-tuned [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) model, quantised to 4-bit for efficient
-   inference on Apple Silicon via [MLX](https://github.com/ml-explore/mlx). It serves as Zora's primary reasoning
-  brain — handling tool calling, task routing, structured reflection, and conversational interaction.
-  This is **not** a general-purpose chat model. It is specifically trained for orchestrator behaviour: deciding
-  which tools to call, how to route tasks across local and remote compute, producing structured JSON for
-  autonomous cognition, and managing multi-step goals.
-  ## Key capabilities
-  - **Tool calling** — 39+ tools with structured `<tool_call>` output format
-  - **Task routing** — classifies prompts into direct response, queued goal, or delegated work to 70B worker nodes
-  - **Structured reflection** — produces complete COG-X JSON schema for autonomous cognition loops (reasoning,
-  priorities, goals, repairs, notifications)
-  - **Task delegation** — routes complex build/code/refactor tasks to worker nodes with bigger models
-  - **Multi-turn reasoning** — maintains context across tool call chains (up to 8 rounds)
-  - **Thinking mode** — optional `<think>` blocks for chain-of-thought reasoning
-  ## Hardware requirements
-  | Config | RAM | Performance |
-  |--------|-----|-------------|
-  | Mac Mini M4 24GB | 24GB | ~90 tok/s with TurboQuant KV cache (recommended orchestrator) |
-  | MacBook Pro M5 Max 128GB | 128GB | ~110 tok/s with speculative decoding |
-  | MacBook Air M3 16GB | 16GB | ~35 tok/s |
-  | Any Apple Silicon | 8GB+ | Will run, but may be slow |
-  The entire stack — model, KV cache, and OS — runs in **7GB RAM** on a 24GB Mac Mini. TurboQuant PolarQuant
-  compresses the KV cache to 1.5GB for 32K context.
-  ## Usage
-  ### With MLX (recommended)
-  ```python
-  from mlx_lm import load, generate
-  model, tokenizer = load("project-zora/zora-4b")
-  response = generate(model, tokenizer, prompt="What's running on my cluster?", max_tokens=512)
-  As an Anthropic-compatible API
-  # Zora exposes the same API format as Anthropic on port 4001
-  export ANTHROPIC_BASE_URL=http://localhost:4001
-  export ANTHROPIC_API_KEY=local
-  claude  # now running on your Metal GPU
-  With Zora orchestrator
-  This model is downloaded automatically when you run ./install.sh in the https://github.com/Azkabanned/zora. The
-  orchestrator loads it at startup and uses it for all local reasoning.
-  Training
-  ┌───────┬───────────────────────────────────────────────────────────────┬────────────────┐
-  │ Round │                             Focus                             │    Examples    │
-  ├───────┼───────────────────────────────────────────────────────────────┼────────────────┤
-  │ R1-R3 │ Core tool calling, multi-step chains                          │ 600+           │
-  ├───────┼───────────────────────────────────────────────────────────────┼────────────────┤
-  │ R4-R5 │ Edge cases, delegation rules                                  │ 200+           │
-  ├───────┼───────────────────────────────────────────────────────────────┼────────────────┤
-  │ R6    │ All features (Team Zora, Enhanced Memory, Presence, 37 tools) │ 200+           │
-  ├───────┼───────────────────────────────────────────────────────────────┼────────────────┤
-  │ R7    │ Structured JSON reflection (COG-X schema)                     │ 37             │
-  ├───────┼───────────────────────────────────────────────────────────────┼────────────────┤
-  │ R8    │ Delegation routing (complex build tasks → worker)             │ 40             │
-  ├───────┼───────────────────────────────────────────────────────────────┼────────────────┤
-  │ Total │                                                               │ 1,107 examples │
-  └───────┴───────────────────────────────────────────────────────────────┴────────────────┘
-  - Base model: Qwen3-4B
-  - Fine-tuning: LoRA SFT (16 layers, lr=1e-4, 2500 iterations)
-  - Final validation loss: 0.017
-  - Quantisation: 4-bit (4.5 bits per weight) via MLX
-  - Training hardware: MacBook Pro M5 Max 128GB via MLX LoRA
-  - Test result: 8/10 tool calling accuracy
-  - No personal data was used in training — all examples are synthetic
-  Architecture
-  Qwen3ForCausalLM
-  ├── 36 layers, 2560 hidden size
-  ├── 32 attention heads, 8 KV heads (GQA)
-  ├── 9728 intermediate size (SiLU)
-  ├── RoPE (theta=1M, max 40960 positions)
-  ├── 4-bit quantisation (4.5 bits/weight)
-  └── TurboQuant PolarQuant KV cache compatible
-  What makes Zora different
-  Zora is a personal AI OS — not a chatbot. The brain model is one part of a larger system:
-  - Real-time nervous system — events from every channel (WhatsApp, Telegram, email, Teams, calendar) flow through
-   one universal event bus
-  - Autonomous operator — follow-through engine that owns work: triages, drafts replies, sends follow-ups, tracks
-  commitments
-  - Self-improving — LoRA training pipeline runs on your hardware, improving the model from your usage patterns
-  - Privacy by architecture — all inference on-device, data never leaves your machine, secrets in macOS Keychain
-  Limitations
-  - Trained for Zora's orchestrator context — may underperform on general chat benchmarks
-  - English only
-  - Best results with the Zora tool/system prompt format
-  - Structured JSON output may truncate on very large contexts (>10K chars) — the orchestrator has repair logic
-  for this
-  - Not suitable for tasks requiring >40K context (use a larger model via worker delegation or cloud offload)
-  License
-  Apache 2.0
-  Links
-  - https://github.com/Azkabanned/zora
-  - https://huggingface.co/Qwen/Qwen3-4B
-  - https://github.com/ml-explore/mlx

 ---
+language: en
+license: apache-2.0
+library_name: mlx
+pipeline_tag: text-generation
+tags:
+- mlx
+- apple-silicon
+- tool-calling
+- orchestrator
+- local-ai
+- personal-ai-os
+base_model: Qwen/Qwen3-4B
+---
+# Zora 4B
+The orchestrator brain for [Zora](https://github.com/Azkabanned/zora) — a private, local-first personal AI OS that runs on Apple Silicon.
+## What is this?
+Zora 4B is a fine-tuned [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) model, quantised to 4-bit for efficient inference on Apple Silicon via [MLX](https://github.com/ml-explore/mlx). It serves as Zora's primary reasoning brain — handling tool calling, task routing, structured reflection, and conversational interaction.
+This is **not** a general-purpose chat model. It is specifically trained for orchestrator behaviour: deciding which tools to call, how to route tasks across local and remote compute, producing structured JSON for autonomous cognition, and managing multi-step goals.
+## Key capabilities
+- **Tool calling** — 39+ tools with structured `<tool_call>` output format
+- **Task routing** — classifies prompts into direct response, queued goal, or delegated work to 70B worker nodes
+- **Structured reflection** — produces complete COG-X JSON schema for autonomous cognition loops
+- **Task delegation** — routes complex build/code/refactor tasks to worker nodes with bigger models
+- **Multi-turn reasoning** — maintains context across tool call chains (up to 8 rounds)
+- **Thinking mode** — optional `<think>` blocks for chain-of-thought reasoning
+## Hardware requirements
+| Config | RAM | Performance |
+|--------|-----|-------------|
+| Mac Mini M4 24GB | 24GB | ~90 tok/s with TurboQuant KV cache |
+| MacBook Pro M5 Max 128GB | 128GB | ~110 tok/s with speculative decoding |
+| MacBook Air M3 16GB | 16GB | ~35 tok/s |
+| Any Apple Silicon | 8GB+ | Will run, but may be slow |
+The entire stack — model, KV cache, and OS — runs in **7GB RAM** on a 24GB Mac Mini.
+## Usage
+### With MLX
+```python
+from mlx_lm import load, generate
+model, tokenizer = load("project-zora/zora-4b")
+response = generate(model, tokenizer, prompt="What's running on my cluster?", max_tokens=512)
+```
+### As an Anthropic-compatible API
+```bash
+export ANTHROPIC_BASE_URL=http://localhost:4001
+export ANTHROPIC_API_KEY=local
+claude  # now running on your Metal GPU
+```
+### With Zora orchestrator
+This model is downloaded automatically when you run `./install.sh` in the [Zora repository](https://github.com/Azkabanned/zora).
+## Training
+| Round | Focus | Examples |
+|-------|-------|---------|
+| R1-R3 | Core tool calling, multi-step chains | 600+ |
+| R4-R5 | Edge cases, delegation rules | 200+ |
+| R6 | All features (Team Zora, Enhanced Memory, Presence) | 200+ |
+| R7 | Structured JSON reflection (COG-X schema) | 37 |
+| R8 | Delegation routing (complex build tasks) | 40 |
+| **Total** | | **1,107 examples** |
+- **Base model:** Qwen3-4B
+- **Method:** LoRA SFT (16 layers, lr=1e-4, 2500 iterations)
+- **Final val loss:** 0.017
+- **Quantisation:** 4-bit (4.5 bits per weight) via MLX
+- **Hardware:** MacBook Pro M5 Max 128GB
+- **Test result:** 8/10 tool calling accuracy
+- **No personal data** — all examples are synthetic
+## Architecture
+```
+Qwen3ForCausalLM
++-- 36 layers, 2560 hidden size
++-- 32 attention heads, 8 KV heads (GQA)
++-- 9728 intermediate size (SiLU)
++-- RoPE (theta=1M, max 40960 positions)
++-- 4-bit quantisation (4.5 bits/weight)
++-- TurboQuant PolarQuant KV cache compatible
+```
+## What makes Zora different
+Zora is a personal AI OS — not a chatbot. This brain model is one part of a larger system:
+- **Real-time nervous system** — events from every channel flow through one universal event bus
+- **Autonomous operator** — follow-through engine that owns work across all channels
+- **Self-improving** — LoRA training pipeline runs on your hardware
+- **Privacy by architecture** — all inference on-device, data never leaves your machine
+## Limitations
+- Trained for Zora's orchestrator context — may underperform on general chat benchmarks
+- English only
+- Best results with the Zora tool/system prompt format
+- Not suitable for tasks requiring >40K context
+## License
+Apache 2.0
+## Links
+- [Zora on GitHub](https://github.com/Azkabanned/zora)
+- [Qwen3-4B (base model)](https://huggingface.co/Qwen/Qwen3-4B)
+- [MLX](https://github.com/ml-explore/mlx)