Zora 4B
The orchestrator brain for Zora — a private, local-first personal AI OS that runs on Apple Silicon.
What is this?
Zora 4B is a fine-tuned Qwen3-4B model, quantised to 4-bit for efficient inference on Apple Silicon via MLX. It serves as Zora's primary reasoning brain — handling tool calling, task routing, structured reflection, and conversational interaction.
This is not a general-purpose chat model. It is specifically trained for orchestrator behaviour: deciding which tools to call, how to route tasks across local and remote compute, producing structured JSON for autonomous cognition, and managing multi-step goals.
Key capabilities
- Tool calling — 39+ tools with structured
<tool_call>output format - Task routing — classifies prompts into direct response, queued goal, or delegated work to 70B worker nodes
- Structured reflection — produces complete COG-X JSON schema for autonomous cognition loops
- Task delegation — routes complex build/code/refactor tasks to worker nodes with bigger models
- Multi-turn reasoning — maintains context across tool call chains (up to 8 rounds)
- Thinking mode — optional
<think>blocks for chain-of-thought reasoning
Hardware requirements
| Config | RAM | Performance |
|---|---|---|
| Mac Mini M4 24GB | 24GB | ~90 tok/s with TurboQuant KV cache |
| MacBook Pro M5 Max 128GB | 128GB | ~110 tok/s with speculative decoding |
| MacBook Air M3 16GB | 16GB | ~35 tok/s |
| Any Apple Silicon | 8GB+ | Will run, but may be slow |
The entire stack — model, KV cache, and OS — runs in 7GB RAM on a 24GB Mac Mini.
Usage
With MLX
from mlx_lm import load, generate
model, tokenizer = load("project-zora/zora-4b")
response = generate(model, tokenizer, prompt="What's running on my cluster?", max_tokens=512)
As an Anthropic-compatible API
export ANTHROPIC_BASE_URL=http://localhost:4001
export ANTHROPIC_API_KEY=local
claude # now running on your Metal GPU
With Zora orchestrator
This model is downloaded automatically when you run ./install.sh in the Zora repository.
Training
| Round | Focus | Examples |
|---|---|---|
| R1-R3 | Core tool calling, multi-step chains | 600+ |
| R4-R5 | Edge cases, delegation rules | 200+ |
| R6 | All features (Team Zora, Enhanced Memory, Presence) | 200+ |
| R7 | Structured JSON reflection (COG-X schema) | 37 |
| R8 | Delegation routing (complex build tasks) | 40 |
| Total | 1,107 examples |
- Base model: Qwen3-4B
- Method: LoRA SFT (16 layers, lr=1e-4, 2500 iterations)
- Final val loss: 0.017
- Quantisation: 4-bit (4.5 bits per weight) via MLX
- Hardware: MacBook Pro M5 Max 128GB
- Test result: 8/10 tool calling accuracy
- No personal data — all examples are synthetic
Architecture
Qwen3ForCausalLM
+-- 36 layers, 2560 hidden size
+-- 32 attention heads, 8 KV heads (GQA)
+-- 9728 intermediate size (SiLU)
+-- RoPE (theta=1M, max 40960 positions)
+-- 4-bit quantisation (4.5 bits/weight)
+-- TurboQuant PolarQuant KV cache compatible
What makes Zora different
Zora is a personal AI OS — not a chatbot. This brain model is one part of a larger system:
- Real-time nervous system — events from every channel flow through one universal event bus
- Autonomous operator — follow-through engine that owns work across all channels
- Self-improving — LoRA training pipeline runs on your hardware
- Privacy by architecture — all inference on-device, data never leaves your machine
Limitations
- Trained for Zora's orchestrator context — may underperform on general chat benchmarks
- English only
- Best results with the Zora tool/system prompt format
- Not suitable for tasks requiring >40K context
License
Apache 2.0
Links
- Downloads last month
- 333
4-bit