OsirisTalon-v3-0.6B-MLX
The Talon — Osiris's ultra-fast tool classifier brain. Runs alongside the main Cortex (9B) on Apple Silicon unified memory via MLX.
Purpose
Pre-classifies user intent in <100ms, selecting the optimal tool and arguments before the main Cortex model processes the request. This eliminates an entire ReAct inference cycle, cutting total response time from ~60-134s to ~25s.
Architecture
- Base Model: Qwen3-0.6B (600M parameters)
- Format: MLX 4-bit quantized (Apple Silicon native)
- Size: ~335MB
- Speed: ~200+ tokens/sec on M2 Pro (MLX Metal)
- Purpose: Tool selection, intent classification, complexity rating
Usage
from mlx_lm import load, generate
model, tokenizer = load("osirisbrain/OsirisTalon-v3-0.6B-MLX")
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "cuanto espacio tengo en disco"}],
add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=100)
Integration
Runs as a dedicated MLX inference server on port 8086, coexisting with llama-server (Cortex 9B) on port 8085. Both share Apple Silicon unified memory without conflict.
Credits
Rebranded from mlx-community/Qwen3-0.6B-4bit for the OsirisBrain sovereign AGI ecosystem. Original model: Qwen/Qwen3-0.6B by Alibaba.
- Downloads last month
- 14
Model size
93.2M params
Tensor type
BF16
·
U32 ·
Hardware compatibility
Log In to add your hardware
4-bit