| --- |
| license: apache-2.0 |
| language: |
| - en |
| - es |
| - zh |
| tags: |
| - mlx |
| - tool-calling |
| - function-calling |
| - intent-classification |
| - osirisbrain |
| - apple-silicon |
| - qwen3 |
| base_model: Qwen/Qwen3-0.6B |
| pipeline_tag: text-generation |
| library_name: mlx |
| --- |
| |
| # OsirisTalon-v3-0.6B-MLX |
|
|
| **The Talon** — Osiris's ultra-fast tool classifier brain. Runs alongside the main Cortex (9B) on Apple Silicon unified memory via MLX. |
|
|
| ## Purpose |
|
|
| Pre-classifies user intent in **<100ms**, selecting the optimal tool and arguments _before_ the main Cortex model processes the request. This eliminates an entire ReAct inference cycle, cutting total response time from ~60-134s to ~25s. |
|
|
| ## Architecture |
|
|
| - **Base Model:** Qwen3-0.6B (600M parameters) |
| - **Format:** MLX 4-bit quantized (Apple Silicon native) |
| - **Size:** ~335MB |
| - **Speed:** ~200+ tokens/sec on M2 Pro (MLX Metal) |
| - **Purpose:** Tool selection, intent classification, complexity rating |
|
|
| ## Usage |
|
|
| ```python |
| from mlx_lm import load, generate |
| |
| model, tokenizer = load("osirisbrain/OsirisTalon-v3-0.6B-MLX") |
| prompt = tokenizer.apply_chat_template( |
| [{"role": "user", "content": "cuanto espacio tengo en disco"}], |
| add_generation_prompt=True |
| ) |
| response = generate(model, tokenizer, prompt=prompt, max_tokens=100) |
| ``` |
|
|
| ## Integration |
|
|
| Runs as a dedicated MLX inference server on port 8086, coexisting with llama-server (Cortex 9B) on port 8085. Both share Apple Silicon unified memory without conflict. |
|
|
| ## Credits |
|
|
| Rebranded from [mlx-community/Qwen3-0.6B-4bit](https://huggingface.co/mlx-community/Qwen3-0.6B-4bit) for the OsirisBrain sovereign AGI ecosystem. |
| Original model: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) by Alibaba. |
|
|