--- license: apache-2.0 language: - en - es - zh tags: - mlx - tool-calling - function-calling - intent-classification - osirisbrain - apple-silicon - qwen3 base_model: Qwen/Qwen3-0.6B pipeline_tag: text-generation library_name: mlx --- # OsirisTalon-v3-0.6B-MLX **The Talon** — Osiris's ultra-fast tool classifier brain. Runs alongside the main Cortex (9B) on Apple Silicon unified memory via MLX. ## Purpose Pre-classifies user intent in **<100ms**, selecting the optimal tool and arguments _before_ the main Cortex model processes the request. This eliminates an entire ReAct inference cycle, cutting total response time from ~60-134s to ~25s. ## Architecture - **Base Model:** Qwen3-0.6B (600M parameters) - **Format:** MLX 4-bit quantized (Apple Silicon native) - **Size:** ~335MB - **Speed:** ~200+ tokens/sec on M2 Pro (MLX Metal) - **Purpose:** Tool selection, intent classification, complexity rating ## Usage ```python from mlx_lm import load, generate model, tokenizer = load("osirisbrain/OsirisTalon-v3-0.6B-MLX") prompt = tokenizer.apply_chat_template( [{"role": "user", "content": "cuanto espacio tengo en disco"}], add_generation_prompt=True ) response = generate(model, tokenizer, prompt=prompt, max_tokens=100) ``` ## Integration Runs as a dedicated MLX inference server on port 8086, coexisting with llama-server (Cortex 9B) on port 8085. Both share Apple Silicon unified memory without conflict. ## Credits Rebranded from [mlx-community/Qwen3-0.6B-4bit](https://huggingface.co/mlx-community/Qwen3-0.6B-4bit) for the OsirisBrain sovereign AGI ecosystem. Original model: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) by Alibaba.