OsirisTalon-v3-0.6B-MLX

The Talon — Osiris's ultra-fast tool classifier brain. Runs alongside the main Cortex (9B) on Apple Silicon unified memory via MLX.

Purpose

Pre-classifies user intent in <100ms, selecting the optimal tool and arguments before the main Cortex model processes the request. This eliminates an entire ReAct inference cycle, cutting total response time from ~60-134s to ~25s.

Architecture

  • Base Model: Qwen3-0.6B (600M parameters)
  • Format: MLX 4-bit quantized (Apple Silicon native)
  • Size: ~335MB
  • Speed: ~200+ tokens/sec on M2 Pro (MLX Metal)
  • Purpose: Tool selection, intent classification, complexity rating

Usage

from mlx_lm import load, generate

model, tokenizer = load("osirisbrain/OsirisTalon-v3-0.6B-MLX")
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "cuanto espacio tengo en disco"}],
    add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=100)

Integration

Runs as a dedicated MLX inference server on port 8086, coexisting with llama-server (Cortex 9B) on port 8085. Both share Apple Silicon unified memory without conflict.

Credits

Rebranded from mlx-community/Qwen3-0.6B-4bit for the OsirisBrain sovereign AGI ecosystem. Original model: Qwen/Qwen3-0.6B by Alibaba.

Downloads last month
14
Safetensors
Model size
93.2M params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for osirisbrain/OsirisSoul-v1-MLX

Finetuned
Qwen/Qwen3-0.6B
Quantized
(261)
this model