How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="dcostenco/prism-coder-32b",
	filename="qwen3-32b-v31-q4km.gguf",
)
llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

prism-coder:32b โ€” AAC Tool Router + Coder (32B)

Fine-tuned from Qwen3-32B for tool routing and advanced code assistance in the Prism AAC system.

BFCL accuracy: 99% on 100-case routing benchmark. Quality escalation tier in the desktop cascade โ€” catches the ~1-3% of cases where 14B is uncertain.

What it does

  • Perfect tool routing on all tested categories
  • Advanced code generation and architecture assistance
  • Complex multi-step session management
  • Final local quality gate before cloud Claude

Deployment

Available on Ollama Hub (recommended โ€” avoids 18GB download for Ollama users):

ollama run dcostenco/prism-coder:32b

Or pull manually with the GGUF from this repo when available.

Cascade position

Desktop cascade: 14B โ†’ 32B (escalation) โ†’ cloud Claude

When 14B returns low-confidence or fails, 32B is invoked automatically. Users with Ollama running get 32B as their local ceiling before cloud.

Training

  • Base: Qwen3-32B
  • Method: MLX LoRA fine-tuning (v28-codebase + routing)
  • Hardware: Apple Silicon (M-series, 64GB RAM)
  • Eval: BFCL routing 99% (11/11 on manual benchmark)

Note on GGUF

The full Q4_K_M GGUF is 18GB. It is distributed via Ollama Hub at dcostenco/prism-coder:32b to avoid large download overhead. Direct GGUF will be added here in a future release.

Downloads last month
74
GGUF
Model size
33B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for dcostenco/prism-coder-32b

Base model

Qwen/Qwen3-32B
Quantized
(154)
this model