Instructions to use dcostenco/prism-coder-32b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use dcostenco/prism-coder-32b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="dcostenco/prism-coder-32b", filename="qwen3-30b-a3b-v1-iq4nl.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use dcostenco/prism-coder-32b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf dcostenco/prism-coder-32b # Run inference directly in the terminal: llama-cli -hf dcostenco/prism-coder-32b
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf dcostenco/prism-coder-32b # Run inference directly in the terminal: llama-cli -hf dcostenco/prism-coder-32b
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf dcostenco/prism-coder-32b # Run inference directly in the terminal: ./llama-cli -hf dcostenco/prism-coder-32b
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf dcostenco/prism-coder-32b # Run inference directly in the terminal: ./build/bin/llama-cli -hf dcostenco/prism-coder-32b
Use Docker
docker model run hf.co/dcostenco/prism-coder-32b
- LM Studio
- Jan
- Ollama
How to use dcostenco/prism-coder-32b with Ollama:
ollama run hf.co/dcostenco/prism-coder-32b
- Unsloth Studio new
How to use dcostenco/prism-coder-32b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dcostenco/prism-coder-32b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dcostenco/prism-coder-32b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for dcostenco/prism-coder-32b to start chatting
- Pi new
How to use dcostenco/prism-coder-32b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf dcostenco/prism-coder-32b
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "dcostenco/prism-coder-32b" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use dcostenco/prism-coder-32b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf dcostenco/prism-coder-32b
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default dcostenco/prism-coder-32b
Run Hermes
hermes
- Docker Model Runner
How to use dcostenco/prism-coder-32b with Docker Model Runner:
docker model run hf.co/dcostenco/prism-coder-32b
- Lemonade
How to use dcostenco/prism-coder-32b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull dcostenco/prism-coder-32b
Run and chat with the model
lemonade run user.prism-coder-32b-{{QUANT_TAG}}List all available models
lemonade list
| language: en | |
| license: apache-2.0 | |
| base_model: Qwen/Qwen3-30B-A3B | |
| tags: | |
| - tool-calling | |
| - routing | |
| - aac | |
| - qwen3 | |
| - moe | |
| - gguf | |
| # prism-coder:32b β Tool Routing Model (Desktop Quality Tier) | |
| Fine-tuned Qwen3-30B-A3B (MoE) for 6-tool routing in the [Prism AAC](https://github.com/dcostenco/prism-aac) system. | |
| Quality escalation tier in the desktop cascade: **14B β 32B β cloud Claude**. | |
| > **v5 (May 2026)**: Switched base from dense Qwen3-32B to Qwen3-30B-A3B (MoE). | |
| > Same accuracy, 9 GB smaller, ~4Γ faster inference (only ~3B params active per token). | |
| ## BFCL Routing Benchmark β v7 (Current) | |
| **Mean: 100.0% PERFECT** (3-seed average, seeds 2027/2028/2029, 102 cases each) | |
| | Category | Count | Description | Accuracy | | |
| |----------|------:|-------------|:--------:| | |
| | aac | 12 | AAC phrase requests β plain text | 100% | | |
| | cmpct | 6 | Ledger compaction | 100% | | |
| | edge | 6 | Multi-step / compound requests | 100% | | |
| | hand | 8 | Agent handoff / relay | 100% | | |
| | info | 5 | General facts β plain text | 100% | | |
| | irrel | 10 | Irrelevant / live queries β plain text | 100% | | |
| | know | 7 | Knowledge base search | 100% | | |
| | load | 9 | Session context loading | 100% | | |
| | pred | 8 | Factual / knowledge queries β plain text | 100% | | |
| | save | 13 | Session ledger save | 100% | | |
| | smem | 12 | Session memory search | 100% | | |
| | tran | 6 | Translation requests β plain text | 100% | | |
| All 12 categories at 100%. No remaining failures. | |
| Eval: MLX inference + thinking, temperature=0, 3-seed mean. | |
| Gate: β₯90% = deploy. | |
| ## Full Cascade Benchmark (May 2026) | |
| Individual BFCL scores (MLX, 3 seeds): | |
| | Model | BFCL | Size | Tier | | |
| |-------|------|------|------| | |
| | prism-coder:8b v36 | **100.0% PERFECT** | 4.7 GB | Desktop / Mobile tier | | |
| | prism-coder:14b v36 | **100.0% PERFECT** | 8.4 GB | Desktop primary tier | | |
| | prism-coder:32b v7 | **100.0% PERFECT** | 16 GB | Desktop quality tier | | |
| Cascade eval: **14b β 32b β Claude Opus** (102 cases Γ 3 seeds) | |
| | Metric | Result | | |
| |--------|--------| | |
| | Cascade accuracy | **100.0%** (mean, 3 seeds) | | |
| | Opus-solo etalon | 98.3% | | |
| | Ξ vs Opus | **+1.7%** | | |
| | Traffic served by 14b | **99%** (101/102 cases avg) | | |
| | Traffic escalated to 32b | 1% (1/102 avg) β catches `save live state` β handoff edge case | | |
| | Traffic reaching Opus API | **0%** | | |
| Fine-tuned cascade outperforms Claude Opus on `edge` (+16.7%) and `know` (+14.3%). | |
| ## Version History | |
| | Version | Base | BFCL | Notes | | |
| |---------|------|------|-------| | |
| | v7 (current) | Qwen3-30B-A3B MoE | **100.0% PERFECT** | Fixed: "what do I know + search memory" compound β knowledge_search | | |
| | v6 | Qwen3-30B-A3B MoE | 99.0% | Fixed MoE merge (BF16 safetensors + correct MLXβHF key mapping) | | |
| | v5 | Qwen3-30B-A3B MoE | 97.1% | 18Γ density fix; 9GB smaller, 4Γ faster vs dense | | |
| | v4 | Qwen3-30B-A3B MoE | 92.2% | rank=32 experiment β regressed vs v3 | | |
| | v3 | Qwen3-30B-A3B MoE | 92.5% | 20Γ reps + LR=1e-5 β hit rank bottleneck | | |
| | v2 | Qwen3-30B-A3B MoE | 92.5% | v34 corpus + 1400 iters | | |
| | v33 (dense) | Qwen3-32B dense | 99.0% | Prior generation β larger/slower | | |
| ## Tools | |
| The model routes between exactly 6 tools: | |
| 1. `session_load_context` β load/fetch/resume project context | |
| 2. `session_save_ledger` β note/log/remember/record progress | |
| 3. `session_save_handoff` β handoff/relay to next agent/session | |
| 4. `session_compact_ledger` β compact/archive/shrink ledger | |
| 5. `session_search_memory` β recall past sessions/conversations | |
| 6. `knowledge_search` β search stored notes/knowledge base | |
| ## Files | |
| | File | Size | Use | | |
| |------|------|-----| | |
| | `qwen3-30b-a3b-v7-iq4nl.gguf` | 16 GB | **Current β recommended** | | |
| | `qwen3-30b-a3b-v6-iq4nl.gguf` | 17 GB | Previous (99.0%) | | |
| | `qwen3-30b-a3b-v5-iq4nl.gguf` | 17 GB | Previous (97.1%) | | |
| | `qwen3-32b-v33-q6k.gguf` | 25 GB | Dense predecessor (99.0%, legacy) | | |
| ## Usage (Ollama) | |
| ```bash | |
| ollama run dcostenco/prism-coder:32b | |
| ``` | |
| ## Training | |
| - **Base**: Qwen/Qwen3-30B-A3B (HF BF16, ~57 GB) | |
| - **Adapters**: v6 LoRA (rank=8, scale=10, 8 layers, LR=1e-5) | |
| - **Merge**: Direct safetensors merge on HF BF16 base; delta = (scale/rank) Γ B^T A^T for attn/gate; delta[i] = (scale/rank) Γ B[i] A[i] for MoE experts (128 experts stacked) | |
| - **Key fix**: v5 merge used wrong base (MLX 4-bit, can't apply float LoRA delta) and uppercase regex `lora_[AB]` vs actual lowercase `lora_a`/`lora_b` adapter keys | |
| - **Hardware**: Apple Silicon (M-series, 64 GB RAM) | |