| --- |
| language: en |
| license: apache-2.0 |
| tags: |
| - tool-routing |
| - function-calling |
| - prism-coder |
| - qwen3.5 |
| - synalux |
| - prompt-engineering |
| - gguf |
| base_model: Qwen/Qwen3.5-4B |
| pipeline_tag: text-generation |
| --- |
| |
| # prism-coder:4b β Prism Memory Tool Router |
|
|
| Prompt-engineered [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) for MCP tool routing in the [Prism Coder](https://ollama.com/dcostenco/prism-coder) system. No fine-tuning β the system prompt IS the specialization. |
|
|
| ## Downloads |
|
|
| | File | Quantization | Size | BFCL Accuracy | Use when | |
| |------|-------------|------|---------------|----------| |
| | `Qwen3.5-4B-Q3_K_M.gguf` | Q3_K_M | **2.3 GB** | **99.1%** Γ 3 seeds | **iPhone / mobile first gate** | |
| | *(stock via Ollama)* | Q4_K_M | 3.4 GB | **100%** Γ 3 seeds | Mac / 8 GB+ devices | |
|
|
| ## Quick Start |
|
|
| ```bash |
| # iPhone-optimized (2.3 GB, 99.1%) |
| ollama pull dcostenco/prism-coder:2b |
| |
| # Full quality (3.4 GB, 100%) |
| ollama pull dcostenco/prism-coder:4b |
| ``` |
|
|
| ## BFCL Benchmark |
|
|
| ### Q3_K_M (prism-coder:2b) β 99.1% Γ 3 seeds |
|
|
| **114/115 Γ 3 shuffled runs = 99.1%, 1 flaky case** |
|
|
| | Category | Count | Accuracy | |
| |----------|------:|:--------:| |
| | save | 17 | 100% | |
| | smem | 17 | 100% | |
| | aac | 12 | 100% | |
| | hand | 12 | 100% | |
| | irrel | 10 | 90% | |
| | load | 9 | 100% | |
| | pred | 8 | 100% | |
| | know | 7 | 100% | |
| | cmpct | 6 | 100% | |
| | edge | 6 | 100% | |
| | tran | 6 | 100% | |
| | info | 5 | 100% | |
|
|
| Single failure: "Write a regex to match email addresses" β knowledge_search instead of plain. |
| |
| ### Q4_K_M (prism-coder:4b) β 100% Γ 3 seeds |
| |
| **115/115 Γ 3 shuffled runs = 100.0%, 0 flaky** |
| |
| ## Architecture |
| |
| Qwen3.5-4B uses a hybrid attention architecture: |
| - **24 linear attention layers** (Gated DeltaNet) β O(n) inference |
| - **8 full attention layers** (standard softmax) β precise retrieval |
| |
| This hybrid design is why prompt-only routing works at 4B scale but not smaller. The 8 full-attention layers are sufficient to hold the routing rules when combined with the DeltaNet layers' pattern matching. |
| |
| ## Fleet Position |
| |
| | Model | Ollama tag | Size | BFCL | Role | |
| |---|---|---|---|---| |
| | **Qwen3.5-4B Q3_K_M** | **`dcostenco/prism-coder:2b`** | **2.3 GB** | **99.1%** | **iPhone / mobile** | |
| | Qwen3.5-4B Q4_K_M | `dcostenco/prism-coder:4b` | 3.4 GB | 100% | Verifier / 8 GB+ | |
| | Qwen3.5-9B Q4_K_M | `dcostenco/prism-coder:9b` | 5.8 GB | 100% | Default router | |
| | prism-coder:32b | `dcostenco/prism-coder:32b` | 19 GB | 100% | Complex tasks | |
| |
| ## Links |
| |
| - [Ollama model page](https://ollama.com/dcostenco/prism-coder) β pull and run |
| - [Prism MCP Server](https://github.com/dcostenco/prism-coder) β the MCP server |
| - [Qwen3.5-4B base](https://huggingface.co/Qwen/Qwen3.5-4B) β upstream model |
| |