dcostenco
/

prism-coder-2b

Text Generation

function-calling

prompt-engineering

Model card Files Files and versions

prism-coder-2b / README.md

dcostenco's picture

Upload README.md with huggingface_hub

0d24c05 verified 21 days ago

|

History Blame Contribute Delete

2.72 kB

	---
	language: en
	license: apache-2.0
	tags:
	- tool-routing
	- function-calling
	- prism-coder
	- qwen3.5
	- synalux
	- prompt-engineering
	- gguf
	base_model: Qwen/Qwen3.5-4B
	pipeline_tag: text-generation
	---

	# prism-coder:4b — Prism Memory Tool Router

	Prompt-engineered [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) for MCP tool routing in the [Prism Coder](https://ollama.com/dcostenco/prism-coder) system. No fine-tuning — the system prompt IS the specialization.

	## Downloads

	\| File \| Quantization \| Size \| BFCL Accuracy \| Use when \|
	\|------\|-------------\|------\|---------------\|----------\|
	\| `Qwen3.5-4B-Q3_K_M.gguf` \| Q3_K_M \| 2.3 GB \| 99.1% × 3 seeds \| iPhone / mobile first gate \|
	\| (stock via Ollama) \| Q4_K_M \| 3.4 GB \| 100% × 3 seeds \| Mac / 8 GB+ devices \|

	## Quick Start

	```bash
	# iPhone-optimized (2.3 GB, 99.1%)
	ollama pull dcostenco/prism-coder:2b

	# Full quality (3.4 GB, 100%)
	ollama pull dcostenco/prism-coder:4b
	```

	## BFCL Benchmark

	### Q3_K_M (prism-coder:2b) — 99.1% × 3 seeds

	114/115 × 3 shuffled runs = 99.1%, 1 flaky case

	\| Category \| Count \| Accuracy \|
	\|----------\|------:\|:--------:\|
	\| save \| 17 \| 100% \|
	\| smem \| 17 \| 100% \|
	\| aac \| 12 \| 100% \|
	\| hand \| 12 \| 100% \|
	\| irrel \| 10 \| 90% \|
	\| load \| 9 \| 100% \|
	\| pred \| 8 \| 100% \|
	\| know \| 7 \| 100% \|
	\| cmpct \| 6 \| 100% \|
	\| edge \| 6 \| 100% \|
	\| tran \| 6 \| 100% \|
	\| info \| 5 \| 100% \|

	Single failure: "Write a regex to match email addresses" → knowledge_search instead of plain.

	### Q4_K_M (prism-coder:4b) — 100% × 3 seeds

	115/115 × 3 shuffled runs = 100.0%, 0 flaky

	## Architecture

	Qwen3.5-4B uses a hybrid attention architecture:
	- 24 linear attention layers (Gated DeltaNet) — O(n) inference
	- 8 full attention layers (standard softmax) — precise retrieval

	This hybrid design is why prompt-only routing works at 4B scale but not smaller. The 8 full-attention layers are sufficient to hold the routing rules when combined with the DeltaNet layers' pattern matching.

	## Fleet Position

	\| Model \| Ollama tag \| Size \| BFCL \| Role \|
	\|---\|---\|---\|---\|---\|
	\| Qwen3.5-4B Q3_K_M \| `dcostenco/prism-coder:2b` \| 2.3 GB \| 99.1% \| iPhone / mobile \|
	\| Qwen3.5-4B Q4_K_M \| `dcostenco/prism-coder:4b` \| 3.4 GB \| 100% \| Verifier / 8 GB+ \|
	\| Qwen3.5-9B Q4_K_M \| `dcostenco/prism-coder:9b` \| 5.8 GB \| 100% \| Default router \|
	\| prism-coder:32b \| `dcostenco/prism-coder:32b` \| 19 GB \| 100% \| Complex tasks \|

	## Links

	- [Ollama model page](https://ollama.com/dcostenco/prism-coder) — pull and run
	- [Prism MCP Server](https://github.com/dcostenco/prism-coder) — the MCP server
	- [Qwen3.5-4B base](https://huggingface.co/Qwen/Qwen3.5-4B) — upstream model