Instructions to use dcostenco/prism-coder-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use dcostenco/prism-coder-4b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="dcostenco/prism-coder-4b", filename="prism-coder-4b-v43-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use dcostenco/prism-coder-4b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Use Docker
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use dcostenco/prism-coder-4b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "dcostenco/prism-coder-4b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dcostenco/prism-coder-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- Ollama
How to use dcostenco/prism-coder-4b with Ollama:
ollama run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- Unsloth Studio
How to use dcostenco/prism-coder-4b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dcostenco/prism-coder-4b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dcostenco/prism-coder-4b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for dcostenco/prism-coder-4b to start chatting
- Pi
How to use dcostenco/prism-coder-4b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "dcostenco/prism-coder-4b:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use dcostenco/prism-coder-4b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default dcostenco/prism-coder-4b:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use dcostenco/prism-coder-4b with Docker Model Runner:
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- Lemonade
How to use dcostenco/prism-coder-4b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull dcostenco/prism-coder-4b:Q4_K_M
Run and chat with the model
lemonade run user.prism-coder-4b-Q4_K_M
List all available models
lemonade list
Add training/TRAINING_DECISIONS_4B_V43.md
Browse files
training/TRAINING_DECISIONS_4B_V43.md
ADDED
|
@@ -0,0 +1,149 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Prism Coder 4B v43 — Training Decisions & Reuse Guide
|
| 2 |
+
|
| 3 |
+
> Apply these decisions to 8B, 14B, 32B training runs. The patterns are size-agnostic.
|
| 4 |
+
|
| 5 |
+
## Architecture: Tiered Model Deployment
|
| 6 |
+
|
| 7 |
+
| Model | Device RAM | Role | Verifier? |
|
| 8 |
+
|-------|-----------|------|-----------|
|
| 9 |
+
| 1.7B | ≥3GB free | Primary agent (low-mem) | Self (or none) |
|
| 10 |
+
| 4B | ≥8GB free | Primary agent (mid-tier) | 4B self-verifier |
|
| 11 |
+
| 8B | ≥16GB free | Primary agent (high-tier) | 4B or self |
|
| 12 |
+
| 14B/32B | ≥24GB free | Primary agent (pro) | 4B |
|
| 13 |
+
|
| 14 |
+
**Key decision**: 1.7B stays on low-memory devices (phones, older Macs). 4B and above all serve the same Prism Memory tool-calling purpose — larger = better edge case handling, same corpus shape.
|
| 15 |
+
|
| 16 |
+
**Verifier tier**: Configured via `PRISM_VERIFIER_MODEL` env var (default: `prism-coder:1b7`). Set to `prism-coder:4b` on devices with ≥8GB free for higher accuracy at ~3× latency cost. The verifier call runs post-draft in `chat-verifier.ts::verifyOrRefuse()`.
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## Training Hyperparameters (Validated for 4B, Scale for Others)
|
| 21 |
+
|
| 22 |
+
| Param | 4B v43 value | 8B guidance | 14B/32B guidance |
|
| 23 |
+
|-------|-------------|-------------|-----------------|
|
| 24 |
+
| Base model | Qwen/Qwen3-4B | Qwen/Qwen3-8B | Qwen/Qwen3-14B / 32B |
|
| 25 |
+
| LoRA rank | 32 | 32 | 32 (or 16 for speed) |
|
| 26 |
+
| LoRA alpha | 64 (scale=2.0) | 64 | 64 |
|
| 27 |
+
| LoRA layers | 16 of 36 | 16 of 36 | 16 of 48 |
|
| 28 |
+
| Batch size | 2 | 2 | 1 (grad-checkpoint) |
|
| 29 |
+
| Grad checkpoint | yes | yes | yes |
|
| 30 |
+
| Seq length | 2048 | 2048 | 2048 |
|
| 31 |
+
| LR (initial) | 1e-4 | 1e-4 | 5e-5 |
|
| 32 |
+
| LR (surgical patch) | 3e-5 | 2e-5 | 1e-5 |
|
| 33 |
+
| Iters (full run) | 2000 | 2000 | 1500 |
|
| 34 |
+
| Iters (patch) | 250 | 200 | 150 |
|
| 35 |
+
| Val batches | 25 | 25 | 25 |
|
| 36 |
+
| Save every | 200 | 200 | 200 |
|
| 37 |
+
|
| 38 |
+
**Critical**: 1e-4 LR on a surgical patch (small delta corpus appended to large existing corpus) causes catastrophic interference. Validated fix: 3e-5 for 4B. Scale down proportionally for larger models.
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## Corpus Format (MUST match for all model sizes)
|
| 43 |
+
|
| 44 |
+
All rows must use the `"text"` key with pre-rendered ChatML strings:
|
| 45 |
+
```json
|
| 46 |
+
{"text": "<|im_start|>system\n...<|im_end|>\n<|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n...<|im_end|>"}
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
**Never use** `"messages"` key format — mlx_lm auto-detects format from the first row and crashes on mixed files.
|
| 50 |
+
|
| 51 |
+
Tool call format in training data:
|
| 52 |
+
```
|
| 53 |
+
<tool_call>
|
| 54 |
+
{"name": "tool_name", "arguments": {...}}
|
| 55 |
+
</tool_call>
|
| 56 |
+
```
|
| 57 |
+
**No pipes** — not `<|tool_call|>`. The eval harness (`bfcl_eval.py`, `swe_bench_test.py`) must use the same format.
|
| 58 |
+
|
| 59 |
+
Think block format:
|
| 60 |
+
```
|
| 61 |
+
<|synalux_think|>reasoning here</|synalux_think|>
|
| 62 |
+
```
|
| 63 |
+
Use string literals for both open and close tags — do NOT use f-string escaping for the close tag.
|
| 64 |
+
|
| 65 |
+
---
|
| 66 |
+
|
| 67 |
+
## Merge: mlx_lm.fuse is Broken for GGUF
|
| 68 |
+
|
| 69 |
+
`mlx_lm.fuse` silently loses LoRA weights during GGUF conversion. Use `merge_4b_v43.py` pattern instead:
|
| 70 |
+
|
| 71 |
+
```python
|
| 72 |
+
# delta = scale * (A @ B) where scale = alpha / rank (pre-computed in adapter_config.json)
|
| 73 |
+
delta = scale * (A_matrix @ B_matrix)
|
| 74 |
+
merged_weight = base_weight + delta
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
Script: `merge_4b_v43.py` — adapt for 8B/14B/32B by changing the base model path.
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
## Layer 3 (Inference-Time Remapping) — Apply Before Training Patches
|
| 82 |
+
|
| 83 |
+
Before writing a corpus patch, check if the failure is fixable by Layer 3 rules in `bfcl_eval.py::apply_layer3()`. Layer 3 fixes:
|
| 84 |
+
- Tool name remapping (semantic similarity false positives)
|
| 85 |
+
- Format normalization (pipe vs no-pipe)
|
| 86 |
+
- Context-based disambiguation (backfill_links vs synthesize_edges)
|
| 87 |
+
- Abstention for general programming/CS questions
|
| 88 |
+
|
| 89 |
+
Layer 3 is **zero-cost** (no training needed) and **regression-proof**. Use training patches only for failures that Layer 3 cannot fix.
|
| 90 |
+
|
| 91 |
+
---
|
| 92 |
+
|
| 93 |
+
## Corpus Mix Ratios (v2 — Verified)
|
| 94 |
+
|
| 95 |
+
| Category | Target % | Notes |
|
| 96 |
+
|----------|----------|-------|
|
| 97 |
+
| Tool-use (Prism Memory) | ~36% | All 29 tools, param extraction, multi-turn |
|
| 98 |
+
| AAC / clinical | ~40% | Critical — prevents mode collapse |
|
| 99 |
+
| Abstention | ~12% | CS/general questions → no tool call |
|
| 100 |
+
| Safety / refusal | ~12% | Edge cases, PII, etc. |
|
| 101 |
+
|
| 102 |
+
Minimum counts for quality gate: tool_calls ≥ 5000, AAC rows ≥ 40%, safety/refusal ≥ 10%.
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## Patch Strategy (Surgical vs Full Retrain)
|
| 107 |
+
|
| 108 |
+
**Surgical patch** (preferred):
|
| 109 |
+
- Identify failing categories from swe_bench or BFCL
|
| 110 |
+
- Build targeted JSONL (30–100 rows per failure group)
|
| 111 |
+
- 3× oversample → append to existing `train.jsonl`
|
| 112 |
+
- Train at reduced LR (3e-5 for 4B, scale down for larger)
|
| 113 |
+
- Iters: 200–300 (enough to reinforce without overwriting)
|
| 114 |
+
|
| 115 |
+
**Full retrain** (when): catastrophic regression, base model upgrade, or corpus shape change requiring >20% new data.
|
| 116 |
+
|
| 117 |
+
---
|
| 118 |
+
|
| 119 |
+
## BFCL Gate: 100% Required Before Push
|
| 120 |
+
|
| 121 |
+
Gate enforced in `bfcl_eval.py`. Run with 3 seeds before any Ollama Hub push:
|
| 122 |
+
```bash
|
| 123 |
+
python3 bfcl_eval.py --model prism-coder:4b-v43 --seeds 2027 2028 2029
|
| 124 |
+
```
|
| 125 |
+
All seeds must show 100%. Partial seed pass = do not push.
|
| 126 |
+
|
| 127 |
+
SWE bench (`swe_bench_test.py`) is the secondary blind eval — target 100% strict, but this is harder and may accept ≥95% strict if BFCL=100%.
|
| 128 |
+
|
| 129 |
+
---
|
| 130 |
+
|
| 131 |
+
## Files in This Directory
|
| 132 |
+
|
| 133 |
+
| File | Purpose |
|
| 134 |
+
|------|---------|
|
| 135 |
+
| `build_4b_v43_corpus.py` | Full v43 corpus builder (28,454 base rows) |
|
| 136 |
+
| `build_4b_v43_patch.py` | Patch 1: initial BFCL failures |
|
| 137 |
+
| `build_4b_v43_patch2.py` | Patch 2: param extraction + format |
|
| 138 |
+
| `build_4b_v43_patch3.py` | Patch 3: (regressed — LR too high, abandoned) |
|
| 139 |
+
| `build_4b_v43_patch4.py` | Patch 4: task_route implicit + param extraction from casual phrasing |
|
| 140 |
+
| `build_4b_v43_swe_patch.py` | SWE bench targeted patch |
|
| 141 |
+
| `train_4b_v43_local.sh` | MLX LoRA training script (Apple Silicon) |
|
| 142 |
+
| `merge_4b_v43.py` | Safe merge: delta = scale × (A @ B) |
|
| 143 |
+
| `export_4b_v43_gguf.sh` | HF → GGUF F16 → Q4_K_M → Ollama register |
|
| 144 |
+
| `bfcl_eval.py` | 64-test BFCL suite with Layer 3 |
|
| 145 |
+
| `swe_bench_test.py` | 68-test blind SWE suite |
|
| 146 |
+
| `orchestrate_4b_to_100.sh` | Autonomous patch→train→eval loop |
|
| 147 |
+
| `analyze_swe_failures.py` | Parse swe_bench output → failure categories |
|
| 148 |
+
|
| 149 |
+
For 8B/14B/32B: copy the build_* and train_* scripts, update model name and hyperparams per the table above.
|