Instructions to use dcostenco/prism-coder-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dcostenco/prism-coder-4b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="dcostenco/prism-coder-4b",
	filename="prism-coder-4b-v43-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use dcostenco/prism-coder-4b with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M

Use Docker

docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M

LM Studio
Jan

vLLM

How to use dcostenco/prism-coder-4b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dcostenco/prism-coder-4b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dcostenco/prism-coder-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M

Ollama
How to use dcostenco/prism-coder-4b with Ollama:
```
ollama run hf.co/dcostenco/prism-coder-4b:Q4_K_M
```

Unsloth Studio

How to use dcostenco/prism-coder-4b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for dcostenco/prism-coder-4b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for dcostenco/prism-coder-4b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for dcostenco/prism-coder-4b to start chatting

How to use dcostenco/prism-coder-4b with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf dcostenco/prism-coder-4b:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "dcostenco/prism-coder-4b:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use dcostenco/prism-coder-4b with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf dcostenco/prism-coder-4b:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default dcostenco/prism-coder-4b:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use dcostenco/prism-coder-4b with Docker Model Runner:
```
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
```

Lemonade

How to use dcostenco/prism-coder-4b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull dcostenco/prism-coder-4b:Q4_K_M

Run and chat with the model

lemonade run user.prism-coder-4b-Q4_K_M

List all available models

lemonade list

dcostenco commited on 6 days ago

Commit

db4f2d8

verified ·

1 Parent(s): 81c234c

Add training/TRAINING_DECISIONS_4B_V43.md

Browse files

Files changed (1) hide show

training/TRAINING_DECISIONS_4B_V43.md +149 -0

training/TRAINING_DECISIONS_4B_V43.md ADDED Viewed

	@@ -0,0 +1,149 @@

+# Prism Coder 4B v43 — Training Decisions & Reuse Guide
+> Apply these decisions to 8B, 14B, 32B training runs. The patterns are size-agnostic.
+## Architecture: Tiered Model Deployment
+| Model | Device RAM | Role | Verifier? |
+|-------|-----------|------|-----------|
+| 1.7B | ≥3GB free | Primary agent (low-mem) | Self (or none) |
+| 4B | ≥8GB free | Primary agent (mid-tier) | 4B self-verifier |
+| 8B | ≥16GB free | Primary agent (high-tier) | 4B or self |
+| 14B/32B | ≥24GB free | Primary agent (pro) | 4B |
+**Key decision**: 1.7B stays on low-memory devices (phones, older Macs). 4B and above all serve the same Prism Memory tool-calling purpose — larger = better edge case handling, same corpus shape.
+**Verifier tier**: Configured via `PRISM_VERIFIER_MODEL` env var (default: `prism-coder:1b7`). Set to `prism-coder:4b` on devices with ≥8GB free for higher accuracy at ~3× latency cost. The verifier call runs post-draft in `chat-verifier.ts::verifyOrRefuse()`.
+---
+## Training Hyperparameters (Validated for 4B, Scale for Others)
+| Param | 4B v43 value | 8B guidance | 14B/32B guidance |
+|-------|-------------|-------------|-----------------|
+| Base model | Qwen/Qwen3-4B | Qwen/Qwen3-8B | Qwen/Qwen3-14B / 32B |
+| LoRA rank | 32 | 32 | 32 (or 16 for speed) |
+| LoRA alpha | 64 (scale=2.0) | 64 | 64 |
+| LoRA layers | 16 of 36 | 16 of 36 | 16 of 48 |
+| Batch size | 2 | 2 | 1 (grad-checkpoint) |
+| Grad checkpoint | yes | yes | yes |
+| Seq length | 2048 | 2048 | 2048 |
+| LR (initial) | 1e-4 | 1e-4 | 5e-5 |
+| LR (surgical patch) | 3e-5 | 2e-5 | 1e-5 |
+| Iters (full run) | 2000 | 2000 | 1500 |
+| Iters (patch) | 250 | 200 | 150 |
+| Val batches | 25 | 25 | 25 |
+| Save every | 200 | 200 | 200 |
+**Critical**: 1e-4 LR on a surgical patch (small delta corpus appended to large existing corpus) causes catastrophic interference. Validated fix: 3e-5 for 4B. Scale down proportionally for larger models.
+---
+## Corpus Format (MUST match for all model sizes)
+All rows must use the `"text"` key with pre-rendered ChatML strings:
+```json
+{"text": "<|im_start|>system\n...<|im_end|>\n<|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n...<|im_end|>"}
+```
+**Never use** `"messages"` key format — mlx_lm auto-detects format from the first row and crashes on mixed files.
+Tool call format in training data:
+```
+<tool_call>
+{"name": "tool_name", "arguments": {...}}
+</tool_call>
+```
+**No pipes** — not `<|tool_call|>`. The eval harness (`bfcl_eval.py`, `swe_bench_test.py`) must use the same format.
+Think block format:
+```
+<|synalux_think|>reasoning here</|synalux_think|>
+```
+Use string literals for both open and close tags — do NOT use f-string escaping for the close tag.
+---
+## Merge: mlx_lm.fuse is Broken for GGUF
+`mlx_lm.fuse` silently loses LoRA weights during GGUF conversion. Use `merge_4b_v43.py` pattern instead:
+```python
+# delta = scale * (A @ B)  where scale = alpha / rank (pre-computed in adapter_config.json)
+delta = scale * (A_matrix @ B_matrix)
+merged_weight = base_weight + delta
+```
+Script: `merge_4b_v43.py` — adapt for 8B/14B/32B by changing the base model path.
+---
+## Layer 3 (Inference-Time Remapping) — Apply Before Training Patches
+Before writing a corpus patch, check if the failure is fixable by Layer 3 rules in `bfcl_eval.py::apply_layer3()`. Layer 3 fixes:
+- Tool name remapping (semantic similarity false positives)
+- Format normalization (pipe vs no-pipe)
+- Context-based disambiguation (backfill_links vs synthesize_edges)
+- Abstention for general programming/CS questions
+Layer 3 is **zero-cost** (no training needed) and **regression-proof**. Use training patches only for failures that Layer 3 cannot fix.
+---
+## Corpus Mix Ratios (v2 — Verified)
+| Category | Target % | Notes |
+|----------|----------|-------|
+| Tool-use (Prism Memory) | ~36% | All 29 tools, param extraction, multi-turn |
+| AAC / clinical | ~40% | Critical — prevents mode collapse |
+| Abstention | ~12% | CS/general questions → no tool call |
+| Safety / refusal | ~12% | Edge cases, PII, etc. |
+Minimum counts for quality gate: tool_calls ≥ 5000, AAC rows ≥ 40%, safety/refusal ≥ 10%.
+---
+## Patch Strategy (Surgical vs Full Retrain)
+**Surgical patch** (preferred):
+- Identify failing categories from swe_bench or BFCL
+- Build targeted JSONL (30–100 rows per failure group)
+- 3× oversample → append to existing `train.jsonl`
+- Train at reduced LR (3e-5 for 4B, scale down for larger)
+- Iters: 200–300 (enough to reinforce without overwriting)
+**Full retrain** (when): catastrophic regression, base model upgrade, or corpus shape change requiring >20% new data.
+---
+## BFCL Gate: 100% Required Before Push
+Gate enforced in `bfcl_eval.py`. Run with 3 seeds before any Ollama Hub push:
+```bash
+python3 bfcl_eval.py --model prism-coder:4b-v43 --seeds 2027 2028 2029
+```
+All seeds must show 100%. Partial seed pass = do not push.
+SWE bench (`swe_bench_test.py`) is the secondary blind eval — target 100% strict, but this is harder and may accept ≥95% strict if BFCL=100%.
+---
+## Files in This Directory
+| File | Purpose |
+|------|---------|
+| `build_4b_v43_corpus.py` | Full v43 corpus builder (28,454 base rows) |
+| `build_4b_v43_patch.py` | Patch 1: initial BFCL failures |
+| `build_4b_v43_patch2.py` | Patch 2: param extraction + format |
+| `build_4b_v43_patch3.py` | Patch 3: (regressed — LR too high, abandoned) |
+| `build_4b_v43_patch4.py` | Patch 4: task_route implicit + param extraction from casual phrasing |
+| `build_4b_v43_swe_patch.py` | SWE bench targeted patch |
+| `train_4b_v43_local.sh` | MLX LoRA training script (Apple Silicon) |
+| `merge_4b_v43.py` | Safe merge: delta = scale × (A @ B) |
+| `export_4b_v43_gguf.sh` | HF → GGUF F16 → Q4_K_M → Ollama register |
+| `bfcl_eval.py` | 64-test BFCL suite with Layer 3 |
+| `swe_bench_test.py` | 68-test blind SWE suite |
+| `orchestrate_4b_to_100.sh` | Autonomous patch→train→eval loop |
+| `analyze_swe_failures.py` | Parse swe_bench output → failure categories |
+For 8B/14B/32B: copy the build_* and train_* scripts, update model name and hyperparams per the table above.