Instructions to use dcostenco/prism-coder-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use dcostenco/prism-coder-4b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="dcostenco/prism-coder-4b", filename="prism-coder-4b-v43-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use dcostenco/prism-coder-4b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Use Docker
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use dcostenco/prism-coder-4b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "dcostenco/prism-coder-4b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dcostenco/prism-coder-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- Ollama
How to use dcostenco/prism-coder-4b with Ollama:
ollama run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- Unsloth Studio
How to use dcostenco/prism-coder-4b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dcostenco/prism-coder-4b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dcostenco/prism-coder-4b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for dcostenco/prism-coder-4b to start chatting
- Pi
How to use dcostenco/prism-coder-4b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "dcostenco/prism-coder-4b:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use dcostenco/prism-coder-4b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default dcostenco/prism-coder-4b:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use dcostenco/prism-coder-4b with Docker Model Runner:
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- Lemonade
How to use dcostenco/prism-coder-4b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull dcostenco/prism-coder-4b:Q4_K_M
Run and chat with the model
lemonade run user.prism-coder-4b-Q4_K_M
List all available models
lemonade list
| #!/usr/bin/env python3 | |
| """ | |
| SWE-Bench Inspired Blind Evaluation for prism-coder:7b | |
| Unlike the real-life test (which had training overlap), these prompts are: | |
| 1. Completely novel — never seen in any training data | |
| 2. Realistic — mimic actual user interactions | |
| 3. Ambiguous — some have keyword traps or context-dependent meanings | |
| 4. Multi-intent — some require the model to pick the most appropriate tool | |
| 5. Adversarial — designed to confuse tool vs reasoning boundaries | |
| Scoring follows SWE-bench methodology: | |
| - Strict match: correct tool name + all required params present | |
| - Partial match: correct tool name + some params | |
| - Wrong tool: incorrect tool name (regardless of params) | |
| - False positive: tool called when none should be | |
| - False negative: no tool called when one should be | |
| """ | |
| import subprocess | |
| import json | |
| import re | |
| import time | |
| import sys | |
| import random | |
| import urllib.request | |
| import statistics | |
| MODEL = "prism-coder:4b-v43" | |
| OLLAMA_API = "http://localhost:11434/api/generate" | |
| # === BLIND TEST CASES (never in training data) === | |
| # Format: (prompt, expected_tool_or_NO_TOOL, required_params, category) | |
| BLIND_TESTS = [ | |
| # ====== CATEGORY 1: Natural user phrasing — tool needed (15 tests) ====== | |
| ("Hey, I want to start a new session. Pull up everything we had on the synalux project.", | |
| "session_load_context", ["project"], "natural_phrasing"), | |
| ("Can you jot down what we accomplished? We rewrote the webhook handler and fixed 3 edge cases.", | |
| "session_save_ledger", ["summary"], "natural_phrasing"), | |
| ("I'm handing this off to the night shift. Make sure they know where we left off on prism-mcp.", | |
| "session_save_handoff", ["project"], "natural_phrasing"), | |
| ("Remind me — did we ever decide between Redis and Memcached for the session store?", | |
| "session_search_memory", ["query"], "natural_phrasing"), | |
| ("That memory entry about the old deployment script is totally wrong. Nuke it.", | |
| "session_forget_memory", ["memory_id"], "natural_phrasing"), | |
| ("Is everything OK with the memory backend? Run diagnostics.", | |
| "session_health_check", [], "natural_phrasing"), | |
| ("Any institutional knowledge about how we handle rate limiting?", | |
| "knowledge_search", ["query"], "natural_phrasing"), | |
| ("The ledger is getting huge. Summarize and archive the old stuff for billing-portal.", | |
| "session_compact_ledger", ["project"], "natural_phrasing"), | |
| ("Dump everything to a file so I can back it up. JSON format, save to /tmp/prism-backup.", | |
| "session_export_memory", ["output_path", "format"], "natural_phrasing"), | |
| ("Should I handle this CSS grid refactor myself or punt it to the local model?", | |
| "session_task_route", ["task_description"], "natural_phrasing"), | |
| # Additional natural phrasing (indirect/conversational) | |
| ("Where were we on the portal project? Bring me up to speed.", | |
| "session_load_context", ["project"], "natural_phrasing"), | |
| ("We just finished a big refactor. Make sure it's written down for posterity.", | |
| "session_save_ledger", [], "natural_phrasing"), | |
| ("Go look through our old conversations and find anything about the payment gateway.", | |
| "session_search_memory", ["query"], "natural_phrasing"), | |
| ("Get rid of that wrong entry we saved about the broken migration.", | |
| "session_forget_memory", ["memory_id"], "natural_phrasing"), | |
| ("Is this bug fix simple enough for the local model to handle?", | |
| "session_task_route", ["task_description"], "natural_phrasing"), | |
| # ====== CATEGORY 2: Adversarial keyword traps — NO tool (15 tests) ====== | |
| ("How do I implement a session manager in Express.js with Redis as the backing store?", | |
| "NO_TOOL", [], "adversarial_trap"), | |
| ("Explain the concept of memory management in Rust — borrowing, ownership, and lifetimes.", | |
| "NO_TOOL", [], "adversarial_trap"), | |
| ("What's the best way to save user preferences in a React Native app?", | |
| "NO_TOOL", [], "adversarial_trap"), | |
| ("Write a function that searches through a knowledge graph using BFS.", | |
| "NO_TOOL", [], "adversarial_trap"), | |
| ("How does garbage collection work in Go vs Java?", | |
| "NO_TOOL", [], "adversarial_trap"), | |
| ("Can you explain the compact representation of sparse matrices?", | |
| "NO_TOOL", [], "adversarial_trap"), | |
| ("What is the health check endpoint pattern in microservices?", | |
| "NO_TOOL", [], "adversarial_trap"), | |
| ("How do I export data from PostgreSQL to a CSV file?", | |
| "NO_TOOL", [], "adversarial_trap"), | |
| # NEW adversarial traps — high-risk keywords | |
| ("How do I create a session in PHP using session_start()?", | |
| "NO_TOOL", [], "adversarial_trap"), | |
| ("Write me a Python context manager for database connections.", | |
| "NO_TOOL", [], "adversarial_trap"), | |
| ("What's the difference between saving to disk vs saving to memory in SQLite?", | |
| "NO_TOOL", [], "adversarial_trap"), | |
| ("How do I implement search functionality with Elasticsearch?", | |
| "NO_TOOL", [], "adversarial_trap"), | |
| ("Explain how to load balance across multiple Node.js processes.", | |
| "NO_TOOL", [], "adversarial_trap"), | |
| ("What is the forget gate in an LSTM neural network?", | |
| "NO_TOOL", [], "adversarial_trap"), | |
| ("How do I route tasks in Celery to different queues?", | |
| "NO_TOOL", [], "adversarial_trap"), | |
| # ====== CATEGORY 3: Disambiguation — correct tool choice (8 tests) ====== | |
| ("Search for anything we discussed about the authentication overhaul last month.", | |
| "session_search_memory", ["query"], "disambiguation"), | |
| ("I need to know if our knowledge base has anything on Kubernetes pod autoscaling.", | |
| "knowledge_search", ["query"], "disambiguation"), | |
| # NEW: forget tool disambiguation | |
| ("Delete the specific memory entry with ID mem-abc-123.", | |
| "session_forget_memory", ["memory_id"], "disambiguation"), | |
| ("Wipe out all old debugging entries from the prism-mcp project.", | |
| "knowledge_forget", ["project"], "disambiguation"), | |
| # NEW: save tool disambiguation | |
| ("We're done for the day. Log what we accomplished.", | |
| "session_save_ledger", [], "disambiguation"), | |
| ("Pass this project to the next developer. Save the handoff state.", | |
| "session_save_handoff", ["project"], "disambiguation"), | |
| # NEW: search tool disambiguation | |
| ("What do our curated knowledge items say about error handling best practices?", | |
| "knowledge_search", ["query"], "disambiguation"), | |
| ("Did we discuss anything about caching in our recent sessions?", | |
| "session_search_memory", ["query"], "disambiguation"), | |
| # ====== CATEGORY 4: Edge cases (8 tests) ====== | |
| ("Load context.", | |
| "session_load_context", [], "edge_case"), | |
| ("Save.", | |
| "session_save_ledger", [], "edge_case"), | |
| ("What tools do you have available?", | |
| "NO_TOOL", [], "edge_case"), | |
| ("Tell me about yourself.", | |
| "NO_TOOL", [], "edge_case"), | |
| # NEW edge cases | |
| ("Hello!", | |
| "NO_TOOL", [], "edge_case"), | |
| ("Thanks, that's all for now.", | |
| "NO_TOOL", [], "edge_case"), | |
| ("Search.", | |
| "session_search_memory", ["query"], "edge_case"), | |
| ("Check health.", | |
| "session_health_check", [], "edge_case"), | |
| # ====== CATEGORY 5: Multi-tool / complex intent (4 tests) ====== | |
| ("Find all our past notes about the billing API redesign and check if the memory DB is healthy.", | |
| "session_search_memory", ["query"], "multi_intent"), | |
| ("Load the prism project context and then save a note that we started the migration.", | |
| "session_load_context", ["project"], "multi_intent"), | |
| ("Before I hand off, save what we did today: fixed the OAuth flow and updated tests.", | |
| "session_save_ledger", ["summary"], "multi_intent"), | |
| ("I want to export a backup and then compact the old entries.", | |
| "session_export_memory", [], "multi_intent"), | |
| # ====== CATEGORY 6: Verifier patterns (8 tests) ====== | |
| # Verifier = synthesize_edges, backfill_links, health_check used to verify/validate state | |
| ("Before we close out, verify all the session links are consistent for the portal project.", | |
| "session_synthesize_edges", ["project"], "verifier"), | |
| ("Run a synthesis pass on the prism-mcp project to make sure all edges are up to date.", | |
| "session_synthesize_edges", ["project"], "verifier"), | |
| ("Backfill the missing cross-session links for the analytics project.", | |
| "session_backfill_links", ["project"], "verifier"), | |
| ("Reconnect the dangling session references for the billing project.", | |
| "session_backfill_links", ["project"], "verifier"), | |
| ("Make sure the memory system is healthy before I start a new session.", | |
| "session_health_check", [], "verifier"), | |
| ("Verify graph integrity — synthesize edges for the ios-app project.", | |
| "session_synthesize_edges", ["project"], "verifier"), | |
| ("Is the memory backend responding correctly?", | |
| "session_health_check", [], "verifier"), | |
| ("Patch up the link gaps in our session history for prism-training.", | |
| "session_backfill_links", ["project"], "verifier"), | |
| # ====== CATEGORY 7: Cascade patterns (10 tests) ====== | |
| # Cascade = first step of a multi-step chain — model must pick the right first tool | |
| ("Search our knowledge base for Redis caching patterns, then upvote the best result.", | |
| "knowledge_search", ["query"], "cascade"), | |
| ("Load context for the portal project, search for any open issues, then save a handoff.", | |
| "session_load_context", ["project"], "cascade"), | |
| ("Check memory health, then compact the ledger if there are stale entries.", | |
| "session_health_check", [], "cascade"), | |
| ("Export everything from the billing project, then set a 60-day retention policy on it.", | |
| "session_export_memory", ["project"], "cascade"), # output_path not in prompt, only project | |
| ("Search for what we decided about authentication, then save a handoff note about it.", | |
| "session_search_memory", ["query"], "cascade"), | |
| ("Save this session's progress, then create a handoff for the next agent.", | |
| "session_save_ledger", [], "cascade"), | |
| ("Route this refactoring task — if local, proceed; if cloud, just tell me.", | |
| "session_task_route", ["task_description"], "cascade"), | |
| ("Search knowledge for WebSocket patterns, downvote anything about long-polling.", | |
| "knowledge_search", ["query"], "cascade"), | |
| ("Compact the prism-mcp ledger and then synthesize the session edges.", | |
| "session_compact_ledger", [], "cascade"), | |
| ("Load the analytics project context and then log that we shipped the v4 dashboard.", | |
| "session_load_context", ["project"], "cascade"), | |
| ] | |
| TOOL_CALL_RE = re.compile( | |
| r'<\|tool_call\|>\s*(\{.*\})', | |
| re.DOTALL | |
| ) | |
| # v43 model uses <tool_call> (no pipes) — strip CoT first, then match | |
| NO_PIPE_TOOL_CALL_RE = re.compile( | |
| r'<tool_call>\s*(\{.*?\})\s*(?:</tool_call>|$)', | |
| re.DOTALL | |
| ) | |
| def call_ollama(prompt: str, timeout: int = 120) -> tuple: | |
| """Call ollama REST API and return (raw_response, parsed_tool_name, parsed_args, latency).""" | |
| start = time.time() | |
| try: | |
| payload = json.dumps({ | |
| "model": MODEL, | |
| "prompt": prompt, | |
| "stream": False, | |
| "raw": True, | |
| "options": {"temperature": 0.0, "num_predict": 512} | |
| }).encode("utf-8") | |
| req = urllib.request.Request( | |
| OLLAMA_API, | |
| data=payload, | |
| headers={"Content-Type": "application/json"} | |
| ) | |
| with urllib.request.urlopen(req, timeout=timeout) as resp: | |
| data = json.loads(resp.read().decode("utf-8")) | |
| raw = data.get("response", "").strip() | |
| except Exception as e: | |
| return (str(e), "ERROR", {}, time.time() - start) | |
| latency = time.time() - start | |
| # Strip CoT blocks before parsing | |
| clean_raw = re.sub(r'<\|synalux_think\|>.*?(?:</\|synalux_think\|>|$)', '', raw, flags=re.DOTALL) | |
| # Strategy 0: no-pipe <tool_call> format (v43 model) | |
| no_pipe_match = NO_PIPE_TOOL_CALL_RE.search(clean_raw) | |
| if no_pipe_match: | |
| try: | |
| tool_json = json.loads(no_pipe_match.group(1)) | |
| tool_name = tool_json.get("name", tool_json.get("tool", "UNKNOWN")) | |
| tool_args = tool_json.get("arguments", tool_json.get("args", {})) | |
| return (raw, tool_name, tool_args, latency) | |
| except json.JSONDecodeError: | |
| pass | |
| # Strategy 1: piped <|tool_call|> format | |
| match = TOOL_CALL_RE.search(clean_raw) | |
| if match: | |
| try: | |
| tool_json = json.loads(match.group(1)) | |
| tool_name = tool_json.get("name", tool_json.get("tool", "UNKNOWN")) | |
| tool_args = tool_json.get("arguments", tool_json.get("args", {})) | |
| return (raw, tool_name, tool_args, latency) | |
| except json.JSONDecodeError: | |
| pass | |
| # Fallback: try to find JSON with "name" key containing nested braces | |
| json_re = re.search(r'(\{[^{}]*"name"\s*:\s*"[^"]+?"[^{}]*(?:\{[^{}]*\}[^{}]*)*\})', raw) | |
| if json_re: | |
| try: | |
| tool_json = json.loads(json_re.group(0)) | |
| tool_name = tool_json.get("name", "UNKNOWN") | |
| tool_args = tool_json.get("arguments", tool_json.get("args", {})) | |
| return (raw, tool_name, tool_args, latency) | |
| except json.JSONDecodeError: | |
| pass | |
| return (raw, "NO_TOOL", {}, latency) | |
| # === LAYER 3: Inference-Time False Positive Rejection === | |
| # Catches cases where the model hallucinates a tool call on general programming prompts. | |
| # These are lightweight heuristics — they only reject, never add tool calls. | |
| # Patterns that strongly indicate a general programming question (NOT Prism) | |
| GENERAL_PROGRAMMING_PATTERNS = [ | |
| # Python context managers — not Prism context loading | |
| r'\bcontext\s+manager\b', r'\bcontextlib\b', r'\b__enter__\b', r'\b__exit__\b', | |
| r'\basync\s+context\s+manager\b', | |
| # ML/LSTM forget gates — not Prism memory deletion | |
| r'\bforget\s+gate\b', r'\blstm\b', r'\bcatastrophic\s+forgetting\b', | |
| r'\bforget\s+bias\b', r'\belastic\s+weight\s+consolidation\b', | |
| # Web framework sessions — not Prism sessions | |
| r'\bexpress\.js\b', r'\bdjango\b', r'\bflask\b', r'\bsession_start\(\)', | |
| r'\bsession\s+middleware\b', r'\bsession\s+affinity\b', | |
| # General CS concepts that overlap with tool names | |
| r'\bgarbage\s+collection\b', r'\bmemory\s+management\s+in\s+rust\b', | |
| r'\bload\s+balanc', r'\bcontext\s+switch', | |
| r'\bsearch\s+algorithm\b', r'\bsearch\s+functionality\s+with\s+elasticsearch\b', | |
| r'\bhealth\s+check\s+endpoint\s+pattern\b', | |
| # Group A: swe-bench false positives | |
| r'\bcelery\b.*\bqueue', r'\broute\s+tasks?\s+in\s+celery\b', | |
| r'\bknowledge\s+graph\b.*\b(?:function|search|algorithm|traversal)\b', | |
| r'\b(?:function|write\s+a\s+function|implement)\b.*\bknowledge\s+graph\b', | |
| r'\bsave\s+(?:user\s+)?preferences?\s+in\s+(?:react|redux|localstorage|a\s+database)\b', | |
| r'\bexport\s+(?:data\s+)?from\s+(?:postgresql|mysql|sqlite|a\s+database)\b', | |
| r'\bpostgresql\b.*\bcsv\b', r'\bcsv\b.*\bpostgresql\b', | |
| ] | |
| # Patterns that confirm Prism-specific intent (overrides rejection) | |
| PRISM_INTENT_PATTERNS = [ | |
| r'\bprism\b', r'\bsession\s*ledger\b', r'\bhandoff\b', r'\bknowledge\s+base\b', | |
| r'\bknowledge\s+items?\b', r'\bour\s+knowledge\b', r'\bknowledge\s+base\b', | |
| r'\bsave.*(?:session|ledger|handoff)\b', r'\bload\s+context\b', | |
| r'\b(?:search|find).*(?:memory|sessions?|conversations?|notes)\b', | |
| r'\bproject\b', r'\bwhat\s+(?:do\s+)?we\s+(?:know|have)\b', | |
| r'\binstitutional\s+knowledge\b', r'\bdocumented\b', r'\bcurated\b', | |
| r'\bmemory\s+entry\b', r'\bmemory\s+backend\b', r'\bdiagnostics\b', | |
| r'\bledger\b', r'\bcompact\b.*(?:ledger|entries|session)\b', | |
| r'\bexport.*(?:memory|backup)\b', r'\b(?:delete|nuke|wipe|remove).*(?:entry|memory|entries)\b', | |
| r'\blog.*(?:what|accomplished|session)\b', r'\brecord.*(?:session|what)\b', | |
| r'\bhand.*(?:off|over)\b', r'\bbring.*up\s+to\s+speed\b', | |
| r'\bbug\s+fix.*(?:local\s+model|handle)\b', r'\broute.*(?:task|this)\b', | |
| ] | |
| def validate_tool_call(prompt, tool_name, tool_args): | |
| """Layer 3: reject obvious false positive tool calls and remap semantic neighbors. | |
| Returns (tool_name, tool_args) — possibly changed if rejected or remapped. | |
| """ | |
| if tool_name == "NO_TOOL": | |
| return tool_name, tool_args | |
| prompt_lower = prompt.lower() | |
| # --- Group B remaps (before false-positive rejection) --- | |
| # "reconnect/patch up/dangling links" → backfill_links | |
| if tool_name in ('session_synthesize_edges', 'session_reconnect'): | |
| if re.search(r'\b(?:reconnect|backfill|patch\s+up|dangling|link\s+gaps?|missing\s+links?|fix\s+links?)\b', prompt_lower): | |
| return 'session_backfill_links', tool_args | |
| # "verify/check that session links are consistent" → synthesize_edges | |
| # Covers both health_check and backfill_links false routes | |
| _VERIFY_CONSISTENT_RE = re.compile( | |
| r'\b(?:verify|validate|check)\b.{0,40}\b(?:links?\s+(?:are\s+)?consistent|edges?\s+up\s+to\s+date|graph\s+integrit|session\s+links?)\b', | |
| re.DOTALL | |
| ) | |
| if tool_name in ('session_health_check', 'session_backfill_links'): | |
| if _VERIFY_CONSISTENT_RE.search(prompt_lower): | |
| return 'session_synthesize_edges', tool_args | |
| # "wipe/clear old entries from knowledge base" → knowledge_forget (not compact_ledger) | |
| if tool_name == 'session_compact_ledger': | |
| if re.search(r'\bknowledge\b', prompt_lower) and re.search(r'\b(?:wipe|clear|delete|remove|entries)\b', prompt_lower): | |
| return 'knowledge_forget', tool_args | |
| # "entries from ... knowledge base" + delete verbs → knowledge_forget (not session_forget_memory) | |
| if tool_name == 'session_forget_memory': | |
| if re.search(r'\bknowledge\s+(?:entr|items?|records?|base)\b', prompt_lower): | |
| return 'knowledge_forget', tool_args | |
| if re.search(r'\bknowledge\s+base\b', prompt_lower) and re.search(r'\b(?:entries|records|items)\b', prompt_lower): | |
| return 'knowledge_forget', tool_args | |
| # "delete/wipe entries from [project]" without a specific memory ID → knowledge_forget | |
| if re.search(r'\b(?:entries|records|logs?)\b', prompt_lower) and re.search(r'\bproject\b', prompt_lower): | |
| if not re.search(r'\bmemory[_\s]id\b|mem-[a-z0-9]|ID\s*[=:]\s*\S+', prompt): | |
| return 'knowledge_forget', {'project': re.search(r'(?:for|from|in)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project', prompt_lower, re.I) and re.search(r'(?:for|from|in)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project', prompt_lower, re.I).group(1) or None} | |
| # "where were we / bring me up to speed" → session_load_context (not session_search_memory) | |
| if tool_name == 'session_search_memory': | |
| if re.search(r'\bwhere\s+were\s+we\b|\bbring\s+me\s+up\s+to\s+speed\b|\bcatch\s+me\s+up\b|\bwhat\s+were\s+we\s+(?:doing|working)', prompt_lower): | |
| project_m = re.search(r'\b(?:on|for|with)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b', prompt_lower) | |
| project = project_m.group(1) if project_m else None | |
| return 'session_load_context', {'project': project} if project else {} | |
| # knowledge_forget / knowledge_set_retention → upvote/downvote protection | |
| if tool_name in ('knowledge_forget', 'knowledge_set_retention'): | |
| if re.search(r'\b(?:upvote|boost|increase\s+(?:its\s+)?(?:rank|score|importance)|uprate|thumbs[\s-]?up)\b', prompt_lower): | |
| return 'knowledge_upvote', {"id": tool_args.get("id") or tool_args.get("knowledge_id") or tool_args.get("entry_id")} | |
| if re.search(r'\b(?:downvote|lower\s+(?:its\s+)?(?:rank|score)|not\s+useful|derank|thumbs[\s-]?down|reduce\s+(?:its\s+)?(?:rank|score))\b', prompt_lower): | |
| return 'knowledge_downvote', {"id": tool_args.get("id") or tool_args.get("knowledge_id") or tool_args.get("entry_id")} | |
| # "remind me / did we ever decide" → session_search_memory (not load_context) | |
| # Exclude "bring me up to speed / where were we" which is a load_context pattern | |
| if tool_name == 'session_load_context': | |
| if re.search(r'\bremind\s+me\b|\bdid\s+we\s+ever\s+(?:decide|settle|choose|pick)\b|\bwhat\s+did\s+we\s+decide\b', prompt_lower): | |
| if not re.search(r'\bbring\s+me\s+up\s+to\s+speed\b|\bwhere\s+were\s+we\b|\bcatch\s+me\s+up\b|\bload\s+.*\bcontext\b', prompt_lower): | |
| return 'session_search_memory', {"query": prompt[:120]} | |
| # Normalize param aliases (model uses alternate field names) | |
| if tool_name == 'session_save_ledger': | |
| # content → summary rename | |
| if 'content' in tool_args and 'summary' not in tool_args: | |
| tool_args = dict(tool_args) | |
| tool_args['summary'] = tool_args.pop('content') | |
| # If prompt contains explicit completed-work content and model omitted summary, fill it | |
| if 'summary' not in tool_args: | |
| work_m = re.search( | |
| r'(?:jot\s+down|log|record|write\s+down|note)\s+(?:what\s+we\s+)?(?:accomplished|did|completed|finished)?\s*[:;]?\s*' | |
| r'(?:we\s+)?(.{10,120})', | |
| prompt, re.I | |
| ) | |
| if not work_m: | |
| work_m = re.search(r'(?:we\s+)?((?:rewrote|fixed|refactored|built|deployed|updated|added|removed)\s+.{10,120})', prompt, re.I) | |
| if work_m: | |
| tool_args = dict(tool_args) | |
| tool_args['summary'] = work_m.group(1).strip().rstrip('.') | |
| # session_export_memory: extract output_path from path patterns, format from keywords | |
| if tool_name == 'session_export_memory': | |
| if 'output_path' not in tool_args or not tool_args.get('output_path'): | |
| path_m = re.search(r'(?:save\s+to|(?:output|export)\s+(?:to|dir(?:ectory)?)\s+["\']?)(/[\w/.-]+|~/[\w/.-]+|\.\/[\w/.-]+)', prompt, re.I) | |
| if path_m: | |
| tool_args = dict(tool_args) | |
| tool_args['output_path'] = path_m.group(1) | |
| if 'format' not in tool_args or not tool_args.get('format'): | |
| fmt_m = re.search(r'\b(json|jsonl|markdown|csv|yaml)\b(?:\s+format)?\b', prompt_lower) | |
| if fmt_m: | |
| tool_args = dict(tool_args) | |
| tool_args['format'] = fmt_m.group(1) | |
| # "jot down / write down / make sure it's written down" → session_save_ledger (not save_experience) | |
| if tool_name == 'session_save_experience': | |
| if re.search(r'\bjot\s+down\b|\bwrite\s+(?:it\s+)?down\b|\bwhat\s+we\s+accomplished\b|\bmake\s+sure\s+it.{0,10}written\b|\brecord\s+(?:this|what)\b', prompt_lower): | |
| if not re.search(r'\b(?:successfully|milestone|achievement|deployed|shipped|launched|fixed\s+the)\b', prompt_lower): | |
| # Apply same normalization as the save_ledger block below | |
| if 'content' in tool_args and 'summary' not in tool_args: | |
| tool_args = dict(tool_args) | |
| tool_args['summary'] = tool_args.pop('content') | |
| if 'summary' not in tool_args: | |
| work_m = re.search(r'(?:we\s+)?((?:rewrote|fixed|refactored|built|deployed|updated|added|removed)\s+.{10,120})', prompt, re.I) | |
| if work_m: | |
| tool_args = dict(tool_args) | |
| tool_args['summary'] = work_m.group(1).strip().rstrip('.') | |
| return 'session_save_ledger', tool_args | |
| # --- False-positive rejection (CS patterns) --- | |
| is_general = any(re.search(p, prompt_lower) for p in GENERAL_PROGRAMMING_PATTERNS) | |
| if not is_general: | |
| return tool_name, tool_args | |
| has_prism_intent = any(re.search(p, prompt_lower) for p in PRISM_INTENT_PATTERNS) | |
| if has_prism_intent: | |
| return tool_name, tool_args | |
| return "NO_TOOL", {} | |
| def evaluate_result(expected_tool, required_params, got_tool, got_args): | |
| """ | |
| SWE-bench scoring: | |
| - strict_pass: correct tool + all required params | |
| - partial_pass: correct tool + missing some params | |
| - wrong_tool: different tool called | |
| - false_positive: tool called when none should be | |
| - false_negative: no tool called when one should be | |
| """ | |
| if expected_tool == "NO_TOOL": | |
| if got_tool == "NO_TOOL": | |
| return "strict_pass" | |
| else: | |
| return "false_positive" | |
| else: | |
| if got_tool == "NO_TOOL": | |
| return "false_negative" | |
| elif got_tool != expected_tool: | |
| # Special case: accept session_search_memory OR knowledge_search for search queries | |
| if expected_tool in ("session_search_memory", "knowledge_search") and got_tool in ("session_search_memory", "knowledge_search"): | |
| pass # Close enough | |
| else: | |
| return "wrong_tool" | |
| # Check required params | |
| if not required_params: | |
| return "strict_pass" | |
| present = [p for p in required_params if p in got_args] | |
| if len(present) == len(required_params): | |
| return "strict_pass" | |
| elif len(present) > 0: | |
| return "partial_pass" | |
| else: | |
| return "partial_pass" # Got the tool right but missing params | |
| def main(shuffle=False, no_validate_layer3=False): | |
| print("=" * 70) | |
| print("SWE-BENCH INSPIRED BLIND EVALUATION — prism-coder:7b") | |
| print("=" * 70) | |
| print(f"Model: {MODEL}") | |
| print(f"Tests: {len(BLIND_TESTS)} (all novel, never in training data)") | |
| print(f"Order: {'RANDOMIZED' if shuffle else 'sequential'}") | |
| print(f"Categories: natural_phrasing, adversarial_trap, disambiguation, edge_case, multi_intent") | |
| print() | |
| # Build indexed test list and optionally shuffle | |
| indexed_tests = list(enumerate(BLIND_TESTS)) | |
| if shuffle: | |
| random.shuffle(indexed_tests) | |
| results = [None] * len(BLIND_TESTS) # store by original index | |
| category_stats = {} | |
| # Use training-compatible system prompt (matches v43 <tool_call> no-pipe format) | |
| _sys_prompt = ( | |
| "You are Synalux, a memory-augmented coding and clinical reasoning assistant. " | |
| "You have access to Prism Memory tools (session_save_ledger, session_load_context, " | |
| "session_search_memory, session_save_handoff, session_forget_memory, session_health_check, " | |
| "session_compact_ledger, session_export_memory, session_task_route, session_save_experience, " | |
| "session_synthesize_edges, session_backfill_links, knowledge_search, knowledge_forget, " | |
| "knowledge_upvote, knowledge_downvote, knowledge_set_retention) and 13 multimodal tool " | |
| "modules (image_gen, office, web_scraper, browser, tts, ocr, git, terminal, deps_scanner, " | |
| "hipaa, data_graph, templates, pdf_parser). " | |
| "Think step-by-step before answering. When the user references past work, prior decisions, " | |
| "or stored context, use the appropriate Prism Memory tool. " | |
| "Format tool calls inside <tool_call>...</tool_call> JSON blocks with fields 'name' and 'arguments'. " | |
| "If no tool is needed, answer directly in plain text. " | |
| "ABSTAIN for general programming questions, CS concepts, greetings, and capability questions." | |
| ) | |
| for display_i, (orig_idx, (prompt, expected, required_params, category)) in enumerate(indexed_tests, 1): | |
| full_prompt = f"<|im_start|>system\n{_sys_prompt}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n" | |
| raw, got_tool, got_args, latency = call_ollama(full_prompt) | |
| # Layer 3: reject false positive tool calls on general programming prompts | |
| # Disabled during training benchmarks so RFT/DPO sees true model failures. | |
| if not no_validate_layer3: | |
| got_tool, got_args = validate_tool_call(prompt, got_tool, got_args) | |
| verdict = evaluate_result(expected, required_params, got_tool, got_args) | |
| is_pass = verdict in ("strict_pass", "partial_pass") | |
| icon = "✅" if verdict == "strict_pass" else ("⚠️" if verdict == "partial_pass" else "❌") | |
| # Truncate prompt for display | |
| short_prompt = prompt[:55] | |
| tag = f"#{orig_idx+1}" | |
| print(f" [{display_i:2d}/{len(BLIND_TESTS)}] {icon} {tag:4s}| expect={expected:28s} got={got_tool:28s} | {latency:5.1f}s | {short_prompt}") | |
| if verdict not in ("strict_pass",): | |
| if verdict == "partial_pass": | |
| missing = [p for p in required_params if p not in got_args] | |
| print(f" ↳ missing params: {missing}") | |
| elif verdict == "false_positive": | |
| print(f" ↳ FALSE POSITIVE: called {got_tool} when no tool expected") | |
| elif verdict == "false_negative": | |
| print(f" ↳ FALSE NEGATIVE: no tool called when {expected} expected") | |
| elif verdict == "wrong_tool": | |
| print(f" ↳ WRONG TOOL: expected {expected}, got {got_tool}") | |
| results[orig_idx] = { | |
| "id": orig_idx + 1, | |
| "prompt": prompt, | |
| "expected": expected, | |
| "got": got_tool, | |
| "got_args": got_args, | |
| "verdict": verdict, | |
| "latency": latency, | |
| "category": category | |
| } | |
| # Category tracking | |
| if category not in category_stats: | |
| category_stats[category] = {"total": 0, "strict": 0, "partial": 0, "fail": 0} | |
| category_stats[category]["total"] += 1 | |
| if verdict == "strict_pass": | |
| category_stats[category]["strict"] += 1 | |
| elif verdict == "partial_pass": | |
| category_stats[category]["partial"] += 1 | |
| else: | |
| category_stats[category]["fail"] += 1 | |
| # Summary | |
| strict = sum(1 for r in results if r["verdict"] == "strict_pass") | |
| partial = sum(1 for r in results if r["verdict"] == "partial_pass") | |
| fails = sum(1 for r in results if r["verdict"] not in ("strict_pass", "partial_pass")) | |
| total = len(results) | |
| tool_tests = [r for r in results if r["expected"] != "NO_TOOL"] | |
| no_tool_tests = [r for r in results if r["expected"] == "NO_TOOL"] | |
| tool_strict = sum(1 for r in tool_tests if r["verdict"] == "strict_pass") | |
| tool_partial = sum(1 for r in tool_tests if r["verdict"] == "partial_pass") | |
| no_tool_pass = sum(1 for r in no_tool_tests if r["verdict"] == "strict_pass") | |
| avg_latency = sum(r["latency"] for r in results) / total | |
| print() | |
| print("=" * 70) | |
| print("SWE-BENCH RESULTS (Blind Evaluation)") | |
| print("=" * 70) | |
| print(f" Strict Pass: {strict}/{total} = {strict/total*100:.0f}%") | |
| print(f" Partial Pass: {partial}/{total} = {partial/total*100:.0f}%") | |
| print(f" Total Pass: {strict+partial}/{total} = {(strict+partial)/total*100:.0f}%") | |
| print(f" Fail: {fails}/{total} = {fails/total*100:.0f}%") | |
| print(f" ---") | |
| print(f" Tool Strict: {tool_strict}/{len(tool_tests)} = {tool_strict/len(tool_tests)*100:.0f}%") | |
| print(f" Tool Partial: {tool_partial}/{len(tool_tests)} = {tool_partial/len(tool_tests)*100:.0f}%") | |
| print(f" Abstention: {no_tool_pass}/{len(no_tool_tests)} = {no_tool_pass/len(no_tool_tests)*100:.0f}%") | |
| print(f" Avg latency: {avg_latency:.1f}s") | |
| print() | |
| print(" Category Breakdown:") | |
| for cat, stats in sorted(category_stats.items()): | |
| pct = (stats["strict"] + stats["partial"]) / stats["total"] * 100 | |
| print(f" {cat:20s}: {stats['strict']}/{stats['total']} strict, {stats['partial']} partial, {stats['fail']} fail ({pct:.0f}%)") | |
| print("=" * 70) | |
| # Save report | |
| report = { | |
| "model": MODEL, | |
| "timestamp": time.strftime("%Y-%m-%dT%H:%M:%S"), | |
| "total_tests": total, | |
| "strict_pass": strict, | |
| "partial_pass": partial, | |
| "fails": fails, | |
| "strict_rate": strict / total, | |
| "total_pass_rate": (strict + partial) / total, | |
| "tool_strict_rate": tool_strict / len(tool_tests), | |
| "abstention_rate": no_tool_pass / len(no_tool_tests), | |
| "avg_latency": avg_latency, | |
| "category_stats": category_stats, | |
| "results": results | |
| } | |
| os.makedirs("results", exist_ok=True) | |
| with open("results/swe_bench_report.json", "w") as f: | |
| json.dump(report, f, indent=2, default=str) | |
| print(f"\nReport saved: results/swe_bench_report.json") | |
| return strict, total, results | |
| import os | |
| import argparse | |
| if __name__ == "__main__": | |
| parser = argparse.ArgumentParser() | |
| parser.add_argument("--model", type=str, default=None, help="Ollama model tag to evaluate (overrides MODEL constant)") | |
| parser.add_argument("--runs", type=int, default=1, help="Number of eval runs for statistical validation") | |
| parser.add_argument("--shuffle", action="store_true", help="Randomize test order each run") | |
| parser.add_argument("--no-validate-layer3", action="store_true", | |
| help="Disable Layer 3 false-positive rejection (use during training benchmarks " | |
| "so RFT/DPO sees true model failures, not heuristic-corrected results)") | |
| args = parser.parse_args() | |
| if args.model: | |
| MODEL = args.model | |
| if args.runs == 1: | |
| main(shuffle=args.shuffle, no_validate_layer3=args.no_validate_layer3) | |
| else: | |
| all_scores = [] | |
| per_test_pass = [0] * len(BLIND_TESTS) | |
| per_test_fail_tools = [[] for _ in range(len(BLIND_TESTS))] | |
| for run_idx in range(args.runs): | |
| seed = random.randint(0, 9999) if args.shuffle else None | |
| print(f"\n{'#'*70}") | |
| print(f" RUN {run_idx+1}/{args.runs}" + (f" (seed={seed})" if seed else "")) | |
| print(f"{'#'*70}") | |
| if seed is not None: | |
| random.seed(seed) | |
| strict, total, results = main(shuffle=args.shuffle, no_validate_layer3=args.no_validate_layer3) | |
| all_scores.append(strict) | |
| for i, r in enumerate(results): | |
| if r["verdict"] == "strict_pass": | |
| per_test_pass[i] += 1 | |
| else: | |
| per_test_fail_tools[i].append(r.get("got", "???")) | |
| # Multi-run summary | |
| med = statistics.median(all_scores) | |
| avg = sum(all_scores) / len(all_scores) | |
| print(f"\n{'='*70}") | |
| print(f" MULTI-RUN SUMMARY ({args.runs} runs × {total} tests" + (" — RANDOMIZED ORDER" if args.shuffle else "") + ")") | |
| print(f"{'='*70}") | |
| print(f" Scores: {' | '.join(f'{s}/{total}' for s in all_scores)}") | |
| print(f" Median: {med}/{total} = {med/total*100:.1f}%") | |
| print(f" Average: {avg:.1f}/{total} = {avg/total*100:.1f}%") | |
| print(f" Min: {min(all_scores)}/{total} = {min(all_scores)/total*100:.0f}%") | |
| print(f" Max: {max(all_scores)}/{total} = {max(all_scores)/total*100:.0f}%") | |
| # Per-test consistency | |
| print(f"\n Per-Test Consistency (N={args.runs} runs):") | |
| flaky = [] | |
| for i, (prompt, expected, _, cat) in enumerate(BLIND_TESTS): | |
| rate = per_test_pass[i] / args.runs | |
| if rate < 1.0: | |
| fail_tools = per_test_fail_tools[i] | |
| flaky.append((i+1, prompt[:60], expected, rate, fail_tools)) | |
| status = f" ⚠️ [{i+1:2d}] {rate*100:3.0f}% pass | expect={expected:25s} | fails→{','.join(set(fail_tools)):20s} | {prompt[:55]}" | |
| print(status) | |
| if not flaky: | |
| print(" ✅ All tests passed consistently across all runs!") | |
| else: | |
| print(f"\n Flaky tests: {len(flaky)}/{total}") | |
| print(f"{'='*70}") | |