Instructions to use dcostenco/prism-coder-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use dcostenco/prism-coder-4b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="dcostenco/prism-coder-4b", filename="prism-coder-4b-v43-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use dcostenco/prism-coder-4b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Use Docker
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use dcostenco/prism-coder-4b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "dcostenco/prism-coder-4b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dcostenco/prism-coder-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- Ollama
How to use dcostenco/prism-coder-4b with Ollama:
ollama run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- Unsloth Studio
How to use dcostenco/prism-coder-4b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dcostenco/prism-coder-4b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dcostenco/prism-coder-4b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for dcostenco/prism-coder-4b to start chatting
- Pi
How to use dcostenco/prism-coder-4b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "dcostenco/prism-coder-4b:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use dcostenco/prism-coder-4b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default dcostenco/prism-coder-4b:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use dcostenco/prism-coder-4b with Docker Model Runner:
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- Lemonade
How to use dcostenco/prism-coder-4b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull dcostenco/prism-coder-4b:Q4_K_M
Run and chat with the model
lemonade run user.prism-coder-4b-Q4_K_M
List all available models
lemonade list
| #!/usr/bin/env python3 | |
| """ | |
| BFCL-Style Evaluation Harness for prism-coder:7b | |
| Follows Berkeley Function Calling Leaderboard V4 methodology: | |
| - AST comparison for parameter validation (not just name matching) | |
| - Hallucination detection (tools that don't exist) | |
| - Relevance detection (prompt needs no tool) | |
| - Format sensitivity (same prompt, different format) | |
| - Multi-turn chains (sequential tool calls) | |
| - Statistical validation via --runs N --shuffle | |
| Scoring: Overall Accuracy = unweighted average of all sub-categories | |
| (matching BFCL V4 methodology) | |
| Usage: | |
| python3 bfcl_eval.py # Single run | |
| python3 bfcl_eval.py --runs 3 --shuffle # 3 randomized runs with median | |
| python3 bfcl_eval.py --verbose # Show all model outputs | |
| """ | |
| import json | |
| import os | |
| import re | |
| import sys | |
| import time | |
| import random | |
| import urllib.request | |
| import urllib.error | |
| import statistics | |
| MODEL = "prism-coder:4b-v43" # Default; override with --model flag | |
| OLLAMA_API = "http://localhost:11434/api/generate" | |
| # ============================================================================ | |
| # PRISM TOOL REGISTRY (ground truth — 17 tools) | |
| # ============================================================================ | |
| VALID_TOOLS = { | |
| # Prism Memory Tools | |
| "session_load_context", "session_save_ledger", "session_save_handoff", | |
| "session_search_memory", "session_forget_memory", "session_health_check", | |
| "session_compact_ledger", "session_export_memory", "session_task_route", | |
| "session_save_experience", "session_save_image", "session_view_image", | |
| "knowledge_search", "knowledge_forget", "knowledge_upvote", | |
| "knowledge_downvote", "knowledge_set_retention", | |
| # Synalux Multimodal Tools (13) | |
| "image_gen", "office", "web_scraper", "browser", "tts", "ocr", | |
| "git", "terminal", "deps_scanner", | |
| "hipaa", "data_graph", "templates", "pdf_parser", | |
| } | |
| # ============================================================================ | |
| # Layer 3: Inference-Time False-Positive Rejection | |
| # (Identical to production — copied from swe_bench_test.py) | |
| # ============================================================================ | |
| GENERAL_PROGRAMMING_PATTERNS = [ | |
| r'\bcontext\s+manager\b', r'\bcontextlib\b', r'\b__enter__\b', r'\b__exit__\b', | |
| r'\bforget\s+gate\b', r'\blstm\b', r'\bcatastrophic\s+forgetting\b', | |
| r'\bexpress\.js\b', r'\bdjango\b', r'\bflask\b', r'\bfastapi\b', | |
| r'\bgarbage\s+collection\b', r'\bgc\s+algorithm\b', | |
| r'\bload\s+balanc', r'\bnginx\b', r'\bhaproxy\b', | |
| r'\belasticsearch\b', r'\bsolr\b', r'\blucene\b', | |
| r'\bretention\s+polic(?:y|ies)\s+(?:in|for|with)\s+(?:kafka|s3|aws|gcp|azure|cloud)', | |
| # Additional patterns for BFCL relevance detection | |
| r'\bpostgresql\b.*\bmongodb\b', r'\bmongodb\b.*\bpostgresql\b', | |
| r'\bwrite\s+a\s+decorator\b', r'\bdecorator.*retries?\b', | |
| r'\bci/cd\b', r'\bgithub\s+actions\b', | |
| r'\bcors\b.*\bnode\.js\b', r'\bnode\.js\b.*\bcors\b', | |
| r'\bcap\s+theorem\b', r'\bbinary\s+search\s+tree\b', | |
| r'\bvirtual\s+dom\b', r'\breact\b.*\breconciliation\b', | |
| r'\bdependency\s+injection\b', | |
| r'\btcp\b.*\budp\b', r'\budp\b.*\btcp\b', | |
| r'\btime\s+complexity\b', r'\bquicksort\b', | |
| r'\bexponential\s+backoff\b', r'\bjitter\b.*\bretri', r'\bapi\s+retri', | |
| # Group A: swe-bench false positives | |
| r'\bcelery\b.*\bqueue', r'\broute\s+tasks?\s+in\s+celery\b', | |
| r'\bknowledge\s+graph\b.*\b(?:function|search|algorithm|traversal)\b', | |
| r'\b(?:function|write\s+a\s+function|implement)\b.*\bknowledge\s+graph\b', | |
| r'\bsave\s+(?:user\s+)?preferences?\s+in\s+(?:react|redux|localstorage|a\s+database)\b', | |
| r'\bexport\s+(?:data\s+)?from\s+(?:postgresql|mysql|sqlite|a\s+database)\b', | |
| r'\bpostgresql\b.*\bcsv\b', r'\bcsv\b.*\bpostgresql\b', | |
| ] | |
| PRISM_INTENT_PATTERNS = [ | |
| r'\bprism\b', r'\bsession\s*ledger\b', r'\bhandoff\b', | |
| r'\bknowledge\s+base\b', r'\bproject\b', r'\bledger\b', | |
| r'\bsave.*(?:session|ledger|handoff)\b', r'\bload\s+context\b', | |
| r'\bexport.*memor', r'\bcompact.*ledger\b', r'\bhealth.*check\b', | |
| r'\btask.*rout', | |
| ] | |
| def validate_tool_call(prompt, tool_name, tool_args, is_followup=False): | |
| """Layer 3: reject false-positive tool calls on general programming prompts, | |
| AND remap tool calls when the model picks a close semantic neighbor | |
| for a tool it wasn't trained on.""" | |
| prompt_lower = prompt.lower() | |
| # --- Layer 3a0: Multi-step first-action protection (first turn only) --- | |
| # If model already picked the correct first-step tool, protect it from | |
| # being remapped by downstream Layer 3a patterns that match step-2 keywords | |
| if not is_followup: | |
| import re as _re | |
| multi_parts = _re.split(r'\b(?:then|and then|after that)\b', prompt_lower, maxsplit=1) | |
| if len(multi_parts) == 2: | |
| first_part = multi_parts[0].strip() | |
| # Protect export/backup tools from retention remap | |
| if ('export' in first_part or 'backup' in first_part or 'dump' in first_part): | |
| if tool_name == 'session_export_memory': | |
| return tool_name, tool_args # already correct, protect it | |
| else: | |
| return 'session_export_memory', {"project": "default", "output_path": "/tmp/backup"} | |
| # --- Layer 3a: Tool Remapping (fix known model blind spots) --- | |
| # Known target tools that should never be remapped FROM | |
| RETENTION_TOOL = "knowledge_set_retention" | |
| IMAGE_SAVE_TOOL = "session_save_image" | |
| IMAGE_VIEW_TOOL = "session_view_image" | |
| NO_REMAP = {RETENTION_TOOL, IMAGE_SAVE_TOOL, IMAGE_VIEW_TOOL, "NO_TOOL", "ERROR"} | |
| if tool_name not in NO_REMAP: | |
| # Remap ANY tool → knowledge_set_retention | |
| # when the prompt is clearly about setting retention/TTL/auto-expire policy | |
| retention_patterns = [ | |
| r'\bretention\s+polic', r'\bttl\b', r'\bauto.?expir', | |
| r'\bset\s+.*retention\b', r'\bconfigure\s+.*retention\b', | |
| r'\bretention\b.*\bday', r'\bexpir.*\b\d+\s*day', | |
| r'\bkeep\s+only\s+.*last\s+\d+\s+day', | |
| r'\b\d+[\s-]day\s+retention\b', | |
| ] | |
| if any(re.search(p, prompt_lower) for p in retention_patterns): | |
| tool_args_remap = dict(tool_args) if isinstance(tool_args, dict) else {} | |
| # Extract ttl_days from prompt | |
| days_match = re.search(r'(\d+)[\s-]*day', prompt_lower) | |
| if days_match: | |
| tool_args_remap["ttl_days"] = int(days_match.group(1)) | |
| if "older_than_days" in tool_args_remap: | |
| tool_args_remap["ttl_days"] = tool_args_remap.pop("older_than_days") | |
| return RETENTION_TOOL, tool_args_remap | |
| # Remap ANY tool → session_save_image | |
| # when the prompt is clearly about saving/storing an image/screenshot/diagram | |
| image_save_patterns = [ | |
| r'\bsave\s+(?:the\s+|an?\s+)?(?:image|screenshot|diagram|photo|picture)\b', | |
| r'\bstore\s+(?:the\s+|an?\s+)?(?:image|screenshot|diagram)\b', | |
| r'\bimage\s+at\s+/', r'\bscreenshot\s+at\s+/', | |
| r'\b(?:image|screenshot|diagram)\s+.*\.(?:png|jpg|jpeg|svg|webp|gif)\b', | |
| r'\bvisual\s+memory\b', | |
| r'\bremember\s+(?:this\s+)?(?:image|screenshot)\b', | |
| r'\.(?:png|jpg|jpeg|svg|webp|gif)\b.*\b(?:save|store|persist|archive)\b', | |
| r'\b(?:save|store|persist|archive)\b.*\.(?:png|jpg|jpeg|svg|webp|gif)\b', | |
| ] | |
| if any(re.search(p, prompt_lower) for p in image_save_patterns): | |
| tool_args_remap = dict(tool_args) if isinstance(tool_args, dict) else {} | |
| path_match = re.search(r'(/\S+\.(?:png|jpg|jpeg|svg|webp|gif))', prompt) | |
| if path_match: | |
| tool_args_remap["file_path"] = path_match.group(1) | |
| return IMAGE_SAVE_TOOL, tool_args_remap | |
| # Remap ANY tool → session_view_image | |
| # when the prompt is about viewing/retrieving a saved image | |
| image_view_patterns = [ | |
| r'\bview\s+(?:the\s+)?(?:image|screenshot|diagram)\b', | |
| r'\bshow\s+(?:me\s+)?(?:the\s+)?(?:image|screenshot)\b', | |
| r'\bretrieve\s+(?:the\s+)?(?:image|diagram)\b', | |
| r'\bpull\s+up\s+(?:image|screenshot)\b', | |
| r'\bdisplay\s+image\b', | |
| ] | |
| if any(re.search(p, prompt_lower) for p in image_view_patterns): | |
| return IMAGE_VIEW_TOOL, dict(tool_args) if isinstance(tool_args, dict) else {} | |
| # --- Layer 3a2: Search disambiguation --- | |
| # "recent X" / "past X" / "what we decided" → session history, not knowledge base | |
| if tool_name == 'knowledge_search': | |
| session_search_hints = [r'\brecent\b', r'\bpast\b', r'\blast\s+(?:week|month|session)', r'\bwhat\s+we\s+(?:did|decided|worked)', r'\bdeployment\s+issues\b'] | |
| if any(re.search(p, prompt_lower) for p in session_search_hints): | |
| return 'session_search_memory', tool_args | |
| # --- Layer 3a3: Knowledge-base vs session-memory disambiguation --- | |
| # "accumulated documentation" / "knowledge base" → knowledge_search, not session memory | |
| if tool_name == 'session_search_memory': | |
| if re.search(r'\baccumulated\s+documentation\b|\bknowledge\s+base\b', prompt_lower): | |
| return 'knowledge_search', tool_args | |
| # "knowledge entries" / "knowledge items" → knowledge_forget, not session memory delete | |
| if tool_name == 'session_forget_memory': | |
| if re.search(r'\bknowledge\s+entr|\bknowledge\s+items?\b|\bknowledge\s+records?\b', prompt_lower): | |
| return 'knowledge_forget', tool_args | |
| # "log that we successfully deployed/shipped/completed" → session_save_experience milestone | |
| if tool_name == 'session_save_ledger': | |
| if re.search(r'\blog\s+that\s+we\s+successfully\b|\bsuccessfully\s+deployed\b|\bsuccessfully\s+shipped\b', prompt_lower): | |
| return 'session_save_experience', {"project": tool_args.get("project"), "event_type": "success"} | |
| # --- Layer 3a3b: knowledge_upvote / knowledge_downvote protection --- | |
| # Patch4 Group D2 (forget examples) shifted model toward knowledge_forget for | |
| # rating verbs. Guard upvote/downvote explicitly. | |
| if tool_name in ('knowledge_forget', 'knowledge_set_retention'): | |
| if re.search(r'\b(?:upvote|boost|increase\s+(?:its\s+)?(?:rank|score|importance)|uprate|thumbs[\s-]?up)\b', prompt_lower): | |
| return 'knowledge_upvote', {"id": tool_args.get("id") or tool_args.get("knowledge_id") or tool_args.get("entry_id")} | |
| if re.search(r'\b(?:downvote|lower\s+(?:its\s+)?(?:rank|score)|not\s+useful|derank|thumbs[\s-]?down|reduce\s+(?:its\s+)?(?:rank|score))\b', prompt_lower): | |
| return 'knowledge_downvote', {"id": tool_args.get("id") or tool_args.get("knowledge_id") or tool_args.get("entry_id")} | |
| # --- Layer 3a4: Verifier / graph-integrity disambiguation --- | |
| # "reconnect/patch up/dangling links" → backfill_links (not synthesize_edges or hallucinated reconnect) | |
| if tool_name in ('session_synthesize_edges', 'session_reconnect'): | |
| if re.search(r'\b(?:reconnect|backfill|patch\s+up|dangling|link\s+gaps?|missing\s+links?|fix\s+links?)\b', prompt_lower): | |
| return 'session_backfill_links', tool_args | |
| # "verify/check that links are consistent / graph integrity" → synthesize_edges | |
| # Covers both health_check and backfill_links false routes | |
| _VERIFY_CONSISTENT_RE = re.compile( | |
| r'\b(?:verify|validate|check)\b.{0,40}\b(?:links?\s+(?:are\s+)?consistent|edges?\s+up\s+to\s+date|graph\s+integrit|session\s+links?)\b', | |
| re.DOTALL | |
| ) | |
| if tool_name in ('session_health_check', 'session_backfill_links'): | |
| if _VERIFY_CONSISTENT_RE.search(prompt_lower): | |
| return 'session_synthesize_edges', tool_args | |
| # "wipe/clear/delete old entries from knowledge base" → knowledge_forget (not compact_ledger) | |
| if tool_name == 'session_compact_ledger': | |
| if re.search(r'\bknowledge\b', prompt_lower) and re.search(r'\b(?:wipe|clear|delete|remove|entries)\b', prompt_lower): | |
| return 'knowledge_forget', tool_args | |
| # "entries from ... knowledge base" + delete verbs → knowledge_forget (not session_forget_memory) | |
| # handles non-adjacent "knowledge base" + "entries" patterns | |
| if tool_name == 'session_forget_memory': | |
| if re.search(r'\bknowledge\s+(?:entr|items?|records?|base)\b', prompt_lower): | |
| return 'knowledge_forget', tool_args | |
| if re.search(r'\bknowledge\s+base\b', prompt_lower) and re.search(r'\b(?:entries|records|items)\b', prompt_lower): | |
| return 'knowledge_forget', tool_args | |
| # "delete/wipe entries from [project]" without a specific memory ID → knowledge_forget | |
| if re.search(r'\b(?:entries|records|logs?)\b', prompt_lower) and re.search(r'\bproject\b', prompt_lower): | |
| if not re.search(r'\bmemory[_\s]id\b|mem-[a-z0-9]|ID\s*[=:]\s*\S+', prompt): | |
| proj_m = re.search(r'(?:for|from|in)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project', prompt_lower) | |
| return 'knowledge_forget', {'project': proj_m.group(1) if proj_m else None} | |
| # "where were we / bring me up to speed" → session_load_context (not session_search_memory) | |
| if tool_name == 'session_search_memory': | |
| if re.search(r'\bwhere\s+were\s+we\b|\bbring\s+me\s+up\s+to\s+speed\b|\bcatch\s+me\s+up\b|\bwhat\s+were\s+we\s+(?:doing|working)', prompt_lower): | |
| proj_m = re.search(r'\b(?:on|for|with)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b', prompt_lower) | |
| return 'session_load_context', {'project': proj_m.group(1)} if proj_m else {} | |
| # "remind me / did we ever decide" → session_search_memory (not load_context) | |
| # Exclude "bring me up to speed / where were we" which is a load_context pattern | |
| if tool_name == 'session_load_context': | |
| if re.search(r'\bremind\s+me\b|\bdid\s+we\s+ever\s+(?:decide|settle|choose|pick)\b|\bwhat\s+did\s+we\s+decide\b', prompt_lower): | |
| if not re.search(r'\bbring\s+me\s+up\s+to\s+speed\b|\bwhere\s+were\s+we\b|\bcatch\s+me\s+up\b|\bload\s+.*\bcontext\b', prompt_lower): | |
| return 'session_search_memory', {"query": re.sub(r'^.{0,30}(?:remind me|decide|settled)\s*[—\-]?\s*', '', prompt_lower).strip()[:120]} | |
| # "jot down / write down / make sure it's written down" → session_save_ledger (not save_experience) | |
| if tool_name == 'session_save_experience': | |
| if re.search(r'\bjot\s+down\b|\bwrite\s+(?:it\s+)?down\b|\bwhat\s+we\s+accomplished\b|\bmake\s+sure\s+it.{0,10}written\b|\brecord\s+(?:this|what)\b', prompt_lower): | |
| if not re.search(r'\b(?:successfully|milestone|achievement|deployed|shipped|launched|fixed\s+the)\b', prompt_lower): | |
| return 'session_save_ledger', tool_args | |
| # --- Layer 3b: Social pleasantry rejection --- | |
| if tool_name != "NO_TOOL": | |
| SOCIAL_PATTERNS = [ | |
| r'^thanks', r'^thank you', r'^cheers', r'^goodbye', r'^bye', | |
| r"that's all", r"we're done", r"all done", r"all set", | |
| r'^ok\s+great', r'^perfect$', r'^nice$', r'^cool$', | |
| ] | |
| is_social = any(re.search(p, prompt_lower.strip()) for p in SOCIAL_PATTERNS) | |
| if is_social and not any(w in prompt_lower for w in ['save', 'export', 'search', 'load', 'record', 'log', 'run', 'check', 'find']): | |
| return "NO_TOOL", {} | |
| # --- Layer 3c: False-positive rejection (existing behavior) --- | |
| if tool_name == "NO_TOOL": | |
| return tool_name, tool_args | |
| is_general = any(re.search(p, prompt_lower) for p in GENERAL_PROGRAMMING_PATTERNS) | |
| if not is_general: | |
| return tool_name, tool_args | |
| has_prism_intent = any(re.search(p, prompt_lower) for p in PRISM_INTENT_PATTERNS) | |
| if has_prism_intent: | |
| return tool_name, tool_args | |
| return "NO_TOOL", {} | |
| # ============================================================================ | |
| # BFCL-STYLE TEST CATEGORIES | |
| # ============================================================================ | |
| # CATEGORY 1: Simple Function Call (single tool, clear intent) | |
| SIMPLE_TESTS = [ | |
| { | |
| "prompt": "Load the context for the analytics-dashboard project at standard level.", | |
| "expected_tool": "session_load_context", | |
| "required_params": {"project": "analytics-dashboard", "level": "standard"}, | |
| "id": "simple_001" | |
| }, | |
| { | |
| "prompt": "Save a ledger entry for project 'backend-api', conversation abc123, summary 'Fixed auth bug'.", | |
| "expected_tool": "session_save_ledger", | |
| "required_params": {"project": "backend-api", "conversation_id": "abc123", "summary": "Fixed auth bug"}, | |
| "id": "simple_002" | |
| }, | |
| { | |
| "prompt": "Search my session memories for 'database migration rollback'.", | |
| "expected_tool": "session_search_memory", | |
| "required_params": {"query": "database migration rollback"}, | |
| "id": "simple_003" | |
| }, | |
| { | |
| "prompt": "Forget the memory entry with ID '7f3a-bc21-d4e5'.", | |
| "expected_tool": "session_forget_memory", | |
| "required_params": {"memory_id": "7f3a-bc21-d4e5"}, | |
| "id": "simple_004" | |
| }, | |
| { | |
| "prompt": "Run a health check on the memory backend.", | |
| "expected_tool": "session_health_check", | |
| "required_params": {}, | |
| "id": "simple_005" | |
| }, | |
| { | |
| "prompt": "Compact the ledger for the prism-mcp project.", | |
| "expected_tool": "session_compact_ledger", | |
| "required_params": {"project": "prism-mcp"}, | |
| "id": "simple_006" | |
| }, | |
| { | |
| "prompt": "Export all memory to /tmp/export in JSON format.", | |
| "expected_tool": "session_export_memory", | |
| "required_params": {"output_path": "/tmp/export", "format": "json"}, | |
| "id": "simple_007" | |
| }, | |
| { | |
| "prompt": "Search the knowledge base for information about retry strategies.", | |
| "expected_tool": "knowledge_search", | |
| "required_params": {"query": "retry strategies"}, | |
| "id": "simple_008" | |
| }, | |
| { | |
| "prompt": "Upvote knowledge entry 'abc-def-123'.", | |
| "expected_tool": "knowledge_upvote", | |
| "required_params": {"id": "abc-def-123"}, | |
| "id": "simple_009" | |
| }, | |
| { | |
| "prompt": "Set a 90-day retention policy for the billing project.", | |
| "expected_tool": "knowledge_set_retention", | |
| "required_params": {"project": "billing", "ttl_days": 90}, | |
| "id": "simple_010" | |
| }, | |
| ] | |
| # CATEGORY 2: Relevance Detection (NO tool should be called — BFCL's hallucination prevention) | |
| RELEVANCE_TESTS = [ | |
| {"prompt": "What's the time complexity of quicksort?", "expected_tool": "NO_TOOL", "id": "relevance_001"}, | |
| {"prompt": "Explain the difference between TCP and UDP.", "expected_tool": "NO_TOOL", "id": "relevance_002"}, | |
| {"prompt": "How do I implement a binary search tree in Python?", "expected_tool": "NO_TOOL", "id": "relevance_003"}, | |
| {"prompt": "What is dependency injection and why is it useful?", "expected_tool": "NO_TOOL", "id": "relevance_004"}, | |
| {"prompt": "How does React's virtual DOM reconciliation work?", "expected_tool": "NO_TOOL", "id": "relevance_005"}, | |
| {"prompt": "Compare PostgreSQL and MongoDB for a real-time analytics platform.", "expected_tool": "NO_TOOL", "id": "relevance_006"}, | |
| {"prompt": "Write a decorator that retries a function 3 times on failure.", "expected_tool": "NO_TOOL", "id": "relevance_007"}, | |
| {"prompt": "How do I set up a CI/CD pipeline with GitHub Actions?", "expected_tool": "NO_TOOL", "id": "relevance_008"}, | |
| {"prompt": "Explain the CAP theorem.", "expected_tool": "NO_TOOL", "id": "relevance_009"}, | |
| {"prompt": "What's the best way to handle CORS in a Node.js Express app?", "expected_tool": "NO_TOOL", "id": "relevance_010"}, | |
| ] | |
| # CATEGORY 3: Hallucination Detection (keywords overlap with tools but should NOT trigger) | |
| HALLUCINATION_TESTS = [ | |
| {"prompt": "How do I implement a context manager in Python using __enter__ and __exit__?", | |
| "expected_tool": "NO_TOOL", "id": "hallucination_001"}, | |
| {"prompt": "Explain the forget gate in an LSTM neural network.", | |
| "expected_tool": "NO_TOOL", "id": "hallucination_002"}, | |
| {"prompt": "How does session management work in Express.js with passport?", | |
| "expected_tool": "NO_TOOL", "id": "hallucination_003"}, | |
| {"prompt": "What's the difference between knowledge distillation and model pruning?", | |
| "expected_tool": "NO_TOOL", "id": "hallucination_004"}, | |
| {"prompt": "How do I save state in a Redux store?", | |
| "expected_tool": "NO_TOOL", "id": "hallucination_005"}, | |
| {"prompt": "Explain memory-mapped files and how they improve I/O performance.", | |
| "expected_tool": "NO_TOOL", "id": "hallucination_006"}, | |
| {"prompt": "How does the garbage collector handle circular references in Python?", | |
| "expected_tool": "NO_TOOL", "id": "hallucination_007"}, | |
| {"prompt": "What is a load balancer health check in Kubernetes?", | |
| "expected_tool": "NO_TOOL", "id": "hallucination_008"}, | |
| {"prompt": "How do I implement exponential backoff with jitter for API retries?", | |
| "expected_tool": "NO_TOOL", "id": "hallucination_009"}, | |
| {"prompt": "Compare Elasticsearch and Solr for full-text search.", | |
| "expected_tool": "NO_TOOL", "id": "hallucination_010"}, | |
| ] | |
| # CATEGORY 4: Disambiguation (similar tools — must pick the right one) | |
| DISAMBIGUATION_TESTS = [ | |
| { | |
| "prompt": "Find past sessions where I discussed WebSocket error handling.", | |
| "expected_tool": "session_search_memory", | |
| "required_params": {"query": "WebSocket error handling"}, | |
| "id": "disambig_001" | |
| }, | |
| { | |
| "prompt": "Search our accumulated documentation for WebSocket best practices.", | |
| "expected_tool": "knowledge_search", | |
| "required_params": {"query": "WebSocket best practices"}, | |
| "id": "disambig_002" | |
| }, | |
| { | |
| "prompt": "Delete that specific memory entry ID 'mem-42' — it's outdated.", | |
| "expected_tool": "session_forget_memory", | |
| "required_params": {"memory_id": "mem-42"}, | |
| "id": "disambig_003" | |
| }, | |
| { | |
| "prompt": "Clear out all old knowledge entries in the 'testing' category for analytics project.", | |
| "expected_tool": "knowledge_forget", | |
| "required_params": {"project": "analytics"}, | |
| "id": "disambig_004" | |
| }, | |
| { | |
| "prompt": "Boost the importance of knowledge entry 'insight-77'.", | |
| "expected_tool": "knowledge_upvote", | |
| "required_params": {"id": "insight-77"}, | |
| "id": "disambig_005" | |
| }, | |
| { | |
| "prompt": "This knowledge item 'insight-88' is not useful anymore, lower its score.", | |
| "expected_tool": "knowledge_downvote", | |
| "required_params": {"id": "insight-88"}, | |
| "id": "disambig_006" | |
| }, | |
| { | |
| "prompt": "Record a successful experience: I fixed the login bug by adding input validation.", | |
| "expected_tool": "session_save_experience", | |
| "required_params": {"event_type": "success"}, | |
| "id": "disambig_007" | |
| }, | |
| { | |
| "prompt": "Leave a handoff note for the next session on the portal project — tell them the DB schema is finalized.", | |
| "expected_tool": "session_save_handoff", | |
| "required_params": {"project": "portal"}, | |
| "id": "disambig_008" | |
| }, | |
| ] | |
| # CATEGORY 5: Format Sensitivity (same intent, different prompt styles) | |
| FORMAT_SENSITIVITY_TESTS = [ | |
| # All 5 should map to session_load_context | |
| {"prompt": "Load context for myproject.", | |
| "expected_tool": "session_load_context", "required_params": {"project": "myproject"}, "id": "format_001"}, | |
| {"prompt": "SESSION_LOAD_CONTEXT(project='myproject')", | |
| "expected_tool": "session_load_context", "required_params": {"project": "myproject"}, "id": "format_002"}, | |
| {"prompt": "Please initialize the session context for project myproject at the standard level.", | |
| "expected_tool": "session_load_context", "required_params": {"project": "myproject"}, "id": "format_003"}, | |
| {"prompt": "ctx = load(project='myproject')", | |
| "expected_tool": "session_load_context", "required_params": {"project": "myproject"}, "id": "format_004"}, | |
| {"prompt": "Yo pull up myproject's context real quick", | |
| "expected_tool": "session_load_context", "required_params": {"project": "myproject"}, "id": "format_005"}, | |
| ] | |
| # CATEGORY 6: AST Parameter Accuracy (correct tool + parameter value matching) | |
| AST_PARAM_TESTS = [ | |
| { | |
| "prompt": "Export my memories to /tmp/backup in markdown format for the billing project.", | |
| "expected_tool": "session_export_memory", | |
| "required_params": {"output_path": "/tmp/backup", "format": "markdown", "project": "billing"}, | |
| "ast_strict": True, # enforce exact param values | |
| "id": "ast_001" | |
| }, | |
| { | |
| "prompt": "Set a 30-day retention policy for the staging project's knowledge.", | |
| "expected_tool": "knowledge_set_retention", | |
| "required_params": {"project": "staging", "ttl_days": 30}, | |
| "ast_strict": True, | |
| "id": "ast_002" | |
| }, | |
| { | |
| "prompt": "Save a ledger entry: project is 'portal', conversation is 'conv-2024-001', summary is 'Deployed v2.0 to production with zero downtime'.", | |
| "expected_tool": "session_save_ledger", | |
| "required_params": {"project": "portal", "conversation_id": "conv-2024-001"}, | |
| "ast_strict": True, | |
| "id": "ast_003" | |
| }, | |
| { | |
| "prompt": "Record a correction experience for the analytics project: I tried using batch inserts but should have used streaming writes instead.", | |
| "expected_tool": "session_save_experience", | |
| "required_params": {"project": "analytics", "event_type": "correction"}, | |
| "ast_strict": False, # Free-text fields (action, correction) are hard to match exactly | |
| "id": "ast_004" | |
| }, | |
| { | |
| "prompt": "Save an image at /tmp/screenshot.png for the dashboard project with description 'Login page redesign mockup'.", | |
| "expected_tool": "session_save_image", | |
| "required_params": {"project": "dashboard", "image_path": "/tmp/screenshot.png"}, | |
| "ast_strict": True, | |
| "id": "ast_005" | |
| }, | |
| ] | |
| # CATEGORY 7: Edge Cases (single-word, ambiguous, multi-intent) | |
| EDGE_CASE_TESTS = [ | |
| {"prompt": "Hello!", "expected_tool": "NO_TOOL", "id": "edge_001"}, | |
| {"prompt": "Thanks, that's all for now.", "expected_tool": "NO_TOOL", "id": "edge_002"}, | |
| {"prompt": "What can you do?", "expected_tool": "NO_TOOL", "id": "edge_003"}, | |
| {"prompt": "Load context.", "expected_tool": "session_load_context", "required_params": {}, "id": "edge_004"}, | |
| {"prompt": "Save.", "expected_tool": "session_save_ledger", "required_params": {}, "id": "edge_005"}, | |
| # Accept both search tools for ambiguous single-word "Search." | |
| {"prompt": "Search.", "expected_tool": ["session_search_memory", "knowledge_search"], "required_params": {}, "id": "edge_006"}, | |
| {"prompt": "Health check.", "expected_tool": "session_health_check", "required_params": {}, "id": "edge_007"}, | |
| {"prompt": "🚀", "expected_tool": "NO_TOOL", "id": "edge_008"}, | |
| ] | |
| # CATEGORY 8: Multi-Turn Chain (sequential tool calls with tool responses — 40% BFCL weight) | |
| # These test whether the model correctly selects the NEXT tool after receiving | |
| # a tool execution result in the conversation history. | |
| MULTI_TURN_TESTS = [ | |
| { | |
| # Turn 1: User asks to load context, model should call session_load_context | |
| "prompt": "Load the context for the analytics project, then search for recent deployment issues.", | |
| "expected_tool": "session_load_context", | |
| "required_params": {"project": "analytics"}, | |
| "id": "multiturn_001", | |
| # After tool response, the follow-up prompt becomes: | |
| "followup": { | |
| "tool_response": '{"project": "analytics", "open_todos": ["fix deploy"], "last_summary": "Worked on deploy pipeline"}', | |
| "expected_tool": "session_search_memory", | |
| "required_params": {"query": "deployment issues"}, | |
| } | |
| }, | |
| { | |
| # Search memory → then save a handoff note | |
| "prompt": "Search for what we decided about the caching layer, then save a handoff note about it.", | |
| "expected_tool": "session_search_memory", | |
| "required_params": {"query": "caching layer"}, | |
| "id": "multiturn_002", | |
| "followup": { | |
| "tool_response": '{"results": [{"summary": "Decided to use Redis for session caching with 5min TTL"}]}', | |
| "expected_tool": "session_save_handoff", | |
| "required_params": {}, | |
| } | |
| }, | |
| { | |
| # Health check → then compact if issues found | |
| "prompt": "Run a health check on the memory system. If there are issues, compact the old entries.", | |
| "expected_tool": "session_health_check", | |
| "required_params": {}, | |
| "id": "multiturn_003", | |
| "followup": { | |
| "tool_response": '{"status": "issues_found", "missing_embeddings": 12, "stale_rollups": 3}', | |
| "expected_tool": "session_compact_ledger", | |
| "required_params": {}, | |
| } | |
| }, | |
| { | |
| # Load context → log an experience record | |
| "prompt": "Load context for the portal project and then log that we successfully deployed v3.", | |
| "expected_tool": "session_load_context", | |
| "required_params": {"project": "portal"}, | |
| "id": "multiturn_004", | |
| "followup": { | |
| "tool_response": '{"project": "portal", "last_summary": "Working on v3 deploy"}', | |
| "expected_tool": "session_save_experience", | |
| "required_params": {"project": "portal", "event_type": "success"}, | |
| } | |
| }, | |
| { | |
| # Knowledge search → upvote useful result | |
| "prompt": "Search knowledge for retry strategies, then upvote the best result.", | |
| "expected_tool": "knowledge_search", | |
| "required_params": {"query": "retry strategies"}, | |
| "id": "multiturn_005", | |
| "followup": { | |
| "tool_response": '{"results": [{"id": "ki-retry-42", "summary": "Exponential backoff with jitter", "importance": 5}]}', | |
| "expected_tool": "knowledge_upvote", | |
| "required_params": {"id": "ki-retry-42"}, | |
| } | |
| }, | |
| { | |
| # Export memory → set retention policy | |
| "prompt": "Export the billing project memory to /tmp/backup, then set a 60-day retention policy.", | |
| "expected_tool": "session_export_memory", | |
| "required_params": {"output_path": "/tmp/backup"}, | |
| "id": "multiturn_006", | |
| "followup": { | |
| "tool_response": '{"status": "exported", "file": "/tmp/backup/prism-export-billing.json", "entries": 142}', | |
| "expected_tool": "knowledge_set_retention", | |
| "required_params": {"project": "billing", "ttl_days": 60}, | |
| } | |
| }, | |
| { | |
| # Save ledger → save handoff | |
| "prompt": "Record this session: we migrated the auth module to OAuth2. Then save the handoff state.", | |
| "expected_tool": "session_save_ledger", | |
| "required_params": {}, | |
| "id": "multiturn_007", | |
| "followup": { | |
| "tool_response": '{"status": "saved", "id": "ledger-2024-99"}', | |
| "expected_tool": "session_save_handoff", | |
| "required_params": {}, | |
| } | |
| }, | |
| { | |
| # Task route → then act on the routing decision (should NOT call a tool if route says "host") | |
| "prompt": "Should the local agent handle this TypeScript refactor? If cloud, just tell me.", | |
| "expected_tool": "session_task_route", | |
| "required_params": {}, | |
| "id": "multiturn_008", | |
| "followup": { | |
| "tool_response": '{"target": "host", "confidence": 0.92, "reason": "Complex refactor needs cloud model"}', | |
| "expected_tool": "NO_TOOL", | |
| "required_params": {}, | |
| } | |
| }, | |
| ] | |
| # ============================================================================ | |
| # ALL CATEGORIES | |
| # ============================================================================ | |
| ALL_CATEGORIES = { | |
| "simple": SIMPLE_TESTS, | |
| "relevance_detection": RELEVANCE_TESTS, | |
| "hallucination": HALLUCINATION_TESTS, | |
| "disambiguation": DISAMBIGUATION_TESTS, | |
| "format_sensitivity": FORMAT_SENSITIVITY_TESTS, | |
| "ast_parameter": AST_PARAM_TESTS, | |
| "edge_case": EDGE_CASE_TESTS, | |
| "multi_turn_chain": MULTI_TURN_TESTS, | |
| } | |
| def parse_all_tool_calls(response_text: str) -> list: | |
| """Extract ALL tool calls from a response, supporting parallel calls. | |
| Returns: list of (tool_name, tool_args) tuples. | |
| """ | |
| results = [] | |
| # R17-fix: Strip CoT blocks to prevent extracting JSON from reasoning | |
| # R19-fix: Handle unclosed think blocks via (?:</\|synalux_think\|>|$) fallback | |
| clean_text = re.sub(r'<\|synalux_think\|>.*?(?:</\|synalux_think\|>|$)', '', response_text, flags=re.DOTALL) | |
| # Strategy 0: Training-format <tool_call>...</tool_call> (no pipes) — used by v43 model | |
| no_pipe_blocks = re.findall(r'<tool_call>\s*(\{.*?\})\s*(?:</tool_call>|(?=<tool_call>)|$)', | |
| clean_text, re.DOTALL) | |
| if not no_pipe_blocks: | |
| no_pipe_blocks = re.findall(r'<tool_call>\s*(\{[^}]*\})', clean_text) | |
| for raw_json in no_pipe_blocks: | |
| try: | |
| brace_depth = 0 | |
| end_idx = 0 | |
| for i, ch in enumerate(raw_json): | |
| if ch == '{': brace_depth += 1 | |
| elif ch == '}': brace_depth -= 1 | |
| if brace_depth == 0: | |
| end_idx = i + 1 | |
| break | |
| parsed = json.loads(raw_json[:end_idx] if end_idx > 0 else raw_json) | |
| if isinstance(parsed, dict) and parsed.get("name"): | |
| tool_args = parsed.get("arguments", {}) | |
| if isinstance(tool_args, dict): | |
| for k, v in tool_args.items(): | |
| if isinstance(v, str) and v.isdigit(): | |
| tool_args[k] = int(v) | |
| else: | |
| tool_args = {} | |
| results.append((parsed["name"], tool_args)) | |
| except (json.JSONDecodeError, IndexError): | |
| continue | |
| if results: | |
| return results | |
| # Strategy 1: Find ALL <|tool_call|> JSON blocks using findall | |
| # R16-fix: Use lookahead (?=<\|tool_call\|>) to avoid consuming boundary token on parallel calls | |
| json_blocks = re.findall(r'<\|tool_call\|>\s*(\{.*?\})\s*(?:</\|tool_call\|>|(?=<\|tool_call\|>)|$)', | |
| clean_text, re.DOTALL) | |
| if not json_blocks: | |
| # Fallback: try greedy per-block extraction | |
| json_blocks = re.findall(r'<\|tool_call\|>\s*(\{[^}]*\})', clean_text) | |
| for raw_json in json_blocks: | |
| try: | |
| # Handle nested braces by finding balanced JSON | |
| brace_depth = 0 | |
| end_idx = 0 | |
| for i, ch in enumerate(raw_json): | |
| if ch == '{': brace_depth += 1 | |
| elif ch == '}': brace_depth -= 1 | |
| if brace_depth == 0: | |
| end_idx = i + 1 | |
| break | |
| clean_json = raw_json[:end_idx] if end_idx > 0 else raw_json | |
| parsed = json.loads(clean_json) | |
| # R11-fix: Guard against hallucinated JSON arrays | |
| if not isinstance(parsed, dict): | |
| continue | |
| tool_name = parsed.get("name", "") | |
| tool_args = parsed.get("arguments", {}) | |
| # Normalize int values | |
| if isinstance(tool_args, dict): | |
| for k, v in tool_args.items(): | |
| if isinstance(v, str) and v.isdigit(): | |
| tool_args[k] = int(v) | |
| else: | |
| tool_args = {} | |
| results.append((tool_name, tool_args)) | |
| except (json.JSONDecodeError, IndexError): | |
| continue | |
| if results: | |
| return results | |
| # Strategy 2: Function-call style: <|tool_call|> tool_name(key=val, ...) | |
| func_matches = re.findall(r'<\|tool_call\|>\s*(\w+)\s*\((.*?)\)', clean_text, re.DOTALL) | |
| for tool_name, args_str in func_matches: | |
| tool_args = {} | |
| args_str = args_str.strip() | |
| if args_str: | |
| for param_match in re.finditer(r'(\w+)\s*=\s*(?:"([^"]*?)"|\'([^\']*?)\'|(\d+(?:\.\d+)?)|(\w+))', args_str): | |
| key = param_match.group(1) | |
| val = param_match.group(2) or param_match.group(3) or param_match.group(4) or param_match.group(5) | |
| if val and isinstance(val, str) and val.isdigit(): | |
| val = int(val) | |
| tool_args[key] = val | |
| results.append((tool_name, tool_args)) | |
| if results: | |
| return results | |
| # Strategy 3: Bare JSON with name field (no <|tool_call|> prefix) | |
| bare_matches = re.findall(r'\{\s*"name"\s*:\s*"(\w+)"\s*,\s*"arguments"\s*:\s*(\{[^}]*\})', clean_text) | |
| for tool_name, args_json in bare_matches: | |
| try: | |
| tool_args = json.loads(args_json) | |
| results.append((tool_name, tool_args)) | |
| except json.JSONDecodeError: | |
| # R13-fix: Do not append empty dicts; allow _repair_and_extract to handle nested JSON | |
| pass | |
| return results | |
| MLX_MODEL_CACHE = None | |
| MLX_TOKENIZER_CACHE = None | |
| def call_ollama(prompt: str, use_json_format: bool = False) -> tuple: | |
| global MLX_MODEL_CACHE, MLX_TOKENIZER_CACHE | |
| import os, time, json, urllib.request | |
| from config import OLLAMA_KEEP_ALIVE, OLLAMA_NUM_CTX, OLLAMA_TEMPERATURE | |
| if MODEL.startswith("/") or os.path.exists(MODEL): | |
| if MLX_MODEL_CACHE is None: | |
| from mlx_lm import load | |
| import gc, mlx.core as mx | |
| print(f"Loading MLX model: {MODEL}") | |
| # OOM protection: clear any prior model from memory | |
| gc.collect() | |
| mx.metal.clear_cache() | |
| MLX_MODEL_CACHE, MLX_TOKENIZER_CACHE = load(MODEL) | |
| peak = mx.metal.get_peak_memory() / 1e9 | |
| print(f" Model loaded. Peak GPU memory: {peak:.1f}GB") | |
| from mlx_lm import generate | |
| start_time = time.time() | |
| try: | |
| response_text = generate(MLX_MODEL_CACHE, MLX_TOKENIZER_CACHE, prompt=prompt, max_tokens=512) | |
| except Exception as e: | |
| if "out of memory" in str(e).lower() or "malloc" in str(e).lower(): | |
| import gc, mlx.core as mx | |
| print(f" ⚠️ OOM detected — clearing cache and retrying with max_tokens=256") | |
| gc.collect(); mx.metal.clear_cache() | |
| response_text = generate(MLX_MODEL_CACHE, MLX_TOKENIZER_CACHE, prompt=prompt, max_tokens=256) | |
| else: | |
| raise | |
| elapsed = time.time() - start_time | |
| all_calls = parse_all_tool_calls(response_text) | |
| if not all_calls: | |
| all_calls = _repair_and_extract(response_text) | |
| if all_calls: | |
| return all_calls[0][0], all_calls[0][1], response_text, elapsed, all_calls | |
| return "NO_TOOL", {}, response_text, elapsed, [] | |
| # Ollama HTTP path (when MODEL is a tag, not a local MLX path) | |
| OLLAMA_API = "http://localhost:11434/api/generate" | |
| payload = json.dumps({ | |
| "model": MODEL, | |
| "prompt": prompt, | |
| "stream": False, | |
| "raw": False, | |
| "options": { | |
| "temperature": 0.0, | |
| "num_predict": 512, | |
| "num_ctx": OLLAMA_NUM_CTX, | |
| }, | |
| "keep_alive": OLLAMA_KEEP_ALIVE, | |
| }).encode() | |
| start_time = time.time() | |
| try: | |
| req = urllib.request.Request(OLLAMA_API, data=payload, | |
| headers={"Content-Type": "application/json"}) | |
| with urllib.request.urlopen(req, timeout=120) as resp: | |
| data = json.loads(resp.read().decode()) | |
| response_text = data.get("response", "") | |
| except Exception as e: | |
| return "ERROR", {}, str(e), time.time() - start_time, [] | |
| elapsed = time.time() - start_time | |
| all_calls = parse_all_tool_calls(response_text) | |
| if not all_calls: | |
| all_calls = _repair_and_extract(response_text) | |
| if all_calls: | |
| return all_calls[0][0], all_calls[0][1], response_text, elapsed, all_calls | |
| return "NO_TOOL", {}, response_text, elapsed, [] | |
| # ============================================================================= | |
| # Enhancement 1: Best-of-N Schema Validator (Test-Time Compute Scaling) | |
| # ============================================================================= | |
| # R6.1-fix: Load tool schemas globally for Best-of-N validation | |
| _TRAINING_DIR = os.path.dirname(os.path.abspath(__file__)) | |
| _TOOL_SCHEMA_PATH = os.path.join(_TRAINING_DIR, "data", "tool_schema.json") | |
| try: | |
| with open(_TOOL_SCHEMA_PATH) as _f: | |
| _TOOL_SCHEMAS = json.load(_f).get("tools", []) | |
| # R14-fix: Dynamically sync VALID_TOOLS with schema registry (includes V4 Agentic tools) | |
| if _TOOL_SCHEMAS: | |
| VALID_TOOLS.update(t["name"] for t in _TOOL_SCHEMAS) | |
| print(f"Loaded {len(_TOOL_SCHEMAS)} tool schemas for Best-of-N validation (VALID_TOOLS: {len(VALID_TOOLS)})") | |
| except (FileNotFoundError, json.JSONDecodeError, PermissionError) as e: | |
| _TOOL_SCHEMAS = [] | |
| print(f"WARNING: Failed to load {_TOOL_SCHEMA_PATH}: {e} — Best-of-N validation disabled") | |
| # R6.1-fix: Import from config instead of hardcoding | |
| from config import BEST_OF_N_DEFAULT, BEST_OF_N_TEMPERATURE | |
| BEST_OF_N = int(os.environ.get("BFCL_BEST_OF_N", str(BEST_OF_N_DEFAULT))) | |
| def validate_tool_call_against_schema(tool_name: str, tool_args: dict, | |
| available_tools: list) -> tuple: | |
| """Validate a tool call against its JSON schema definition. | |
| Returns (is_valid, error_reason). | |
| """ | |
| # Find matching tool schema | |
| schema = None | |
| for tool in available_tools: | |
| if tool.get("name") == tool_name: | |
| schema = tool | |
| break | |
| if schema is None: | |
| return False, f"tool '{tool_name}' not in available tools" | |
| params = schema.get("parameters", {}) | |
| props = params.get("properties", {}) | |
| required = set(params.get("required", [])) | |
| # R9-fix: Guard against hallucinated non-dict arguments (e.g., arrays) | |
| if not isinstance(tool_args, dict): | |
| return False, f"arguments must be an object, got {type(tool_args).__name__}" | |
| # Check required params present | |
| for req_param in required: | |
| if req_param not in tool_args: | |
| return False, f"missing required param: {req_param}" | |
| # Check no hallucinated params | |
| for arg_name in tool_args: | |
| if arg_name not in props: | |
| return False, f"hallucinated param: {arg_name}" | |
| # Check data types | |
| for arg_name, arg_val in tool_args.items(): | |
| # R6.2-fix: Only allow None for optional (non-required) params | |
| if arg_val is None: | |
| if arg_name in required: | |
| return False, f"{arg_name} is required and cannot be null" | |
| continue | |
| if arg_name not in props: | |
| continue | |
| expected_type = props[arg_name].get("type", "string") | |
| if expected_type == "integer" and (not isinstance(arg_val, int) or isinstance(arg_val, bool)): | |
| return False, f"{arg_name} should be int, got {type(arg_val).__name__}" | |
| elif expected_type == "number" and (not isinstance(arg_val, (int, float)) or isinstance(arg_val, bool)): | |
| return False, f"{arg_name} should be number, got {type(arg_val).__name__}" | |
| elif expected_type == "boolean" and not isinstance(arg_val, bool): | |
| return False, f"{arg_name} should be bool, got {type(arg_val).__name__}" | |
| elif expected_type == "object" and not isinstance(arg_val, dict): | |
| return False, f"{arg_name} should be object, got {type(arg_val).__name__}" | |
| elif expected_type == "array" and not isinstance(arg_val, list): | |
| return False, f"{arg_name} should be array, got {type(arg_val).__name__}" | |
| # Check enum constraints | |
| for arg_name, arg_val in tool_args.items(): | |
| if arg_name in props and "enum" in props[arg_name]: | |
| if arg_val not in props[arg_name]["enum"]: | |
| return False, f"{arg_name} value '{arg_val}' not in enum" | |
| return True, "valid" | |
| def call_ollama_best_of_n(prompt: str, available_tools: list = None, | |
| n: int = None) -> tuple: | |
| return call_ollama(prompt) | |
| def _repair_and_extract(text: str) -> list: | |
| """R5-3: Attempt to repair malformed JSON and extract tool calls. | |
| Handles: trailing commas, missing closing braces. | |
| NOTE: Does NOT cast string types — BFCL strictly checks data types. | |
| """ | |
| import re as _re | |
| # R17-fix: Strip CoT blocks before attempting repair | |
| # R19-fix: Handle unclosed think blocks via (?:</\|synalux_think\|>|$) fallback | |
| clean_text = _re.sub(r'<\|synalux_think\|>.*?(?:</\|synalux_think\|>|$)', '', text, flags=_re.DOTALL) | |
| # Find anything that looks like a JSON tool call | |
| candidates = _re.findall(r'\{\s*"name"\s*:.*?(?:\}\s*\}|\})', clean_text, _re.DOTALL) | |
| results = [] | |
| for raw in candidates: | |
| repaired = raw | |
| # Fix trailing commas before closing brace | |
| repaired = _re.sub(r',\s*\}', '}', repaired) | |
| # Count braces and add missing ones | |
| open_braces = repaired.count('{') | |
| close_braces = repaired.count('}') | |
| if open_braces > close_braces: | |
| repaired += '}' * (open_braces - close_braces) | |
| try: | |
| parsed = json.loads(repaired) | |
| tool_name = parsed.get("name", "") | |
| tool_args = parsed.get("arguments", {}) | |
| if tool_name: | |
| results.append((tool_name, tool_args)) | |
| except json.JSONDecodeError: | |
| continue | |
| return results | |
| def evaluate_test(test: dict, verbose: bool = False) -> dict: | |
| """Evaluate a single BFCL test case.""" | |
| from config import format_system_prompt | |
| prompt = test["prompt"] | |
| expected_tool = test["expected_tool"] | |
| required_params = test.get("required_params", {}) | |
| ast_strict = test.get("ast_strict", False) | |
| test_id = test["id"] | |
| # Support list of acceptable tools for ambiguous prompts | |
| expected_tool_list = expected_tool if isinstance(expected_tool, list) else [expected_tool] | |
| # R5-7 fix: Wrap prompt with system prompt to match training distribution | |
| # Uses bfcl_eval_mode=True to disable clarification behavior (R4-5) | |
| # R6.1-fix: Use RAG system prompt for context-limited tool injection | |
| # R20-fix: Use training-compatible system prompt (matches 1518 Prism tool training examples). | |
| # format_system_prompt with empty schemas produces piped-token format (<|tool_call|>) that | |
| # mismatches v43 training format (<tool_call>), causing think-block close tag confusion and | |
| # empty-tag degeneration loops. Hardcoded training format resolves 0% \u2192 correct scoring. | |
| _TRAINING_SYS_PROMPT = ( | |
| "You are Synalux, a memory-augmented coding and clinical reasoning assistant. " | |
| "You have access to Prism Memory tools (session_save_ledger, session_load_context, " | |
| "session_search_memory, session_save_handoff, session_forget_memory, session_health_check, " | |
| "session_compact_ledger, session_export_memory, session_task_route, session_save_experience, " | |
| "session_synthesize_edges, session_backfill_links, knowledge_search, knowledge_forget, " | |
| "knowledge_upvote, knowledge_downvote, knowledge_set_retention) and 13 multimodal tool " | |
| "modules (image_gen, office, web_scraper, browser, tts, ocr, git, terminal, deps_scanner, " | |
| "hipaa, data_graph, templates, pdf_parser). " | |
| "Think step-by-step before answering. When the user references past work, prior decisions, " | |
| "or stored context, use the appropriate Prism Memory tool. " | |
| "Format tool calls inside <tool_call>...</tool_call> JSON blocks with fields 'name' and 'arguments'. " | |
| "If no tool is needed, answer directly in plain text. " | |
| "ABSTAIN for general programming questions, CS concepts, greetings, and capability questions." | |
| ) | |
| sys_prompt = _TRAINING_SYS_PROMPT | |
| # R8-fix: Format as proper ChatML so Ollama raw mode sends it correctly | |
| full_prompt = f"<|im_start|>system\n{sys_prompt}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n" | |
| # R6-1: Use Best-of-N when enabled (validates candidates against tool schemas) | |
| if BEST_OF_N > 1: | |
| # R6.1-fix: Use globally loaded tool schemas, not per-test dicts | |
| actual_tool, actual_args, raw_response, latency, all_calls = call_ollama_best_of_n( | |
| full_prompt, available_tools=_TOOL_SCHEMAS | |
| ) | |
| else: | |
| actual_tool, actual_args, raw_response, latency, all_calls = call_ollama(full_prompt) | |
| # Layer 3 validation | |
| actual_tool, actual_args = validate_tool_call(prompt, actual_tool, actual_args) | |
| # Hallucination check: did the model call a tool that doesn't exist? | |
| hallucinated = actual_tool not in VALID_TOOLS and actual_tool != "NO_TOOL" and actual_tool != "ERROR" | |
| # Score | |
| result = { | |
| "id": test_id, | |
| "prompt": prompt, | |
| "expected": expected_tool_list[0] if len(expected_tool_list) == 1 else str(expected_tool_list), | |
| "actual": actual_tool, | |
| "latency": latency, | |
| "hallucinated": hallucinated, | |
| "correct": False, | |
| "tool_correct": False, | |
| "params_correct": False, | |
| "details": "", | |
| } | |
| if actual_tool == "ERROR": | |
| result["details"] = "API error" | |
| return result | |
| # Tool name match (check against all acceptable tools) | |
| tool_matches = actual_tool in expected_tool_list | |
| if tool_matches: | |
| result["tool_correct"] = True | |
| if "NO_TOOL" in expected_tool_list: | |
| result["correct"] = True | |
| result["params_correct"] = True | |
| result["details"] = "✅ Correct abstention" | |
| else: | |
| # Check parameters | |
| if ast_strict and required_params: | |
| # R21-fix: Guard against non-dict arguments (e.g. hallucinated arrays) | |
| if not isinstance(actual_args, dict): | |
| actual_args = {} | |
| # AST-level: check exact parameter values | |
| params_ok = True | |
| mismatches = [] | |
| for key, expected_val in required_params.items(): | |
| actual_val = actual_args.get(key) | |
| if actual_val is None: | |
| params_ok = False | |
| mismatches.append(f"missing '{key}'") | |
| elif isinstance(expected_val, int): | |
| try: | |
| if int(actual_val) != expected_val: | |
| params_ok = False | |
| mismatches.append(f"'{key}': expected {expected_val}, got {actual_val}") | |
| except (ValueError, TypeError): | |
| params_ok = False | |
| mismatches.append(f"'{key}': expected int {expected_val}, got '{actual_val}'") | |
| elif isinstance(expected_val, str): | |
| if str(actual_val).lower().strip() != expected_val.lower().strip(): | |
| # Fuzzy match for similar strings | |
| if expected_val.lower() not in str(actual_val).lower(): | |
| params_ok = False | |
| mismatches.append(f"'{key}': expected '{expected_val}', got '{actual_val}'") | |
| result["params_correct"] = params_ok | |
| result["correct"] = params_ok | |
| result["details"] = "✅ AST match" if params_ok else f"⚠️ Param mismatch: {', '.join(mismatches)}" | |
| else: | |
| # Non-strict: just check required param keys exist | |
| missing = [k for k in required_params if k not in actual_args] | |
| result["params_correct"] = len(missing) == 0 | |
| result["correct"] = True # Tool is correct even if params partially missing | |
| if missing: | |
| result["details"] = f"✅ Tool correct, missing params: {missing}" | |
| else: | |
| result["details"] = "✅ Full match" | |
| else: | |
| # Wrong tool | |
| expected_str = expected_tool_list[0] if len(expected_tool_list) == 1 else str(expected_tool_list) | |
| if "NO_TOOL" in expected_tool_list: | |
| result["details"] = f"❌ False positive: called {actual_tool} instead of abstaining" | |
| elif actual_tool == "NO_TOOL": | |
| result["details"] = f"❌ False negative: abstained instead of calling {expected_str}" | |
| else: | |
| result["details"] = f"❌ Wrong tool: expected {expected_str}, got {actual_tool}" | |
| if verbose: | |
| status = "✅" if result["correct"] else "❌" | |
| print(f" {status} [{test_id}] {result['details']}") | |
| if not result["correct"]: | |
| print(f" Prompt: {prompt[:80]}...") | |
| print(f" Raw: {raw_response[:120]}...") | |
| # R11-fix: Multi-turn followup evaluation (was deferred, now implemented) | |
| if result["correct"] and isinstance(test, dict) and "followup" in test: | |
| followup = test["followup"] | |
| # Build conversation history: original prompt + first response + tool response + new assistant turn | |
| # R12-fix: Use native ChatML without <|tool_response|> tags to match training distribution | |
| history = ( | |
| f"{full_prompt}{raw_response}<|im_end|>\n" | |
| f"<|im_start|>tool\n{followup['tool_response']}<|im_end|>\n" | |
| f"<|im_start|>assistant\n" | |
| ) | |
| if BEST_OF_N > 1: | |
| next_tool, next_args, next_raw, next_latency, _ = call_ollama_best_of_n( | |
| history, available_tools=_TOOL_SCHEMAS | |
| ) | |
| else: | |
| next_tool, next_args, next_raw, next_latency, _ = call_ollama(history) | |
| # Layer 3 on followup: validate + catch repeated tool calls | |
| next_tool, next_args = validate_tool_call(prompt, next_tool, next_args, is_followup=True) | |
| if next_tool == actual_tool and next_tool != "NO_TOOL": | |
| next_tool = "NO_TOOL" | |
| next_args = {} | |
| result["actual"] += f" -> {next_tool}" | |
| result["latency"] += next_latency | |
| expected_followup = followup.get("expected_tool", "NO_TOOL") | |
| result["correct"] = (next_tool == expected_followup) | |
| if not result["correct"]: | |
| result["details"] += f" | ❌ Followup: expected {expected_followup}, got {next_tool}" | |
| else: | |
| result["details"] += f" | ✅ Followup: {next_tool}" | |
| if verbose: | |
| status2 = "✅" if result["correct"] else "❌" | |
| print(f" {status2} Followup turn: expected={expected_followup}, got={next_tool}") | |
| return result | |
| def run_evaluation(shuffle: bool = False, verbose: bool = False, quiet: bool = False) -> dict: | |
| """Run full BFCL-style evaluation across all categories.""" | |
| # Build flat test list with category tags | |
| all_tests = [] | |
| for cat_name, tests in ALL_CATEGORIES.items(): | |
| for test in tests: | |
| test_copy = test.copy() | |
| test_copy["category"] = cat_name | |
| all_tests.append(test_copy) | |
| if shuffle: | |
| random.shuffle(all_tests) | |
| print(f"\n{'='*70}") | |
| print(f" BFCL-Style Evaluation — {MODEL}") | |
| print(f" {len(all_tests)} tests across {len(ALL_CATEGORIES)} categories") | |
| print(f" Shuffle: {'ON' if shuffle else 'OFF'}") | |
| print(f"{'='*70}\n") | |
| # Run all tests | |
| results = [] | |
| category_results = {cat: [] for cat in ALL_CATEGORIES} | |
| start_time = time.time() | |
| for i, test in enumerate(all_tests, 1): | |
| cat = test["category"] | |
| if verbose: | |
| print(f"[{i}/{len(all_tests)}] Category: {cat}") | |
| result = evaluate_test(test, verbose=verbose) | |
| result["category"] = cat | |
| results.append(result) | |
| category_results[cat].append(result) | |
| if not verbose and not quiet: | |
| status = "✅" if result["correct"] else "❌" | |
| print(f" {status} [{result['id']}] {result['expected']:>25s} → {result['actual']:<25s} {result['latency']:.1f}s", end="") | |
| if result["hallucinated"]: | |
| print(" 🚨 HALLUCINATED", end="") | |
| print() | |
| elif quiet and not result["correct"]: | |
| print(f" ❌ [{result['id']}] expected {result['expected']}, got {result['actual']}") | |
| elapsed = time.time() - start_time | |
| # Category scores (BFCL methodology: accuracy per category) | |
| category_scores = {} | |
| print(f"\n{'='*70}") | |
| print(f" CATEGORY BREAKDOWN") | |
| print(f"{'='*70}") | |
| for cat_name in ALL_CATEGORIES: | |
| cat_res = category_results[cat_name] | |
| if not cat_res: | |
| continue | |
| correct = sum(1 for r in cat_res if r["correct"]) | |
| total = len(cat_res) | |
| accuracy = correct / total * 100 | |
| category_scores[cat_name] = accuracy | |
| tool_correct = sum(1 for r in cat_res if r["tool_correct"]) | |
| params_correct = sum(1 for r in cat_res if r["params_correct"]) | |
| hallucinated = sum(1 for r in cat_res if r["hallucinated"]) | |
| print(f" {cat_name:25s} {correct}/{total} = {accuracy:6.1f}% " | |
| f"(tool:{tool_correct}/{total} params:{params_correct}/{total} " | |
| f"halluc:{hallucinated})") | |
| # Overall score (BFCL: unweighted average across categories) | |
| overall = sum(category_scores.values()) / len(category_scores) if category_scores else 0 | |
| total_correct = sum(1 for r in results if r["correct"]) | |
| total_halluc = sum(1 for r in results if r["hallucinated"]) | |
| avg_latency = sum(r["latency"] for r in results) / len(results) if results else 0 | |
| print(f"\n{'='*70}") | |
| print(f" OVERALL RESULTS") | |
| print(f"{'='*70}") | |
| print(f" Overall Accuracy (BFCL avg): {overall:.1f}%") | |
| print(f" Raw Accuracy: {total_correct}/{len(results)} = {total_correct/len(results)*100:.1f}%") | |
| print(f" Hallucinations: {total_halluc}") | |
| print(f" Avg Latency: {avg_latency:.1f}s") | |
| print(f" Total Time: {elapsed:.0f}s") | |
| print(f"{'='*70}\n") | |
| return { | |
| "overall_accuracy": overall, | |
| "raw_accuracy": total_correct / len(results) * 100 if results else 0, | |
| "total_correct": total_correct, | |
| "total_tests": len(results), | |
| "category_scores": category_scores, | |
| "hallucinations": total_halluc, | |
| "avg_latency": avg_latency, | |
| "elapsed": elapsed, | |
| "results": results, | |
| } | |
| def main(): | |
| import argparse | |
| parser = argparse.ArgumentParser(description="BFCL-Style evaluation for Prism models") | |
| parser.add_argument("--model", type=str, default=None, help="Ollama model name (default: prism-coder:7b)") | |
| parser.add_argument("--runs", type=int, default=1, help="Number of evaluation runs") | |
| parser.add_argument("--shuffle", action="store_true", help="Randomize test order each run") | |
| parser.add_argument("--verbose", action="store_true", help="Show detailed model outputs") | |
| parser.add_argument("--quiet", action="store_true", help="Only print category breakdown and overall results (suppress per-test lines)") | |
| parser.add_argument("--cleanup", action="store_true", help="Release MLX model from memory after eval completes") | |
| args = parser.parse_args() | |
| # Allow --model to override the global MODEL | |
| global MODEL | |
| if args.model: | |
| MODEL = args.model | |
| print(f"Using model: {MODEL}") | |
| all_run_results = [] | |
| for run_idx in range(args.runs): | |
| if args.runs > 1: | |
| print(f"\n{'#'*70}") | |
| print(f" RUN {run_idx + 1} / {args.runs}") | |
| print(f"{'#'*70}") | |
| result = run_evaluation(shuffle=args.shuffle, verbose=args.verbose, quiet=args.quiet) | |
| all_run_results.append(result) | |
| if args.runs > 1: | |
| # Multi-run summary | |
| overall_scores = [r["overall_accuracy"] for r in all_run_results] | |
| raw_scores = [r["raw_accuracy"] for r in all_run_results] | |
| raw_correct = [r["total_correct"] for r in all_run_results] | |
| total_tests = all_run_results[0]["total_tests"] | |
| total_halluc = [r["hallucinations"] for r in all_run_results] | |
| print(f"\n{'='*70}") | |
| print(f" MULTI-RUN SUMMARY ({args.runs} runs × {total_tests} tests)") | |
| print(f"{'='*70}") | |
| print(f" BFCL Overall Accuracy:") | |
| for i, s in enumerate(overall_scores): | |
| print(f" Run {i+1}: {s:.1f}%") | |
| print(f" Average: {statistics.mean(overall_scores):.1f}%") | |
| print(f" Median: {statistics.median(overall_scores):.1f}%") | |
| if len(overall_scores) > 1: | |
| print(f" StdDev: {statistics.stdev(overall_scores):.2f}%") | |
| print(f"\n Raw Scores: {' | '.join(f'{c}/{total_tests}' for c in raw_correct)}") | |
| print(f" Hallucinations: {' | '.join(str(h) for h in total_halluc)}") | |
| # Per-category consistency | |
| print(f"\n Per-Category Consistency:") | |
| categories = all_run_results[0]["category_scores"].keys() | |
| for cat in categories: | |
| scores = [r["category_scores"].get(cat, 0) for r in all_run_results] | |
| avg = statistics.mean(scores) | |
| consistent = all(s == scores[0] for s in scores) | |
| marker = "✅" if consistent and avg == 100 else "⚠️" if not consistent else "✅" | |
| print(f" {marker} {cat:25s} {' | '.join(f'{s:.0f}%' for s in scores)} → avg {avg:.1f}%") | |
| print(f"\n{'='*70}\n") | |
| # Exit code | |
| median_overall = statistics.median(overall_scores) | |
| if median_overall < 90: | |
| print("❌ FAIL: Median BFCL accuracy below 90%") | |
| sys.exit(1) | |
| elif median_overall < 95: | |
| print("⚠️ WARN: Median BFCL accuracy below 95%") | |
| sys.exit(0) | |
| else: | |
| print(f"✅ PASS: Median BFCL accuracy {median_overall:.1f}%") | |
| sys.exit(0) | |
| else: | |
| overall = all_run_results[0]["overall_accuracy"] | |
| if args.cleanup: | |
| cleanup_mlx_model() | |
| if overall < 90: | |
| sys.exit(1) | |
| sys.exit(0) | |
| if args.cleanup: | |
| cleanup_mlx_model() | |
| def cleanup_mlx_model(): | |
| global MLX_MODEL_CACHE, MLX_TOKENIZER_CACHE | |
| if MLX_MODEL_CACHE is not None: | |
| import gc | |
| try: | |
| import mlx.core as mx | |
| mx.clear_cache() | |
| except Exception: | |
| pass | |
| MLX_MODEL_CACHE = None | |
| MLX_TOKENIZER_CACHE = None | |
| gc.collect() | |
| print("🧹 MLX model released from memory") | |
| if __name__ == "__main__": | |
| main() | |