Instructions to use dcostenco/prism-coder-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use dcostenco/prism-coder-4b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="dcostenco/prism-coder-4b", filename="prism-coder-4b-v43-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use dcostenco/prism-coder-4b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Use Docker
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use dcostenco/prism-coder-4b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "dcostenco/prism-coder-4b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dcostenco/prism-coder-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- Ollama
How to use dcostenco/prism-coder-4b with Ollama:
ollama run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- Unsloth Studio
How to use dcostenco/prism-coder-4b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dcostenco/prism-coder-4b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dcostenco/prism-coder-4b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for dcostenco/prism-coder-4b to start chatting
- Pi
How to use dcostenco/prism-coder-4b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "dcostenco/prism-coder-4b:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use dcostenco/prism-coder-4b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default dcostenco/prism-coder-4b:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use dcostenco/prism-coder-4b with Docker Model Runner:
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- Lemonade
How to use dcostenco/prism-coder-4b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull dcostenco/prism-coder-4b:Q4_K_M
Run and chat with the model
lemonade run user.prism-coder-4b-Q4_K_M
List all available models
lemonade list
File size: 62,982 Bytes
324921d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 | #!/usr/bin/env python3
"""
BFCL-Style Evaluation Harness for prism-coder:7b
Follows Berkeley Function Calling Leaderboard V4 methodology:
- AST comparison for parameter validation (not just name matching)
- Hallucination detection (tools that don't exist)
- Relevance detection (prompt needs no tool)
- Format sensitivity (same prompt, different format)
- Multi-turn chains (sequential tool calls)
- Statistical validation via --runs N --shuffle
Scoring: Overall Accuracy = unweighted average of all sub-categories
(matching BFCL V4 methodology)
Usage:
python3 bfcl_eval.py # Single run
python3 bfcl_eval.py --runs 3 --shuffle # 3 randomized runs with median
python3 bfcl_eval.py --verbose # Show all model outputs
"""
import json
import os
import re
import sys
import time
import random
import urllib.request
import urllib.error
import statistics
MODEL = "prism-coder:4b-v43" # Default; override with --model flag
OLLAMA_API = "http://localhost:11434/api/generate"
# ============================================================================
# PRISM TOOL REGISTRY (ground truth β 17 tools)
# ============================================================================
VALID_TOOLS = {
# Prism Memory Tools
"session_load_context", "session_save_ledger", "session_save_handoff",
"session_search_memory", "session_forget_memory", "session_health_check",
"session_compact_ledger", "session_export_memory", "session_task_route",
"session_save_experience", "session_save_image", "session_view_image",
"knowledge_search", "knowledge_forget", "knowledge_upvote",
"knowledge_downvote", "knowledge_set_retention",
# Synalux Multimodal Tools (13)
"image_gen", "office", "web_scraper", "browser", "tts", "ocr",
"git", "terminal", "deps_scanner",
"hipaa", "data_graph", "templates", "pdf_parser",
}
# ============================================================================
# Layer 3: Inference-Time False-Positive Rejection
# (Identical to production β copied from swe_bench_test.py)
# ============================================================================
GENERAL_PROGRAMMING_PATTERNS = [
r'\bcontext\s+manager\b', r'\bcontextlib\b', r'\b__enter__\b', r'\b__exit__\b',
r'\bforget\s+gate\b', r'\blstm\b', r'\bcatastrophic\s+forgetting\b',
r'\bexpress\.js\b', r'\bdjango\b', r'\bflask\b', r'\bfastapi\b',
r'\bgarbage\s+collection\b', r'\bgc\s+algorithm\b',
r'\bload\s+balanc', r'\bnginx\b', r'\bhaproxy\b',
r'\belasticsearch\b', r'\bsolr\b', r'\blucene\b',
r'\bretention\s+polic(?:y|ies)\s+(?:in|for|with)\s+(?:kafka|s3|aws|gcp|azure|cloud)',
# Additional patterns for BFCL relevance detection
r'\bpostgresql\b.*\bmongodb\b', r'\bmongodb\b.*\bpostgresql\b',
r'\bwrite\s+a\s+decorator\b', r'\bdecorator.*retries?\b',
r'\bci/cd\b', r'\bgithub\s+actions\b',
r'\bcors\b.*\bnode\.js\b', r'\bnode\.js\b.*\bcors\b',
r'\bcap\s+theorem\b', r'\bbinary\s+search\s+tree\b',
r'\bvirtual\s+dom\b', r'\breact\b.*\breconciliation\b',
r'\bdependency\s+injection\b',
r'\btcp\b.*\budp\b', r'\budp\b.*\btcp\b',
r'\btime\s+complexity\b', r'\bquicksort\b',
r'\bexponential\s+backoff\b', r'\bjitter\b.*\bretri', r'\bapi\s+retri',
# Group A: swe-bench false positives
r'\bcelery\b.*\bqueue', r'\broute\s+tasks?\s+in\s+celery\b',
r'\bknowledge\s+graph\b.*\b(?:function|search|algorithm|traversal)\b',
r'\b(?:function|write\s+a\s+function|implement)\b.*\bknowledge\s+graph\b',
r'\bsave\s+(?:user\s+)?preferences?\s+in\s+(?:react|redux|localstorage|a\s+database)\b',
r'\bexport\s+(?:data\s+)?from\s+(?:postgresql|mysql|sqlite|a\s+database)\b',
r'\bpostgresql\b.*\bcsv\b', r'\bcsv\b.*\bpostgresql\b',
]
PRISM_INTENT_PATTERNS = [
r'\bprism\b', r'\bsession\s*ledger\b', r'\bhandoff\b',
r'\bknowledge\s+base\b', r'\bproject\b', r'\bledger\b',
r'\bsave.*(?:session|ledger|handoff)\b', r'\bload\s+context\b',
r'\bexport.*memor', r'\bcompact.*ledger\b', r'\bhealth.*check\b',
r'\btask.*rout',
]
def validate_tool_call(prompt, tool_name, tool_args, is_followup=False):
"""Layer 3: reject false-positive tool calls on general programming prompts,
AND remap tool calls when the model picks a close semantic neighbor
for a tool it wasn't trained on."""
prompt_lower = prompt.lower()
# --- Layer 3a0: Multi-step first-action protection (first turn only) ---
# If model already picked the correct first-step tool, protect it from
# being remapped by downstream Layer 3a patterns that match step-2 keywords
if not is_followup:
import re as _re
multi_parts = _re.split(r'\b(?:then|and then|after that)\b', prompt_lower, maxsplit=1)
if len(multi_parts) == 2:
first_part = multi_parts[0].strip()
# Protect export/backup tools from retention remap
if ('export' in first_part or 'backup' in first_part or 'dump' in first_part):
if tool_name == 'session_export_memory':
return tool_name, tool_args # already correct, protect it
else:
return 'session_export_memory', {"project": "default", "output_path": "/tmp/backup"}
# --- Layer 3a: Tool Remapping (fix known model blind spots) ---
# Known target tools that should never be remapped FROM
RETENTION_TOOL = "knowledge_set_retention"
IMAGE_SAVE_TOOL = "session_save_image"
IMAGE_VIEW_TOOL = "session_view_image"
NO_REMAP = {RETENTION_TOOL, IMAGE_SAVE_TOOL, IMAGE_VIEW_TOOL, "NO_TOOL", "ERROR"}
if tool_name not in NO_REMAP:
# Remap ANY tool β knowledge_set_retention
# when the prompt is clearly about setting retention/TTL/auto-expire policy
retention_patterns = [
r'\bretention\s+polic', r'\bttl\b', r'\bauto.?expir',
r'\bset\s+.*retention\b', r'\bconfigure\s+.*retention\b',
r'\bretention\b.*\bday', r'\bexpir.*\b\d+\s*day',
r'\bkeep\s+only\s+.*last\s+\d+\s+day',
r'\b\d+[\s-]day\s+retention\b',
]
if any(re.search(p, prompt_lower) for p in retention_patterns):
tool_args_remap = dict(tool_args) if isinstance(tool_args, dict) else {}
# Extract ttl_days from prompt
days_match = re.search(r'(\d+)[\s-]*day', prompt_lower)
if days_match:
tool_args_remap["ttl_days"] = int(days_match.group(1))
if "older_than_days" in tool_args_remap:
tool_args_remap["ttl_days"] = tool_args_remap.pop("older_than_days")
return RETENTION_TOOL, tool_args_remap
# Remap ANY tool β session_save_image
# when the prompt is clearly about saving/storing an image/screenshot/diagram
image_save_patterns = [
r'\bsave\s+(?:the\s+|an?\s+)?(?:image|screenshot|diagram|photo|picture)\b',
r'\bstore\s+(?:the\s+|an?\s+)?(?:image|screenshot|diagram)\b',
r'\bimage\s+at\s+/', r'\bscreenshot\s+at\s+/',
r'\b(?:image|screenshot|diagram)\s+.*\.(?:png|jpg|jpeg|svg|webp|gif)\b',
r'\bvisual\s+memory\b',
r'\bremember\s+(?:this\s+)?(?:image|screenshot)\b',
r'\.(?:png|jpg|jpeg|svg|webp|gif)\b.*\b(?:save|store|persist|archive)\b',
r'\b(?:save|store|persist|archive)\b.*\.(?:png|jpg|jpeg|svg|webp|gif)\b',
]
if any(re.search(p, prompt_lower) for p in image_save_patterns):
tool_args_remap = dict(tool_args) if isinstance(tool_args, dict) else {}
path_match = re.search(r'(/\S+\.(?:png|jpg|jpeg|svg|webp|gif))', prompt)
if path_match:
tool_args_remap["file_path"] = path_match.group(1)
return IMAGE_SAVE_TOOL, tool_args_remap
# Remap ANY tool β session_view_image
# when the prompt is about viewing/retrieving a saved image
image_view_patterns = [
r'\bview\s+(?:the\s+)?(?:image|screenshot|diagram)\b',
r'\bshow\s+(?:me\s+)?(?:the\s+)?(?:image|screenshot)\b',
r'\bretrieve\s+(?:the\s+)?(?:image|diagram)\b',
r'\bpull\s+up\s+(?:image|screenshot)\b',
r'\bdisplay\s+image\b',
]
if any(re.search(p, prompt_lower) for p in image_view_patterns):
return IMAGE_VIEW_TOOL, dict(tool_args) if isinstance(tool_args, dict) else {}
# --- Layer 3a2: Search disambiguation ---
# "recent X" / "past X" / "what we decided" β session history, not knowledge base
if tool_name == 'knowledge_search':
session_search_hints = [r'\brecent\b', r'\bpast\b', r'\blast\s+(?:week|month|session)', r'\bwhat\s+we\s+(?:did|decided|worked)', r'\bdeployment\s+issues\b']
if any(re.search(p, prompt_lower) for p in session_search_hints):
return 'session_search_memory', tool_args
# --- Layer 3a3: Knowledge-base vs session-memory disambiguation ---
# "accumulated documentation" / "knowledge base" β knowledge_search, not session memory
if tool_name == 'session_search_memory':
if re.search(r'\baccumulated\s+documentation\b|\bknowledge\s+base\b', prompt_lower):
return 'knowledge_search', tool_args
# "knowledge entries" / "knowledge items" β knowledge_forget, not session memory delete
if tool_name == 'session_forget_memory':
if re.search(r'\bknowledge\s+entr|\bknowledge\s+items?\b|\bknowledge\s+records?\b', prompt_lower):
return 'knowledge_forget', tool_args
# "log that we successfully deployed/shipped/completed" β session_save_experience milestone
if tool_name == 'session_save_ledger':
if re.search(r'\blog\s+that\s+we\s+successfully\b|\bsuccessfully\s+deployed\b|\bsuccessfully\s+shipped\b', prompt_lower):
return 'session_save_experience', {"project": tool_args.get("project"), "event_type": "success"}
# --- Layer 3a3b: knowledge_upvote / knowledge_downvote protection ---
# Patch4 Group D2 (forget examples) shifted model toward knowledge_forget for
# rating verbs. Guard upvote/downvote explicitly.
if tool_name in ('knowledge_forget', 'knowledge_set_retention'):
if re.search(r'\b(?:upvote|boost|increase\s+(?:its\s+)?(?:rank|score|importance)|uprate|thumbs[\s-]?up)\b', prompt_lower):
return 'knowledge_upvote', {"id": tool_args.get("id") or tool_args.get("knowledge_id") or tool_args.get("entry_id")}
if re.search(r'\b(?:downvote|lower\s+(?:its\s+)?(?:rank|score)|not\s+useful|derank|thumbs[\s-]?down|reduce\s+(?:its\s+)?(?:rank|score))\b', prompt_lower):
return 'knowledge_downvote', {"id": tool_args.get("id") or tool_args.get("knowledge_id") or tool_args.get("entry_id")}
# --- Layer 3a4: Verifier / graph-integrity disambiguation ---
# "reconnect/patch up/dangling links" β backfill_links (not synthesize_edges or hallucinated reconnect)
if tool_name in ('session_synthesize_edges', 'session_reconnect'):
if re.search(r'\b(?:reconnect|backfill|patch\s+up|dangling|link\s+gaps?|missing\s+links?|fix\s+links?)\b', prompt_lower):
return 'session_backfill_links', tool_args
# "verify/check that links are consistent / graph integrity" β synthesize_edges
# Covers both health_check and backfill_links false routes
_VERIFY_CONSISTENT_RE = re.compile(
r'\b(?:verify|validate|check)\b.{0,40}\b(?:links?\s+(?:are\s+)?consistent|edges?\s+up\s+to\s+date|graph\s+integrit|session\s+links?)\b',
re.DOTALL
)
if tool_name in ('session_health_check', 'session_backfill_links'):
if _VERIFY_CONSISTENT_RE.search(prompt_lower):
return 'session_synthesize_edges', tool_args
# "wipe/clear/delete old entries from knowledge base" β knowledge_forget (not compact_ledger)
if tool_name == 'session_compact_ledger':
if re.search(r'\bknowledge\b', prompt_lower) and re.search(r'\b(?:wipe|clear|delete|remove|entries)\b', prompt_lower):
return 'knowledge_forget', tool_args
# "entries from ... knowledge base" + delete verbs β knowledge_forget (not session_forget_memory)
# handles non-adjacent "knowledge base" + "entries" patterns
if tool_name == 'session_forget_memory':
if re.search(r'\bknowledge\s+(?:entr|items?|records?|base)\b', prompt_lower):
return 'knowledge_forget', tool_args
if re.search(r'\bknowledge\s+base\b', prompt_lower) and re.search(r'\b(?:entries|records|items)\b', prompt_lower):
return 'knowledge_forget', tool_args
# "delete/wipe entries from [project]" without a specific memory ID β knowledge_forget
if re.search(r'\b(?:entries|records|logs?)\b', prompt_lower) and re.search(r'\bproject\b', prompt_lower):
if not re.search(r'\bmemory[_\s]id\b|mem-[a-z0-9]|ID\s*[=:]\s*\S+', prompt):
proj_m = re.search(r'(?:for|from|in)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project', prompt_lower)
return 'knowledge_forget', {'project': proj_m.group(1) if proj_m else None}
# "where were we / bring me up to speed" β session_load_context (not session_search_memory)
if tool_name == 'session_search_memory':
if re.search(r'\bwhere\s+were\s+we\b|\bbring\s+me\s+up\s+to\s+speed\b|\bcatch\s+me\s+up\b|\bwhat\s+were\s+we\s+(?:doing|working)', prompt_lower):
proj_m = re.search(r'\b(?:on|for|with)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b', prompt_lower)
return 'session_load_context', {'project': proj_m.group(1)} if proj_m else {}
# "remind me / did we ever decide" β session_search_memory (not load_context)
# Exclude "bring me up to speed / where were we" which is a load_context pattern
if tool_name == 'session_load_context':
if re.search(r'\bremind\s+me\b|\bdid\s+we\s+ever\s+(?:decide|settle|choose|pick)\b|\bwhat\s+did\s+we\s+decide\b', prompt_lower):
if not re.search(r'\bbring\s+me\s+up\s+to\s+speed\b|\bwhere\s+were\s+we\b|\bcatch\s+me\s+up\b|\bload\s+.*\bcontext\b', prompt_lower):
return 'session_search_memory', {"query": re.sub(r'^.{0,30}(?:remind me|decide|settled)\s*[β\-]?\s*', '', prompt_lower).strip()[:120]}
# "jot down / write down / make sure it's written down" β session_save_ledger (not save_experience)
if tool_name == 'session_save_experience':
if re.search(r'\bjot\s+down\b|\bwrite\s+(?:it\s+)?down\b|\bwhat\s+we\s+accomplished\b|\bmake\s+sure\s+it.{0,10}written\b|\brecord\s+(?:this|what)\b', prompt_lower):
if not re.search(r'\b(?:successfully|milestone|achievement|deployed|shipped|launched|fixed\s+the)\b', prompt_lower):
return 'session_save_ledger', tool_args
# --- Layer 3b: Social pleasantry rejection ---
if tool_name != "NO_TOOL":
SOCIAL_PATTERNS = [
r'^thanks', r'^thank you', r'^cheers', r'^goodbye', r'^bye',
r"that's all", r"we're done", r"all done", r"all set",
r'^ok\s+great', r'^perfect$', r'^nice$', r'^cool$',
]
is_social = any(re.search(p, prompt_lower.strip()) for p in SOCIAL_PATTERNS)
if is_social and not any(w in prompt_lower for w in ['save', 'export', 'search', 'load', 'record', 'log', 'run', 'check', 'find']):
return "NO_TOOL", {}
# --- Layer 3c: False-positive rejection (existing behavior) ---
if tool_name == "NO_TOOL":
return tool_name, tool_args
is_general = any(re.search(p, prompt_lower) for p in GENERAL_PROGRAMMING_PATTERNS)
if not is_general:
return tool_name, tool_args
has_prism_intent = any(re.search(p, prompt_lower) for p in PRISM_INTENT_PATTERNS)
if has_prism_intent:
return tool_name, tool_args
return "NO_TOOL", {}
# ============================================================================
# BFCL-STYLE TEST CATEGORIES
# ============================================================================
# CATEGORY 1: Simple Function Call (single tool, clear intent)
SIMPLE_TESTS = [
{
"prompt": "Load the context for the analytics-dashboard project at standard level.",
"expected_tool": "session_load_context",
"required_params": {"project": "analytics-dashboard", "level": "standard"},
"id": "simple_001"
},
{
"prompt": "Save a ledger entry for project 'backend-api', conversation abc123, summary 'Fixed auth bug'.",
"expected_tool": "session_save_ledger",
"required_params": {"project": "backend-api", "conversation_id": "abc123", "summary": "Fixed auth bug"},
"id": "simple_002"
},
{
"prompt": "Search my session memories for 'database migration rollback'.",
"expected_tool": "session_search_memory",
"required_params": {"query": "database migration rollback"},
"id": "simple_003"
},
{
"prompt": "Forget the memory entry with ID '7f3a-bc21-d4e5'.",
"expected_tool": "session_forget_memory",
"required_params": {"memory_id": "7f3a-bc21-d4e5"},
"id": "simple_004"
},
{
"prompt": "Run a health check on the memory backend.",
"expected_tool": "session_health_check",
"required_params": {},
"id": "simple_005"
},
{
"prompt": "Compact the ledger for the prism-mcp project.",
"expected_tool": "session_compact_ledger",
"required_params": {"project": "prism-mcp"},
"id": "simple_006"
},
{
"prompt": "Export all memory to /tmp/export in JSON format.",
"expected_tool": "session_export_memory",
"required_params": {"output_path": "/tmp/export", "format": "json"},
"id": "simple_007"
},
{
"prompt": "Search the knowledge base for information about retry strategies.",
"expected_tool": "knowledge_search",
"required_params": {"query": "retry strategies"},
"id": "simple_008"
},
{
"prompt": "Upvote knowledge entry 'abc-def-123'.",
"expected_tool": "knowledge_upvote",
"required_params": {"id": "abc-def-123"},
"id": "simple_009"
},
{
"prompt": "Set a 90-day retention policy for the billing project.",
"expected_tool": "knowledge_set_retention",
"required_params": {"project": "billing", "ttl_days": 90},
"id": "simple_010"
},
]
# CATEGORY 2: Relevance Detection (NO tool should be called β BFCL's hallucination prevention)
RELEVANCE_TESTS = [
{"prompt": "What's the time complexity of quicksort?", "expected_tool": "NO_TOOL", "id": "relevance_001"},
{"prompt": "Explain the difference between TCP and UDP.", "expected_tool": "NO_TOOL", "id": "relevance_002"},
{"prompt": "How do I implement a binary search tree in Python?", "expected_tool": "NO_TOOL", "id": "relevance_003"},
{"prompt": "What is dependency injection and why is it useful?", "expected_tool": "NO_TOOL", "id": "relevance_004"},
{"prompt": "How does React's virtual DOM reconciliation work?", "expected_tool": "NO_TOOL", "id": "relevance_005"},
{"prompt": "Compare PostgreSQL and MongoDB for a real-time analytics platform.", "expected_tool": "NO_TOOL", "id": "relevance_006"},
{"prompt": "Write a decorator that retries a function 3 times on failure.", "expected_tool": "NO_TOOL", "id": "relevance_007"},
{"prompt": "How do I set up a CI/CD pipeline with GitHub Actions?", "expected_tool": "NO_TOOL", "id": "relevance_008"},
{"prompt": "Explain the CAP theorem.", "expected_tool": "NO_TOOL", "id": "relevance_009"},
{"prompt": "What's the best way to handle CORS in a Node.js Express app?", "expected_tool": "NO_TOOL", "id": "relevance_010"},
]
# CATEGORY 3: Hallucination Detection (keywords overlap with tools but should NOT trigger)
HALLUCINATION_TESTS = [
{"prompt": "How do I implement a context manager in Python using __enter__ and __exit__?",
"expected_tool": "NO_TOOL", "id": "hallucination_001"},
{"prompt": "Explain the forget gate in an LSTM neural network.",
"expected_tool": "NO_TOOL", "id": "hallucination_002"},
{"prompt": "How does session management work in Express.js with passport?",
"expected_tool": "NO_TOOL", "id": "hallucination_003"},
{"prompt": "What's the difference between knowledge distillation and model pruning?",
"expected_tool": "NO_TOOL", "id": "hallucination_004"},
{"prompt": "How do I save state in a Redux store?",
"expected_tool": "NO_TOOL", "id": "hallucination_005"},
{"prompt": "Explain memory-mapped files and how they improve I/O performance.",
"expected_tool": "NO_TOOL", "id": "hallucination_006"},
{"prompt": "How does the garbage collector handle circular references in Python?",
"expected_tool": "NO_TOOL", "id": "hallucination_007"},
{"prompt": "What is a load balancer health check in Kubernetes?",
"expected_tool": "NO_TOOL", "id": "hallucination_008"},
{"prompt": "How do I implement exponential backoff with jitter for API retries?",
"expected_tool": "NO_TOOL", "id": "hallucination_009"},
{"prompt": "Compare Elasticsearch and Solr for full-text search.",
"expected_tool": "NO_TOOL", "id": "hallucination_010"},
]
# CATEGORY 4: Disambiguation (similar tools β must pick the right one)
DISAMBIGUATION_TESTS = [
{
"prompt": "Find past sessions where I discussed WebSocket error handling.",
"expected_tool": "session_search_memory",
"required_params": {"query": "WebSocket error handling"},
"id": "disambig_001"
},
{
"prompt": "Search our accumulated documentation for WebSocket best practices.",
"expected_tool": "knowledge_search",
"required_params": {"query": "WebSocket best practices"},
"id": "disambig_002"
},
{
"prompt": "Delete that specific memory entry ID 'mem-42' β it's outdated.",
"expected_tool": "session_forget_memory",
"required_params": {"memory_id": "mem-42"},
"id": "disambig_003"
},
{
"prompt": "Clear out all old knowledge entries in the 'testing' category for analytics project.",
"expected_tool": "knowledge_forget",
"required_params": {"project": "analytics"},
"id": "disambig_004"
},
{
"prompt": "Boost the importance of knowledge entry 'insight-77'.",
"expected_tool": "knowledge_upvote",
"required_params": {"id": "insight-77"},
"id": "disambig_005"
},
{
"prompt": "This knowledge item 'insight-88' is not useful anymore, lower its score.",
"expected_tool": "knowledge_downvote",
"required_params": {"id": "insight-88"},
"id": "disambig_006"
},
{
"prompt": "Record a successful experience: I fixed the login bug by adding input validation.",
"expected_tool": "session_save_experience",
"required_params": {"event_type": "success"},
"id": "disambig_007"
},
{
"prompt": "Leave a handoff note for the next session on the portal project β tell them the DB schema is finalized.",
"expected_tool": "session_save_handoff",
"required_params": {"project": "portal"},
"id": "disambig_008"
},
]
# CATEGORY 5: Format Sensitivity (same intent, different prompt styles)
FORMAT_SENSITIVITY_TESTS = [
# All 5 should map to session_load_context
{"prompt": "Load context for myproject.",
"expected_tool": "session_load_context", "required_params": {"project": "myproject"}, "id": "format_001"},
{"prompt": "SESSION_LOAD_CONTEXT(project='myproject')",
"expected_tool": "session_load_context", "required_params": {"project": "myproject"}, "id": "format_002"},
{"prompt": "Please initialize the session context for project myproject at the standard level.",
"expected_tool": "session_load_context", "required_params": {"project": "myproject"}, "id": "format_003"},
{"prompt": "ctx = load(project='myproject')",
"expected_tool": "session_load_context", "required_params": {"project": "myproject"}, "id": "format_004"},
{"prompt": "Yo pull up myproject's context real quick",
"expected_tool": "session_load_context", "required_params": {"project": "myproject"}, "id": "format_005"},
]
# CATEGORY 6: AST Parameter Accuracy (correct tool + parameter value matching)
AST_PARAM_TESTS = [
{
"prompt": "Export my memories to /tmp/backup in markdown format for the billing project.",
"expected_tool": "session_export_memory",
"required_params": {"output_path": "/tmp/backup", "format": "markdown", "project": "billing"},
"ast_strict": True, # enforce exact param values
"id": "ast_001"
},
{
"prompt": "Set a 30-day retention policy for the staging project's knowledge.",
"expected_tool": "knowledge_set_retention",
"required_params": {"project": "staging", "ttl_days": 30},
"ast_strict": True,
"id": "ast_002"
},
{
"prompt": "Save a ledger entry: project is 'portal', conversation is 'conv-2024-001', summary is 'Deployed v2.0 to production with zero downtime'.",
"expected_tool": "session_save_ledger",
"required_params": {"project": "portal", "conversation_id": "conv-2024-001"},
"ast_strict": True,
"id": "ast_003"
},
{
"prompt": "Record a correction experience for the analytics project: I tried using batch inserts but should have used streaming writes instead.",
"expected_tool": "session_save_experience",
"required_params": {"project": "analytics", "event_type": "correction"},
"ast_strict": False, # Free-text fields (action, correction) are hard to match exactly
"id": "ast_004"
},
{
"prompt": "Save an image at /tmp/screenshot.png for the dashboard project with description 'Login page redesign mockup'.",
"expected_tool": "session_save_image",
"required_params": {"project": "dashboard", "image_path": "/tmp/screenshot.png"},
"ast_strict": True,
"id": "ast_005"
},
]
# CATEGORY 7: Edge Cases (single-word, ambiguous, multi-intent)
EDGE_CASE_TESTS = [
{"prompt": "Hello!", "expected_tool": "NO_TOOL", "id": "edge_001"},
{"prompt": "Thanks, that's all for now.", "expected_tool": "NO_TOOL", "id": "edge_002"},
{"prompt": "What can you do?", "expected_tool": "NO_TOOL", "id": "edge_003"},
{"prompt": "Load context.", "expected_tool": "session_load_context", "required_params": {}, "id": "edge_004"},
{"prompt": "Save.", "expected_tool": "session_save_ledger", "required_params": {}, "id": "edge_005"},
# Accept both search tools for ambiguous single-word "Search."
{"prompt": "Search.", "expected_tool": ["session_search_memory", "knowledge_search"], "required_params": {}, "id": "edge_006"},
{"prompt": "Health check.", "expected_tool": "session_health_check", "required_params": {}, "id": "edge_007"},
{"prompt": "π", "expected_tool": "NO_TOOL", "id": "edge_008"},
]
# CATEGORY 8: Multi-Turn Chain (sequential tool calls with tool responses β 40% BFCL weight)
# These test whether the model correctly selects the NEXT tool after receiving
# a tool execution result in the conversation history.
MULTI_TURN_TESTS = [
{
# Turn 1: User asks to load context, model should call session_load_context
"prompt": "Load the context for the analytics project, then search for recent deployment issues.",
"expected_tool": "session_load_context",
"required_params": {"project": "analytics"},
"id": "multiturn_001",
# After tool response, the follow-up prompt becomes:
"followup": {
"tool_response": '{"project": "analytics", "open_todos": ["fix deploy"], "last_summary": "Worked on deploy pipeline"}',
"expected_tool": "session_search_memory",
"required_params": {"query": "deployment issues"},
}
},
{
# Search memory β then save a handoff note
"prompt": "Search for what we decided about the caching layer, then save a handoff note about it.",
"expected_tool": "session_search_memory",
"required_params": {"query": "caching layer"},
"id": "multiturn_002",
"followup": {
"tool_response": '{"results": [{"summary": "Decided to use Redis for session caching with 5min TTL"}]}',
"expected_tool": "session_save_handoff",
"required_params": {},
}
},
{
# Health check β then compact if issues found
"prompt": "Run a health check on the memory system. If there are issues, compact the old entries.",
"expected_tool": "session_health_check",
"required_params": {},
"id": "multiturn_003",
"followup": {
"tool_response": '{"status": "issues_found", "missing_embeddings": 12, "stale_rollups": 3}',
"expected_tool": "session_compact_ledger",
"required_params": {},
}
},
{
# Load context β log an experience record
"prompt": "Load context for the portal project and then log that we successfully deployed v3.",
"expected_tool": "session_load_context",
"required_params": {"project": "portal"},
"id": "multiturn_004",
"followup": {
"tool_response": '{"project": "portal", "last_summary": "Working on v3 deploy"}',
"expected_tool": "session_save_experience",
"required_params": {"project": "portal", "event_type": "success"},
}
},
{
# Knowledge search β upvote useful result
"prompt": "Search knowledge for retry strategies, then upvote the best result.",
"expected_tool": "knowledge_search",
"required_params": {"query": "retry strategies"},
"id": "multiturn_005",
"followup": {
"tool_response": '{"results": [{"id": "ki-retry-42", "summary": "Exponential backoff with jitter", "importance": 5}]}',
"expected_tool": "knowledge_upvote",
"required_params": {"id": "ki-retry-42"},
}
},
{
# Export memory β set retention policy
"prompt": "Export the billing project memory to /tmp/backup, then set a 60-day retention policy.",
"expected_tool": "session_export_memory",
"required_params": {"output_path": "/tmp/backup"},
"id": "multiturn_006",
"followup": {
"tool_response": '{"status": "exported", "file": "/tmp/backup/prism-export-billing.json", "entries": 142}',
"expected_tool": "knowledge_set_retention",
"required_params": {"project": "billing", "ttl_days": 60},
}
},
{
# Save ledger β save handoff
"prompt": "Record this session: we migrated the auth module to OAuth2. Then save the handoff state.",
"expected_tool": "session_save_ledger",
"required_params": {},
"id": "multiturn_007",
"followup": {
"tool_response": '{"status": "saved", "id": "ledger-2024-99"}',
"expected_tool": "session_save_handoff",
"required_params": {},
}
},
{
# Task route β then act on the routing decision (should NOT call a tool if route says "host")
"prompt": "Should the local agent handle this TypeScript refactor? If cloud, just tell me.",
"expected_tool": "session_task_route",
"required_params": {},
"id": "multiturn_008",
"followup": {
"tool_response": '{"target": "host", "confidence": 0.92, "reason": "Complex refactor needs cloud model"}',
"expected_tool": "NO_TOOL",
"required_params": {},
}
},
]
# ============================================================================
# ALL CATEGORIES
# ============================================================================
ALL_CATEGORIES = {
"simple": SIMPLE_TESTS,
"relevance_detection": RELEVANCE_TESTS,
"hallucination": HALLUCINATION_TESTS,
"disambiguation": DISAMBIGUATION_TESTS,
"format_sensitivity": FORMAT_SENSITIVITY_TESTS,
"ast_parameter": AST_PARAM_TESTS,
"edge_case": EDGE_CASE_TESTS,
"multi_turn_chain": MULTI_TURN_TESTS,
}
def parse_all_tool_calls(response_text: str) -> list:
"""Extract ALL tool calls from a response, supporting parallel calls.
Returns: list of (tool_name, tool_args) tuples.
"""
results = []
# R17-fix: Strip CoT blocks to prevent extracting JSON from reasoning
# R19-fix: Handle unclosed think blocks via (?:</\|synalux_think\|>|$) fallback
clean_text = re.sub(r'<\|synalux_think\|>.*?(?:</\|synalux_think\|>|$)', '', response_text, flags=re.DOTALL)
# Strategy 0: Training-format <tool_call>...</tool_call> (no pipes) β used by v43 model
no_pipe_blocks = re.findall(r'<tool_call>\s*(\{.*?\})\s*(?:</tool_call>|(?=<tool_call>)|$)',
clean_text, re.DOTALL)
if not no_pipe_blocks:
no_pipe_blocks = re.findall(r'<tool_call>\s*(\{[^}]*\})', clean_text)
for raw_json in no_pipe_blocks:
try:
brace_depth = 0
end_idx = 0
for i, ch in enumerate(raw_json):
if ch == '{': brace_depth += 1
elif ch == '}': brace_depth -= 1
if brace_depth == 0:
end_idx = i + 1
break
parsed = json.loads(raw_json[:end_idx] if end_idx > 0 else raw_json)
if isinstance(parsed, dict) and parsed.get("name"):
tool_args = parsed.get("arguments", {})
if isinstance(tool_args, dict):
for k, v in tool_args.items():
if isinstance(v, str) and v.isdigit():
tool_args[k] = int(v)
else:
tool_args = {}
results.append((parsed["name"], tool_args))
except (json.JSONDecodeError, IndexError):
continue
if results:
return results
# Strategy 1: Find ALL <|tool_call|> JSON blocks using findall
# R16-fix: Use lookahead (?=<\|tool_call\|>) to avoid consuming boundary token on parallel calls
json_blocks = re.findall(r'<\|tool_call\|>\s*(\{.*?\})\s*(?:</\|tool_call\|>|(?=<\|tool_call\|>)|$)',
clean_text, re.DOTALL)
if not json_blocks:
# Fallback: try greedy per-block extraction
json_blocks = re.findall(r'<\|tool_call\|>\s*(\{[^}]*\})', clean_text)
for raw_json in json_blocks:
try:
# Handle nested braces by finding balanced JSON
brace_depth = 0
end_idx = 0
for i, ch in enumerate(raw_json):
if ch == '{': brace_depth += 1
elif ch == '}': brace_depth -= 1
if brace_depth == 0:
end_idx = i + 1
break
clean_json = raw_json[:end_idx] if end_idx > 0 else raw_json
parsed = json.loads(clean_json)
# R11-fix: Guard against hallucinated JSON arrays
if not isinstance(parsed, dict):
continue
tool_name = parsed.get("name", "")
tool_args = parsed.get("arguments", {})
# Normalize int values
if isinstance(tool_args, dict):
for k, v in tool_args.items():
if isinstance(v, str) and v.isdigit():
tool_args[k] = int(v)
else:
tool_args = {}
results.append((tool_name, tool_args))
except (json.JSONDecodeError, IndexError):
continue
if results:
return results
# Strategy 2: Function-call style: <|tool_call|> tool_name(key=val, ...)
func_matches = re.findall(r'<\|tool_call\|>\s*(\w+)\s*\((.*?)\)', clean_text, re.DOTALL)
for tool_name, args_str in func_matches:
tool_args = {}
args_str = args_str.strip()
if args_str:
for param_match in re.finditer(r'(\w+)\s*=\s*(?:"([^"]*?)"|\'([^\']*?)\'|(\d+(?:\.\d+)?)|(\w+))', args_str):
key = param_match.group(1)
val = param_match.group(2) or param_match.group(3) or param_match.group(4) or param_match.group(5)
if val and isinstance(val, str) and val.isdigit():
val = int(val)
tool_args[key] = val
results.append((tool_name, tool_args))
if results:
return results
# Strategy 3: Bare JSON with name field (no <|tool_call|> prefix)
bare_matches = re.findall(r'\{\s*"name"\s*:\s*"(\w+)"\s*,\s*"arguments"\s*:\s*(\{[^}]*\})', clean_text)
for tool_name, args_json in bare_matches:
try:
tool_args = json.loads(args_json)
results.append((tool_name, tool_args))
except json.JSONDecodeError:
# R13-fix: Do not append empty dicts; allow _repair_and_extract to handle nested JSON
pass
return results
MLX_MODEL_CACHE = None
MLX_TOKENIZER_CACHE = None
def call_ollama(prompt: str, use_json_format: bool = False) -> tuple:
global MLX_MODEL_CACHE, MLX_TOKENIZER_CACHE
import os, time, json, urllib.request
from config import OLLAMA_KEEP_ALIVE, OLLAMA_NUM_CTX, OLLAMA_TEMPERATURE
if MODEL.startswith("/") or os.path.exists(MODEL):
if MLX_MODEL_CACHE is None:
from mlx_lm import load
import gc, mlx.core as mx
print(f"Loading MLX model: {MODEL}")
# OOM protection: clear any prior model from memory
gc.collect()
mx.metal.clear_cache()
MLX_MODEL_CACHE, MLX_TOKENIZER_CACHE = load(MODEL)
peak = mx.metal.get_peak_memory() / 1e9
print(f" Model loaded. Peak GPU memory: {peak:.1f}GB")
from mlx_lm import generate
start_time = time.time()
try:
response_text = generate(MLX_MODEL_CACHE, MLX_TOKENIZER_CACHE, prompt=prompt, max_tokens=512)
except Exception as e:
if "out of memory" in str(e).lower() or "malloc" in str(e).lower():
import gc, mlx.core as mx
print(f" β οΈ OOM detected β clearing cache and retrying with max_tokens=256")
gc.collect(); mx.metal.clear_cache()
response_text = generate(MLX_MODEL_CACHE, MLX_TOKENIZER_CACHE, prompt=prompt, max_tokens=256)
else:
raise
elapsed = time.time() - start_time
all_calls = parse_all_tool_calls(response_text)
if not all_calls:
all_calls = _repair_and_extract(response_text)
if all_calls:
return all_calls[0][0], all_calls[0][1], response_text, elapsed, all_calls
return "NO_TOOL", {}, response_text, elapsed, []
# Ollama HTTP path (when MODEL is a tag, not a local MLX path)
OLLAMA_API = "http://localhost:11434/api/generate"
payload = json.dumps({
"model": MODEL,
"prompt": prompt,
"stream": False,
"raw": False,
"options": {
"temperature": 0.0,
"num_predict": 512,
"num_ctx": OLLAMA_NUM_CTX,
},
"keep_alive": OLLAMA_KEEP_ALIVE,
}).encode()
start_time = time.time()
try:
req = urllib.request.Request(OLLAMA_API, data=payload,
headers={"Content-Type": "application/json"})
with urllib.request.urlopen(req, timeout=120) as resp:
data = json.loads(resp.read().decode())
response_text = data.get("response", "")
except Exception as e:
return "ERROR", {}, str(e), time.time() - start_time, []
elapsed = time.time() - start_time
all_calls = parse_all_tool_calls(response_text)
if not all_calls:
all_calls = _repair_and_extract(response_text)
if all_calls:
return all_calls[0][0], all_calls[0][1], response_text, elapsed, all_calls
return "NO_TOOL", {}, response_text, elapsed, []
# =============================================================================
# Enhancement 1: Best-of-N Schema Validator (Test-Time Compute Scaling)
# =============================================================================
# R6.1-fix: Load tool schemas globally for Best-of-N validation
_TRAINING_DIR = os.path.dirname(os.path.abspath(__file__))
_TOOL_SCHEMA_PATH = os.path.join(_TRAINING_DIR, "data", "tool_schema.json")
try:
with open(_TOOL_SCHEMA_PATH) as _f:
_TOOL_SCHEMAS = json.load(_f).get("tools", [])
# R14-fix: Dynamically sync VALID_TOOLS with schema registry (includes V4 Agentic tools)
if _TOOL_SCHEMAS:
VALID_TOOLS.update(t["name"] for t in _TOOL_SCHEMAS)
print(f"Loaded {len(_TOOL_SCHEMAS)} tool schemas for Best-of-N validation (VALID_TOOLS: {len(VALID_TOOLS)})")
except (FileNotFoundError, json.JSONDecodeError, PermissionError) as e:
_TOOL_SCHEMAS = []
print(f"WARNING: Failed to load {_TOOL_SCHEMA_PATH}: {e} β Best-of-N validation disabled")
# R6.1-fix: Import from config instead of hardcoding
from config import BEST_OF_N_DEFAULT, BEST_OF_N_TEMPERATURE
BEST_OF_N = int(os.environ.get("BFCL_BEST_OF_N", str(BEST_OF_N_DEFAULT)))
def validate_tool_call_against_schema(tool_name: str, tool_args: dict,
available_tools: list) -> tuple:
"""Validate a tool call against its JSON schema definition.
Returns (is_valid, error_reason).
"""
# Find matching tool schema
schema = None
for tool in available_tools:
if tool.get("name") == tool_name:
schema = tool
break
if schema is None:
return False, f"tool '{tool_name}' not in available tools"
params = schema.get("parameters", {})
props = params.get("properties", {})
required = set(params.get("required", []))
# R9-fix: Guard against hallucinated non-dict arguments (e.g., arrays)
if not isinstance(tool_args, dict):
return False, f"arguments must be an object, got {type(tool_args).__name__}"
# Check required params present
for req_param in required:
if req_param not in tool_args:
return False, f"missing required param: {req_param}"
# Check no hallucinated params
for arg_name in tool_args:
if arg_name not in props:
return False, f"hallucinated param: {arg_name}"
# Check data types
for arg_name, arg_val in tool_args.items():
# R6.2-fix: Only allow None for optional (non-required) params
if arg_val is None:
if arg_name in required:
return False, f"{arg_name} is required and cannot be null"
continue
if arg_name not in props:
continue
expected_type = props[arg_name].get("type", "string")
if expected_type == "integer" and (not isinstance(arg_val, int) or isinstance(arg_val, bool)):
return False, f"{arg_name} should be int, got {type(arg_val).__name__}"
elif expected_type == "number" and (not isinstance(arg_val, (int, float)) or isinstance(arg_val, bool)):
return False, f"{arg_name} should be number, got {type(arg_val).__name__}"
elif expected_type == "boolean" and not isinstance(arg_val, bool):
return False, f"{arg_name} should be bool, got {type(arg_val).__name__}"
elif expected_type == "object" and not isinstance(arg_val, dict):
return False, f"{arg_name} should be object, got {type(arg_val).__name__}"
elif expected_type == "array" and not isinstance(arg_val, list):
return False, f"{arg_name} should be array, got {type(arg_val).__name__}"
# Check enum constraints
for arg_name, arg_val in tool_args.items():
if arg_name in props and "enum" in props[arg_name]:
if arg_val not in props[arg_name]["enum"]:
return False, f"{arg_name} value '{arg_val}' not in enum"
return True, "valid"
def call_ollama_best_of_n(prompt: str, available_tools: list = None,
n: int = None) -> tuple:
return call_ollama(prompt)
def _repair_and_extract(text: str) -> list:
"""R5-3: Attempt to repair malformed JSON and extract tool calls.
Handles: trailing commas, missing closing braces.
NOTE: Does NOT cast string types β BFCL strictly checks data types.
"""
import re as _re
# R17-fix: Strip CoT blocks before attempting repair
# R19-fix: Handle unclosed think blocks via (?:</\|synalux_think\|>|$) fallback
clean_text = _re.sub(r'<\|synalux_think\|>.*?(?:</\|synalux_think\|>|$)', '', text, flags=_re.DOTALL)
# Find anything that looks like a JSON tool call
candidates = _re.findall(r'\{\s*"name"\s*:.*?(?:\}\s*\}|\})', clean_text, _re.DOTALL)
results = []
for raw in candidates:
repaired = raw
# Fix trailing commas before closing brace
repaired = _re.sub(r',\s*\}', '}', repaired)
# Count braces and add missing ones
open_braces = repaired.count('{')
close_braces = repaired.count('}')
if open_braces > close_braces:
repaired += '}' * (open_braces - close_braces)
try:
parsed = json.loads(repaired)
tool_name = parsed.get("name", "")
tool_args = parsed.get("arguments", {})
if tool_name:
results.append((tool_name, tool_args))
except json.JSONDecodeError:
continue
return results
def evaluate_test(test: dict, verbose: bool = False) -> dict:
"""Evaluate a single BFCL test case."""
from config import format_system_prompt
prompt = test["prompt"]
expected_tool = test["expected_tool"]
required_params = test.get("required_params", {})
ast_strict = test.get("ast_strict", False)
test_id = test["id"]
# Support list of acceptable tools for ambiguous prompts
expected_tool_list = expected_tool if isinstance(expected_tool, list) else [expected_tool]
# R5-7 fix: Wrap prompt with system prompt to match training distribution
# Uses bfcl_eval_mode=True to disable clarification behavior (R4-5)
# R6.1-fix: Use RAG system prompt for context-limited tool injection
# R20-fix: Use training-compatible system prompt (matches 1518 Prism tool training examples).
# format_system_prompt with empty schemas produces piped-token format (<|tool_call|>) that
# mismatches v43 training format (<tool_call>), causing think-block close tag confusion and
# empty-tag degeneration loops. Hardcoded training format resolves 0% \u2192 correct scoring.
_TRAINING_SYS_PROMPT = (
"You are Synalux, a memory-augmented coding and clinical reasoning assistant. "
"You have access to Prism Memory tools (session_save_ledger, session_load_context, "
"session_search_memory, session_save_handoff, session_forget_memory, session_health_check, "
"session_compact_ledger, session_export_memory, session_task_route, session_save_experience, "
"session_synthesize_edges, session_backfill_links, knowledge_search, knowledge_forget, "
"knowledge_upvote, knowledge_downvote, knowledge_set_retention) and 13 multimodal tool "
"modules (image_gen, office, web_scraper, browser, tts, ocr, git, terminal, deps_scanner, "
"hipaa, data_graph, templates, pdf_parser). "
"Think step-by-step before answering. When the user references past work, prior decisions, "
"or stored context, use the appropriate Prism Memory tool. "
"Format tool calls inside <tool_call>...</tool_call> JSON blocks with fields 'name' and 'arguments'. "
"If no tool is needed, answer directly in plain text. "
"ABSTAIN for general programming questions, CS concepts, greetings, and capability questions."
)
sys_prompt = _TRAINING_SYS_PROMPT
# R8-fix: Format as proper ChatML so Ollama raw mode sends it correctly
full_prompt = f"<|im_start|>system\n{sys_prompt}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
# R6-1: Use Best-of-N when enabled (validates candidates against tool schemas)
if BEST_OF_N > 1:
# R6.1-fix: Use globally loaded tool schemas, not per-test dicts
actual_tool, actual_args, raw_response, latency, all_calls = call_ollama_best_of_n(
full_prompt, available_tools=_TOOL_SCHEMAS
)
else:
actual_tool, actual_args, raw_response, latency, all_calls = call_ollama(full_prompt)
# Layer 3 validation
actual_tool, actual_args = validate_tool_call(prompt, actual_tool, actual_args)
# Hallucination check: did the model call a tool that doesn't exist?
hallucinated = actual_tool not in VALID_TOOLS and actual_tool != "NO_TOOL" and actual_tool != "ERROR"
# Score
result = {
"id": test_id,
"prompt": prompt,
"expected": expected_tool_list[0] if len(expected_tool_list) == 1 else str(expected_tool_list),
"actual": actual_tool,
"latency": latency,
"hallucinated": hallucinated,
"correct": False,
"tool_correct": False,
"params_correct": False,
"details": "",
}
if actual_tool == "ERROR":
result["details"] = "API error"
return result
# Tool name match (check against all acceptable tools)
tool_matches = actual_tool in expected_tool_list
if tool_matches:
result["tool_correct"] = True
if "NO_TOOL" in expected_tool_list:
result["correct"] = True
result["params_correct"] = True
result["details"] = "β
Correct abstention"
else:
# Check parameters
if ast_strict and required_params:
# R21-fix: Guard against non-dict arguments (e.g. hallucinated arrays)
if not isinstance(actual_args, dict):
actual_args = {}
# AST-level: check exact parameter values
params_ok = True
mismatches = []
for key, expected_val in required_params.items():
actual_val = actual_args.get(key)
if actual_val is None:
params_ok = False
mismatches.append(f"missing '{key}'")
elif isinstance(expected_val, int):
try:
if int(actual_val) != expected_val:
params_ok = False
mismatches.append(f"'{key}': expected {expected_val}, got {actual_val}")
except (ValueError, TypeError):
params_ok = False
mismatches.append(f"'{key}': expected int {expected_val}, got '{actual_val}'")
elif isinstance(expected_val, str):
if str(actual_val).lower().strip() != expected_val.lower().strip():
# Fuzzy match for similar strings
if expected_val.lower() not in str(actual_val).lower():
params_ok = False
mismatches.append(f"'{key}': expected '{expected_val}', got '{actual_val}'")
result["params_correct"] = params_ok
result["correct"] = params_ok
result["details"] = "β
AST match" if params_ok else f"β οΈ Param mismatch: {', '.join(mismatches)}"
else:
# Non-strict: just check required param keys exist
missing = [k for k in required_params if k not in actual_args]
result["params_correct"] = len(missing) == 0
result["correct"] = True # Tool is correct even if params partially missing
if missing:
result["details"] = f"β
Tool correct, missing params: {missing}"
else:
result["details"] = "β
Full match"
else:
# Wrong tool
expected_str = expected_tool_list[0] if len(expected_tool_list) == 1 else str(expected_tool_list)
if "NO_TOOL" in expected_tool_list:
result["details"] = f"β False positive: called {actual_tool} instead of abstaining"
elif actual_tool == "NO_TOOL":
result["details"] = f"β False negative: abstained instead of calling {expected_str}"
else:
result["details"] = f"β Wrong tool: expected {expected_str}, got {actual_tool}"
if verbose:
status = "β
" if result["correct"] else "β"
print(f" {status} [{test_id}] {result['details']}")
if not result["correct"]:
print(f" Prompt: {prompt[:80]}...")
print(f" Raw: {raw_response[:120]}...")
# R11-fix: Multi-turn followup evaluation (was deferred, now implemented)
if result["correct"] and isinstance(test, dict) and "followup" in test:
followup = test["followup"]
# Build conversation history: original prompt + first response + tool response + new assistant turn
# R12-fix: Use native ChatML without <|tool_response|> tags to match training distribution
history = (
f"{full_prompt}{raw_response}<|im_end|>\n"
f"<|im_start|>tool\n{followup['tool_response']}<|im_end|>\n"
f"<|im_start|>assistant\n"
)
if BEST_OF_N > 1:
next_tool, next_args, next_raw, next_latency, _ = call_ollama_best_of_n(
history, available_tools=_TOOL_SCHEMAS
)
else:
next_tool, next_args, next_raw, next_latency, _ = call_ollama(history)
# Layer 3 on followup: validate + catch repeated tool calls
next_tool, next_args = validate_tool_call(prompt, next_tool, next_args, is_followup=True)
if next_tool == actual_tool and next_tool != "NO_TOOL":
next_tool = "NO_TOOL"
next_args = {}
result["actual"] += f" -> {next_tool}"
result["latency"] += next_latency
expected_followup = followup.get("expected_tool", "NO_TOOL")
result["correct"] = (next_tool == expected_followup)
if not result["correct"]:
result["details"] += f" | β Followup: expected {expected_followup}, got {next_tool}"
else:
result["details"] += f" | β
Followup: {next_tool}"
if verbose:
status2 = "β
" if result["correct"] else "β"
print(f" {status2} Followup turn: expected={expected_followup}, got={next_tool}")
return result
def run_evaluation(shuffle: bool = False, verbose: bool = False, quiet: bool = False) -> dict:
"""Run full BFCL-style evaluation across all categories."""
# Build flat test list with category tags
all_tests = []
for cat_name, tests in ALL_CATEGORIES.items():
for test in tests:
test_copy = test.copy()
test_copy["category"] = cat_name
all_tests.append(test_copy)
if shuffle:
random.shuffle(all_tests)
print(f"\n{'='*70}")
print(f" BFCL-Style Evaluation β {MODEL}")
print(f" {len(all_tests)} tests across {len(ALL_CATEGORIES)} categories")
print(f" Shuffle: {'ON' if shuffle else 'OFF'}")
print(f"{'='*70}\n")
# Run all tests
results = []
category_results = {cat: [] for cat in ALL_CATEGORIES}
start_time = time.time()
for i, test in enumerate(all_tests, 1):
cat = test["category"]
if verbose:
print(f"[{i}/{len(all_tests)}] Category: {cat}")
result = evaluate_test(test, verbose=verbose)
result["category"] = cat
results.append(result)
category_results[cat].append(result)
if not verbose and not quiet:
status = "β
" if result["correct"] else "β"
print(f" {status} [{result['id']}] {result['expected']:>25s} β {result['actual']:<25s} {result['latency']:.1f}s", end="")
if result["hallucinated"]:
print(" π¨ HALLUCINATED", end="")
print()
elif quiet and not result["correct"]:
print(f" β [{result['id']}] expected {result['expected']}, got {result['actual']}")
elapsed = time.time() - start_time
# Category scores (BFCL methodology: accuracy per category)
category_scores = {}
print(f"\n{'='*70}")
print(f" CATEGORY BREAKDOWN")
print(f"{'='*70}")
for cat_name in ALL_CATEGORIES:
cat_res = category_results[cat_name]
if not cat_res:
continue
correct = sum(1 for r in cat_res if r["correct"])
total = len(cat_res)
accuracy = correct / total * 100
category_scores[cat_name] = accuracy
tool_correct = sum(1 for r in cat_res if r["tool_correct"])
params_correct = sum(1 for r in cat_res if r["params_correct"])
hallucinated = sum(1 for r in cat_res if r["hallucinated"])
print(f" {cat_name:25s} {correct}/{total} = {accuracy:6.1f}% "
f"(tool:{tool_correct}/{total} params:{params_correct}/{total} "
f"halluc:{hallucinated})")
# Overall score (BFCL: unweighted average across categories)
overall = sum(category_scores.values()) / len(category_scores) if category_scores else 0
total_correct = sum(1 for r in results if r["correct"])
total_halluc = sum(1 for r in results if r["hallucinated"])
avg_latency = sum(r["latency"] for r in results) / len(results) if results else 0
print(f"\n{'='*70}")
print(f" OVERALL RESULTS")
print(f"{'='*70}")
print(f" Overall Accuracy (BFCL avg): {overall:.1f}%")
print(f" Raw Accuracy: {total_correct}/{len(results)} = {total_correct/len(results)*100:.1f}%")
print(f" Hallucinations: {total_halluc}")
print(f" Avg Latency: {avg_latency:.1f}s")
print(f" Total Time: {elapsed:.0f}s")
print(f"{'='*70}\n")
return {
"overall_accuracy": overall,
"raw_accuracy": total_correct / len(results) * 100 if results else 0,
"total_correct": total_correct,
"total_tests": len(results),
"category_scores": category_scores,
"hallucinations": total_halluc,
"avg_latency": avg_latency,
"elapsed": elapsed,
"results": results,
}
def main():
import argparse
parser = argparse.ArgumentParser(description="BFCL-Style evaluation for Prism models")
parser.add_argument("--model", type=str, default=None, help="Ollama model name (default: prism-coder:7b)")
parser.add_argument("--runs", type=int, default=1, help="Number of evaluation runs")
parser.add_argument("--shuffle", action="store_true", help="Randomize test order each run")
parser.add_argument("--verbose", action="store_true", help="Show detailed model outputs")
parser.add_argument("--quiet", action="store_true", help="Only print category breakdown and overall results (suppress per-test lines)")
parser.add_argument("--cleanup", action="store_true", help="Release MLX model from memory after eval completes")
args = parser.parse_args()
# Allow --model to override the global MODEL
global MODEL
if args.model:
MODEL = args.model
print(f"Using model: {MODEL}")
all_run_results = []
for run_idx in range(args.runs):
if args.runs > 1:
print(f"\n{'#'*70}")
print(f" RUN {run_idx + 1} / {args.runs}")
print(f"{'#'*70}")
result = run_evaluation(shuffle=args.shuffle, verbose=args.verbose, quiet=args.quiet)
all_run_results.append(result)
if args.runs > 1:
# Multi-run summary
overall_scores = [r["overall_accuracy"] for r in all_run_results]
raw_scores = [r["raw_accuracy"] for r in all_run_results]
raw_correct = [r["total_correct"] for r in all_run_results]
total_tests = all_run_results[0]["total_tests"]
total_halluc = [r["hallucinations"] for r in all_run_results]
print(f"\n{'='*70}")
print(f" MULTI-RUN SUMMARY ({args.runs} runs Γ {total_tests} tests)")
print(f"{'='*70}")
print(f" BFCL Overall Accuracy:")
for i, s in enumerate(overall_scores):
print(f" Run {i+1}: {s:.1f}%")
print(f" Average: {statistics.mean(overall_scores):.1f}%")
print(f" Median: {statistics.median(overall_scores):.1f}%")
if len(overall_scores) > 1:
print(f" StdDev: {statistics.stdev(overall_scores):.2f}%")
print(f"\n Raw Scores: {' | '.join(f'{c}/{total_tests}' for c in raw_correct)}")
print(f" Hallucinations: {' | '.join(str(h) for h in total_halluc)}")
# Per-category consistency
print(f"\n Per-Category Consistency:")
categories = all_run_results[0]["category_scores"].keys()
for cat in categories:
scores = [r["category_scores"].get(cat, 0) for r in all_run_results]
avg = statistics.mean(scores)
consistent = all(s == scores[0] for s in scores)
marker = "β
" if consistent and avg == 100 else "β οΈ" if not consistent else "β
"
print(f" {marker} {cat:25s} {' | '.join(f'{s:.0f}%' for s in scores)} β avg {avg:.1f}%")
print(f"\n{'='*70}\n")
# Exit code
median_overall = statistics.median(overall_scores)
if median_overall < 90:
print("β FAIL: Median BFCL accuracy below 90%")
sys.exit(1)
elif median_overall < 95:
print("β οΈ WARN: Median BFCL accuracy below 95%")
sys.exit(0)
else:
print(f"β
PASS: Median BFCL accuracy {median_overall:.1f}%")
sys.exit(0)
else:
overall = all_run_results[0]["overall_accuracy"]
if args.cleanup:
cleanup_mlx_model()
if overall < 90:
sys.exit(1)
sys.exit(0)
if args.cleanup:
cleanup_mlx_model()
def cleanup_mlx_model():
global MLX_MODEL_CACHE, MLX_TOKENIZER_CACHE
if MLX_MODEL_CACHE is not None:
import gc
try:
import mlx.core as mx
mx.clear_cache()
except Exception:
pass
MLX_MODEL_CACHE = None
MLX_TOKENIZER_CACHE = None
gc.collect()
print("π§Ή MLX model released from memory")
if __name__ == "__main__":
main()
|