Instructions to use dcostenco/prism-coder-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use dcostenco/prism-coder-4b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="dcostenco/prism-coder-4b", filename="prism-coder-4b-v43-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use dcostenco/prism-coder-4b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf dcostenco/prism-coder-4b:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M
Use Docker
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use dcostenco/prism-coder-4b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "dcostenco/prism-coder-4b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dcostenco/prism-coder-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- Ollama
How to use dcostenco/prism-coder-4b with Ollama:
ollama run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- Unsloth Studio
How to use dcostenco/prism-coder-4b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dcostenco/prism-coder-4b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dcostenco/prism-coder-4b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for dcostenco/prism-coder-4b to start chatting
- Pi
How to use dcostenco/prism-coder-4b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "dcostenco/prism-coder-4b:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use dcostenco/prism-coder-4b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default dcostenco/prism-coder-4b:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use dcostenco/prism-coder-4b with Docker Model Runner:
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
- Lemonade
How to use dcostenco/prism-coder-4b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull dcostenco/prism-coder-4b:Q4_K_M
Run and chat with the model
lemonade run user.prism-coder-4b-Q4_K_M
List all available models
lemonade list
File size: 86,611 Bytes
fcac56b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 | #!/usr/bin/env python3
"""
eval_300.py — 300-Case Standard Evaluation for prism-coder:4b-v43
Replaces bfcl_eval.py (64 tests) and swe_bench_test.py (68 tests) with a single
~300-case blind eval. Designed to be run 3 times for statistical stability checks.
All test cases are NOVEL — never seen in any training data.
Categories:
natural_phrasing (50) — casual/indirect phrasing that maps to a tool
adversarial_trap (70) — CS/programming questions that must NOT call a tool
disambiguation (40) — similar tools exist; must pick the correct one
edge_case (25) — minimal / ambiguous prompts
multi_intent (20) — multi-step prompts; score on first action only
verifier (25) — synthesize_edges / backfill_links / health_check patterns
cascade (25) — explicit first-step-of-chain patterns
param_extraction (25) — params in the prompt text; test correct extraction
abstention (20) — greetings / capability questions; must return NO_TOOL
Scoring:
strict_pass = correct tool + all required_params present → 1.0 point
partial_pass = correct tool + at least 1 required_param but not all → 0.5 point
wrong_tool = wrong tool name → 0 points
false_pos = tool called when NO_TOOL expected → 0 points
false_neg = NO_TOOL when tool expected → 0 points
Usage:
python3 eval_300.py
python3 eval_300.py --runs 3 --shuffle
python3 eval_300.py --model prism-coder:4b-v43 --runs 3
python3 eval_300.py --no-validate-layer3
"""
import json
import os
import re
import sys
import time
import random
import statistics
import urllib.request
import argparse
# ---------------------------------------------------------------------------
# Config
# ---------------------------------------------------------------------------
MODEL = "prism-coder:4b-v43"
OLLAMA_API = "http://localhost:11434/api/generate"
SYSTEM_PROMPT = (
"You are Synalux, a memory-augmented coding and clinical reasoning assistant. "
"You have access to Prism Memory tools (session_save_ledger, session_load_context, "
"session_search_memory, session_save_handoff, session_forget_memory, session_health_check, "
"session_compact_ledger, session_export_memory, session_task_route, session_save_experience, "
"session_synthesize_edges, session_backfill_links, knowledge_search, knowledge_forget, "
"knowledge_upvote, knowledge_downvote, knowledge_set_retention) and 13 multimodal tool "
"modules (image_gen, office, web_scraper, browser, tts, ocr, git, terminal, deps_scanner, "
"hipaa, data_graph, templates, pdf_parser). "
"Think step-by-step before answering. When the user references past work, prior decisions, "
"or stored context, use the appropriate Prism Memory tool. "
"Format tool calls inside <tool_call>...</tool_call> JSON blocks with fields 'name' and 'arguments'. "
"If no tool is needed, answer directly in plain text. "
"ABSTAIN for general programming questions, CS concepts, greetings, and capability questions."
)
VALID_TOOLS = {
"session_load_context", "session_save_ledger", "session_save_handoff",
"session_search_memory", "session_forget_memory", "session_health_check",
"session_compact_ledger", "session_export_memory", "session_task_route",
"session_save_experience", "session_synthesize_edges", "session_backfill_links",
"knowledge_search", "knowledge_forget", "knowledge_upvote",
"knowledge_downvote", "knowledge_set_retention",
}
# ---------------------------------------------------------------------------
# Test Cases (prompt, expected_tool_or_NO_TOOL, required_params, category)
# required_params: list of param keys that MUST appear in got_args
# ---------------------------------------------------------------------------
TESTS = [
# ===========================================================================
# CATEGORY 1: natural_phrasing (50 cases)
# Casual / indirect user phrasing that maps to a specific Prism tool.
# ===========================================================================
# --- session_load_context ---
("Alright, kick things off. Pull up whatever we had on the checkout-service project.",
"session_load_context", ["project"], "natural_phrasing"),
("I'm back from lunch. Get me re-oriented on the prism-aac project.",
"session_load_context", ["project"], "natural_phrasing"),
("Fresh session here. Reconstruct everything we built for the notifications project.",
"session_load_context", ["project"], "natural_phrasing"),
("Starting a new chat. Bring up the full context for the mobile-app project.",
"session_load_context", ["project"], "natural_phrasing"),
("Where did we leave off with the auth-service work?",
"session_load_context", [], "natural_phrasing"),
("Get me up to speed on the reporting-dashboard project.",
"session_load_context", ["project"], "natural_phrasing"),
("Resume from where we were on the data-pipeline project.",
"session_load_context", ["project"], "natural_phrasing"),
("Catch me up — what was the state of the subscription-api project?",
"session_load_context", ["project"], "natural_phrasing"),
# --- session_save_ledger ---
("We wrapped up for today. Make a note that we completed the database indexing overhaul.",
"session_save_ledger", [], "natural_phrasing"),
("Log what just happened: we refactored the payment module and all tests pass.",
"session_save_ledger", [], "natural_phrasing"),
("Record this session — we finalized the API contract for the mobile team.",
"session_save_ledger", [], "natural_phrasing"),
("Write down everything we did today before I close this tab.",
"session_save_ledger", [], "natural_phrasing"),
("Jot down our progress: three endpoints migrated, two more to go.",
"session_save_ledger", [], "natural_phrasing"),
("Before I head out, save a summary of what we accomplished this afternoon.",
"session_save_ledger", [], "natural_phrasing"),
# --- session_save_handoff ---
("I'm handing this over. Leave a note for whoever picks this up next on the billing-portal project.",
"session_save_handoff", ["project"], "natural_phrasing"),
("Pass the baton on the logistics-api project. Save the handoff so the next person knows where we are.",
"session_save_handoff", ["project"], "natural_phrasing"),
("Shift change. Store the current state for the embedded-firmware project so the next agent can continue.",
"session_save_handoff", ["project"], "natural_phrasing"),
("Create a handoff note for the trading-platform project — we got through feature flagging, still need A/B routing.",
"session_save_handoff", ["project"], "natural_phrasing"),
# --- session_search_memory ---
("Remind me — did we ever pick a caching strategy for the CDN layer?",
"session_search_memory", ["query"], "natural_phrasing"),
("Did we discuss anything about Kafka consumer lag in previous sessions?",
"session_search_memory", ["query"], "natural_phrasing"),
("Go back through our history and find anything about the CI pipeline refactor.",
"session_search_memory", ["query"], "natural_phrasing"),
("What did we decide about webhook retry logic in past conversations?",
"session_search_memory", ["query"], "natural_phrasing"),
("Dig up anything we recorded about the multi-tenant database design.",
"session_search_memory", ["query"], "natural_phrasing"),
("Pull up any notes we saved about the gRPC migration.",
"session_search_memory", ["query"], "natural_phrasing"),
# --- session_forget_memory ---
("That entry we saved about using SQLite in production is totally wrong. Remove it.",
"session_forget_memory", ["memory_id"], "natural_phrasing"),
("Delete the memory with ID mem-zx91-ff. It's stale.",
"session_forget_memory", ["memory_id"], "natural_phrasing"),
("Wipe the incorrect ledger note that said we shipped v2.1 — we didn't.",
"session_forget_memory", ["memory_id"], "natural_phrasing"),
# --- session_health_check ---
("Something feels off. Can you run diagnostics on the memory backend?",
"session_health_check", [], "natural_phrasing"),
("Before I trust these search results, verify the memory system is healthy.",
"session_health_check", [], "natural_phrasing"),
("Give the memory infrastructure a quick checkup.",
"session_health_check", [], "natural_phrasing"),
# --- session_compact_ledger ---
("The session history for the event-sourcing project is getting massive. Trim and archive the old entries.",
"session_compact_ledger", ["project"], "natural_phrasing"),
("Compress the ledger for the recommendation-engine project — too much noise in there.",
"session_compact_ledger", ["project"], "natural_phrasing"),
("Prune out the old session entries for the analytics-backend project.",
"session_compact_ledger", ["project"], "natural_phrasing"),
# --- session_export_memory ---
("Dump a full backup of my memory to /data/exports in JSON format.",
"session_export_memory", ["output_path", "format"], "natural_phrasing"),
("Export everything to /tmp/prism-dump so I can archive it.",
"session_export_memory", ["output_path"], "natural_phrasing"),
("I need an offline copy of all session data. Export to /backup/weekly.",
"session_export_memory", ["output_path"], "natural_phrasing"),
# --- session_task_route ---
("Should I tackle this Rust async runtime bug locally or send it to a bigger model?",
"session_task_route", ["task_description"], "natural_phrasing"),
("Is this image classification fine-tuning job something the local agent can handle?",
"session_task_route", ["task_description"], "natural_phrasing"),
("Route this task: refactor the monorepo build system to support incremental compilation.",
"session_task_route", ["task_description"], "natural_phrasing"),
# --- session_save_experience ---
("Log a milestone: we successfully zero-downtime-deployed the new search index.",
"session_save_experience", [], "natural_phrasing"),
("Record that we fixed the race condition in the WebSocket handler — took 4 hours but it's solid now.",
"session_save_experience", [], "natural_phrasing"),
# --- knowledge_search ---
("Any institutional knowledge on how we handle circuit breakers?",
"knowledge_search", ["query"], "natural_phrasing"),
("What does our knowledge base say about rate limiting strategies?",
"knowledge_search", ["query"], "natural_phrasing"),
("Look up anything curated about CQRS patterns.",
"knowledge_search", ["query"], "natural_phrasing"),
("Check our documented knowledge for anything on event-driven architecture.",
"knowledge_search", ["query"], "natural_phrasing"),
# --- knowledge_upvote / downvote ---
("That knowledge entry about using Redis for distributed locks was really helpful. Give it a thumbs up.",
"knowledge_upvote", [], "natural_phrasing"),
("Boost the ranking on our GraphQL federation notes — they're gold.",
"knowledge_upvote", [], "natural_phrasing"),
("That doc about using polling instead of webhooks is outdated and wrong. Lower its score.",
"knowledge_downvote", [], "natural_phrasing"),
("Downvote the entry about using bcrypt at cost 4 — it's dangerously insecure.",
"knowledge_downvote", [], "natural_phrasing"),
# --- knowledge_set_retention ---
("Set a 45-day retention policy on the alpha-testing project's knowledge.",
"knowledge_set_retention", ["project"], "natural_phrasing"),
# ===========================================================================
# CATEGORY 2: adversarial_trap (70 cases)
# CS / programming questions — must return NO_TOOL even when keywords match.
# ===========================================================================
# Python
("Write a Python function that implements a trie for fast prefix searches.",
"NO_TOOL", [], "adversarial_trap"),
("How do I use Python's contextlib.contextmanager decorator?",
"NO_TOOL", [], "adversarial_trap"),
("Explain Python's __slots__ and when to use it for memory optimization.",
"NO_TOOL", [], "adversarial_trap"),
("What is the difference between deepcopy and shallow copy in Python?",
"NO_TOOL", [], "adversarial_trap"),
("How does Python's asyncio event loop schedule coroutines?",
"NO_TOOL", [], "adversarial_trap"),
("Write a Python generator that yields prime numbers indefinitely.",
"NO_TOOL", [], "adversarial_trap"),
("How do I profile memory usage in a Python application?",
"NO_TOOL", [], "adversarial_trap"),
# JavaScript / TypeScript
("How do I debounce a function in JavaScript without lodash?",
"NO_TOOL", [], "adversarial_trap"),
("Explain the JavaScript event loop and microtask queue.",
"NO_TOOL", [], "adversarial_trap"),
("How does TypeScript's discriminated union type work?",
"NO_TOOL", [], "adversarial_trap"),
("Write a TypeScript generic function that deep-merges two objects.",
"NO_TOOL", [], "adversarial_trap"),
("What is the difference between a WeakMap and a Map in JavaScript?",
"NO_TOOL", [], "adversarial_trap"),
("How do I implement a promise-based queue in Node.js?",
"NO_TOOL", [], "adversarial_trap"),
# Go
("How does Go's goroutine scheduler work with M:N threading?",
"NO_TOOL", [], "adversarial_trap"),
("Explain Go's garbage collector and write barriers.",
"NO_TOOL", [], "adversarial_trap"),
("Write a concurrent rate limiter in Go using channels.",
"NO_TOOL", [], "adversarial_trap"),
("How do I implement context cancellation in a Go HTTP server?",
"NO_TOOL", [], "adversarial_trap"),
# Rust
("Explain Rust's borrow checker and why it prevents data races.",
"NO_TOOL", [], "adversarial_trap"),
("How do Arc and Mutex work together in Rust for thread-safe state sharing?",
"NO_TOOL", [], "adversarial_trap"),
("What is Rust's Pin and why is it needed for async futures?",
"NO_TOOL", [], "adversarial_trap"),
("Write a Rust trait that implements a retry strategy with exponential backoff.",
"NO_TOOL", [], "adversarial_trap"),
# SQL / NoSQL
("Write a SQL query that finds the second-highest salary in an employees table.",
"NO_TOOL", [], "adversarial_trap"),
("How do I use window functions in PostgreSQL to compute a running total?",
"NO_TOOL", [], "adversarial_trap"),
("What is a covering index and when should I use one in MySQL?",
"NO_TOOL", [], "adversarial_trap"),
("Explain eventual consistency in DynamoDB and how to work around it.",
"NO_TOOL", [], "adversarial_trap"),
("How do I export data from MongoDB to a JSON file using mongoexport?",
"NO_TOOL", [], "adversarial_trap"),
("What is a materialized view in PostgreSQL and how does it differ from a regular view?",
"NO_TOOL", [], "adversarial_trap"),
# Algorithms / Data Structures
("Explain Dijkstra's algorithm and its time complexity.",
"NO_TOOL", [], "adversarial_trap"),
("Write a depth-first search implementation for a graph adjacency list.",
"NO_TOOL", [], "adversarial_trap"),
("How does consistent hashing help with horizontal scaling?",
"NO_TOOL", [], "adversarial_trap"),
("Explain the difference between a B-tree and a B+ tree.",
"NO_TOOL", [], "adversarial_trap"),
("What is the time and space complexity of merge sort?",
"NO_TOOL", [], "adversarial_trap"),
("Implement a LRU cache in Python using OrderedDict.",
"NO_TOOL", [], "adversarial_trap"),
("How does a bloom filter work and what are its false positive trade-offs?",
"NO_TOOL", [], "adversarial_trap"),
# Frameworks / Config
("How do I configure Django's ORM to use read replicas?",
"NO_TOOL", [], "adversarial_trap"),
("Explain Flask's application context vs. request context.",
"NO_TOOL", [], "adversarial_trap"),
("How does FastAPI's dependency injection system work?",
"NO_TOOL", [], "adversarial_trap"),
("Write a middleware in Express.js that logs request durations.",
"NO_TOOL", [], "adversarial_trap"),
("How do I set up hot-module replacement in a Vite + React project?",
"NO_TOOL", [], "adversarial_trap"),
("What is the difference between server components and client components in Next.js 14?",
"NO_TOOL", [], "adversarial_trap"),
# DevOps / Infrastructure
("Write a Dockerfile for a Python FastAPI app with multi-stage builds.",
"NO_TOOL", [], "adversarial_trap"),
("How do I configure a Kubernetes HorizontalPodAutoscaler based on custom metrics?",
"NO_TOOL", [], "adversarial_trap"),
("What is the difference between rolling and blue-green deployments?",
"NO_TOOL", [], "adversarial_trap"),
("How do I set up Prometheus scraping for a Node.js service?",
"NO_TOOL", [], "adversarial_trap"),
("Explain how etcd achieves consensus using the Raft algorithm.",
"NO_TOOL", [], "adversarial_trap"),
("Write a GitHub Actions workflow that runs tests on every pull request.",
"NO_TOOL", [], "adversarial_trap"),
# Memory management (trap on 'memory' keyword)
("How does virtual memory paging work in Linux?",
"NO_TOOL", [], "adversarial_trap"),
("What is memory-mapped I/O and how does mmap work in C?",
"NO_TOOL", [], "adversarial_trap"),
("Explain stack vs. heap memory allocation and when each is appropriate.",
"NO_TOOL", [], "adversarial_trap"),
("How does the V8 engine's garbage collector use generational collection?",
"NO_TOOL", [], "adversarial_trap"),
# Session handling (trap on 'session' keyword)
("How does PHP's session_start() work under the hood?",
"NO_TOOL", [], "adversarial_trap"),
("Implement session fixation protection in a Flask application.",
"NO_TOOL", [], "adversarial_trap"),
("What is the difference between sticky sessions and session replication?",
"NO_TOOL", [], "adversarial_trap"),
("How do I store JWT tokens in a secure, httpOnly cookie in Express?",
"NO_TOOL", [], "adversarial_trap"),
# Search (trap on 'search' keyword)
("How do I implement fuzzy search with trigrams in PostgreSQL?",
"NO_TOOL", [], "adversarial_trap"),
("Explain TF-IDF and how it ranks documents in full-text search.",
"NO_TOOL", [], "adversarial_trap"),
("Write a binary search implementation in Rust.",
"NO_TOOL", [], "adversarial_trap"),
("Compare Elasticsearch and OpenSearch for log aggregation.",
"NO_TOOL", [], "adversarial_trap"),
# Graph theory (trap on 'graph' + 'edges' keywords)
("Explain the difference between Prim's and Kruskal's spanning tree algorithms.",
"NO_TOOL", [], "adversarial_trap"),
("How do topological sorts work on directed acyclic graphs?",
"NO_TOOL", [], "adversarial_trap"),
("Write a function to detect cycles in a directed graph using DFS.",
"NO_TOOL", [], "adversarial_trap"),
# Load balancing (trap on 'load' keyword)
("What are the differences between round-robin, least-connections, and IP-hash load balancing?",
"NO_TOOL", [], "adversarial_trap"),
("How does Nginx upstream load balancing handle health check failures?",
"NO_TOOL", [], "adversarial_trap"),
# Logging / monitoring
("How do I implement structured logging in a Go service with zerolog?",
"NO_TOOL", [], "adversarial_trap"),
("Explain the ELK stack and how logs flow from Beats to Kibana.",
"NO_TOOL", [], "adversarial_trap"),
("What is OpenTelemetry and how does distributed tracing work?",
"NO_TOOL", [], "adversarial_trap"),
# Misc CS concepts
("What is the difference between optimistic and pessimistic locking in databases?",
"NO_TOOL", [], "adversarial_trap"),
("Explain how CRDTs achieve conflict-free distributed state.",
"NO_TOOL", [], "adversarial_trap"),
("What is a saga pattern in distributed systems?",
"NO_TOOL", [], "adversarial_trap"),
("How does the forget gate in an LSTM neural network control memory?",
"NO_TOOL", [], "adversarial_trap"),
# ===========================================================================
# CATEGORY 3: disambiguation (40 cases)
# Similar tools — model must pick the correct one.
# ===========================================================================
# session_search_memory vs knowledge_search
("Find anything we discussed last month about the API versioning decision.",
"session_search_memory", ["query"], "disambiguation"),
("What do our curated knowledge items say about dependency injection patterns?",
"knowledge_search", ["query"], "disambiguation"),
("Search our accumulated documentation for information on database sharding.",
"knowledge_search", ["query"], "disambiguation"),
("Look through recent session notes for anything about the CDN cache invalidation bug.",
"session_search_memory", ["query"], "disambiguation"),
("Any past conversations where we discussed microservice mesh configurations?",
"session_search_memory", ["query"], "disambiguation"),
("Check the knowledge base for anything on event sourcing trade-offs.",
"knowledge_search", ["query"], "disambiguation"),
# session_forget_memory vs knowledge_forget
("Remove the specific session memory with ID mem-qq77-rr. It's incorrect.",
"session_forget_memory", ["memory_id"], "disambiguation"),
("Clear all the outdated knowledge entries in the staging project.",
"knowledge_forget", ["project"], "disambiguation"),
("Wipe out old debugging records from the search-service project's knowledge base.",
"knowledge_forget", ["project"], "disambiguation"),
("Delete the memory entry for ID mem-ab99-cd — we noted the wrong schema version.",
"session_forget_memory", ["memory_id"], "disambiguation"),
("Remove all knowledge items in the deprecated-feature category from the portal project.",
"knowledge_forget", ["project"], "disambiguation"),
# session_save_ledger vs session_save_experience vs session_save_handoff
("Log what we did today: migrated the billing module to the new event bus.",
"session_save_ledger", [], "disambiguation"),
("Record a milestone: we successfully launched the new onboarding flow in production.",
"session_save_experience", [], "disambiguation"),
("Hand off this session — save the state for the next agent on the gateway project.",
"session_save_handoff", ["project"], "disambiguation"),
("Write down that we rewrote the payment reconciliation logic today.",
"session_save_ledger", [], "disambiguation"),
("Mark a success: we fixed the notorious N+1 query on the orders endpoint.",
"session_save_experience", [], "disambiguation"),
("The contractor is taking over tonight. Save the handoff for the migration-tools project.",
"session_save_handoff", ["project"], "disambiguation"),
# knowledge_upvote vs knowledge_downvote
("That knowledge entry about immutable infrastructure is spot on. Upvote it.",
"knowledge_upvote", [], "disambiguation"),
("The doc recommending XML over JSON for internal APIs is terrible. Mark it down.",
"knowledge_downvote", [], "disambiguation"),
("Increase the importance score of the circuit-breaker patterns entry.",
"knowledge_upvote", [], "disambiguation"),
("Reduce the rank of that outdated note about using MD5 for hashing.",
"knowledge_downvote", [], "disambiguation"),
# session_compact_ledger vs session_export_memory
("The billing-service ledger is bloated. Compress and archive the old entries.",
"session_compact_ledger", ["project"], "disambiguation"),
("Export a full offline snapshot of my memory to /archive/snapshot in JSON.",
"session_export_memory", ["output_path", "format"], "disambiguation"),
("Trim down the session history for the firmware project — it's too long.",
"session_compact_ledger", ["project"], "disambiguation"),
("Save everything to disk — dump all session data to /tmp/export-all.",
"session_export_memory", ["output_path"], "disambiguation"),
# session_synthesize_edges vs session_backfill_links vs session_health_check
("Verify the session graph edges are all consistent for the trading-platform project.",
"session_synthesize_edges", ["project"], "disambiguation"),
("Reconnect the dangling session references for the ml-pipeline project.",
"session_backfill_links", ["project"], "disambiguation"),
("Run a full health diagnostic on the Prism memory backend.",
"session_health_check", [], "disambiguation"),
("Patch up missing cross-session links for the user-service project.",
"session_backfill_links", ["project"], "disambiguation"),
("Make sure all edges are synthesized and up to date for the invoicing project.",
"session_synthesize_edges", ["project"], "disambiguation"),
("Is the memory system responding normally? Do a quick health check.",
"session_health_check", [], "disambiguation"),
# session_load_context vs session_search_memory
("Bring me back into the context of the payments-gateway project.",
"session_load_context", ["project"], "disambiguation"),
("Look for any notes we made about the GraphQL schema decisions.",
"session_search_memory", ["query"], "disambiguation"),
("Restore the full session state for the devops-automation project.",
"session_load_context", ["project"], "disambiguation"),
("Search our history for any discussion about OAuth2 vs API keys.",
"session_search_memory", ["query"], "disambiguation"),
# session_task_route vs session_load_context
("Should the local model handle this React performance optimization or route it to the cloud?",
"session_task_route", ["task_description"], "disambiguation"),
("Initialize context for the infrastructure-as-code project — I'm starting fresh.",
"session_load_context", ["project"], "disambiguation"),
# knowledge_set_retention vs knowledge_forget
("Set the knowledge for the beta-program project to expire after 90 days.",
"knowledge_set_retention", ["project"], "disambiguation"),
("Delete all knowledge in the archived-2025 project — we don't need it anymore.",
"knowledge_forget", ["project"], "disambiguation"),
("Auto-expire the knowledge entries in the sandbox project after 14 days.",
"knowledge_set_retention", ["project"], "disambiguation"),
# ===========================================================================
# CATEGORY 4: edge_case (25 cases)
# Minimal, single-word, ambiguous, or unusual prompts.
# ===========================================================================
("Load context.", "session_load_context", [], "edge_case"),
("Save.", "session_save_ledger", [], "edge_case"),
("Search.", "session_search_memory", [], "edge_case"),
("Check health.", "session_health_check", [], "edge_case"),
("Export.", "session_export_memory", [], "edge_case"),
("Compact.", "session_compact_ledger", [], "edge_case"),
("Handoff.", "session_save_handoff", [], "edge_case"),
("Route this.", "session_task_route", [], "edge_case"),
("Synthesize edges.", "session_synthesize_edges", [], "edge_case"),
("Backfill links.", "session_backfill_links", [], "edge_case"),
("Forget it.", "session_forget_memory", [], "edge_case"),
("Knowledge search.", "knowledge_search", [], "edge_case"),
# Abstention edge cases
("Hello!", "NO_TOOL", [], "edge_case"),
("What can you do?", "NO_TOOL", [], "edge_case"),
("Tell me about yourself.", "NO_TOOL", [], "edge_case"),
("Thanks, we're done.", "NO_TOOL", [], "edge_case"),
("OK great.", "NO_TOOL", [], "edge_case"),
("Bye!", "NO_TOOL", [], "edge_case"),
# Ambiguous short prompts that still require the right tool
("Run diagnostics.", "session_health_check", [], "edge_case"),
("Save the handoff.", "session_save_handoff", [], "edge_case"),
("Log this session.", "session_save_ledger", [], "edge_case"),
("Search memory.", "session_search_memory", [], "edge_case"),
("Knowledge base lookup.", "knowledge_search", [], "edge_case"),
("Archive old entries.", "session_compact_ledger", [], "edge_case"),
("Save experience.", "session_save_experience", [], "edge_case"),
# ===========================================================================
# CATEGORY 5: multi_intent (20 cases)
# Multi-step prompts — score only the FIRST action.
# ===========================================================================
("Load the context for the pipeline project, then search for any past notes on streaming.",
"session_load_context", ["project"], "multi_intent"),
("Search our memory for anything about the OAuth migration, then save a handoff.",
"session_search_memory", ["query"], "multi_intent"),
("Check memory health, and if it's all good, compact the fraud-detection ledger.",
"session_health_check", [], "multi_intent"),
("Find notes about the ML model rollout, and then log that we finished the A/B test today.",
"session_search_memory", ["query"], "multi_intent"),
("Load the prism-mcp context, then check if there are any open issues about rate limiting.",
"session_load_context", ["project"], "multi_intent"),
("Export everything to /tmp/backup, then set a 60-day retention policy on it.",
"session_export_memory", ["output_path"], "multi_intent"),
("Save what we did today: shipped the new notification system. Then create a handoff note.",
"session_save_ledger", [], "multi_intent"),
("Search for what we decided about the queue architecture, then upvote the best result.",
"session_search_memory", ["query"], "multi_intent"),
("Run a health check on the memory system, then compact the ledger if there are issues.",
"session_health_check", [], "multi_intent"),
("Look up our knowledge on service mesh patterns, and then downvote the outdated ones.",
"knowledge_search", ["query"], "multi_intent"),
("Compact the session history for the payments project, then synthesize the session edges.",
"session_compact_ledger", ["project"], "multi_intent"),
("Load context for the billing-v2 project, and record our progress: we fixed the invoice date bug.",
"session_load_context", ["project"], "multi_intent"),
("Search our knowledge base for event-driven design patterns, then save a handoff with the findings.",
"knowledge_search", ["query"], "multi_intent"),
("Backfill the cross-session links for the ios-app project, then synthesize edges.",
"session_backfill_links", ["project"], "multi_intent"),
("Route this task: full rewrite of the logging subsystem. If cloud, just tell me.",
"session_task_route", ["task_description"], "multi_intent"),
("Export memory to /var/backup, and then purge the old knowledge entries from the legacy project.",
"session_export_memory", ["output_path"], "multi_intent"),
("Find what we discussed about caching strategies, then set a 30-day retention on that knowledge.",
"session_search_memory", ["query"], "multi_intent"),
("Record a success milestone: zero-downtime deploy of version 4.2. Then compact the ledger.",
"session_save_experience", [], "multi_intent"),
("Load the fraud-detection project context and then synthesize all session edges.",
"session_load_context", ["project"], "multi_intent"),
("Save what we accomplished: rewrote the ingestion pipeline. Then hand it off to the ops team.",
"session_save_ledger", [], "multi_intent"),
# ===========================================================================
# CATEGORY 6: verifier (25 cases)
# session_synthesize_edges / session_backfill_links / session_health_check patterns.
# ===========================================================================
# session_synthesize_edges
("Make sure all session graph edges are consistent for the auth-gateway project.",
"session_synthesize_edges", ["project"], "verifier"),
("Run a synthesis pass to validate all edges are up to date for the orchestration project.",
"session_synthesize_edges", ["project"], "verifier"),
("Verify graph integrity — synthesize edges for the content-delivery project.",
"session_synthesize_edges", ["project"], "verifier"),
("Before closing out, check that all session links are consistent for the scheduling project.",
"session_synthesize_edges", ["project"], "verifier"),
("Ensure all session relationships are properly synthesized for the warehouse-api project.",
"session_synthesize_edges", ["project"], "verifier"),
("Run edge synthesis on the real-time-alerts project to validate the session graph.",
"session_synthesize_edges", ["project"], "verifier"),
("Validate that all edges in the session graph are consistent for the pricing-engine project.",
"session_synthesize_edges", ["project"], "verifier"),
("Confirm session link consistency for the document-processing project.",
"session_synthesize_edges", ["project"], "verifier"),
# session_backfill_links
("There are broken cross-session links in the search-backend project. Backfill them.",
"session_backfill_links", ["project"], "verifier"),
("Reconnect all dangling references in the identity-service project history.",
"session_backfill_links", ["project"], "verifier"),
("Patch the missing links between sessions for the payments-v3 project.",
"session_backfill_links", ["project"], "verifier"),
("Fix the link gaps in our session history for the recommendation-service project.",
"session_backfill_links", ["project"], "verifier"),
("Backfill any missing cross-session connections for the notification-hub project.",
"session_backfill_links", ["project"], "verifier"),
("Reconnect broken session references in the compliance-tracker project.",
"session_backfill_links", ["project"], "verifier"),
("Repair missing session links for the api-gateway project.",
"session_backfill_links", ["project"], "verifier"),
# session_health_check
("Before I start a new sprint, confirm the memory system is operating correctly.",
"session_health_check", [], "verifier"),
("The search results seem incomplete. Check if the memory backend is healthy.",
"session_health_check", [], "verifier"),
("I'm seeing weird behavior in session recall. Run a diagnostic check.",
"session_health_check", [], "verifier"),
("Ping the memory system and confirm it's all healthy.",
"session_health_check", [], "verifier"),
("Is the Prism memory backend operating within normal parameters?",
"session_health_check", [], "verifier"),
("Double-check the memory infrastructure health before I rely on these results.",
"session_health_check", [], "verifier"),
("Verify the memory system is functioning before we start the long session.",
"session_health_check", [], "verifier"),
("Run a full health check and report back on the memory backend status.",
"session_health_check", [], "verifier"),
("Something is off with memory recall. Diagnose the backend.",
"session_health_check", [], "verifier"),
("Confirm the session memory system is healthy before I save this handoff.",
"session_health_check", [], "verifier"),
# ===========================================================================
# CATEGORY 7: cascade (25 cases)
# Explicit first-step-of-chain patterns — model must pick the right FIRST tool.
# ===========================================================================
("Search our knowledge for gRPC patterns, then upvote the most relevant entry.",
"knowledge_search", ["query"], "cascade"),
("Load the indexing-service context, then search for any past notes on shard rebalancing.",
"session_load_context", ["project"], "cascade"),
("Check memory health, then compact the alerts project ledger if there are stale entries.",
"session_health_check", [], "cascade"),
("Export all memory to /tmp/archive, then set a 180-day retention policy on the archive project.",
"session_export_memory", ["output_path"], "cascade"),
("Search for what we decided about the event schema design, then save a handoff about it.",
"session_search_memory", ["query"], "cascade"),
("Save today's session notes for the pipeline project, then create a handoff for the next agent.",
"session_save_ledger", [], "cascade"),
("Should the local model handle this concurrency refactor? If cloud, stop there.",
"session_task_route", ["task_description"], "cascade"),
("Search knowledge for CQRS trade-offs, downvote anything recommending a single store.",
"knowledge_search", ["query"], "cascade"),
("Compact the ledger for the embeddings project, then synthesize the session edges.",
"session_compact_ledger", ["project"], "cascade"),
("Load the feature-flags project context, then log that we shipped the A/B framework.",
"session_load_context", ["project"], "cascade"),
("Run a health check first, then based on results decide whether to compact or export.",
"session_health_check", [], "cascade"),
("Search memory for past decisions about SSE vs WebSockets, then record what we found.",
"session_search_memory", ["query"], "cascade"),
("Backfill the missing links for the analytics project, then synthesize the edges.",
"session_backfill_links", ["project"], "cascade"),
("Load context for the tenant-management project, then search for any open migration tickets.",
"session_load_context", ["project"], "cascade"),
("Find what we know about zero-copy networking, then save a handoff with that context.",
"session_search_memory", ["query"], "cascade"),
("Export to /backups/weekly, then compact the media-processing ledger.",
"session_export_memory", ["output_path"], "cascade"),
("Search our knowledge base for Kubernetes resource quotas, then set a 60-day retention.",
"knowledge_search", ["query"], "cascade"),
("Save the experience: we eliminated 80% of unnecessary re-renders. Then route the next task.",
"session_save_experience", [], "cascade"),
("Synthesize edges for the audit-log project, then backfill any missing links.",
"session_synthesize_edges", ["project"], "cascade"),
("Load the risk-assessment project context and then search memory for past risk audit notes.",
"session_load_context", ["project"], "cascade"),
("Find our notes on the transaction saga pattern, then upvote the best entry.",
"session_search_memory", ["query"], "cascade"),
("Compact the metrics project ledger, then export it to /tmp/metrics-backup.",
"session_compact_ledger", ["project"], "cascade"),
("Route this task: implement distributed tracing with OpenTelemetry across five services.",
"session_task_route", ["task_description"], "cascade"),
("Save what we accomplished: added RBAC support to the admin API. Then synthesize edges.",
"session_save_ledger", [], "cascade"),
("Search knowledge for eventual consistency patterns, then forget the entries about using global locks.",
"knowledge_search", ["query"], "cascade"),
# ===========================================================================
# CATEGORY 8: param_extraction (25 cases)
# Params ARE mentioned in the prompt — test that model extracts them correctly.
# ===========================================================================
("Load the full context for the fraud-detection project at a deep level.",
"session_load_context", ["project"], "param_extraction"),
("Compact the session ledger for the user-identity project.",
"session_compact_ledger", ["project"], "param_extraction"),
("Save a handoff note for the supplier-portal project.",
"session_save_handoff", ["project"], "param_extraction"),
("Delete the memory entry with ID mem-fg33-hh. It has the wrong branch name.",
"session_forget_memory", ["memory_id"], "param_extraction"),
("Export all memory data to /exports/2026-q2 in JSON format.",
"session_export_memory", ["output_path", "format"], "param_extraction"),
("Set the retention policy for the experiment-runner project to 45 days.",
"knowledge_set_retention", ["project"], "param_extraction"),
("Search session memory for 'distributed tracing setup'.",
"session_search_memory", ["query"], "param_extraction"),
("Search the knowledge base for 'idempotency keys in payment APIs'.",
"knowledge_search", ["query"], "param_extraction"),
("Backfill the cross-session links for the warehouse-inventory project.",
"session_backfill_links", ["project"], "param_extraction"),
("Synthesize session edges for the logistics-optimizer project.",
"session_synthesize_edges", ["project"], "param_extraction"),
("Forget the knowledge entry with ID ki-cc44-gg — that approach is deprecated.",
"knowledge_forget", [], "param_extraction"),
("Upvote the knowledge entry with ID ki-tt55-rr. Really solid documentation.",
"knowledge_upvote", [], "param_extraction"),
("Downvote knowledge entry ki-uu99-qq — it recommends a vulnerable library.",
"knowledge_downvote", [], "param_extraction"),
("Configure an 80-day retention policy for the beta-features project's knowledge.",
"knowledge_set_retention", ["project"], "param_extraction"),
("Load context for the platform-core project.",
"session_load_context", ["project"], "param_extraction"),
("Export the archive to /data/long-term-backup in markdown format.",
"session_export_memory", ["output_path", "format"], "param_extraction"),
("Search for 'zero-downtime database migrations' in our session history.",
"session_search_memory", ["query"], "param_extraction"),
("Search knowledge for 'CQRS vs event sourcing trade-offs'.",
"knowledge_search", ["query"], "param_extraction"),
("Compact the ledger for the monitoring-stack project.",
"session_compact_ledger", ["project"], "param_extraction"),
("Delete memory entry mem-pp12-ss — wrong model version was recorded.",
"session_forget_memory", ["memory_id"], "param_extraction"),
("Save a handoff for the checkout-v4 project.",
"session_save_handoff", ["project"], "param_extraction"),
("Route this task: rewrite the message broker integration to use NATS instead of RabbitMQ.",
"session_task_route", ["task_description"], "param_extraction"),
("Synthesize edges for the ingestion-pipeline project.",
"session_synthesize_edges", ["project"], "param_extraction"),
("Backfill the missing session links in the content-catalog project.",
"session_backfill_links", ["project"], "param_extraction"),
("Set 120-day retention on the compliance-logs project's knowledge.",
"knowledge_set_retention", ["project"], "param_extraction"),
# ===========================================================================
# CATEGORY 9: abstention (20 cases)
# Greetings, capability questions, general CS — must return NO_TOOL.
# ===========================================================================
("Hi there!", "NO_TOOL", [], "abstention"),
("Good morning!", "NO_TOOL", [], "abstention"),
("Hey, quick question — what's your name?", "NO_TOOL", [], "abstention"),
("What tools do you have available?", "NO_TOOL", [], "abstention"),
("What are your capabilities?", "NO_TOOL", [], "abstention"),
("Can you explain what Prism Memory tools do?", "NO_TOOL", [], "abstention"),
("What programming languages do you know?", "NO_TOOL", [], "abstention"),
("Thanks, that's all for now!", "NO_TOOL", [], "abstention"),
("Great work today, goodbye.", "NO_TOOL", [], "abstention"),
("You're really helpful, thanks!", "NO_TOOL", [], "abstention"),
("What is the capital of France?", "NO_TOOL", [], "abstention"),
("Tell me a joke.", "NO_TOOL", [], "abstention"),
("How do you work?", "NO_TOOL", [], "abstention"),
("Are you GPT-4?", "NO_TOOL", [], "abstention"),
("Can you write me a poem?", "NO_TOOL", [], "abstention"),
("What's the weather like today?", "NO_TOOL", [], "abstention"),
("Can you recommend a good book?", "NO_TOOL", [], "abstention"),
("What's 2+2?", "NO_TOOL", [], "abstention"),
("Do you have feelings?", "NO_TOOL", [], "abstention"),
("What is machine learning?", "NO_TOOL", [], "abstention"),
]
# ---------------------------------------------------------------------------
# Sanity check: enforce exactly 300 cases and correct counts per category
# ---------------------------------------------------------------------------
_TARGET_COUNTS = {
"natural_phrasing": 50,
"adversarial_trap": 70,
"disambiguation": 40,
"edge_case": 25,
"multi_intent": 20,
"verifier": 25,
"cascade": 25,
"param_extraction": 25,
"abstention": 20,
}
_TOTAL_TARGET = 300
def _verify_test_counts():
from collections import Counter
counts = Counter(t[3] for t in TESTS)
errors = []
for cat, expected in _TARGET_COUNTS.items():
actual = counts.get(cat, 0)
if actual != expected:
errors.append(f" {cat}: expected {expected}, got {actual}")
if len(TESTS) != _TOTAL_TARGET:
errors.append(f" TOTAL: expected {_TOTAL_TARGET}, got {len(TESTS)}")
if errors:
print("WARNING: test count mismatches:")
for e in errors:
print(e)
return len(errors) == 0
# ---------------------------------------------------------------------------
# Layer 3: Inference-Time False-Positive Rejection + Remapping
# (Copied and merged from swe_bench_test.py — all current rules preserved)
# ---------------------------------------------------------------------------
GENERAL_PROGRAMMING_PATTERNS = [
# Python context managers
r'\bcontext\s+manager\b', r'\bcontextlib\b', r'\b__enter__\b', r'\b__exit__\b',
r'\basync\s+context\s+manager\b',
# ML / LSTM forget gates
r'\bforget\s+gate\b', r'\blstm\b', r'\bcatastrophic\s+forgetting\b',
r'\bforget\s+bias\b', r'\belastic\s+weight\s+consolidation\b',
# Web framework sessions
r'\bexpress\.js\b', r'\bdjango\b', r'\bflask\b', r'\bfastapi\b',
r'\bsession_start\(\)', r'\bsession\s+middleware\b', r'\bsession\s+affinity\b',
# General CS
r'\bgarbage\s+collection\b', r'\bgc\s+algorithm\b',
r'\bmemory\s+management\s+in\s+rust\b',
r'\bload\s+balanc', r'\bnginx\b', r'\bhaproxy\b',
r'\bcontext\s+switch',
r'\bsearch\s+algorithm\b',
r'\bsearch\s+functionality\s+with\s+elasticsearch\b',
r'\bhealth\s+check\s+endpoint\s+pattern\b',
r'\belasticsearch\b', r'\bsolr\b', r'\blucene\b',
r'\bretention\s+polic(?:y|ies)\s+(?:in|for|with)\s+(?:kafka|s3|aws|gcp|azure|cloud)',
r'\bpostgresql\b.*\bmongodb\b', r'\bmongodb\b.*\bpostgresql\b',
r'\bwrite\s+a\s+decorator\b', r'\bdecorator.*retries?\b',
r'\bci/cd\b', r'\bgithub\s+actions\b',
r'\bcors\b.*\bnode\.js\b', r'\bnode\.js\b.*\bcors\b',
r'\bcap\s+theorem\b', r'\bbinary\s+search\s+tree\b',
r'\bvirtual\s+dom\b', r'\breact\b.*\breconciliation\b',
r'\bdependency\s+injection\b',
r'\btcp\b.*\budp\b', r'\budp\b.*\btcp\b',
r'\btime\s+complexity\b', r'\bquicksort\b',
r'\bexponential\s+backoff\b', r'\bjitter\b.*\bretri', r'\bapi\s+retri',
r'\bcelery\b.*\bqueue', r'\broute\s+tasks?\s+in\s+celery\b',
r'\bknowledge\s+graph\b.*\b(?:function|search|algorithm|traversal)\b',
r'\b(?:function|write\s+a\s+function|implement)\b.*\bknowledge\s+graph\b',
r'\bsave\s+(?:user\s+)?preferences?\s+in\s+(?:react|redux|localstorage|a\s+database)\b',
r'\bexport\s+(?:data\s+)?from\s+(?:postgresql|mysql|sqlite|a\s+database)\b',
r'\bpostgresql\b.*\bcsv\b', r'\bcsv\b.*\bpostgresql\b',
# Additional patterns from bfcl_eval.py
r'\bgoroutine\b', r'\bwrite\s+barrier\b', r'\brust\b.*\bborrow\b',
r'\barc\b.*\bmutex\b', r'\bpin\b.*\bfuture\b',
r'\bwindow\s+function\b', r'\bmongodb\b', r'\bmongoexport\b',
r'\bdijkstra\b', r'\bdepth.first\s+search\b', r'\bconsistent\s+hashing\b',
r'\bb.tree\b', r'\bbloom\s+filter\b', r'\blru\s+cache\b', r'\bordereddic\b',
r'\bhorizontalpodautoscal', r'\bprometheus\b', r'\betcd\b', r'\braft\b',
r'\bzerolog\b', r'\belk\s+stack\b', r'\bopentelemetry\b',
r'\bcrdt\b', r'\bsaga\s+pattern\b',
r'\btrie\b', r'\bweakmap\b', r'\bpromise.based\s+queue\b',
r'\bcovering\s+index\b', r'\bmaterialized\s+view\b',
r'\btf-idf\b', r'\btrigram\b', r'\bfuzzy\s+search\b',
r'\btopological\s+sort\b', r'\bcycle\s+detection\b',
r'\bprim.s\b', r'\bkruskal.s\b', r'\bspanning\s+tree\b',
r'\bhot.module\s+replacement\b', r'\bvite\b',
r'\bserver\s+component\b', r'\bclient\s+component\b',
r'\bdocker(?:file)?\b', r'\bblue.green\s+deploy', r'\brolling\s+deploy',
r'\bsticky\s+session\b', r'\bsession\s+replication\b', r'\bsession\s+fixation\b',
r'\bjwt\b.*\bhttponly\b',
r'\bpaging\b.*\bmemory\b', r'\bmmap\b', r'\bstack\s+vs\s+heap\b',
r'\bv8\s+engine\b', r'\bgenerational\s+collection\b',
r'\boptimistic\s+lock', r'\bpessimistic\s+lock',
r'\bcrdt\b', r'\beventual\s+consistency\b.*\bdynamo',
# General knowledge / weather / math
r"what'?s\s+the\s+weather\b", r'\bforecast\b.*\btoday\b',
r'\bwrite\s+a\s+sql\s+query\b', r'\bsecond.highest\s+salary\b',
r'\bsql\s+query\s+(?:that|to)\b',
]
PRISM_INTENT_PATTERNS = [
r'\bprism\b', r'\bsession\s*ledger\b', r'\bhandoff\b', r'\bknowledge\s+base\b',
r'\bknowledge\s+items?\b', r'\bour\s+knowledge\b',
r'\bsave.*(?:session|ledger|handoff)\b', r'\bload\s+context\b',
r'\b(?:search|find).*(?:memory|sessions?|conversations?|notes)\b',
r'\bproject\b', r'\bwhat\s+(?:do\s+)?we\s+(?:know|have)\b',
r'\binstitutional\s+knowledge\b', r'\bdocumented\b', r'\bcurated\b',
r'\bmemory\s+entry\b', r'\bmemory\s+backend\b', r'\bdiagnostics\b',
r'\bledger\b', r'\bcompact\b.*(?:ledger|entries|session)\b',
r'\bexport.*(?:memory|backup)\b', r'\b(?:delete|nuke|wipe|remove).*(?:entry|memory|entries)\b',
r'\blog.*(?:what|accomplished|session)\b', r'\brecord.*(?:session|what)\b',
r'\bhand.*(?:off|over)\b', r'\bbring.*up\s+to\s+speed\b',
r'\bbug\s+fix.*(?:local\s+model|handle)\b', r'\broute.*(?:task|this)\b',
r'\bbackfill\b', r'\bsynthesize\b', r'\bsession\s+graph\b',
r'\bsession\s+links?\b', r'\bedges?\s+(?:up\s+to\s+date|consistent)\b',
r'\bgraph\s+integrit', r'\bdangling\b', r'\breconnect.*(?:session|links?|references?)\b',
r'\bpatch.*(?:links?|gaps?)\b', r'\bmissing\s+links?\b',
r'\bsave\s+experience\b', r'\brecord\s+(?:a\s+)?milestone\b',
r'\brecord\s+(?:a\s+)?success\b', r'\bupvote\b', r'\bdownvote\b',
r'\bretention\s+polic(?:y|ies)\b', r'\bauto.expir\b', r'\bttl\b',
r'\bknowledge\s+entry\b', r'\bknowledge\s+record\b',
]
def validate_tool_call(prompt, tool_name, tool_args):
"""Layer 3: reject obvious false-positive tool calls and remap semantic neighbors.
Copied from swe_bench_test.py with additions from bfcl_eval.py.
Returns (tool_name, tool_args) — possibly changed if rejected or remapped.
"""
prompt_lower = prompt.lower()
# Special NO_TOOL override: "confirm session link/graph consistency" → synthesize_edges
if tool_name in ("NO_TOOL", "ERROR"):
if re.search(r'\b(?:confirm|verify|validate|check|ensure)\b', prompt_lower):
if re.search(r'\bsession\s+(?:link|edge|graph)\s+(?:consistency|consistent)\b', prompt_lower):
proj_m = re.search(r'\b(?:for|on)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b', prompt_lower)
return 'session_synthesize_edges', ({'project': proj_m.group(1)} if proj_m else {})
return tool_name, tool_args
# --- Group B remaps (before false-positive rejection) ---
# "reconnect/patch up/dangling links" → backfill_links
# But don't remap when "synthesize edges" is the explicit first action
if tool_name in ('session_synthesize_edges', 'session_reconnect'):
if re.search(r'\b(?:reconnect|backfill|patch\s+up|dangling|link\s+gaps?|missing\s+links?|fix\s+links?)\b', prompt_lower):
if not re.search(r'^synthesize\b', prompt_lower) and \
not re.search(r'\bsynthesiz\w+\s+edges?\s+for\b', prompt_lower):
return 'session_backfill_links', tool_args
# "verify/check/make sure session links/edges are consistent / graph integrity" → synthesize_edges
if tool_name in ('session_health_check', 'session_backfill_links'):
_has_verify_verb = re.search(
r'\b(?:verify|validate|check|make\s+sure|ensure|confirm)\b', prompt_lower
)
_has_consistent_edge = re.search(
r'\b(?:edges?|links?|graph)\b.*?\b(?:consistent|up\s+to\s+date|synthesized)\b'
r'|\bconsistent\b.*?\b(?:edges?|links?|graph)\b'
r'|\bsession\s+links?\b'
r'|\bgraph\s+integrit',
prompt_lower, re.DOTALL
)
if _has_verify_verb and _has_consistent_edge:
return 'session_synthesize_edges', tool_args
# "synthesize edges for X, then backfill" → synthesize_edges is the FIRST action
if tool_name == 'session_backfill_links':
if re.search(r'(?:^|\bfirst\b|\bstart\s+with)\s*synthesize\s+edges?\b', prompt_lower) or \
re.search(r'^synthesize\b', prompt_lower):
return 'session_synthesize_edges', tool_args
# "wipe/clear old entries from knowledge base" → knowledge_forget (not compact_ledger)
# BUT protect "session entries" / "session history" from this remap
if tool_name == 'session_compact_ledger':
if re.search(r'\bknowledge\b', prompt_lower) and re.search(r'\b(?:wipe|clear|delete|remove|entries)\b', prompt_lower):
if not re.search(r'\bsession\s+(?:entries|history|ledger)\b', prompt_lower):
return 'knowledge_forget', tool_args
# "prune/trim/archive old session entries" → session_compact_ledger (not forget_memory)
if tool_name in ('session_forget_memory', 'knowledge_forget'):
if re.search(r'\b(?:prune|trim|archive|compress)\b', prompt_lower) and re.search(r'\b(?:session|ledger)\s+(?:entries|history)?\b', prompt_lower):
proj_m = re.search(r'\b(?:for|on)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b', prompt_lower)
return 'session_compact_ledger', ({'project': proj_m.group(1)} if proj_m else tool_args)
# "archive old entries" (without 'knowledge') → session_compact_ledger
if tool_name == 'session_forget_memory':
if re.search(r'\b(?:archive|prune|trim)\s+old\s+entries\b', prompt_lower):
if not re.search(r'\bknowledge\b', prompt_lower) and not re.search(r'\bmemory[_\s]id\b|mem-[a-z0-9]\b', prompt_lower):
return 'session_compact_ledger', tool_args
# "knowledge entries/items/records" + delete verbs → knowledge_forget (not session_forget_memory)
if tool_name == 'session_forget_memory':
if re.search(r'\bknowledge\s+(?:entr|items?|records?|base)\b', prompt_lower):
return 'knowledge_forget', tool_args
if re.search(r'\bknowledge\s+base\b', prompt_lower) and re.search(r'\b(?:entries|records|items)\b', prompt_lower):
return 'knowledge_forget', tool_args
# "delete/wipe entries from [project]" without a specific memory ID → knowledge_forget
if re.search(r'\b(?:entries|records|logs?)\b', prompt_lower) and re.search(r'\bproject\b', prompt_lower):
if not re.search(r'\bmemory[_\s]id\b|mem-[a-z0-9]|ID\s*[=:]\s*\S+', prompt):
if not re.search(r'\b(?:session|ledger)\b', prompt_lower):
proj_m = re.search(r'(?:for|from|in)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project', prompt_lower, re.I)
return 'knowledge_forget', {'project': proj_m.group(1) if proj_m else None}
# "where were we / bring me up to speed / catch me up" → session_load_context (not session_search_memory)
if tool_name == 'session_search_memory':
if re.search(r'\bwhere\s+were\s+we\b|\bbring\s+me\s+up\s+to\s+speed\b|\bcatch\s+me\s+up\b|\bwhat\s+were\s+we\s+(?:doing|working)', prompt_lower):
project_m = re.search(
r'\b(?:on|for|with|of)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b'
r'|\b([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b'
r'|(?:state\s+of\s+(?:the\s+)?)([a-zA-Z][a-zA-Z0-9_-]+)(?:\s+project)?\b',
prompt_lower
)
if project_m:
project = next((g for g in project_m.groups() if g and g not in ('the', 'a', 'this', 'that', 'my', 'our')), None)
else:
project = None
return 'session_load_context', {'project': project} if project else {}
# "accumulated documentation / knowledge base" → knowledge_search (not session_search_memory)
if tool_name == 'session_search_memory':
if re.search(r'\baccumulated\s+documentation\b|\bknowledge\s+base\b', prompt_lower):
return 'knowledge_search', tool_args
# "recent / past / last week / what we did" → session_search_memory (not knowledge_search)
if tool_name == 'knowledge_search':
session_hints = [
r'\brecent\b', r'\bpast\b', r'\blast\s+(?:week|month|session)',
r'\bwhat\s+we\s+(?:did|decided|worked)', r'\bdeployment\s+issues\b',
]
if any(re.search(p, prompt_lower) for p in session_hints):
return 'session_search_memory', tool_args
# "remind me / did we ever decide" → session_search_memory (not load_context)
if tool_name == 'session_load_context':
if re.search(r'\bremind\s+me\b|\bdid\s+we\s+ever\s+(?:decide|settle|choose|pick)\b|\bwhat\s+did\s+we\s+decide\b', prompt_lower):
if not re.search(r'\bbring\s+me\s+up\s+to\s+speed\b|\bwhere\s+were\s+we\b|\bcatch\s+me\s+up\b|\bload\s+.*\bcontext\b', prompt_lower):
return 'session_search_memory', {"query": prompt[:120]}
# "jot down / write down / make a note / log what just happened" → session_save_ledger
_LEDGER_TRIGGERS = re.compile(
r'\bjot\s+down\b|\bwrite\s+(?:it\s+)?down\b|\bwhat\s+we\s+accomplished\b'
r'|\bmake\s+sure\s+it.{0,10}written\b|\brecord\s+(?:this\s+session|what)\b'
r'|\bmake\s+(?:a\s+)?note\s+(?:that|of)\b|\blog\s+what\s+just\s+happened\b'
r'|\bwrite\s+down\s+everything\b|\bbefore\s+I\s+(?:close|head\s+out)\b',
re.IGNORECASE
)
# negative: milestone/achievement events that belong in save_experience
_EXPERIENCE_NEGATIVE = re.compile(
r'\b(?:successfully|milestone|achievement|deployed\s+the|shipped\s+the|launched\s+the'
r'|we\s+(?:fixed|built|completed|created|resolved|deployed|shipped|launched)\s+the'
r'|race\s+condition|solid\s+now|zero.downtime)\b'
)
# Unambiguous note-taking phrases bypass the milestone negative check
_NOTE_TRIGGERS = re.compile(
r'\bmake\s+(?:a\s+)?note\s+(?:that|of)\b|\bjot\s+down\b'
r'|\bwrite\s+(?:it\s+)?down\b|\blog\s+what\s+just\s+happened\b',
re.IGNORECASE
)
if tool_name in ('session_save_experience', 'session_task_route'):
if _LEDGER_TRIGGERS.search(prompt):
if _NOTE_TRIGGERS.search(prompt) or not _EXPERIENCE_NEGATIVE.search(prompt_lower):
if 'content' in tool_args and 'summary' not in tool_args:
tool_args = dict(tool_args)
tool_args['summary'] = tool_args.pop('content')
if 'summary' not in tool_args:
work_m = re.search(r'(?:we\s+)?((?:rewrote|fixed|refactored|built|deployed|updated|added|removed|finalized|completed|migrated)\s+.{10,120})', prompt, re.I)
if not work_m:
work_m = re.search(r'(?:make\s+a\s+note|log|note)\s+(?:that\s+)?(?:we\s+)?(completed|finished|did|wrote|refactored|migrated).{0,120}', prompt, re.I)
if work_m:
tool_args = dict(tool_args)
tool_args['summary'] = work_m.group(0).strip().rstrip('.')
return 'session_save_ledger', tool_args
# "record that we fixed/built/resolved [thing]" → session_save_experience (milestone)
if tool_name == 'session_save_ledger':
if re.search(r'\brecord\s+that\s+we\s+(?:fixed|built|completed|created|resolved|deployed|shipped|launched)\b', prompt_lower):
return 'session_save_experience', {"project": tool_args.get("project"), "event_type": "milestone"}
# content → summary normalization + inline extraction for session_save_ledger
if tool_name == 'session_save_ledger':
if 'content' in tool_args and 'summary' not in tool_args:
tool_args = dict(tool_args)
tool_args['summary'] = tool_args.pop('content')
if 'summary' not in tool_args:
work_m = re.search(r'(?:we\s+)?((?:rewrote|fixed|refactored|built|deployed|updated|added|removed|finalized|completed|migrated)\s+.{10,120})', prompt, re.I)
if not work_m:
work_m = re.search(r'(?:log|note|record)\s+(?:what\s+just\s+happened|this|that)\s*[:;]\s*(.{10,120})', prompt, re.I)
if work_m:
tool_args = dict(tool_args)
tool_args['summary'] = (work_m.group(1) if work_m.lastindex else work_m.group(0)).strip().rstrip('.')
# "log that we successfully deployed/shipped" → session_save_experience milestone (not save_ledger)
if tool_name == 'session_save_ledger':
if re.search(r'\blog\s+that\s+we\s+successfully\b|\bsuccessfully\s+deployed\b|\bsuccessfully\s+shipped\b|\bsuccessfully\s+launched\b', prompt_lower):
return 'session_save_experience', {"project": tool_args.get("project"), "event_type": "success"}
# "shift change / store current state for next agent" → session_save_handoff
if tool_name == 'session_save_ledger':
if re.search(r'\bshift\s+change\b|\bstore\s+(?:the\s+)?current\s+state\s+for\b|\bnext\s+(?:agent|person|developer)\s+can\s+continue\b|\bhand.*over\b|\bpick.*up\s+next\b', prompt_lower):
return 'session_save_handoff', tool_args
# Multi-intent: "Search/Find ... THEN upvote/downvote" → first action is search
if tool_name in ('knowledge_upvote', 'knowledge_downvote'):
if re.search(r'\bthen\s+(?:upvote|downvote|boost|rate\s+up|rate\s+down)\b', prompt_lower):
if re.search(r'^(?:search|find|look\s+up)\b', prompt_lower):
query_m = re.search(
r'^(?:search\s+(?:for\s+)?|find\s+(?:our\s+)?(?:notes?\s+on\s+)?|look\s+up\s+)(.+?)(?:,?\s*then\b)',
prompt, re.I
)
return 'session_search_memory', {"query": query_m.group(1).strip() if query_m else prompt[:120]}
# invalid tool name → try retention or upvote/downvote
if tool_name not in VALID_TOOLS:
if re.search(r'\b(?:auto.?expir|ttl\b|\d+\s*days?\s+(?:retention|expir)|\bretention\s*polic)', prompt_lower):
return 'knowledge_set_retention', tool_args
# fall through to upvote/downvote patterns below
# knowledge_forget / knowledge_set_retention → upvote/downvote protection
_UPVOTE_SET = {'knowledge_forget', 'knowledge_set_retention', 'session_forget_memory',
'session_task_route', 'session_search_memory'}
# Don't remap to upvote/downvote when primary intent is "search THEN upvote"
_is_search_then_vote = (
re.search(r'^(?:search|find|look\s+up)\b', prompt_lower) and
re.search(r'\bthen\s+(?:upvote|downvote|boost|rate\s+up|rate\s+down)\b', prompt_lower)
)
if (tool_name in _UPVOTE_SET or tool_name not in VALID_TOOLS) and not _is_search_then_vote:
_id_val = (tool_args.get("id") or tool_args.get("knowledge_id") or tool_args.get("entry_id")) if isinstance(tool_args, dict) else None
if re.search(r'\b(?:upvote|boost|increase\s+(?:the\s+|its\s+)?(?:rank|score|importance)|uprate|thumbs[\s-]?up|mark\s+(?:it\s+)?(?:up|helpful|useful|great|good)|importance\s+score)\b', prompt_lower):
return 'knowledge_upvote', {"id": _id_val}
if re.search(r'\b(?:downvote|lower\s+(?:the\s+|its\s+)?(?:rank|score)|not\s+useful|derank|thumbs[\s-]?down|reduce\s+(?:the\s+|its\s+)?(?:rank|score)|mark\s+(?:it\s+)?(?:down|bad|wrong|outdated|terrible))\b', prompt_lower):
return 'knowledge_downvote', {"id": _id_val}
# session_load_context: extract project from prompt if missing
if tool_name == 'session_load_context':
if not (isinstance(tool_args, dict) and tool_args.get('project')):
proj_m = re.search(
r'\b(?:on|for|of|with|in)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b'
r'|\b([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b'
r'|(?:state\s+of\s+(?:the\s+)?)([a-zA-Z][a-zA-Z0-9_-]+)(?:\s+project)?\b',
prompt_lower
)
if proj_m:
proj = next((g for g in proj_m.groups() if g), None)
if proj and proj not in ('the', 'a', 'this', 'that', 'my', 'our'):
tool_args = dict(tool_args) if isinstance(tool_args, dict) else {}
tool_args['project'] = proj
# session_compact_ledger: extract project if missing
if tool_name == 'session_compact_ledger':
if not (isinstance(tool_args, dict) and tool_args.get('project')):
proj_m = re.search(
r'\b(?:for|on|of)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+(?:project\s+)?ledger\b'
r'|\b([a-zA-Z][a-zA-Z0-9_-]+)\s+project\s+ledger\b'
r'|\b(?:compact|trim|prune|compress|archive)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+(?:project|ledger)\b',
prompt_lower
)
if proj_m:
proj = next((g for g in proj_m.groups() if g), None)
if proj and proj not in ('the', 'a', 'this', 'that', 'my', 'our', 'old', 'stale'):
tool_args = dict(tool_args) if isinstance(tool_args, dict) else {}
tool_args['project'] = proj
# "is this something the local model can handle? / route this task" → session_task_route
if tool_name == 'session_search_memory':
if re.search(r'\b(?:local\s+(?:model|agent)\s+(?:can\s+handle|should\s+handle)|route\s+this\s+task|should\s+(?:I|the\s+local\s+model)\s+(?:tackle|handle)|is\s+this\s+(?:something|simple\s+enough)\s+(?:for\s+the\s+)?local)\b', prompt_lower):
return 'session_task_route', {"task_description": prompt}
# session_task_route: extract task_description from prompt
if tool_name == 'session_task_route':
if 'task_description' not in tool_args or not tool_args.get('task_description'):
tool_args = dict(tool_args)
tool_args['task_description'] = prompt
# session_export_memory: extract output_path from path patterns, format from keywords
if tool_name == 'session_export_memory':
if not isinstance(tool_args, dict):
tool_args = {}
tool_args = dict(tool_args)
if 'output_path' not in tool_args or not tool_args.get('output_path'):
path_m = re.search(
r'(?:save\s+to|(?:output|export|dump)\s+(?:to\s+)?["\']?|to\s+["\']?)(/[\w/.-]+|~/[\w/.-]+)',
prompt, re.I
)
if path_m:
tool_args['output_path'] = path_m.group(1)
if 'format' not in tool_args or not tool_args.get('format'):
fmt_m = re.search(r'\b(json|jsonl|markdown|csv|yaml)\b(?:\s+format)?', prompt_lower)
if fmt_m:
tool_args['format'] = fmt_m.group(1)
# session_compact_ledger: protect "session entries" from knowledge_forget remap
# (already handled above but ensure compact stays for session-specific prompts)
# "where did we leave off / what was the state" → session_load_context
if tool_name == 'session_search_memory':
if re.search(r'\bwhere\s+did\s+we\s+leave\s+off\b|\bwhat\s+was\s+the\s+state\s+of\b|\bget\s+me\s+(?:re-?oriented|up\s+to\s+speed)\b|\bpull\s+up\s+(?:whatever|the\s+(?:full\s+)?context)', prompt_lower):
project_m = re.search(r'\b(?:on|for|with|of)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b', prompt_lower)
project = project_m.group(1) if project_m else None
return 'session_load_context', ({'project': project} if project else {})
# --- Social pleasantry rejection ---
SOCIAL_PATTERNS = [
r'^thanks', r'^thank you', r'^cheers', r'^goodbye', r'^bye',
r"that's all", r"we're done", r"all done", r"all set",
r'^ok\s+great', r'^perfect$', r'^nice$', r'^cool$',
r'^hi\b', r'^hey\b', r'^hello\b', r'^good\s+morning', r'^good\s+afternoon',
]
is_social = any(re.search(p, prompt_lower.strip()) for p in SOCIAL_PATTERNS)
if is_social and not any(w in prompt_lower for w in [
'save', 'export', 'search', 'load', 'record', 'log', 'run', 'check', 'find',
'compact', 'handoff', 'route', 'synthesize', 'backfill', 'forget', 'upvote', 'downvote',
]):
return "NO_TOOL", {}
# --- False-positive rejection (CS patterns) ---
is_general = any(re.search(p, prompt_lower) for p in GENERAL_PROGRAMMING_PATTERNS)
if not is_general:
return tool_name, tool_args
has_prism_intent = any(re.search(p, prompt_lower) for p in PRISM_INTENT_PATTERNS)
if has_prism_intent:
return tool_name, tool_args
return "NO_TOOL", {}
# ---------------------------------------------------------------------------
# Ollama Call
# ---------------------------------------------------------------------------
TOOL_CALL_NOPIPE_RE = re.compile(
r'<tool_call>\s*(\{.*?\})\s*(?:</tool_call>|$)',
re.DOTALL
)
TOOL_CALL_PIPE_RE = re.compile(
r'<\|tool_call\|>\s*(\{.*?\})',
re.DOTALL
)
BARE_JSON_RE = re.compile(
r'(\{[^{}]*"name"\s*:\s*"[^"]+?"[^{}]*(?:\{[^{}]*\}[^{}]*)*\})'
)
def call_ollama(prompt: str, timeout: int = 120) -> tuple:
"""Call Ollama REST API with a pre-formatted ChatML prompt.
Returns (raw_response, tool_name, tool_args, latency_secs).
"""
start = time.time()
try:
payload = json.dumps({
"model": MODEL,
"prompt": prompt,
"stream": False,
"raw": True,
"options": {"temperature": 0.0, "num_predict": 512},
}).encode("utf-8")
req = urllib.request.Request(
OLLAMA_API,
data=payload,
headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req, timeout=timeout) as resp:
data = json.loads(resp.read().decode("utf-8"))
raw = data.get("response", "").strip()
except Exception as exc:
return (str(exc), "ERROR", {}, time.time() - start)
latency = time.time() - start
# Strip CoT blocks
clean = re.sub(
r'<\|synalux_think\|>.*?(?:</\|synalux_think\|>|$)',
'', raw, flags=re.DOTALL
)
# Strategy 0: no-pipe <tool_call>…</tool_call> (v43 native format)
m = TOOL_CALL_NOPIPE_RE.search(clean)
if m:
try:
tj = json.loads(m.group(1))
return (raw, tj.get("name", tj.get("tool", "UNKNOWN")),
tj.get("arguments", tj.get("args", {})), latency)
except json.JSONDecodeError:
pass
# Strategy 1: piped <|tool_call|>
m = TOOL_CALL_PIPE_RE.search(clean)
if m:
try:
tj = json.loads(m.group(1))
return (raw, tj.get("name", tj.get("tool", "UNKNOWN")),
tj.get("arguments", tj.get("args", {})), latency)
except json.JSONDecodeError:
pass
# Strategy 2: bare JSON with "name" key
m = BARE_JSON_RE.search(clean)
if m:
try:
tj = json.loads(m.group(0))
return (raw, tj.get("name", "UNKNOWN"),
tj.get("arguments", tj.get("args", {})), latency)
except json.JSONDecodeError:
pass
return (raw, "NO_TOOL", {}, latency)
# ---------------------------------------------------------------------------
# Scoring
# ---------------------------------------------------------------------------
def evaluate_result(expected_tool, required_params, got_tool, got_args):
"""
Returns one of:
strict_pass — correct tool + all required_params present
partial_pass — correct tool + at least 1 required_param present but not all
wrong_tool — tool name is wrong (includes false positives / negatives)
false_positive — tool called when NO_TOOL expected
false_negative — NO_TOOL returned when tool expected
"""
if expected_tool == "NO_TOOL":
return "false_positive" if got_tool != "NO_TOOL" else "strict_pass"
if got_tool == "NO_TOOL":
return "false_negative"
# Accept either search tool for ambiguous prompts
tools_match = (got_tool == expected_tool) or (
expected_tool in ("session_search_memory", "knowledge_search") and
got_tool in ("session_search_memory", "knowledge_search")
)
if not tools_match:
return "wrong_tool"
if not required_params:
return "strict_pass"
if not isinstance(got_args, dict):
got_args = {}
present = [p for p in required_params if p in got_args and got_args[p] not in (None, "", [])]
if len(present) == len(required_params):
return "strict_pass"
if len(present) > 0:
return "partial_pass"
# Right tool, zero params matched
return "partial_pass"
def score(verdict):
if verdict == "strict_pass":
return 1.0
if verdict == "partial_pass":
return 0.5
return 0.0
# ---------------------------------------------------------------------------
# Main Eval
# ---------------------------------------------------------------------------
def run_once(tests, shuffle=False, run_label=""):
"""Run one full pass over test suite. Returns (results_list, category_stats)."""
indexed = list(enumerate(tests))
if shuffle:
random.shuffle(indexed)
results = [None] * len(tests)
category_stats = {}
for display_i, (orig_idx, (prompt, expected, req_params, category)) in enumerate(indexed, 1):
chatml = (
f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n"
f"<|im_start|>user\n{prompt}<|im_end|>\n"
f"<|im_start|>assistant\n"
)
raw, got_tool, got_args, latency = call_ollama(chatml)
got_tool, got_args = validate_tool_call(prompt, got_tool, got_args)
verdict = evaluate_result(expected, req_params, got_tool, got_args)
icon = "OK" if verdict == "strict_pass" else ("~~" if verdict == "partial_pass" else "XX")
tag = f"#{orig_idx + 1:03d}"
short = prompt[:52]
run_info = f"[{run_label}] " if run_label else ""
print(
f" {run_info}[{display_i:3d}/{len(tests)}] {icon} {tag} "
f"expect={expected:30s} got={got_tool:30s} {latency:5.1f}s | {short}"
)
if verdict != "strict_pass":
if verdict == "partial_pass":
missing = [p for p in req_params if p not in got_args or got_args.get(p) in (None, "", [])]
print(f" -> partial: missing params {missing}")
elif verdict == "false_positive":
print(f" -> FALSE POSITIVE: called {got_tool} (expected NO_TOOL)")
elif verdict == "false_negative":
print(f" -> FALSE NEGATIVE: no tool called (expected {expected})")
elif verdict == "wrong_tool":
print(f" -> WRONG TOOL: expected {expected}, got {got_tool}")
results[orig_idx] = {
"id": orig_idx + 1,
"prompt": prompt,
"expected": expected,
"got": got_tool,
"got_args": got_args,
"verdict": verdict,
"latency": latency,
"category": category,
"points": score(verdict),
}
if category not in category_stats:
category_stats[category] = {"total": 0, "strict": 0, "partial": 0, "fail": 0, "points": 0.0}
cat = category_stats[category]
cat["total"] += 1
cat["points"] += score(verdict)
if verdict == "strict_pass":
cat["strict"] += 1
elif verdict == "partial_pass":
cat["partial"] += 1
else:
cat["fail"] += 1
return results, category_stats
def print_run_summary(results, category_stats, run_label=""):
strict = sum(1 for r in results if r["verdict"] == "strict_pass")
partial = sum(1 for r in results if r["verdict"] == "partial_pass")
fp = sum(1 for r in results if r["verdict"] == "false_positive")
fn = sum(1 for r in results if r["verdict"] == "false_negative")
wt = sum(1 for r in results if r["verdict"] == "wrong_tool")
total = len(results)
total_points = sum(r["points"] for r in results)
tool_tests = [r for r in results if r["expected"] != "NO_TOOL"]
no_tool_tests = [r for r in results if r["expected"] == "NO_TOOL"]
no_tool_correct = sum(1 for r in no_tool_tests if r["verdict"] == "strict_pass")
hallucinations = sum(1 for r in results if r["verdict"] == "false_positive")
avg_lat = sum(r["latency"] for r in results) / total if total else 0
lbl = f" (Run {run_label})" if run_label else ""
print()
print("=" * 80)
print(f" EVAL-300 RESULTS{lbl}")
print("=" * 80)
print(f" Strict Pass: {strict}/{total} = {strict / total * 100:.1f}%")
print(f" Partial Pass: {partial}/{total} = {partial / total * 100:.1f}%")
print(f" Wrong Tool: {wt}/{total}")
print(f" False Positives: {fp}/{total} (hallucinations)")
print(f" False Negatives: {fn}/{total}")
print(f" ---")
print(f" strict_pct (strict/total): {strict / total * 100:.1f}%")
print(f" weighted_pct (total_points/total): {total_points / total * 100:.1f}%")
print(f" Abstention accuracy: {no_tool_correct}/{len(no_tool_tests)} = {no_tool_correct / len(no_tool_tests) * 100:.1f}%")
print(f" Hallucinations: {hallucinations} (target = 0)")
print(f" Avg latency: {avg_lat:.1f}s")
print()
print(f" {'Category':<22} {'Strict':>7} {'Partial':>8} {'Fail':>5} {'Pts/Tot':>10} {'Pct':>6}")
print(f" {'-'*22} {'-'*7} {'-'*8} {'-'*5} {'-'*10} {'-'*6}")
for cat, s in sorted(category_stats.items()):
pts_pct = s["points"] / s["total"] * 100 if s["total"] else 0
print(f" {cat:<22} {s['strict']:>7} {s['partial']:>8} {s['fail']:>5} "
f"{s['points']:>5.1f}/{s['total']:<4} {pts_pct:>5.1f}%")
print("=" * 80)
return {
"strict": strict,
"partial": partial,
"wrong_tool": wt,
"false_positive": fp,
"false_negative": fn,
"total": total,
"total_points": total_points,
"strict_pct": strict / total,
"weighted_pct": total_points / total,
"abstention_rate": no_tool_correct / len(no_tool_tests) if no_tool_tests else 0,
"hallucinations": hallucinations,
"avg_latency": avg_lat,
"category_stats": category_stats,
}
def main():
parser = argparse.ArgumentParser(description="Eval-300: 300-case standard evaluation for prism-coder")
parser.add_argument("--model", type=str, default=None,
help="Ollama model tag to evaluate (default: prism-coder:4b-v43)")
parser.add_argument("--runs", type=int, default=1,
help="Number of eval runs (default: 1; use 3 for stability check)")
parser.add_argument("--shuffle", action="store_true",
help="Randomize test order each run")
parser.add_argument("--no-validate-layer3", action="store_true",
help="Disable Layer 3 false-positive rejection "
"(use during RFT/DPO so model sees true failures)")
args = parser.parse_args()
global MODEL, validate_tool_call
if args.model:
MODEL = args.model
if args.no_validate_layer3:
def validate_tool_call(prompt, tool_name, tool_args): # noqa: F811
return tool_name, tool_args
_verify_test_counts()
print("=" * 80)
print(f" EVAL-300 — prism-coder standard evaluation")
print(f" Model: {MODEL}")
print(f" Tests: {len(TESTS)}")
print(f" Runs: {args.runs}" + (" (RANDOMIZED ORDER each run)" if args.shuffle else ""))
print(f" Layer3: {'DISABLED' if args.no_validate_layer3 else 'enabled'}")
print("=" * 80)
all_run_summaries = []
all_run_results = []
for run_idx in range(args.runs):
run_label = str(run_idx + 1) if args.runs > 1 else ""
if args.runs > 1:
print(f"\n{'#' * 80}")
print(f" RUN {run_idx + 1} / {args.runs}" +
(f" (seed={random.randint(1000, 9999)})" if args.shuffle else ""))
print(f"{'#' * 80}")
results, cat_stats = run_once(TESTS, shuffle=args.shuffle, run_label=run_label)
summary = print_run_summary(results, cat_stats, run_label=run_label)
all_run_summaries.append(summary)
all_run_results.append(results)
# ---------------------------------------------------------------------------
# Multi-run aggregate
# ---------------------------------------------------------------------------
if args.runs > 1:
strict_scores = [s["strict"] for s in all_run_summaries]
weighted_pcts = [s["weighted_pct"] * 100 for s in all_run_summaries]
total = all_run_summaries[0]["total"]
halluc_counts = [s["hallucinations"] for s in all_run_summaries]
# Per-test stability
per_test_pass = [0] * len(TESTS)
per_test_fail_tools = [[] for _ in range(len(TESTS))]
for run_results in all_run_results:
for r in run_results:
idx = r["id"] - 1
if r["verdict"] == "strict_pass":
per_test_pass[idx] += 1
else:
per_test_fail_tools[idx].append(r.get("got", "???"))
med_strict = statistics.median(strict_scores)
avg_strict = statistics.mean(strict_scores)
med_weighted = statistics.median(weighted_pcts)
print(f"\n{'=' * 80}")
print(f" MULTI-RUN SUMMARY ({args.runs} runs x {total} tests)")
print(f"{'=' * 80}")
print(f" Strict scores: {' | '.join(f'{s}/{total}' for s in strict_scores)}")
print(f" Median strict: {med_strict}/{total} = {med_strict / total * 100:.1f}%")
print(f" Average strict: {avg_strict:.1f}/{total} = {avg_strict / total * 100:.1f}%")
print(f" Weighted pct: {' | '.join(f'{p:.1f}%' for p in weighted_pcts)} "
f"(median {med_weighted:.1f}%)")
print(f" Hallucinations: {' | '.join(str(h) for h in halluc_counts)} "
f"(target = 0 each run)")
print()
print(f" Flaky tests (< 100% pass rate across {args.runs} runs):")
flaky = []
for i, (prompt, expected, _, cat) in enumerate(TESTS):
rate = per_test_pass[i] / args.runs
if rate < 1.0:
fail_tools = per_test_fail_tools[i]
flaky.append((i + 1, rate, expected, set(fail_tools), cat, prompt[:60]))
if flaky:
for fid, rate, exp, fails, fcat, fshort in sorted(flaky, key=lambda x: x[1]):
print(f" [{fid:03d}] {rate * 100:3.0f}% | cat={fcat:<18s} | expect={exp:<28s} | fails->{','.join(fails):<20s} | {fshort}")
else:
print(" All tests passed consistently across all runs!")
print(f" Total flaky: {len(flaky)}/{total}")
print(f"{'=' * 80}")
# ---------------------------------------------------------------------------
# Save JSON report
# ---------------------------------------------------------------------------
os.makedirs("results", exist_ok=True)
report_path = "results/eval300_report.json"
final_summary = all_run_summaries[-1] if args.runs == 1 else {
"runs": args.runs,
"strict_scores": strict_scores,
"median_strict": statistics.median(strict_scores) / total,
"avg_strict": statistics.mean(strict_scores) / total,
"median_weighted_pct": statistics.median(weighted_pcts) / 100,
"hallucinations_per_run": halluc_counts,
"per_run_summaries": all_run_summaries,
} if args.runs > 1 else all_run_summaries[0]
report = {
"model": MODEL,
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%S"),
"total_tests": len(TESTS),
"runs": args.runs,
"shuffle": args.shuffle,
"layer3_enabled": not args.no_validate_layer3,
"summary": final_summary,
"last_run_results": all_run_results[-1],
}
with open(report_path, "w") as f:
json.dump(report, f, indent=2, default=str)
print(f"\nReport saved: {report_path}")
# Exit code: fail if last run strict < 90%
last_strict_pct = all_run_summaries[-1]["strict_pct"] * 100
if last_strict_pct < 90.0:
print(f"FAIL: strict_pct {last_strict_pct:.1f}% is below 90% gate")
sys.exit(1)
else:
print(f"PASS: strict_pct {last_strict_pct:.1f}%")
sys.exit(0)
if __name__ == "__main__":
main()
|