Instructions to use dcostenco/prism-coder-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dcostenco/prism-coder-4b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="dcostenco/prism-coder-4b",
	filename="prism-coder-4b-v43-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use dcostenco/prism-coder-4b with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf dcostenco/prism-coder-4b:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf dcostenco/prism-coder-4b:Q4_K_M

Use Docker

docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M

LM Studio
Jan

vLLM

How to use dcostenco/prism-coder-4b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dcostenco/prism-coder-4b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dcostenco/prism-coder-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M

Ollama
How to use dcostenco/prism-coder-4b with Ollama:
```
ollama run hf.co/dcostenco/prism-coder-4b:Q4_K_M
```

Unsloth Studio

How to use dcostenco/prism-coder-4b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for dcostenco/prism-coder-4b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for dcostenco/prism-coder-4b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for dcostenco/prism-coder-4b to start chatting

How to use dcostenco/prism-coder-4b with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf dcostenco/prism-coder-4b:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "dcostenco/prism-coder-4b:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use dcostenco/prism-coder-4b with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf dcostenco/prism-coder-4b:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default dcostenco/prism-coder-4b:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use dcostenco/prism-coder-4b with Docker Model Runner:
```
docker model run hf.co/dcostenco/prism-coder-4b:Q4_K_M
```

Lemonade

How to use dcostenco/prism-coder-4b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull dcostenco/prism-coder-4b:Q4_K_M

Run and chat with the model

lemonade run user.prism-coder-4b-Q4_K_M

List all available models

lemonade list

prism-coder-4b / training /eval_300.py

dcostenco

Add training/eval_300.py

fcac56b verified 6 days ago

raw

history blame contribute delete

86.6 kB

	#!/usr/bin/env python3
	"""
	eval_300.py — 300-Case Standard Evaluation for prism-coder:4b-v43

	Replaces bfcl_eval.py (64 tests) and swe_bench_test.py (68 tests) with a single
	~300-case blind eval. Designed to be run 3 times for statistical stability checks.

	All test cases are NOVEL — never seen in any training data.

	Categories:
	natural_phrasing (50) — casual/indirect phrasing that maps to a tool
	adversarial_trap (70) — CS/programming questions that must NOT call a tool
	disambiguation (40) — similar tools exist; must pick the correct one
	edge_case (25) — minimal / ambiguous prompts
	multi_intent (20) — multi-step prompts; score on first action only
	verifier (25) — synthesize_edges / backfill_links / health_check patterns
	cascade (25) — explicit first-step-of-chain patterns
	param_extraction (25) — params in the prompt text; test correct extraction
	abstention (20) — greetings / capability questions; must return NO_TOOL

	Scoring:
	strict_pass = correct tool + all required_params present → 1.0 point
	partial_pass = correct tool + at least 1 required_param but not all → 0.5 point
	wrong_tool = wrong tool name → 0 points
	false_pos = tool called when NO_TOOL expected → 0 points
	false_neg = NO_TOOL when tool expected → 0 points

	Usage:
	python3 eval_300.py
	python3 eval_300.py --runs 3 --shuffle
	python3 eval_300.py --model prism-coder:4b-v43 --runs 3
	python3 eval_300.py --no-validate-layer3
	"""

	import json
	import os
	import re
	import sys
	import time
	import random
	import statistics
	import urllib.request
	import argparse

	# ---------------------------------------------------------------------------
	# Config
	# ---------------------------------------------------------------------------
	MODEL = "prism-coder:4b-v43"
	OLLAMA_API = "http://localhost:11434/api/generate"

	SYSTEM_PROMPT = (
	"You are Synalux, a memory-augmented coding and clinical reasoning assistant. "
	"You have access to Prism Memory tools (session_save_ledger, session_load_context, "
	"session_search_memory, session_save_handoff, session_forget_memory, session_health_check, "
	"session_compact_ledger, session_export_memory, session_task_route, session_save_experience, "
	"session_synthesize_edges, session_backfill_links, knowledge_search, knowledge_forget, "
	"knowledge_upvote, knowledge_downvote, knowledge_set_retention) and 13 multimodal tool "
	"modules (image_gen, office, web_scraper, browser, tts, ocr, git, terminal, deps_scanner, "
	"hipaa, data_graph, templates, pdf_parser). "
	"Think step-by-step before answering. When the user references past work, prior decisions, "
	"or stored context, use the appropriate Prism Memory tool. "
	"Format tool calls inside <tool_call>...</tool_call> JSON blocks with fields 'name' and 'arguments'. "
	"If no tool is needed, answer directly in plain text. "
	"ABSTAIN for general programming questions, CS concepts, greetings, and capability questions."
	)

	VALID_TOOLS = {
	"session_load_context", "session_save_ledger", "session_save_handoff",
	"session_search_memory", "session_forget_memory", "session_health_check",
	"session_compact_ledger", "session_export_memory", "session_task_route",
	"session_save_experience", "session_synthesize_edges", "session_backfill_links",
	"knowledge_search", "knowledge_forget", "knowledge_upvote",
	"knowledge_downvote", "knowledge_set_retention",
	}

	# ---------------------------------------------------------------------------
	# Test Cases (prompt, expected_tool_or_NO_TOOL, required_params, category)
	# required_params: list of param keys that MUST appear in got_args
	# ---------------------------------------------------------------------------

	TESTS = [

	# ===========================================================================
	# CATEGORY 1: natural_phrasing (50 cases)
	# Casual / indirect user phrasing that maps to a specific Prism tool.
	# ===========================================================================

	# --- session_load_context ---
	("Alright, kick things off. Pull up whatever we had on the checkout-service project.",
	"session_load_context", ["project"], "natural_phrasing"),

	("I'm back from lunch. Get me re-oriented on the prism-aac project.",
	"session_load_context", ["project"], "natural_phrasing"),

	("Fresh session here. Reconstruct everything we built for the notifications project.",
	"session_load_context", ["project"], "natural_phrasing"),

	("Starting a new chat. Bring up the full context for the mobile-app project.",
	"session_load_context", ["project"], "natural_phrasing"),

	("Where did we leave off with the auth-service work?",
	"session_load_context", [], "natural_phrasing"),

	("Get me up to speed on the reporting-dashboard project.",
	"session_load_context", ["project"], "natural_phrasing"),

	("Resume from where we were on the data-pipeline project.",
	"session_load_context", ["project"], "natural_phrasing"),

	("Catch me up — what was the state of the subscription-api project?",
	"session_load_context", ["project"], "natural_phrasing"),

	# --- session_save_ledger ---
	("We wrapped up for today. Make a note that we completed the database indexing overhaul.",
	"session_save_ledger", [], "natural_phrasing"),

	("Log what just happened: we refactored the payment module and all tests pass.",
	"session_save_ledger", [], "natural_phrasing"),

	("Record this session — we finalized the API contract for the mobile team.",
	"session_save_ledger", [], "natural_phrasing"),

	("Write down everything we did today before I close this tab.",
	"session_save_ledger", [], "natural_phrasing"),

	("Jot down our progress: three endpoints migrated, two more to go.",
	"session_save_ledger", [], "natural_phrasing"),

	("Before I head out, save a summary of what we accomplished this afternoon.",
	"session_save_ledger", [], "natural_phrasing"),

	# --- session_save_handoff ---
	("I'm handing this over. Leave a note for whoever picks this up next on the billing-portal project.",
	"session_save_handoff", ["project"], "natural_phrasing"),

	("Pass the baton on the logistics-api project. Save the handoff so the next person knows where we are.",
	"session_save_handoff", ["project"], "natural_phrasing"),

	("Shift change. Store the current state for the embedded-firmware project so the next agent can continue.",
	"session_save_handoff", ["project"], "natural_phrasing"),

	("Create a handoff note for the trading-platform project — we got through feature flagging, still need A/B routing.",
	"session_save_handoff", ["project"], "natural_phrasing"),

	# --- session_search_memory ---
	("Remind me — did we ever pick a caching strategy for the CDN layer?",
	"session_search_memory", ["query"], "natural_phrasing"),

	("Did we discuss anything about Kafka consumer lag in previous sessions?",
	"session_search_memory", ["query"], "natural_phrasing"),

	("Go back through our history and find anything about the CI pipeline refactor.",
	"session_search_memory", ["query"], "natural_phrasing"),

	("What did we decide about webhook retry logic in past conversations?",
	"session_search_memory", ["query"], "natural_phrasing"),

	("Dig up anything we recorded about the multi-tenant database design.",
	"session_search_memory", ["query"], "natural_phrasing"),

	("Pull up any notes we saved about the gRPC migration.",
	"session_search_memory", ["query"], "natural_phrasing"),

	# --- session_forget_memory ---
	("That entry we saved about using SQLite in production is totally wrong. Remove it.",
	"session_forget_memory", ["memory_id"], "natural_phrasing"),

	("Delete the memory with ID mem-zx91-ff. It's stale.",
	"session_forget_memory", ["memory_id"], "natural_phrasing"),

	("Wipe the incorrect ledger note that said we shipped v2.1 — we didn't.",
	"session_forget_memory", ["memory_id"], "natural_phrasing"),

	# --- session_health_check ---
	("Something feels off. Can you run diagnostics on the memory backend?",
	"session_health_check", [], "natural_phrasing"),

	("Before I trust these search results, verify the memory system is healthy.",
	"session_health_check", [], "natural_phrasing"),

	("Give the memory infrastructure a quick checkup.",
	"session_health_check", [], "natural_phrasing"),

	# --- session_compact_ledger ---
	("The session history for the event-sourcing project is getting massive. Trim and archive the old entries.",
	"session_compact_ledger", ["project"], "natural_phrasing"),

	("Compress the ledger for the recommendation-engine project — too much noise in there.",
	"session_compact_ledger", ["project"], "natural_phrasing"),

	("Prune out the old session entries for the analytics-backend project.",
	"session_compact_ledger", ["project"], "natural_phrasing"),

	# --- session_export_memory ---
	("Dump a full backup of my memory to /data/exports in JSON format.",
	"session_export_memory", ["output_path", "format"], "natural_phrasing"),

	("Export everything to /tmp/prism-dump so I can archive it.",
	"session_export_memory", ["output_path"], "natural_phrasing"),

	("I need an offline copy of all session data. Export to /backup/weekly.",
	"session_export_memory", ["output_path"], "natural_phrasing"),

	# --- session_task_route ---
	("Should I tackle this Rust async runtime bug locally or send it to a bigger model?",
	"session_task_route", ["task_description"], "natural_phrasing"),

	("Is this image classification fine-tuning job something the local agent can handle?",
	"session_task_route", ["task_description"], "natural_phrasing"),

	("Route this task: refactor the monorepo build system to support incremental compilation.",
	"session_task_route", ["task_description"], "natural_phrasing"),

	# --- session_save_experience ---
	("Log a milestone: we successfully zero-downtime-deployed the new search index.",
	"session_save_experience", [], "natural_phrasing"),

	("Record that we fixed the race condition in the WebSocket handler — took 4 hours but it's solid now.",
	"session_save_experience", [], "natural_phrasing"),

	# --- knowledge_search ---
	("Any institutional knowledge on how we handle circuit breakers?",
	"knowledge_search", ["query"], "natural_phrasing"),

	("What does our knowledge base say about rate limiting strategies?",
	"knowledge_search", ["query"], "natural_phrasing"),

	("Look up anything curated about CQRS patterns.",
	"knowledge_search", ["query"], "natural_phrasing"),

	("Check our documented knowledge for anything on event-driven architecture.",
	"knowledge_search", ["query"], "natural_phrasing"),

	# --- knowledge_upvote / downvote ---
	("That knowledge entry about using Redis for distributed locks was really helpful. Give it a thumbs up.",
	"knowledge_upvote", [], "natural_phrasing"),

	("Boost the ranking on our GraphQL federation notes — they're gold.",
	"knowledge_upvote", [], "natural_phrasing"),

	("That doc about using polling instead of webhooks is outdated and wrong. Lower its score.",
	"knowledge_downvote", [], "natural_phrasing"),

	("Downvote the entry about using bcrypt at cost 4 — it's dangerously insecure.",
	"knowledge_downvote", [], "natural_phrasing"),

	# --- knowledge_set_retention ---
	("Set a 45-day retention policy on the alpha-testing project's knowledge.",
	"knowledge_set_retention", ["project"], "natural_phrasing"),


	# ===========================================================================
	# CATEGORY 2: adversarial_trap (70 cases)
	# CS / programming questions — must return NO_TOOL even when keywords match.
	# ===========================================================================

	# Python
	("Write a Python function that implements a trie for fast prefix searches.",
	"NO_TOOL", [], "adversarial_trap"),

	("How do I use Python's contextlib.contextmanager decorator?",
	"NO_TOOL", [], "adversarial_trap"),

	("Explain Python's __slots__ and when to use it for memory optimization.",
	"NO_TOOL", [], "adversarial_trap"),

	("What is the difference between deepcopy and shallow copy in Python?",
	"NO_TOOL", [], "adversarial_trap"),

	("How does Python's asyncio event loop schedule coroutines?",
	"NO_TOOL", [], "adversarial_trap"),

	("Write a Python generator that yields prime numbers indefinitely.",
	"NO_TOOL", [], "adversarial_trap"),

	("How do I profile memory usage in a Python application?",
	"NO_TOOL", [], "adversarial_trap"),

	# JavaScript / TypeScript
	("How do I debounce a function in JavaScript without lodash?",
	"NO_TOOL", [], "adversarial_trap"),

	("Explain the JavaScript event loop and microtask queue.",
	"NO_TOOL", [], "adversarial_trap"),

	("How does TypeScript's discriminated union type work?",
	"NO_TOOL", [], "adversarial_trap"),

	("Write a TypeScript generic function that deep-merges two objects.",
	"NO_TOOL", [], "adversarial_trap"),

	("What is the difference between a WeakMap and a Map in JavaScript?",
	"NO_TOOL", [], "adversarial_trap"),

	("How do I implement a promise-based queue in Node.js?",
	"NO_TOOL", [], "adversarial_trap"),

	# Go
	("How does Go's goroutine scheduler work with M:N threading?",
	"NO_TOOL", [], "adversarial_trap"),

	("Explain Go's garbage collector and write barriers.",
	"NO_TOOL", [], "adversarial_trap"),

	("Write a concurrent rate limiter in Go using channels.",
	"NO_TOOL", [], "adversarial_trap"),

	("How do I implement context cancellation in a Go HTTP server?",
	"NO_TOOL", [], "adversarial_trap"),

	# Rust
	("Explain Rust's borrow checker and why it prevents data races.",
	"NO_TOOL", [], "adversarial_trap"),

	("How do Arc and Mutex work together in Rust for thread-safe state sharing?",
	"NO_TOOL", [], "adversarial_trap"),

	("What is Rust's Pin and why is it needed for async futures?",
	"NO_TOOL", [], "adversarial_trap"),

	("Write a Rust trait that implements a retry strategy with exponential backoff.",
	"NO_TOOL", [], "adversarial_trap"),

	# SQL / NoSQL
	("Write a SQL query that finds the second-highest salary in an employees table.",
	"NO_TOOL", [], "adversarial_trap"),

	("How do I use window functions in PostgreSQL to compute a running total?",
	"NO_TOOL", [], "adversarial_trap"),

	("What is a covering index and when should I use one in MySQL?",
	"NO_TOOL", [], "adversarial_trap"),

	("Explain eventual consistency in DynamoDB and how to work around it.",
	"NO_TOOL", [], "adversarial_trap"),

	("How do I export data from MongoDB to a JSON file using mongoexport?",
	"NO_TOOL", [], "adversarial_trap"),

	("What is a materialized view in PostgreSQL and how does it differ from a regular view?",
	"NO_TOOL", [], "adversarial_trap"),

	# Algorithms / Data Structures
	("Explain Dijkstra's algorithm and its time complexity.",
	"NO_TOOL", [], "adversarial_trap"),

	("Write a depth-first search implementation for a graph adjacency list.",
	"NO_TOOL", [], "adversarial_trap"),

	("How does consistent hashing help with horizontal scaling?",
	"NO_TOOL", [], "adversarial_trap"),

	("Explain the difference between a B-tree and a B+ tree.",
	"NO_TOOL", [], "adversarial_trap"),

	("What is the time and space complexity of merge sort?",
	"NO_TOOL", [], "adversarial_trap"),

	("Implement a LRU cache in Python using OrderedDict.",
	"NO_TOOL", [], "adversarial_trap"),

	("How does a bloom filter work and what are its false positive trade-offs?",
	"NO_TOOL", [], "adversarial_trap"),

	# Frameworks / Config
	("How do I configure Django's ORM to use read replicas?",
	"NO_TOOL", [], "adversarial_trap"),

	("Explain Flask's application context vs. request context.",
	"NO_TOOL", [], "adversarial_trap"),

	("How does FastAPI's dependency injection system work?",
	"NO_TOOL", [], "adversarial_trap"),

	("Write a middleware in Express.js that logs request durations.",
	"NO_TOOL", [], "adversarial_trap"),

	("How do I set up hot-module replacement in a Vite + React project?",
	"NO_TOOL", [], "adversarial_trap"),

	("What is the difference between server components and client components in Next.js 14?",
	"NO_TOOL", [], "adversarial_trap"),

	# DevOps / Infrastructure
	("Write a Dockerfile for a Python FastAPI app with multi-stage builds.",
	"NO_TOOL", [], "adversarial_trap"),

	("How do I configure a Kubernetes HorizontalPodAutoscaler based on custom metrics?",
	"NO_TOOL", [], "adversarial_trap"),

	("What is the difference between rolling and blue-green deployments?",
	"NO_TOOL", [], "adversarial_trap"),

	("How do I set up Prometheus scraping for a Node.js service?",
	"NO_TOOL", [], "adversarial_trap"),

	("Explain how etcd achieves consensus using the Raft algorithm.",
	"NO_TOOL", [], "adversarial_trap"),

	("Write a GitHub Actions workflow that runs tests on every pull request.",
	"NO_TOOL", [], "adversarial_trap"),

	# Memory management (trap on 'memory' keyword)
	("How does virtual memory paging work in Linux?",
	"NO_TOOL", [], "adversarial_trap"),

	("What is memory-mapped I/O and how does mmap work in C?",
	"NO_TOOL", [], "adversarial_trap"),

	("Explain stack vs. heap memory allocation and when each is appropriate.",
	"NO_TOOL", [], "adversarial_trap"),

	("How does the V8 engine's garbage collector use generational collection?",
	"NO_TOOL", [], "adversarial_trap"),

	# Session handling (trap on 'session' keyword)
	("How does PHP's session_start() work under the hood?",
	"NO_TOOL", [], "adversarial_trap"),

	("Implement session fixation protection in a Flask application.",
	"NO_TOOL", [], "adversarial_trap"),

	("What is the difference between sticky sessions and session replication?",
	"NO_TOOL", [], "adversarial_trap"),

	("How do I store JWT tokens in a secure, httpOnly cookie in Express?",
	"NO_TOOL", [], "adversarial_trap"),

	# Search (trap on 'search' keyword)
	("How do I implement fuzzy search with trigrams in PostgreSQL?",
	"NO_TOOL", [], "adversarial_trap"),

	("Explain TF-IDF and how it ranks documents in full-text search.",
	"NO_TOOL", [], "adversarial_trap"),

	("Write a binary search implementation in Rust.",
	"NO_TOOL", [], "adversarial_trap"),

	("Compare Elasticsearch and OpenSearch for log aggregation.",
	"NO_TOOL", [], "adversarial_trap"),

	# Graph theory (trap on 'graph' + 'edges' keywords)
	("Explain the difference between Prim's and Kruskal's spanning tree algorithms.",
	"NO_TOOL", [], "adversarial_trap"),

	("How do topological sorts work on directed acyclic graphs?",
	"NO_TOOL", [], "adversarial_trap"),

	("Write a function to detect cycles in a directed graph using DFS.",
	"NO_TOOL", [], "adversarial_trap"),

	# Load balancing (trap on 'load' keyword)
	("What are the differences between round-robin, least-connections, and IP-hash load balancing?",
	"NO_TOOL", [], "adversarial_trap"),

	("How does Nginx upstream load balancing handle health check failures?",
	"NO_TOOL", [], "adversarial_trap"),

	# Logging / monitoring
	("How do I implement structured logging in a Go service with zerolog?",
	"NO_TOOL", [], "adversarial_trap"),

	("Explain the ELK stack and how logs flow from Beats to Kibana.",
	"NO_TOOL", [], "adversarial_trap"),

	("What is OpenTelemetry and how does distributed tracing work?",
	"NO_TOOL", [], "adversarial_trap"),

	# Misc CS concepts
	("What is the difference between optimistic and pessimistic locking in databases?",
	"NO_TOOL", [], "adversarial_trap"),

	("Explain how CRDTs achieve conflict-free distributed state.",
	"NO_TOOL", [], "adversarial_trap"),

	("What is a saga pattern in distributed systems?",
	"NO_TOOL", [], "adversarial_trap"),

	("How does the forget gate in an LSTM neural network control memory?",
	"NO_TOOL", [], "adversarial_trap"),


	# ===========================================================================
	# CATEGORY 3: disambiguation (40 cases)
	# Similar tools — model must pick the correct one.
	# ===========================================================================

	# session_search_memory vs knowledge_search
	("Find anything we discussed last month about the API versioning decision.",
	"session_search_memory", ["query"], "disambiguation"),

	("What do our curated knowledge items say about dependency injection patterns?",
	"knowledge_search", ["query"], "disambiguation"),

	("Search our accumulated documentation for information on database sharding.",
	"knowledge_search", ["query"], "disambiguation"),

	("Look through recent session notes for anything about the CDN cache invalidation bug.",
	"session_search_memory", ["query"], "disambiguation"),

	("Any past conversations where we discussed microservice mesh configurations?",
	"session_search_memory", ["query"], "disambiguation"),

	("Check the knowledge base for anything on event sourcing trade-offs.",
	"knowledge_search", ["query"], "disambiguation"),

	# session_forget_memory vs knowledge_forget
	("Remove the specific session memory with ID mem-qq77-rr. It's incorrect.",
	"session_forget_memory", ["memory_id"], "disambiguation"),

	("Clear all the outdated knowledge entries in the staging project.",
	"knowledge_forget", ["project"], "disambiguation"),

	("Wipe out old debugging records from the search-service project's knowledge base.",
	"knowledge_forget", ["project"], "disambiguation"),

	("Delete the memory entry for ID mem-ab99-cd — we noted the wrong schema version.",
	"session_forget_memory", ["memory_id"], "disambiguation"),

	("Remove all knowledge items in the deprecated-feature category from the portal project.",
	"knowledge_forget", ["project"], "disambiguation"),

	# session_save_ledger vs session_save_experience vs session_save_handoff
	("Log what we did today: migrated the billing module to the new event bus.",
	"session_save_ledger", [], "disambiguation"),

	("Record a milestone: we successfully launched the new onboarding flow in production.",
	"session_save_experience", [], "disambiguation"),

	("Hand off this session — save the state for the next agent on the gateway project.",
	"session_save_handoff", ["project"], "disambiguation"),

	("Write down that we rewrote the payment reconciliation logic today.",
	"session_save_ledger", [], "disambiguation"),

	("Mark a success: we fixed the notorious N+1 query on the orders endpoint.",
	"session_save_experience", [], "disambiguation"),

	("The contractor is taking over tonight. Save the handoff for the migration-tools project.",
	"session_save_handoff", ["project"], "disambiguation"),

	# knowledge_upvote vs knowledge_downvote
	("That knowledge entry about immutable infrastructure is spot on. Upvote it.",
	"knowledge_upvote", [], "disambiguation"),

	("The doc recommending XML over JSON for internal APIs is terrible. Mark it down.",
	"knowledge_downvote", [], "disambiguation"),

	("Increase the importance score of the circuit-breaker patterns entry.",
	"knowledge_upvote", [], "disambiguation"),

	("Reduce the rank of that outdated note about using MD5 for hashing.",
	"knowledge_downvote", [], "disambiguation"),

	# session_compact_ledger vs session_export_memory
	("The billing-service ledger is bloated. Compress and archive the old entries.",
	"session_compact_ledger", ["project"], "disambiguation"),

	("Export a full offline snapshot of my memory to /archive/snapshot in JSON.",
	"session_export_memory", ["output_path", "format"], "disambiguation"),

	("Trim down the session history for the firmware project — it's too long.",
	"session_compact_ledger", ["project"], "disambiguation"),

	("Save everything to disk — dump all session data to /tmp/export-all.",
	"session_export_memory", ["output_path"], "disambiguation"),

	# session_synthesize_edges vs session_backfill_links vs session_health_check
	("Verify the session graph edges are all consistent for the trading-platform project.",
	"session_synthesize_edges", ["project"], "disambiguation"),

	("Reconnect the dangling session references for the ml-pipeline project.",
	"session_backfill_links", ["project"], "disambiguation"),

	("Run a full health diagnostic on the Prism memory backend.",
	"session_health_check", [], "disambiguation"),

	("Patch up missing cross-session links for the user-service project.",
	"session_backfill_links", ["project"], "disambiguation"),

	("Make sure all edges are synthesized and up to date for the invoicing project.",
	"session_synthesize_edges", ["project"], "disambiguation"),

	("Is the memory system responding normally? Do a quick health check.",
	"session_health_check", [], "disambiguation"),

	# session_load_context vs session_search_memory
	("Bring me back into the context of the payments-gateway project.",
	"session_load_context", ["project"], "disambiguation"),

	("Look for any notes we made about the GraphQL schema decisions.",
	"session_search_memory", ["query"], "disambiguation"),

	("Restore the full session state for the devops-automation project.",
	"session_load_context", ["project"], "disambiguation"),

	("Search our history for any discussion about OAuth2 vs API keys.",
	"session_search_memory", ["query"], "disambiguation"),

	# session_task_route vs session_load_context
	("Should the local model handle this React performance optimization or route it to the cloud?",
	"session_task_route", ["task_description"], "disambiguation"),

	("Initialize context for the infrastructure-as-code project — I'm starting fresh.",
	"session_load_context", ["project"], "disambiguation"),

	# knowledge_set_retention vs knowledge_forget
	("Set the knowledge for the beta-program project to expire after 90 days.",
	"knowledge_set_retention", ["project"], "disambiguation"),

	("Delete all knowledge in the archived-2025 project — we don't need it anymore.",
	"knowledge_forget", ["project"], "disambiguation"),

	("Auto-expire the knowledge entries in the sandbox project after 14 days.",
	"knowledge_set_retention", ["project"], "disambiguation"),


	# ===========================================================================
	# CATEGORY 4: edge_case (25 cases)
	# Minimal, single-word, ambiguous, or unusual prompts.
	# ===========================================================================

	("Load context.", "session_load_context", [], "edge_case"),
	("Save.", "session_save_ledger", [], "edge_case"),
	("Search.", "session_search_memory", [], "edge_case"),
	("Check health.", "session_health_check", [], "edge_case"),
	("Export.", "session_export_memory", [], "edge_case"),
	("Compact.", "session_compact_ledger", [], "edge_case"),
	("Handoff.", "session_save_handoff", [], "edge_case"),
	("Route this.", "session_task_route", [], "edge_case"),
	("Synthesize edges.", "session_synthesize_edges", [], "edge_case"),
	("Backfill links.", "session_backfill_links", [], "edge_case"),
	("Forget it.", "session_forget_memory", [], "edge_case"),
	("Knowledge search.", "knowledge_search", [], "edge_case"),

	# Abstention edge cases
	("Hello!", "NO_TOOL", [], "edge_case"),
	("What can you do?", "NO_TOOL", [], "edge_case"),
	("Tell me about yourself.", "NO_TOOL", [], "edge_case"),
	("Thanks, we're done.", "NO_TOOL", [], "edge_case"),
	("OK great.", "NO_TOOL", [], "edge_case"),
	("Bye!", "NO_TOOL", [], "edge_case"),

	# Ambiguous short prompts that still require the right tool
	("Run diagnostics.", "session_health_check", [], "edge_case"),
	("Save the handoff.", "session_save_handoff", [], "edge_case"),
	("Log this session.", "session_save_ledger", [], "edge_case"),
	("Search memory.", "session_search_memory", [], "edge_case"),
	("Knowledge base lookup.", "knowledge_search", [], "edge_case"),
	("Archive old entries.", "session_compact_ledger", [], "edge_case"),
	("Save experience.", "session_save_experience", [], "edge_case"),


	# ===========================================================================
	# CATEGORY 5: multi_intent (20 cases)
	# Multi-step prompts — score only the FIRST action.
	# ===========================================================================

	("Load the context for the pipeline project, then search for any past notes on streaming.",
	"session_load_context", ["project"], "multi_intent"),

	("Search our memory for anything about the OAuth migration, then save a handoff.",
	"session_search_memory", ["query"], "multi_intent"),

	("Check memory health, and if it's all good, compact the fraud-detection ledger.",
	"session_health_check", [], "multi_intent"),

	("Find notes about the ML model rollout, and then log that we finished the A/B test today.",
	"session_search_memory", ["query"], "multi_intent"),

	("Load the prism-mcp context, then check if there are any open issues about rate limiting.",
	"session_load_context", ["project"], "multi_intent"),

	("Export everything to /tmp/backup, then set a 60-day retention policy on it.",
	"session_export_memory", ["output_path"], "multi_intent"),

	("Save what we did today: shipped the new notification system. Then create a handoff note.",
	"session_save_ledger", [], "multi_intent"),

	("Search for what we decided about the queue architecture, then upvote the best result.",
	"session_search_memory", ["query"], "multi_intent"),

	("Run a health check on the memory system, then compact the ledger if there are issues.",
	"session_health_check", [], "multi_intent"),

	("Look up our knowledge on service mesh patterns, and then downvote the outdated ones.",
	"knowledge_search", ["query"], "multi_intent"),

	("Compact the session history for the payments project, then synthesize the session edges.",
	"session_compact_ledger", ["project"], "multi_intent"),

	("Load context for the billing-v2 project, and record our progress: we fixed the invoice date bug.",
	"session_load_context", ["project"], "multi_intent"),

	("Search our knowledge base for event-driven design patterns, then save a handoff with the findings.",
	"knowledge_search", ["query"], "multi_intent"),

	("Backfill the cross-session links for the ios-app project, then synthesize edges.",
	"session_backfill_links", ["project"], "multi_intent"),

	("Route this task: full rewrite of the logging subsystem. If cloud, just tell me.",
	"session_task_route", ["task_description"], "multi_intent"),

	("Export memory to /var/backup, and then purge the old knowledge entries from the legacy project.",
	"session_export_memory", ["output_path"], "multi_intent"),

	("Find what we discussed about caching strategies, then set a 30-day retention on that knowledge.",
	"session_search_memory", ["query"], "multi_intent"),

	("Record a success milestone: zero-downtime deploy of version 4.2. Then compact the ledger.",
	"session_save_experience", [], "multi_intent"),

	("Load the fraud-detection project context and then synthesize all session edges.",
	"session_load_context", ["project"], "multi_intent"),

	("Save what we accomplished: rewrote the ingestion pipeline. Then hand it off to the ops team.",
	"session_save_ledger", [], "multi_intent"),


	# ===========================================================================
	# CATEGORY 6: verifier (25 cases)
	# session_synthesize_edges / session_backfill_links / session_health_check patterns.
	# ===========================================================================

	# session_synthesize_edges
	("Make sure all session graph edges are consistent for the auth-gateway project.",
	"session_synthesize_edges", ["project"], "verifier"),

	("Run a synthesis pass to validate all edges are up to date for the orchestration project.",
	"session_synthesize_edges", ["project"], "verifier"),

	("Verify graph integrity — synthesize edges for the content-delivery project.",
	"session_synthesize_edges", ["project"], "verifier"),

	("Before closing out, check that all session links are consistent for the scheduling project.",
	"session_synthesize_edges", ["project"], "verifier"),

	("Ensure all session relationships are properly synthesized for the warehouse-api project.",
	"session_synthesize_edges", ["project"], "verifier"),

	("Run edge synthesis on the real-time-alerts project to validate the session graph.",
	"session_synthesize_edges", ["project"], "verifier"),

	("Validate that all edges in the session graph are consistent for the pricing-engine project.",
	"session_synthesize_edges", ["project"], "verifier"),

	("Confirm session link consistency for the document-processing project.",
	"session_synthesize_edges", ["project"], "verifier"),

	# session_backfill_links
	("There are broken cross-session links in the search-backend project. Backfill them.",
	"session_backfill_links", ["project"], "verifier"),

	("Reconnect all dangling references in the identity-service project history.",
	"session_backfill_links", ["project"], "verifier"),

	("Patch the missing links between sessions for the payments-v3 project.",
	"session_backfill_links", ["project"], "verifier"),

	("Fix the link gaps in our session history for the recommendation-service project.",
	"session_backfill_links", ["project"], "verifier"),

	("Backfill any missing cross-session connections for the notification-hub project.",
	"session_backfill_links", ["project"], "verifier"),

	("Reconnect broken session references in the compliance-tracker project.",
	"session_backfill_links", ["project"], "verifier"),

	("Repair missing session links for the api-gateway project.",
	"session_backfill_links", ["project"], "verifier"),

	# session_health_check
	("Before I start a new sprint, confirm the memory system is operating correctly.",
	"session_health_check", [], "verifier"),

	("The search results seem incomplete. Check if the memory backend is healthy.",
	"session_health_check", [], "verifier"),

	("I'm seeing weird behavior in session recall. Run a diagnostic check.",
	"session_health_check", [], "verifier"),

	("Ping the memory system and confirm it's all healthy.",
	"session_health_check", [], "verifier"),

	("Is the Prism memory backend operating within normal parameters?",
	"session_health_check", [], "verifier"),

	("Double-check the memory infrastructure health before I rely on these results.",
	"session_health_check", [], "verifier"),

	("Verify the memory system is functioning before we start the long session.",
	"session_health_check", [], "verifier"),

	("Run a full health check and report back on the memory backend status.",
	"session_health_check", [], "verifier"),

	("Something is off with memory recall. Diagnose the backend.",
	"session_health_check", [], "verifier"),

	("Confirm the session memory system is healthy before I save this handoff.",
	"session_health_check", [], "verifier"),


	# ===========================================================================
	# CATEGORY 7: cascade (25 cases)
	# Explicit first-step-of-chain patterns — model must pick the right FIRST tool.
	# ===========================================================================

	("Search our knowledge for gRPC patterns, then upvote the most relevant entry.",
	"knowledge_search", ["query"], "cascade"),

	("Load the indexing-service context, then search for any past notes on shard rebalancing.",
	"session_load_context", ["project"], "cascade"),

	("Check memory health, then compact the alerts project ledger if there are stale entries.",
	"session_health_check", [], "cascade"),

	("Export all memory to /tmp/archive, then set a 180-day retention policy on the archive project.",
	"session_export_memory", ["output_path"], "cascade"),

	("Search for what we decided about the event schema design, then save a handoff about it.",
	"session_search_memory", ["query"], "cascade"),

	("Save today's session notes for the pipeline project, then create a handoff for the next agent.",
	"session_save_ledger", [], "cascade"),

	("Should the local model handle this concurrency refactor? If cloud, stop there.",
	"session_task_route", ["task_description"], "cascade"),

	("Search knowledge for CQRS trade-offs, downvote anything recommending a single store.",
	"knowledge_search", ["query"], "cascade"),

	("Compact the ledger for the embeddings project, then synthesize the session edges.",
	"session_compact_ledger", ["project"], "cascade"),

	("Load the feature-flags project context, then log that we shipped the A/B framework.",
	"session_load_context", ["project"], "cascade"),

	("Run a health check first, then based on results decide whether to compact or export.",
	"session_health_check", [], "cascade"),

	("Search memory for past decisions about SSE vs WebSockets, then record what we found.",
	"session_search_memory", ["query"], "cascade"),

	("Backfill the missing links for the analytics project, then synthesize the edges.",
	"session_backfill_links", ["project"], "cascade"),

	("Load context for the tenant-management project, then search for any open migration tickets.",
	"session_load_context", ["project"], "cascade"),

	("Find what we know about zero-copy networking, then save a handoff with that context.",
	"session_search_memory", ["query"], "cascade"),

	("Export to /backups/weekly, then compact the media-processing ledger.",
	"session_export_memory", ["output_path"], "cascade"),

	("Search our knowledge base for Kubernetes resource quotas, then set a 60-day retention.",
	"knowledge_search", ["query"], "cascade"),

	("Save the experience: we eliminated 80% of unnecessary re-renders. Then route the next task.",
	"session_save_experience", [], "cascade"),

	("Synthesize edges for the audit-log project, then backfill any missing links.",
	"session_synthesize_edges", ["project"], "cascade"),

	("Load the risk-assessment project context and then search memory for past risk audit notes.",
	"session_load_context", ["project"], "cascade"),

	("Find our notes on the transaction saga pattern, then upvote the best entry.",
	"session_search_memory", ["query"], "cascade"),

	("Compact the metrics project ledger, then export it to /tmp/metrics-backup.",
	"session_compact_ledger", ["project"], "cascade"),

	("Route this task: implement distributed tracing with OpenTelemetry across five services.",
	"session_task_route", ["task_description"], "cascade"),

	("Save what we accomplished: added RBAC support to the admin API. Then synthesize edges.",
	"session_save_ledger", [], "cascade"),

	("Search knowledge for eventual consistency patterns, then forget the entries about using global locks.",
	"knowledge_search", ["query"], "cascade"),


	# ===========================================================================
	# CATEGORY 8: param_extraction (25 cases)
	# Params ARE mentioned in the prompt — test that model extracts them correctly.
	# ===========================================================================

	("Load the full context for the fraud-detection project at a deep level.",
	"session_load_context", ["project"], "param_extraction"),

	("Compact the session ledger for the user-identity project.",
	"session_compact_ledger", ["project"], "param_extraction"),

	("Save a handoff note for the supplier-portal project.",
	"session_save_handoff", ["project"], "param_extraction"),

	("Delete the memory entry with ID mem-fg33-hh. It has the wrong branch name.",
	"session_forget_memory", ["memory_id"], "param_extraction"),

	("Export all memory data to /exports/2026-q2 in JSON format.",
	"session_export_memory", ["output_path", "format"], "param_extraction"),

	("Set the retention policy for the experiment-runner project to 45 days.",
	"knowledge_set_retention", ["project"], "param_extraction"),

	("Search session memory for 'distributed tracing setup'.",
	"session_search_memory", ["query"], "param_extraction"),

	("Search the knowledge base for 'idempotency keys in payment APIs'.",
	"knowledge_search", ["query"], "param_extraction"),

	("Backfill the cross-session links for the warehouse-inventory project.",
	"session_backfill_links", ["project"], "param_extraction"),

	("Synthesize session edges for the logistics-optimizer project.",
	"session_synthesize_edges", ["project"], "param_extraction"),

	("Forget the knowledge entry with ID ki-cc44-gg — that approach is deprecated.",
	"knowledge_forget", [], "param_extraction"),

	("Upvote the knowledge entry with ID ki-tt55-rr. Really solid documentation.",
	"knowledge_upvote", [], "param_extraction"),

	("Downvote knowledge entry ki-uu99-qq — it recommends a vulnerable library.",
	"knowledge_downvote", [], "param_extraction"),

	("Configure an 80-day retention policy for the beta-features project's knowledge.",
	"knowledge_set_retention", ["project"], "param_extraction"),

	("Load context for the platform-core project.",
	"session_load_context", ["project"], "param_extraction"),

	("Export the archive to /data/long-term-backup in markdown format.",
	"session_export_memory", ["output_path", "format"], "param_extraction"),

	("Search for 'zero-downtime database migrations' in our session history.",
	"session_search_memory", ["query"], "param_extraction"),

	("Search knowledge for 'CQRS vs event sourcing trade-offs'.",
	"knowledge_search", ["query"], "param_extraction"),

	("Compact the ledger for the monitoring-stack project.",
	"session_compact_ledger", ["project"], "param_extraction"),

	("Delete memory entry mem-pp12-ss — wrong model version was recorded.",
	"session_forget_memory", ["memory_id"], "param_extraction"),

	("Save a handoff for the checkout-v4 project.",
	"session_save_handoff", ["project"], "param_extraction"),

	("Route this task: rewrite the message broker integration to use NATS instead of RabbitMQ.",
	"session_task_route", ["task_description"], "param_extraction"),

	("Synthesize edges for the ingestion-pipeline project.",
	"session_synthesize_edges", ["project"], "param_extraction"),

	("Backfill the missing session links in the content-catalog project.",
	"session_backfill_links", ["project"], "param_extraction"),

	("Set 120-day retention on the compliance-logs project's knowledge.",
	"knowledge_set_retention", ["project"], "param_extraction"),


	# ===========================================================================
	# CATEGORY 9: abstention (20 cases)
	# Greetings, capability questions, general CS — must return NO_TOOL.
	# ===========================================================================

	("Hi there!", "NO_TOOL", [], "abstention"),
	("Good morning!", "NO_TOOL", [], "abstention"),
	("Hey, quick question — what's your name?", "NO_TOOL", [], "abstention"),
	("What tools do you have available?", "NO_TOOL", [], "abstention"),
	("What are your capabilities?", "NO_TOOL", [], "abstention"),
	("Can you explain what Prism Memory tools do?", "NO_TOOL", [], "abstention"),
	("What programming languages do you know?", "NO_TOOL", [], "abstention"),
	("Thanks, that's all for now!", "NO_TOOL", [], "abstention"),
	("Great work today, goodbye.", "NO_TOOL", [], "abstention"),
	("You're really helpful, thanks!", "NO_TOOL", [], "abstention"),
	("What is the capital of France?", "NO_TOOL", [], "abstention"),
	("Tell me a joke.", "NO_TOOL", [], "abstention"),
	("How do you work?", "NO_TOOL", [], "abstention"),
	("Are you GPT-4?", "NO_TOOL", [], "abstention"),
	("Can you write me a poem?", "NO_TOOL", [], "abstention"),
	("What's the weather like today?", "NO_TOOL", [], "abstention"),
	("Can you recommend a good book?", "NO_TOOL", [], "abstention"),
	("What's 2+2?", "NO_TOOL", [], "abstention"),
	("Do you have feelings?", "NO_TOOL", [], "abstention"),
	("What is machine learning?", "NO_TOOL", [], "abstention"),

	]

	# ---------------------------------------------------------------------------
	# Sanity check: enforce exactly 300 cases and correct counts per category
	# ---------------------------------------------------------------------------
	_TARGET_COUNTS = {
	"natural_phrasing": 50,
	"adversarial_trap": 70,
	"disambiguation": 40,
	"edge_case": 25,
	"multi_intent": 20,
	"verifier": 25,
	"cascade": 25,
	"param_extraction": 25,
	"abstention": 20,
	}
	_TOTAL_TARGET = 300

	def _verify_test_counts():
	from collections import Counter
	counts = Counter(t[3] for t in TESTS)
	errors = []
	for cat, expected in _TARGET_COUNTS.items():
	actual = counts.get(cat, 0)
	if actual != expected:
	errors.append(f" {cat}: expected {expected}, got {actual}")
	if len(TESTS) != _TOTAL_TARGET:
	errors.append(f" TOTAL: expected {_TOTAL_TARGET}, got {len(TESTS)}")
	if errors:
	print("WARNING: test count mismatches:")
	for e in errors:
	print(e)
	return len(errors) == 0

	# ---------------------------------------------------------------------------
	# Layer 3: Inference-Time False-Positive Rejection + Remapping
	# (Copied and merged from swe_bench_test.py — all current rules preserved)
	# ---------------------------------------------------------------------------

	GENERAL_PROGRAMMING_PATTERNS = [
	# Python context managers
	r'\bcontext\s+manager\b', r'\bcontextlib\b', r'\b__enter__\b', r'\b__exit__\b',
	r'\basync\s+context\s+manager\b',
	# ML / LSTM forget gates
	r'\bforget\s+gate\b', r'\blstm\b', r'\bcatastrophic\s+forgetting\b',
	r'\bforget\s+bias\b', r'\belastic\s+weight\s+consolidation\b',
	# Web framework sessions
	r'\bexpress\.js\b', r'\bdjango\b', r'\bflask\b', r'\bfastapi\b',
	r'\bsession_start', r'\bsession\s+middleware\b', r'\bsession\s+affinity\b',
	# General CS
	r'\bgarbage\s+collection\b', r'\bgc\s+algorithm\b',
	r'\bmemory\s+management\s+in\s+rust\b',
	r'\bload\s+balanc', r'\bnginx\b', r'\bhaproxy\b',
	r'\bcontext\s+switch',
	r'\bsearch\s+algorithm\b',
	r'\bsearch\s+functionality\s+with\s+elasticsearch\b',
	r'\bhealth\s+check\s+endpoint\s+pattern\b',
	r'\belasticsearch\b', r'\bsolr\b', r'\blucene\b',
	r'\bretention\s+polic(?:y\|ies)\s+(?:in\|for\|with)\s+(?:kafka\|s3\|aws\|gcp\|azure\|cloud)',
	r'\bpostgresql\b.\bmongodb\b', r'\bmongodb\b.\bpostgresql\b',
	r'\bwrite\s+a\s+decorator\b', r'\bdecorator.*retries?\b',
	r'\bci/cd\b', r'\bgithub\s+actions\b',
	r'\bcors\b.\bnode\.js\b', r'\bnode\.js\b.\bcors\b',
	r'\bcap\s+theorem\b', r'\bbinary\s+search\s+tree\b',
	r'\bvirtual\s+dom\b', r'\breact\b.*\breconciliation\b',
	r'\bdependency\s+injection\b',
	r'\btcp\b.\budp\b', r'\budp\b.\btcp\b',
	r'\btime\s+complexity\b', r'\bquicksort\b',
	r'\bexponential\s+backoff\b', r'\bjitter\b.*\bretri', r'\bapi\s+retri',
	r'\bcelery\b.*\bqueue', r'\broute\s+tasks?\s+in\s+celery\b',
	r'\bknowledge\s+graph\b.*\b(?:function\|search\|algorithm\|traversal)\b',
	r'\b(?:function\|write\s+a\s+function\|implement)\b.*\bknowledge\s+graph\b',
	r'\bsave\s+(?:user\s+)?preferences?\s+in\s+(?:react\|redux\|localstorage\|a\s+database)\b',
	r'\bexport\s+(?:data\s+)?from\s+(?:postgresql\|mysql\|sqlite\|a\s+database)\b',
	r'\bpostgresql\b.\bcsv\b', r'\bcsv\b.\bpostgresql\b',
	# Additional patterns from bfcl_eval.py
	r'\bgoroutine\b', r'\bwrite\s+barrier\b', r'\brust\b.*\bborrow\b',
	r'\barc\b.\bmutex\b', r'\bpin\b.\bfuture\b',
	r'\bwindow\s+function\b', r'\bmongodb\b', r'\bmongoexport\b',
	r'\bdijkstra\b', r'\bdepth.first\s+search\b', r'\bconsistent\s+hashing\b',
	r'\bb.tree\b', r'\bbloom\s+filter\b', r'\blru\s+cache\b', r'\bordereddic\b',
	r'\bhorizontalpodautoscal', r'\bprometheus\b', r'\betcd\b', r'\braft\b',
	r'\bzerolog\b', r'\belk\s+stack\b', r'\bopentelemetry\b',
	r'\bcrdt\b', r'\bsaga\s+pattern\b',
	r'\btrie\b', r'\bweakmap\b', r'\bpromise.based\s+queue\b',
	r'\bcovering\s+index\b', r'\bmaterialized\s+view\b',
	r'\btf-idf\b', r'\btrigram\b', r'\bfuzzy\s+search\b',
	r'\btopological\s+sort\b', r'\bcycle\s+detection\b',
	r'\bprim.s\b', r'\bkruskal.s\b', r'\bspanning\s+tree\b',
	r'\bhot.module\s+replacement\b', r'\bvite\b',
	r'\bserver\s+component\b', r'\bclient\s+component\b',
	r'\bdocker(?:file)?\b', r'\bblue.green\s+deploy', r'\brolling\s+deploy',
	r'\bsticky\s+session\b', r'\bsession\s+replication\b', r'\bsession\s+fixation\b',
	r'\bjwt\b.*\bhttponly\b',
	r'\bpaging\b.*\bmemory\b', r'\bmmap\b', r'\bstack\s+vs\s+heap\b',
	r'\bv8\s+engine\b', r'\bgenerational\s+collection\b',
	r'\boptimistic\s+lock', r'\bpessimistic\s+lock',
	r'\bcrdt\b', r'\beventual\s+consistency\b.*\bdynamo',
	# General knowledge / weather / math
	r"what'?s\s+the\s+weather\b", r'\bforecast\b.*\btoday\b',
	r'\bwrite\s+a\s+sql\s+query\b', r'\bsecond.highest\s+salary\b',
	r'\bsql\s+query\s+(?:that\|to)\b',
	]

	PRISM_INTENT_PATTERNS = [
	r'\bprism\b', r'\bsession\s*ledger\b', r'\bhandoff\b', r'\bknowledge\s+base\b',
	r'\bknowledge\s+items?\b', r'\bour\s+knowledge\b',
	r'\bsave.*(?:session\|ledger\|handoff)\b', r'\bload\s+context\b',
	r'\b(?:search\|find).*(?:memory\|sessions?\|conversations?\|notes)\b',
	r'\bproject\b', r'\bwhat\s+(?:do\s+)?we\s+(?:know\|have)\b',
	r'\binstitutional\s+knowledge\b', r'\bdocumented\b', r'\bcurated\b',
	r'\bmemory\s+entry\b', r'\bmemory\s+backend\b', r'\bdiagnostics\b',
	r'\bledger\b', r'\bcompact\b.*(?:ledger\|entries\|session)\b',
	r'\bexport.(?:memory\|backup)\b', r'\b(?:delete\|nuke\|wipe\|remove).(?:entry\|memory\|entries)\b',
	r'\blog.(?:what\|accomplished\|session)\b', r'\brecord.(?:session\|what)\b',
	r'\bhand.(?:off\|over)\b', r'\bbring.up\s+to\s+speed\b',
	r'\bbug\s+fix.(?:local\s+model\|handle)\b', r'\broute.(?:task\|this)\b',
	r'\bbackfill\b', r'\bsynthesize\b', r'\bsession\s+graph\b',
	r'\bsession\s+links?\b', r'\bedges?\s+(?:up\s+to\s+date\|consistent)\b',
	r'\bgraph\s+integrit', r'\bdangling\b', r'\breconnect.*(?:session\|links?\|references?)\b',
	r'\bpatch.*(?:links?\|gaps?)\b', r'\bmissing\s+links?\b',
	r'\bsave\s+experience\b', r'\brecord\s+(?:a\s+)?milestone\b',
	r'\brecord\s+(?:a\s+)?success\b', r'\bupvote\b', r'\bdownvote\b',
	r'\bretention\s+polic(?:y\|ies)\b', r'\bauto.expir\b', r'\bttl\b',
	r'\bknowledge\s+entry\b', r'\bknowledge\s+record\b',
	]


	def validate_tool_call(prompt, tool_name, tool_args):
	"""Layer 3: reject obvious false-positive tool calls and remap semantic neighbors.
	Copied from swe_bench_test.py with additions from bfcl_eval.py.
	Returns (tool_name, tool_args) — possibly changed if rejected or remapped.
	"""
	prompt_lower = prompt.lower()

	# Special NO_TOOL override: "confirm session link/graph consistency" → synthesize_edges
	if tool_name in ("NO_TOOL", "ERROR"):
	if re.search(r'\b(?:confirm\|verify\|validate\|check\|ensure)\b', prompt_lower):
	if re.search(r'\bsession\s+(?:link\|edge\|graph)\s+(?:consistency\|consistent)\b', prompt_lower):
	proj_m = re.search(r'\b(?:for\|on)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b', prompt_lower)
	return 'session_synthesize_edges', ({'project': proj_m.group(1)} if proj_m else {})
	return tool_name, tool_args

	# --- Group B remaps (before false-positive rejection) ---

	# "reconnect/patch up/dangling links" → backfill_links
	# But don't remap when "synthesize edges" is the explicit first action
	if tool_name in ('session_synthesize_edges', 'session_reconnect'):
	if re.search(r'\b(?:reconnect\|backfill\|patch\s+up\|dangling\|link\s+gaps?\|missing\s+links?\|fix\s+links?)\b', prompt_lower):
	if not re.search(r'^synthesize\b', prompt_lower) and \
	not re.search(r'\bsynthesiz\w+\s+edges?\s+for\b', prompt_lower):
	return 'session_backfill_links', tool_args

	# "verify/check/make sure session links/edges are consistent / graph integrity" → synthesize_edges
	if tool_name in ('session_health_check', 'session_backfill_links'):
	_has_verify_verb = re.search(
	r'\b(?:verify\|validate\|check\|make\s+sure\|ensure\|confirm)\b', prompt_lower
	)
	_has_consistent_edge = re.search(
	r'\b(?:edges?\|links?\|graph)\b.*?\b(?:consistent\|up\s+to\s+date\|synthesized)\b'
	r'\|\bconsistent\b.*?\b(?:edges?\|links?\|graph)\b'
	r'\|\bsession\s+links?\b'
	r'\|\bgraph\s+integrit',
	prompt_lower, re.DOTALL
	)
	if _has_verify_verb and _has_consistent_edge:
	return 'session_synthesize_edges', tool_args

	# "synthesize edges for X, then backfill" → synthesize_edges is the FIRST action
	if tool_name == 'session_backfill_links':
	if re.search(r'(?:^\|\bfirst\b\|\bstart\s+with)\s*synthesize\s+edges?\b', prompt_lower) or \
	re.search(r'^synthesize\b', prompt_lower):
	return 'session_synthesize_edges', tool_args

	# "wipe/clear old entries from knowledge base" → knowledge_forget (not compact_ledger)
	# BUT protect "session entries" / "session history" from this remap
	if tool_name == 'session_compact_ledger':
	if re.search(r'\bknowledge\b', prompt_lower) and re.search(r'\b(?:wipe\|clear\|delete\|remove\|entries)\b', prompt_lower):
	if not re.search(r'\bsession\s+(?:entries\|history\|ledger)\b', prompt_lower):
	return 'knowledge_forget', tool_args

	# "prune/trim/archive old session entries" → session_compact_ledger (not forget_memory)
	if tool_name in ('session_forget_memory', 'knowledge_forget'):
	if re.search(r'\b(?:prune\|trim\|archive\|compress)\b', prompt_lower) and re.search(r'\b(?:session\|ledger)\s+(?:entries\|history)?\b', prompt_lower):
	proj_m = re.search(r'\b(?:for\|on)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b', prompt_lower)
	return 'session_compact_ledger', ({'project': proj_m.group(1)} if proj_m else tool_args)

	# "archive old entries" (without 'knowledge') → session_compact_ledger
	if tool_name == 'session_forget_memory':
	if re.search(r'\b(?:archive\|prune\|trim)\s+old\s+entries\b', prompt_lower):
	if not re.search(r'\bknowledge\b', prompt_lower) and not re.search(r'\bmemory[_\s]id\b\|mem-[a-z0-9]\b', prompt_lower):
	return 'session_compact_ledger', tool_args

	# "knowledge entries/items/records" + delete verbs → knowledge_forget (not session_forget_memory)
	if tool_name == 'session_forget_memory':
	if re.search(r'\bknowledge\s+(?:entr\|items?\|records?\|base)\b', prompt_lower):
	return 'knowledge_forget', tool_args
	if re.search(r'\bknowledge\s+base\b', prompt_lower) and re.search(r'\b(?:entries\|records\|items)\b', prompt_lower):
	return 'knowledge_forget', tool_args
	# "delete/wipe entries from [project]" without a specific memory ID → knowledge_forget
	if re.search(r'\b(?:entries\|records\|logs?)\b', prompt_lower) and re.search(r'\bproject\b', prompt_lower):
	if not re.search(r'\bmemory[_\s]id\b\|mem-[a-z0-9]\|ID\s[=:]\s\S+', prompt):
	if not re.search(r'\b(?:session\|ledger)\b', prompt_lower):
	proj_m = re.search(r'(?:for\|from\|in)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project', prompt_lower, re.I)
	return 'knowledge_forget', {'project': proj_m.group(1) if proj_m else None}

	# "where were we / bring me up to speed / catch me up" → session_load_context (not session_search_memory)
	if tool_name == 'session_search_memory':
	if re.search(r'\bwhere\s+were\s+we\b\|\bbring\s+me\s+up\s+to\s+speed\b\|\bcatch\s+me\s+up\b\|\bwhat\s+were\s+we\s+(?:doing\|working)', prompt_lower):
	project_m = re.search(
	r'\b(?:on\|for\|with\|of)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b'
	r'\|\b([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b'
	r'\|(?:state\s+of\s+(?:the\s+)?)([a-zA-Z][a-zA-Z0-9_-]+)(?:\s+project)?\b',
	prompt_lower
	)
	if project_m:
	project = next((g for g in project_m.groups() if g and g not in ('the', 'a', 'this', 'that', 'my', 'our')), None)
	else:
	project = None
	return 'session_load_context', {'project': project} if project else {}

	# "accumulated documentation / knowledge base" → knowledge_search (not session_search_memory)
	if tool_name == 'session_search_memory':
	if re.search(r'\baccumulated\s+documentation\b\|\bknowledge\s+base\b', prompt_lower):
	return 'knowledge_search', tool_args

	# "recent / past / last week / what we did" → session_search_memory (not knowledge_search)
	if tool_name == 'knowledge_search':
	session_hints = [
	r'\brecent\b', r'\bpast\b', r'\blast\s+(?:week\|month\|session)',
	r'\bwhat\s+we\s+(?:did\|decided\|worked)', r'\bdeployment\s+issues\b',
	]
	if any(re.search(p, prompt_lower) for p in session_hints):
	return 'session_search_memory', tool_args

	# "remind me / did we ever decide" → session_search_memory (not load_context)
	if tool_name == 'session_load_context':
	if re.search(r'\bremind\s+me\b\|\bdid\s+we\s+ever\s+(?:decide\|settle\|choose\|pick)\b\|\bwhat\s+did\s+we\s+decide\b', prompt_lower):
	if not re.search(r'\bbring\s+me\s+up\s+to\s+speed\b\|\bwhere\s+were\s+we\b\|\bcatch\s+me\s+up\b\|\bload\s+.*\bcontext\b', prompt_lower):
	return 'session_search_memory', {"query": prompt[:120]}

	# "jot down / write down / make a note / log what just happened" → session_save_ledger
	_LEDGER_TRIGGERS = re.compile(
	r'\bjot\s+down\b\|\bwrite\s+(?:it\s+)?down\b\|\bwhat\s+we\s+accomplished\b'
	r'\|\bmake\s+sure\s+it.{0,10}written\b\|\brecord\s+(?:this\s+session\|what)\b'
	r'\|\bmake\s+(?:a\s+)?note\s+(?:that\|of)\b\|\blog\s+what\s+just\s+happened\b'
	r'\|\bwrite\s+down\s+everything\b\|\bbefore\s+I\s+(?:close\|head\s+out)\b',
	re.IGNORECASE
	)
	# negative: milestone/achievement events that belong in save_experience
	_EXPERIENCE_NEGATIVE = re.compile(
	r'\b(?:successfully\|milestone\|achievement\|deployed\s+the\|shipped\s+the\|launched\s+the'
	r'\|we\s+(?:fixed\|built\|completed\|created\|resolved\|deployed\|shipped\|launched)\s+the'
	r'\|race\s+condition\|solid\s+now\|zero.downtime)\b'
	)
	# Unambiguous note-taking phrases bypass the milestone negative check
	_NOTE_TRIGGERS = re.compile(
	r'\bmake\s+(?:a\s+)?note\s+(?:that\|of)\b\|\bjot\s+down\b'
	r'\|\bwrite\s+(?:it\s+)?down\b\|\blog\s+what\s+just\s+happened\b',
	re.IGNORECASE
	)
	if tool_name in ('session_save_experience', 'session_task_route'):
	if _LEDGER_TRIGGERS.search(prompt):
	if _NOTE_TRIGGERS.search(prompt) or not _EXPERIENCE_NEGATIVE.search(prompt_lower):
	if 'content' in tool_args and 'summary' not in tool_args:
	tool_args = dict(tool_args)
	tool_args['summary'] = tool_args.pop('content')
	if 'summary' not in tool_args:
	work_m = re.search(r'(?:we\s+)?((?:rewrote\|fixed\|refactored\|built\|deployed\|updated\|added\|removed\|finalized\|completed\|migrated)\s+.{10,120})', prompt, re.I)
	if not work_m:
	work_m = re.search(r'(?:make\s+a\s+note\|log\|note)\s+(?:that\s+)?(?:we\s+)?(completed\|finished\|did\|wrote\|refactored\|migrated).{0,120}', prompt, re.I)
	if work_m:
	tool_args = dict(tool_args)
	tool_args['summary'] = work_m.group(0).strip().rstrip('.')
	return 'session_save_ledger', tool_args

	# "record that we fixed/built/resolved [thing]" → session_save_experience (milestone)
	if tool_name == 'session_save_ledger':
	if re.search(r'\brecord\s+that\s+we\s+(?:fixed\|built\|completed\|created\|resolved\|deployed\|shipped\|launched)\b', prompt_lower):
	return 'session_save_experience', {"project": tool_args.get("project"), "event_type": "milestone"}

	# content → summary normalization + inline extraction for session_save_ledger
	if tool_name == 'session_save_ledger':
	if 'content' in tool_args and 'summary' not in tool_args:
	tool_args = dict(tool_args)
	tool_args['summary'] = tool_args.pop('content')
	if 'summary' not in tool_args:
	work_m = re.search(r'(?:we\s+)?((?:rewrote\|fixed\|refactored\|built\|deployed\|updated\|added\|removed\|finalized\|completed\|migrated)\s+.{10,120})', prompt, re.I)
	if not work_m:
	work_m = re.search(r'(?:log\|note\|record)\s+(?:what\s+just\s+happened\|this\|that)\s[:;]\s(.{10,120})', prompt, re.I)
	if work_m:
	tool_args = dict(tool_args)
	tool_args['summary'] = (work_m.group(1) if work_m.lastindex else work_m.group(0)).strip().rstrip('.')

	# "log that we successfully deployed/shipped" → session_save_experience milestone (not save_ledger)
	if tool_name == 'session_save_ledger':
	if re.search(r'\blog\s+that\s+we\s+successfully\b\|\bsuccessfully\s+deployed\b\|\bsuccessfully\s+shipped\b\|\bsuccessfully\s+launched\b', prompt_lower):
	return 'session_save_experience', {"project": tool_args.get("project"), "event_type": "success"}

	# "shift change / store current state for next agent" → session_save_handoff
	if tool_name == 'session_save_ledger':
	if re.search(r'\bshift\s+change\b\|\bstore\s+(?:the\s+)?current\s+state\s+for\b\|\bnext\s+(?:agent\|person\|developer)\s+can\s+continue\b\|\bhand.over\b\|\bpick.up\s+next\b', prompt_lower):
	return 'session_save_handoff', tool_args

	# Multi-intent: "Search/Find ... THEN upvote/downvote" → first action is search
	if tool_name in ('knowledge_upvote', 'knowledge_downvote'):
	if re.search(r'\bthen\s+(?:upvote\|downvote\|boost\|rate\s+up\|rate\s+down)\b', prompt_lower):
	if re.search(r'^(?:search\|find\|look\s+up)\b', prompt_lower):
	query_m = re.search(
	r'^(?:search\s+(?:for\s+)?\|find\s+(?:our\s+)?(?:notes?\s+on\s+)?\|look\s+up\s+)(.+?)(?:,?\s*then\b)',
	prompt, re.I
	)
	return 'session_search_memory', {"query": query_m.group(1).strip() if query_m else prompt[:120]}

	# invalid tool name → try retention or upvote/downvote
	if tool_name not in VALID_TOOLS:
	if re.search(r'\b(?:auto.?expir\|ttl\b\|\d+\sdays?\s+(?:retention\|expir)\|\bretention\spolic)', prompt_lower):
	return 'knowledge_set_retention', tool_args
	# fall through to upvote/downvote patterns below

	# knowledge_forget / knowledge_set_retention → upvote/downvote protection
	_UPVOTE_SET = {'knowledge_forget', 'knowledge_set_retention', 'session_forget_memory',
	'session_task_route', 'session_search_memory'}
	# Don't remap to upvote/downvote when primary intent is "search THEN upvote"
	_is_search_then_vote = (
	re.search(r'^(?:search\|find\|look\s+up)\b', prompt_lower) and
	re.search(r'\bthen\s+(?:upvote\|downvote\|boost\|rate\s+up\|rate\s+down)\b', prompt_lower)
	)
	if (tool_name in _UPVOTE_SET or tool_name not in VALID_TOOLS) and not _is_search_then_vote:
	_id_val = (tool_args.get("id") or tool_args.get("knowledge_id") or tool_args.get("entry_id")) if isinstance(tool_args, dict) else None
	if re.search(r'\b(?:upvote\|boost\|increase\s+(?:the\s+\|its\s+)?(?:rank\|score\|importance)\|uprate\|thumbs[\s-]?up\|mark\s+(?:it\s+)?(?:up\|helpful\|useful\|great\|good)\|importance\s+score)\b', prompt_lower):
	return 'knowledge_upvote', {"id": _id_val}
	if re.search(r'\b(?:downvote\|lower\s+(?:the\s+\|its\s+)?(?:rank\|score)\|not\s+useful\|derank\|thumbs[\s-]?down\|reduce\s+(?:the\s+\|its\s+)?(?:rank\|score)\|mark\s+(?:it\s+)?(?:down\|bad\|wrong\|outdated\|terrible))\b', prompt_lower):
	return 'knowledge_downvote', {"id": _id_val}

	# session_load_context: extract project from prompt if missing
	if tool_name == 'session_load_context':
	if not (isinstance(tool_args, dict) and tool_args.get('project')):
	proj_m = re.search(
	r'\b(?:on\|for\|of\|with\|in)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b'
	r'\|\b([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b'
	r'\|(?:state\s+of\s+(?:the\s+)?)([a-zA-Z][a-zA-Z0-9_-]+)(?:\s+project)?\b',
	prompt_lower
	)
	if proj_m:
	proj = next((g for g in proj_m.groups() if g), None)
	if proj and proj not in ('the', 'a', 'this', 'that', 'my', 'our'):
	tool_args = dict(tool_args) if isinstance(tool_args, dict) else {}
	tool_args['project'] = proj

	# session_compact_ledger: extract project if missing
	if tool_name == 'session_compact_ledger':
	if not (isinstance(tool_args, dict) and tool_args.get('project')):
	proj_m = re.search(
	r'\b(?:for\|on\|of)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+(?:project\s+)?ledger\b'
	r'\|\b([a-zA-Z][a-zA-Z0-9_-]+)\s+project\s+ledger\b'
	r'\|\b(?:compact\|trim\|prune\|compress\|archive)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+(?:project\|ledger)\b',
	prompt_lower
	)
	if proj_m:
	proj = next((g for g in proj_m.groups() if g), None)
	if proj and proj not in ('the', 'a', 'this', 'that', 'my', 'our', 'old', 'stale'):
	tool_args = dict(tool_args) if isinstance(tool_args, dict) else {}
	tool_args['project'] = proj

	# "is this something the local model can handle? / route this task" → session_task_route
	if tool_name == 'session_search_memory':
	if re.search(r'\b(?:local\s+(?:model\|agent)\s+(?:can\s+handle\|should\s+handle)\|route\s+this\s+task\|should\s+(?:I\|the\s+local\s+model)\s+(?:tackle\|handle)\|is\s+this\s+(?:something\|simple\s+enough)\s+(?:for\s+the\s+)?local)\b', prompt_lower):
	return 'session_task_route', {"task_description": prompt}

	# session_task_route: extract task_description from prompt
	if tool_name == 'session_task_route':
	if 'task_description' not in tool_args or not tool_args.get('task_description'):
	tool_args = dict(tool_args)
	tool_args['task_description'] = prompt

	# session_export_memory: extract output_path from path patterns, format from keywords
	if tool_name == 'session_export_memory':
	if not isinstance(tool_args, dict):
	tool_args = {}
	tool_args = dict(tool_args)
	if 'output_path' not in tool_args or not tool_args.get('output_path'):
	path_m = re.search(
	r'(?:save\s+to\|(?:output\|export\|dump)\s+(?:to\s+)?["\']?\|to\s+["\']?)(/[\w/.-]+\|~/[\w/.-]+)',
	prompt, re.I
	)
	if path_m:
	tool_args['output_path'] = path_m.group(1)
	if 'format' not in tool_args or not tool_args.get('format'):
	fmt_m = re.search(r'\b(json\|jsonl\|markdown\|csv\|yaml)\b(?:\s+format)?', prompt_lower)
	if fmt_m:
	tool_args['format'] = fmt_m.group(1)

	# session_compact_ledger: protect "session entries" from knowledge_forget remap
	# (already handled above but ensure compact stays for session-specific prompts)

	# "where did we leave off / what was the state" → session_load_context
	if tool_name == 'session_search_memory':
	if re.search(r'\bwhere\s+did\s+we\s+leave\s+off\b\|\bwhat\s+was\s+the\s+state\s+of\b\|\bget\s+me\s+(?:re-?oriented\|up\s+to\s+speed)\b\|\bpull\s+up\s+(?:whatever\|the\s+(?:full\s+)?context)', prompt_lower):
	project_m = re.search(r'\b(?:on\|for\|with\|of)\s+(?:the\s+)?([a-zA-Z][a-zA-Z0-9_-]+)\s+project\b', prompt_lower)
	project = project_m.group(1) if project_m else None
	return 'session_load_context', ({'project': project} if project else {})

	# --- Social pleasantry rejection ---
	SOCIAL_PATTERNS = [
	r'^thanks', r'^thank you', r'^cheers', r'^goodbye', r'^bye',
	r"that's all", r"we're done", r"all done", r"all set",
	r'^ok\s+great', r'^perfect$', r'^nice$', r'^cool$',
	r'^hi\b', r'^hey\b', r'^hello\b', r'^good\s+morning', r'^good\s+afternoon',
	]
	is_social = any(re.search(p, prompt_lower.strip()) for p in SOCIAL_PATTERNS)
	if is_social and not any(w in prompt_lower for w in [
	'save', 'export', 'search', 'load', 'record', 'log', 'run', 'check', 'find',
	'compact', 'handoff', 'route', 'synthesize', 'backfill', 'forget', 'upvote', 'downvote',
	]):
	return "NO_TOOL", {}

	# --- False-positive rejection (CS patterns) ---
	is_general = any(re.search(p, prompt_lower) for p in GENERAL_PROGRAMMING_PATTERNS)

	if not is_general:
	return tool_name, tool_args

	has_prism_intent = any(re.search(p, prompt_lower) for p in PRISM_INTENT_PATTERNS)

	if has_prism_intent:
	return tool_name, tool_args

	return "NO_TOOL", {}


	# ---------------------------------------------------------------------------
	# Ollama Call
	# ---------------------------------------------------------------------------

	TOOL_CALL_NOPIPE_RE = re.compile(
	r'<tool_call>\s(\{.?\})\s*(?:</tool_call>\|$)',
	re.DOTALL
	)
	TOOL_CALL_PIPE_RE = re.compile(
	r'<\\|tool_call\\|>\s(\{.?\})',
	re.DOTALL
	)
	BARE_JSON_RE = re.compile(
	r'(\{[^{}]"name"\s:\s"[^"]+?"[^{}](?:\{[^{}]\}[^{}])*\})'
	)


	def call_ollama(prompt: str, timeout: int = 120) -> tuple:
	"""Call Ollama REST API with a pre-formatted ChatML prompt.
	Returns (raw_response, tool_name, tool_args, latency_secs).
	"""
	start = time.time()
	try:
	payload = json.dumps({
	"model": MODEL,
	"prompt": prompt,
	"stream": False,
	"raw": True,
	"options": {"temperature": 0.0, "num_predict": 512},
	}).encode("utf-8")
	req = urllib.request.Request(
	OLLAMA_API,
	data=payload,
	headers={"Content-Type": "application/json"},
	)
	with urllib.request.urlopen(req, timeout=timeout) as resp:
	data = json.loads(resp.read().decode("utf-8"))
	raw = data.get("response", "").strip()
	except Exception as exc:
	return (str(exc), "ERROR", {}, time.time() - start)

	latency = time.time() - start

	# Strip CoT blocks
	clean = re.sub(
	r'<\\|synalux_think\\|>.*?(?:</\\|synalux_think\\|>\|$)',
	'', raw, flags=re.DOTALL
	)

	# Strategy 0: no-pipe <tool_call>…</tool_call> (v43 native format)
	m = TOOL_CALL_NOPIPE_RE.search(clean)
	if m:
	try:
	tj = json.loads(m.group(1))
	return (raw, tj.get("name", tj.get("tool", "UNKNOWN")),
	tj.get("arguments", tj.get("args", {})), latency)
	except json.JSONDecodeError:
	pass

	# Strategy 1: piped <\|tool_call\|>
	m = TOOL_CALL_PIPE_RE.search(clean)
	if m:
	try:
	tj = json.loads(m.group(1))
	return (raw, tj.get("name", tj.get("tool", "UNKNOWN")),
	tj.get("arguments", tj.get("args", {})), latency)
	except json.JSONDecodeError:
	pass

	# Strategy 2: bare JSON with "name" key
	m = BARE_JSON_RE.search(clean)
	if m:
	try:
	tj = json.loads(m.group(0))
	return (raw, tj.get("name", "UNKNOWN"),
	tj.get("arguments", tj.get("args", {})), latency)
	except json.JSONDecodeError:
	pass

	return (raw, "NO_TOOL", {}, latency)


	# ---------------------------------------------------------------------------
	# Scoring
	# ---------------------------------------------------------------------------

	def evaluate_result(expected_tool, required_params, got_tool, got_args):
	"""
	Returns one of:
	strict_pass — correct tool + all required_params present
	partial_pass — correct tool + at least 1 required_param present but not all
	wrong_tool — tool name is wrong (includes false positives / negatives)
	false_positive — tool called when NO_TOOL expected
	false_negative — NO_TOOL returned when tool expected
	"""
	if expected_tool == "NO_TOOL":
	return "false_positive" if got_tool != "NO_TOOL" else "strict_pass"

	if got_tool == "NO_TOOL":
	return "false_negative"

	# Accept either search tool for ambiguous prompts
	tools_match = (got_tool == expected_tool) or (
	expected_tool in ("session_search_memory", "knowledge_search") and
	got_tool in ("session_search_memory", "knowledge_search")
	)
	if not tools_match:
	return "wrong_tool"

	if not required_params:
	return "strict_pass"

	if not isinstance(got_args, dict):
	got_args = {}

	present = [p for p in required_params if p in got_args and got_args[p] not in (None, "", [])]
	if len(present) == len(required_params):
	return "strict_pass"
	if len(present) > 0:
	return "partial_pass"
	# Right tool, zero params matched
	return "partial_pass"


	def score(verdict):
	if verdict == "strict_pass":
	return 1.0
	if verdict == "partial_pass":
	return 0.5
	return 0.0


	# ---------------------------------------------------------------------------
	# Main Eval
	# ---------------------------------------------------------------------------

	def run_once(tests, shuffle=False, run_label=""):
	"""Run one full pass over test suite. Returns (results_list, category_stats)."""
	indexed = list(enumerate(tests))
	if shuffle:
	random.shuffle(indexed)

	results = [None] * len(tests)
	category_stats = {}

	for display_i, (orig_idx, (prompt, expected, req_params, category)) in enumerate(indexed, 1):
	chatml = (
	f"<\|im_start\|>system\n{SYSTEM_PROMPT}<\|im_end\|>\n"
	f"<\|im_start\|>user\n{prompt}<\|im_end\|>\n"
	f"<\|im_start\|>assistant\n"
	)
	raw, got_tool, got_args, latency = call_ollama(chatml)
	got_tool, got_args = validate_tool_call(prompt, got_tool, got_args)
	verdict = evaluate_result(expected, req_params, got_tool, got_args)

	icon = "OK" if verdict == "strict_pass" else ("~~" if verdict == "partial_pass" else "XX")
	tag = f"#{orig_idx + 1:03d}"
	short = prompt[:52]
	run_info = f"[{run_label}] " if run_label else ""
	print(
	f" {run_info}[{display_i:3d}/{len(tests)}] {icon} {tag} "
	f"expect={expected:30s} got={got_tool:30s} {latency:5.1f}s \| {short}"
	)
	if verdict != "strict_pass":
	if verdict == "partial_pass":
	missing = [p for p in req_params if p not in got_args or got_args.get(p) in (None, "", [])]
	print(f" -> partial: missing params {missing}")
	elif verdict == "false_positive":
	print(f" -> FALSE POSITIVE: called {got_tool} (expected NO_TOOL)")
	elif verdict == "false_negative":
	print(f" -> FALSE NEGATIVE: no tool called (expected {expected})")
	elif verdict == "wrong_tool":
	print(f" -> WRONG TOOL: expected {expected}, got {got_tool}")

	results[orig_idx] = {
	"id": orig_idx + 1,
	"prompt": prompt,
	"expected": expected,
	"got": got_tool,
	"got_args": got_args,
	"verdict": verdict,
	"latency": latency,
	"category": category,
	"points": score(verdict),
	}

	if category not in category_stats:
	category_stats[category] = {"total": 0, "strict": 0, "partial": 0, "fail": 0, "points": 0.0}
	cat = category_stats[category]
	cat["total"] += 1
	cat["points"] += score(verdict)
	if verdict == "strict_pass":
	cat["strict"] += 1
	elif verdict == "partial_pass":
	cat["partial"] += 1
	else:
	cat["fail"] += 1

	return results, category_stats


	def print_run_summary(results, category_stats, run_label=""):
	strict = sum(1 for r in results if r["verdict"] == "strict_pass")
	partial = sum(1 for r in results if r["verdict"] == "partial_pass")
	fp = sum(1 for r in results if r["verdict"] == "false_positive")
	fn = sum(1 for r in results if r["verdict"] == "false_negative")
	wt = sum(1 for r in results if r["verdict"] == "wrong_tool")
	total = len(results)
	total_points = sum(r["points"] for r in results)

	tool_tests = [r for r in results if r["expected"] != "NO_TOOL"]
	no_tool_tests = [r for r in results if r["expected"] == "NO_TOOL"]
	no_tool_correct = sum(1 for r in no_tool_tests if r["verdict"] == "strict_pass")
	hallucinations = sum(1 for r in results if r["verdict"] == "false_positive")
	avg_lat = sum(r["latency"] for r in results) / total if total else 0

	lbl = f" (Run {run_label})" if run_label else ""
	print()
	print("=" * 80)
	print(f" EVAL-300 RESULTS{lbl}")
	print("=" * 80)
	print(f" Strict Pass: {strict}/{total} = {strict / total * 100:.1f}%")
	print(f" Partial Pass: {partial}/{total} = {partial / total * 100:.1f}%")
	print(f" Wrong Tool: {wt}/{total}")
	print(f" False Positives: {fp}/{total} (hallucinations)")
	print(f" False Negatives: {fn}/{total}")
	print(f" ---")
	print(f" strict_pct (strict/total): {strict / total * 100:.1f}%")
	print(f" weighted_pct (total_points/total): {total_points / total * 100:.1f}%")
	print(f" Abstention accuracy: {no_tool_correct}/{len(no_tool_tests)} = {no_tool_correct / len(no_tool_tests) * 100:.1f}%")
	print(f" Hallucinations: {hallucinations} (target = 0)")
	print(f" Avg latency: {avg_lat:.1f}s")
	print()
	print(f" {'Category':<22} {'Strict':>7} {'Partial':>8} {'Fail':>5} {'Pts/Tot':>10} {'Pct':>6}")
	print(f" {'-'22} {'-'7} {'-'8} {'-'5} {'-'10} {'-'6}")
	for cat, s in sorted(category_stats.items()):
	pts_pct = s["points"] / s["total"] * 100 if s["total"] else 0
	print(f" {cat:<22} {s['strict']:>7} {s['partial']:>8} {s['fail']:>5} "
	f"{s['points']:>5.1f}/{s['total']:<4} {pts_pct:>5.1f}%")
	print("=" * 80)
	return {
	"strict": strict,
	"partial": partial,
	"wrong_tool": wt,
	"false_positive": fp,
	"false_negative": fn,
	"total": total,
	"total_points": total_points,
	"strict_pct": strict / total,
	"weighted_pct": total_points / total,
	"abstention_rate": no_tool_correct / len(no_tool_tests) if no_tool_tests else 0,
	"hallucinations": hallucinations,
	"avg_latency": avg_lat,
	"category_stats": category_stats,
	}


	def main():
	parser = argparse.ArgumentParser(description="Eval-300: 300-case standard evaluation for prism-coder")
	parser.add_argument("--model", type=str, default=None,
	help="Ollama model tag to evaluate (default: prism-coder:4b-v43)")
	parser.add_argument("--runs", type=int, default=1,
	help="Number of eval runs (default: 1; use 3 for stability check)")
	parser.add_argument("--shuffle", action="store_true",
	help="Randomize test order each run")
	parser.add_argument("--no-validate-layer3", action="store_true",
	help="Disable Layer 3 false-positive rejection "
	"(use during RFT/DPO so model sees true failures)")
	args = parser.parse_args()

	global MODEL, validate_tool_call
	if args.model:
	MODEL = args.model

	if args.no_validate_layer3:
	def validate_tool_call(prompt, tool_name, tool_args): # noqa: F811
	return tool_name, tool_args

	_verify_test_counts()

	print("=" * 80)
	print(f" EVAL-300 — prism-coder standard evaluation")
	print(f" Model: {MODEL}")
	print(f" Tests: {len(TESTS)}")
	print(f" Runs: {args.runs}" + (" (RANDOMIZED ORDER each run)" if args.shuffle else ""))
	print(f" Layer3: {'DISABLED' if args.no_validate_layer3 else 'enabled'}")
	print("=" * 80)

	all_run_summaries = []
	all_run_results = []

	for run_idx in range(args.runs):
	run_label = str(run_idx + 1) if args.runs > 1 else ""
	if args.runs > 1:
	print(f"\n{'#' * 80}")
	print(f" RUN {run_idx + 1} / {args.runs}" +
	(f" (seed={random.randint(1000, 9999)})" if args.shuffle else ""))
	print(f"{'#' * 80}")

	results, cat_stats = run_once(TESTS, shuffle=args.shuffle, run_label=run_label)
	summary = print_run_summary(results, cat_stats, run_label=run_label)
	all_run_summaries.append(summary)
	all_run_results.append(results)

	# ---------------------------------------------------------------------------
	# Multi-run aggregate
	# ---------------------------------------------------------------------------
	if args.runs > 1:
	strict_scores = [s["strict"] for s in all_run_summaries]
	weighted_pcts = [s["weighted_pct"] * 100 for s in all_run_summaries]
	total = all_run_summaries[0]["total"]
	halluc_counts = [s["hallucinations"] for s in all_run_summaries]

	# Per-test stability
	per_test_pass = [0] * len(TESTS)
	per_test_fail_tools = [[] for _ in range(len(TESTS))]
	for run_results in all_run_results:
	for r in run_results:
	idx = r["id"] - 1
	if r["verdict"] == "strict_pass":
	per_test_pass[idx] += 1
	else:
	per_test_fail_tools[idx].append(r.get("got", "???"))

	med_strict = statistics.median(strict_scores)
	avg_strict = statistics.mean(strict_scores)
	med_weighted = statistics.median(weighted_pcts)

	print(f"\n{'=' * 80}")
	print(f" MULTI-RUN SUMMARY ({args.runs} runs x {total} tests)")
	print(f"{'=' * 80}")
	print(f" Strict scores: {' \| '.join(f'{s}/{total}' for s in strict_scores)}")
	print(f" Median strict: {med_strict}/{total} = {med_strict / total * 100:.1f}%")
	print(f" Average strict: {avg_strict:.1f}/{total} = {avg_strict / total * 100:.1f}%")
	print(f" Weighted pct: {' \| '.join(f'{p:.1f}%' for p in weighted_pcts)} "
	f"(median {med_weighted:.1f}%)")
	print(f" Hallucinations: {' \| '.join(str(h) for h in halluc_counts)} "
	f"(target = 0 each run)")
	print()
	print(f" Flaky tests (< 100% pass rate across {args.runs} runs):")
	flaky = []
	for i, (prompt, expected, _, cat) in enumerate(TESTS):
	rate = per_test_pass[i] / args.runs
	if rate < 1.0:
	fail_tools = per_test_fail_tools[i]
	flaky.append((i + 1, rate, expected, set(fail_tools), cat, prompt[:60]))
	if flaky:
	for fid, rate, exp, fails, fcat, fshort in sorted(flaky, key=lambda x: x[1]):
	print(f" [{fid:03d}] {rate * 100:3.0f}% \| cat={fcat:<18s} \| expect={exp:<28s} \| fails->{','.join(fails):<20s} \| {fshort}")
	else:
	print(" All tests passed consistently across all runs!")
	print(f" Total flaky: {len(flaky)}/{total}")
	print(f"{'=' * 80}")

	# ---------------------------------------------------------------------------
	# Save JSON report
	# ---------------------------------------------------------------------------
	os.makedirs("results", exist_ok=True)
	report_path = "results/eval300_report.json"
	final_summary = all_run_summaries[-1] if args.runs == 1 else {
	"runs": args.runs,
	"strict_scores": strict_scores,
	"median_strict": statistics.median(strict_scores) / total,
	"avg_strict": statistics.mean(strict_scores) / total,
	"median_weighted_pct": statistics.median(weighted_pcts) / 100,
	"hallucinations_per_run": halluc_counts,
	"per_run_summaries": all_run_summaries,
	} if args.runs > 1 else all_run_summaries[0]

	report = {
	"model": MODEL,
	"timestamp": time.strftime("%Y-%m-%dT%H:%M:%S"),
	"total_tests": len(TESTS),
	"runs": args.runs,
	"shuffle": args.shuffle,
	"layer3_enabled": not args.no_validate_layer3,
	"summary": final_summary,
	"last_run_results": all_run_results[-1],
	}
	with open(report_path, "w") as f:
	json.dump(report, f, indent=2, default=str)
	print(f"\nReport saved: {report_path}")

	# Exit code: fail if last run strict < 90%
	last_strict_pct = all_run_summaries[-1]["strict_pct"] * 100
	if last_strict_pct < 90.0:
	print(f"FAIL: strict_pct {last_strict_pct:.1f}% is below 90% gate")
	sys.exit(1)
	else:
	print(f"PASS: strict_pct {last_strict_pct:.1f}%")
	sys.exit(0)


	if __name__ == "__main__":
	main()