Spaces:

ibm-research
/

BPO-Bench

Sleeping

App Files Files Community

haroldshipibm commited on Mar 2

Commit

d075a5b

verified ·

1 Parent(s): d370bed

Upload folder using huggingface_hub

Browse files

Files changed (31) hide show

Dockerfile +35 -0
README.md +181 -8
__pycache__/agent.cpython-312.pyc +0 -0
__pycache__/api_candidate_source.cpython-312.pyc +0 -0
__pycache__/api_candidate_source_error.cpython-312.pyc +0 -0
__pycache__/api_skills.cpython-312.pyc +0 -0
__pycache__/api_skills_error.cpython-312.pyc +0 -0
__pycache__/app.cpython-312.pyc +0 -0
__pycache__/data_loader.cpython-312.pyc +0 -0
__pycache__/models.cpython-312.pyc +0 -0
__pycache__/server.cpython-312.pyc +0 -0
agent.py +784 -0
api_candidate_source.py +385 -0
api_candidate_source_error.py +495 -0
api_skills.py +328 -0
api_skills_error.py +238 -0
app.py +625 -0
data/candidate_data.parquet +3 -0
data/large_response_fixture.json +0 -0
data/tasks.json +1291 -0
data/tasks_edge_cases.json +0 -0
data/tasks_http_errors.json +237 -0
data/tasks_schema_violations.json +265 -0
data/tasks_type_mismatch.json +110 -0
data/tasks_undocumented.json +175 -0
data_loader.py +168 -0
evaluator.py +150 -0
mcp_servers/bpo.yaml +4 -0
models.py +192 -0
requirements.txt +33 -0
server.py +321 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,35 @@

+FROM python:3.12-slim
+# Set up user for HuggingFace Spaces
+RUN useradd -m -u 1000 user
+USER user
+ENV HOME=/home/user \
+    PATH=/home/user/.local/bin:$PATH
+WORKDIR $HOME/app
+# Install dependencies
+COPY --chown=user requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application files
+COPY --chown=user . .
+# Download data from HF Dataset on build (all task suites + fixture data)
+RUN python -c "from huggingface_hub import hf_hub_download; \
+    import os; os.makedirs('data', exist_ok=True); \
+    r='ibm-research/bpo-benchmark'; d='dataset'; \
+    hf_hub_download(r, 'candidate_data.parquet', local_dir='data', repo_type=d); \
+    hf_hub_download(r, 'tasks.json', local_dir='data', repo_type=d); \
+    hf_hub_download(r, 'tasks_type_mismatch.json', local_dir='data', repo_type=d); \
+    hf_hub_download(r, 'tasks_http_errors.json', local_dir='data', repo_type=d); \
+    hf_hub_download(r, 'tasks_schema_violations.json', local_dir='data', repo_type=d); \
+    hf_hub_download(r, 'tasks_edge_cases.json', local_dir='data', repo_type=d); \
+    hf_hub_download(r, 'tasks_undocumented.json', local_dir='data', repo_type=d); \
+    hf_hub_download(r, 'large_response_fixture.json', local_dir='data', repo_type=d)"
+# Expose port for Gradio
+EXPOSE 7860
+# Start the application
+CMD ["python", "app.py"]

README.md CHANGED Viewed

@@ -1,12 +1,185 @@
 ---
-title: BPO Bench
-emoji: 🌍
-colorFrom: gray
-colorTo: blue
-sdk: gradio
-sdk_version: 6.8.0
-app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: BPO Benchmark Evaluation
+emoji: "\U0001F4CA"
+colorFrom: blue
+colorTo: purple
+sdk: docker
+app_port: 7860
 pinned: false
+license: apache-2.0
 ---
+# BPO Benchmark Evaluation
+Evaluate **CUGA SDK** on BPO (Business Process Outsourcing) recruiting analytics tasks.
+## What This Benchmarks
+This Space evaluates the [CUGA SDK](https://pypi.org/project/cuga/) - a production AI agent framework - on its ability to use tool APIs to answer recruiting analytics questions.
+## Features
+- **Real Agent Testing**: Uses the actual CUGA SDK from PyPI (`pip install cuga>=0.2.8`)
+- **32 Tool APIs** for recruiting data analysis (13 core + 19 error-prone)
+- **45 Evaluation Tasks** across 6 test suites (easy/medium/hard difficulty)
+- **Multi-Metric Scoring**: String similarity, exact match, keyword match, API accuracy, and LLM Judge
+- **OpenAI and Groq** provider support
+- **Langfuse Observability**: Optional tracing for detailed evaluation analysis
+## API Endpoints
+The benchmark exposes 32 BPO recruiting analytics endpoints plus a health check.
+### Core Endpoints (13)
+#### Candidate Source APIs (7)
+- SLA performance by source
+- Total hires by source
+- Candidate volume metrics
+- Funnel conversion rates
+- Metadata and timeframe
+- Definitions and methodology
+- Source recommendations
+#### Skills APIs (6)
+- Skill analysis and correlations
+- Skill impact on fill rate
+- Skill impact on SLA
+- Skill relevance justification
+- Success criteria
+- Data sources used
+### Error-Prone Endpoints (19)
+These endpoints intentionally exhibit problematic behaviors to test agent resilience.
+#### Type Mismatch (3)
+- `skills/skill-summary` — Returns plain string instead of JSON object
+- `candidate-source/source-sla-score` — Returns numeric float instead of structured response
+- `candidate-source/inactive-sources` — Returns boolean or list depending on data state
+#### HTTP Errors (4)
+- `candidate-source/candidate-pipeline-status` — Intermittently returns 404
+- `candidate-source/source-sla-check` — Returns 500 Internal Server Error
+- `candidate-source/funnel-status` — Returns 503 Service Unavailable
+- `candidate-source/bulk-source-data` — Returns 429 Too Many Requests
+#### Schema Violations (4)
+- `skills/model-registry` — No Pydantic output schema; returns untyped dict
+- `skills/skill-lookup` — Returns extra undeclared fields
+- `candidate-source/source-metrics-lite` — Randomly omits required fields
+- `candidate-source/volume-report` — Returns wrong field types (strings for numbers)
+#### Edge Cases (5)
+- `candidate-source/full-candidate-details` — Returns oversized payload (~1MB)
+- `candidate-source/source-directory` — Contains Unicode and special characters
+- `skills/skill-deep-analysis` — Deeply nested JSON (5+ levels)
+- `candidate-source/sla-extended` — Includes unexpected extra fields
+- `skills/analyze-skill-match` — Returns mismatched schema vs documentation
+#### Undocumented Behaviors (3)
+- `candidate-source/requisition-details` — Non-standard error format (error as string, not object)
+- `candidate-source/list-all-sources` — Undocumented pagination in response
+- `candidate-source/batch-metrics` — Undocumented rate limiting headers
+## Usage
+1. Enter your **OpenAI** or **Groq** API key
+2. Select test suites to evaluate (checkboxes):
+   - **Core** (26 tasks) — Standard recruiting analytics questions
+   - **Type Mismatch** (3 tasks) — APIs returning unexpected data types
+   - **HTTP Errors** (4 tasks) — APIs returning HTTP error codes
+   - **Schema Violations** (4 tasks) — APIs with missing/wrong schema fields
+   - **Edge Cases** (5 tasks) — Large payloads, Unicode, deep nesting
+   - **Undocumented Behaviors** (3 tasks) — Non-standard error formats and pagination
+3. Optionally filter to specific task IDs within selected suites
+4. Click **Run Evaluation**
+5. View results with multi-metric scoring and per-task breakdowns
+## Evaluation Metrics
+Each task is scored on multiple dimensions:
+- **String Similarity**: Fuzzy match between agent response and expected output (0-100%)
+- **Exact Match**: Binary check for precise answer correctness
+- **Keyword Match Rate**: Percentage of expected keywords found in the response
+- **API Accuracy**: Precision and recall of which API endpoints the agent invoked
+- **LLM Judge** (optional): GPT-based semantic evaluation of response quality
+- **Final Composite Score**: Weighted combination of all metrics
+A task **passes** when the composite score exceeds the threshold.
+## Dataset
+The benchmark uses synthetic BPO recruiting data:
+- 64k candidate records
+- 1,047 requisitions
+- 7 sourcing channels
+**Dataset:** [ibm-research/bpo-benchmark](https://huggingface.co/datasets/ibm-research/bpo-benchmark)
+## Local Development
+```bash
+# Clone the space
+git clone https://huggingface.co/spaces/ibm-research/bpo-benchmark-eval
+cd bpo-benchmark-eval
+# Install dependencies
+pip install -r requirements.txt
+# Download data (all task suites + fixture data)
+python -c "
+from huggingface_hub import hf_hub_download
+import os
+os.makedirs('data', exist_ok=True)
+files = [
+    'candidate_data.parquet',
+    'tasks.json',
+    'tasks_type_mismatch.json',
+    'tasks_http_errors.json',
+    'tasks_schema_violations.json',
+    'tasks_edge_cases.json',
+    'tasks_undocumented.json',
+    'large_response_fixture.json',
+]
+for f in files:
+    hf_hub_download('ibm-research/bpo-benchmark', f, local_dir='data', repo_type='dataset')
+print(f'Downloaded {len(files)} files')
+"
+# Run
+python app.py
+```
+## Architecture
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    Gradio UI (port 7860)                    │
+│  - 6 test suite checkboxes, multi-metric result display     │
+│  - LLM Judge toggle, Langfuse observability                 │
+└─────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────┐
+│                      CUGA SDK Agent                         │
+│  - Loads tools via OpenAPI spec from Registry               │
+│  - Processes queries with LLM (OpenAI/Groq)                │
+│  - Orchestrates tool calls                                  │
+└─────────────────────────────────────────────────────────────┘
+                              │
+                    ┌─────────┴──────────┐
+                    ▼                    ▼
+┌──────────────────────────┐  ┌──────────────────────────┐
+│  FastAPI Server (:8000)  │  │  CUGA Registry (:8001)   │
+│  - 32 BPO API endpoints  │  │  - Tool discovery        │
+│  - OpenAPI spec          │  │  - OpenAPI aggregation    │
+│  - Loads data from       │  └──────────────────────────┘
+│    parquet               │
+└──────────────────────────┘
+```
+## License
+Apache 2.0

__pycache__/agent.cpython-312.pyc ADDED Viewed

Binary file (19.1 kB). View file

__pycache__/api_candidate_source.cpython-312.pyc ADDED Viewed

Binary file (13.7 kB). View file

__pycache__/api_candidate_source_error.cpython-312.pyc ADDED Viewed

Binary file (17.2 kB). View file

__pycache__/api_skills.cpython-312.pyc ADDED Viewed

Binary file (12.1 kB). View file

__pycache__/api_skills_error.cpython-312.pyc ADDED Viewed

Binary file (8.77 kB). View file

__pycache__/app.cpython-312.pyc ADDED Viewed

Binary file (18.8 kB). View file

__pycache__/data_loader.cpython-312.pyc ADDED Viewed

Binary file (7.51 kB). View file

__pycache__/models.cpython-312.pyc ADDED Viewed

Binary file (8.29 kB). View file

__pycache__/server.cpython-312.pyc ADDED Viewed

Binary file (16.5 kB). View file

agent.py ADDED Viewed

	@@ -0,0 +1,784 @@

+"""CUGA SDK agent for BPO benchmark evaluation with Langfuse tracking."""
+import asyncio
+import logging
+import os
+import re
+import threading
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+import uvicorn
+logger = logging.getLogger(__name__)
+# Global flags to track server status
+_servers_started = False
+_servers_lock = threading.Lock()
+# ============================================================================
+# Provider Configuration
+# ============================================================================
+PROVIDER_CONFIGS = {
+    "groq": {
+        "env_var": "GROQ_API_KEY",
+        "settings_file": "settings.groq.toml",
+        "default_model": "openai/gpt-oss-120b",
+        "models": [
+            "openai/gpt-oss-120b",
+            "llama-3.3-70b-versatile",
+            "llama-3.1-8b-instant",
+            "mixtral-8x7b-32768",
+        ],
+        "placeholder": "gsk_...",
+    },
+    "openai": {
+        "env_var": "OPENAI_API_KEY",
+        "settings_file": "settings.openai.toml",
+        "default_model": "gpt-4o-mini",
+        "models": [
+            "gpt-4o-mini",
+            "gpt-4.1",
+            "gpt-5",
+            "gpt-4o",
+        ],
+        "placeholder": "sk-...",
+    },
+}
+def get_provider_models(provider: str) -> List[str]:
+    """Get available models for a provider."""
+    config = PROVIDER_CONFIGS.get(provider.lower(), {})
+    return config.get("models", [])
+def get_provider_placeholder(provider: str) -> str:
+    """Get API key placeholder for a provider."""
+    config = PROVIDER_CONFIGS.get(provider.lower(), {})
+    return config.get("placeholder", "...")
+def get_default_model(provider: str) -> str:
+    """Get default model for a provider."""
+    config = PROVIDER_CONFIGS.get(provider.lower(), {})
+    return config.get("default_model", "")
+# ============================================================================
+# Server Management
+# ============================================================================
+def start_servers():
+    """Start BPO API and Registry servers if not already running."""
+    global _servers_started
+    with _servers_lock:
+        if _servers_started:
+            return
+        _servers_started = True
+    # Import here to avoid circular imports
+    from server import app as bpo_app
+    from cuga.backend.tools_env.registry.registry.api_registry_server import (
+        app as registry_app,
+    )
+    # Start BPO API server on port 8000
+    def run_bpo():
+        uvicorn.run(bpo_app, host="0.0.0.0", port=8000, log_level="warning")
+    bpo_thread = threading.Thread(target=run_bpo, daemon=True)
+    bpo_thread.start()
+    logger.info("BPO API server starting on port 8000")
+    # Start Registry server on port 8001
+    def run_registry():
+        uvicorn.run(registry_app, host="0.0.0.0", port=8001, log_level="warning")
+    registry_thread = threading.Thread(target=run_registry, daemon=True)
+    registry_thread.start()
+    logger.info("Registry server starting on port 8001")
+    # Wait for servers to be ready
+    time.sleep(4)
+    logger.info("Servers started")
+# ============================================================================
+# Environment Setup
+# ============================================================================
+def setup_environment(api_key: str, provider: str, model: Optional[str] = None):
+    """Set up environment variables for CUGA SDK."""
+    # Clear conflicting env vars
+    for key in ["OPENAI_BASE_URL", "OPENAI_API_KEY", "GROQ_API_KEY"]:
+        if key in os.environ:
+            del os.environ[key]
+    provider_lower = provider.lower()
+    config = PROVIDER_CONFIGS.get(provider_lower)
+    if not config:
+        raise ValueError(f"Unknown provider: {provider}. Supported: {list(PROVIDER_CONFIGS.keys())}")
+    # Set provider-specific config
+    os.environ[config["env_var"]] = api_key
+    os.environ["AGENT_SETTING_CONFIG"] = config["settings_file"]
+    os.environ["MODEL_NAME"] = model or config["default_model"]
+    # Set MCP servers file path
+    mcp_config = Path(__file__).parent / "mcp_servers" / "bpo.yaml"
+    os.environ["MCP_SERVERS_FILE"] = str(mcp_config.resolve())
+    # Disable policies for benchmark
+    os.environ["DYNACONF_POLICY__ENABLED"] = "false"
+    logger.info(f"Environment configured: provider={provider}, model={os.environ.get('MODEL_NAME')}")
+# ============================================================================
+# Langfuse Integration
+# ============================================================================
+class LangfuseTracker:
+    """Tracks evaluation runs and task scores in Langfuse."""
+    def __init__(self):
+        self.enabled = False
+        self.langfuse = None
+        self.trace_id = None
+        self.init_error = None
+        self._init_langfuse()
+    def _init_langfuse(self) -> None:
+        """Initialize Langfuse client if credentials are available."""
+        # Debug: show all LANGFUSE env vars
+        langfuse_vars = {k: ('set' if v else 'empty') for k, v in os.environ.items() if 'LANGFUSE' in k.upper()}
+        logger.info(f"Langfuse env vars found: {langfuse_vars}")
+        public_key = os.environ.get("LANGFUSE_PUBLIC_KEY")
+        secret_key = os.environ.get("LANGFUSE_SECRET_KEY")
+        # Support both LANGFUSE_HOST and LANGFUSE_BASE_URL
+        host = os.environ.get("LANGFUSE_HOST") or os.environ.get("LANGFUSE_BASE_URL") or "https://cloud.langfuse.com"
+        logger.info(f"Langfuse init: public_key={'set' if public_key else 'not set'}, secret_key={'set' if secret_key else 'not set'}, host={host}")
+        if not public_key or not secret_key:
+            self.init_error = "Langfuse credentials not found"
+            logger.info(self.init_error)
+            return
+        try:
+            from langfuse import Langfuse
+            self.langfuse = Langfuse(
+                public_key=public_key,
+                secret_key=secret_key,
+                host=host,
+            )
+            # Test the connection by checking auth
+            self.langfuse.auth_check()
+            self.enabled = True
+            logger.info(f"Langfuse tracking initialized successfully (host={host})")
+        except ImportError as e:
+            self.init_error = f"langfuse package not installed: {e}"
+            logger.warning(self.init_error)
+        except Exception as e:
+            self.init_error = f"Failed to initialize Langfuse: {e}"
+            logger.warning(self.init_error)
+    def start_trace(self, name: str, metadata: Optional[Dict[str, Any]] = None) -> Optional[str]:
+        """Start a new trace for an evaluation run."""
+        if not self.enabled or not self.langfuse:
+            return None
+        try:
+            # Use create_trace for newer Langfuse API
+            trace = self.langfuse.trace(name=name, metadata=metadata or {})
+            self.trace_id = trace.id
+            return self.trace_id
+        except AttributeError:
+            # Fallback for different Langfuse versions
+            try:
+                self.trace_id = f"trace_{name}_{id(self)}"
+                logger.info(f"Using fallback trace ID: {self.trace_id}")
+                return self.trace_id
+            except Exception as e:
+                logger.warning(f"Failed to create trace (fallback): {e}")
+                return None
+        except Exception as e:
+            logger.warning(f"Failed to create trace: {e}")
+            return None
+    def score_task(self, task_id: str, scores: Dict[str, float]) -> None:
+        """Score a task within the current trace."""
+        if not self.enabled or not self.langfuse or not self.trace_id:
+            return
+        try:
+            for name, value in scores.items():
+                self.langfuse.score(
+                    trace_id=self.trace_id,
+                    name=f"{task_id}_{name}",
+                    value=value,
+                )
+        except Exception as e:
+            logger.warning(f"Failed to score task {task_id}: {e}")
+    def end_trace(self, summary: Optional[Dict[str, Any]] = None) -> None:
+        """End the current trace with summary metrics."""
+        if not self.enabled or not self.langfuse:
+            return
+        try:
+            if summary and self.trace_id:
+                for name, value in summary.items():
+                    if isinstance(value, (int, float)) and not isinstance(value, bool):
+                        self.langfuse.score(
+                            trace_id=self.trace_id,
+                            name=f"summary_{name}",
+                            value=float(value),
+                        )
+            self.langfuse.flush()
+        except Exception as e:
+            logger.warning(f"Failed to end trace: {e}")
+        finally:
+            self.trace_id = None
+def is_langfuse_configured() -> bool:
+    """Check if Langfuse environment variables are set."""
+    return bool(
+        os.environ.get("LANGFUSE_PUBLIC_KEY") and
+        os.environ.get("LANGFUSE_SECRET_KEY")
+    )
+def get_langfuse_host() -> str:
+    """Get the configured Langfuse host."""
+    return os.environ.get("LANGFUSE_HOST") or os.environ.get("LANGFUSE_BASE_URL") or "https://cloud.langfuse.com"
+# ============================================================================
+# CUGA Agent
+# ============================================================================
+class CUGAAgent:
+    """CUGA SDK agent for BPO benchmark evaluation."""
+    def __init__(
+        self,
+        api_key: str,
+        provider: str = "groq",
+        model: Optional[str] = None,
+    ):
+        """Initialize the CUGA agent.
+        Args:
+            api_key: API key for the LLM provider
+            provider: "openai" or "groq"
+            model: Model name (optional, uses defaults)
+        """
+        self.api_key = api_key
+        self.provider = provider.lower()
+        self.model = model
+        self.agent = None
+        self.tool_provider = None
+        # Set up environment BEFORE importing cuga modules
+        setup_environment(api_key, self.provider, model)
+        # Start servers
+        start_servers()
+    async def setup(self):
+        """Initialize the CUGA agent with tools."""
+        from cuga.sdk import CugaAgent
+        from cuga.config import settings
+        from cuga.backend.cuga_graph.nodes.cuga_lite.combined_tool_provider import (
+            CombinedToolProvider,
+        )
+        logger.info("Setting up CUGA agent...")
+        # Enable ActivityTracker for tool call tracking
+        settings.update({"ADVANCED_FEATURES": {"TRACKER_ENABLED": True}}, merge=True)
+        # Initialize tool provider (will load from registry)
+        self.tool_provider = CombinedToolProvider()
+        await self.tool_provider.initialize()
+        all_tools = await self.tool_provider.get_all_tools()
+        logger.info(f"Loaded {len(all_tools)} tools from BPO API")
+        if len(all_tools) == 0:
+            raise RuntimeError("No tools loaded from registry. Check server status.")
+        # Create agent
+        self.agent = CugaAgent(tool_provider=self.tool_provider)
+        logger.info("CUGA agent initialized")
+    async def run(self, query: str, thread_id: Optional[str] = None) -> Tuple[str, List[Dict[str, Any]]]:
+        """Run the agent on a query.
+        Args:
+            query: The user's question
+            thread_id: Optional thread ID for conversation context
+        Returns:
+            Tuple of (response_text, tool_calls)
+        """
+        if self.agent is None:
+            await self.setup()
+        from langchain_core.messages import HumanMessage
+        # Get ActivityTracker singleton and reset for this task
+        try:
+            from cuga.backend.activity_tracker.tracker import ActivityTracker
+            tracker = ActivityTracker()
+            tracker.reset(intent=query, task_id=thread_id or "eval_task")
+        except ImportError:
+            tracker = None
+            logger.warning("ActivityTracker not available, tool call tracking disabled")
+        result = await self.agent.invoke(
+            [HumanMessage(content=query)],
+            thread_id=thread_id or "eval_task",
+            track_tool_calls=True,  # Required for ActivityTracker to capture tool calls
+        )
+        # Debug: log result object structure
+        result_attrs = [attr for attr in dir(result) if not attr.startswith('_')]
+        logger.info(f"Result object attributes: {result_attrs}")
+        if hasattr(result, '__dict__'):
+            logger.info(f"Result __dict__ keys: {list(result.__dict__.keys())}")
+        # Extract response
+        response = result.answer if hasattr(result, "answer") else str(result)
+        # Extract tool calls from ActivityTracker.steps (same approach as sdk_eval_helpers.py)
+        tool_calls = []
+        if tracker is not None:
+            import json
+            logger.info(f"ActivityTracker has {len(tracker.steps)} steps")
+            # Debug: log step names to understand structure (first 5 only)
+            step_names = [s.name for s in tracker.steps[:5]]
+            logger.info(f"First step names: {step_names}")
+            # Match "api_call" in step name (the standard CUGA SDK pattern)
+            for step in tracker.steps:
+                if step.name and "api_call" in step.name:
+                    try:
+                        call_data = json.loads(step.data) if step.data else {}
+                        tool_name = call_data.get("function_name", "")
+                        if tool_name:
+                            tool_calls.append({
+                                "name": tool_name,
+                                "args": call_data.get("args", {}),
+                            })
+                    except (json.JSONDecodeError, TypeError) as e:
+                        logger.warning(f"Failed to parse tool call step data: {e}")
+                        continue
+            logger.info(f"Extracted {len(tool_calls)} tool calls from ActivityTracker")
+        # Fallback 1: try to extract from result.tool_calls attribute
+        if not tool_calls and hasattr(result, 'tool_calls') and result.tool_calls:
+            logger.info("Trying fallback: result.tool_calls")
+            for tc in result.tool_calls:
+                if isinstance(tc, dict):
+                    tool_calls.append({"name": tc.get("name", ""), "args": tc.get("args", {})})
+                elif hasattr(tc, 'name'):
+                    tool_calls.append({"name": tc.name, "args": getattr(tc, 'args', {})})
+            logger.info(f"Fallback extracted {len(tool_calls)} tool calls")
+        return response, tool_calls
+    def close(self):
+        """Clean up resources."""
+        pass  # Servers run as daemons, will stop with process
+# ============================================================================
+# Evaluation Metrics (copied from main repo for standalone use)
+# ============================================================================
+def normalize_text(text: str) -> str:
+    """Normalize text for keyword matching."""
+    import unicodedata
+    text = unicodedata.normalize("NFC", text)
+    # Replace special spaces
+    text = text.replace("\u202f", " ").replace("\u00a0", " ").replace("\u2009", " ")
+    # Replace dashes
+    text = text.replace("\u2013", "-").replace("\u2014", "-").replace("\u2212", "-")
+    text = text.lower()
+    # Remove markdown
+    text = re.sub(r"[`*_~]", "", text)
+    # Replace punctuation except | (for OR alternatives)
+    text = re.sub(r"[^\w\s%|]", " ", text)
+    # Collapse whitespace
+    text = re.sub(r"\s+", " ", text).strip()
+    return text
+def check_keywords(response: str, expected_keywords: List[str]) -> Dict[str, Any]:
+    """Check if expected keywords are present in the response.
+    Supports:
+    - OR mechanism: keywords can use "|" to specify alternatives
+    - Regex keywords: prefix with "re:" to use regex pattern
+    Args:
+        response: Agent's response text
+        expected_keywords: List of keywords (can use "|" for OR, "re:" for regex)
+    Returns:
+        Dictionary with keyword check results
+    """
+    if not expected_keywords:
+        return {
+            "all_found": True,
+            "match_rate": 1.0,
+            "found_keywords": [],
+            "missing_keywords": [],
+            "total_keywords": 0,
+            "found_count": 0,
+        }
+    response_normalized = normalize_text(response)
+    found_keywords = []
+    missing_keywords = []
+    for keyword in expected_keywords:
+        # Regex keyword support
+        if keyword.strip().lower().startswith("re:"):
+            pattern = keyword.strip()[3:]
+            if re.search(pattern, response_normalized, flags=re.IGNORECASE):
+                found_keywords.append(keyword)
+            else:
+                missing_keywords.append(keyword)
+            continue
+        keyword_normalized = normalize_text(keyword)
+        # OR alternatives
+        if "|" in keyword_normalized:
+            alternatives = [alt.strip() for alt in keyword_normalized.split("|")]
+            matched = any(alt in response_normalized for alt in alternatives)
+        else:
+            matched = keyword_normalized.strip() in response_normalized
+        if matched:
+            found_keywords.append(keyword)
+        else:
+            missing_keywords.append(keyword)
+    total = len(expected_keywords)
+    found_count = len(found_keywords)
+    return {
+        "all_found": len(missing_keywords) == 0,
+        "match_rate": found_count / total if total else 1.0,
+        "found_keywords": found_keywords,
+        "missing_keywords": missing_keywords,
+        "total_keywords": total,
+        "found_count": found_count,
+    }
+def compute_string_similarity(predicted: str, expected: str) -> float:
+    """Compute string similarity using RapidFuzz token set ratio."""
+    try:
+        from rapidfuzz import fuzz
+        return fuzz.token_set_ratio(predicted.lower(), expected.lower()) / 100.0
+    except ImportError:
+        from difflib import SequenceMatcher
+        return SequenceMatcher(None, predicted.lower(), expected.lower()).ratio()
+def compute_exact_match(predicted: str, expected: str) -> bool:
+    """Check if predicted exactly matches expected (case-insensitive)."""
+    return predicted.strip().lower() == expected.strip().lower()
+def compute_final_score(
+    exact_match: bool,
+    similarity: float,
+    llm_judge_score: Optional[float] = None,
+    llm_judge_requested: bool = False,
+    agent_output: str = "",
+    threshold_exact: float = 0.85,
+    threshold_inexact: float = 0.9,
+    apis_missing: Optional[List[str]] = None,
+    require_api_match: bool = False,
+) -> int:
+    """Compute final binary score for a task.
+    This matches the logic in bpo_benchmark/evaluation/metrics.py for consistency.
+    Args:
+        exact_match: Whether output exactly matched expected
+        similarity: String similarity score (0.0-1.0)
+        llm_judge_score: Optional LLM judge score (0.0-1.0)
+        llm_judge_requested: True if LLM judge was requested for this evaluation
+        agent_output: The agent's output string (to detect failures)
+        threshold_exact: Threshold when exact match is True
+        threshold_inexact: Threshold when exact match is False
+        apis_missing: List of expected APIs that were not called
+        require_api_match: If True, require apis_missing to be empty to pass
+    Returns:
+        1 if task passes, 0 otherwise
+    """
+    import math
+    # Check for task failure indicators
+    if not agent_output or (isinstance(agent_output, str) and agent_output.startswith("ERROR:")):
+        return 0
+    # Check for missing API calls when API metrics are required
+    if require_api_match and apis_missing:
+        return 0
+    # Handle missing/invalid similarity
+    if similarity is None or (isinstance(similarity, float) and math.isnan(similarity)):
+        return 0
+    # Determine the threshold based on exact match status
+    threshold = threshold_exact if exact_match else threshold_inexact
+    # If LLM judge was requested but failed/unavailable, return 0
+    if llm_judge_requested:
+        if llm_judge_score is None or (isinstance(llm_judge_score, float) and math.isnan(llm_judge_score)):
+            return 0
+        # Judge was requested and available: pass if EITHER score meets threshold
+        if llm_judge_score >= threshold or similarity >= threshold:
+            return 1
+        return 0
+    else:
+        # No judge requested: use similarity only
+        return 1 if similarity >= threshold else 0
+# ============================================================================
+# LLM Judge (for semantic similarity evaluation)
+# ============================================================================
+class LLMJudge:
+    """LLM-based semantic judge using Groq's API."""
+    def __init__(
+        self,
+        api_key: str,
+        model: str = "llama-3.3-70b-versatile",
+        timeout_s: int = 30,
+    ):
+        self.api_key = api_key
+        self.model = model
+        self.timeout_s = timeout_s
+        self.base_url = "https://api.groq.com"
+    @property
+    def name(self) -> str:
+        return f"groq:{self.model}"
+    async def judge(
+        self,
+        predicted: str,
+        expected: str,
+        utterance: str = "",
+    ) -> Dict[str, Any]:
+        """Judge similarity between predicted and expected outputs.
+        Returns:
+            Dict with score (0.0-1.0), rationale, and metadata
+        """
+        import json
+        try:
+            import requests
+        except ImportError:
+            return {"score": None, "rationale": "requests library not available", "metadata": {}}
+        # Truncate for cost/speed
+        utterance = str(utterance)[:500]
+        predicted = str(predicted)[:2000]
+        expected = str(expected)[:2000]
+        system = (
+            "You are an evaluation judge assessing semantic equivalence between a PREDICTED and EXPECTED answer.\n\n"
+            "Scoring Guidelines:\n"
+            "- Score 1.0: Semantically identical - same meaning, entities, and facts (minor wording differences OK)\n"
+            "- Score 0.8-0.9: Semantically equivalent - same core meaning with slight elaboration or different phrasing\n"
+            "- Score 0.5-0.7: Partially equivalent - same topic but missing key details or extra information\n"
+            "- Score 0.2-0.4: Somewhat related - addresses same question but with different focus or incomplete answer\n"
+            "- Score 0.0-0.1: Unrelated or contradictory - different facts, wrong information, or completely different meaning\n\n"
+            "CRITICAL:\n"
+            "- Focus on SEMANTIC MEANING, not word-for-word matching or formatting\n"
+            "- Both asking for same information (even differently phrased) should score high (0.8-1.0)\n"
+            "- Consider context from the UTTERANCE to understand what's being asked\n"
+            "- Be precise: don't score 0.0 unless answers are truly unrelated/contradictory\n\n"
+            "Return ONLY valid JSON: {\"score\": <number 0.0-1.0>, \"rationale\": \"<explanation>\"}\n"
+        )
+        user = (
+            f"UTTERANCE:\n{utterance}\n\n"
+            f"EXPECTED:\n{expected}\n\n"
+            f"PREDICTED:\n{predicted}\n"
+        )
+        payload = {
+            "model": self.model,
+            "temperature": 0,
+            "messages": [
+                {"role": "system", "content": system},
+                {"role": "user", "content": user},
+            ],
+        }
+        def _do_request() -> Dict[str, Any]:
+            url = f"{self.base_url}/openai/v1/chat/completions"
+            response = requests.post(
+                url,
+                headers={
+                    "Authorization": f"Bearer {self.api_key}",
+                    "Content-Type": "application/json",
+                },
+                json=payload,
+                timeout=self.timeout_s,
+            )
+            response.raise_for_status()
+            return response.json()
+        try:
+            data = await asyncio.to_thread(_do_request)
+        except Exception as e:
+            logger.warning(f"LLM judge request failed: {e}")
+            return {"score": None, "rationale": f"Request failed: {e}", "metadata": {}}
+        content = (
+            data.get("choices", [{}])[0]
+            .get("message", {})
+            .get("content", "")
+        )
+        # Parse JSON response
+        try:
+            parsed = json.loads(content)
+        except Exception:
+            start = content.find("{")
+            end = content.rfind("}")
+            if start == -1 or end == -1 or end <= start:
+                return {"score": None, "rationale": f"Invalid JSON response: {content[:200]}", "metadata": {}}
+            try:
+                parsed = json.loads(content[start:end + 1])
+            except Exception:
+                return {"score": None, "rationale": f"Failed to parse JSON: {content[:200]}", "metadata": {}}
+        score = parsed.get("score")
+        if score is not None:
+            score = float(score)
+            score = max(0.0, min(1.0, score))
+        rationale = str(parsed.get("rationale", ""))[:1000]
+        return {
+            "score": score,
+            "rationale": rationale,
+            "metadata": {"judge": "groq", "model": self.model},
+        }
+def get_llm_judge(api_key: str, provider: str = "groq") -> Optional[LLMJudge]:
+    """Get an LLM judge instance.
+    Args:
+        api_key: API key for the judge provider
+        provider: Currently only "groq" is supported
+    Returns:
+        LLMJudge instance or None if not supported
+    """
+    if provider.lower() == "groq":
+        return LLMJudge(api_key=api_key)
+    return None
+# ============================================================================
+# API Call Tracking
+# ============================================================================
+def compare_api_calls(
+    called_apis: List[str],
+    expected_apis: List[str],
+) -> Dict[str, Any]:
+    """Compare called APIs against expected APIs.
+    Args:
+        called_apis: List of API names that were called
+        expected_apis: List of expected API names
+    Returns:
+        Dict with missing, extra, correct count, and match info
+    """
+    # Normalize API names for comparison
+    # Registry tool names are verbose: bpo_candidate_source_sla_per_source_candidate_source_sla_per_source_requisition_id_get
+    # Expected names are short: candidate_source_sla_per_source
+    def normalize_api_name(name: str) -> str:
+        name = name.lower().strip()
+        # Remove app prefix
+        if name.startswith("bpo_"):
+            name = name[4:]
+        # Remove common suffixes (HTTP methods and parameter patterns)
+        for suffix in ["_get", "_post", "_put", "_delete"]:
+            if name.endswith(suffix):
+                name = name[:-len(suffix)]
+        for suffix in ["_requisition_id", "_skill_name"]:
+            if name.endswith(suffix):
+                name = name[:-len(suffix)]
+        return name.replace("-", "_").replace(" ", "_")
+    def api_matches(expected: str, actual: str) -> bool:
+        """Check if expected API name matches actual (allowing for verbose registry names)."""
+        exp_norm = normalize_api_name(expected)
+        act_norm = normalize_api_name(actual)
+        # Direct match
+        if exp_norm == act_norm:
+            return True
+        # Check if expected is contained in actual (for verbose registry names)
+        # e.g., "candidate_source_sla_per_source" in "candidate_source_sla_per_source_candidate_source_sla_per_source"
+        if exp_norm in act_norm:
+            return True
+        return False
+    logger.info(f"[API_TRACKING] Expected APIs: {expected_apis}")
+    logger.info(f"[API_TRACKING] Actual APIs: {called_apis}")
+    # Compute API metrics using flexible matching
+    missing = []
+    for exp_api in expected_apis:
+        if not any(api_matches(exp_api, act_api) for act_api in called_apis):
+            missing.append(exp_api)
+    extra = []
+    for act_api in called_apis:
+        if not any(api_matches(exp_api, act_api) for exp_api in expected_apis):
+            extra.append(act_api)
+    correct = len(expected_apis) - len(missing)
+    return {
+        "missing": missing,
+        "extra": extra,
+        "correct": correct,
+        "expected_count": len(expected_apis),
+        "called_count": len(called_apis),
+        "all_expected_called": len(missing) == 0,
+    }

api_candidate_source.py ADDED Viewed

	@@ -0,0 +1,385 @@

+"""
+Candidate source APIs - compute metrics from actual data.
+AUTO-GENERATED by scripts/generate_hf.sh - DO NOT EDIT DIRECTLY
+Edit candidate_source.py in main repo and regenerate.
+"""
+from typing import Dict, List, Any, Optional, Union
+import pandas as pd
+from loguru import logger
+from data_loader import get_data_loader
+from models import (
+    RequisitionNotFoundResponse,
+    SLAPerSourceResponse,
+    TotalHiresBySourceResponse,
+    CandidateVolumeResponse,
+    FunnelConversionResponse,
+    MetadataResponse,
+    DefinitionsResponse,
+    SourceRecommendationResponse,
+)
+BPO_LOG_API_CALLS = False  # Disabled for deployment
+def _log_api_call(msg: str) -> None:
+    """Log API call if BPO_LOG_API_CALLS is enabled."""
+    if BPO_LOG_API_CALLS:
+        logger.info(msg)
+def _check_requisition_valid(requisition_id: str) -> Optional[RequisitionNotFoundResponse]:
+    """
+    Check if a requisition ID is valid. Returns None if valid,
+    or an error response model if invalid.
+    """
+    loader = get_data_loader()
+    if not loader.is_valid_requisition(requisition_id):
+        suggestions = loader.get_suggested_requisitions(requisition_id)
+        return RequisitionNotFoundResponse(
+            error="requisition_not_found",
+            message=f"No job can be found with the ID {requisition_id}.",
+            suggested_requisition_ids=suggestions,
+        )
+    return None
+def get_sla_per_source(requisition_id: str) -> Union[SLAPerSourceResponse, RequisitionNotFoundResponse]:
+    """
+    Retrieves the SLA percentage for each sourcing channel.
+    Args:
+        requisition_id: The specific requisition ID to filter SLA data for.
+    Returns:
+        A dictionary with source names and their SLA percentages.
+    """
+    _log_api_call(f"API call: get_sla_per_source(requisition_id={requisition_id})")
+    # Check if requisition ID is valid
+    error = _check_requisition_valid(requisition_id)
+    if error:
+        return error
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    # Filter to only reviewed candidates (SLA only applies to reviewed candidates)
+    reviewed_data = data[data['reviewed']]
+    # Group by source and calculate SLA met percentage
+    sla_by_source = reviewed_data.groupby('source_name').agg(
+        total=('sla_met', 'count'),
+        sla_met=('sla_met', 'sum')
+    )
+    sla_by_source['sla_percentage'] = (sla_by_source['sla_met'] / sla_by_source['total'] * 100).round(0).astype(int)
+    metrics = [
+        {
+            "source_name": source,
+            "sla_percentage": int(row['sla_percentage'])
+        }
+        for source, row in sla_by_source.iterrows()
+    ]
+    # Sort by SLA percentage (ascending) for consistency
+    metrics.sort(key=lambda x: x['sla_percentage'])
+    return SLAPerSourceResponse(metrics=metrics)
+def get_total_hires_by_source(requisition_id: str) -> Union[TotalHiresBySourceResponse, RequisitionNotFoundResponse]:
+    """
+    Retrieves the total number of hires per sourcing channel.
+    Args:
+        requisition_id: The specific requisition ID to filter hiring data for.
+    Returns:
+        A dictionary with source names and total hires.
+    """
+    _log_api_call(f"API call: get_total_hires_by_source(requisition_id={requisition_id})")
+    # Check if requisition ID is valid
+    error = _check_requisition_valid(requisition_id)
+    if error:
+        return error
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    # Count hires by source
+    hires_by_source = data[data['hired']].groupby('source_name').size()
+    metrics = [
+        {
+            "source_name": source,
+            "total_hires": int(count)
+        }
+        for source, count in hires_by_source.items()
+    ]
+    # Sort by total hires (descending)
+    metrics.sort(key=lambda x: x['total_hires'], reverse=True)
+    total_hires = int(data['hired'].sum())
+    return TotalHiresBySourceResponse(
+        job_id=requisition_id,
+        metrics=metrics,
+        total_hires=total_hires,
+    )
+def get_candidate_volume_by_source(
+    requisition_id: str,
+    sources: Optional[List[str]] = None
+) -> Union[CandidateVolumeResponse, RequisitionNotFoundResponse]:
+    """
+    Retrieves candidate volume per sourcing channel.
+    Args:
+        requisition_id: The specific requisition ID to filter candidate volume.
+        sources: Optional subset of sourcing channels to include (case-sensitive).
+    Returns:
+        A dictionary with source names and candidate volumes.
+    """
+    _log_api_call(f"API call: get_candidate_volume_by_source(requisition_id={requisition_id}, sources={sources})")
+    # Check if requisition ID is valid
+    error = _check_requisition_valid(requisition_id)
+    if error:
+        return error
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    total_volume = len(data)
+    # Count candidates by source
+    volume_by_source = data.groupby('source_name').size()
+    metrics = [
+        {
+            "source_name": source,
+            "candidate_volume": int(count),
+            "percentage": int(round(count/total_volume*100))
+        }
+        for source, count in volume_by_source.items()
+    ]
+    # Filter by sources if provided
+    if sources:
+        metrics = [m for m in metrics if m['source_name'] in sources]
+    # Sort by volume (descending)
+    metrics.sort(key=lambda x: x['candidate_volume'], reverse=True)
+    return CandidateVolumeResponse(
+        job_id=requisition_id,
+        total_candidate_volume=total_volume,
+        metrics=metrics,
+        heading=(
+            f"For requisitions similar to {requisition_id}, there were {total_volume} candidates over "
+            "the past three years. Here's how many candidates came from each source "
+            "(with percentages from the total number):"
+        ),
+    )
+def get_funnel_conversion_by_source(requisition_id: str) -> Union[FunnelConversionResponse, RequisitionNotFoundResponse]:
+    """
+    Retrieves conversion rates at each funnel stage for each sourcing channel.
+    Args:
+        requisition_id: The specific requisition ID to filter funnel data for.
+    Returns:
+        A dictionary with review %, interview rate, and offer acceptance rate.
+    """
+    _log_api_call(f"API call: get_funnel_conversion_by_source(requisition_id={requisition_id})")
+    # Check if requisition ID is valid
+    error = _check_requisition_valid(requisition_id)
+    if error:
+        return error
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    metrics = []
+    for source in data['source_name'].unique():
+        source_data = data[data['source_name'] == source]
+        total = len(source_data)
+        if total == 0:
+            continue
+        reviewed = source_data['reviewed'].sum()
+        interviewed = source_data['interviewed'].sum()
+        offered = source_data['offer_extended'].sum()
+        metrics.append({
+            "source_name": source,
+            "first_round_review_percentage": round(reviewed / total * 100, 1),
+            "interview_rate": round(interviewed / total * 100, 1),
+            "offer_acceptance_rate": round(offered / total * 100, 1),
+        })
+    # Sort by source name for consistency
+    metrics.sort(key=lambda x: x['source_name'])
+    return FunnelConversionResponse(
+        job_id=requisition_id,
+        metrics=metrics,
+    )
+def get_metadata_and_timeframe(requisition_id: str) -> Union[MetadataResponse, RequisitionNotFoundResponse]:
+    """
+    Retrieves metadata including data timeframe, last update date, and the
+    number of requisitions analysed.
+    Args:
+        requisition_id: The job requisition ID.
+    Returns:
+        A dictionary containing timeframe and requisition summary.
+    """
+    _log_api_call(f"API call: get_metadata_and_timeframe(requisition_id={requisition_id})")
+    # Check if requisition ID is valid
+    error = _check_requisition_valid(requisition_id)
+    if error:
+        return error
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    # Get date range from applied_at column
+    min_date = data['applied_at'].min()
+    max_date = data['applied_at'].max()
+    # Count unique requisitions
+    num_requisitions = data['requisition_id'].nunique()
+    # Static dates for reproducible benchmarking
+    # Use actual dates from data but with last_updated fixed for stability
+    return MetadataResponse(
+        job_id=requisition_id,
+        time_frame_start="2023-10-09",
+        time_frame_end="2025-03-15",
+        data_last_updated="2025-04-29",
+        total_requisitions_analysed=num_requisitions,
+    )
+def get_definitions_and_methodology(requisition_id: str) -> Union[DefinitionsResponse, RequisitionNotFoundResponse]:
+    """
+    Provides definitions of key metrics and outlines the methodology used
+    to calculate performance.
+    Args:
+        requisition_id: The specific requisition ID for context.
+    Returns:
+        A dictionary including metric definitions, calculation notes,
+        and the top metrics considered.
+    """
+    _log_api_call(f"API call: get_definitions_and_methodology(requisition_id={requisition_id})")
+    # Check if requisition ID is valid
+    error = _check_requisition_valid(requisition_id)
+    if error:
+        return error
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    # Report total requisitions in dataset (full analysis framework)
+    num_total_requisitions = loader.data['requisition_id'].nunique()
+    min_date = data['applied_at'].min()
+    max_date = data['applied_at'].max()
+    years = (max_date - min_date).days / 365.25
+    return DefinitionsResponse(
+        job_id=requisition_id,
+        definitions={
+            "sla": "Percentage of candidates reviewed within the defined SLA window (e.g., 48 hours)",
+            "time_to_fill": "Average time from job posting to accepted offer",
+            "success_rate": "Ratio of candidates who accepted offers out of those interviewed",
+        },
+        calculation_notes=(
+            f"Metrics are computed from {num_total_requisitions} requisitions over the last {years:.1f} years. "
+            "Funnel stats are based on system timestamps and recruiter actions in ATS."
+        ),
+        top_metrics_considered=[
+            "SLA %",
+            "First round review %",
+            "Offer acceptance rate",
+            "Candidate volume",
+            "Total hires",
+        ],
+    )
+def get_source_recommendation_summary(requisition_id: str) -> Union[SourceRecommendationResponse, RequisitionNotFoundResponse]:
+    """
+    Returns a high-level summary combining jobs-filled %, review %, offer-accept
+    rate, and total hires for each source.
+    Args:
+        requisition_id: The job requisition ID.
+    Returns:
+        A dictionary with composite source metrics.
+    """
+    _log_api_call(f"API call: get_source_recommendation_summary(requisition_id={requisition_id})")
+    # Check if requisition ID is valid
+    error = _check_requisition_valid(requisition_id)
+    if error:
+        return error
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    num_requisitions = data['requisition_id'].nunique()
+    metrics = []
+    for source in data['source_name'].unique():
+        source_data = data[data['source_name'] == source]
+        total = len(source_data)
+        if total == 0:
+            continue
+        # Calculate metrics
+        reviewed = source_data['reviewed'].sum()
+        hired = source_data['hired'].sum()
+        # Jobs filled percentage: what % of requisitions had at least 1 hire from this source
+        reqs_with_hires = source_data[source_data['hired']]['requisition_id'].nunique()
+        jobs_filled_pct = int(reqs_with_hires / num_requisitions * 100)
+        # Offer acceptance rate: of those who got offers, how many accepted?
+        offers = source_data['offer_extended'].sum()
+        accepted = source_data['offer_accepted'].sum()
+        offer_accept_rate = round(accepted / offers * 100) if offers > 0 else 0
+        metrics.append({
+            "source_name": source,
+            "jobs_filled_percentage": jobs_filled_pct,
+            "first_round_review_percentage": int(reviewed / total * 100),
+            "offer_acceptance_rate": offer_accept_rate,
+            "total_hires": int(hired),
+        })
+    # Sort by source name
+    metrics.sort(key=lambda x: x['source_name'])
+    return SourceRecommendationResponse(
+        total_requisitions=num_requisitions,
+        metrics=metrics,
+    )

api_candidate_source_error.py ADDED Viewed

	@@ -0,0 +1,495 @@

+"""
+Error-prone candidate source API variants for testing agent resilience.
+Each function has a unique, plausible intent and embeds a specific error behavior.
+Completely independent from original APIs — accesses DataLoader directly.
+AUTO-GENERATED by scripts/generate_hf.sh - DO NOT EDIT DIRECTLY
+Edit candidate_source_error.py in main repo and regenerate.
+"""
+import json
+import random
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+from data_loader import get_data_loader
+# Seeded RNG for reproducible probabilistic behavior
+_rng = random.Random(42)
+# Track call counts for rate-limiting behavior
+_call_counts: Dict[str, int] = {}
+def _check_requisition(requisition_id: str) -> Optional[Dict[str, Any]]:
+    """Return error dict if requisition invalid, else None."""
+    loader = get_data_loader()
+    if not loader.is_valid_requisition(requisition_id):
+        return {
+            "error": "requisition_not_found",
+            "message": f"Requisition {requisition_id} not found",
+        }
+    return None
+# ── Test 28: Type mismatch — int instead of float ───────────────────────────
+def get_source_sla_score(requisition_id: str, source_name: str = "Dice") -> Any:
+    """Get the SLA score for a specific sourcing channel.
+    Returns the SLA achievement score for the given source.
+    ERROR BEHAVIOR: Returns int (e.g., 80) instead of float (e.g., 80.0).
+    Tests type handling for numeric values.
+    """
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    reviewed = data[(data["reviewed"]) & (data["source_name"] == source_name)]
+    if len(reviewed) == 0:
+        return {"error": "no_data", "message": f"No reviewed candidates from {source_name}"}
+    sla_pct = int(round(reviewed["sla_met"].mean() * 100))
+    return sla_pct  # Returns bare int instead of {"sla_score": 80.0}
+# ── Test 29: Type mismatch — None instead of empty list ─────────────────────
+def get_inactive_sources(requisition_id: str) -> Any:
+    """Show any inactive sourcing channels with no candidates.
+    Returns a list of sources that produced zero candidates.
+    ERROR BEHAVIOR: Returns None instead of empty list when no
+    inactive sources exist. Tests null handling.
+    """
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    active_sources = set(data["source_name"].unique())
+    # Use all sources from the full dataset as the reference set
+    all_possible = set(loader.data["source_name"].unique())
+    inactive = all_possible - active_sources
+    if not inactive:
+        return None  # Returns None instead of []
+    return list(inactive)
+# ── Test 30: HTTP 404 (probabilistic, 20% chance) ───────────────────────────
+def get_candidate_pipeline_status(requisition_id: str) -> Dict[str, Any]:
+    """Get candidate pipeline status for a requisition.
+    Returns current pipeline status showing candidate distribution by source.
+    ERROR BEHAVIOR: 20% chance of returning a 404-style error dict.
+    Tests retry logic and error recovery.
+    """
+    if _rng.random() < 0.2:
+        return {
+            "status_code": 404,
+            "error": True,
+            "message": "Resource temporarily unavailable",
+        }
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    volume_by_source = data.groupby("source_name").size().to_dict()
+    return {
+        "requisition_id": requisition_id,
+        "pipeline": {k: int(v) for k, v in volume_by_source.items()},
+        "total_candidates": int(len(data)),
+    }
+# ── Test 31: HTTP 500 with valid body ────────────────────────────────────────
+def get_source_sla_check(requisition_id: str) -> Dict[str, Any]:
+    """Run a quick SLA status check across all sourcing channels.
+    Returns SLA metrics per source for rapid status assessment.
+    ERROR BEHAVIOR: Returns HTTP 500-style error dict but includes valid
+    data in the body. Tests agent ability to use response body despite
+    error status code.
+    """
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    reviewed = data[data["reviewed"]]
+    metrics = []
+    for source, group in reviewed.groupby("source_name"):
+        sla_pct = int(round(group["sla_met"].mean() * 100))
+        metrics.append({"source_name": source, "sla_percentage": sla_pct})
+    return {
+        "status_code": 500,
+        "error": True,
+        "message": "Internal server error",
+        "body": {"metrics": metrics},
+    }
+# ── Test 32: HTTP 503 Service Unavailable ────────────────────────────────────
+def get_funnel_status(requisition_id: str) -> Dict[str, Any]:
+    """Get the current funnel status for a requisition.
+    Returns real-time funnel pipeline status showing conversion at each stage.
+    ERROR BEHAVIOR: Always returns 503 with retry-after info.
+    Tests service unavailable handling.
+    """
+    return {
+        "status_code": 503,
+        "error": True,
+        "message": "Service temporarily unavailable. The funnel analytics engine is undergoing maintenance.",
+        "retry_after_seconds": 300,
+        "expected_recovery": "2025-05-01T12:00:00Z",
+    }
+# ── Test 33: HTTP 429 Rate Limited ──────────────────────────────────────────
+def get_bulk_source_data(requisition_id: str) -> Dict[str, Any]:
+    """Pull bulk source data for all requisitions.
+    Returns comprehensive source data across all requisitions in the system.
+    ERROR BEHAVIOR: Returns 429 after 3rd call (tracked via module-level counter).
+    Tests rate limit handling.
+    """
+    key = "get_bulk_source_data"
+    _call_counts[key] = _call_counts.get(key, 0) + 1
+    if _call_counts[key] > 3:
+        return {
+            "status_code": 429,
+            "error": True,
+            "message": "Rate limit exceeded. Maximum 3 calls per session.",
+            "retry_after_seconds": 60,
+            "limit": 3,
+            "remaining": 0,
+        }
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    summary = {}
+    for source, group in data.groupby("source_name"):
+        summary[source] = {
+            "total_candidates": int(len(group)),
+            "total_hires": int(group["hired"].sum()),
+            "reviewed": int(group["reviewed"].sum()),
+        }
+    return {
+        "requisition_id": requisition_id,
+        "sources": summary,
+        "call_number": _call_counts[key],
+    }
+# ── Test 36: Missing required fields ────────────────────────────────────────
+def get_source_metrics_lite(requisition_id: str) -> Dict[str, Any]:
+    """Get a lightweight summary of source metrics.
+    Returns a compact view of per-source metrics for quick analysis.
+    ERROR BEHAVIOR: Response missing `source_name` field in metrics entries.
+    Tests agent handling of incomplete/partial data.
+    """
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    metrics = []
+    for source, group in data.groupby("source_name"):
+        # Intentionally omit source_name
+        metrics.append({
+            "candidate_count": int(len(group)),
+            "hire_count": int(group["hired"].sum()),
+            "sla_met_count": int(group[group["reviewed"]]["sla_met"].sum()),
+        })
+    return {
+        "requisition_id": requisition_id,
+        "metrics": metrics,
+        "note": "Lightweight view — some fields may be omitted for performance.",
+    }
+# ── Test 37: Wrong field types in response ──────────────────────────────────
+def get_volume_report(requisition_id: str) -> Dict[str, Any]:
+    """Generate a volume report for a requisition.
+    Returns candidate volume statistics broken down by source.
+    ERROR BEHAVIOR: `candidate_count` returned as string instead of int.
+    Tests type coercion handling.
+    """
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    metrics = []
+    for source, group in data.groupby("source_name"):
+        metrics.append({
+            "source_name": source,
+            "candidate_count": str(len(group)),  # String instead of int
+            "hire_count": str(int(group["hired"].sum())),  # String instead of int
+            "review_rate": f"{group['reviewed'].mean() * 100:.1f}%",
+        })
+    return {
+        "requisition_id": requisition_id,
+        "metrics": metrics,
+        "total_candidates": str(len(data)),  # String instead of int
+    }
+# ── Test 38: Large response (1000 records) ──────────────────────────────────
+def get_full_candidate_details(requisition_id: str) -> Dict[str, Any]:
+    """Get full candidate details for a requisition.
+    Returns comprehensive candidate-level data for detailed analysis.
+    ERROR BEHAVIOR: Response contains 1000 pre-generated candidate records.
+    Tests agent handling of large payloads.
+    """
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    # Load pre-generated fixture
+    fixture_paths = [
+        Path(__file__).parent.parent.parent / "data" / "large_response_fixture.json",
+        Path("./data/large_response_fixture.json"),
+    ]
+    for path in fixture_paths:
+        if path.exists():
+            with open(path, "r") as f:
+                records = json.load(f)
+            return {
+                "requisition_id": requisition_id,
+                "total_records": len(records),
+                "candidates": records,
+            }
+    # Fallback: generate minimal records if fixture missing
+    return {
+        "requisition_id": requisition_id,
+        "total_records": 0,
+        "candidates": [],
+        "warning": "Large response fixture not found",
+    }
+# ── Test 39: Unicode and special characters ─────────────────────────────────
+def get_source_directory(requisition_id: str) -> Dict[str, Any]:
+    """Show the source directory for a requisition.
+    Returns a directory listing of all sourcing channels with their metadata.
+    ERROR BEHAVIOR: Source names contain emoji, CJK characters, Arabic text.
+    Tests unicode handling.
+    """
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    return {
+        "requisition_id": requisition_id,
+        "sources": [
+            {"name": "LinkedIn \U0001F4BC", "region": "Global", "status": "active"},
+            {"name": "Dice \U0001F3B2", "region": "North America", "status": "active"},
+            {"name": "\u62db\u8058\u7f51 (Zhaopin)", "region": "\u4e2d\u56fd", "status": "active"},
+            {"name": "\u0628\u064a\u062a.\u0643\u0648\u0645 (Bayt)", "region": "\u0627\u0644\u0634\u0631\u0642 \u0627\u0644\u0623\u0648\u0633\u0637", "status": "active"},
+            {"name": "GitHub \U0001F431\u200D\U0001F4BB", "region": "Global", "status": "active"},
+            {"name": "R\u00e9f\u00e9rence\u2122", "region": "Europe", "status": "inactive"},
+            {"name": "\u2605 Top Talent \u2605", "region": "APAC", "status": "active"},
+        ],
+        "total_sources": 7,
+    }
+# ── Test 41: Extra undocumented fields (20 extra fields) ─────────────────────
+def get_sla_extended(requisition_id: str, source_name: str = "Dice") -> Dict[str, Any]:
+    """Get extended SLA data for a specific sourcing channel.
+    Returns SLA metrics with additional analytics for the given source.
+    ERROR BEHAVIOR: Response includes 20 undocumented extra fields
+    beyond what the schema describes. Tests agent ability to ignore
+    noise and extract relevant data.
+    """
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    source_data = data[(data["reviewed"]) & (data["source_name"] == source_name)]
+    sla_pct = int(round(source_data["sla_met"].mean() * 100)) if len(source_data) > 0 else 0
+    return {
+        "requisition_id": requisition_id,
+        "source_name": source_name,
+        "sla_percentage": sla_pct,
+        # Undocumented extra fields
+        "_internal_id": "sla-ext-7f3a9b2c",
+        "_cache_ttl": 3600,
+        "_version": "2.3.1",
+        "_debug_query_ms": 42,
+        "_shard_id": 3,
+        "_region": "us-east-1",
+        "_feature_flags": ["sla_v2", "extended_metrics"],
+        "_experiment_group": "control",
+        "_sampling_rate": 0.95,
+        "_data_quality_score": 0.98,
+        "_last_recomputed": "2025-04-29T03:00:00Z",
+        "_computation_engine": "spark-3.5",
+        "_model_version": "sla-impact-v1.4.2",
+        "_confidence_interval": [sla_pct - 3, sla_pct + 3],
+        "_p_value": 0.023,
+        "_sample_size": int(len(source_data)),
+        "_outliers_removed": 2,
+        "_normalization_method": "min-max",
+        "_correlation_with_hires": 0.67,
+        "_seasonal_adjustment": True,
+    }
+# ── Test 43: Undocumented error format ──────────────────────────────────────
+def get_requisition_details(requisition_id: str) -> Dict[str, Any]:
+    """Get detailed information for a specific requisition.
+    Returns comprehensive requisition metadata and status.
+    ERROR BEHAVIOR: Returns non-standard error format `{"err": "not_found"}`
+    instead of the standard `RequisitionNotFoundResponse`.
+    Tests non-standard error parsing.
+    """
+    loader = get_data_loader()
+    if not loader.is_valid_requisition(requisition_id):
+        return {"err": "not_found", "req": requisition_id}
+    data = loader.get_by_requisition(requisition_id)
+    row = data.iloc[0]
+    return {
+        "requisition_id": requisition_id,
+        "department": str(row.get("department", "Unknown")),
+        "seniority_level": str(row.get("seniority_level", "Unknown")),
+        "total_candidates": int(len(data)),
+        "sources_used": list(data["source_name"].unique()),
+    }
+# ── Test 44: Undocumented pagination ─────────────────────────────────────────
+def list_all_sources(requisition_id: str) -> Dict[str, Any]:
+    """List all available sourcing channels.
+    Returns a paginated list of all sourcing channels in the system.
+    ERROR BEHAVIOR: Response includes `next_page` token not described
+    in any schema. Tests pagination detection and handling.
+    """
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    sources = sorted(data["source_name"].unique())
+    # Return first 3 with pagination token
+    page_size = 3
+    page = sources[:page_size]
+    result: Dict[str, Any] = {
+        "requisition_id": requisition_id,
+        "sources": [{"name": s, "index": i} for i, s in enumerate(page)],
+        "total_count": len(sources),
+        "page_size": page_size,
+        "page": 1,
+    }
+    if len(sources) > page_size:
+        result["next_page"] = "eyJvZmZzZXQiOjMsInJlcV9pZCI6IjA1OTU4QlIifQ=="
+        result["has_more"] = True
+    else:
+        result["has_more"] = False
+    return result
+# ── Test 45: Undocumented rate limiting headers ──────────────────────────────
+def get_batch_metrics(requisition_id: str) -> Dict[str, Any]:
+    """Fetch batch metrics for all sourcing channels.
+    Returns aggregated metrics across all sources with rate limit information.
+    ERROR BEHAVIOR: Response includes X-RateLimit style headers embedded
+    in the response body. Tests rate limit awareness.
+    """
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    metrics = {}
+    for source, group in data.groupby("source_name"):
+        metrics[source] = {
+            "candidates": int(len(group)),
+            "hires": int(group["hired"].sum()),
+            "reviewed": int(group["reviewed"].sum()),
+        }
+    return {
+        "requisition_id": requisition_id,
+        "metrics": metrics,
+        # Rate limit info embedded in response body
+        "X-RateLimit-Limit": 100,
+        "X-RateLimit-Remaining": 97,
+        "X-RateLimit-Reset": "2025-05-01T00:00:00Z",
+        "X-RateLimit-Window": "1h",
+    }

api_skills.py ADDED Viewed

	@@ -0,0 +1,328 @@

+"""
+Skills APIs - compute skill-related metrics from actual data.
+AUTO-GENERATED by scripts/generate_hf.sh - DO NOT EDIT DIRECTLY
+Edit skills.py in main repo and regenerate.
+"""
+from typing import Dict, List, Any, Optional, Union
+import pandas as pd
+from loguru import logger
+from data_loader import get_data_loader
+from models import (
+    RequisitionNotFoundResponse,
+    SkillAnalysisResponse,
+    SkillImpactFillRateResponse,
+    SkillImpactSLAResponse,
+    SkillRelevanceResponse,
+    SuccessfulPostingResponse,
+    DataSourcesResponse,
+    SkillJustificationData,
+    SkillJustificationImpact,
+    SuccessCriteria,
+)
+BPO_LOG_API_CALLS = False  # Disabled for deployment
+def _log_api_call(msg: str) -> None:
+    """Log API call if BPO_LOG_API_CALLS is enabled."""
+    if BPO_LOG_API_CALLS:
+        logger.info(msg)
+def _check_requisition_valid(requisition_id: str) -> Optional[RequisitionNotFoundResponse]:
+    """
+    Check if a requisition ID is valid. Returns None if valid,
+    or an error response model if invalid.
+    """
+    loader = get_data_loader()
+    if not loader.is_valid_requisition(requisition_id):
+        suggestions = loader.get_suggested_requisitions(requisition_id)
+        return RequisitionNotFoundResponse(
+            error="requisition_not_found",
+            message=f"No job can be found with the ID {requisition_id}.",
+            suggested_requisition_ids=suggestions,
+        )
+    return None
+def get_skill_analysis(requisition_id: str) -> Union[SkillAnalysisResponse, RequisitionNotFoundResponse]:
+    """
+    Provides statistical indicators for each skill associated with the requisition,
+    enabling an LLM or analyst to decide whether a skill should be retained,
+    removed, or reconsidered.
+    Args:
+        requisition_id: The job requisition ID.
+    Returns:
+        Dict with historical counts and SLA correlation per skill.
+    """
+    _log_api_call(f"API call: get_skill_analysis(requisition_id={requisition_id})")
+    # Check if requisition ID is valid
+    error = _check_requisition_valid(requisition_id)
+    if error:
+        return error
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    # Get all unique skills across all candidates
+    all_skills = []
+    for skills_list in data['skills_parsed']:
+        all_skills.extend(skills_list)
+    skill_counts = pd.Series(all_skills).value_counts()
+    # For each skill, compute SLA correlation
+    historical_skills = []
+    for skill, count in skill_counts.head(10).items():  # Top 10 skills
+        # Filter to reviewed candidates only (SLA only applies to reviewed candidates)
+        reviewed_data = data[data['reviewed']]
+        # Get candidates with and without this skill
+        has_skill = reviewed_data[reviewed_data['skills_parsed'].apply(lambda x: skill in x)]
+        no_skill = reviewed_data[reviewed_data['skills_parsed'].apply(lambda x: skill not in x)]
+        # Calculate SLA rates
+        sla_with = has_skill['sla_met'].mean() if len(has_skill) > 0 else 0
+        sla_without = no_skill['sla_met'].mean() if len(no_skill) > 0 else 0
+        # Determine correlation
+        diff = sla_with - sla_without
+        if diff < -0.10:
+            correlation = "highly negative impact on SLA"
+        elif diff < 0:
+            correlation = "slightly negative impact on SLA"
+        elif diff > 0.10:
+            correlation = "highly positive impact on SLA"
+        elif diff > 0:
+            correlation = "slightly positive impact on SLA"
+        else:
+            correlation = "no impact on SLA"
+        historical_skills.append({
+            "name": skill,
+            "skill_occurrence": int(count),
+            "correlation": correlation
+        })
+    num_jobs = data['requisition_id'].nunique()
+    return SkillAnalysisResponse(
+        historical_jobs=num_jobs,
+        input_skills=[],  # Would come from requisition details
+        historical_skills_with_analysis=historical_skills,
+    )
+def get_skill_impact_fill_rate(requisition_id: str, skill_name: str) -> Union[SkillImpactFillRateResponse, RequisitionNotFoundResponse]:
+    """
+    Evaluates how the inclusion of a specific skill affects requisition
+    fill-rate metrics and candidate pool size.
+    Args:
+        requisition_id: The job requisition ID.
+        skill_name: The skill to evaluate.
+    Returns:
+        Impact metrics with and without the skill.
+    """
+    _log_api_call(f"API call: get_skill_impact_fill_rate(requisition_id={requisition_id}, skill_name={skill_name})")
+    # Check if requisition ID is valid
+    error = _check_requisition_valid(requisition_id)
+    if error:
+        return error
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    # Split data by whether requisitions included this skill
+    has_skill_reqs = data[data['skills_parsed'].apply(lambda x: skill_name in x)]['requisition_id'].unique()
+    no_skill_reqs = data[~data['requisition_id'].isin(has_skill_reqs)]['requisition_id'].unique()
+    def calc_metrics(req_ids):
+        if len(req_ids) == 0:
+            return {"fill_rate_percentage": 0, "time_to_fill_days": 0, "candidate_pool_size": 0}
+        req_data = data[data['requisition_id'].isin(req_ids)]
+        # Fill rate: % of reqs that got at least one hire
+        reqs_with_hires = req_data[req_data['hired']]['requisition_id'].nunique()
+        fill_rate = reqs_with_hires / len(req_ids) * 100
+        # Time to fill: average days from applied to hired
+        hired = req_data[req_data['hired']]
+        if len(hired) > 0:
+            time_to_fill = (hired['hire_date'] - hired['applied_at']).dt.days.mean()
+        else:
+            time_to_fill = 0
+        # Candidate pool size
+        pool_size = len(req_data)
+        return {
+            "fill_rate_percentage": round(fill_rate, 1),
+            "time_to_fill_days": int(time_to_fill),
+            "candidate_pool_size": pool_size
+        }
+    with_skill = calc_metrics(has_skill_reqs)
+    without_skill = calc_metrics(no_skill_reqs)
+    return SkillImpactFillRateResponse(
+        skill_name=skill_name,
+        impact=with_skill,
+        compared_to_baseline=without_skill,
+    )
+def get_skill_impact_sla(requisition_id: str, skill_name: str) -> Union[SkillImpactSLAResponse, RequisitionNotFoundResponse]:
+    """
+    Analyzes how a skill affects SLA achievement rate.
+    Args:
+        requisition_id: The job requisition ID.
+        skill_name: The skill being analyzed.
+    Returns:
+        Success percentages with/without the skill and the delta.
+    """
+    _log_api_call(f"API call: get_skill_impact_sla(requisition_id={requisition_id}, skill_name={skill_name})")
+    # Check if requisition ID is valid
+    error = _check_requisition_valid(requisition_id)
+    if error:
+        return error
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    # Filter to reviewed candidates only (SLA only applies to reviewed candidates)
+    reviewed_data = data[data['reviewed']]
+    # Get candidates with and without this skill
+    has_skill = reviewed_data[reviewed_data['skills_parsed'].apply(lambda x: skill_name in x)]
+    no_skill = reviewed_data[reviewed_data['skills_parsed'].apply(lambda x: skill_name not in x)]
+    sla_with = round(has_skill['sla_met'].mean() * 100) if len(has_skill) > 0 else 0
+    sla_without = round(no_skill['sla_met'].mean() * 100) if len(no_skill) > 0 else 0
+    return SkillImpactSLAResponse(
+        requisition_id=requisition_id,
+        skill_name=skill_name,
+        sla_achievement_with_skill=sla_with,
+        sla_achievement_without_skill=sla_without,
+        delta=sla_with - sla_without,
+    )
+def get_skill_relevance_justification(requisition_id: str, skill_name: str) -> Union[SkillRelevanceResponse, RequisitionNotFoundResponse]:
+    """
+    Explains whether a skill is relevant and why, based on historical hiring
+    success and outcome data.
+    Args:
+        requisition_id: The job requisition ID.
+        skill_name: The skill being justified.
+    Returns:
+        Relevance determination with justification.
+    """
+    _log_api_call(f"API call: get_skill_relevance_justification(requisition_id={requisition_id}, skill_name={skill_name})")
+    # Check if requisition ID is valid
+    error = _check_requisition_valid(requisition_id)
+    if error:
+        return error
+    # Get both SLA and fill rate impacts
+    sla_impact = get_skill_impact_sla(requisition_id, skill_name)
+    fill_impact = get_skill_impact_fill_rate(requisition_id, skill_name)
+    # Determine relevance based on both metrics
+    is_relevant = False
+    if sla_impact.delta > 5 or fill_impact.impact.fill_rate_percentage > fill_impact.compared_to_baseline.fill_rate_percentage * 1.2:
+        is_relevant = True
+    justification = SkillJustificationData(
+        requisition_id=requisition_id,
+        skill_name=skill_name,
+        sla_achievement_with_skill=sla_impact.sla_achievement_with_skill,
+        sla_achievement_without_skill=sla_impact.sla_achievement_without_skill,
+        delta=sla_impact.delta,
+        impact=SkillJustificationImpact(
+            fill_rate_percentage=fill_impact.impact.fill_rate_percentage,
+            time_to_fill_days=fill_impact.impact.time_to_fill_days,
+            candidate_pool_size=fill_impact.impact.candidate_pool_size,
+        ),
+        compared_to_baseline=SkillJustificationImpact(
+            fill_rate_percentage=fill_impact.compared_to_baseline.fill_rate_percentage,
+            time_to_fill_days=fill_impact.compared_to_baseline.time_to_fill_days,
+            candidate_pool_size=fill_impact.compared_to_baseline.candidate_pool_size,
+        ),
+    )
+    return SkillRelevanceResponse(
+        requisition_id=requisition_id,
+        skill_name=skill_name,
+        is_relevant=is_relevant,
+        justification=justification,
+    )
+def get_successful_posting_criteria() -> SuccessfulPostingResponse:
+    """
+    Returns the business definition of a successful job posting,
+    including thresholds and benchmarks for success.
+    Returns:
+        Success criteria thresholds.
+    """
+    _log_api_call("API call: get_successful_posting_criteria()")
+    return SuccessfulPostingResponse(
+        criteria=SuccessCriteria(
+            time_to_fill_threshold_days=90,
+            offer_acceptance_rate_min=50,
+            sla_compliance_min=80,
+            candidate_quality_rating_avg=3.5,
+        ),
+        justification="Based on historical performance benchmarks and industry standards",
+    )
+def get_data_sources_used(requisition_id: str) -> Union[DataSourcesResponse, RequisitionNotFoundResponse]:
+    """
+    Lists the datasets and ML models used to make hiring recommendations
+    for a requisition.
+    Args:
+        requisition_id: The job requisition ID.
+    Returns:
+        Data sources and models used.
+    """
+    _log_api_call(f"API call: get_data_sources_used(requisition_id={requisition_id})")
+    # Check if requisition ID is valid
+    error = _check_requisition_valid(requisition_id)
+    if error:
+        return error
+    return DataSourcesResponse(
+        requisition_id=requisition_id,
+        datasets_used=[
+            "Historical hiring success data",
+            "Requisition skill tagging",
+            "Funnel conversion metrics",
+            "Candidate quality feedback",
+        ],
+        models_involved=[
+            "Skill relevance classifier",
+            "SLA impact regression model",
+            "Funnel conversion recommender",
+        ],
+    )

api_skills_error.py ADDED Viewed

	@@ -0,0 +1,238 @@

+"""
+Error-prone skills API variants for testing agent resilience.
+Each function has a unique, plausible intent and embeds a specific error behavior.
+Completely independent from original APIs — accesses DataLoader directly.
+AUTO-GENERATED by scripts/generate_hf.sh - DO NOT EDIT DIRECTLY
+Edit skills_error.py in main repo and regenerate.
+"""
+import json
+import random
+from typing import Any, Dict, Optional
+from data_loader import get_data_loader
+# Seeded RNG for reproducible probabilistic behavior
+_rng = random.Random(42)
+def _check_requisition(requisition_id: str) -> Optional[Dict[str, Any]]:
+    """Return error dict if requisition invalid, else None."""
+    loader = get_data_loader()
+    if not loader.is_valid_requisition(requisition_id):
+        return {
+            "error": "requisition_not_found",
+            "message": f"Requisition {requisition_id} not found",
+        }
+    return None
+# ── Test 27: Type mismatch — string instead of structured list ──────────────
+def get_skill_summary(requisition_id: str) -> str:
+    """Get a quick text summary of skills needed for a requisition.
+    Returns a concise comma-separated skill overview.
+    ERROR BEHAVIOR: Returns a plain comma-separated string instead of
+    structured SkillAnalysisResponse. Tests type mismatch handling.
+    """
+    err = _check_requisition(requisition_id)
+    if err:
+        return json.dumps(err)
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    all_skills: set = set()
+    for skills_list in data["skills_parsed"].dropna():
+        if isinstance(skills_list, list):
+            all_skills.update(skills_list)
+    return ", ".join(sorted(all_skills))
+# ── Test 34: Missing output schema — untyped dict ───────────────────────────
+def get_model_registry(requisition_id: str) -> Dict[str, Any]:
+    """Check which ML models are registered for a given requisition.
+    Returns model registry information including versions and status.
+    ERROR BEHAVIOR: No Pydantic output schema — returns a plain dict
+    with dynamically typed fields. Tests schema inference.
+    """
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    return {
+        "requisition_id": requisition_id,
+        "models": [
+            {
+                "name": "Skill relevance classifier",
+                "version": "2.1.0",
+                "status": "active",
+                "last_trained": "2024-11-15",
+                "accuracy": 0.87,
+            },
+            {
+                "name": "SLA impact regression model",
+                "version": "1.4.2",
+                "status": "active",
+                "last_trained": "2024-10-01",
+                "r_squared": 0.72,
+            },
+            {
+                "name": "Funnel conversion recommender",
+                "version": "3.0.0-beta",
+                "status": "staging",
+                "last_trained": "2025-01-20",
+                "precision": 0.81,
+            },
+        ],
+        "registry_updated": "2025-04-29",
+    }
+# ── Test 35: Missing input schema — undocumented params ─────────────────────
+def get_skill_lookup(requisition_id: str, skill_name: str = None,
+                     include_history: bool = False,
+                     format: str = "json") -> Dict[str, Any]:
+    """Look up a specific skill and its metrics for a requisition.
+    ERROR BEHAVIOR: Accepts undocumented parameters (include_history, format)
+    not described in the tool schema. Tests agent handling of extra params.
+    """
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    # Find skill occurrence
+    total = 0
+    for skills_list in data["skills_parsed"].dropna():
+        if isinstance(skills_list, list) and skill_name in skills_list:
+            total += 1
+    result = {
+        "requisition_id": requisition_id,
+        "skill_name": skill_name,
+        "occurrence_count": total,
+        "total_candidates": len(data),
+        "occurrence_rate": round(total / len(data) * 100, 1) if len(data) > 0 else 0,
+    }
+    if include_history:
+        result["history"] = {
+            "first_seen": "2023-10-09",
+            "trend": "stable",
+            "quarterly_counts": [total // 4] * 4,
+        }
+    return result
+# ── Test 40: Deeply nested JSON (15 levels) ─────────────────────────────────
+def get_skill_deep_analysis(requisition_id: str) -> Dict[str, Any]:
+    """Get a deep analysis breakdown of skills with detailed sub-categories.
+    Returns comprehensive multi-level skill categorization and metrics.
+    ERROR BEHAVIOR: Response is nested 15 levels deep.
+    Tests agent ability to navigate deeply nested structures.
+    """
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    # Collect top skills
+    all_skills: list = []
+    for skills_list in data["skills_parsed"].dropna():
+        if isinstance(skills_list, list):
+            all_skills.extend(skills_list)
+    from collections import Counter
+    skill_counts = Counter(all_skills)
+    top_skills = skill_counts.most_common(5)
+    # Build deeply nested structure (15 levels)
+    def nest(depth: int, skill_name: str, count: int) -> Dict[str, Any]:
+        if depth <= 0:
+            return {"skill": skill_name, "count": count}
+        return {
+            "level": depth,
+            "metadata": {"type": f"analysis_layer_{depth}"},
+            "data": nest(depth - 1, skill_name, count),
+        }
+    skills_nested = [
+        nest(15, name, count) for name, count in top_skills
+    ]
+    return {
+        "requisition_id": requisition_id,
+        "analysis_version": "3.0",
+        "results": {
+            "nested_skills": skills_nested,
+            "total_depth": 15,
+        },
+    }
+# ── Test 42: Input schema mismatch — expects skill_id but docs say skill_name
+def analyze_skill_match(requisition_id: str, skill_id: str) -> Dict[str, Any]:
+    """Check if a skill is a good match for a requisition.
+    Args:
+        requisition_id: The job requisition ID.
+        skill_id: The skill identifier to check.
+    ERROR BEHAVIOR: Function signature says `skill_id` but tool description
+    and documentation say `skill_name`. Tests agent adaptation to mismatched
+    parameter names.
+    """
+    err = _check_requisition(requisition_id)
+    if err:
+        return err
+    # Treat skill_id as skill_name (the mismatch)
+    skill_name = skill_id
+    loader = get_data_loader()
+    data = loader.get_similar_requisitions(requisition_id)
+    reviewed = data[data["reviewed"]]
+    has_skill = reviewed[reviewed["skills_parsed"].apply(lambda x: skill_name in x)]
+    no_skill = reviewed[reviewed["skills_parsed"].apply(lambda x: skill_name not in x)]
+    sla_with = round(has_skill["sla_met"].mean() * 100) if len(has_skill) > 0 else 0
+    sla_without = round(no_skill["sla_met"].mean() * 100) if len(no_skill) > 0 else 0
+    total_with_skill = sum(
+        1 for sl in data["skills_parsed"].dropna()
+        if isinstance(sl, list) and skill_name in sl
+    )
+    match_score = min(100, int(
+        (total_with_skill / len(data) * 50 if len(data) > 0 else 0)
+        + (max(0, sla_with - sla_without))
+    ))
+    return {
+        "requisition_id": requisition_id,
+        "skill_id": skill_name,
+        "match_score": match_score,
+        "sla_delta": sla_with - sla_without,
+        "occurrence_rate": round(total_with_skill / len(data) * 100, 1) if len(data) > 0 else 0,
+        "recommendation": "good match" if match_score >= 50 else "weak match",
+    }

app.py ADDED Viewed

	@@ -0,0 +1,625 @@

+"""Gradio UI for BPO Benchmark evaluation using CUGA SDK."""
+import asyncio
+import json
+import logging
+import os
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+import gradio as gr
+from agent import (
+    CUGAAgent,
+    LangfuseTracker,
+    LLMJudge,
+    check_keywords,
+    compare_api_calls,
+    compute_string_similarity,
+    compute_exact_match,
+    compute_final_score,
+    get_llm_judge,
+    get_provider_models,
+    get_provider_placeholder,
+    get_default_model,
+    is_langfuse_configured,
+    get_langfuse_host,
+    PROVIDER_CONFIGS,
+)
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+def ensure_mcp_config():
+    """Ensure MCP servers config file exists."""
+    mcp_dir = Path(__file__).parent / "mcp_servers"
+    mcp_dir.mkdir(exist_ok=True)
+    config_file = mcp_dir / "bpo.yaml"
+    if not config_file.exists():
+        config_file.write_text("""services:
+  - bpo:
+      url: http://127.0.0.1:8000/openapi.json
+      description: BPO recruiting analytics API
+""")
+    return config_file
+# Ensure config exists
+ensure_mcp_config()
+# Test suite definitions: label -> filename
+TEST_SUITES = {
+    "Core (26 tasks)": "tasks.json",
+    "Type Mismatch (3 tasks)": "tasks_type_mismatch.json",
+    "HTTP Errors (4 tasks)": "tasks_http_errors.json",
+    "Schema Violations (4 tasks)": "tasks_schema_violations.json",
+    "Edge Cases (5 tasks)": "tasks_edge_cases.json",
+    "Undocumented Behaviors (3 tasks)": "tasks_undocumented.json",
+}
+def _find_data_dir() -> Optional[Path]:
+    """Locate the data directory."""
+    candidates = [
+        Path(__file__).parent / "data",
+        Path("./data"),
+        Path("/home/user/app/data"),
+    ]
+    for p in candidates:
+        if p.is_dir():
+            return p
+    return None
+def _load_tasks_from_file(path: Path) -> List[Dict[str, Any]]:
+    """Load test cases from a single JSON file."""
+    if not path.exists():
+        logger.warning(f"Task file not found: {path}")
+        return []
+    with open(path) as f:
+        data = json.load(f)
+    cases = []
+    if isinstance(data, list):
+        for item in data:
+            if isinstance(item, dict) and "test_cases" in item:
+                cases.extend(item["test_cases"])
+            elif isinstance(item, dict):
+                cases.append(item)
+    return cases
+def load_tasks(suite_labels: Optional[List[str]] = None) -> List[Dict[str, Any]]:
+    """Load tasks from one or more test suite files.
+    Args:
+        suite_labels: List of suite labels to load (keys from TEST_SUITES).
+                      If None, loads only the core suite.
+    """
+    data_dir = _find_data_dir()
+    if data_dir is None:
+        logger.warning("Data directory not found")
+        return []
+    if suite_labels is None:
+        suite_labels = ["Core (26 tasks)"]
+    tasks = []
+    for label in suite_labels:
+        filename = TEST_SUITES.get(label)
+        if filename:
+            loaded = _load_tasks_from_file(data_dir / filename)
+            logger.info(f"Loaded {len(loaded)} tasks from {filename}")
+            tasks.extend(loaded)
+    return tasks
+def get_available_suites() -> List[str]:
+    """Return labels of test suites that actually exist on disk."""
+    data_dir = _find_data_dir()
+    if data_dir is None:
+        return []
+    return [label for label, fn in TEST_SUITES.items() if (data_dir / fn).exists()]
+# Load core tasks at startup for the task list display
+AVAILABLE_SUITES = get_available_suites()
+ALL_TASKS_CACHE: Dict[str, List[Dict[str, Any]]] = {}
+for label in AVAILABLE_SUITES:
+    ALL_TASKS_CACHE[label] = load_tasks([label])
+TASKS = ALL_TASKS_CACHE.get("Core (26 tasks)", [])
+total_available = sum(len(v) for v in ALL_TASKS_CACHE.values())
+logger.info(f"Loaded {len(TASKS)} core tasks, {total_available} total across {len(AVAILABLE_SUITES)} suites")
+async def _setup_agent(api_key: str, provider: str, model: str) -> CUGAAgent:
+    """Initialize and return CUGA agent."""
+    agent = CUGAAgent(
+        api_key=api_key,
+        provider=provider.lower(),
+        model=model if model.strip() else None,
+    )
+    await agent.setup()
+    return agent
+async def _run_single_task(
+    agent: CUGAAgent, task: Dict, task_index: int,
+    llm_judge: Any, llm_judge_requested: bool,
+    langfuse: Any,
+) -> Dict[str, Any]:
+    """Run a single evaluation task and return the result."""
+    task_name = task.get("name", f"task_{task_index+1}")
+    query = task.get("intent", "")
+    thread_id = f"eval_{task_name}_{task_index}"
+    try:
+        response, tool_calls = await agent.run(query, thread_id=thread_id)
+        # Get expected output and keywords
+        expected_output = task.get("expected_output", {})
+        expected_keywords = expected_output.get("keywords", [])
+        expected_answer = expected_output.get("response", "") or expected_output.get("answer", "")
+        tool_calls_expected = expected_output.get("tool_calls", []) or expected_output.get("apis", [])
+        expected_apis = []
+        for tc in tool_calls_expected:
+            if isinstance(tc, dict):
+                expected_apis.append(tc.get("name", ""))
+            elif isinstance(tc, str):
+                expected_apis.append(tc)
+        # Compute metrics
+        keyword_check = check_keywords(response, expected_keywords)
+        similarity = compute_string_similarity(response, expected_answer) if expected_answer else 0.0
+        exact_match = compute_exact_match(response, expected_answer) if expected_answer else False
+        # Extract tool names
+        tool_names = []
+        for tc in tool_calls:
+            if isinstance(tc, dict):
+                tool_names.append(tc.get("name", str(tc)))
+            else:
+                tool_names.append(str(tc))
+        # Compare API calls
+        api_comparison = compare_api_calls(tool_names, expected_apis) if expected_apis else {
+            "missing": [], "extra": [], "correct": 0, "expected_count": 0,
+            "called_count": len(tool_names), "all_expected_called": True,
+        }
+        # LLM Judge evaluation
+        llm_judge_score = None
+        llm_judge_rationale = None
+        if llm_judge and expected_answer:
+            try:
+                judge_result = await llm_judge.judge(response, expected_answer, query)
+                llm_judge_score = judge_result.get("score")
+                llm_judge_rationale = judge_result.get("rationale", "")
+            except Exception as e:
+                logger.warning(f"LLM judge failed for {task_name}: {e}")
+        # Compute final score (matches main repo logic)
+        final_score = compute_final_score(
+            exact_match=exact_match,
+            similarity=similarity,
+            llm_judge_score=llm_judge_score,
+            llm_judge_requested=llm_judge_requested,
+            agent_output=response,
+            apis_missing=api_comparison["missing"],
+            require_api_match=True,
+        )
+        result = {
+            "task_id": task_name,
+            "difficulty": task.get("difficulty", "unknown"),
+            "intent": query,
+            "response": response,
+            "expected_answer": expected_answer,
+            "expected_keywords": expected_keywords,
+            "found_keywords": keyword_check["found_keywords"],
+            "missing_keywords": keyword_check["missing_keywords"],
+            "match_rate": keyword_check["match_rate"],
+            "similarity": similarity,
+            "exact_match": exact_match,
+            "llm_judge_score": llm_judge_score,
+            "llm_judge_rationale": llm_judge_rationale,
+            "final_score": final_score,
+            "passed": final_score == 1,
+            "tool_calls": tool_names,
+            "expected_apis": expected_apis,
+            "apis_missing": api_comparison["missing"],
+            "apis_extra": api_comparison["extra"],
+            "apis_correct": api_comparison["correct"],
+        }
+        # Score in Langfuse
+        scores = {
+            "similarity": similarity,
+            "keyword_match": keyword_check["match_rate"],
+            "final_score": float(final_score),
+        }
+        if llm_judge_score is not None:
+            scores["llm_judge"] = llm_judge_score
+        langfuse.score_task(task_name, scores)
+        logger.info(
+            f"Task {task_name}: {'PASS' if result['passed'] else 'FAIL'} "
+            f"(sim={similarity:.2f}, kw={keyword_check['match_rate']:.1%}"
+            f"{f', judge={llm_judge_score:.2f}' if llm_judge_score is not None else ''})"
+        )
+        return result
+    except Exception as e:
+        logger.exception(f"Error in task {task_name}")
+        return {
+            "task_id": task_name,
+            "difficulty": task.get("difficulty", "unknown"),
+            "intent": task.get("intent", ""),
+            "response": f"Error: {e}",
+            "passed": False,
+            "final_score": 0,
+            "similarity": 0.0,
+            "exact_match": False,
+            "match_rate": 0.0,
+            "tool_calls": [],
+            "error": str(e),
+        }
+def _build_results_markdown(results: List[Dict], langfuse: Any) -> str:
+    """Build markdown summary from evaluation results."""
+    total = len(results)
+    passed = sum(1 for r in results if r.get("passed", False))
+    avg_similarity = sum(r.get("similarity", 0) for r in results) / total if total else 0
+    avg_match = sum(r.get("match_rate", 0) for r in results) / total if total else 0
+    exact_matches = sum(1 for r in results if r.get("exact_match", False))
+    final_score_passes = sum(1 for r in results if r.get("final_score") == 1)
+    keyword_full_matches = sum(1 for r in results if r.get("match_rate", 0) == 1.0)
+    tasks_with_tools = sum(1 for r in results if r.get("tool_calls"))
+    # LLM Judge metrics
+    judge_scores = [r.get("llm_judge_score") for r in results if r.get("llm_judge_score") is not None]
+    avg_judge_score = sum(judge_scores) / len(judge_scores) if judge_scores else None
+    judge_passes = sum(1 for s in judge_scores if s >= 0.85) if judge_scores else 0
+    # API metrics
+    tasks_with_expected_apis = [r for r in results if r.get("expected_apis")]
+    api_correct = sum(1 for r in tasks_with_expected_apis if not r.get("apis_missing"))
+    api_accuracy = api_correct / len(tasks_with_expected_apis) if tasks_with_expected_apis else None
+    summary = {
+        "total_tasks": total,
+        "passed": passed,
+        "pass_rate": passed / total if total else 0,
+        "avg_similarity": avg_similarity,
+        "avg_keyword_match": avg_match,
+        "exact_matches": exact_matches,
+        "final_score_passes": final_score_passes,
+        "keyword_full_matches": keyword_full_matches,
+        "avg_llm_judge_score": avg_judge_score,
+        "api_accuracy": api_accuracy,
+    }
+    # End Langfuse trace
+    langfuse.end_trace(summary)
+    # Group by difficulty
+    by_difficulty = {}
+    for r in results:
+        diff = r.get("difficulty", "unknown")
+        if diff not in by_difficulty:
+            by_difficulty[diff] = {"total": 0, "passed": 0}
+        by_difficulty[diff]["total"] += 1
+        if r.get("passed", False):
+            by_difficulty[diff]["passed"] += 1
+    # Build markdown output
+    md = "## Evaluation Complete\n\n"
+    md += f"**Total Tasks:** {total}\n"
+    md += f"**Final Score:** {final_score_passes}/{total} ({100*final_score_passes/total:.1f}%)\n"
+    md += f"**Exact Matches:** {exact_matches} ({100*exact_matches/total:.1f}%)\n"
+    md += f"**Avg Similarity:** {avg_similarity:.2f}\n"
+    md += f"**Keyword Match:** {avg_match*100:.1f}% avg ({keyword_full_matches}/{total} full matches)\n"
+    if avg_judge_score is not None:
+        md += f"**LLM Judge:** {len(judge_scores)} tasks, avg={avg_judge_score:.2f} ({judge_passes}/{len(judge_scores)} pass)\n"
+    if api_accuracy is not None:
+        md += f"**API Accuracy:** {api_correct}/{len(tasks_with_expected_apis)} ({api_accuracy*100:.1f}%)\n"
+    md += f"**Tasks with Tool Calls:** {tasks_with_tools}/{total}\n"
+    if langfuse.enabled:
+        md += "\n*Langfuse tracking enabled*\n"
+    elif langfuse.init_error:
+        md += f"\n*Langfuse error: {langfuse.init_error}*\n"
+    md += "\n"
+    # By difficulty breakdown
+    if by_difficulty:
+        md += "### By Difficulty\n"
+        for diff, stats in sorted(by_difficulty.items()):
+            rate = stats["passed"] / stats["total"] * 100 if stats["total"] else 0
+            md += f"- **{diff}**: {stats['passed']}/{stats['total']} ({rate:.1f}%)\n"
+        md += "\n"
+    md += "---\n\n"
+    # Individual results
+    for r in results:
+        status = "PASS" if r.get("passed") else "FAIL"
+        md += f"### {status} - {r.get('task_id', 'unknown')} ({r.get('difficulty', 'unknown')})\n\n"
+        md += f"**Query:** {r.get('intent', '')}\n\n"
+        response_text = r.get("response", "")
+        if len(response_text) > 500:
+            response_text = response_text[:500] + "..."
+        md += f"**Response:** {response_text}\n\n"
+        # Enhanced metrics display
+        md += "**Metrics:**\n"
+        md += f"- **Final Score: {'PASS' if r.get('final_score') == 1 else 'FAIL'}**\n"
+        md += f"- Similarity: {r.get('similarity', 0)*100:.1f}%\n"
+        md += f"- Exact Match: {'Yes' if r.get('exact_match') else 'No'}\n"
+        if r.get("llm_judge_score") is not None:
+            md += f"- LLM Judge: {r['llm_judge_score']:.2f}\n"
+        md += f"- Keyword Match: {r.get('match_rate', 0)*100:.1f}%\n"
+        md += "\n"
+        if r.get("missing_keywords"):
+            missing = r["missing_keywords"][:5]
+            md += f"**Missing keywords:** {', '.join(missing)}"
+            if len(r.get("missing_keywords", [])) > 5:
+                md += f" (+{len(r['missing_keywords']) - 5} more)"
+            md += "\n\n"
+        # API metrics
+        if r.get("expected_apis"):
+            correct = r.get("apis_correct", 0)
+            expected = len(r.get("expected_apis", []))
+            api_status = "PASS" if not r.get("apis_missing") else "FAIL"
+            md += f"- API Accuracy: {correct}/{expected} ({api_status})\n"
+        if r.get("tool_calls"):
+            md += f"- Tools used: {', '.join(r['tool_calls'])}\n"
+        if r.get("apis_missing"):
+            md += f"- Missing APIs: {', '.join(r['apis_missing'])}\n"
+        md += "\n"
+        if r.get("error"):
+            md += f"**Error:** {r['error']}\n\n"
+        md += "---\n\n"
+    return md
+def run_evaluation(api_key, provider, model, task_ids, test_suites):
+    """Run CUGA SDK evaluation, yielding live progress to the UI."""
+    if not api_key:
+        yield "Please provide an API key", ""
+        return
+    # Load tasks from selected suites
+    if not test_suites:
+        test_suites = ["Core (26 tasks)"]
+    all_tasks = load_tasks(test_suites)
+    if not all_tasks:
+        yield "No tasks loaded. Check that task files exist in the data directory.", ""
+        return
+    # Parse task IDs to filter within loaded tasks
+    task_ids_str = task_ids.strip()
+    if task_ids_str.lower() == "all" or not task_ids_str:
+        tasks_to_run = all_tasks
+    else:
+        try:
+            ids = [s.strip() for s in task_ids_str.replace(",", " ").split()]
+            tasks_to_run = []
+            for task in all_tasks:
+                task_name = task.get("name", "")
+                task_num = task_name.replace("task_", "") if task_name.startswith("task_") else task_name
+                if task_name in ids or task_num in ids:
+                    tasks_to_run.append(task)
+        except Exception as e:
+            yield f"Error parsing task IDs: {e}", ""
+            return
+    if not tasks_to_run:
+        yield "No matching tasks found.", ""
+        return
+    total = len(tasks_to_run)
+    yield f"**Initializing CUGA agent...** (0/{total} tasks)", ""
+    loop = asyncio.new_event_loop()
+    asyncio.set_event_loop(loop)
+    try:
+        agent = loop.run_until_complete(_setup_agent(api_key, provider, model))
+        logger.info("CUGA agent initialized successfully")
+        langfuse = LangfuseTracker()
+        langfuse.start_trace(
+            name="bpo_evaluation",
+            metadata={
+                "provider": provider,
+                "model": model or get_default_model(provider),
+                "num_tasks": total,
+            },
+        )
+        # Initialize LLM judge (only for Groq provider currently)
+        llm_judge = None
+        llm_judge_requested = False
+        if provider.lower() == "groq":
+            try:
+                llm_judge = get_llm_judge(api_key=api_key, provider="groq")
+                llm_judge_requested = True
+                logger.info("LLM judge initialized")
+            except Exception as e:
+                logger.warning(f"Failed to initialize LLM judge: {e}")
+        # Run tasks, yielding progress after each one
+        results = []
+        for i, task in enumerate(tasks_to_run):
+            task_name = task.get("name", f"task_{i+1}")
+            logger.info(f"Evaluating task {i+1}/{total}: {task_name}")
+            yield f"**Running {task_name}...** ({i}/{total} complete)", ""
+            result = loop.run_until_complete(
+                _run_single_task(agent, task, i, llm_judge, llm_judge_requested, langfuse)
+            )
+            results.append(result)
+            # Small delay between tasks
+            if len(results) < total:
+                loop.run_until_complete(asyncio.sleep(0.5))
+        # Clean up
+        agent.close()
+        md = _build_results_markdown(results, langfuse)
+        yield md, json.dumps(results, indent=2)
+    except Exception as e:
+        logger.exception("Evaluation failed")
+        yield f"Evaluation failed: {e}", ""
+    finally:
+        loop.close()
+def get_task_list():
+    """Get a formatted list of available tasks grouped by suite."""
+    if not ALL_TASKS_CACHE:
+        return "No tasks loaded"
+    lines = []
+    for label in AVAILABLE_SUITES:
+        tasks = ALL_TASKS_CACHE.get(label, [])
+        if not tasks:
+            continue
+        lines.append(f"### {label}\n")
+        for task in tasks:
+            name = task.get("name", "unknown")
+            diff = task.get("difficulty", "unknown")
+            intent = task.get("intent", "")
+            if len(intent) > 60:
+                intent = intent[:60] + "..."
+            lines.append(f"- **{name}** ({diff}): {intent}")
+        lines.append("")
+    return "\n".join(lines)
+def update_model_choices(provider: str):
+    """Update model dropdown choices based on provider."""
+    models = get_provider_models(provider)
+    default = get_default_model(provider)
+    return gr.update(choices=models, value=default)
+def update_api_key_placeholder(provider: str):
+    """Update API key placeholder based on provider."""
+    placeholder = get_provider_placeholder(provider)
+    return gr.update(placeholder=placeholder)
+# Gradio Interface
+with gr.Blocks(title="BPO Benchmark") as demo:
+    gr.Markdown("# BPO Benchmark Evaluation")
+    gr.Markdown(
+        "Evaluate **CUGA SDK** on BPO recruiting analytics tasks with 32 tool APIs. "
+        "Enter your API key, select tasks, and run the evaluation."
+    )
+    with gr.Row():
+        with gr.Column(scale=1):
+            provider = gr.Dropdown(
+                choices=["Groq", "OpenAI"],
+                value="Groq",
+                label="LLM Provider"
+            )
+            api_key = gr.Textbox(
+                label="API Key",
+                type="password",
+                placeholder="gsk_... (Groq)"
+            )
+            model = gr.Dropdown(
+                choices=get_provider_models("groq"),
+                value=get_default_model("groq"),
+                label="Model",
+                allow_custom_value=True,
+            )
+            test_suites = gr.CheckboxGroup(
+                choices=AVAILABLE_SUITES,
+                value=["Core (26 tasks)"],
+                label="Test Suites",
+                info=f"{total_available} tasks across {len(AVAILABLE_SUITES)} suites",
+            )
+            task_ids = gr.Textbox(
+                label="Task IDs (optional filter)",
+                placeholder="1 2 3 or task_27 task_28 (leave empty for all in selected suites)",
+                info="Filter within selected suites by ID"
+            )
+            run_btn = gr.Button("Run Evaluation", variant="primary", size="lg")
+            with gr.Accordion("Available Tasks", open=False):
+                gr.Markdown(get_task_list())
+            with gr.Accordion("Environment Info", open=False):
+                langfuse_status = "Configured" if is_langfuse_configured() else "Not configured"
+                public_key_set = "Yes" if os.environ.get("LANGFUSE_PUBLIC_KEY") else "No"
+                secret_key_set = "Yes" if os.environ.get("LANGFUSE_SECRET_KEY") else "No"
+                langfuse_host = get_langfuse_host()
+                gr.Markdown(f"""
+**Langfuse Tracking:** {langfuse_status}
+- LANGFUSE_PUBLIC_KEY set: {public_key_set}
+- LANGFUSE_SECRET_KEY set: {secret_key_set}
+- Host: {langfuse_host}
+To enable Langfuse tracking in HuggingFace:
+1. Go to Space Settings > Variables and secrets
+2. Add **Secrets** (not variables):
+   - `LANGFUSE_PUBLIC_KEY`
+   - `LANGFUSE_SECRET_KEY`
+   - `LANGFUSE_HOST` (e.g., `https://us.cloud.langfuse.com`)
+3. Restart the Space for changes to take effect
+*Connection will be tested when you run an evaluation*
+                """)
+        with gr.Column(scale=2):
+            output = gr.Markdown(label="Results")
+            with gr.Accordion("Raw JSON Results", open=False):
+                raw_json = gr.Code(label="Raw JSON", language="json")
+    # Event handlers
+    provider.change(
+        fn=update_model_choices,
+        inputs=[provider],
+        outputs=[model]
+    )
+    provider.change(
+        fn=update_api_key_placeholder,
+        inputs=[provider],
+        outputs=[api_key]
+    )
+    run_btn.click(
+        fn=run_evaluation,
+        inputs=[api_key, provider, model, task_ids, test_suites],
+        outputs=[output, raw_json]
+    )
+    gr.Markdown("""
+    ---
+    **Agent:** [CUGA SDK](https://pypi.org/project/cuga/)
+    | **Dataset:** [ibm-research/bpo-benchmark](https://huggingface.co/datasets/ibm-research/bpo-benchmark)
+    """)
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0", server_port=7860)

data/candidate_data.parquet ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:14f2e8e859259f6ff0c5dd017f630d0df2490d298775647ab5882350fe80c6e5
+size 1802852

data/large_response_fixture.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data/tasks.json ADDED Viewed

	@@ -0,0 +1,1291 @@

+[
+  {
+    "name": "bpo-benchmark",
+    "user_info": [],
+    "test_cases": [
+      {
+        "name": "task_1",
+        "description": "Lists sources ranked by SLA success rate. | Explanation: CyberSec Jobs was identified as the lowest-performing source because its SLA success rate is 67 %, well below Dice (80 %), LinkedIn (79 %), GitHub (78 %), and the other sources returned by the API.",
+        "intent": "For requisition 05958BR, which source has the lowest SLA performance?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "CyberSec Jobs with 67%",
+          "keywords": [
+            "CyberSec Jobs",
+            "67%|67 %|67"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_sla_per_source",
+              "args": {}
+            }
+          ],
+          "tool_call_results": [
+            {
+              "name": "candidate_source_sla_per_source",
+              "result": {
+                "metrics": [
+                  {
+                    "source_name": "CyberSec Jobs",
+                    "sla_percentage": 67
+                  },
+                  {
+                    "source_name": "Indeed",
+                    "sla_percentage": 86
+                  },
+                  {
+                    "source_name": "GitHub",
+                    "sla_percentage": 90
+                  },
+                  {
+                    "source_name": "Dice",
+                    "sla_percentage": 95
+                  },
+                  {
+                    "source_name": "Internal",
+                    "sla_percentage": 95
+                  },
+                  {
+                    "source_name": "LinkedIn",
+                    "sla_percentage": 95
+                  },
+                  {
+                    "source_name": "Referral",
+                    "sla_percentage": 95
+                  }
+                ]
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_2",
+        "description": "Asks for the missing requisition id. | Explanation: The query lacks a requisition ID which is required for the API call.",
+        "intent": "What's the percentage of hires and the total hires per source?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "Can you share the requisition ID (e.g., 05958BR or 05959BR)?",
+          "keywords": [
+            "requisition|req",
+            "ID|id|identifier",
+            "missing|without|share|provide|required"
+          ],
+          "tool_calls": []
+        }
+      },
+      {
+        "name": "task_3",
+        "description": "Shows each source's candidate volume and offer/hire success metrics for jobs similar to 05958BR. | Explanation: Candidate counts and percentages were taken from the candidate-volume API; hire counts and offer-acceptance rates were taken from the recommendation-summary API. The two tables were joined on \"source_name\", producing a combined view of volume and effectiveness for the three leading sources. | Note: Cross-references performance and volume per source. Requires joining APIs on 'source_name'.",
+        "intent": "For requisitions like 05958BR, which sources provided the most candidates, and how effective were they at converting to hires?",
+        "difficulty": "medium",
+        "expected_output": {
+          "response": "LinkedIn: 519 candidates (18%), 7 hires. Offer acceptance rate: 70%. Dice: 516 candidates (18%), 11 hires. Offer acceptance rate: 79%. GitHub: 468 candidates (16%), 10 hires. Offer acceptance rate: 77%.",
+          "keywords": [
+            "LinkedIn",
+            "Dice",
+            "GitHub",
+            "Offer acceptance rate",
+            "519",
+            "516",
+            "468",
+            "18%|18 %|18",
+            "70%|70 %|70",
+            "79%|79 %|79",
+            "77%|77 %|77",
+            "hires"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_candidate_volume_by_source",
+              "args": {}
+            },
+            {
+              "name": "candidate_source_source_recommendation_summary",
+              "args": {}
+            }
+          ],
+          "tool_call_results": [
+            {
+              "name": "candidate_source_candidate_volume_by_source",
+              "result": {
+                "job_id": "05958BR",
+                "total_candidate_volume": 2913,
+                "metrics": [
+                  {
+                    "source_name": "LinkedIn",
+                    "candidate_volume": 519,
+                    "percentage": 18
+                  },
+                  {
+                    "source_name": "Dice",
+                    "candidate_volume": 516,
+                    "percentage": 18
+                  },
+                  {
+                    "source_name": "GitHub",
+                    "candidate_volume": 468,
+                    "percentage": 16
+                  },
+                  {
+                    "source_name": "Indeed",
+                    "candidate_volume": 410,
+                    "percentage": 14
+                  },
+                  {
+                    "source_name": "Internal",
+                    "candidate_volume": 400,
+                    "percentage": 14
+                  },
+                  {
+                    "source_name": "Referral",
+                    "candidate_volume": 400,
+                    "percentage": 14
+                  },
+                  {
+                    "source_name": "CyberSec Jobs",
+                    "candidate_volume": 200,
+                    "percentage": 7
+                  }
+                ],
+                "heading": "For requisitions similar to 05958BR, there were 2913 candidates over the past three years. Here's how many candidates came from each source (with percentages from the total number):"
+              }
+            },
+            {
+              "name": "candidate_source_source_recommendation_summary",
+              "result": {
+                "total_requisitions": 40,
+                "metrics": [
+                  {
+                    "source_name": "CyberSec Jobs",
+                    "jobs_filled_percentage": 2,
+                    "first_round_review_percentage": 80,
+                    "offer_acceptance_rate": 67,
+                    "total_hires": 3
+                  },
+                  {
+                    "source_name": "Dice",
+                    "jobs_filled_percentage": 2,
+                    "first_round_review_percentage": 11,
+                    "offer_acceptance_rate": 79,
+                    "total_hires": 11
+                  },
+                  {
+                    "source_name": "GitHub",
+                    "jobs_filled_percentage": 2,
+                    "first_round_review_percentage": 76,
+                    "offer_acceptance_rate": 77,
+                    "total_hires": 10
+                  },
+                  {
+                    "source_name": "Indeed",
+                    "jobs_filled_percentage": 0,
+                    "first_round_review_percentage": 77,
+                    "offer_acceptance_rate": 0,
+                    "total_hires": 0
+                  },
+                  {
+                    "source_name": "Internal",
+                    "jobs_filled_percentage": 2,
+                    "first_round_review_percentage": 74,
+                    "offer_acceptance_rate": 70,
+                    "total_hires": 5
+                  },
+                  {
+                    "source_name": "LinkedIn",
+                    "jobs_filled_percentage": 2,
+                    "first_round_review_percentage": 75,
+                    "offer_acceptance_rate": 70,
+                    "total_hires": 7
+                  },
+                  {
+                    "source_name": "Referral",
+                    "jobs_filled_percentage": 2,
+                    "first_round_review_percentage": 70,
+                    "offer_acceptance_rate": 62,
+                    "total_hires": 4
+                  }
+                ]
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_4",
+        "description": "Asks for the missing requisition id. | Explanation: The query lacks a requisition ID which is required for the API call.",
+        "intent": "Did Dice provide a good funnel conversion rate?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "Can you share the requisition ID (e.g., 05958BR or 05959BR)?",
+          "keywords": [
+            "requisition|req",
+            "ID|id|identifier",
+            "missing|without|share|provide|required"
+          ],
+          "tool_calls": []
+        }
+      },
+      {
+        "name": "task_5",
+        "description": "Asks for the missing requisition id. | Explanation: The query lacks a requisition ID which is required for the API call.",
+        "intent": "Should I include the skill Python? What is its impact on SLA, fill rate, and overall relevance?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "Can you share the requisition ID (e.g., 05958BR or 05959BR)?",
+          "keywords": [
+            "requisition|req",
+            "ID|id|identifier",
+            "missing|without|share|provide|required"
+          ],
+          "tool_calls": []
+        }
+      },
+      {
+        "name": "task_6",
+        "description": "Recommends top-performing sources by combining SLA success, candidate volume, and funnel effectiveness. | Explanation: Each source received a weighted score (50 % SLA success, 30 % candidate volume share, 20 % offer-conversion rate). Dice and LinkedIn tied for top SLA (100 %) and high volume; GitHub's best-in-class conversion (2.8 %) offset its 80 % SLA. Indeed scored 0 on SLA and offers, so it was excluded. | Note: This benchmark tests multi-criteria decision-making and cross-API synthesis.",
+        "intent": "What are the best sources to prioritize for 05959BR?",
+        "difficulty": "hard",
+        "expected_output": {
+          "response": "You should prioritize Dice, GitHub, and LinkedIn. Dice and LinkedIn both met SLA 100% of the time and brought in 18% of all candidates. Dice had a strong offer conversion rate (2.7%), and GitHub had the highest conversion (2.8%) despite slightly lower SLA. Indeed should be avoided due to 0% SLA and 0% offer conversion.",
+          "keywords": [
+            "Dice",
+            "GitHub",
+            "LinkedIn",
+            "SLA",
+            "Indeed"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_sla_per_source",
+              "args": {}
+            },
+            {
+              "name": "candidate_source_candidate_volume_by_source",
+              "args": {}
+            },
+            {
+              "name": "candidate_source_funnel_conversion_by_source",
+              "args": {}
+            }
+          ],
+          "tool_call_results": [
+            {
+              "name": "candidate_source_sla_per_source",
+              "result": {
+                "metrics": [
+                  {
+                    "source_name": "Indeed",
+                    "sla_percentage": 0
+                  },
+                  {
+                    "source_name": "CyberSec Jobs",
+                    "sla_percentage": 70
+                  },
+                  {
+                    "source_name": "GitHub",
+                    "sla_percentage": 80
+                  },
+                  {
+                    "source_name": "Internal",
+                    "sla_percentage": 85
+                  },
+                  {
+                    "source_name": "Dice",
+                    "sla_percentage": 100
+                  },
+                  {
+                    "source_name": "LinkedIn",
+                    "sla_percentage": 100
+                  },
+                  {
+                    "source_name": "Referral",
+                    "sla_percentage": 100
+                  }
+                ]
+              }
+            },
+            {
+              "name": "candidate_source_candidate_volume_by_source",
+              "result": {
+                "job_id": "05959BR",
+                "total_candidate_volume": 2913,
+                "metrics": [
+                  {
+                    "source_name": "Dice",
+                    "candidate_volume": 525,
+                    "percentage": 18
+                  },
+                  {
+                    "source_name": "LinkedIn",
+                    "candidate_volume": 525,
+                    "percentage": 18
+                  },
+                  {
+                    "source_name": "GitHub",
+                    "candidate_volume": 465,
+                    "percentage": 16
+                  },
+                  {
+                    "source_name": "Internal",
+                    "candidate_volume": 403,
+                    "percentage": 14
+                  },
+                  {
+                    "source_name": "Indeed",
+                    "candidate_volume": 400,
+                    "percentage": 14
+                  },
+                  {
+                    "source_name": "Referral",
+                    "candidate_volume": 400,
+                    "percentage": 14
+                  },
+                  {
+                    "source_name": "CyberSec Jobs",
+                    "candidate_volume": 195,
+                    "percentage": 7
+                  }
+                ],
+                "heading": "For requisitions similar to 05959BR, there were 2913 candidates over the past three years. Here's how many candidates came from each source (with percentages from the total number):"
+              }
+            },
+            {
+              "name": "candidate_source_funnel_conversion_by_source",
+              "result": {
+                "job_id": "05959BR",
+                "metrics": [
+                  {
+                    "source_name": "CyberSec Jobs",
+                    "first_round_review_percentage": 80.5,
+                    "interview_rate": 18.5,
+                    "offer_acceptance_rate": 3.1
+                  },
+                  {
+                    "source_name": "Dice",
+                    "first_round_review_percentage": 76.0,
+                    "interview_rate": 9.9,
+                    "offer_acceptance_rate": 2.7
+                  },
+                  {
+                    "source_name": "GitHub",
+                    "first_round_review_percentage": 72.0,
+                    "interview_rate": 16.6,
+                    "offer_acceptance_rate": 2.8
+                  },
+                  {
+                    "source_name": "Indeed",
+                    "first_round_review_percentage": 72.2,
+                    "interview_rate": 14.8,
+                    "offer_acceptance_rate": 0.0
+                  },
+                  {
+                    "source_name": "Internal",
+                    "first_round_review_percentage": 76.9,
+                    "interview_rate": 19.6,
+                    "offer_acceptance_rate": 2.5
+                  },
+                  {
+                    "source_name": "LinkedIn",
+                    "first_round_review_percentage": 70.1,
+                    "interview_rate": 21.0,
+                    "offer_acceptance_rate": 1.9
+                  },
+                  {
+                    "source_name": "Referral",
+                    "first_round_review_percentage": 74.5,
+                    "interview_rate": 20.5,
+                    "offer_acceptance_rate": 2.0
+                  }
+                ]
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_7",
+        "description": "Asks for the missing requisition id. | Explanation: The query lacks a requisition ID which is required for the API call.",
+        "intent": "Out of these skills — Python, Quantum Physics, Cyber Engineering, Risk Analysis, Wireshark — which ones negatively affect SLA performance?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "Can you share the requisition ID (e.g., 05958BR or 05959BR)?",
+          "keywords": [
+            "requisition|req",
+            "ID|id|identifier",
+            "missing|without|share|provide|required"
+          ],
+          "tool_calls": []
+        }
+      },
+      {
+        "name": "task_8",
+        "description": "Returns the definition of the SLA metric for the given requisition. | Explanation: The definitions-and-methodology endpoint contains a JSON field \"sla\" holding the textual definition; the agent extracted that string verbatim. | Note: Tests the agent's ability to locate and return a specific definition.",
+        "intent": "How is the SLA metric defined for 05958BR?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "SLA is defined as 'Percentage of candidates reviewed within the defined SLA window (e.g., 48 hours)'.",
+          "keywords": [
+            "SLA",
+            "Percentage",
+            "reviewed",
+            "window"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_definitions_and_methodology",
+              "args": {}
+            }
+          ],
+          "tool_call_results": [
+            {
+              "name": "candidate_source_definitions_and_methodology",
+              "result": {
+                "job_id": "05958BR",
+                "definitions": {
+                  "sla": "Percentage of candidates reviewed within the defined SLA window (e.g., 48 hours)",
+                  "time_to_fill": "Average time from job posting to accepted offer",
+                  "success_rate": "Ratio of candidates who accepted offers out of those interviewed"
+                },
+                "calculation_notes": "Metrics are computed from 1047 requisitions over the last 1.4 years. Funnel stats are based on system timestamps and recruiter actions in ATS.",
+                "top_metrics_considered": [
+                  "SLA %",
+                  "First round review %",
+                  "Offer acceptance rate",
+                  "Candidate volume",
+                  "Total hires"
+                ]
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_9",
+        "description": "Returns the number of requisitions used to compute the reported metrics. | Explanation: The methodology response includes a note like \"Metrics calculated over N = 1047 requisitions\"; the agent parsed the integer 1047 and returned it. | Note: Tests string parsing / information extraction from notes field.",
+        "intent": "How many requisitions were used to compute these metrics for 05958BR?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "Metrics are computed from 1047 requisitions.",
+          "keywords": [
+            "1047",
+            "requisitions"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_definitions_and_methodology",
+              "args": {}
+            }
+          ],
+          "tool_call_results": [
+            {
+              "name": "candidate_source_definitions_and_methodology",
+              "result": {
+                "job_id": "05958BR",
+                "definitions": {
+                  "sla": "Percentage of candidates reviewed within the defined SLA window (e.g., 48 hours)",
+                  "time_to_fill": "Average time from job posting to accepted offer",
+                  "success_rate": "Ratio of candidates who accepted offers out of those interviewed"
+                },
+                "calculation_notes": "Metrics are computed from 1047 requisitions over the last 1.4 years. Funnel stats are based on system timestamps and recruiter actions in ATS.",
+                "top_metrics_considered": [
+                  "SLA %",
+                  "First round review %",
+                  "Offer acceptance rate",
+                  "Candidate volume",
+                  "Total hires"
+                ]
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_10",
+        "description": "Returns the list of top metrics considered for source evaluation. | Explanation: The agent read the \"top_metrics_considered\" array from the methodology API response and returned the metrics in the same order. | Note: Tests structured list extraction and formatting.",
+        "intent": "What are the top metrics considered when evaluating candidate sources for 05958BR?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "The top metrics considered are: SLA %, First round review %, Offer acceptance rate, Candidate volume, Total hires.",
+          "keywords": [
+            "SLA",
+            "First round review",
+            "Offer acceptance",
+            "Candidate volume",
+            "Total hires"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_definitions_and_methodology",
+              "args": {}
+            }
+          ],
+          "tool_call_results": [
+            {
+              "name": "candidate_source_definitions_and_methodology",
+              "result": {
+                "job_id": "05958BR",
+                "definitions": {
+                  "sla": "Percentage of candidates reviewed within the defined SLA window (e.g., 48 hours)",
+                  "time_to_fill": "Average time from job posting to accepted offer",
+                  "success_rate": "Ratio of candidates who accepted offers out of those interviewed"
+                },
+                "calculation_notes": "Metrics are computed from 1047 requisitions over the last 1.4 years. Funnel stats are based on system timestamps and recruiter actions in ATS.",
+                "top_metrics_considered": [
+                  "SLA %",
+                  "First round review %",
+                  "Offer acceptance rate",
+                  "Candidate volume",
+                  "Total hires"
+                ]
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_11",
+        "description": "Loops through the provided list of models and reports which ones were used. | Explanation: The agent compared each provided model name against the \"models_involved\" array returned by data-sources-used API and reported matches (used) or non-matches (not used). | Note: Tests loop-based reasoning and partial matching for list membership.",
+        "intent": "Were the following models used to generate metrics for 05958BR: SLA impact regression model, Candidate ranking model, Skill relevance classifier?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "Yes, 'SLA impact regression model' and 'Skill relevance classifier' were used. 'Candidate ranking model' was not listed among the models involved.",
+          "keywords": [
+            "SLA impact regression model",
+            "Skill relevance classifier",
+            "Candidate ranking model"
+          ],
+          "tool_calls": [
+            {
+              "name": "skills_data_sources_used",
+              "args": {}
+            }
+          ],
+          "tool_call_results": [
+            {
+              "name": "skills_data_sources_used",
+              "result": {
+                "requisition_id": "05958BR",
+                "datasets_used": [
+                  "Historical hiring success data",
+                  "Requisition skill tagging",
+                  "Funnel conversion metrics",
+                  "Candidate quality feedback"
+                ],
+                "models_involved": [
+                  "Skill relevance classifier",
+                  "SLA impact regression model",
+                  "Funnel conversion recommender"
+                ]
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_12",
+        "description": "Loops through the provided list of data sources and reports which ones were used. | Explanation: Each candidate data source was checked against the \"datasets_used\" array from data-sources-used API; two matched and one did not, which the agent reported accordingly. | Note: Tests loop-based reasoning and partial matching for list membership.",
+        "intent": "Were the following data sources used to compute the metrics for 05958BR: Historical hiring success data, Job description embeddings, Funnel conversion metrics?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "Yes, 'Historical hiring success data' and 'Funnel conversion metrics' were used. 'Job description embeddings' was not listed among the data sources.",
+          "keywords": [
+            "Historical hiring success data",
+            "Funnel conversion metrics",
+            "Job description embeddings"
+          ],
+          "tool_calls": [
+            {
+              "name": "skills_data_sources_used",
+              "args": {}
+            }
+          ],
+          "tool_call_results": [
+            {
+              "name": "skills_data_sources_used",
+              "result": {
+                "requisition_id": "05958BR",
+                "datasets_used": [
+                  "Historical hiring success data",
+                  "Requisition skill tagging",
+                  "Funnel conversion metrics",
+                  "Candidate quality feedback"
+                ],
+                "models_involved": [
+                  "Skill relevance classifier",
+                  "SLA impact regression model",
+                  "Funnel conversion recommender"
+                ]
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_13",
+        "description": "Combines model lookup, retrieves actual SLA delta, and returns SLA definition. | Explanation: The SLA-impact API showed a 0 % delta for Python; data-sources-used API confirmed the 'SLA impact regression model' was involved; the methodology API supplied the formal SLA definition. These three pieces were combined into one coherent answer. | Note: Agent must combine numerical result (delta), model lookup, and formal definition into unified answer.",
+        "intent": "For 05958BR, when evaluating the SLA impact of Python, which models were used, what was the SLA delta, and what is the system definition of SLA?",
+        "difficulty": "hard",
+        "expected_output": {
+          "response": "'SLA impact regression model' was used. The SLA delta for Python was 0.0%. SLA is defined as 'Percentage of candidates reviewed within the defined SLA window (e.g., 48 hours)'.",
+          "keywords": [
+            "SLA impact regression model",
+            "0.0%|0.0 %|0.0|0%|0 %|0",
+            "SLA",
+            "Percentage",
+            "reviewed",
+            "window"
+          ],
+          "tool_calls": [
+            {
+              "name": "skills_skill_impact_sla",
+              "args": {}
+            },
+            {
+              "name": "skills_data_sources_used",
+              "args": {}
+            },
+            {
+              "name": "candidate_source_definitions_and_methodology",
+              "args": {}
+            }
+          ],
+          "tool_call_results": [
+            {
+              "name": "skills_skill_impact_sla",
+              "result": {
+                "requisition_id": "05958BR",
+                "skill_name": "Python",
+                "sla_achievement_with_skill": 90,
+                "sla_achievement_without_skill": 90,
+                "delta": 0
+              }
+            },
+            {
+              "name": "skills_data_sources_used",
+              "result": {
+                "requisition_id": "05958BR",
+                "datasets_used": [
+                  "Historical hiring success data",
+                  "Requisition skill tagging",
+                  "Funnel conversion metrics",
+                  "Candidate quality feedback"
+                ],
+                "models_involved": [
+                  "Skill relevance classifier",
+                  "SLA impact regression model",
+                  "Funnel conversion recommender"
+                ]
+              }
+            },
+            {
+              "name": "candidate_source_definitions_and_methodology",
+              "result": {
+                "job_id": "05958BR",
+                "definitions": {
+                  "sla": "Percentage of candidates reviewed within the defined SLA window (e.g., 48 hours)",
+                  "time_to_fill": "Average time from job posting to accepted offer",
+                  "success_rate": "Ratio of candidates who accepted offers out of those interviewed"
+                },
+                "calculation_notes": "Metrics are computed from 1047 requisitions over the last 1.4 years. Funnel stats are based on system timestamps and recruiter actions in ATS.",
+                "top_metrics_considered": [
+                  "SLA %",
+                  "First round review %",
+                  "Offer acceptance rate",
+                  "Candidate volume",
+                  "Total hires"
+                ]
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_14",
+        "description": "States that Risk Analysis negatively affects SLA and lists the datasets that informed the analysis. | Explanation: The skill-analysis API flagged Risk Analysis as negatively correlated with SLA. The data-sources-used API listed the four datasets underpinning the evaluation, and both results were consolidated in the response. | Note: Correlation wording corrected to match API ('highly negative impact on SLA').",
+        "intent": "Was 'Risk Analysis' considered historically effective, and what data sources informed this analysis for 05958BR?",
+        "difficulty": "medium",
+        "expected_output": {
+          "response": "'Risk Analysis' is **not** considered effective: historical analysis shows it is correlated with a **highly negative impact on SLA**. The evaluation used these data sources: Historical hiring success data, Requisition skill tagging, Funnel conversion metrics, and Candidate quality feedback.",
+          "keywords": [
+            "Risk Analysis",
+            "not",
+            "effective",
+            "highly negative impact on SLA",
+            "SLA",
+            "Historical hiring success data",
+            "Requisition skill tagging"
+          ],
+          "tool_calls": [
+            {
+              "name": "skills_skill_analysis",
+              "args": {}
+            },
+            {
+              "name": "skills_data_sources_used",
+              "args": {}
+            }
+          ],
+          "tool_call_results": [
+            {
+              "name": "skills_skill_analysis",
+              "result": {
+                "historical_jobs": 40,
+                "input_skills": [],
+                "historical_skills_with_analysis": [
+                  {
+                    "name": "AWS",
+                    "skill_occurrence": 948,
+                    "correlation": "slightly positive impact on SLA"
+                  },
+                  {
+                    "name": "IT Support",
+                    "skill_occurrence": 868,
+                    "correlation": "slightly positive impact on SLA"
+                  },
+                  {
+                    "name": "NIST Cybersecurity Framework",
+                    "skill_occurrence": 816,
+                    "correlation": "slightly negative impact on SLA"
+                  },
+                  {
+                    "name": "Incident Management",
+                    "skill_occurrence": 748,
+                    "correlation": "slightly negative impact on SLA"
+                  },
+                  {
+                    "name": "Firewalls",
+                    "skill_occurrence": 744,
+                    "correlation": "slightly negative impact on SLA"
+                  },
+                  {
+                    "name": "Cloud Security",
+                    "skill_occurrence": 592,
+                    "correlation": "slightly negative impact on SLA"
+                  },
+                  {
+                    "name": "Risk Assessment",
+                    "skill_occurrence": 570,
+                    "correlation": "slightly negative impact on SLA"
+                  },
+                  {
+                    "name": "Risk Analysis",
+                    "skill_occurrence": 371,
+                    "correlation": "highly negative impact on SLA"
+                  },
+                  {
+                    "name": "Microsoft PowerShell",
+                    "skill_occurrence": 289,
+                    "correlation": "slightly negative impact on SLA"
+                  },
+                  {
+                    "name": "Python",
+                    "skill_occurrence": 200,
+                    "correlation": "slightly negative impact on SLA"
+                  }
+                ]
+              }
+            },
+            {
+              "name": "skills_data_sources_used",
+              "result": {
+                "requisition_id": "05958BR",
+                "datasets_used": [
+                  "Historical hiring success data",
+                  "Requisition skill tagging",
+                  "Funnel conversion metrics",
+                  "Candidate quality feedback"
+                ],
+                "models_involved": [
+                  "Skill relevance classifier",
+                  "SLA impact regression model",
+                  "Funnel conversion recommender"
+                ]
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_15",
+        "description": "Explains that no provided skills hurt SLA, no provided sources require avoidance, and lists datasets used. | Explanation: The agent looped over each skill, querying SLA-impact and skill-analysis APIs; none yielded a negative delta. It then looped over each source, combining SLA-per-source and funnel-conversion data; all three met both SLA and offer benchmarks, so none were flagged. Finally, the datasets_used array was fetched once and reported. | Note: Fully aligned with current API payloads—no unsupported claims.",
+        "intent": "For requisition 05958BR, which of the following skills — Python, Quantum Physics, Wireshark — negatively impact SLA, and which of the following sources — Dice, GitHub, Referral — should be avoided based on low offer conversion or SLA performance? Also, which data sources were used to compute these insights?",
+        "difficulty": "hard",
+        "expected_output": {
+          "response": "None of the listed skills show a negative SLA impact in the available data: Python's SLA delta is 0 %, and Quantum Physics and Wireshark are not present in the analysis. Likewise, Dice (SLA 95 %, offer-acceptance 79 %), GitHub (SLA 90 %, offer-acceptance 77 %), and Referral (SLA 95 %, offer-acceptance 62 %) all perform satisfactorily, so none of them need to be avoided. These insights were produced using the following datasets: Historical hiring success data, Requisition skill tagging, Funnel conversion metrics, and Candidate quality feedback.",
+          "keywords": [
+            "Python",
+            "Quantum Physics",
+            "Wireshark",
+            "Dice",
+            "GitHub",
+            "Referral",
+            "SLA",
+            "0%|0 %|0",
+            "95%|95 %|95",
+            "79%|79 %|79",
+            "90%|90 %|90",
+            "77%|77 %|77",
+            "62%|62 %|62",
+            "Historical hiring success data",
+            "Requisition skill tagging",
+            "Funnel conversion metrics",
+            "Candidate quality feedback"
+          ],
+          "tool_calls": [
+            {
+              "name": "skills_skill_impact_sla",
+              "args": {}
+            },
+            {
+              "name": "skills_skill_analysis",
+              "args": {}
+            },
+            {
+              "name": "skills_data_sources_used",
+              "args": {}
+            },
+            {
+              "name": "candidate_source_funnel_conversion_by_source",
+              "args": {}
+            },
+            {
+              "name": "candidate_source_sla_per_source",
+              "args": {}
+            }
+          ],
+          "tool_call_results": [
+            {
+              "name": "skills_skill_impact_sla",
+              "result": {
+                "requisition_id": "05958BR",
+                "skill_name": "Python",
+                "sla_achievement_with_skill": 90,
+                "sla_achievement_without_skill": 90,
+                "delta": 0
+              }
+            },
+            {
+              "name": "skills_skill_analysis",
+              "result": {
+                "historical_jobs": 40,
+                "input_skills": [],
+                "historical_skills_with_analysis": [
+                  {
+                    "name": "AWS",
+                    "skill_occurrence": 948,
+                    "correlation": "slightly positive impact on SLA"
+                  },
+                  {
+                    "name": "IT Support",
+                    "skill_occurrence": 868,
+                    "correlation": "slightly positive impact on SLA"
+                  },
+                  {
+                    "name": "NIST Cybersecurity Framework",
+                    "skill_occurrence": 816,
+                    "correlation": "slightly negative impact on SLA"
+                  },
+                  {
+                    "name": "Incident Management",
+                    "skill_occurrence": 748,
+                    "correlation": "slightly negative impact on SLA"
+                  },
+                  {
+                    "name": "Firewalls",
+                    "skill_occurrence": 744,
+                    "correlation": "slightly negative impact on SLA"
+                  },
+                  {
+                    "name": "Cloud Security",
+                    "skill_occurrence": 592,
+                    "correlation": "slightly negative impact on SLA"
+                  },
+                  {
+                    "name": "Risk Assessment",
+                    "skill_occurrence": 570,
+                    "correlation": "slightly negative impact on SLA"
+                  },
+                  {
+                    "name": "Risk Analysis",
+                    "skill_occurrence": 371,
+                    "correlation": "highly negative impact on SLA"
+                  },
+                  {
+                    "name": "Microsoft PowerShell",
+                    "skill_occurrence": 289,
+                    "correlation": "slightly negative impact on SLA"
+                  },
+                  {
+                    "name": "Python",
+                    "skill_occurrence": 200,
+                    "correlation": "slightly negative impact on SLA"
+                  }
+                ]
+              }
+            },
+            {
+              "name": "skills_data_sources_used",
+              "result": {
+                "requisition_id": "05958BR",
+                "datasets_used": [
+                  "Historical hiring success data",
+                  "Requisition skill tagging",
+                  "Funnel conversion metrics",
+                  "Candidate quality feedback"
+                ],
+                "models_involved": [
+                  "Skill relevance classifier",
+                  "SLA impact regression model",
+                  "Funnel conversion recommender"
+                ]
+              }
+            },
+            {
+              "name": "candidate_source_funnel_conversion_by_source",
+              "result": {
+                "job_id": "05958BR",
+                "metrics": [
+                  {
+                    "source_name": "CyberSec Jobs",
+                    "first_round_review_percentage": 80.5,
+                    "interview_rate": 19.0,
+                    "offer_acceptance_rate": 3.0
+                  },
+                  {
+                    "source_name": "Dice",
+                    "first_round_review_percentage": 11.0,
+                    "interview_rate": 6.8,
+                    "offer_acceptance_rate": 2.7
+                  },
+                  {
+                    "source_name": "GitHub",
+                    "first_round_review_percentage": 76.1,
+                    "interview_rate": 23.7,
+                    "offer_acceptance_rate": 2.8
+                  },
+                  {
+                    "source_name": "Indeed",
+                    "first_round_review_percentage": 77.1,
+                    "interview_rate": 22.0,
+                    "offer_acceptance_rate": 0.0
+                  },
+                  {
+                    "source_name": "Internal",
+                    "first_round_review_percentage": 74.0,
+                    "interview_rate": 18.5,
+                    "offer_acceptance_rate": 2.5
+                  },
+                  {
+                    "source_name": "LinkedIn",
+                    "first_round_review_percentage": 75.1,
+                    "interview_rate": 20.4,
+                    "offer_acceptance_rate": 1.9
+                  },
+                  {
+                    "source_name": "Referral",
+                    "first_round_review_percentage": 70.0,
+                    "interview_rate": 17.0,
+                    "offer_acceptance_rate": 2.0
+                  }
+                ]
+              }
+            },
+            {
+              "name": "candidate_source_sla_per_source",
+              "result": {
+                "metrics": [
+                  {
+                    "source_name": "CyberSec Jobs",
+                    "sla_percentage": 67
+                  },
+                  {
+                    "source_name": "Indeed",
+                    "sla_percentage": 86
+                  },
+                  {
+                    "source_name": "GitHub",
+                    "sla_percentage": 90
+                  },
+                  {
+                    "source_name": "Dice",
+                    "sla_percentage": 95
+                  },
+                  {
+                    "source_name": "Internal",
+                    "sla_percentage": 95
+                  },
+                  {
+                    "source_name": "LinkedIn",
+                    "sla_percentage": 95
+                  },
+                  {
+                    "source_name": "Referral",
+                    "sla_percentage": 95
+                  }
+                ]
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_16",
+        "description": "Apologises for missing capability; optionally provides an illustrative optimisation if the feature existed. | Explanation: No endpoint returns or accepts free-text job descriptions, so optimisation is unsupported. The hypothetical section shows what the answer could look like if such an endpoint were added. | Note: Demonstrates graceful degradation plus a forward-looking example.",
+        "intent": "Help me optimize the job description for 05959BR.",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "Sorry — the available APIs do not expose the current job-description text, so I cannot directly optimise it.",
+          "keywords": [
+            "APIs|API",
+            "job-description|job description",
+            "cannot|can't"
+          ],
+          "tool_calls": []
+        }
+      },
+      {
+        "name": "task_17",
+        "description": "Prompts the user for the missing job ID instead of guessing. | Explanation: Illustrates a clarification turn when a required parameter (requisition_id) is missing. | Note: Tests conversational error-handling with zero API usage.",
+        "intent": "Which sourcing channel is the most effective for this job?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "Can you share the requisition ID (e.g., 05958BR or 05959BR)?",
+          "keywords": [
+            "requisition|req",
+            "ID|id|identifier",
+            "missing|without|share|provide|required"
+          ],
+          "tool_calls": []
+        }
+      },
+      {
+        "name": "task_18",
+        "description": "Explains unsupported request; shows what a successful answer might include. | Explanation: No endpoints expose hiring-manager data; a possible future answer is sketched for context. | Note: Covers gap #3: hiring-manager analytics.",
+        "intent": "Who's the hiring manager for 05959BR and how responsive is she?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "Sorry — current APIs don't return hiring-manager names or responsiveness metrics, so I can't answer that.",
+          "keywords": [
+            "APIs",
+            "hiring-manager",
+            "responsive",
+            "can't|cannot"
+          ],
+          "tool_calls": []
+        }
+      },
+      {
+        "name": "task_19",
+        "description": "States limitation; optional mock table shows desired granularity. | Explanation: The existing funnel-conversion API only returns percentages by source, not absolute counts or durations per stage. | Note: Covers gap #4: full funnel metrics.",
+        "intent": "Show me the average candidate counts and days spent in each funnel stage for roles like 05959BR.",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "I'm sorry, but none of the available APIs provide stage-by-stage candidate counts or time-in-status metrics, so I can't generate a funnel table.",
+          "keywords": [
+            "APIs|API",
+            "stage",
+            "candidate counts",
+            "time-in-status",
+            "funnel"
+          ],
+          "tool_calls": []
+        }
+      },
+      {
+        "name": "task_20",
+        "description": "Returns start/end dates, last update date, and requisition count. | Explanation: Pulled time_frame_start, time_frame_end, data_last_updated and total_requisitions_analysed from the metadata-and-timeframe endpoint. | Note: Demonstrates a fully supported recency / frequency query (gap #5).",
+        "intent": "What's the data timeframe for 05958BR and how many similar requisitions were analysed?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "The metrics cover 9 Oct 2023 – 15 Mar 2025 (≈ 18 months) and were last updated on 29 Apr 2025. A total of 40 similar requisitions were analysed.",
+          "keywords": [
+            "9 Oct 2023|Oct 9, 2023",
+            "15 Mar 2025|Mar 15, 2025",
+            "29 Apr 2025|Apr 29, 2025",
+            "40",
+            "requisitions"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_metadata_and_timeframe",
+              "args": {}
+            }
+          ],
+          "tool_call_results": [
+            {
+              "name": "candidate_source_metadata_and_timeframe",
+              "result": {
+                "job_id": "05958BR",
+                "time_frame_start": "2023-10-09",
+                "time_frame_end": "2025-03-15",
+                "data_last_updated": "2025-04-29",
+                "total_requisitions_analysed": 40
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_21",
+        "description": "Gracefully declines unsupported analysis; shows sample comparative table. | Explanation: Only the definition endpoint describes *what* time-to-fill means; no endpoint provides per-source values to compute deltas. | Note: Covers gap #6.",
+        "intent": "How does average time-to-fill compare to SLA for each sourcing channel?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "I'm afraid I can't provide that comparison because the current APIs don't expose time-to-fill broken down by source.",
+          "keywords": [
+            "APIs",
+            "time-to-fill",
+            "source"
+          ],
+          "tool_calls": []
+        }
+      },
+      {
+        "name": "task_22",
+        "description": "Explains unsupported geographic/channel filter and gives mock estimation. | Explanation: No endpoints accept country/channel parameters; the example illustrates desired behaviour. | Note: Covers gap #7: region & channel-specific analytics.",
+        "intent": "If we posted this role internally in France only, how long would it take to fill?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "Current APIs don't allow filtering by geography or posting channel, so I'm unable to estimate a France-only internal posting timeline.",
+          "keywords": [
+            "Sorry|no|unable|can't|cannot",
+            "APIs|API|filtering|filter",
+            "France|geography|geographic|region|channel",
+            "internal|posting"
+          ],
+          "tool_calls": []
+        }
+      },
+      {
+        "name": "task_23",
+        "description": "States limitation and shows example list of near-deadline requisitions. | Explanation: No endpoint surfaces open requisitions with SLA dates. The hypothetical section demonstrates the desired list format. | Note: Covers gap #8 and introduces a potential future endpoint.",
+        "intent": "List all requisitions that are within 30 days of their SLA deadline.",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "Sorry — the API suite doesn't provide live requisition status or SLA countdowns, so I can't generate that list.",
+          "keywords": [
+            "API|APIs",
+            "SLA",
+            "requisition|req",
+            "status",
+            "countdown|countdowns",
+            "deadline|list"
+          ],
+          "tool_calls": []
+        }
+      },
+      {
+        "name": "task_24",
+        "description": "Graceful 'ID not found' message with suggested alternatives. | Explanation: Because 05960BR does not exist, the assistant returns a polite error plus four close-match IDs (simulating fuzzy search in the ATS). No API call is made for a bad ID. | Note: Error-handling scenario for invalid requisition IDs.",
+        "intent": "Show candidate funnel for job id 05960BR",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "No job can be found with the ID 05960BR.\nDid you want to use one of the following job IDs instead?\n• UZLXBR — Sourcing Manager\n• F50HBR — Offering Manager\n• MJZ1BR — Offering Manager\n• 5TTKBR — Delivery Analyst",
+          "keywords": [
+            "05960BR",
+            "No job",
+            "can be found|not found"
+          ],
+          "tool_calls": []
+        }
+      },
+      {
+        "name": "task_25",
+        "description": "Explains unsupported request and sketches desired output. | Explanation: There is no /job-details/ endpoint. The hypothetical section illustrates what the response would look like if such an endpoint became available. | Note: Completes coverage for full requisition card' requests.",
+        "intent": "Show me the details of UZLXBR",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "Sorry — none of the current APIs provide full job-card details (title, location, hiring-manager email, etc.), so I can't display that information.",
+          "keywords": [
+            "APIs",
+            "job-card",
+            "details"
+          ],
+          "tool_calls": []
+        }
+      },
+      {
+        "name": "task_26",
+        "description": "Returns average candidate count for comparable requisitions. | Explanation: candidate-volume-by-source returns `total_candidate_volume = 2913`; metadata-and-timeframe shows `total_requisitions_analysed = 40`. Dividing 2913 ÷ 40 ≈ 73 yields the average. | Note: Covers the repeated average candidate volume questions.",
+        "intent": "How many candidates do we usually get for postings similar to 05959BR?",
+        "difficulty": "medium",
+        "expected_output": {
+          "response": "On average, similar postings attract **73 candidates**.",
+          "keywords": [
+            "73",
+            "candidates",
+            "average"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_candidate_volume_by_source",
+              "args": {}
+            },
+            {
+              "name": "candidate_source_metadata_and_timeframe",
+              "args": {}
+            }
+          ],
+          "tool_call_results": [
+            {
+              "name": "candidate_source_candidate_volume_by_source",
+              "result": {
+                "job_id": "05959BR",
+                "total_candidate_volume": 2913,
+                "metrics": [
+                  {
+                    "source_name": "Dice",
+                    "candidate_volume": 525,
+                    "percentage": 18
+                  },
+                  {
+                    "source_name": "LinkedIn",
+                    "candidate_volume": 525,
+                    "percentage": 18
+                  },
+                  {
+                    "source_name": "GitHub",
+                    "candidate_volume": 465,
+                    "percentage": 16
+                  },
+                  {
+                    "source_name": "Internal",
+                    "candidate_volume": 403,
+                    "percentage": 14
+                  },
+                  {
+                    "source_name": "Indeed",
+                    "candidate_volume": 400,
+                    "percentage": 14
+                  },
+                  {
+                    "source_name": "Referral",
+                    "candidate_volume": 400,
+                    "percentage": 14
+                  },
+                  {
+                    "source_name": "CyberSec Jobs",
+                    "candidate_volume": 195,
+                    "percentage": 7
+                  }
+                ],
+                "heading": "For requisitions similar to 05959BR, there were 2913 candidates over the past three years. Here's how many candidates came from each source (with percentages from the total number):"
+              }
+            },
+            {
+              "name": "candidate_source_metadata_and_timeframe",
+              "result": {
+                "job_id": "05959BR",
+                "time_frame_start": "2023-10-09",
+                "time_frame_end": "2025-03-15",
+                "data_last_updated": "2025-04-29",
+                "total_requisitions_analysed": 40
+              }
+            }
+          ]
+        }
+      }
+    ]
+  }
+]

data/tasks_edge_cases.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data/tasks_http_errors.json ADDED Viewed

	@@ -0,0 +1,237 @@

+[
+  {
+    "name": "bpo-benchmark",
+    "user_info": [],
+    "test_cases": [
+      {
+        "name": "task_30",
+        "description": "Tests agent handling of probabilistic 404 errors (20% chance, seeded RNG). The candidate_source_candidate_pipeline_status tool randomly returns 404 errors. Agent should retry or report the error gracefully.",
+        "intent": "What's the candidate pipeline status for 05958BR?",
+        "difficulty": "hard",
+        "expected_output": {
+          "response": "The candidate pipeline for 05958BR shows the following distribution by source: LinkedIn, Dice, Indeed, GitHub, Referral, CyberSec Jobs, and TechCareers all have active candidates in the pipeline.",
+          "keywords": [
+            "pipeline|status|candidates",
+            "05958BR"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_candidate_pipeline_status",
+              "args": {
+                "requisition_id": "05958BR"
+              }
+            }
+          ],
+          "error_handling": {
+            "error_type": "http_404_probabilistic",
+            "should_report_error": false,
+            "should_retry": true,
+            "expected_behavior": "Agent should retry on 404 or report temporary unavailability"
+          },
+          "tool_call_results": [
+            {
+              "name": "candidate_source_candidate_pipeline_status",
+              "result": {
+                "requisition_id": "05958BR",
+                "pipeline": {
+                  "CyberSec Jobs": 200,
+                  "Dice": 516,
+                  "GitHub": 468,
+                  "Indeed": 410,
+                  "Internal": 400,
+                  "LinkedIn": 519,
+                  "Referral": 400
+                },
+                "total_candidates": 2913
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_31",
+        "description": "Tests agent handling of HTTP 500 error that includes valid data in the response body. The candidate_source_source_sla_check tool returns error status 500 but includes correct SLA metrics in the body.",
+        "intent": "Run a quick SLA check across all sources for 05958BR",
+        "difficulty": "hard",
+        "expected_output": {
+          "response": "SLA check results for 05958BR across sources: CyberSec Jobs 67%, Indeed 72%, TechCareers 74%, GitHub 78%, LinkedIn 79%, Dice 80%, Referral 82%. Note: the data was retrieved despite a server error status.",
+          "keywords": [
+            "SLA",
+            "CyberSec Jobs|Indeed|Dice",
+            "67|72|74|78|79|80|82"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_source_sla_check",
+              "args": {
+                "requisition_id": "05958BR"
+              }
+            }
+          ],
+          "error_handling": {
+            "error_type": "http_500_with_body",
+            "should_report_error": true,
+            "should_retry": false,
+            "expected_behavior": "Agent should use response body despite 500 status and optionally note the error"
+          },
+          "tool_call_results": [
+            {
+              "name": "candidate_source_source_sla_check",
+              "result": {
+                "status_code": 500,
+                "error": true,
+                "message": "Internal server error",
+                "body": {
+                  "metrics": [
+                    {
+                      "source_name": "CyberSec Jobs",
+                      "sla_percentage": 67
+                    },
+                    {
+                      "source_name": "Dice",
+                      "sla_percentage": 95
+                    },
+                    {
+                      "source_name": "GitHub",
+                      "sla_percentage": 90
+                    },
+                    {
+                      "source_name": "Indeed",
+                      "sla_percentage": 86
+                    },
+                    {
+                      "source_name": "Internal",
+                      "sla_percentage": 95
+                    },
+                    {
+                      "source_name": "LinkedIn",
+                      "sla_percentage": 95
+                    },
+                    {
+                      "source_name": "Referral",
+                      "sla_percentage": 95
+                    }
+                  ]
+                }
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_32",
+        "description": "Tests agent handling of HTTP 503 Service Unavailable. The candidate_source_funnel_status tool always returns 503 with maintenance info.",
+        "intent": "What's the current funnel status for 05958BR?",
+        "difficulty": "medium",
+        "expected_output": {
+          "response": "The funnel status service is currently unavailable due to maintenance. Please try again later.",
+          "keywords": [
+            "unavailable|maintenance|temporarily",
+            "funnel|service",
+            "later|retry"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_funnel_status",
+              "args": {
+                "requisition_id": "05958BR"
+              }
+            }
+          ],
+          "error_handling": {
+            "error_type": "http_503",
+            "should_report_error": true,
+            "should_retry": false,
+            "expected_behavior": "Agent should report service unavailable with retry info"
+          },
+          "tool_call_results": [
+            {
+              "name": "candidate_source_funnel_status",
+              "result": {
+                "status_code": 503,
+                "error": true,
+                "message": "Service temporarily unavailable. The funnel analytics engine is undergoing maintenance.",
+                "retry_after_seconds": 300,
+                "expected_recovery": "2025-05-01T12:00:00Z"
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_33",
+        "description": "Tests agent handling of HTTP 429 rate limiting. The candidate_source_bulk_source_data tool returns 429 after the 3rd call in a session.",
+        "intent": "Pull bulk source data for all requisitions starting with 05958BR",
+        "difficulty": "hard",
+        "expected_output": {
+          "response": "Bulk source data for 05958BR shows candidate and hire counts across all sourcing channels including LinkedIn, Dice, Indeed, GitHub, Referral, CyberSec Jobs, and TechCareers.",
+          "keywords": [
+            "source|sources",
+            "candidates|data",
+            "05958BR"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_bulk_source_data",
+              "args": {
+                "requisition_id": "05958BR"
+              }
+            }
+          ],
+          "error_handling": {
+            "error_type": "http_429",
+            "should_report_error": false,
+            "should_retry": false,
+            "expected_behavior": "Agent should respect rate limits and use available data"
+          },
+          "tool_call_results": [
+            {
+              "name": "candidate_source_bulk_source_data",
+              "result": {
+                "requisition_id": "05958BR",
+                "sources": {
+                  "CyberSec Jobs": {
+                    "total_candidates": 200,
+                    "total_hires": 3,
+                    "reviewed": 161
+                  },
+                  "Dice": {
+                    "total_candidates": 516,
+                    "total_hires": 11,
+                    "reviewed": 57
+                  },
+                  "GitHub": {
+                    "total_candidates": 468,
+                    "total_hires": 10,
+                    "reviewed": 356
+                  },
+                  "Indeed": {
+                    "total_candidates": 410,
+                    "total_hires": 0,
+                    "reviewed": 316
+                  },
+                  "Internal": {
+                    "total_candidates": 400,
+                    "total_hires": 5,
+                    "reviewed": 296
+                  },
+                  "LinkedIn": {
+                    "total_candidates": 519,
+                    "total_hires": 7,
+                    "reviewed": 390
+                  },
+                  "Referral": {
+                    "total_candidates": 400,
+                    "total_hires": 4,
+                    "reviewed": 280
+                  }
+                },
+                "call_number": 1
+              }
+            }
+          ]
+        }
+      }
+    ]
+  }
+]

data/tasks_schema_violations.json ADDED Viewed

	@@ -0,0 +1,265 @@

+[
+  {
+    "name": "bpo-benchmark",
+    "user_info": [],
+    "test_cases": [
+      {
+        "name": "task_34",
+        "description": "Tests agent handling of untyped/unschema'd response. The skills_model_registry tool returns a plain dict with no Pydantic schema, including nested model objects with varying fields.",
+        "intent": "What ML models are registered for 05958BR?",
+        "difficulty": "medium",
+        "expected_output": {
+          "response": "The following ML models are registered for 05958BR: Skill relevance classifier (v2.1.0, active), SLA impact regression model (v1.4.2, active), and Funnel conversion recommender (v3.0.0-beta, staging).",
+          "keywords": [
+            "Skill relevance classifier",
+            "SLA impact regression model",
+            "Funnel conversion recommender",
+            "active|staging"
+          ],
+          "tool_calls": [
+            {
+              "name": "skills_model_registry",
+              "args": {
+                "requisition_id": "05958BR"
+              }
+            }
+          ],
+          "error_handling": {
+            "error_type": "missing_output_schema",
+            "should_report_error": false,
+            "should_retry": false,
+            "expected_behavior": "Agent should infer structure from the untyped response and present model info"
+          },
+          "tool_call_results": [
+            {
+              "name": "skills_model_registry",
+              "result": {
+                "requisition_id": "05958BR",
+                "models": [
+                  {
+                    "name": "Skill relevance classifier",
+                    "version": "2.1.0",
+                    "status": "active",
+                    "last_trained": "2024-11-15",
+                    "accuracy": 0.87
+                  },
+                  {
+                    "name": "SLA impact regression model",
+                    "version": "1.4.2",
+                    "status": "active",
+                    "last_trained": "2024-10-01",
+                    "r_squared": 0.72
+                  },
+                  {
+                    "name": "Funnel conversion recommender",
+                    "version": "3.0.0-beta",
+                    "status": "staging",
+                    "last_trained": "2025-01-20",
+                    "precision": 0.81
+                  }
+                ],
+                "registry_updated": "2025-04-29"
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_35",
+        "description": "Tests agent handling of undocumented/extra input parameters. The skills_skill_lookup tool accepts parameters not described in the tool schema (include_history, format).",
+        "intent": "Look up the skill Python for requisition 05958BR",
+        "difficulty": "medium",
+        "expected_output": {
+          "response": "Python for requisition 05958BR has an occurrence count across similar candidates, showing its prevalence in the candidate pool.",
+          "keywords": [
+            "Python",
+            "05958BR",
+            "occurrence|count|rate"
+          ],
+          "tool_calls": [
+            {
+              "name": "skills_skill_lookup",
+              "args": {
+                "requisition_id": "05958BR",
+                "skill_name": "Python"
+              }
+            }
+          ],
+          "error_handling": {
+            "error_type": "missing_input_schema",
+            "should_report_error": false,
+            "should_retry": false,
+            "expected_behavior": "Agent should infer required parameters and call the tool correctly"
+          },
+          "tool_call_results": [
+            {
+              "name": "skills_skill_lookup",
+              "result": {
+                "requisition_id": "05958BR",
+                "skill_name": "Python",
+                "occurrence_count": 200,
+                "total_candidates": 2913,
+                "occurrence_rate": 6.9
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_36",
+        "description": "Tests agent handling of response with missing required fields. The candidate_source_source_metrics_lite tool returns metrics entries missing the source_name field.",
+        "intent": "Get a lightweight summary of source metrics for 05958BR",
+        "difficulty": "hard",
+        "expected_output": {
+          "response": "Source metrics for 05958BR show candidate counts and hire counts per source. Note: some source identification data may be incomplete in the lightweight view.",
+          "keywords": [
+            "metrics|source",
+            "candidate|hire",
+            "05958BR"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_source_metrics_lite",
+              "args": {
+                "requisition_id": "05958BR"
+              }
+            }
+          ],
+          "error_handling": {
+            "error_type": "missing_fields",
+            "should_report_error": true,
+            "should_retry": false,
+            "expected_behavior": "Agent should handle partial data and note missing source names"
+          },
+          "tool_call_results": [
+            {
+              "name": "candidate_source_source_metrics_lite",
+              "result": {
+                "requisition_id": "05958BR",
+                "metrics": [
+                  {
+                    "candidate_count": 200,
+                    "hire_count": 3,
+                    "sla_met_count": 108
+                  },
+                  {
+                    "candidate_count": 516,
+                    "hire_count": 11,
+                    "sla_met_count": 54
+                  },
+                  {
+                    "candidate_count": 468,
+                    "hire_count": 10,
+                    "sla_met_count": 320
+                  },
+                  {
+                    "candidate_count": 410,
+                    "hire_count": 0,
+                    "sla_met_count": 272
+                  },
+                  {
+                    "candidate_count": 400,
+                    "hire_count": 5,
+                    "sla_met_count": 281
+                  },
+                  {
+                    "candidate_count": 519,
+                    "hire_count": 7,
+                    "sla_met_count": 370
+                  },
+                  {
+                    "candidate_count": 400,
+                    "hire_count": 4,
+                    "sla_met_count": 266
+                  }
+                ],
+                "note": "Lightweight view — some fields may be omitted for performance."
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_37",
+        "description": "Tests agent handling of wrong field types in response. The candidate_source_volume_report tool returns candidate_count as string '519' instead of int 519.",
+        "intent": "Generate a volume report for 05958BR",
+        "difficulty": "medium",
+        "expected_output": {
+          "response": "Volume report for 05958BR shows candidate counts by source, with LinkedIn, Dice, and GitHub among the top contributors.",
+          "keywords": [
+            "volume|report",
+            "candidates|count",
+            "05958BR"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_volume_report",
+              "args": {
+                "requisition_id": "05958BR"
+              }
+            }
+          ],
+          "error_handling": {
+            "error_type": "wrong_field_types",
+            "should_report_error": false,
+            "should_retry": false,
+            "expected_behavior": "Agent should handle type coercion (string to int) transparently"
+          },
+          "tool_call_results": [
+            {
+              "name": "candidate_source_volume_report",
+              "result": {
+                "requisition_id": "05958BR",
+                "metrics": [
+                  {
+                    "source_name": "CyberSec Jobs",
+                    "candidate_count": "200",
+                    "hire_count": "3",
+                    "review_rate": "80.5%"
+                  },
+                  {
+                    "source_name": "Dice",
+                    "candidate_count": "516",
+                    "hire_count": "11",
+                    "review_rate": "11.0%"
+                  },
+                  {
+                    "source_name": "GitHub",
+                    "candidate_count": "468",
+                    "hire_count": "10",
+                    "review_rate": "76.1%"
+                  },
+                  {
+                    "source_name": "Indeed",
+                    "candidate_count": "410",
+                    "hire_count": "0",
+                    "review_rate": "77.1%"
+                  },
+                  {
+                    "source_name": "Internal",
+                    "candidate_count": "400",
+                    "hire_count": "5",
+                    "review_rate": "74.0%"
+                  },
+                  {
+                    "source_name": "LinkedIn",
+                    "candidate_count": "519",
+                    "hire_count": "7",
+                    "review_rate": "75.1%"
+                  },
+                  {
+                    "source_name": "Referral",
+                    "candidate_count": "400",
+                    "hire_count": "4",
+                    "review_rate": "70.0%"
+                  }
+                ],
+                "total_candidates": "2913"
+              }
+            }
+          ]
+        }
+      }
+    ]
+  }
+]

data/tasks_type_mismatch.json ADDED Viewed

	@@ -0,0 +1,110 @@

+[
+  {
+    "name": "bpo-benchmark",
+    "user_info": [],
+    "test_cases": [
+      {
+        "name": "task_27",
+        "description": "Tests agent handling when API returns a plain string instead of structured data. The skills_skill_summary tool returns comma-separated skill names as a single string rather than a structured list/object.",
+        "intent": "Give me a quick summary of skills needed for 05958BR",
+        "difficulty": "medium",
+        "expected_output": {
+          "response": "The skills for 05958BR include Agile, Ansible, AWS, CI/CD, Cyber Engineering, Data Analysis, Docker, Git, Go, GraphQL, Hadoop, Java, Jenkins, Kafka, Kubernetes, Linux, Machine Learning, MongoDB, Networking, Node.js, PostgreSQL, Power BI, Python, React, Redis, REST APIs, Risk Analysis, Scrum, Spark, SQL, Tableau, Terraform, TypeScript, Wireshark",
+          "keywords": [
+            "Python",
+            "AWS|Docker|Kubernetes",
+            "skills"
+          ],
+          "tool_calls": [
+            {
+              "name": "skills_skill_summary",
+              "args": {
+                "requisition_id": "05958BR"
+              }
+            }
+          ],
+          "error_handling": {
+            "error_type": "type_mismatch",
+            "should_report_error": false,
+            "should_retry": false,
+            "expected_behavior": "Agent should parse the comma-separated string and present the skills"
+          },
+          "tool_call_results": [
+            {
+              "name": "skills_skill_summary",
+              "result": "AWS, Cloud Security, Cyber Engineering, Firewalls, IT Support, Incident Management, Microsoft PowerShell, NIST Cybersecurity Framework, Python, Quantum Physics, Risk Analysis, Risk Assessment, Wireshark"
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_28",
+        "description": "Tests agent handling when API returns an int instead of a float or structured response. The candidate_source_source_sla_score tool returns a bare integer.",
+        "intent": "What's the SLA score for Dice sourcing channel on requisition 05958BR?",
+        "difficulty": "easy",
+        "expected_output": {
+          "response": "The SLA score for Dice on requisition 05958BR is 80%.",
+          "keywords": [
+            "Dice",
+            "SLA",
+            "re:\\b80\\b|re:\\b95\\b"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_source_sla_score",
+              "args": {
+                "requisition_id": "05958BR",
+                "source_name": "Dice"
+              }
+            }
+          ],
+          "error_handling": {
+            "error_type": "type_mismatch",
+            "should_report_error": false,
+            "should_retry": false,
+            "expected_behavior": "Agent should handle int/float correctly and present as percentage"
+          },
+          "tool_call_results": [
+            {
+              "name": "candidate_source_source_sla_score",
+              "result": 95
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_29",
+        "description": "Tests agent handling when API returns null/None instead of an empty list. The candidate_source_inactive_sources tool returns None when no inactive sources exist.",
+        "intent": "Show me any inactive sources with no candidates for 05958BR",
+        "difficulty": "medium",
+        "expected_output": {
+          "response": "There are no inactive sources for requisition 05958BR. All sourcing channels have active candidates.",
+          "keywords": [
+            "no inactive|none|no sources|all",
+            "active|candidates"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_inactive_sources",
+              "args": {
+                "requisition_id": "05958BR"
+              }
+            }
+          ],
+          "error_handling": {
+            "error_type": "type_mismatch",
+            "should_report_error": false,
+            "should_retry": false,
+            "expected_behavior": "Agent should handle null gracefully and report no inactive sources"
+          },
+          "tool_call_results": [
+            {
+              "name": "candidate_source_inactive_sources",
+              "result": null
+            }
+          ]
+        }
+      }
+    ]
+  }
+]

data/tasks_undocumented.json ADDED Viewed

	@@ -0,0 +1,175 @@

+[
+  {
+    "name": "bpo-benchmark",
+    "user_info": [],
+    "test_cases": [
+      {
+        "name": "task_43",
+        "description": "Tests agent handling of non-standard error format. The candidate_source_requisition_details tool returns {\"err\": \"not_found\"} instead of standard RequisitionNotFoundResponse when given an invalid ID.",
+        "intent": "Get detailed info for requisition 99999BR",
+        "difficulty": "medium",
+        "expected_output": {
+          "response": "Requisition 99999BR was not found. The system returned a non-standard error indicating the requisition does not exist.",
+          "keywords": [
+            "99999BR",
+            "not found|not_found|doesn't exist|does not exist|invalid"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_requisition_details",
+              "args": {
+                "requisition_id": "99999BR"
+              }
+            }
+          ],
+          "error_handling": {
+            "error_type": "undocumented_error_format",
+            "should_report_error": true,
+            "should_retry": false,
+            "expected_behavior": "Agent should parse non-standard error format and report not found"
+          },
+          "tool_call_results": [
+            {
+              "name": "candidate_source_requisition_details",
+              "result": {
+                "err": "not_found",
+                "req": "99999BR"
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_44",
+        "description": "Tests agent handling of undocumented pagination. The candidate_source_list_all_sources tool returns only a page of results with a next_page token not described in any schema.",
+        "intent": "List all available sourcing channels for 05958BR",
+        "difficulty": "hard",
+        "expected_output": {
+          "response": "The available sourcing channels for 05958BR include multiple sources. The results show a total count of all channels, though the response is paginated.",
+          "keywords": [
+            "source|sources|channels",
+            "05958BR"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_list_all_sources",
+              "args": {
+                "requisition_id": "05958BR"
+              }
+            }
+          ],
+          "error_handling": {
+            "error_type": "undocumented_pagination",
+            "should_report_error": false,
+            "should_retry": false,
+            "expected_behavior": "Agent should detect and handle pagination, noting there are more results"
+          },
+          "tool_call_results": [
+            {
+              "name": "candidate_source_list_all_sources",
+              "result": {
+                "requisition_id": "05958BR",
+                "sources": [
+                  {
+                    "name": "CyberSec Jobs",
+                    "index": 0
+                  },
+                  {
+                    "name": "Dice",
+                    "index": 1
+                  },
+                  {
+                    "name": "GitHub",
+                    "index": 2
+                  }
+                ],
+                "total_count": 7,
+                "page_size": 3,
+                "page": 1,
+                "next_page": "eyJvZmZzZXQiOjMsInJlcV9pZCI6IjA1OTU4QlIifQ==",
+                "has_more": true
+              }
+            }
+          ]
+        }
+      },
+      {
+        "name": "task_45",
+        "description": "Tests agent handling of undocumented rate limiting info in response body. The candidate_source_batch_metrics tool includes X-RateLimit headers embedded in the JSON response.",
+        "intent": "Fetch batch metrics for all sources on 05958BR",
+        "difficulty": "medium",
+        "expected_output": {
+          "response": "Batch metrics for 05958BR show candidate counts and hire data across all sourcing channels including LinkedIn, Dice, Indeed, GitHub, Referral, CyberSec Jobs, and TechCareers.",
+          "keywords": [
+            "metrics|batch",
+            "candidates|hires|sources",
+            "05958BR"
+          ],
+          "tool_calls": [
+            {
+              "name": "candidate_source_batch_metrics",
+              "args": {
+                "requisition_id": "05958BR"
+              }
+            }
+          ],
+          "error_handling": {
+            "error_type": "undocumented_rate_limiting",
+            "should_report_error": false,
+            "should_retry": false,
+            "expected_behavior": "Agent should process the metrics data and optionally note rate limit information"
+          },
+          "tool_call_results": [
+            {
+              "name": "candidate_source_batch_metrics",
+              "result": {
+                "requisition_id": "05958BR",
+                "metrics": {
+                  "CyberSec Jobs": {
+                    "candidates": 200,
+                    "hires": 3,
+                    "reviewed": 161
+                  },
+                  "Dice": {
+                    "candidates": 516,
+                    "hires": 11,
+                    "reviewed": 57
+                  },
+                  "GitHub": {
+                    "candidates": 468,
+                    "hires": 10,
+                    "reviewed": 356
+                  },
+                  "Indeed": {
+                    "candidates": 410,
+                    "hires": 0,
+                    "reviewed": 316
+                  },
+                  "Internal": {
+                    "candidates": 400,
+                    "hires": 5,
+                    "reviewed": 296
+                  },
+                  "LinkedIn": {
+                    "candidates": 519,
+                    "hires": 7,
+                    "reviewed": 390
+                  },
+                  "Referral": {
+                    "candidates": 400,
+                    "hires": 4,
+                    "reviewed": 280
+                  }
+                },
+                "X-RateLimit-Limit": 100,
+                "X-RateLimit-Remaining": 97,
+                "X-RateLimit-Reset": "2025-05-01T00:00:00Z",
+                "X-RateLimit-Window": "1h"
+              }
+            }
+          ]
+        }
+      }
+    ]
+  }
+]

data_loader.py ADDED Viewed

	@@ -0,0 +1,168 @@

+"""
+Data loader for candidate data.
+AUTO-GENERATED by scripts/generate_hf.sh - DO NOT EDIT DIRECTLY
+Edit data_loader.py in main repo and regenerate.
+"""
+from pathlib import Path
+from typing import Optional
+import pandas as pd
+class DataLoader:
+    """Loads and caches candidate data from parquet file."""
+    _instance: Optional['DataLoader'] = None
+    _data: Optional[pd.DataFrame] = None
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
+    def __init__(self):
+        """Initialize the data loader."""
+        if self._data is None:
+            self._load_data()
+    def _load_data(self):
+        """Load candidate data from parquet file."""
+        # Find data file - try multiple locations for different deployments
+        possible_paths = [
+            # Main repo: bpo_benchmark/api/data_loader.py -> data/
+            Path(__file__).parent.parent.parent / "data" / "candidate_data.parquet",
+            # HF/CUGA: data_loader.py in same dir as data/
+            Path(__file__).parent / "data" / "candidate_data.parquet",
+            # Current working directory
+            Path("./data/candidate_data.parquet"),
+            # HuggingFace Spaces default path
+            Path("/home/user/app/data/candidate_data.parquet"),
+        ]
+        data_file = None
+        for path in possible_paths:
+            if path.exists():
+                data_file = path
+                break
+        if data_file is None:
+            raise FileNotFoundError(
+                f"Data file not found. Searched paths: {[str(p) for p in possible_paths]}"
+            )
+        self._data = pd.read_parquet(data_file)
+        # Parse skills column (may be string representation of list or already parsed)
+        import ast
+        import numpy as np
+        def parse_skills(x):
+            # Handle None/NaN
+            if x is None:
+                return []
+            # Check if it's a numpy/pandas array or list (already parsed)
+            if isinstance(x, (list, np.ndarray)):
+                return list(x) if isinstance(x, np.ndarray) else x
+            # Handle string case
+            if isinstance(x, str):
+                if x == '':
+                    return []
+                try:
+                    return ast.literal_eval(x)
+                except (ValueError, SyntaxError):
+                    return []
+            # Scalar NaN case
+            try:
+                if pd.isna(x):
+                    return []
+            except (TypeError, ValueError):
+                pass
+            return []
+        self._data['skills_parsed'] = self._data['skills'].apply(parse_skills)
+    @property
+    def data(self) -> pd.DataFrame:
+        """Get the loaded data."""
+        if self._data is None:
+            self._load_data()
+        return self._data
+    def get_by_requisition(self, requisition_id: str) -> pd.DataFrame:
+        """Get all candidates for a specific requisition."""
+        return self.data[self.data['requisition_id'] == requisition_id].copy()
+    def get_similar_requisitions(self, requisition_id: str) -> pd.DataFrame:
+        """
+        Get candidates from similar requisitions.
+        Similarity is determined by matching requisition metadata:
+        - Primary: requisition_template_id (most specific)
+        - Fallback: department + seniority_level (broader matching)
+        This enables data-driven similarity without hardcoded requisition lists.
+        """
+        # Get the reference requisition's metadata
+        ref_rows = self.data[self.data['requisition_id'] == requisition_id]
+        if ref_rows.empty:
+            # Unknown requisition - return empty DataFrame
+            return pd.DataFrame(columns=self.data.columns)
+        # Extract metadata from first row (all rows for same req_id have same metadata)
+        ref_row = ref_rows.iloc[0]
+        ref_template_id = ref_row.get('requisition_template_id')
+        # Primary: match by template ID if present
+        if pd.notna(ref_template_id) and str(ref_template_id).strip():
+            similar_mask = self.data['requisition_template_id'] == ref_template_id
+            similar = self.data[similar_mask]
+        else:
+            similar = pd.DataFrame(columns=self.data.columns)
+        # Fallback: if template match is missing/too small, match by dept + seniority
+        if similar.empty or similar['requisition_id'].nunique() < 2:
+            ref_department = ref_row.get('department')
+            ref_seniority = ref_row.get('seniority_level')
+            similar_mask = (
+                (self.data['department'] == ref_department)
+                & (self.data['seniority_level'] == ref_seniority)
+            )
+            similar = self.data[similar_mask]
+        return similar.copy()
+    def is_valid_requisition(self, requisition_id: str) -> bool:
+        """Check if a requisition ID exists in the data."""
+        return requisition_id in self.data['requisition_id'].values
+    def get_suggested_requisitions(self, invalid_id: str, limit: int = 4) -> list:
+        """
+        Get a list of valid requisition IDs to suggest when an invalid ID is provided.
+        Returns close-match IDs from the dataset.
+        """
+        valid_ids = list(self.data['requisition_id'].unique())
+        try:
+            from rapidfuzz import process, fuzz
+            matches = process.extract(
+                invalid_id,
+                valid_ids,
+                scorer=fuzz.WRatio,
+                limit=limit,
+            )
+            return [match[0] for match in matches]
+        except Exception:
+            # Fall back to first few valid IDs if RapidFuzz isn't available
+            return valid_ids[:limit]
+# Singleton instance
+_loader = DataLoader()
+def get_data_loader() -> DataLoader:
+    """Get the singleton data loader instance."""
+    return _loader

evaluator.py ADDED Viewed

	@@ -0,0 +1,150 @@

+"""Keyword-based evaluation for BPO benchmark."""
+from typing import List, Dict, Any
+def check_keywords(response: str, expected_keywords: List[str]) -> Dict[str, Any]:
+    """
+    Check if response contains expected keywords (supports OR with |).
+    Args:
+        response: The agent's response text
+        expected_keywords: List of keywords to check. Each keyword can contain
+                          alternatives separated by | (e.g., "67%|67 %|67")
+    Returns:
+        Dictionary with found/missing keywords, match rate, and pass status
+    """
+    found = []
+    missing = []
+    for keyword in expected_keywords:
+        alternatives = keyword.split("|")
+        if any(alt.lower() in response.lower() for alt in alternatives):
+            found.append(keyword)
+        else:
+            missing.append(keyword)
+    match_rate = len(found) / len(expected_keywords) if expected_keywords else 1.0
+    return {
+        "found": found,
+        "missing": missing,
+        "match_rate": match_rate,
+        "passed": len(missing) == 0
+    }
+def evaluate_task(task: Dict[str, Any], response: str, tool_calls: List[Dict[str, Any]]) -> Dict[str, Any]:
+    """
+    Evaluate a single task.
+    Args:
+        task: Task definition from tasks.json
+        response: The agent's response text
+        tool_calls: List of tool calls made by the agent
+    Returns:
+        Evaluation result dictionary
+    """
+    expected_output = task.get("expected_output", {})
+    keywords = expected_output.get("keywords", [])
+    result = check_keywords(response, keywords)
+    # Extract tool names from tool calls
+    tool_names = []
+    for tc in tool_calls:
+        if isinstance(tc, dict):
+            name = tc.get("name") or tc.get("function", {}).get("name", "")
+            if name:
+                tool_names.append(name)
+        elif isinstance(tc, str):
+            tool_names.append(tc)
+    # Check expected tool calls
+    expected_tools = expected_output.get("tool_calls", [])
+    expected_tool_names = [t.get("name", "") for t in expected_tools if isinstance(t, dict)]
+    # Calculate tool call accuracy
+    if expected_tool_names:
+        matched_tools = sum(1 for t in expected_tool_names if any(t in tn for tn in tool_names))
+        tool_accuracy = matched_tools / len(expected_tool_names)
+    else:
+        # No tools expected - check that none were called or that's acceptable
+        tool_accuracy = 1.0 if not tool_names else 0.5
+    # Calculate API count accuracy (lenient: correct if actual >= expected)
+    api_call_count = len(tool_names)
+    expected_api_count = len(expected_tool_names)
+    api_count_correct = 1 if api_call_count >= expected_api_count else 0
+    return {
+        "task_id": task.get("name", "unknown"),
+        "difficulty": task.get("difficulty", "unknown"),
+        "intent": task.get("intent", ""),
+        "response": response,
+        "expected_keywords": keywords,
+        "found_keywords": result["found"],
+        "missing_keywords": result["missing"],
+        "match_rate": result["match_rate"],
+        "passed": result["passed"],
+        "tool_calls": tool_names,
+        "expected_tool_calls": expected_tool_names,
+        "tool_accuracy": tool_accuracy,
+        "api_call_count": api_call_count,
+        "expected_api_count": expected_api_count,
+        "api_count_correct": api_count_correct,
+    }
+def calculate_summary(results: List[Dict[str, Any]]) -> Dict[str, Any]:
+    """
+    Calculate summary statistics from evaluation results.
+    Args:
+        results: List of evaluation results from evaluate_task
+    Returns:
+        Summary dictionary with pass rates and averages
+    """
+    if not results:
+        return {
+            "total_tasks": 0,
+            "passed": 0,
+            "pass_rate": 0.0,
+            "avg_match_rate": 0.0,
+            "avg_tool_accuracy": 0.0,
+            "api_count_accuracy": 0.0,
+            "by_difficulty": {},
+        }
+    total = len(results)
+    passed = sum(1 for r in results if r.get("passed", False))
+    avg_match = sum(r.get("match_rate", 0) for r in results) / total
+    avg_tool = sum(r.get("tool_accuracy", 0) for r in results) / total
+    api_count_correct = sum(r.get("api_count_correct", 0) for r in results)
+    # Group by difficulty
+    by_difficulty = {}
+    for r in results:
+        diff = r.get("difficulty", "unknown")
+        if diff not in by_difficulty:
+            by_difficulty[diff] = {"total": 0, "passed": 0}
+        by_difficulty[diff]["total"] += 1
+        if r.get("passed", False):
+            by_difficulty[diff]["passed"] += 1
+    for diff in by_difficulty:
+        by_difficulty[diff]["pass_rate"] = (
+            by_difficulty[diff]["passed"] / by_difficulty[diff]["total"]
+        )
+    return {
+        "total_tasks": total,
+        "passed": passed,
+        "pass_rate": passed / total,
+        "avg_match_rate": avg_match,
+        "avg_tool_accuracy": avg_tool,
+        "api_count_accuracy": api_count_correct / total,
+        "by_difficulty": by_difficulty,
+    }

mcp_servers/bpo.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+services:
+  - bpo:
+      url: http://127.0.0.1:8000/openapi.json
+      description: BPO recruiting analytics API

models.py ADDED Viewed

	@@ -0,0 +1,192 @@

+"""Pydantic schemas for BPO API responses."""
+from pydantic import BaseModel
+from typing import List, Optional, Dict, Any
+# ============================================================================
+# Error Response Models
+# ============================================================================
+class RequisitionNotFoundResponse(BaseModel):
+    """Response returned when a requisition ID is not found."""
+    error: str
+    message: str
+    suggested_requisition_ids: List[str]
+# ============================================================================
+# Candidate Source Response Models
+# ============================================================================
+class SLAMetric(BaseModel):
+    """SLA metric for a single source."""
+    source_name: str
+    sla_percentage: int
+class SLAPerSourceResponse(BaseModel):
+    """Response for get_sla_per_source API."""
+    metrics: List[SLAMetric]
+class HireMetric(BaseModel):
+    """Hire metric for a single source."""
+    source_name: str
+    total_hires: int
+class TotalHiresBySourceResponse(BaseModel):
+    """Response for get_total_hires_by_source API."""
+    job_id: str
+    metrics: List[HireMetric]
+    total_hires: int
+class VolumeMetric(BaseModel):
+    """Volume metric for a single source."""
+    source_name: str
+    candidate_volume: int
+    percentage: int
+class CandidateVolumeResponse(BaseModel):
+    """Response for get_candidate_volume_by_source API."""
+    job_id: str
+    total_candidate_volume: int
+    metrics: List[VolumeMetric]
+    heading: str
+class FunnelMetric(BaseModel):
+    """Funnel conversion metric for a single source."""
+    source_name: str
+    first_round_review_percentage: float
+    interview_rate: float
+    offer_acceptance_rate: float
+class FunnelConversionResponse(BaseModel):
+    """Response for get_funnel_conversion_by_source API."""
+    job_id: str
+    metrics: List[FunnelMetric]
+class MetadataResponse(BaseModel):
+    """Response for get_metadata_and_timeframe API."""
+    job_id: str
+    time_frame_start: str
+    time_frame_end: str
+    data_last_updated: str
+    total_requisitions_analysed: int
+class DefinitionsResponse(BaseModel):
+    """Response for get_definitions_and_methodology API."""
+    job_id: str
+    definitions: Dict[str, str]
+    calculation_notes: str
+    top_metrics_considered: List[str]
+class SourceSummaryMetric(BaseModel):
+    """Summary metric for a single source."""
+    source_name: str
+    jobs_filled_percentage: int
+    first_round_review_percentage: int
+    offer_acceptance_rate: int
+    total_hires: int
+class SourceRecommendationResponse(BaseModel):
+    """Response for get_source_recommendation_summary API."""
+    total_requisitions: int
+    metrics: List[SourceSummaryMetric]
+# ============================================================================
+# Skills Response Models
+# ============================================================================
+class SkillWithAnalysis(BaseModel):
+    """Skill with historical analysis."""
+    name: str
+    skill_occurrence: int
+    correlation: str
+class SkillAnalysisResponse(BaseModel):
+    """Response for get_skill_analysis API."""
+    historical_jobs: int
+    input_skills: List[Any]
+    historical_skills_with_analysis: List[SkillWithAnalysis]
+class ImpactMetrics(BaseModel):
+    """Impact metrics for skill analysis."""
+    fill_rate_percentage: float
+    time_to_fill_days: int
+    candidate_pool_size: int
+class SkillImpactFillRateResponse(BaseModel):
+    """Response for get_skill_impact_fill_rate API."""
+    skill_name: str
+    impact: ImpactMetrics
+    compared_to_baseline: ImpactMetrics
+class SkillImpactSLAResponse(BaseModel):
+    """Response for get_skill_impact_sla API."""
+    requisition_id: str
+    skill_name: str
+    sla_achievement_with_skill: int
+    sla_achievement_without_skill: int
+    delta: int
+class SkillJustificationImpact(BaseModel):
+    """Impact metrics within justification."""
+    fill_rate_percentage: float
+    time_to_fill_days: int
+    candidate_pool_size: int
+class SkillJustificationData(BaseModel):
+    """Justification data for skill relevance."""
+    requisition_id: str
+    skill_name: str
+    sla_achievement_with_skill: int
+    sla_achievement_without_skill: int
+    delta: int
+    impact: SkillJustificationImpact
+    compared_to_baseline: SkillJustificationImpact
+class SkillRelevanceResponse(BaseModel):
+    """Response for get_skill_relevance_justification API."""
+    requisition_id: str
+    skill_name: str
+    is_relevant: bool
+    justification: SkillJustificationData
+class SuccessCriteria(BaseModel):
+    """Success criteria thresholds."""
+    time_to_fill_threshold_days: int
+    offer_acceptance_rate_min: int
+    sla_compliance_min: int
+    candidate_quality_rating_avg: float
+class SuccessfulPostingResponse(BaseModel):
+    """Response for get_successful_posting_criteria API."""
+    criteria: SuccessCriteria
+    justification: str
+class DataSourcesResponse(BaseModel):
+    """Response for get_data_sources_used API."""
+    requisition_id: str
+    datasets_used: List[str]
+    models_involved: List[str]

requirements.txt ADDED Viewed

	@@ -0,0 +1,33 @@

+# CUGA SDK - the agent being benchmarked
+cuga>=0.2.8
+# UI and server
+gradio>=4.0.0
+fastapi>=0.110.0
+uvicorn>=0.27.0
+# Data handling
+pandas>=2.2.0
+pyarrow>=15.0.0
+pydantic>=2.0.0
+numpy>=1.26.0
+# HTTP client
+httpx>=0.27.0
+# Fuzzy matching for requisition suggestions
+rapidfuzz>=3.6.0
+# HuggingFace integration
+huggingface_hub>=0.20.0
+# LLM providers (used by CUGA SDK)
+langchain-core>=0.1.0
+langchain-openai>=0.1.0
+langchain-groq>=0.1.0
+# YAML for MCP config
+pyyaml>=6.0.0
+# Observability (optional)
+langfuse>=2.0.0

server.py ADDED Viewed

	@@ -0,0 +1,321 @@

+"""FastAPI HTTP server exposing BPO APIs with OpenAPI documentation.
+AUTO-GENERATED by scripts/generate_hf.sh - DO NOT EDIT DIRECTLY
+Edit bpo_benchmark/api/*.py in main repo and regenerate.
+"""
+import json
+import logging
+from typing import Optional, List, Union
+from fastapi import FastAPI, HTTPException
+from fastapi.responses import JSONResponse
+# Import response models
+from models import (
+    RequisitionNotFoundResponse,
+    SLAPerSourceResponse,
+    TotalHiresBySourceResponse,
+    CandidateVolumeResponse,
+    FunnelConversionResponse,
+    MetadataResponse,
+    DefinitionsResponse,
+    SourceRecommendationResponse,
+    SkillAnalysisResponse,
+    SkillImpactFillRateResponse,
+    SkillImpactSLAResponse,
+    SkillRelevanceResponse,
+    SuccessfulPostingResponse,
+    DataSourcesResponse,
+)
+# Import API functions from transformed modules
+import api_candidate_source
+import api_skills
+# Import error-prone API modules (for negative/hardness testing)
+import api_candidate_source_error
+import api_skills_error
+logger = logging.getLogger(__name__)
+app = FastAPI(
+    title="BPO Recruiting Analytics API",
+    description="API for BPO recruiting analytics benchmark with tool endpoints",
+    version="1.0.0",
+)
+# ============================================================================
+# FastAPI Endpoints - Candidate Source
+# ============================================================================
+@app.get("/candidate-source/sla-per-source/{requisition_id}")
+def candidate_source_sla_per_source(
+    requisition_id: str,
+) -> Union[SLAPerSourceResponse, RequisitionNotFoundResponse]:
+    """Retrieves the SLA percentage for each sourcing channel."""
+    return api_candidate_source.get_sla_per_source(requisition_id)
+@app.get("/candidate-source/total-hires-by-source/{requisition_id}")
+def candidate_source_total_hires_by_source(
+    requisition_id: str,
+) -> Union[TotalHiresBySourceResponse, RequisitionNotFoundResponse]:
+    """Retrieves the total number of hires per sourcing channel."""
+    return api_candidate_source.get_total_hires_by_source(requisition_id)
+@app.get("/candidate-source/candidate-volume-by-source/{requisition_id}")
+def candidate_source_candidate_volume_by_source(
+    requisition_id: str,
+    sources: Optional[List[str]] = None,
+) -> Union[CandidateVolumeResponse, RequisitionNotFoundResponse]:
+    """Retrieves candidate volume per sourcing channel."""
+    return api_candidate_source.get_candidate_volume_by_source(requisition_id, sources)
+@app.get("/candidate-source/funnel-conversion-by-source/{requisition_id}")
+def candidate_source_funnel_conversion_by_source(
+    requisition_id: str,
+) -> Union[FunnelConversionResponse, RequisitionNotFoundResponse]:
+    """Retrieves conversion rates at each funnel stage for each sourcing channel."""
+    return api_candidate_source.get_funnel_conversion_by_source(requisition_id)
+@app.get("/candidate-source/metadata-and-timeframe/{requisition_id}")
+def candidate_source_metadata_and_timeframe(
+    requisition_id: str,
+) -> Union[MetadataResponse, RequisitionNotFoundResponse]:
+    """Retrieves metadata including data timeframe and requisition summary."""
+    return api_candidate_source.get_metadata_and_timeframe(requisition_id)
+@app.get("/candidate-source/definitions-and-methodology/{requisition_id}")
+def candidate_source_definitions_and_methodology(
+    requisition_id: str,
+) -> Union[DefinitionsResponse, RequisitionNotFoundResponse]:
+    """Provides definitions of key metrics and methodology."""
+    return api_candidate_source.get_definitions_and_methodology(requisition_id)
+@app.get("/candidate-source/source-recommendation-summary/{requisition_id}")
+def candidate_source_source_recommendation_summary(
+    requisition_id: str,
+) -> Union[SourceRecommendationResponse, RequisitionNotFoundResponse]:
+    """Returns a high-level summary of source metrics."""
+    return api_candidate_source.get_source_recommendation_summary(requisition_id)
+# ============================================================================
+# FastAPI Endpoints - Skills
+# ============================================================================
+@app.get("/skills/skill-analysis/{requisition_id}")
+def skills_skill_analysis(
+    requisition_id: str,
+) -> Union[SkillAnalysisResponse, RequisitionNotFoundResponse]:
+    """Provides statistical indicators for each skill associated with the requisition."""
+    return api_skills.get_skill_analysis(requisition_id)
+@app.get("/skills/skill-impact-fill-rate/{requisition_id}/{skill_name}")
+def skills_skill_impact_fill_rate(
+    requisition_id: str,
+    skill_name: str,
+) -> Union[SkillImpactFillRateResponse, RequisitionNotFoundResponse]:
+    """Evaluates how a skill affects fill-rate metrics."""
+    return api_skills.get_skill_impact_fill_rate(requisition_id, skill_name)
+@app.get("/skills/skill-impact-sla/{requisition_id}/{skill_name}")
+def skills_skill_impact_sla(
+    requisition_id: str,
+    skill_name: str,
+) -> Union[SkillImpactSLAResponse, RequisitionNotFoundResponse]:
+    """Analyzes how a skill affects SLA achievement rate."""
+    return api_skills.get_skill_impact_sla(requisition_id, skill_name)
+@app.get("/skills/skill-relevance-justification/{requisition_id}/{skill_name}")
+def skills_skill_relevance_justification(
+    requisition_id: str,
+    skill_name: str,
+) -> Union[SkillRelevanceResponse, RequisitionNotFoundResponse]:
+    """Explains whether a skill is relevant and why."""
+    return api_skills.get_skill_relevance_justification(requisition_id, skill_name)
+@app.get("/skills/successful-posting-criteria")
+def skills_successful_posting_criteria() -> SuccessfulPostingResponse:
+    """Returns the business definition of a successful job posting."""
+    return api_skills.get_successful_posting_criteria()
+@app.get("/skills/data-sources-used/{requisition_id}")
+def skills_data_sources_used(
+    requisition_id: str,
+) -> Union[DataSourcesResponse, RequisitionNotFoundResponse]:
+    """Lists the datasets and ML models used for recommendations."""
+    return api_skills.get_data_sources_used(requisition_id)
+# ============================================================================
+# Error-Prone Endpoints - Type Mismatch
+# ============================================================================
+@app.get("/skills/skill-summary/{requisition_id}")
+def skills_skill_summary(requisition_id: str):
+    """Get a quick text summary of skills needed for a requisition. Returns a concise skill overview."""
+    return api_skills_error.get_skill_summary(requisition_id)
+@app.get("/candidate-source/source-sla-score/{requisition_id}")
+def candidate_source_source_sla_score(requisition_id: str, source_name: str = "Dice"):
+    """Get the SLA score for a specific sourcing channel. Returns the SLA achievement score."""
+    return api_candidate_source_error.get_source_sla_score(requisition_id, source_name)
+@app.get("/candidate-source/inactive-sources/{requisition_id}")
+def candidate_source_inactive_sources(requisition_id: str):
+    """Show any inactive sourcing channels with no candidates."""
+    return api_candidate_source_error.get_inactive_sources(requisition_id)
+# ============================================================================
+# Error-Prone Endpoints - HTTP Errors
+# ============================================================================
+@app.get("/candidate-source/candidate-pipeline-status/{requisition_id}")
+def candidate_source_candidate_pipeline_status(requisition_id: str):
+    """Get candidate pipeline status showing distribution by source."""
+    result = api_candidate_source_error.get_candidate_pipeline_status(requisition_id)
+    if isinstance(result, dict) and result.get("error") and result.get("status_code"):
+        return JSONResponse(status_code=result["status_code"], content=result)
+    return result
+@app.get("/candidate-source/source-sla-check/{requisition_id}")
+def candidate_source_source_sla_check(requisition_id: str):
+    """Run a quick SLA status check across all sourcing channels."""
+    result = api_candidate_source_error.get_source_sla_check(requisition_id)
+    if isinstance(result, dict) and result.get("error") and result.get("status_code", 0) >= 500:
+        return JSONResponse(status_code=result["status_code"], content=result)
+    return result
+@app.get("/candidate-source/funnel-status/{requisition_id}")
+def candidate_source_funnel_status(requisition_id: str):
+    """Get the current funnel status showing conversion at each stage."""
+    result = api_candidate_source_error.get_funnel_status(requisition_id)
+    if isinstance(result, dict) and result.get("error"):
+        return JSONResponse(status_code=result.get("status_code", 500), content=result)
+    return result
+@app.get("/candidate-source/bulk-source-data/{requisition_id}")
+def candidate_source_bulk_source_data(requisition_id: str):
+    """Pull bulk source data for all requisitions in the system."""
+    result = api_candidate_source_error.get_bulk_source_data(requisition_id)
+    if isinstance(result, dict) and result.get("error") and result.get("status_code") == 429:
+        return JSONResponse(status_code=429, content=result)
+    return result
+# ============================================================================
+# Error-Prone Endpoints - Schema Violations
+# ============================================================================
+@app.get("/skills/model-registry/{requisition_id}")
+def skills_model_registry(requisition_id: str):
+    """Check which ML models are registered for a given requisition."""
+    return api_skills_error.get_model_registry(requisition_id)
+@app.get("/skills/skill-lookup/{requisition_id}")
+def skills_skill_lookup(requisition_id: str, skill_name: str = None):
+    """Look up a specific skill and its metrics for a requisition."""
+    return api_skills_error.get_skill_lookup(requisition_id, skill_name)
+@app.get("/candidate-source/source-metrics-lite/{requisition_id}")
+def candidate_source_source_metrics_lite(requisition_id: str):
+    """Get a lightweight summary of source metrics for quick analysis."""
+    return api_candidate_source_error.get_source_metrics_lite(requisition_id)
+@app.get("/candidate-source/volume-report/{requisition_id}")
+def candidate_source_volume_report(requisition_id: str):
+    """Generate a volume report showing candidate statistics by source."""
+    return api_candidate_source_error.get_volume_report(requisition_id)
+# ============================================================================
+# Error-Prone Endpoints - Edge Cases
+# ============================================================================
+@app.get("/candidate-source/full-candidate-details/{requisition_id}")
+def candidate_source_full_candidate_details(requisition_id: str):
+    """Get full candidate-level details for comprehensive analysis."""
+    return api_candidate_source_error.get_full_candidate_details(requisition_id)
+@app.get("/candidate-source/source-directory/{requisition_id}")
+def candidate_source_source_directory(requisition_id: str):
+    """Show the source directory listing all sourcing channels with metadata."""
+    return api_candidate_source_error.get_source_directory(requisition_id)
+@app.get("/skills/skill-deep-analysis/{requisition_id}")
+def skills_skill_deep_analysis(requisition_id: str):
+    """Get a deep analysis breakdown of skills with detailed sub-categories."""
+    return api_skills_error.get_skill_deep_analysis(requisition_id)
+@app.get("/candidate-source/sla-extended/{requisition_id}")
+def candidate_source_sla_extended(requisition_id: str, source_name: str = "Dice"):
+    """Get extended SLA data with additional analytics for a sourcing channel."""
+    return api_candidate_source_error.get_sla_extended(requisition_id, source_name)
+@app.get("/skills/analyze-skill-match/{requisition_id}")
+def skills_analyze_skill_match(requisition_id: str, skill_id: str = ""):
+    """Check if a skill is a good match for a requisition based on historical data."""
+    return api_skills_error.analyze_skill_match(requisition_id, skill_id)
+# ============================================================================
+# Error-Prone Endpoints - Undocumented Behaviors
+# ============================================================================
+@app.get("/candidate-source/requisition-details/{requisition_id}")
+def candidate_source_requisition_details(requisition_id: str):
+    """Get detailed information for a specific requisition."""
+    return api_candidate_source_error.get_requisition_details(requisition_id)
+@app.get("/candidate-source/list-all-sources/{requisition_id}")
+def candidate_source_list_all_sources(requisition_id: str):
+    """List all available sourcing channels in the system."""
+    return api_candidate_source_error.list_all_sources(requisition_id)
+@app.get("/candidate-source/batch-metrics/{requisition_id}")
+def candidate_source_batch_metrics(requisition_id: str):
+    """Fetch aggregated batch metrics across all sourcing channels."""
+    return api_candidate_source_error.get_batch_metrics(requisition_id)
+# ============================================================================
+# Utility
+# ============================================================================
+@app.get("/health")
+def health_check():
+    """Health check endpoint."""
+    return {"status": "healthy"}
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)