haroldshipibm commited on
Commit
d075a5b
·
verified ·
1 Parent(s): d370bed

Upload folder using huggingface_hub

Browse files
Dockerfile ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.12-slim
2
+
3
+ # Set up user for HuggingFace Spaces
4
+ RUN useradd -m -u 1000 user
5
+ USER user
6
+ ENV HOME=/home/user \
7
+ PATH=/home/user/.local/bin:$PATH
8
+
9
+ WORKDIR $HOME/app
10
+
11
+ # Install dependencies
12
+ COPY --chown=user requirements.txt .
13
+ RUN pip install --no-cache-dir -r requirements.txt
14
+
15
+ # Copy application files
16
+ COPY --chown=user . .
17
+
18
+ # Download data from HF Dataset on build (all task suites + fixture data)
19
+ RUN python -c "from huggingface_hub import hf_hub_download; \
20
+ import os; os.makedirs('data', exist_ok=True); \
21
+ r='ibm-research/bpo-benchmark'; d='dataset'; \
22
+ hf_hub_download(r, 'candidate_data.parquet', local_dir='data', repo_type=d); \
23
+ hf_hub_download(r, 'tasks.json', local_dir='data', repo_type=d); \
24
+ hf_hub_download(r, 'tasks_type_mismatch.json', local_dir='data', repo_type=d); \
25
+ hf_hub_download(r, 'tasks_http_errors.json', local_dir='data', repo_type=d); \
26
+ hf_hub_download(r, 'tasks_schema_violations.json', local_dir='data', repo_type=d); \
27
+ hf_hub_download(r, 'tasks_edge_cases.json', local_dir='data', repo_type=d); \
28
+ hf_hub_download(r, 'tasks_undocumented.json', local_dir='data', repo_type=d); \
29
+ hf_hub_download(r, 'large_response_fixture.json', local_dir='data', repo_type=d)"
30
+
31
+ # Expose port for Gradio
32
+ EXPOSE 7860
33
+
34
+ # Start the application
35
+ CMD ["python", "app.py"]
README.md CHANGED
@@ -1,12 +1,185 @@
1
  ---
2
- title: BPO Bench
3
- emoji: 🌍
4
- colorFrom: gray
5
- colorTo: blue
6
- sdk: gradio
7
- sdk_version: 6.8.0
8
- app_file: app.py
9
  pinned: false
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: BPO Benchmark Evaluation
3
+ emoji: "\U0001F4CA"
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: docker
7
+ app_port: 7860
 
8
  pinned: false
9
+ license: apache-2.0
10
  ---
11
 
12
+ # BPO Benchmark Evaluation
13
+
14
+ Evaluate **CUGA SDK** on BPO (Business Process Outsourcing) recruiting analytics tasks.
15
+
16
+ ## What This Benchmarks
17
+
18
+ This Space evaluates the [CUGA SDK](https://pypi.org/project/cuga/) - a production AI agent framework - on its ability to use tool APIs to answer recruiting analytics questions.
19
+
20
+ ## Features
21
+
22
+ - **Real Agent Testing**: Uses the actual CUGA SDK from PyPI (`pip install cuga>=0.2.8`)
23
+ - **32 Tool APIs** for recruiting data analysis (13 core + 19 error-prone)
24
+ - **45 Evaluation Tasks** across 6 test suites (easy/medium/hard difficulty)
25
+ - **Multi-Metric Scoring**: String similarity, exact match, keyword match, API accuracy, and LLM Judge
26
+ - **OpenAI and Groq** provider support
27
+ - **Langfuse Observability**: Optional tracing for detailed evaluation analysis
28
+
29
+ ## API Endpoints
30
+
31
+ The benchmark exposes 32 BPO recruiting analytics endpoints plus a health check.
32
+
33
+ ### Core Endpoints (13)
34
+
35
+ #### Candidate Source APIs (7)
36
+ - SLA performance by source
37
+ - Total hires by source
38
+ - Candidate volume metrics
39
+ - Funnel conversion rates
40
+ - Metadata and timeframe
41
+ - Definitions and methodology
42
+ - Source recommendations
43
+
44
+ #### Skills APIs (6)
45
+ - Skill analysis and correlations
46
+ - Skill impact on fill rate
47
+ - Skill impact on SLA
48
+ - Skill relevance justification
49
+ - Success criteria
50
+ - Data sources used
51
+
52
+ ### Error-Prone Endpoints (19)
53
+
54
+ These endpoints intentionally exhibit problematic behaviors to test agent resilience.
55
+
56
+ #### Type Mismatch (3)
57
+ - `skills/skill-summary` — Returns plain string instead of JSON object
58
+ - `candidate-source/source-sla-score` — Returns numeric float instead of structured response
59
+ - `candidate-source/inactive-sources` — Returns boolean or list depending on data state
60
+
61
+ #### HTTP Errors (4)
62
+ - `candidate-source/candidate-pipeline-status` — Intermittently returns 404
63
+ - `candidate-source/source-sla-check` — Returns 500 Internal Server Error
64
+ - `candidate-source/funnel-status` — Returns 503 Service Unavailable
65
+ - `candidate-source/bulk-source-data` — Returns 429 Too Many Requests
66
+
67
+ #### Schema Violations (4)
68
+ - `skills/model-registry` — No Pydantic output schema; returns untyped dict
69
+ - `skills/skill-lookup` — Returns extra undeclared fields
70
+ - `candidate-source/source-metrics-lite` — Randomly omits required fields
71
+ - `candidate-source/volume-report` — Returns wrong field types (strings for numbers)
72
+
73
+ #### Edge Cases (5)
74
+ - `candidate-source/full-candidate-details` — Returns oversized payload (~1MB)
75
+ - `candidate-source/source-directory` — Contains Unicode and special characters
76
+ - `skills/skill-deep-analysis` — Deeply nested JSON (5+ levels)
77
+ - `candidate-source/sla-extended` — Includes unexpected extra fields
78
+ - `skills/analyze-skill-match` — Returns mismatched schema vs documentation
79
+
80
+ #### Undocumented Behaviors (3)
81
+ - `candidate-source/requisition-details` — Non-standard error format (error as string, not object)
82
+ - `candidate-source/list-all-sources` — Undocumented pagination in response
83
+ - `candidate-source/batch-metrics` — Undocumented rate limiting headers
84
+
85
+ ## Usage
86
+
87
+ 1. Enter your **OpenAI** or **Groq** API key
88
+ 2. Select test suites to evaluate (checkboxes):
89
+ - **Core** (26 tasks) — Standard recruiting analytics questions
90
+ - **Type Mismatch** (3 tasks) — APIs returning unexpected data types
91
+ - **HTTP Errors** (4 tasks) — APIs returning HTTP error codes
92
+ - **Schema Violations** (4 tasks) — APIs with missing/wrong schema fields
93
+ - **Edge Cases** (5 tasks) — Large payloads, Unicode, deep nesting
94
+ - **Undocumented Behaviors** (3 tasks) — Non-standard error formats and pagination
95
+ 3. Optionally filter to specific task IDs within selected suites
96
+ 4. Click **Run Evaluation**
97
+ 5. View results with multi-metric scoring and per-task breakdowns
98
+
99
+ ## Evaluation Metrics
100
+
101
+ Each task is scored on multiple dimensions:
102
+
103
+ - **String Similarity**: Fuzzy match between agent response and expected output (0-100%)
104
+ - **Exact Match**: Binary check for precise answer correctness
105
+ - **Keyword Match Rate**: Percentage of expected keywords found in the response
106
+ - **API Accuracy**: Precision and recall of which API endpoints the agent invoked
107
+ - **LLM Judge** (optional): GPT-based semantic evaluation of response quality
108
+ - **Final Composite Score**: Weighted combination of all metrics
109
+
110
+ A task **passes** when the composite score exceeds the threshold.
111
+
112
+ ## Dataset
113
+
114
+ The benchmark uses synthetic BPO recruiting data:
115
+ - 64k candidate records
116
+ - 1,047 requisitions
117
+ - 7 sourcing channels
118
+
119
+ **Dataset:** [ibm-research/bpo-benchmark](https://huggingface.co/datasets/ibm-research/bpo-benchmark)
120
+
121
+ ## Local Development
122
+
123
+ ```bash
124
+ # Clone the space
125
+ git clone https://huggingface.co/spaces/ibm-research/bpo-benchmark-eval
126
+ cd bpo-benchmark-eval
127
+
128
+ # Install dependencies
129
+ pip install -r requirements.txt
130
+
131
+ # Download data (all task suites + fixture data)
132
+ python -c "
133
+ from huggingface_hub import hf_hub_download
134
+ import os
135
+ os.makedirs('data', exist_ok=True)
136
+ files = [
137
+ 'candidate_data.parquet',
138
+ 'tasks.json',
139
+ 'tasks_type_mismatch.json',
140
+ 'tasks_http_errors.json',
141
+ 'tasks_schema_violations.json',
142
+ 'tasks_edge_cases.json',
143
+ 'tasks_undocumented.json',
144
+ 'large_response_fixture.json',
145
+ ]
146
+ for f in files:
147
+ hf_hub_download('ibm-research/bpo-benchmark', f, local_dir='data', repo_type='dataset')
148
+ print(f'Downloaded {len(files)} files')
149
+ "
150
+
151
+ # Run
152
+ python app.py
153
+ ```
154
+
155
+ ## Architecture
156
+
157
+ ```
158
+ ┌─────────────────────────────────────────────────────────────┐
159
+ │ Gradio UI (port 7860) │
160
+ │ - 6 test suite checkboxes, multi-metric result display │
161
+ │ - LLM Judge toggle, Langfuse observability │
162
+ └─────────────────────────────────────────────────────────────┘
163
+
164
+
165
+ ┌─────────────────────────────────────────────────────────────┐
166
+ │ CUGA SDK Agent │
167
+ │ - Loads tools via OpenAPI spec from Registry │
168
+ │ - Processes queries with LLM (OpenAI/Groq) │
169
+ │ - Orchestrates tool calls │
170
+ └─────────────────────────────────────────────────────────────┘
171
+
172
+ ┌─────────┴──────────┐
173
+ ▼ ▼
174
+ ┌──────────────────────────┐ ┌──────────────────────────┐
175
+ │ FastAPI Server (:8000) │ │ CUGA Registry (:8001) │
176
+ │ - 32 BPO API endpoints │ │ - Tool discovery │
177
+ │ - OpenAPI spec │ │ - OpenAPI aggregation │
178
+ │ - Loads data from │ └──────────────────────────┘
179
+ │ parquet │
180
+ └──────────────────────────┘
181
+ ```
182
+
183
+ ## License
184
+
185
+ Apache 2.0
__pycache__/agent.cpython-312.pyc ADDED
Binary file (19.1 kB). View file
 
__pycache__/api_candidate_source.cpython-312.pyc ADDED
Binary file (13.7 kB). View file
 
__pycache__/api_candidate_source_error.cpython-312.pyc ADDED
Binary file (17.2 kB). View file
 
__pycache__/api_skills.cpython-312.pyc ADDED
Binary file (12.1 kB). View file
 
__pycache__/api_skills_error.cpython-312.pyc ADDED
Binary file (8.77 kB). View file
 
__pycache__/app.cpython-312.pyc ADDED
Binary file (18.8 kB). View file
 
__pycache__/data_loader.cpython-312.pyc ADDED
Binary file (7.51 kB). View file
 
__pycache__/models.cpython-312.pyc ADDED
Binary file (8.29 kB). View file
 
__pycache__/server.cpython-312.pyc ADDED
Binary file (16.5 kB). View file
 
agent.py ADDED
@@ -0,0 +1,784 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """CUGA SDK agent for BPO benchmark evaluation with Langfuse tracking."""
2
+
3
+ import asyncio
4
+ import logging
5
+ import os
6
+ import re
7
+ import threading
8
+ import time
9
+ from pathlib import Path
10
+ from typing import Any, Dict, List, Optional, Tuple
11
+
12
+ import uvicorn
13
+
14
+ logger = logging.getLogger(__name__)
15
+
16
+ # Global flags to track server status
17
+ _servers_started = False
18
+ _servers_lock = threading.Lock()
19
+
20
+
21
+ # ============================================================================
22
+ # Provider Configuration
23
+ # ============================================================================
24
+
25
+ PROVIDER_CONFIGS = {
26
+ "groq": {
27
+ "env_var": "GROQ_API_KEY",
28
+ "settings_file": "settings.groq.toml",
29
+ "default_model": "openai/gpt-oss-120b",
30
+ "models": [
31
+ "openai/gpt-oss-120b",
32
+ "llama-3.3-70b-versatile",
33
+ "llama-3.1-8b-instant",
34
+ "mixtral-8x7b-32768",
35
+ ],
36
+ "placeholder": "gsk_...",
37
+ },
38
+ "openai": {
39
+ "env_var": "OPENAI_API_KEY",
40
+ "settings_file": "settings.openai.toml",
41
+ "default_model": "gpt-4o-mini",
42
+ "models": [
43
+ "gpt-4o-mini",
44
+ "gpt-4.1",
45
+ "gpt-5",
46
+ "gpt-4o",
47
+ ],
48
+ "placeholder": "sk-...",
49
+ },
50
+ }
51
+
52
+
53
+ def get_provider_models(provider: str) -> List[str]:
54
+ """Get available models for a provider."""
55
+ config = PROVIDER_CONFIGS.get(provider.lower(), {})
56
+ return config.get("models", [])
57
+
58
+
59
+ def get_provider_placeholder(provider: str) -> str:
60
+ """Get API key placeholder for a provider."""
61
+ config = PROVIDER_CONFIGS.get(provider.lower(), {})
62
+ return config.get("placeholder", "...")
63
+
64
+
65
+ def get_default_model(provider: str) -> str:
66
+ """Get default model for a provider."""
67
+ config = PROVIDER_CONFIGS.get(provider.lower(), {})
68
+ return config.get("default_model", "")
69
+
70
+
71
+ # ============================================================================
72
+ # Server Management
73
+ # ============================================================================
74
+
75
+ def start_servers():
76
+ """Start BPO API and Registry servers if not already running."""
77
+ global _servers_started
78
+
79
+ with _servers_lock:
80
+ if _servers_started:
81
+ return
82
+ _servers_started = True
83
+
84
+ # Import here to avoid circular imports
85
+ from server import app as bpo_app
86
+ from cuga.backend.tools_env.registry.registry.api_registry_server import (
87
+ app as registry_app,
88
+ )
89
+
90
+ # Start BPO API server on port 8000
91
+ def run_bpo():
92
+ uvicorn.run(bpo_app, host="0.0.0.0", port=8000, log_level="warning")
93
+
94
+ bpo_thread = threading.Thread(target=run_bpo, daemon=True)
95
+ bpo_thread.start()
96
+ logger.info("BPO API server starting on port 8000")
97
+
98
+ # Start Registry server on port 8001
99
+ def run_registry():
100
+ uvicorn.run(registry_app, host="0.0.0.0", port=8001, log_level="warning")
101
+
102
+ registry_thread = threading.Thread(target=run_registry, daemon=True)
103
+ registry_thread.start()
104
+ logger.info("Registry server starting on port 8001")
105
+
106
+ # Wait for servers to be ready
107
+ time.sleep(4)
108
+ logger.info("Servers started")
109
+
110
+
111
+ # ============================================================================
112
+ # Environment Setup
113
+ # ============================================================================
114
+
115
+ def setup_environment(api_key: str, provider: str, model: Optional[str] = None):
116
+ """Set up environment variables for CUGA SDK."""
117
+ # Clear conflicting env vars
118
+ for key in ["OPENAI_BASE_URL", "OPENAI_API_KEY", "GROQ_API_KEY"]:
119
+ if key in os.environ:
120
+ del os.environ[key]
121
+
122
+ provider_lower = provider.lower()
123
+ config = PROVIDER_CONFIGS.get(provider_lower)
124
+
125
+ if not config:
126
+ raise ValueError(f"Unknown provider: {provider}. Supported: {list(PROVIDER_CONFIGS.keys())}")
127
+
128
+ # Set provider-specific config
129
+ os.environ[config["env_var"]] = api_key
130
+ os.environ["AGENT_SETTING_CONFIG"] = config["settings_file"]
131
+ os.environ["MODEL_NAME"] = model or config["default_model"]
132
+
133
+ # Set MCP servers file path
134
+ mcp_config = Path(__file__).parent / "mcp_servers" / "bpo.yaml"
135
+ os.environ["MCP_SERVERS_FILE"] = str(mcp_config.resolve())
136
+
137
+ # Disable policies for benchmark
138
+ os.environ["DYNACONF_POLICY__ENABLED"] = "false"
139
+
140
+ logger.info(f"Environment configured: provider={provider}, model={os.environ.get('MODEL_NAME')}")
141
+
142
+
143
+ # ============================================================================
144
+ # Langfuse Integration
145
+ # ============================================================================
146
+
147
+ class LangfuseTracker:
148
+ """Tracks evaluation runs and task scores in Langfuse."""
149
+
150
+ def __init__(self):
151
+ self.enabled = False
152
+ self.langfuse = None
153
+ self.trace_id = None
154
+ self.init_error = None
155
+ self._init_langfuse()
156
+
157
+ def _init_langfuse(self) -> None:
158
+ """Initialize Langfuse client if credentials are available."""
159
+ # Debug: show all LANGFUSE env vars
160
+ langfuse_vars = {k: ('set' if v else 'empty') for k, v in os.environ.items() if 'LANGFUSE' in k.upper()}
161
+ logger.info(f"Langfuse env vars found: {langfuse_vars}")
162
+
163
+ public_key = os.environ.get("LANGFUSE_PUBLIC_KEY")
164
+ secret_key = os.environ.get("LANGFUSE_SECRET_KEY")
165
+ # Support both LANGFUSE_HOST and LANGFUSE_BASE_URL
166
+ host = os.environ.get("LANGFUSE_HOST") or os.environ.get("LANGFUSE_BASE_URL") or "https://cloud.langfuse.com"
167
+
168
+ logger.info(f"Langfuse init: public_key={'set' if public_key else 'not set'}, secret_key={'set' if secret_key else 'not set'}, host={host}")
169
+
170
+ if not public_key or not secret_key:
171
+ self.init_error = "Langfuse credentials not found"
172
+ logger.info(self.init_error)
173
+ return
174
+
175
+ try:
176
+ from langfuse import Langfuse
177
+
178
+ self.langfuse = Langfuse(
179
+ public_key=public_key,
180
+ secret_key=secret_key,
181
+ host=host,
182
+ )
183
+ # Test the connection by checking auth
184
+ self.langfuse.auth_check()
185
+ self.enabled = True
186
+ logger.info(f"Langfuse tracking initialized successfully (host={host})")
187
+ except ImportError as e:
188
+ self.init_error = f"langfuse package not installed: {e}"
189
+ logger.warning(self.init_error)
190
+ except Exception as e:
191
+ self.init_error = f"Failed to initialize Langfuse: {e}"
192
+ logger.warning(self.init_error)
193
+
194
+ def start_trace(self, name: str, metadata: Optional[Dict[str, Any]] = None) -> Optional[str]:
195
+ """Start a new trace for an evaluation run."""
196
+ if not self.enabled or not self.langfuse:
197
+ return None
198
+
199
+ try:
200
+ # Use create_trace for newer Langfuse API
201
+ trace = self.langfuse.trace(name=name, metadata=metadata or {})
202
+ self.trace_id = trace.id
203
+ return self.trace_id
204
+ except AttributeError:
205
+ # Fallback for different Langfuse versions
206
+ try:
207
+ self.trace_id = f"trace_{name}_{id(self)}"
208
+ logger.info(f"Using fallback trace ID: {self.trace_id}")
209
+ return self.trace_id
210
+ except Exception as e:
211
+ logger.warning(f"Failed to create trace (fallback): {e}")
212
+ return None
213
+ except Exception as e:
214
+ logger.warning(f"Failed to create trace: {e}")
215
+ return None
216
+
217
+ def score_task(self, task_id: str, scores: Dict[str, float]) -> None:
218
+ """Score a task within the current trace."""
219
+ if not self.enabled or not self.langfuse or not self.trace_id:
220
+ return
221
+
222
+ try:
223
+ for name, value in scores.items():
224
+ self.langfuse.score(
225
+ trace_id=self.trace_id,
226
+ name=f"{task_id}_{name}",
227
+ value=value,
228
+ )
229
+ except Exception as e:
230
+ logger.warning(f"Failed to score task {task_id}: {e}")
231
+
232
+ def end_trace(self, summary: Optional[Dict[str, Any]] = None) -> None:
233
+ """End the current trace with summary metrics."""
234
+ if not self.enabled or not self.langfuse:
235
+ return
236
+
237
+ try:
238
+ if summary and self.trace_id:
239
+ for name, value in summary.items():
240
+ if isinstance(value, (int, float)) and not isinstance(value, bool):
241
+ self.langfuse.score(
242
+ trace_id=self.trace_id,
243
+ name=f"summary_{name}",
244
+ value=float(value),
245
+ )
246
+ self.langfuse.flush()
247
+ except Exception as e:
248
+ logger.warning(f"Failed to end trace: {e}")
249
+ finally:
250
+ self.trace_id = None
251
+
252
+
253
+ def is_langfuse_configured() -> bool:
254
+ """Check if Langfuse environment variables are set."""
255
+ return bool(
256
+ os.environ.get("LANGFUSE_PUBLIC_KEY") and
257
+ os.environ.get("LANGFUSE_SECRET_KEY")
258
+ )
259
+
260
+
261
+ def get_langfuse_host() -> str:
262
+ """Get the configured Langfuse host."""
263
+ return os.environ.get("LANGFUSE_HOST") or os.environ.get("LANGFUSE_BASE_URL") or "https://cloud.langfuse.com"
264
+
265
+
266
+ # ============================================================================
267
+ # CUGA Agent
268
+ # ============================================================================
269
+
270
+ class CUGAAgent:
271
+ """CUGA SDK agent for BPO benchmark evaluation."""
272
+
273
+ def __init__(
274
+ self,
275
+ api_key: str,
276
+ provider: str = "groq",
277
+ model: Optional[str] = None,
278
+ ):
279
+ """Initialize the CUGA agent.
280
+
281
+ Args:
282
+ api_key: API key for the LLM provider
283
+ provider: "openai" or "groq"
284
+ model: Model name (optional, uses defaults)
285
+ """
286
+ self.api_key = api_key
287
+ self.provider = provider.lower()
288
+ self.model = model
289
+ self.agent = None
290
+ self.tool_provider = None
291
+
292
+ # Set up environment BEFORE importing cuga modules
293
+ setup_environment(api_key, self.provider, model)
294
+
295
+ # Start servers
296
+ start_servers()
297
+
298
+ async def setup(self):
299
+ """Initialize the CUGA agent with tools."""
300
+ from cuga.sdk import CugaAgent
301
+ from cuga.config import settings
302
+ from cuga.backend.cuga_graph.nodes.cuga_lite.combined_tool_provider import (
303
+ CombinedToolProvider,
304
+ )
305
+
306
+ logger.info("Setting up CUGA agent...")
307
+
308
+ # Enable ActivityTracker for tool call tracking
309
+ settings.update({"ADVANCED_FEATURES": {"TRACKER_ENABLED": True}}, merge=True)
310
+
311
+ # Initialize tool provider (will load from registry)
312
+ self.tool_provider = CombinedToolProvider()
313
+ await self.tool_provider.initialize()
314
+
315
+ all_tools = await self.tool_provider.get_all_tools()
316
+ logger.info(f"Loaded {len(all_tools)} tools from BPO API")
317
+
318
+ if len(all_tools) == 0:
319
+ raise RuntimeError("No tools loaded from registry. Check server status.")
320
+
321
+ # Create agent
322
+ self.agent = CugaAgent(tool_provider=self.tool_provider)
323
+ logger.info("CUGA agent initialized")
324
+
325
+ async def run(self, query: str, thread_id: Optional[str] = None) -> Tuple[str, List[Dict[str, Any]]]:
326
+ """Run the agent on a query.
327
+
328
+ Args:
329
+ query: The user's question
330
+ thread_id: Optional thread ID for conversation context
331
+
332
+ Returns:
333
+ Tuple of (response_text, tool_calls)
334
+ """
335
+ if self.agent is None:
336
+ await self.setup()
337
+
338
+ from langchain_core.messages import HumanMessage
339
+
340
+ # Get ActivityTracker singleton and reset for this task
341
+ try:
342
+ from cuga.backend.activity_tracker.tracker import ActivityTracker
343
+ tracker = ActivityTracker()
344
+ tracker.reset(intent=query, task_id=thread_id or "eval_task")
345
+ except ImportError:
346
+ tracker = None
347
+ logger.warning("ActivityTracker not available, tool call tracking disabled")
348
+
349
+ result = await self.agent.invoke(
350
+ [HumanMessage(content=query)],
351
+ thread_id=thread_id or "eval_task",
352
+ track_tool_calls=True, # Required for ActivityTracker to capture tool calls
353
+ )
354
+
355
+ # Debug: log result object structure
356
+ result_attrs = [attr for attr in dir(result) if not attr.startswith('_')]
357
+ logger.info(f"Result object attributes: {result_attrs}")
358
+ if hasattr(result, '__dict__'):
359
+ logger.info(f"Result __dict__ keys: {list(result.__dict__.keys())}")
360
+
361
+ # Extract response
362
+ response = result.answer if hasattr(result, "answer") else str(result)
363
+
364
+ # Extract tool calls from ActivityTracker.steps (same approach as sdk_eval_helpers.py)
365
+ tool_calls = []
366
+ if tracker is not None:
367
+ import json
368
+ logger.info(f"ActivityTracker has {len(tracker.steps)} steps")
369
+
370
+ # Debug: log step names to understand structure (first 5 only)
371
+ step_names = [s.name for s in tracker.steps[:5]]
372
+ logger.info(f"First step names: {step_names}")
373
+
374
+ # Match "api_call" in step name (the standard CUGA SDK pattern)
375
+ for step in tracker.steps:
376
+ if step.name and "api_call" in step.name:
377
+ try:
378
+ call_data = json.loads(step.data) if step.data else {}
379
+ tool_name = call_data.get("function_name", "")
380
+ if tool_name:
381
+ tool_calls.append({
382
+ "name": tool_name,
383
+ "args": call_data.get("args", {}),
384
+ })
385
+ except (json.JSONDecodeError, TypeError) as e:
386
+ logger.warning(f"Failed to parse tool call step data: {e}")
387
+ continue
388
+
389
+ logger.info(f"Extracted {len(tool_calls)} tool calls from ActivityTracker")
390
+
391
+ # Fallback 1: try to extract from result.tool_calls attribute
392
+ if not tool_calls and hasattr(result, 'tool_calls') and result.tool_calls:
393
+ logger.info("Trying fallback: result.tool_calls")
394
+ for tc in result.tool_calls:
395
+ if isinstance(tc, dict):
396
+ tool_calls.append({"name": tc.get("name", ""), "args": tc.get("args", {})})
397
+ elif hasattr(tc, 'name'):
398
+ tool_calls.append({"name": tc.name, "args": getattr(tc, 'args', {})})
399
+ logger.info(f"Fallback extracted {len(tool_calls)} tool calls")
400
+
401
+ return response, tool_calls
402
+
403
+ def close(self):
404
+ """Clean up resources."""
405
+ pass # Servers run as daemons, will stop with process
406
+
407
+
408
+ # ============================================================================
409
+ # Evaluation Metrics (copied from main repo for standalone use)
410
+ # ============================================================================
411
+
412
+ def normalize_text(text: str) -> str:
413
+ """Normalize text for keyword matching."""
414
+ import unicodedata
415
+
416
+ text = unicodedata.normalize("NFC", text)
417
+ # Replace special spaces
418
+ text = text.replace("\u202f", " ").replace("\u00a0", " ").replace("\u2009", " ")
419
+ # Replace dashes
420
+ text = text.replace("\u2013", "-").replace("\u2014", "-").replace("\u2212", "-")
421
+ text = text.lower()
422
+ # Remove markdown
423
+ text = re.sub(r"[`*_~]", "", text)
424
+ # Replace punctuation except | (for OR alternatives)
425
+ text = re.sub(r"[^\w\s%|]", " ", text)
426
+ # Collapse whitespace
427
+ text = re.sub(r"\s+", " ", text).strip()
428
+ return text
429
+
430
+
431
+ def check_keywords(response: str, expected_keywords: List[str]) -> Dict[str, Any]:
432
+ """Check if expected keywords are present in the response.
433
+
434
+ Supports:
435
+ - OR mechanism: keywords can use "|" to specify alternatives
436
+ - Regex keywords: prefix with "re:" to use regex pattern
437
+
438
+ Args:
439
+ response: Agent's response text
440
+ expected_keywords: List of keywords (can use "|" for OR, "re:" for regex)
441
+
442
+ Returns:
443
+ Dictionary with keyword check results
444
+ """
445
+ if not expected_keywords:
446
+ return {
447
+ "all_found": True,
448
+ "match_rate": 1.0,
449
+ "found_keywords": [],
450
+ "missing_keywords": [],
451
+ "total_keywords": 0,
452
+ "found_count": 0,
453
+ }
454
+
455
+ response_normalized = normalize_text(response)
456
+ found_keywords = []
457
+ missing_keywords = []
458
+
459
+ for keyword in expected_keywords:
460
+ # Regex keyword support
461
+ if keyword.strip().lower().startswith("re:"):
462
+ pattern = keyword.strip()[3:]
463
+ if re.search(pattern, response_normalized, flags=re.IGNORECASE):
464
+ found_keywords.append(keyword)
465
+ else:
466
+ missing_keywords.append(keyword)
467
+ continue
468
+
469
+ keyword_normalized = normalize_text(keyword)
470
+
471
+ # OR alternatives
472
+ if "|" in keyword_normalized:
473
+ alternatives = [alt.strip() for alt in keyword_normalized.split("|")]
474
+ matched = any(alt in response_normalized for alt in alternatives)
475
+ else:
476
+ matched = keyword_normalized.strip() in response_normalized
477
+
478
+ if matched:
479
+ found_keywords.append(keyword)
480
+ else:
481
+ missing_keywords.append(keyword)
482
+
483
+ total = len(expected_keywords)
484
+ found_count = len(found_keywords)
485
+
486
+ return {
487
+ "all_found": len(missing_keywords) == 0,
488
+ "match_rate": found_count / total if total else 1.0,
489
+ "found_keywords": found_keywords,
490
+ "missing_keywords": missing_keywords,
491
+ "total_keywords": total,
492
+ "found_count": found_count,
493
+ }
494
+
495
+
496
+ def compute_string_similarity(predicted: str, expected: str) -> float:
497
+ """Compute string similarity using RapidFuzz token set ratio."""
498
+ try:
499
+ from rapidfuzz import fuzz
500
+ return fuzz.token_set_ratio(predicted.lower(), expected.lower()) / 100.0
501
+ except ImportError:
502
+ from difflib import SequenceMatcher
503
+ return SequenceMatcher(None, predicted.lower(), expected.lower()).ratio()
504
+
505
+
506
+ def compute_exact_match(predicted: str, expected: str) -> bool:
507
+ """Check if predicted exactly matches expected (case-insensitive)."""
508
+ return predicted.strip().lower() == expected.strip().lower()
509
+
510
+
511
+ def compute_final_score(
512
+ exact_match: bool,
513
+ similarity: float,
514
+ llm_judge_score: Optional[float] = None,
515
+ llm_judge_requested: bool = False,
516
+ agent_output: str = "",
517
+ threshold_exact: float = 0.85,
518
+ threshold_inexact: float = 0.9,
519
+ apis_missing: Optional[List[str]] = None,
520
+ require_api_match: bool = False,
521
+ ) -> int:
522
+ """Compute final binary score for a task.
523
+
524
+ This matches the logic in bpo_benchmark/evaluation/metrics.py for consistency.
525
+
526
+ Args:
527
+ exact_match: Whether output exactly matched expected
528
+ similarity: String similarity score (0.0-1.0)
529
+ llm_judge_score: Optional LLM judge score (0.0-1.0)
530
+ llm_judge_requested: True if LLM judge was requested for this evaluation
531
+ agent_output: The agent's output string (to detect failures)
532
+ threshold_exact: Threshold when exact match is True
533
+ threshold_inexact: Threshold when exact match is False
534
+ apis_missing: List of expected APIs that were not called
535
+ require_api_match: If True, require apis_missing to be empty to pass
536
+
537
+ Returns:
538
+ 1 if task passes, 0 otherwise
539
+ """
540
+ import math
541
+
542
+ # Check for task failure indicators
543
+ if not agent_output or (isinstance(agent_output, str) and agent_output.startswith("ERROR:")):
544
+ return 0
545
+
546
+ # Check for missing API calls when API metrics are required
547
+ if require_api_match and apis_missing:
548
+ return 0
549
+
550
+ # Handle missing/invalid similarity
551
+ if similarity is None or (isinstance(similarity, float) and math.isnan(similarity)):
552
+ return 0
553
+
554
+ # Determine the threshold based on exact match status
555
+ threshold = threshold_exact if exact_match else threshold_inexact
556
+
557
+ # If LLM judge was requested but failed/unavailable, return 0
558
+ if llm_judge_requested:
559
+ if llm_judge_score is None or (isinstance(llm_judge_score, float) and math.isnan(llm_judge_score)):
560
+ return 0
561
+ # Judge was requested and available: pass if EITHER score meets threshold
562
+ if llm_judge_score >= threshold or similarity >= threshold:
563
+ return 1
564
+ return 0
565
+ else:
566
+ # No judge requested: use similarity only
567
+ return 1 if similarity >= threshold else 0
568
+
569
+
570
+ # ============================================================================
571
+ # LLM Judge (for semantic similarity evaluation)
572
+ # ============================================================================
573
+
574
+ class LLMJudge:
575
+ """LLM-based semantic judge using Groq's API."""
576
+
577
+ def __init__(
578
+ self,
579
+ api_key: str,
580
+ model: str = "llama-3.3-70b-versatile",
581
+ timeout_s: int = 30,
582
+ ):
583
+ self.api_key = api_key
584
+ self.model = model
585
+ self.timeout_s = timeout_s
586
+ self.base_url = "https://api.groq.com"
587
+
588
+ @property
589
+ def name(self) -> str:
590
+ return f"groq:{self.model}"
591
+
592
+ async def judge(
593
+ self,
594
+ predicted: str,
595
+ expected: str,
596
+ utterance: str = "",
597
+ ) -> Dict[str, Any]:
598
+ """Judge similarity between predicted and expected outputs.
599
+
600
+ Returns:
601
+ Dict with score (0.0-1.0), rationale, and metadata
602
+ """
603
+ import json
604
+
605
+ try:
606
+ import requests
607
+ except ImportError:
608
+ return {"score": None, "rationale": "requests library not available", "metadata": {}}
609
+
610
+ # Truncate for cost/speed
611
+ utterance = str(utterance)[:500]
612
+ predicted = str(predicted)[:2000]
613
+ expected = str(expected)[:2000]
614
+
615
+ system = (
616
+ "You are an evaluation judge assessing semantic equivalence between a PREDICTED and EXPECTED answer.\n\n"
617
+ "Scoring Guidelines:\n"
618
+ "- Score 1.0: Semantically identical - same meaning, entities, and facts (minor wording differences OK)\n"
619
+ "- Score 0.8-0.9: Semantically equivalent - same core meaning with slight elaboration or different phrasing\n"
620
+ "- Score 0.5-0.7: Partially equivalent - same topic but missing key details or extra information\n"
621
+ "- Score 0.2-0.4: Somewhat related - addresses same question but with different focus or incomplete answer\n"
622
+ "- Score 0.0-0.1: Unrelated or contradictory - different facts, wrong information, or completely different meaning\n\n"
623
+ "CRITICAL:\n"
624
+ "- Focus on SEMANTIC MEANING, not word-for-word matching or formatting\n"
625
+ "- Both asking for same information (even differently phrased) should score high (0.8-1.0)\n"
626
+ "- Consider context from the UTTERANCE to understand what's being asked\n"
627
+ "- Be precise: don't score 0.0 unless answers are truly unrelated/contradictory\n\n"
628
+ "Return ONLY valid JSON: {\"score\": <number 0.0-1.0>, \"rationale\": \"<explanation>\"}\n"
629
+ )
630
+
631
+ user = (
632
+ f"UTTERANCE:\n{utterance}\n\n"
633
+ f"EXPECTED:\n{expected}\n\n"
634
+ f"PREDICTED:\n{predicted}\n"
635
+ )
636
+
637
+ payload = {
638
+ "model": self.model,
639
+ "temperature": 0,
640
+ "messages": [
641
+ {"role": "system", "content": system},
642
+ {"role": "user", "content": user},
643
+ ],
644
+ }
645
+
646
+ def _do_request() -> Dict[str, Any]:
647
+ url = f"{self.base_url}/openai/v1/chat/completions"
648
+ response = requests.post(
649
+ url,
650
+ headers={
651
+ "Authorization": f"Bearer {self.api_key}",
652
+ "Content-Type": "application/json",
653
+ },
654
+ json=payload,
655
+ timeout=self.timeout_s,
656
+ )
657
+ response.raise_for_status()
658
+ return response.json()
659
+
660
+ try:
661
+ data = await asyncio.to_thread(_do_request)
662
+ except Exception as e:
663
+ logger.warning(f"LLM judge request failed: {e}")
664
+ return {"score": None, "rationale": f"Request failed: {e}", "metadata": {}}
665
+
666
+ content = (
667
+ data.get("choices", [{}])[0]
668
+ .get("message", {})
669
+ .get("content", "")
670
+ )
671
+
672
+ # Parse JSON response
673
+ try:
674
+ parsed = json.loads(content)
675
+ except Exception:
676
+ start = content.find("{")
677
+ end = content.rfind("}")
678
+ if start == -1 or end == -1 or end <= start:
679
+ return {"score": None, "rationale": f"Invalid JSON response: {content[:200]}", "metadata": {}}
680
+ try:
681
+ parsed = json.loads(content[start:end + 1])
682
+ except Exception:
683
+ return {"score": None, "rationale": f"Failed to parse JSON: {content[:200]}", "metadata": {}}
684
+
685
+ score = parsed.get("score")
686
+ if score is not None:
687
+ score = float(score)
688
+ score = max(0.0, min(1.0, score))
689
+
690
+ rationale = str(parsed.get("rationale", ""))[:1000]
691
+
692
+ return {
693
+ "score": score,
694
+ "rationale": rationale,
695
+ "metadata": {"judge": "groq", "model": self.model},
696
+ }
697
+
698
+
699
+ def get_llm_judge(api_key: str, provider: str = "groq") -> Optional[LLMJudge]:
700
+ """Get an LLM judge instance.
701
+
702
+ Args:
703
+ api_key: API key for the judge provider
704
+ provider: Currently only "groq" is supported
705
+
706
+ Returns:
707
+ LLMJudge instance or None if not supported
708
+ """
709
+ if provider.lower() == "groq":
710
+ return LLMJudge(api_key=api_key)
711
+ return None
712
+
713
+
714
+ # ============================================================================
715
+ # API Call Tracking
716
+ # ============================================================================
717
+
718
+ def compare_api_calls(
719
+ called_apis: List[str],
720
+ expected_apis: List[str],
721
+ ) -> Dict[str, Any]:
722
+ """Compare called APIs against expected APIs.
723
+
724
+ Args:
725
+ called_apis: List of API names that were called
726
+ expected_apis: List of expected API names
727
+
728
+ Returns:
729
+ Dict with missing, extra, correct count, and match info
730
+ """
731
+ # Normalize API names for comparison
732
+ # Registry tool names are verbose: bpo_candidate_source_sla_per_source_candidate_source_sla_per_source_requisition_id_get
733
+ # Expected names are short: candidate_source_sla_per_source
734
+ def normalize_api_name(name: str) -> str:
735
+ name = name.lower().strip()
736
+ # Remove app prefix
737
+ if name.startswith("bpo_"):
738
+ name = name[4:]
739
+ # Remove common suffixes (HTTP methods and parameter patterns)
740
+ for suffix in ["_get", "_post", "_put", "_delete"]:
741
+ if name.endswith(suffix):
742
+ name = name[:-len(suffix)]
743
+ for suffix in ["_requisition_id", "_skill_name"]:
744
+ if name.endswith(suffix):
745
+ name = name[:-len(suffix)]
746
+ return name.replace("-", "_").replace(" ", "_")
747
+
748
+ def api_matches(expected: str, actual: str) -> bool:
749
+ """Check if expected API name matches actual (allowing for verbose registry names)."""
750
+ exp_norm = normalize_api_name(expected)
751
+ act_norm = normalize_api_name(actual)
752
+ # Direct match
753
+ if exp_norm == act_norm:
754
+ return True
755
+ # Check if expected is contained in actual (for verbose registry names)
756
+ # e.g., "candidate_source_sla_per_source" in "candidate_source_sla_per_source_candidate_source_sla_per_source"
757
+ if exp_norm in act_norm:
758
+ return True
759
+ return False
760
+
761
+ logger.info(f"[API_TRACKING] Expected APIs: {expected_apis}")
762
+ logger.info(f"[API_TRACKING] Actual APIs: {called_apis}")
763
+
764
+ # Compute API metrics using flexible matching
765
+ missing = []
766
+ for exp_api in expected_apis:
767
+ if not any(api_matches(exp_api, act_api) for act_api in called_apis):
768
+ missing.append(exp_api)
769
+
770
+ extra = []
771
+ for act_api in called_apis:
772
+ if not any(api_matches(exp_api, act_api) for exp_api in expected_apis):
773
+ extra.append(act_api)
774
+
775
+ correct = len(expected_apis) - len(missing)
776
+
777
+ return {
778
+ "missing": missing,
779
+ "extra": extra,
780
+ "correct": correct,
781
+ "expected_count": len(expected_apis),
782
+ "called_count": len(called_apis),
783
+ "all_expected_called": len(missing) == 0,
784
+ }
api_candidate_source.py ADDED
@@ -0,0 +1,385 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Candidate source APIs - compute metrics from actual data.
3
+
4
+ AUTO-GENERATED by scripts/generate_hf.sh - DO NOT EDIT DIRECTLY
5
+ Edit candidate_source.py in main repo and regenerate.
6
+ """
7
+
8
+ from typing import Dict, List, Any, Optional, Union
9
+ import pandas as pd
10
+ from loguru import logger
11
+ from data_loader import get_data_loader
12
+ from models import (
13
+ RequisitionNotFoundResponse,
14
+ SLAPerSourceResponse,
15
+ TotalHiresBySourceResponse,
16
+ CandidateVolumeResponse,
17
+ FunnelConversionResponse,
18
+ MetadataResponse,
19
+ DefinitionsResponse,
20
+ SourceRecommendationResponse,
21
+ )
22
+ BPO_LOG_API_CALLS = False # Disabled for deployment
23
+
24
+
25
+ def _log_api_call(msg: str) -> None:
26
+ """Log API call if BPO_LOG_API_CALLS is enabled."""
27
+ if BPO_LOG_API_CALLS:
28
+ logger.info(msg)
29
+
30
+
31
+ def _check_requisition_valid(requisition_id: str) -> Optional[RequisitionNotFoundResponse]:
32
+ """
33
+ Check if a requisition ID is valid. Returns None if valid,
34
+ or an error response model if invalid.
35
+ """
36
+ loader = get_data_loader()
37
+ if not loader.is_valid_requisition(requisition_id):
38
+ suggestions = loader.get_suggested_requisitions(requisition_id)
39
+ return RequisitionNotFoundResponse(
40
+ error="requisition_not_found",
41
+ message=f"No job can be found with the ID {requisition_id}.",
42
+ suggested_requisition_ids=suggestions,
43
+ )
44
+ return None
45
+
46
+
47
+ def get_sla_per_source(requisition_id: str) -> Union[SLAPerSourceResponse, RequisitionNotFoundResponse]:
48
+ """
49
+ Retrieves the SLA percentage for each sourcing channel.
50
+
51
+ Args:
52
+ requisition_id: The specific requisition ID to filter SLA data for.
53
+
54
+ Returns:
55
+ A dictionary with source names and their SLA percentages.
56
+ """
57
+ _log_api_call(f"API call: get_sla_per_source(requisition_id={requisition_id})")
58
+
59
+ # Check if requisition ID is valid
60
+ error = _check_requisition_valid(requisition_id)
61
+ if error:
62
+ return error
63
+
64
+ loader = get_data_loader()
65
+ data = loader.get_similar_requisitions(requisition_id)
66
+
67
+ # Filter to only reviewed candidates (SLA only applies to reviewed candidates)
68
+ reviewed_data = data[data['reviewed']]
69
+
70
+ # Group by source and calculate SLA met percentage
71
+ sla_by_source = reviewed_data.groupby('source_name').agg(
72
+ total=('sla_met', 'count'),
73
+ sla_met=('sla_met', 'sum')
74
+ )
75
+ sla_by_source['sla_percentage'] = (sla_by_source['sla_met'] / sla_by_source['total'] * 100).round(0).astype(int)
76
+
77
+ metrics = [
78
+ {
79
+ "source_name": source,
80
+ "sla_percentage": int(row['sla_percentage'])
81
+ }
82
+ for source, row in sla_by_source.iterrows()
83
+ ]
84
+
85
+ # Sort by SLA percentage (ascending) for consistency
86
+ metrics.sort(key=lambda x: x['sla_percentage'])
87
+
88
+ return SLAPerSourceResponse(metrics=metrics)
89
+
90
+
91
+ def get_total_hires_by_source(requisition_id: str) -> Union[TotalHiresBySourceResponse, RequisitionNotFoundResponse]:
92
+ """
93
+ Retrieves the total number of hires per sourcing channel.
94
+
95
+ Args:
96
+ requisition_id: The specific requisition ID to filter hiring data for.
97
+
98
+ Returns:
99
+ A dictionary with source names and total hires.
100
+ """
101
+ _log_api_call(f"API call: get_total_hires_by_source(requisition_id={requisition_id})")
102
+
103
+ # Check if requisition ID is valid
104
+ error = _check_requisition_valid(requisition_id)
105
+ if error:
106
+ return error
107
+
108
+ loader = get_data_loader()
109
+ data = loader.get_similar_requisitions(requisition_id)
110
+
111
+ # Count hires by source
112
+ hires_by_source = data[data['hired']].groupby('source_name').size()
113
+
114
+ metrics = [
115
+ {
116
+ "source_name": source,
117
+ "total_hires": int(count)
118
+ }
119
+ for source, count in hires_by_source.items()
120
+ ]
121
+
122
+ # Sort by total hires (descending)
123
+ metrics.sort(key=lambda x: x['total_hires'], reverse=True)
124
+
125
+ total_hires = int(data['hired'].sum())
126
+
127
+ return TotalHiresBySourceResponse(
128
+ job_id=requisition_id,
129
+ metrics=metrics,
130
+ total_hires=total_hires,
131
+ )
132
+
133
+
134
+ def get_candidate_volume_by_source(
135
+ requisition_id: str,
136
+ sources: Optional[List[str]] = None
137
+ ) -> Union[CandidateVolumeResponse, RequisitionNotFoundResponse]:
138
+ """
139
+ Retrieves candidate volume per sourcing channel.
140
+
141
+ Args:
142
+ requisition_id: The specific requisition ID to filter candidate volume.
143
+ sources: Optional subset of sourcing channels to include (case-sensitive).
144
+
145
+ Returns:
146
+ A dictionary with source names and candidate volumes.
147
+ """
148
+ _log_api_call(f"API call: get_candidate_volume_by_source(requisition_id={requisition_id}, sources={sources})")
149
+
150
+ # Check if requisition ID is valid
151
+ error = _check_requisition_valid(requisition_id)
152
+ if error:
153
+ return error
154
+
155
+ loader = get_data_loader()
156
+ data = loader.get_similar_requisitions(requisition_id)
157
+
158
+ total_volume = len(data)
159
+
160
+ # Count candidates by source
161
+ volume_by_source = data.groupby('source_name').size()
162
+
163
+ metrics = [
164
+ {
165
+ "source_name": source,
166
+ "candidate_volume": int(count),
167
+ "percentage": int(round(count/total_volume*100))
168
+ }
169
+ for source, count in volume_by_source.items()
170
+ ]
171
+
172
+ # Filter by sources if provided
173
+ if sources:
174
+ metrics = [m for m in metrics if m['source_name'] in sources]
175
+
176
+ # Sort by volume (descending)
177
+ metrics.sort(key=lambda x: x['candidate_volume'], reverse=True)
178
+
179
+ return CandidateVolumeResponse(
180
+ job_id=requisition_id,
181
+ total_candidate_volume=total_volume,
182
+ metrics=metrics,
183
+ heading=(
184
+ f"For requisitions similar to {requisition_id}, there were {total_volume} candidates over "
185
+ "the past three years. Here's how many candidates came from each source "
186
+ "(with percentages from the total number):"
187
+ ),
188
+ )
189
+
190
+
191
+ def get_funnel_conversion_by_source(requisition_id: str) -> Union[FunnelConversionResponse, RequisitionNotFoundResponse]:
192
+ """
193
+ Retrieves conversion rates at each funnel stage for each sourcing channel.
194
+
195
+ Args:
196
+ requisition_id: The specific requisition ID to filter funnel data for.
197
+
198
+ Returns:
199
+ A dictionary with review %, interview rate, and offer acceptance rate.
200
+ """
201
+ _log_api_call(f"API call: get_funnel_conversion_by_source(requisition_id={requisition_id})")
202
+
203
+ # Check if requisition ID is valid
204
+ error = _check_requisition_valid(requisition_id)
205
+ if error:
206
+ return error
207
+
208
+ loader = get_data_loader()
209
+ data = loader.get_similar_requisitions(requisition_id)
210
+
211
+ metrics = []
212
+ for source in data['source_name'].unique():
213
+ source_data = data[data['source_name'] == source]
214
+ total = len(source_data)
215
+
216
+ if total == 0:
217
+ continue
218
+
219
+ reviewed = source_data['reviewed'].sum()
220
+ interviewed = source_data['interviewed'].sum()
221
+ offered = source_data['offer_extended'].sum()
222
+
223
+ metrics.append({
224
+ "source_name": source,
225
+ "first_round_review_percentage": round(reviewed / total * 100, 1),
226
+ "interview_rate": round(interviewed / total * 100, 1),
227
+ "offer_acceptance_rate": round(offered / total * 100, 1),
228
+ })
229
+
230
+ # Sort by source name for consistency
231
+ metrics.sort(key=lambda x: x['source_name'])
232
+
233
+ return FunnelConversionResponse(
234
+ job_id=requisition_id,
235
+ metrics=metrics,
236
+ )
237
+
238
+
239
+ def get_metadata_and_timeframe(requisition_id: str) -> Union[MetadataResponse, RequisitionNotFoundResponse]:
240
+ """
241
+ Retrieves metadata including data timeframe, last update date, and the
242
+ number of requisitions analysed.
243
+
244
+ Args:
245
+ requisition_id: The job requisition ID.
246
+
247
+ Returns:
248
+ A dictionary containing timeframe and requisition summary.
249
+ """
250
+ _log_api_call(f"API call: get_metadata_and_timeframe(requisition_id={requisition_id})")
251
+
252
+ # Check if requisition ID is valid
253
+ error = _check_requisition_valid(requisition_id)
254
+ if error:
255
+ return error
256
+
257
+ loader = get_data_loader()
258
+ data = loader.get_similar_requisitions(requisition_id)
259
+
260
+ # Get date range from applied_at column
261
+ min_date = data['applied_at'].min()
262
+ max_date = data['applied_at'].max()
263
+
264
+ # Count unique requisitions
265
+ num_requisitions = data['requisition_id'].nunique()
266
+
267
+ # Static dates for reproducible benchmarking
268
+ # Use actual dates from data but with last_updated fixed for stability
269
+ return MetadataResponse(
270
+ job_id=requisition_id,
271
+ time_frame_start="2023-10-09",
272
+ time_frame_end="2025-03-15",
273
+ data_last_updated="2025-04-29",
274
+ total_requisitions_analysed=num_requisitions,
275
+ )
276
+
277
+
278
+ def get_definitions_and_methodology(requisition_id: str) -> Union[DefinitionsResponse, RequisitionNotFoundResponse]:
279
+ """
280
+ Provides definitions of key metrics and outlines the methodology used
281
+ to calculate performance.
282
+
283
+ Args:
284
+ requisition_id: The specific requisition ID for context.
285
+
286
+ Returns:
287
+ A dictionary including metric definitions, calculation notes,
288
+ and the top metrics considered.
289
+ """
290
+ _log_api_call(f"API call: get_definitions_and_methodology(requisition_id={requisition_id})")
291
+
292
+ # Check if requisition ID is valid
293
+ error = _check_requisition_valid(requisition_id)
294
+ if error:
295
+ return error
296
+
297
+ loader = get_data_loader()
298
+ data = loader.get_similar_requisitions(requisition_id)
299
+
300
+ # Report total requisitions in dataset (full analysis framework)
301
+ num_total_requisitions = loader.data['requisition_id'].nunique()
302
+ min_date = data['applied_at'].min()
303
+ max_date = data['applied_at'].max()
304
+ years = (max_date - min_date).days / 365.25
305
+
306
+ return DefinitionsResponse(
307
+ job_id=requisition_id,
308
+ definitions={
309
+ "sla": "Percentage of candidates reviewed within the defined SLA window (e.g., 48 hours)",
310
+ "time_to_fill": "Average time from job posting to accepted offer",
311
+ "success_rate": "Ratio of candidates who accepted offers out of those interviewed",
312
+ },
313
+ calculation_notes=(
314
+ f"Metrics are computed from {num_total_requisitions} requisitions over the last {years:.1f} years. "
315
+ "Funnel stats are based on system timestamps and recruiter actions in ATS."
316
+ ),
317
+ top_metrics_considered=[
318
+ "SLA %",
319
+ "First round review %",
320
+ "Offer acceptance rate",
321
+ "Candidate volume",
322
+ "Total hires",
323
+ ],
324
+ )
325
+
326
+
327
+ def get_source_recommendation_summary(requisition_id: str) -> Union[SourceRecommendationResponse, RequisitionNotFoundResponse]:
328
+ """
329
+ Returns a high-level summary combining jobs-filled %, review %, offer-accept
330
+ rate, and total hires for each source.
331
+
332
+ Args:
333
+ requisition_id: The job requisition ID.
334
+
335
+ Returns:
336
+ A dictionary with composite source metrics.
337
+ """
338
+ _log_api_call(f"API call: get_source_recommendation_summary(requisition_id={requisition_id})")
339
+
340
+ # Check if requisition ID is valid
341
+ error = _check_requisition_valid(requisition_id)
342
+ if error:
343
+ return error
344
+
345
+ loader = get_data_loader()
346
+ data = loader.get_similar_requisitions(requisition_id)
347
+
348
+ num_requisitions = data['requisition_id'].nunique()
349
+
350
+ metrics = []
351
+ for source in data['source_name'].unique():
352
+ source_data = data[data['source_name'] == source]
353
+ total = len(source_data)
354
+
355
+ if total == 0:
356
+ continue
357
+
358
+ # Calculate metrics
359
+ reviewed = source_data['reviewed'].sum()
360
+ hired = source_data['hired'].sum()
361
+
362
+ # Jobs filled percentage: what % of requisitions had at least 1 hire from this source
363
+ reqs_with_hires = source_data[source_data['hired']]['requisition_id'].nunique()
364
+ jobs_filled_pct = int(reqs_with_hires / num_requisitions * 100)
365
+
366
+ # Offer acceptance rate: of those who got offers, how many accepted?
367
+ offers = source_data['offer_extended'].sum()
368
+ accepted = source_data['offer_accepted'].sum()
369
+ offer_accept_rate = round(accepted / offers * 100) if offers > 0 else 0
370
+
371
+ metrics.append({
372
+ "source_name": source,
373
+ "jobs_filled_percentage": jobs_filled_pct,
374
+ "first_round_review_percentage": int(reviewed / total * 100),
375
+ "offer_acceptance_rate": offer_accept_rate,
376
+ "total_hires": int(hired),
377
+ })
378
+
379
+ # Sort by source name
380
+ metrics.sort(key=lambda x: x['source_name'])
381
+
382
+ return SourceRecommendationResponse(
383
+ total_requisitions=num_requisitions,
384
+ metrics=metrics,
385
+ )
api_candidate_source_error.py ADDED
@@ -0,0 +1,495 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Error-prone candidate source API variants for testing agent resilience.
3
+
4
+ Each function has a unique, plausible intent and embeds a specific error behavior.
5
+ Completely independent from original APIs — accesses DataLoader directly.
6
+
7
+ AUTO-GENERATED by scripts/generate_hf.sh - DO NOT EDIT DIRECTLY
8
+ Edit candidate_source_error.py in main repo and regenerate.
9
+ """
10
+
11
+ import json
12
+ import random
13
+ from pathlib import Path
14
+ from typing import Any, Dict, List, Optional
15
+
16
+ from data_loader import get_data_loader
17
+
18
+ # Seeded RNG for reproducible probabilistic behavior
19
+ _rng = random.Random(42)
20
+
21
+ # Track call counts for rate-limiting behavior
22
+ _call_counts: Dict[str, int] = {}
23
+
24
+
25
+ def _check_requisition(requisition_id: str) -> Optional[Dict[str, Any]]:
26
+ """Return error dict if requisition invalid, else None."""
27
+ loader = get_data_loader()
28
+ if not loader.is_valid_requisition(requisition_id):
29
+ return {
30
+ "error": "requisition_not_found",
31
+ "message": f"Requisition {requisition_id} not found",
32
+ }
33
+ return None
34
+
35
+
36
+ # ── Test 28: Type mismatch — int instead of float ───────────────────────────
37
+
38
+ def get_source_sla_score(requisition_id: str, source_name: str = "Dice") -> Any:
39
+ """Get the SLA score for a specific sourcing channel.
40
+
41
+ Returns the SLA achievement score for the given source.
42
+
43
+ ERROR BEHAVIOR: Returns int (e.g., 80) instead of float (e.g., 80.0).
44
+ Tests type handling for numeric values.
45
+ """
46
+ err = _check_requisition(requisition_id)
47
+ if err:
48
+ return err
49
+
50
+ loader = get_data_loader()
51
+ data = loader.get_similar_requisitions(requisition_id)
52
+ reviewed = data[(data["reviewed"]) & (data["source_name"] == source_name)]
53
+
54
+ if len(reviewed) == 0:
55
+ return {"error": "no_data", "message": f"No reviewed candidates from {source_name}"}
56
+
57
+ sla_pct = int(round(reviewed["sla_met"].mean() * 100))
58
+ return sla_pct # Returns bare int instead of {"sla_score": 80.0}
59
+
60
+
61
+ # ── Test 29: Type mismatch — None instead of empty list ─────────────────────
62
+
63
+ def get_inactive_sources(requisition_id: str) -> Any:
64
+ """Show any inactive sourcing channels with no candidates.
65
+
66
+ Returns a list of sources that produced zero candidates.
67
+
68
+ ERROR BEHAVIOR: Returns None instead of empty list when no
69
+ inactive sources exist. Tests null handling.
70
+ """
71
+ err = _check_requisition(requisition_id)
72
+ if err:
73
+ return err
74
+
75
+ loader = get_data_loader()
76
+ data = loader.get_similar_requisitions(requisition_id)
77
+ active_sources = set(data["source_name"].unique())
78
+
79
+ # Use all sources from the full dataset as the reference set
80
+ all_possible = set(loader.data["source_name"].unique())
81
+ inactive = all_possible - active_sources
82
+
83
+ if not inactive:
84
+ return None # Returns None instead of []
85
+
86
+ return list(inactive)
87
+
88
+
89
+ # ── Test 30: HTTP 404 (probabilistic, 20% chance) ───────────────────────────
90
+
91
+ def get_candidate_pipeline_status(requisition_id: str) -> Dict[str, Any]:
92
+ """Get candidate pipeline status for a requisition.
93
+
94
+ Returns current pipeline status showing candidate distribution by source.
95
+
96
+ ERROR BEHAVIOR: 20% chance of returning a 404-style error dict.
97
+ Tests retry logic and error recovery.
98
+ """
99
+ if _rng.random() < 0.2:
100
+ return {
101
+ "status_code": 404,
102
+ "error": True,
103
+ "message": "Resource temporarily unavailable",
104
+ }
105
+
106
+ err = _check_requisition(requisition_id)
107
+ if err:
108
+ return err
109
+
110
+ loader = get_data_loader()
111
+ data = loader.get_similar_requisitions(requisition_id)
112
+ volume_by_source = data.groupby("source_name").size().to_dict()
113
+
114
+ return {
115
+ "requisition_id": requisition_id,
116
+ "pipeline": {k: int(v) for k, v in volume_by_source.items()},
117
+ "total_candidates": int(len(data)),
118
+ }
119
+
120
+
121
+ # ── Test 31: HTTP 500 with valid body ────────────────────────────────────────
122
+
123
+ def get_source_sla_check(requisition_id: str) -> Dict[str, Any]:
124
+ """Run a quick SLA status check across all sourcing channels.
125
+
126
+ Returns SLA metrics per source for rapid status assessment.
127
+
128
+ ERROR BEHAVIOR: Returns HTTP 500-style error dict but includes valid
129
+ data in the body. Tests agent ability to use response body despite
130
+ error status code.
131
+ """
132
+ err = _check_requisition(requisition_id)
133
+ if err:
134
+ return err
135
+
136
+ loader = get_data_loader()
137
+ data = loader.get_similar_requisitions(requisition_id)
138
+ reviewed = data[data["reviewed"]]
139
+ metrics = []
140
+ for source, group in reviewed.groupby("source_name"):
141
+ sla_pct = int(round(group["sla_met"].mean() * 100))
142
+ metrics.append({"source_name": source, "sla_percentage": sla_pct})
143
+
144
+ return {
145
+ "status_code": 500,
146
+ "error": True,
147
+ "message": "Internal server error",
148
+ "body": {"metrics": metrics},
149
+ }
150
+
151
+
152
+ # ── Test 32: HTTP 503 Service Unavailable ────────────────────────────────────
153
+
154
+ def get_funnel_status(requisition_id: str) -> Dict[str, Any]:
155
+ """Get the current funnel status for a requisition.
156
+
157
+ Returns real-time funnel pipeline status showing conversion at each stage.
158
+
159
+ ERROR BEHAVIOR: Always returns 503 with retry-after info.
160
+ Tests service unavailable handling.
161
+ """
162
+ return {
163
+ "status_code": 503,
164
+ "error": True,
165
+ "message": "Service temporarily unavailable. The funnel analytics engine is undergoing maintenance.",
166
+ "retry_after_seconds": 300,
167
+ "expected_recovery": "2025-05-01T12:00:00Z",
168
+ }
169
+
170
+
171
+ # ── Test 33: HTTP 429 Rate Limited ──────────────────────────────────────────
172
+
173
+ def get_bulk_source_data(requisition_id: str) -> Dict[str, Any]:
174
+ """Pull bulk source data for all requisitions.
175
+
176
+ Returns comprehensive source data across all requisitions in the system.
177
+
178
+ ERROR BEHAVIOR: Returns 429 after 3rd call (tracked via module-level counter).
179
+ Tests rate limit handling.
180
+ """
181
+ key = "get_bulk_source_data"
182
+ _call_counts[key] = _call_counts.get(key, 0) + 1
183
+
184
+ if _call_counts[key] > 3:
185
+ return {
186
+ "status_code": 429,
187
+ "error": True,
188
+ "message": "Rate limit exceeded. Maximum 3 calls per session.",
189
+ "retry_after_seconds": 60,
190
+ "limit": 3,
191
+ "remaining": 0,
192
+ }
193
+
194
+ err = _check_requisition(requisition_id)
195
+ if err:
196
+ return err
197
+
198
+ loader = get_data_loader()
199
+ data = loader.get_similar_requisitions(requisition_id)
200
+ summary = {}
201
+ for source, group in data.groupby("source_name"):
202
+ summary[source] = {
203
+ "total_candidates": int(len(group)),
204
+ "total_hires": int(group["hired"].sum()),
205
+ "reviewed": int(group["reviewed"].sum()),
206
+ }
207
+
208
+ return {
209
+ "requisition_id": requisition_id,
210
+ "sources": summary,
211
+ "call_number": _call_counts[key],
212
+ }
213
+
214
+
215
+ # ── Test 36: Missing required fields ────────────────────────────────────────
216
+
217
+ def get_source_metrics_lite(requisition_id: str) -> Dict[str, Any]:
218
+ """Get a lightweight summary of source metrics.
219
+
220
+ Returns a compact view of per-source metrics for quick analysis.
221
+
222
+ ERROR BEHAVIOR: Response missing `source_name` field in metrics entries.
223
+ Tests agent handling of incomplete/partial data.
224
+ """
225
+ err = _check_requisition(requisition_id)
226
+ if err:
227
+ return err
228
+
229
+ loader = get_data_loader()
230
+ data = loader.get_similar_requisitions(requisition_id)
231
+ metrics = []
232
+ for source, group in data.groupby("source_name"):
233
+ # Intentionally omit source_name
234
+ metrics.append({
235
+ "candidate_count": int(len(group)),
236
+ "hire_count": int(group["hired"].sum()),
237
+ "sla_met_count": int(group[group["reviewed"]]["sla_met"].sum()),
238
+ })
239
+
240
+ return {
241
+ "requisition_id": requisition_id,
242
+ "metrics": metrics,
243
+ "note": "Lightweight view — some fields may be omitted for performance.",
244
+ }
245
+
246
+
247
+ # ── Test 37: Wrong field types in response ──────────────────────────────────
248
+
249
+ def get_volume_report(requisition_id: str) -> Dict[str, Any]:
250
+ """Generate a volume report for a requisition.
251
+
252
+ Returns candidate volume statistics broken down by source.
253
+
254
+ ERROR BEHAVIOR: `candidate_count` returned as string instead of int.
255
+ Tests type coercion handling.
256
+ """
257
+ err = _check_requisition(requisition_id)
258
+ if err:
259
+ return err
260
+
261
+ loader = get_data_loader()
262
+ data = loader.get_similar_requisitions(requisition_id)
263
+ metrics = []
264
+ for source, group in data.groupby("source_name"):
265
+ metrics.append({
266
+ "source_name": source,
267
+ "candidate_count": str(len(group)), # String instead of int
268
+ "hire_count": str(int(group["hired"].sum())), # String instead of int
269
+ "review_rate": f"{group['reviewed'].mean() * 100:.1f}%",
270
+ })
271
+
272
+ return {
273
+ "requisition_id": requisition_id,
274
+ "metrics": metrics,
275
+ "total_candidates": str(len(data)), # String instead of int
276
+ }
277
+
278
+
279
+ # ── Test 38: Large response (1000 records) ──────────────────────────────────
280
+
281
+ def get_full_candidate_details(requisition_id: str) -> Dict[str, Any]:
282
+ """Get full candidate details for a requisition.
283
+
284
+ Returns comprehensive candidate-level data for detailed analysis.
285
+
286
+ ERROR BEHAVIOR: Response contains 1000 pre-generated candidate records.
287
+ Tests agent handling of large payloads.
288
+ """
289
+ err = _check_requisition(requisition_id)
290
+ if err:
291
+ return err
292
+
293
+ # Load pre-generated fixture
294
+ fixture_paths = [
295
+ Path(__file__).parent.parent.parent / "data" / "large_response_fixture.json",
296
+ Path("./data/large_response_fixture.json"),
297
+ ]
298
+
299
+ for path in fixture_paths:
300
+ if path.exists():
301
+ with open(path, "r") as f:
302
+ records = json.load(f)
303
+ return {
304
+ "requisition_id": requisition_id,
305
+ "total_records": len(records),
306
+ "candidates": records,
307
+ }
308
+
309
+ # Fallback: generate minimal records if fixture missing
310
+ return {
311
+ "requisition_id": requisition_id,
312
+ "total_records": 0,
313
+ "candidates": [],
314
+ "warning": "Large response fixture not found",
315
+ }
316
+
317
+
318
+ # ── Test 39: Unicode and special characters ─────────────────────────────────
319
+
320
+ def get_source_directory(requisition_id: str) -> Dict[str, Any]:
321
+ """Show the source directory for a requisition.
322
+
323
+ Returns a directory listing of all sourcing channels with their metadata.
324
+
325
+ ERROR BEHAVIOR: Source names contain emoji, CJK characters, Arabic text.
326
+ Tests unicode handling.
327
+ """
328
+ err = _check_requisition(requisition_id)
329
+ if err:
330
+ return err
331
+
332
+ return {
333
+ "requisition_id": requisition_id,
334
+ "sources": [
335
+ {"name": "LinkedIn \U0001F4BC", "region": "Global", "status": "active"},
336
+ {"name": "Dice \U0001F3B2", "region": "North America", "status": "active"},
337
+ {"name": "\u62db\u8058\u7f51 (Zhaopin)", "region": "\u4e2d\u56fd", "status": "active"},
338
+ {"name": "\u0628\u064a\u062a.\u0643\u0648\u0645 (Bayt)", "region": "\u0627\u0644\u0634\u0631\u0642 \u0627\u0644\u0623\u0648\u0633\u0637", "status": "active"},
339
+ {"name": "GitHub \U0001F431\u200D\U0001F4BB", "region": "Global", "status": "active"},
340
+ {"name": "R\u00e9f\u00e9rence\u2122", "region": "Europe", "status": "inactive"},
341
+ {"name": "\u2605 Top Talent \u2605", "region": "APAC", "status": "active"},
342
+ ],
343
+ "total_sources": 7,
344
+ }
345
+
346
+
347
+ # ── Test 41: Extra undocumented fields (20 extra fields) ─────────────────────
348
+
349
+ def get_sla_extended(requisition_id: str, source_name: str = "Dice") -> Dict[str, Any]:
350
+ """Get extended SLA data for a specific sourcing channel.
351
+
352
+ Returns SLA metrics with additional analytics for the given source.
353
+
354
+ ERROR BEHAVIOR: Response includes 20 undocumented extra fields
355
+ beyond what the schema describes. Tests agent ability to ignore
356
+ noise and extract relevant data.
357
+ """
358
+ err = _check_requisition(requisition_id)
359
+ if err:
360
+ return err
361
+
362
+ loader = get_data_loader()
363
+ data = loader.get_similar_requisitions(requisition_id)
364
+ source_data = data[(data["reviewed"]) & (data["source_name"] == source_name)]
365
+
366
+ sla_pct = int(round(source_data["sla_met"].mean() * 100)) if len(source_data) > 0 else 0
367
+
368
+ return {
369
+ "requisition_id": requisition_id,
370
+ "source_name": source_name,
371
+ "sla_percentage": sla_pct,
372
+ # Undocumented extra fields
373
+ "_internal_id": "sla-ext-7f3a9b2c",
374
+ "_cache_ttl": 3600,
375
+ "_version": "2.3.1",
376
+ "_debug_query_ms": 42,
377
+ "_shard_id": 3,
378
+ "_region": "us-east-1",
379
+ "_feature_flags": ["sla_v2", "extended_metrics"],
380
+ "_experiment_group": "control",
381
+ "_sampling_rate": 0.95,
382
+ "_data_quality_score": 0.98,
383
+ "_last_recomputed": "2025-04-29T03:00:00Z",
384
+ "_computation_engine": "spark-3.5",
385
+ "_model_version": "sla-impact-v1.4.2",
386
+ "_confidence_interval": [sla_pct - 3, sla_pct + 3],
387
+ "_p_value": 0.023,
388
+ "_sample_size": int(len(source_data)),
389
+ "_outliers_removed": 2,
390
+ "_normalization_method": "min-max",
391
+ "_correlation_with_hires": 0.67,
392
+ "_seasonal_adjustment": True,
393
+ }
394
+
395
+
396
+ # ── Test 43: Undocumented error format ──────────────────────────────────────
397
+
398
+ def get_requisition_details(requisition_id: str) -> Dict[str, Any]:
399
+ """Get detailed information for a specific requisition.
400
+
401
+ Returns comprehensive requisition metadata and status.
402
+
403
+ ERROR BEHAVIOR: Returns non-standard error format `{"err": "not_found"}`
404
+ instead of the standard `RequisitionNotFoundResponse`.
405
+ Tests non-standard error parsing.
406
+ """
407
+ loader = get_data_loader()
408
+ if not loader.is_valid_requisition(requisition_id):
409
+ return {"err": "not_found", "req": requisition_id}
410
+
411
+ data = loader.get_by_requisition(requisition_id)
412
+ row = data.iloc[0]
413
+
414
+ return {
415
+ "requisition_id": requisition_id,
416
+ "department": str(row.get("department", "Unknown")),
417
+ "seniority_level": str(row.get("seniority_level", "Unknown")),
418
+ "total_candidates": int(len(data)),
419
+ "sources_used": list(data["source_name"].unique()),
420
+ }
421
+
422
+
423
+ # ── Test 44: Undocumented pagination ─────────────────────────────────────────
424
+
425
+ def list_all_sources(requisition_id: str) -> Dict[str, Any]:
426
+ """List all available sourcing channels.
427
+
428
+ Returns a paginated list of all sourcing channels in the system.
429
+
430
+ ERROR BEHAVIOR: Response includes `next_page` token not described
431
+ in any schema. Tests pagination detection and handling.
432
+ """
433
+ err = _check_requisition(requisition_id)
434
+ if err:
435
+ return err
436
+
437
+ loader = get_data_loader()
438
+ data = loader.get_similar_requisitions(requisition_id)
439
+ sources = sorted(data["source_name"].unique())
440
+
441
+ # Return first 3 with pagination token
442
+ page_size = 3
443
+ page = sources[:page_size]
444
+
445
+ result: Dict[str, Any] = {
446
+ "requisition_id": requisition_id,
447
+ "sources": [{"name": s, "index": i} for i, s in enumerate(page)],
448
+ "total_count": len(sources),
449
+ "page_size": page_size,
450
+ "page": 1,
451
+ }
452
+
453
+ if len(sources) > page_size:
454
+ result["next_page"] = "eyJvZmZzZXQiOjMsInJlcV9pZCI6IjA1OTU4QlIifQ=="
455
+ result["has_more"] = True
456
+ else:
457
+ result["has_more"] = False
458
+
459
+ return result
460
+
461
+
462
+ # ── Test 45: Undocumented rate limiting headers ──────────────────────────────
463
+
464
+ def get_batch_metrics(requisition_id: str) -> Dict[str, Any]:
465
+ """Fetch batch metrics for all sourcing channels.
466
+
467
+ Returns aggregated metrics across all sources with rate limit information.
468
+
469
+ ERROR BEHAVIOR: Response includes X-RateLimit style headers embedded
470
+ in the response body. Tests rate limit awareness.
471
+ """
472
+ err = _check_requisition(requisition_id)
473
+ if err:
474
+ return err
475
+
476
+ loader = get_data_loader()
477
+ data = loader.get_similar_requisitions(requisition_id)
478
+
479
+ metrics = {}
480
+ for source, group in data.groupby("source_name"):
481
+ metrics[source] = {
482
+ "candidates": int(len(group)),
483
+ "hires": int(group["hired"].sum()),
484
+ "reviewed": int(group["reviewed"].sum()),
485
+ }
486
+
487
+ return {
488
+ "requisition_id": requisition_id,
489
+ "metrics": metrics,
490
+ # Rate limit info embedded in response body
491
+ "X-RateLimit-Limit": 100,
492
+ "X-RateLimit-Remaining": 97,
493
+ "X-RateLimit-Reset": "2025-05-01T00:00:00Z",
494
+ "X-RateLimit-Window": "1h",
495
+ }
api_skills.py ADDED
@@ -0,0 +1,328 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Skills APIs - compute skill-related metrics from actual data.
3
+
4
+ AUTO-GENERATED by scripts/generate_hf.sh - DO NOT EDIT DIRECTLY
5
+ Edit skills.py in main repo and regenerate.
6
+ """
7
+
8
+ from typing import Dict, List, Any, Optional, Union
9
+ import pandas as pd
10
+ from loguru import logger
11
+ from data_loader import get_data_loader
12
+ from models import (
13
+ RequisitionNotFoundResponse,
14
+ SkillAnalysisResponse,
15
+ SkillImpactFillRateResponse,
16
+ SkillImpactSLAResponse,
17
+ SkillRelevanceResponse,
18
+ SuccessfulPostingResponse,
19
+ DataSourcesResponse,
20
+ SkillJustificationData,
21
+ SkillJustificationImpact,
22
+ SuccessCriteria,
23
+ )
24
+ BPO_LOG_API_CALLS = False # Disabled for deployment
25
+
26
+
27
+ def _log_api_call(msg: str) -> None:
28
+ """Log API call if BPO_LOG_API_CALLS is enabled."""
29
+ if BPO_LOG_API_CALLS:
30
+ logger.info(msg)
31
+
32
+
33
+ def _check_requisition_valid(requisition_id: str) -> Optional[RequisitionNotFoundResponse]:
34
+ """
35
+ Check if a requisition ID is valid. Returns None if valid,
36
+ or an error response model if invalid.
37
+ """
38
+ loader = get_data_loader()
39
+ if not loader.is_valid_requisition(requisition_id):
40
+ suggestions = loader.get_suggested_requisitions(requisition_id)
41
+ return RequisitionNotFoundResponse(
42
+ error="requisition_not_found",
43
+ message=f"No job can be found with the ID {requisition_id}.",
44
+ suggested_requisition_ids=suggestions,
45
+ )
46
+ return None
47
+
48
+
49
+ def get_skill_analysis(requisition_id: str) -> Union[SkillAnalysisResponse, RequisitionNotFoundResponse]:
50
+ """
51
+ Provides statistical indicators for each skill associated with the requisition,
52
+ enabling an LLM or analyst to decide whether a skill should be retained,
53
+ removed, or reconsidered.
54
+
55
+ Args:
56
+ requisition_id: The job requisition ID.
57
+
58
+ Returns:
59
+ Dict with historical counts and SLA correlation per skill.
60
+ """
61
+ _log_api_call(f"API call: get_skill_analysis(requisition_id={requisition_id})")
62
+
63
+ # Check if requisition ID is valid
64
+ error = _check_requisition_valid(requisition_id)
65
+ if error:
66
+ return error
67
+
68
+ loader = get_data_loader()
69
+ data = loader.get_similar_requisitions(requisition_id)
70
+
71
+ # Get all unique skills across all candidates
72
+ all_skills = []
73
+ for skills_list in data['skills_parsed']:
74
+ all_skills.extend(skills_list)
75
+
76
+ skill_counts = pd.Series(all_skills).value_counts()
77
+
78
+ # For each skill, compute SLA correlation
79
+ historical_skills = []
80
+ for skill, count in skill_counts.head(10).items(): # Top 10 skills
81
+ # Filter to reviewed candidates only (SLA only applies to reviewed candidates)
82
+ reviewed_data = data[data['reviewed']]
83
+
84
+ # Get candidates with and without this skill
85
+ has_skill = reviewed_data[reviewed_data['skills_parsed'].apply(lambda x: skill in x)]
86
+ no_skill = reviewed_data[reviewed_data['skills_parsed'].apply(lambda x: skill not in x)]
87
+
88
+ # Calculate SLA rates
89
+ sla_with = has_skill['sla_met'].mean() if len(has_skill) > 0 else 0
90
+ sla_without = no_skill['sla_met'].mean() if len(no_skill) > 0 else 0
91
+
92
+ # Determine correlation
93
+ diff = sla_with - sla_without
94
+ if diff < -0.10:
95
+ correlation = "highly negative impact on SLA"
96
+ elif diff < 0:
97
+ correlation = "slightly negative impact on SLA"
98
+ elif diff > 0.10:
99
+ correlation = "highly positive impact on SLA"
100
+ elif diff > 0:
101
+ correlation = "slightly positive impact on SLA"
102
+ else:
103
+ correlation = "no impact on SLA"
104
+
105
+ historical_skills.append({
106
+ "name": skill,
107
+ "skill_occurrence": int(count),
108
+ "correlation": correlation
109
+ })
110
+
111
+ num_jobs = data['requisition_id'].nunique()
112
+
113
+ return SkillAnalysisResponse(
114
+ historical_jobs=num_jobs,
115
+ input_skills=[], # Would come from requisition details
116
+ historical_skills_with_analysis=historical_skills,
117
+ )
118
+
119
+
120
+ def get_skill_impact_fill_rate(requisition_id: str, skill_name: str) -> Union[SkillImpactFillRateResponse, RequisitionNotFoundResponse]:
121
+ """
122
+ Evaluates how the inclusion of a specific skill affects requisition
123
+ fill-rate metrics and candidate pool size.
124
+
125
+ Args:
126
+ requisition_id: The job requisition ID.
127
+ skill_name: The skill to evaluate.
128
+
129
+ Returns:
130
+ Impact metrics with and without the skill.
131
+ """
132
+ _log_api_call(f"API call: get_skill_impact_fill_rate(requisition_id={requisition_id}, skill_name={skill_name})")
133
+
134
+ # Check if requisition ID is valid
135
+ error = _check_requisition_valid(requisition_id)
136
+ if error:
137
+ return error
138
+
139
+ loader = get_data_loader()
140
+ data = loader.get_similar_requisitions(requisition_id)
141
+
142
+ # Split data by whether requisitions included this skill
143
+ has_skill_reqs = data[data['skills_parsed'].apply(lambda x: skill_name in x)]['requisition_id'].unique()
144
+ no_skill_reqs = data[~data['requisition_id'].isin(has_skill_reqs)]['requisition_id'].unique()
145
+
146
+ def calc_metrics(req_ids):
147
+ if len(req_ids) == 0:
148
+ return {"fill_rate_percentage": 0, "time_to_fill_days": 0, "candidate_pool_size": 0}
149
+
150
+ req_data = data[data['requisition_id'].isin(req_ids)]
151
+
152
+ # Fill rate: % of reqs that got at least one hire
153
+ reqs_with_hires = req_data[req_data['hired']]['requisition_id'].nunique()
154
+ fill_rate = reqs_with_hires / len(req_ids) * 100
155
+
156
+ # Time to fill: average days from applied to hired
157
+ hired = req_data[req_data['hired']]
158
+ if len(hired) > 0:
159
+ time_to_fill = (hired['hire_date'] - hired['applied_at']).dt.days.mean()
160
+ else:
161
+ time_to_fill = 0
162
+
163
+ # Candidate pool size
164
+ pool_size = len(req_data)
165
+
166
+ return {
167
+ "fill_rate_percentage": round(fill_rate, 1),
168
+ "time_to_fill_days": int(time_to_fill),
169
+ "candidate_pool_size": pool_size
170
+ }
171
+
172
+ with_skill = calc_metrics(has_skill_reqs)
173
+ without_skill = calc_metrics(no_skill_reqs)
174
+
175
+ return SkillImpactFillRateResponse(
176
+ skill_name=skill_name,
177
+ impact=with_skill,
178
+ compared_to_baseline=without_skill,
179
+ )
180
+
181
+
182
+ def get_skill_impact_sla(requisition_id: str, skill_name: str) -> Union[SkillImpactSLAResponse, RequisitionNotFoundResponse]:
183
+ """
184
+ Analyzes how a skill affects SLA achievement rate.
185
+
186
+ Args:
187
+ requisition_id: The job requisition ID.
188
+ skill_name: The skill being analyzed.
189
+
190
+ Returns:
191
+ Success percentages with/without the skill and the delta.
192
+ """
193
+ _log_api_call(f"API call: get_skill_impact_sla(requisition_id={requisition_id}, skill_name={skill_name})")
194
+
195
+ # Check if requisition ID is valid
196
+ error = _check_requisition_valid(requisition_id)
197
+ if error:
198
+ return error
199
+
200
+ loader = get_data_loader()
201
+ data = loader.get_similar_requisitions(requisition_id)
202
+
203
+ # Filter to reviewed candidates only (SLA only applies to reviewed candidates)
204
+ reviewed_data = data[data['reviewed']]
205
+
206
+ # Get candidates with and without this skill
207
+ has_skill = reviewed_data[reviewed_data['skills_parsed'].apply(lambda x: skill_name in x)]
208
+ no_skill = reviewed_data[reviewed_data['skills_parsed'].apply(lambda x: skill_name not in x)]
209
+
210
+ sla_with = round(has_skill['sla_met'].mean() * 100) if len(has_skill) > 0 else 0
211
+ sla_without = round(no_skill['sla_met'].mean() * 100) if len(no_skill) > 0 else 0
212
+
213
+ return SkillImpactSLAResponse(
214
+ requisition_id=requisition_id,
215
+ skill_name=skill_name,
216
+ sla_achievement_with_skill=sla_with,
217
+ sla_achievement_without_skill=sla_without,
218
+ delta=sla_with - sla_without,
219
+ )
220
+
221
+
222
+ def get_skill_relevance_justification(requisition_id: str, skill_name: str) -> Union[SkillRelevanceResponse, RequisitionNotFoundResponse]:
223
+ """
224
+ Explains whether a skill is relevant and why, based on historical hiring
225
+ success and outcome data.
226
+
227
+ Args:
228
+ requisition_id: The job requisition ID.
229
+ skill_name: The skill being justified.
230
+
231
+ Returns:
232
+ Relevance determination with justification.
233
+ """
234
+ _log_api_call(f"API call: get_skill_relevance_justification(requisition_id={requisition_id}, skill_name={skill_name})")
235
+
236
+ # Check if requisition ID is valid
237
+ error = _check_requisition_valid(requisition_id)
238
+ if error:
239
+ return error
240
+
241
+ # Get both SLA and fill rate impacts
242
+ sla_impact = get_skill_impact_sla(requisition_id, skill_name)
243
+ fill_impact = get_skill_impact_fill_rate(requisition_id, skill_name)
244
+
245
+ # Determine relevance based on both metrics
246
+ is_relevant = False
247
+ if sla_impact.delta > 5 or fill_impact.impact.fill_rate_percentage > fill_impact.compared_to_baseline.fill_rate_percentage * 1.2:
248
+ is_relevant = True
249
+
250
+ justification = SkillJustificationData(
251
+ requisition_id=requisition_id,
252
+ skill_name=skill_name,
253
+ sla_achievement_with_skill=sla_impact.sla_achievement_with_skill,
254
+ sla_achievement_without_skill=sla_impact.sla_achievement_without_skill,
255
+ delta=sla_impact.delta,
256
+ impact=SkillJustificationImpact(
257
+ fill_rate_percentage=fill_impact.impact.fill_rate_percentage,
258
+ time_to_fill_days=fill_impact.impact.time_to_fill_days,
259
+ candidate_pool_size=fill_impact.impact.candidate_pool_size,
260
+ ),
261
+ compared_to_baseline=SkillJustificationImpact(
262
+ fill_rate_percentage=fill_impact.compared_to_baseline.fill_rate_percentage,
263
+ time_to_fill_days=fill_impact.compared_to_baseline.time_to_fill_days,
264
+ candidate_pool_size=fill_impact.compared_to_baseline.candidate_pool_size,
265
+ ),
266
+ )
267
+
268
+ return SkillRelevanceResponse(
269
+ requisition_id=requisition_id,
270
+ skill_name=skill_name,
271
+ is_relevant=is_relevant,
272
+ justification=justification,
273
+ )
274
+
275
+
276
+ def get_successful_posting_criteria() -> SuccessfulPostingResponse:
277
+ """
278
+ Returns the business definition of a successful job posting,
279
+ including thresholds and benchmarks for success.
280
+
281
+ Returns:
282
+ Success criteria thresholds.
283
+ """
284
+ _log_api_call("API call: get_successful_posting_criteria()")
285
+
286
+ return SuccessfulPostingResponse(
287
+ criteria=SuccessCriteria(
288
+ time_to_fill_threshold_days=90,
289
+ offer_acceptance_rate_min=50,
290
+ sla_compliance_min=80,
291
+ candidate_quality_rating_avg=3.5,
292
+ ),
293
+ justification="Based on historical performance benchmarks and industry standards",
294
+ )
295
+
296
+
297
+ def get_data_sources_used(requisition_id: str) -> Union[DataSourcesResponse, RequisitionNotFoundResponse]:
298
+ """
299
+ Lists the datasets and ML models used to make hiring recommendations
300
+ for a requisition.
301
+
302
+ Args:
303
+ requisition_id: The job requisition ID.
304
+
305
+ Returns:
306
+ Data sources and models used.
307
+ """
308
+ _log_api_call(f"API call: get_data_sources_used(requisition_id={requisition_id})")
309
+
310
+ # Check if requisition ID is valid
311
+ error = _check_requisition_valid(requisition_id)
312
+ if error:
313
+ return error
314
+
315
+ return DataSourcesResponse(
316
+ requisition_id=requisition_id,
317
+ datasets_used=[
318
+ "Historical hiring success data",
319
+ "Requisition skill tagging",
320
+ "Funnel conversion metrics",
321
+ "Candidate quality feedback",
322
+ ],
323
+ models_involved=[
324
+ "Skill relevance classifier",
325
+ "SLA impact regression model",
326
+ "Funnel conversion recommender",
327
+ ],
328
+ )
api_skills_error.py ADDED
@@ -0,0 +1,238 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Error-prone skills API variants for testing agent resilience.
3
+
4
+ Each function has a unique, plausible intent and embeds a specific error behavior.
5
+ Completely independent from original APIs — accesses DataLoader directly.
6
+
7
+ AUTO-GENERATED by scripts/generate_hf.sh - DO NOT EDIT DIRECTLY
8
+ Edit skills_error.py in main repo and regenerate.
9
+ """
10
+
11
+ import json
12
+ import random
13
+ from typing import Any, Dict, Optional
14
+
15
+ from data_loader import get_data_loader
16
+
17
+ # Seeded RNG for reproducible probabilistic behavior
18
+ _rng = random.Random(42)
19
+
20
+
21
+ def _check_requisition(requisition_id: str) -> Optional[Dict[str, Any]]:
22
+ """Return error dict if requisition invalid, else None."""
23
+ loader = get_data_loader()
24
+ if not loader.is_valid_requisition(requisition_id):
25
+ return {
26
+ "error": "requisition_not_found",
27
+ "message": f"Requisition {requisition_id} not found",
28
+ }
29
+ return None
30
+
31
+
32
+ # ── Test 27: Type mismatch — string instead of structured list ──────────────
33
+
34
+ def get_skill_summary(requisition_id: str) -> str:
35
+ """Get a quick text summary of skills needed for a requisition.
36
+
37
+ Returns a concise comma-separated skill overview.
38
+
39
+ ERROR BEHAVIOR: Returns a plain comma-separated string instead of
40
+ structured SkillAnalysisResponse. Tests type mismatch handling.
41
+ """
42
+ err = _check_requisition(requisition_id)
43
+ if err:
44
+ return json.dumps(err)
45
+
46
+ loader = get_data_loader()
47
+ data = loader.get_similar_requisitions(requisition_id)
48
+ all_skills: set = set()
49
+ for skills_list in data["skills_parsed"].dropna():
50
+ if isinstance(skills_list, list):
51
+ all_skills.update(skills_list)
52
+
53
+ return ", ".join(sorted(all_skills))
54
+
55
+
56
+ # ── Test 34: Missing output schema — untyped dict ───────────────────────────
57
+
58
+ def get_model_registry(requisition_id: str) -> Dict[str, Any]:
59
+ """Check which ML models are registered for a given requisition.
60
+
61
+ Returns model registry information including versions and status.
62
+
63
+ ERROR BEHAVIOR: No Pydantic output schema — returns a plain dict
64
+ with dynamically typed fields. Tests schema inference.
65
+ """
66
+ err = _check_requisition(requisition_id)
67
+ if err:
68
+ return err
69
+
70
+ return {
71
+ "requisition_id": requisition_id,
72
+ "models": [
73
+ {
74
+ "name": "Skill relevance classifier",
75
+ "version": "2.1.0",
76
+ "status": "active",
77
+ "last_trained": "2024-11-15",
78
+ "accuracy": 0.87,
79
+ },
80
+ {
81
+ "name": "SLA impact regression model",
82
+ "version": "1.4.2",
83
+ "status": "active",
84
+ "last_trained": "2024-10-01",
85
+ "r_squared": 0.72,
86
+ },
87
+ {
88
+ "name": "Funnel conversion recommender",
89
+ "version": "3.0.0-beta",
90
+ "status": "staging",
91
+ "last_trained": "2025-01-20",
92
+ "precision": 0.81,
93
+ },
94
+ ],
95
+ "registry_updated": "2025-04-29",
96
+ }
97
+
98
+
99
+ # ── Test 35: Missing input schema — undocumented params ─────────────────────
100
+
101
+ def get_skill_lookup(requisition_id: str, skill_name: str = None,
102
+ include_history: bool = False,
103
+ format: str = "json") -> Dict[str, Any]:
104
+ """Look up a specific skill and its metrics for a requisition.
105
+
106
+ ERROR BEHAVIOR: Accepts undocumented parameters (include_history, format)
107
+ not described in the tool schema. Tests agent handling of extra params.
108
+ """
109
+ err = _check_requisition(requisition_id)
110
+ if err:
111
+ return err
112
+
113
+ loader = get_data_loader()
114
+ data = loader.get_similar_requisitions(requisition_id)
115
+
116
+ # Find skill occurrence
117
+ total = 0
118
+ for skills_list in data["skills_parsed"].dropna():
119
+ if isinstance(skills_list, list) and skill_name in skills_list:
120
+ total += 1
121
+
122
+ result = {
123
+ "requisition_id": requisition_id,
124
+ "skill_name": skill_name,
125
+ "occurrence_count": total,
126
+ "total_candidates": len(data),
127
+ "occurrence_rate": round(total / len(data) * 100, 1) if len(data) > 0 else 0,
128
+ }
129
+
130
+ if include_history:
131
+ result["history"] = {
132
+ "first_seen": "2023-10-09",
133
+ "trend": "stable",
134
+ "quarterly_counts": [total // 4] * 4,
135
+ }
136
+
137
+ return result
138
+
139
+
140
+ # ── Test 40: Deeply nested JSON (15 levels) ─────────────────────────────────
141
+
142
+ def get_skill_deep_analysis(requisition_id: str) -> Dict[str, Any]:
143
+ """Get a deep analysis breakdown of skills with detailed sub-categories.
144
+
145
+ Returns comprehensive multi-level skill categorization and metrics.
146
+
147
+ ERROR BEHAVIOR: Response is nested 15 levels deep.
148
+ Tests agent ability to navigate deeply nested structures.
149
+ """
150
+ err = _check_requisition(requisition_id)
151
+ if err:
152
+ return err
153
+
154
+ loader = get_data_loader()
155
+ data = loader.get_similar_requisitions(requisition_id)
156
+
157
+ # Collect top skills
158
+ all_skills: list = []
159
+ for skills_list in data["skills_parsed"].dropna():
160
+ if isinstance(skills_list, list):
161
+ all_skills.extend(skills_list)
162
+
163
+ from collections import Counter
164
+ skill_counts = Counter(all_skills)
165
+ top_skills = skill_counts.most_common(5)
166
+
167
+ # Build deeply nested structure (15 levels)
168
+ def nest(depth: int, skill_name: str, count: int) -> Dict[str, Any]:
169
+ if depth <= 0:
170
+ return {"skill": skill_name, "count": count}
171
+ return {
172
+ "level": depth,
173
+ "metadata": {"type": f"analysis_layer_{depth}"},
174
+ "data": nest(depth - 1, skill_name, count),
175
+ }
176
+
177
+ skills_nested = [
178
+ nest(15, name, count) for name, count in top_skills
179
+ ]
180
+
181
+ return {
182
+ "requisition_id": requisition_id,
183
+ "analysis_version": "3.0",
184
+ "results": {
185
+ "nested_skills": skills_nested,
186
+ "total_depth": 15,
187
+ },
188
+ }
189
+
190
+
191
+ # ── Test 42: Input schema mismatch — expects skill_id but docs say skill_name
192
+
193
+ def analyze_skill_match(requisition_id: str, skill_id: str) -> Dict[str, Any]:
194
+ """Check if a skill is a good match for a requisition.
195
+
196
+ Args:
197
+ requisition_id: The job requisition ID.
198
+ skill_id: The skill identifier to check.
199
+
200
+ ERROR BEHAVIOR: Function signature says `skill_id` but tool description
201
+ and documentation say `skill_name`. Tests agent adaptation to mismatched
202
+ parameter names.
203
+ """
204
+ err = _check_requisition(requisition_id)
205
+ if err:
206
+ return err
207
+
208
+ # Treat skill_id as skill_name (the mismatch)
209
+ skill_name = skill_id
210
+
211
+ loader = get_data_loader()
212
+ data = loader.get_similar_requisitions(requisition_id)
213
+ reviewed = data[data["reviewed"]]
214
+
215
+ has_skill = reviewed[reviewed["skills_parsed"].apply(lambda x: skill_name in x)]
216
+ no_skill = reviewed[reviewed["skills_parsed"].apply(lambda x: skill_name not in x)]
217
+
218
+ sla_with = round(has_skill["sla_met"].mean() * 100) if len(has_skill) > 0 else 0
219
+ sla_without = round(no_skill["sla_met"].mean() * 100) if len(no_skill) > 0 else 0
220
+
221
+ total_with_skill = sum(
222
+ 1 for sl in data["skills_parsed"].dropna()
223
+ if isinstance(sl, list) and skill_name in sl
224
+ )
225
+
226
+ match_score = min(100, int(
227
+ (total_with_skill / len(data) * 50 if len(data) > 0 else 0)
228
+ + (max(0, sla_with - sla_without))
229
+ ))
230
+
231
+ return {
232
+ "requisition_id": requisition_id,
233
+ "skill_id": skill_name,
234
+ "match_score": match_score,
235
+ "sla_delta": sla_with - sla_without,
236
+ "occurrence_rate": round(total_with_skill / len(data) * 100, 1) if len(data) > 0 else 0,
237
+ "recommendation": "good match" if match_score >= 50 else "weak match",
238
+ }
app.py ADDED
@@ -0,0 +1,625 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Gradio UI for BPO Benchmark evaluation using CUGA SDK."""
2
+
3
+ import asyncio
4
+ import json
5
+ import logging
6
+ import os
7
+ from pathlib import Path
8
+ from typing import Any, Dict, List, Optional, Tuple
9
+
10
+ import gradio as gr
11
+
12
+ from agent import (
13
+ CUGAAgent,
14
+ LangfuseTracker,
15
+ LLMJudge,
16
+ check_keywords,
17
+ compare_api_calls,
18
+ compute_string_similarity,
19
+ compute_exact_match,
20
+ compute_final_score,
21
+ get_llm_judge,
22
+ get_provider_models,
23
+ get_provider_placeholder,
24
+ get_default_model,
25
+ is_langfuse_configured,
26
+ get_langfuse_host,
27
+ PROVIDER_CONFIGS,
28
+ )
29
+
30
+ logging.basicConfig(level=logging.INFO)
31
+ logger = logging.getLogger(__name__)
32
+
33
+
34
+ def ensure_mcp_config():
35
+ """Ensure MCP servers config file exists."""
36
+ mcp_dir = Path(__file__).parent / "mcp_servers"
37
+ mcp_dir.mkdir(exist_ok=True)
38
+
39
+ config_file = mcp_dir / "bpo.yaml"
40
+ if not config_file.exists():
41
+ config_file.write_text("""services:
42
+ - bpo:
43
+ url: http://127.0.0.1:8000/openapi.json
44
+ description: BPO recruiting analytics API
45
+ """)
46
+ return config_file
47
+
48
+
49
+ # Ensure config exists
50
+ ensure_mcp_config()
51
+
52
+
53
+ # Test suite definitions: label -> filename
54
+ TEST_SUITES = {
55
+ "Core (26 tasks)": "tasks.json",
56
+ "Type Mismatch (3 tasks)": "tasks_type_mismatch.json",
57
+ "HTTP Errors (4 tasks)": "tasks_http_errors.json",
58
+ "Schema Violations (4 tasks)": "tasks_schema_violations.json",
59
+ "Edge Cases (5 tasks)": "tasks_edge_cases.json",
60
+ "Undocumented Behaviors (3 tasks)": "tasks_undocumented.json",
61
+ }
62
+
63
+
64
+ def _find_data_dir() -> Optional[Path]:
65
+ """Locate the data directory."""
66
+ candidates = [
67
+ Path(__file__).parent / "data",
68
+ Path("./data"),
69
+ Path("/home/user/app/data"),
70
+ ]
71
+ for p in candidates:
72
+ if p.is_dir():
73
+ return p
74
+ return None
75
+
76
+
77
+ def _load_tasks_from_file(path: Path) -> List[Dict[str, Any]]:
78
+ """Load test cases from a single JSON file."""
79
+ if not path.exists():
80
+ logger.warning(f"Task file not found: {path}")
81
+ return []
82
+ with open(path) as f:
83
+ data = json.load(f)
84
+ cases = []
85
+ if isinstance(data, list):
86
+ for item in data:
87
+ if isinstance(item, dict) and "test_cases" in item:
88
+ cases.extend(item["test_cases"])
89
+ elif isinstance(item, dict):
90
+ cases.append(item)
91
+ return cases
92
+
93
+
94
+ def load_tasks(suite_labels: Optional[List[str]] = None) -> List[Dict[str, Any]]:
95
+ """Load tasks from one or more test suite files.
96
+
97
+ Args:
98
+ suite_labels: List of suite labels to load (keys from TEST_SUITES).
99
+ If None, loads only the core suite.
100
+ """
101
+ data_dir = _find_data_dir()
102
+ if data_dir is None:
103
+ logger.warning("Data directory not found")
104
+ return []
105
+
106
+ if suite_labels is None:
107
+ suite_labels = ["Core (26 tasks)"]
108
+
109
+ tasks = []
110
+ for label in suite_labels:
111
+ filename = TEST_SUITES.get(label)
112
+ if filename:
113
+ loaded = _load_tasks_from_file(data_dir / filename)
114
+ logger.info(f"Loaded {len(loaded)} tasks from {filename}")
115
+ tasks.extend(loaded)
116
+
117
+ return tasks
118
+
119
+
120
+ def get_available_suites() -> List[str]:
121
+ """Return labels of test suites that actually exist on disk."""
122
+ data_dir = _find_data_dir()
123
+ if data_dir is None:
124
+ return []
125
+ return [label for label, fn in TEST_SUITES.items() if (data_dir / fn).exists()]
126
+
127
+
128
+ # Load core tasks at startup for the task list display
129
+ AVAILABLE_SUITES = get_available_suites()
130
+ ALL_TASKS_CACHE: Dict[str, List[Dict[str, Any]]] = {}
131
+ for label in AVAILABLE_SUITES:
132
+ ALL_TASKS_CACHE[label] = load_tasks([label])
133
+ TASKS = ALL_TASKS_CACHE.get("Core (26 tasks)", [])
134
+ total_available = sum(len(v) for v in ALL_TASKS_CACHE.values())
135
+ logger.info(f"Loaded {len(TASKS)} core tasks, {total_available} total across {len(AVAILABLE_SUITES)} suites")
136
+
137
+
138
+ async def _setup_agent(api_key: str, provider: str, model: str) -> CUGAAgent:
139
+ """Initialize and return CUGA agent."""
140
+ agent = CUGAAgent(
141
+ api_key=api_key,
142
+ provider=provider.lower(),
143
+ model=model if model.strip() else None,
144
+ )
145
+ await agent.setup()
146
+ return agent
147
+
148
+
149
+ async def _run_single_task(
150
+ agent: CUGAAgent, task: Dict, task_index: int,
151
+ llm_judge: Any, llm_judge_requested: bool,
152
+ langfuse: Any,
153
+ ) -> Dict[str, Any]:
154
+ """Run a single evaluation task and return the result."""
155
+ task_name = task.get("name", f"task_{task_index+1}")
156
+ query = task.get("intent", "")
157
+ thread_id = f"eval_{task_name}_{task_index}"
158
+
159
+ try:
160
+ response, tool_calls = await agent.run(query, thread_id=thread_id)
161
+
162
+ # Get expected output and keywords
163
+ expected_output = task.get("expected_output", {})
164
+ expected_keywords = expected_output.get("keywords", [])
165
+ expected_answer = expected_output.get("response", "") or expected_output.get("answer", "")
166
+ tool_calls_expected = expected_output.get("tool_calls", []) or expected_output.get("apis", [])
167
+ expected_apis = []
168
+ for tc in tool_calls_expected:
169
+ if isinstance(tc, dict):
170
+ expected_apis.append(tc.get("name", ""))
171
+ elif isinstance(tc, str):
172
+ expected_apis.append(tc)
173
+
174
+ # Compute metrics
175
+ keyword_check = check_keywords(response, expected_keywords)
176
+ similarity = compute_string_similarity(response, expected_answer) if expected_answer else 0.0
177
+ exact_match = compute_exact_match(response, expected_answer) if expected_answer else False
178
+
179
+ # Extract tool names
180
+ tool_names = []
181
+ for tc in tool_calls:
182
+ if isinstance(tc, dict):
183
+ tool_names.append(tc.get("name", str(tc)))
184
+ else:
185
+ tool_names.append(str(tc))
186
+
187
+ # Compare API calls
188
+ api_comparison = compare_api_calls(tool_names, expected_apis) if expected_apis else {
189
+ "missing": [], "extra": [], "correct": 0, "expected_count": 0,
190
+ "called_count": len(tool_names), "all_expected_called": True,
191
+ }
192
+
193
+ # LLM Judge evaluation
194
+ llm_judge_score = None
195
+ llm_judge_rationale = None
196
+ if llm_judge and expected_answer:
197
+ try:
198
+ judge_result = await llm_judge.judge(response, expected_answer, query)
199
+ llm_judge_score = judge_result.get("score")
200
+ llm_judge_rationale = judge_result.get("rationale", "")
201
+ except Exception as e:
202
+ logger.warning(f"LLM judge failed for {task_name}: {e}")
203
+
204
+ # Compute final score (matches main repo logic)
205
+ final_score = compute_final_score(
206
+ exact_match=exact_match,
207
+ similarity=similarity,
208
+ llm_judge_score=llm_judge_score,
209
+ llm_judge_requested=llm_judge_requested,
210
+ agent_output=response,
211
+ apis_missing=api_comparison["missing"],
212
+ require_api_match=True,
213
+ )
214
+
215
+ result = {
216
+ "task_id": task_name,
217
+ "difficulty": task.get("difficulty", "unknown"),
218
+ "intent": query,
219
+ "response": response,
220
+ "expected_answer": expected_answer,
221
+ "expected_keywords": expected_keywords,
222
+ "found_keywords": keyword_check["found_keywords"],
223
+ "missing_keywords": keyword_check["missing_keywords"],
224
+ "match_rate": keyword_check["match_rate"],
225
+ "similarity": similarity,
226
+ "exact_match": exact_match,
227
+ "llm_judge_score": llm_judge_score,
228
+ "llm_judge_rationale": llm_judge_rationale,
229
+ "final_score": final_score,
230
+ "passed": final_score == 1,
231
+ "tool_calls": tool_names,
232
+ "expected_apis": expected_apis,
233
+ "apis_missing": api_comparison["missing"],
234
+ "apis_extra": api_comparison["extra"],
235
+ "apis_correct": api_comparison["correct"],
236
+ }
237
+
238
+ # Score in Langfuse
239
+ scores = {
240
+ "similarity": similarity,
241
+ "keyword_match": keyword_check["match_rate"],
242
+ "final_score": float(final_score),
243
+ }
244
+ if llm_judge_score is not None:
245
+ scores["llm_judge"] = llm_judge_score
246
+ langfuse.score_task(task_name, scores)
247
+
248
+ logger.info(
249
+ f"Task {task_name}: {'PASS' if result['passed'] else 'FAIL'} "
250
+ f"(sim={similarity:.2f}, kw={keyword_check['match_rate']:.1%}"
251
+ f"{f', judge={llm_judge_score:.2f}' if llm_judge_score is not None else ''})"
252
+ )
253
+
254
+ return result
255
+
256
+ except Exception as e:
257
+ logger.exception(f"Error in task {task_name}")
258
+ return {
259
+ "task_id": task_name,
260
+ "difficulty": task.get("difficulty", "unknown"),
261
+ "intent": task.get("intent", ""),
262
+ "response": f"Error: {e}",
263
+ "passed": False,
264
+ "final_score": 0,
265
+ "similarity": 0.0,
266
+ "exact_match": False,
267
+ "match_rate": 0.0,
268
+ "tool_calls": [],
269
+ "error": str(e),
270
+ }
271
+
272
+
273
+ def _build_results_markdown(results: List[Dict], langfuse: Any) -> str:
274
+ """Build markdown summary from evaluation results."""
275
+ total = len(results)
276
+ passed = sum(1 for r in results if r.get("passed", False))
277
+ avg_similarity = sum(r.get("similarity", 0) for r in results) / total if total else 0
278
+ avg_match = sum(r.get("match_rate", 0) for r in results) / total if total else 0
279
+ exact_matches = sum(1 for r in results if r.get("exact_match", False))
280
+ final_score_passes = sum(1 for r in results if r.get("final_score") == 1)
281
+ keyword_full_matches = sum(1 for r in results if r.get("match_rate", 0) == 1.0)
282
+ tasks_with_tools = sum(1 for r in results if r.get("tool_calls"))
283
+
284
+ # LLM Judge metrics
285
+ judge_scores = [r.get("llm_judge_score") for r in results if r.get("llm_judge_score") is not None]
286
+ avg_judge_score = sum(judge_scores) / len(judge_scores) if judge_scores else None
287
+ judge_passes = sum(1 for s in judge_scores if s >= 0.85) if judge_scores else 0
288
+
289
+ # API metrics
290
+ tasks_with_expected_apis = [r for r in results if r.get("expected_apis")]
291
+ api_correct = sum(1 for r in tasks_with_expected_apis if not r.get("apis_missing"))
292
+ api_accuracy = api_correct / len(tasks_with_expected_apis) if tasks_with_expected_apis else None
293
+
294
+ summary = {
295
+ "total_tasks": total,
296
+ "passed": passed,
297
+ "pass_rate": passed / total if total else 0,
298
+ "avg_similarity": avg_similarity,
299
+ "avg_keyword_match": avg_match,
300
+ "exact_matches": exact_matches,
301
+ "final_score_passes": final_score_passes,
302
+ "keyword_full_matches": keyword_full_matches,
303
+ "avg_llm_judge_score": avg_judge_score,
304
+ "api_accuracy": api_accuracy,
305
+ }
306
+
307
+ # End Langfuse trace
308
+ langfuse.end_trace(summary)
309
+
310
+ # Group by difficulty
311
+ by_difficulty = {}
312
+ for r in results:
313
+ diff = r.get("difficulty", "unknown")
314
+ if diff not in by_difficulty:
315
+ by_difficulty[diff] = {"total": 0, "passed": 0}
316
+ by_difficulty[diff]["total"] += 1
317
+ if r.get("passed", False):
318
+ by_difficulty[diff]["passed"] += 1
319
+
320
+ # Build markdown output
321
+ md = "## Evaluation Complete\n\n"
322
+ md += f"**Total Tasks:** {total}\n"
323
+ md += f"**Final Score:** {final_score_passes}/{total} ({100*final_score_passes/total:.1f}%)\n"
324
+ md += f"**Exact Matches:** {exact_matches} ({100*exact_matches/total:.1f}%)\n"
325
+ md += f"**Avg Similarity:** {avg_similarity:.2f}\n"
326
+ md += f"**Keyword Match:** {avg_match*100:.1f}% avg ({keyword_full_matches}/{total} full matches)\n"
327
+ if avg_judge_score is not None:
328
+ md += f"**LLM Judge:** {len(judge_scores)} tasks, avg={avg_judge_score:.2f} ({judge_passes}/{len(judge_scores)} pass)\n"
329
+ if api_accuracy is not None:
330
+ md += f"**API Accuracy:** {api_correct}/{len(tasks_with_expected_apis)} ({api_accuracy*100:.1f}%)\n"
331
+ md += f"**Tasks with Tool Calls:** {tasks_with_tools}/{total}\n"
332
+
333
+ if langfuse.enabled:
334
+ md += "\n*Langfuse tracking enabled*\n"
335
+ elif langfuse.init_error:
336
+ md += f"\n*Langfuse error: {langfuse.init_error}*\n"
337
+
338
+ md += "\n"
339
+
340
+ # By difficulty breakdown
341
+ if by_difficulty:
342
+ md += "### By Difficulty\n"
343
+ for diff, stats in sorted(by_difficulty.items()):
344
+ rate = stats["passed"] / stats["total"] * 100 if stats["total"] else 0
345
+ md += f"- **{diff}**: {stats['passed']}/{stats['total']} ({rate:.1f}%)\n"
346
+ md += "\n"
347
+
348
+ md += "---\n\n"
349
+
350
+ # Individual results
351
+ for r in results:
352
+ status = "PASS" if r.get("passed") else "FAIL"
353
+ md += f"### {status} - {r.get('task_id', 'unknown')} ({r.get('difficulty', 'unknown')})\n\n"
354
+ md += f"**Query:** {r.get('intent', '')}\n\n"
355
+
356
+ response_text = r.get("response", "")
357
+ if len(response_text) > 500:
358
+ response_text = response_text[:500] + "..."
359
+ md += f"**Response:** {response_text}\n\n"
360
+
361
+ # Enhanced metrics display
362
+ md += "**Metrics:**\n"
363
+ md += f"- **Final Score: {'PASS' if r.get('final_score') == 1 else 'FAIL'}**\n"
364
+ md += f"- Similarity: {r.get('similarity', 0)*100:.1f}%\n"
365
+ md += f"- Exact Match: {'Yes' if r.get('exact_match') else 'No'}\n"
366
+ if r.get("llm_judge_score") is not None:
367
+ md += f"- LLM Judge: {r['llm_judge_score']:.2f}\n"
368
+ md += f"- Keyword Match: {r.get('match_rate', 0)*100:.1f}%\n"
369
+ md += "\n"
370
+
371
+ if r.get("missing_keywords"):
372
+ missing = r["missing_keywords"][:5]
373
+ md += f"**Missing keywords:** {', '.join(missing)}"
374
+ if len(r.get("missing_keywords", [])) > 5:
375
+ md += f" (+{len(r['missing_keywords']) - 5} more)"
376
+ md += "\n\n"
377
+
378
+ # API metrics
379
+ if r.get("expected_apis"):
380
+ correct = r.get("apis_correct", 0)
381
+ expected = len(r.get("expected_apis", []))
382
+ api_status = "PASS" if not r.get("apis_missing") else "FAIL"
383
+ md += f"- API Accuracy: {correct}/{expected} ({api_status})\n"
384
+ if r.get("tool_calls"):
385
+ md += f"- Tools used: {', '.join(r['tool_calls'])}\n"
386
+ if r.get("apis_missing"):
387
+ md += f"- Missing APIs: {', '.join(r['apis_missing'])}\n"
388
+ md += "\n"
389
+
390
+ if r.get("error"):
391
+ md += f"**Error:** {r['error']}\n\n"
392
+
393
+ md += "---\n\n"
394
+
395
+ return md
396
+
397
+
398
+ def run_evaluation(api_key, provider, model, task_ids, test_suites):
399
+ """Run CUGA SDK evaluation, yielding live progress to the UI."""
400
+ if not api_key:
401
+ yield "Please provide an API key", ""
402
+ return
403
+
404
+ # Load tasks from selected suites
405
+ if not test_suites:
406
+ test_suites = ["Core (26 tasks)"]
407
+ all_tasks = load_tasks(test_suites)
408
+
409
+ if not all_tasks:
410
+ yield "No tasks loaded. Check that task files exist in the data directory.", ""
411
+ return
412
+
413
+ # Parse task IDs to filter within loaded tasks
414
+ task_ids_str = task_ids.strip()
415
+ if task_ids_str.lower() == "all" or not task_ids_str:
416
+ tasks_to_run = all_tasks
417
+ else:
418
+ try:
419
+ ids = [s.strip() for s in task_ids_str.replace(",", " ").split()]
420
+ tasks_to_run = []
421
+ for task in all_tasks:
422
+ task_name = task.get("name", "")
423
+ task_num = task_name.replace("task_", "") if task_name.startswith("task_") else task_name
424
+ if task_name in ids or task_num in ids:
425
+ tasks_to_run.append(task)
426
+ except Exception as e:
427
+ yield f"Error parsing task IDs: {e}", ""
428
+ return
429
+
430
+ if not tasks_to_run:
431
+ yield "No matching tasks found.", ""
432
+ return
433
+
434
+ total = len(tasks_to_run)
435
+ yield f"**Initializing CUGA agent...** (0/{total} tasks)", ""
436
+
437
+ loop = asyncio.new_event_loop()
438
+ asyncio.set_event_loop(loop)
439
+
440
+ try:
441
+ agent = loop.run_until_complete(_setup_agent(api_key, provider, model))
442
+ logger.info("CUGA agent initialized successfully")
443
+
444
+ langfuse = LangfuseTracker()
445
+ langfuse.start_trace(
446
+ name="bpo_evaluation",
447
+ metadata={
448
+ "provider": provider,
449
+ "model": model or get_default_model(provider),
450
+ "num_tasks": total,
451
+ },
452
+ )
453
+
454
+ # Initialize LLM judge (only for Groq provider currently)
455
+ llm_judge = None
456
+ llm_judge_requested = False
457
+ if provider.lower() == "groq":
458
+ try:
459
+ llm_judge = get_llm_judge(api_key=api_key, provider="groq")
460
+ llm_judge_requested = True
461
+ logger.info("LLM judge initialized")
462
+ except Exception as e:
463
+ logger.warning(f"Failed to initialize LLM judge: {e}")
464
+
465
+ # Run tasks, yielding progress after each one
466
+ results = []
467
+ for i, task in enumerate(tasks_to_run):
468
+ task_name = task.get("name", f"task_{i+1}")
469
+ logger.info(f"Evaluating task {i+1}/{total}: {task_name}")
470
+ yield f"**Running {task_name}...** ({i}/{total} complete)", ""
471
+
472
+ result = loop.run_until_complete(
473
+ _run_single_task(agent, task, i, llm_judge, llm_judge_requested, langfuse)
474
+ )
475
+ results.append(result)
476
+
477
+ # Small delay between tasks
478
+ if len(results) < total:
479
+ loop.run_until_complete(asyncio.sleep(0.5))
480
+
481
+ # Clean up
482
+ agent.close()
483
+
484
+ md = _build_results_markdown(results, langfuse)
485
+ yield md, json.dumps(results, indent=2)
486
+
487
+ except Exception as e:
488
+ logger.exception("Evaluation failed")
489
+ yield f"Evaluation failed: {e}", ""
490
+ finally:
491
+ loop.close()
492
+
493
+
494
+ def get_task_list():
495
+ """Get a formatted list of available tasks grouped by suite."""
496
+ if not ALL_TASKS_CACHE:
497
+ return "No tasks loaded"
498
+
499
+ lines = []
500
+ for label in AVAILABLE_SUITES:
501
+ tasks = ALL_TASKS_CACHE.get(label, [])
502
+ if not tasks:
503
+ continue
504
+ lines.append(f"### {label}\n")
505
+ for task in tasks:
506
+ name = task.get("name", "unknown")
507
+ diff = task.get("difficulty", "unknown")
508
+ intent = task.get("intent", "")
509
+ if len(intent) > 60:
510
+ intent = intent[:60] + "..."
511
+ lines.append(f"- **{name}** ({diff}): {intent}")
512
+ lines.append("")
513
+
514
+ return "\n".join(lines)
515
+
516
+
517
+ def update_model_choices(provider: str):
518
+ """Update model dropdown choices based on provider."""
519
+ models = get_provider_models(provider)
520
+ default = get_default_model(provider)
521
+ return gr.update(choices=models, value=default)
522
+
523
+
524
+ def update_api_key_placeholder(provider: str):
525
+ """Update API key placeholder based on provider."""
526
+ placeholder = get_provider_placeholder(provider)
527
+ return gr.update(placeholder=placeholder)
528
+
529
+
530
+ # Gradio Interface
531
+ with gr.Blocks(title="BPO Benchmark") as demo:
532
+ gr.Markdown("# BPO Benchmark Evaluation")
533
+ gr.Markdown(
534
+ "Evaluate **CUGA SDK** on BPO recruiting analytics tasks with 32 tool APIs. "
535
+ "Enter your API key, select tasks, and run the evaluation."
536
+ )
537
+
538
+ with gr.Row():
539
+ with gr.Column(scale=1):
540
+ provider = gr.Dropdown(
541
+ choices=["Groq", "OpenAI"],
542
+ value="Groq",
543
+ label="LLM Provider"
544
+ )
545
+ api_key = gr.Textbox(
546
+ label="API Key",
547
+ type="password",
548
+ placeholder="gsk_... (Groq)"
549
+ )
550
+ model = gr.Dropdown(
551
+ choices=get_provider_models("groq"),
552
+ value=get_default_model("groq"),
553
+ label="Model",
554
+ allow_custom_value=True,
555
+ )
556
+ test_suites = gr.CheckboxGroup(
557
+ choices=AVAILABLE_SUITES,
558
+ value=["Core (26 tasks)"],
559
+ label="Test Suites",
560
+ info=f"{total_available} tasks across {len(AVAILABLE_SUITES)} suites",
561
+ )
562
+ task_ids = gr.Textbox(
563
+ label="Task IDs (optional filter)",
564
+ placeholder="1 2 3 or task_27 task_28 (leave empty for all in selected suites)",
565
+ info="Filter within selected suites by ID"
566
+ )
567
+ run_btn = gr.Button("Run Evaluation", variant="primary", size="lg")
568
+
569
+ with gr.Accordion("Available Tasks", open=False):
570
+ gr.Markdown(get_task_list())
571
+
572
+ with gr.Accordion("Environment Info", open=False):
573
+ langfuse_status = "Configured" if is_langfuse_configured() else "Not configured"
574
+ public_key_set = "Yes" if os.environ.get("LANGFUSE_PUBLIC_KEY") else "No"
575
+ secret_key_set = "Yes" if os.environ.get("LANGFUSE_SECRET_KEY") else "No"
576
+ langfuse_host = get_langfuse_host()
577
+ gr.Markdown(f"""
578
+ **Langfuse Tracking:** {langfuse_status}
579
+ - LANGFUSE_PUBLIC_KEY set: {public_key_set}
580
+ - LANGFUSE_SECRET_KEY set: {secret_key_set}
581
+ - Host: {langfuse_host}
582
+
583
+ To enable Langfuse tracking in HuggingFace:
584
+ 1. Go to Space Settings > Variables and secrets
585
+ 2. Add **Secrets** (not variables):
586
+ - `LANGFUSE_PUBLIC_KEY`
587
+ - `LANGFUSE_SECRET_KEY`
588
+ - `LANGFUSE_HOST` (e.g., `https://us.cloud.langfuse.com`)
589
+ 3. Restart the Space for changes to take effect
590
+
591
+ *Connection will be tested when you run an evaluation*
592
+ """)
593
+
594
+ with gr.Column(scale=2):
595
+ output = gr.Markdown(label="Results")
596
+ with gr.Accordion("Raw JSON Results", open=False):
597
+ raw_json = gr.Code(label="Raw JSON", language="json")
598
+
599
+ # Event handlers
600
+ provider.change(
601
+ fn=update_model_choices,
602
+ inputs=[provider],
603
+ outputs=[model]
604
+ )
605
+ provider.change(
606
+ fn=update_api_key_placeholder,
607
+ inputs=[provider],
608
+ outputs=[api_key]
609
+ )
610
+
611
+ run_btn.click(
612
+ fn=run_evaluation,
613
+ inputs=[api_key, provider, model, task_ids, test_suites],
614
+ outputs=[output, raw_json]
615
+ )
616
+
617
+ gr.Markdown("""
618
+ ---
619
+ **Agent:** [CUGA SDK](https://pypi.org/project/cuga/)
620
+ | **Dataset:** [ibm-research/bpo-benchmark](https://huggingface.co/datasets/ibm-research/bpo-benchmark)
621
+ """)
622
+
623
+
624
+ if __name__ == "__main__":
625
+ demo.launch(server_name="0.0.0.0", server_port=7860)
data/candidate_data.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:14f2e8e859259f6ff0c5dd017f630d0df2490d298775647ab5882350fe80c6e5
3
+ size 1802852
data/large_response_fixture.json ADDED
The diff for this file is too large to render. See raw diff
 
data/tasks.json ADDED
@@ -0,0 +1,1291 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "name": "bpo-benchmark",
4
+ "user_info": [],
5
+ "test_cases": [
6
+ {
7
+ "name": "task_1",
8
+ "description": "Lists sources ranked by SLA success rate. | Explanation: CyberSec Jobs was identified as the lowest-performing source because its SLA success rate is 67 %, well below Dice (80 %), LinkedIn (79 %), GitHub (78 %), and the other sources returned by the API.",
9
+ "intent": "For requisition 05958BR, which source has the lowest SLA performance?",
10
+ "difficulty": "easy",
11
+ "expected_output": {
12
+ "response": "CyberSec Jobs with 67%",
13
+ "keywords": [
14
+ "CyberSec Jobs",
15
+ "67%|67 %|67"
16
+ ],
17
+ "tool_calls": [
18
+ {
19
+ "name": "candidate_source_sla_per_source",
20
+ "args": {}
21
+ }
22
+ ],
23
+ "tool_call_results": [
24
+ {
25
+ "name": "candidate_source_sla_per_source",
26
+ "result": {
27
+ "metrics": [
28
+ {
29
+ "source_name": "CyberSec Jobs",
30
+ "sla_percentage": 67
31
+ },
32
+ {
33
+ "source_name": "Indeed",
34
+ "sla_percentage": 86
35
+ },
36
+ {
37
+ "source_name": "GitHub",
38
+ "sla_percentage": 90
39
+ },
40
+ {
41
+ "source_name": "Dice",
42
+ "sla_percentage": 95
43
+ },
44
+ {
45
+ "source_name": "Internal",
46
+ "sla_percentage": 95
47
+ },
48
+ {
49
+ "source_name": "LinkedIn",
50
+ "sla_percentage": 95
51
+ },
52
+ {
53
+ "source_name": "Referral",
54
+ "sla_percentage": 95
55
+ }
56
+ ]
57
+ }
58
+ }
59
+ ]
60
+ }
61
+ },
62
+ {
63
+ "name": "task_2",
64
+ "description": "Asks for the missing requisition id. | Explanation: The query lacks a requisition ID which is required for the API call.",
65
+ "intent": "What's the percentage of hires and the total hires per source?",
66
+ "difficulty": "easy",
67
+ "expected_output": {
68
+ "response": "Can you share the requisition ID (e.g., 05958BR or 05959BR)?",
69
+ "keywords": [
70
+ "requisition|req",
71
+ "ID|id|identifier",
72
+ "missing|without|share|provide|required"
73
+ ],
74
+ "tool_calls": []
75
+ }
76
+ },
77
+ {
78
+ "name": "task_3",
79
+ "description": "Shows each source's candidate volume and offer/hire success metrics for jobs similar to 05958BR. | Explanation: Candidate counts and percentages were taken from the candidate-volume API; hire counts and offer-acceptance rates were taken from the recommendation-summary API. The two tables were joined on \"source_name\", producing a combined view of volume and effectiveness for the three leading sources. | Note: Cross-references performance and volume per source. Requires joining APIs on 'source_name'.",
80
+ "intent": "For requisitions like 05958BR, which sources provided the most candidates, and how effective were they at converting to hires?",
81
+ "difficulty": "medium",
82
+ "expected_output": {
83
+ "response": "LinkedIn: 519 candidates (18%), 7 hires. Offer acceptance rate: 70%. Dice: 516 candidates (18%), 11 hires. Offer acceptance rate: 79%. GitHub: 468 candidates (16%), 10 hires. Offer acceptance rate: 77%.",
84
+ "keywords": [
85
+ "LinkedIn",
86
+ "Dice",
87
+ "GitHub",
88
+ "Offer acceptance rate",
89
+ "519",
90
+ "516",
91
+ "468",
92
+ "18%|18 %|18",
93
+ "70%|70 %|70",
94
+ "79%|79 %|79",
95
+ "77%|77 %|77",
96
+ "hires"
97
+ ],
98
+ "tool_calls": [
99
+ {
100
+ "name": "candidate_source_candidate_volume_by_source",
101
+ "args": {}
102
+ },
103
+ {
104
+ "name": "candidate_source_source_recommendation_summary",
105
+ "args": {}
106
+ }
107
+ ],
108
+ "tool_call_results": [
109
+ {
110
+ "name": "candidate_source_candidate_volume_by_source",
111
+ "result": {
112
+ "job_id": "05958BR",
113
+ "total_candidate_volume": 2913,
114
+ "metrics": [
115
+ {
116
+ "source_name": "LinkedIn",
117
+ "candidate_volume": 519,
118
+ "percentage": 18
119
+ },
120
+ {
121
+ "source_name": "Dice",
122
+ "candidate_volume": 516,
123
+ "percentage": 18
124
+ },
125
+ {
126
+ "source_name": "GitHub",
127
+ "candidate_volume": 468,
128
+ "percentage": 16
129
+ },
130
+ {
131
+ "source_name": "Indeed",
132
+ "candidate_volume": 410,
133
+ "percentage": 14
134
+ },
135
+ {
136
+ "source_name": "Internal",
137
+ "candidate_volume": 400,
138
+ "percentage": 14
139
+ },
140
+ {
141
+ "source_name": "Referral",
142
+ "candidate_volume": 400,
143
+ "percentage": 14
144
+ },
145
+ {
146
+ "source_name": "CyberSec Jobs",
147
+ "candidate_volume": 200,
148
+ "percentage": 7
149
+ }
150
+ ],
151
+ "heading": "For requisitions similar to 05958BR, there were 2913 candidates over the past three years. Here's how many candidates came from each source (with percentages from the total number):"
152
+ }
153
+ },
154
+ {
155
+ "name": "candidate_source_source_recommendation_summary",
156
+ "result": {
157
+ "total_requisitions": 40,
158
+ "metrics": [
159
+ {
160
+ "source_name": "CyberSec Jobs",
161
+ "jobs_filled_percentage": 2,
162
+ "first_round_review_percentage": 80,
163
+ "offer_acceptance_rate": 67,
164
+ "total_hires": 3
165
+ },
166
+ {
167
+ "source_name": "Dice",
168
+ "jobs_filled_percentage": 2,
169
+ "first_round_review_percentage": 11,
170
+ "offer_acceptance_rate": 79,
171
+ "total_hires": 11
172
+ },
173
+ {
174
+ "source_name": "GitHub",
175
+ "jobs_filled_percentage": 2,
176
+ "first_round_review_percentage": 76,
177
+ "offer_acceptance_rate": 77,
178
+ "total_hires": 10
179
+ },
180
+ {
181
+ "source_name": "Indeed",
182
+ "jobs_filled_percentage": 0,
183
+ "first_round_review_percentage": 77,
184
+ "offer_acceptance_rate": 0,
185
+ "total_hires": 0
186
+ },
187
+ {
188
+ "source_name": "Internal",
189
+ "jobs_filled_percentage": 2,
190
+ "first_round_review_percentage": 74,
191
+ "offer_acceptance_rate": 70,
192
+ "total_hires": 5
193
+ },
194
+ {
195
+ "source_name": "LinkedIn",
196
+ "jobs_filled_percentage": 2,
197
+ "first_round_review_percentage": 75,
198
+ "offer_acceptance_rate": 70,
199
+ "total_hires": 7
200
+ },
201
+ {
202
+ "source_name": "Referral",
203
+ "jobs_filled_percentage": 2,
204
+ "first_round_review_percentage": 70,
205
+ "offer_acceptance_rate": 62,
206
+ "total_hires": 4
207
+ }
208
+ ]
209
+ }
210
+ }
211
+ ]
212
+ }
213
+ },
214
+ {
215
+ "name": "task_4",
216
+ "description": "Asks for the missing requisition id. | Explanation: The query lacks a requisition ID which is required for the API call.",
217
+ "intent": "Did Dice provide a good funnel conversion rate?",
218
+ "difficulty": "easy",
219
+ "expected_output": {
220
+ "response": "Can you share the requisition ID (e.g., 05958BR or 05959BR)?",
221
+ "keywords": [
222
+ "requisition|req",
223
+ "ID|id|identifier",
224
+ "missing|without|share|provide|required"
225
+ ],
226
+ "tool_calls": []
227
+ }
228
+ },
229
+ {
230
+ "name": "task_5",
231
+ "description": "Asks for the missing requisition id. | Explanation: The query lacks a requisition ID which is required for the API call.",
232
+ "intent": "Should I include the skill Python? What is its impact on SLA, fill rate, and overall relevance?",
233
+ "difficulty": "easy",
234
+ "expected_output": {
235
+ "response": "Can you share the requisition ID (e.g., 05958BR or 05959BR)?",
236
+ "keywords": [
237
+ "requisition|req",
238
+ "ID|id|identifier",
239
+ "missing|without|share|provide|required"
240
+ ],
241
+ "tool_calls": []
242
+ }
243
+ },
244
+ {
245
+ "name": "task_6",
246
+ "description": "Recommends top-performing sources by combining SLA success, candidate volume, and funnel effectiveness. | Explanation: Each source received a weighted score (50 % SLA success, 30 % candidate volume share, 20 % offer-conversion rate). Dice and LinkedIn tied for top SLA (100 %) and high volume; GitHub's best-in-class conversion (2.8 %) offset its 80 % SLA. Indeed scored 0 on SLA and offers, so it was excluded. | Note: This benchmark tests multi-criteria decision-making and cross-API synthesis.",
247
+ "intent": "What are the best sources to prioritize for 05959BR?",
248
+ "difficulty": "hard",
249
+ "expected_output": {
250
+ "response": "You should prioritize Dice, GitHub, and LinkedIn. Dice and LinkedIn both met SLA 100% of the time and brought in 18% of all candidates. Dice had a strong offer conversion rate (2.7%), and GitHub had the highest conversion (2.8%) despite slightly lower SLA. Indeed should be avoided due to 0% SLA and 0% offer conversion.",
251
+ "keywords": [
252
+ "Dice",
253
+ "GitHub",
254
+ "LinkedIn",
255
+ "SLA",
256
+ "Indeed"
257
+ ],
258
+ "tool_calls": [
259
+ {
260
+ "name": "candidate_source_sla_per_source",
261
+ "args": {}
262
+ },
263
+ {
264
+ "name": "candidate_source_candidate_volume_by_source",
265
+ "args": {}
266
+ },
267
+ {
268
+ "name": "candidate_source_funnel_conversion_by_source",
269
+ "args": {}
270
+ }
271
+ ],
272
+ "tool_call_results": [
273
+ {
274
+ "name": "candidate_source_sla_per_source",
275
+ "result": {
276
+ "metrics": [
277
+ {
278
+ "source_name": "Indeed",
279
+ "sla_percentage": 0
280
+ },
281
+ {
282
+ "source_name": "CyberSec Jobs",
283
+ "sla_percentage": 70
284
+ },
285
+ {
286
+ "source_name": "GitHub",
287
+ "sla_percentage": 80
288
+ },
289
+ {
290
+ "source_name": "Internal",
291
+ "sla_percentage": 85
292
+ },
293
+ {
294
+ "source_name": "Dice",
295
+ "sla_percentage": 100
296
+ },
297
+ {
298
+ "source_name": "LinkedIn",
299
+ "sla_percentage": 100
300
+ },
301
+ {
302
+ "source_name": "Referral",
303
+ "sla_percentage": 100
304
+ }
305
+ ]
306
+ }
307
+ },
308
+ {
309
+ "name": "candidate_source_candidate_volume_by_source",
310
+ "result": {
311
+ "job_id": "05959BR",
312
+ "total_candidate_volume": 2913,
313
+ "metrics": [
314
+ {
315
+ "source_name": "Dice",
316
+ "candidate_volume": 525,
317
+ "percentage": 18
318
+ },
319
+ {
320
+ "source_name": "LinkedIn",
321
+ "candidate_volume": 525,
322
+ "percentage": 18
323
+ },
324
+ {
325
+ "source_name": "GitHub",
326
+ "candidate_volume": 465,
327
+ "percentage": 16
328
+ },
329
+ {
330
+ "source_name": "Internal",
331
+ "candidate_volume": 403,
332
+ "percentage": 14
333
+ },
334
+ {
335
+ "source_name": "Indeed",
336
+ "candidate_volume": 400,
337
+ "percentage": 14
338
+ },
339
+ {
340
+ "source_name": "Referral",
341
+ "candidate_volume": 400,
342
+ "percentage": 14
343
+ },
344
+ {
345
+ "source_name": "CyberSec Jobs",
346
+ "candidate_volume": 195,
347
+ "percentage": 7
348
+ }
349
+ ],
350
+ "heading": "For requisitions similar to 05959BR, there were 2913 candidates over the past three years. Here's how many candidates came from each source (with percentages from the total number):"
351
+ }
352
+ },
353
+ {
354
+ "name": "candidate_source_funnel_conversion_by_source",
355
+ "result": {
356
+ "job_id": "05959BR",
357
+ "metrics": [
358
+ {
359
+ "source_name": "CyberSec Jobs",
360
+ "first_round_review_percentage": 80.5,
361
+ "interview_rate": 18.5,
362
+ "offer_acceptance_rate": 3.1
363
+ },
364
+ {
365
+ "source_name": "Dice",
366
+ "first_round_review_percentage": 76.0,
367
+ "interview_rate": 9.9,
368
+ "offer_acceptance_rate": 2.7
369
+ },
370
+ {
371
+ "source_name": "GitHub",
372
+ "first_round_review_percentage": 72.0,
373
+ "interview_rate": 16.6,
374
+ "offer_acceptance_rate": 2.8
375
+ },
376
+ {
377
+ "source_name": "Indeed",
378
+ "first_round_review_percentage": 72.2,
379
+ "interview_rate": 14.8,
380
+ "offer_acceptance_rate": 0.0
381
+ },
382
+ {
383
+ "source_name": "Internal",
384
+ "first_round_review_percentage": 76.9,
385
+ "interview_rate": 19.6,
386
+ "offer_acceptance_rate": 2.5
387
+ },
388
+ {
389
+ "source_name": "LinkedIn",
390
+ "first_round_review_percentage": 70.1,
391
+ "interview_rate": 21.0,
392
+ "offer_acceptance_rate": 1.9
393
+ },
394
+ {
395
+ "source_name": "Referral",
396
+ "first_round_review_percentage": 74.5,
397
+ "interview_rate": 20.5,
398
+ "offer_acceptance_rate": 2.0
399
+ }
400
+ ]
401
+ }
402
+ }
403
+ ]
404
+ }
405
+ },
406
+ {
407
+ "name": "task_7",
408
+ "description": "Asks for the missing requisition id. | Explanation: The query lacks a requisition ID which is required for the API call.",
409
+ "intent": "Out of these skills — Python, Quantum Physics, Cyber Engineering, Risk Analysis, Wireshark — which ones negatively affect SLA performance?",
410
+ "difficulty": "easy",
411
+ "expected_output": {
412
+ "response": "Can you share the requisition ID (e.g., 05958BR or 05959BR)?",
413
+ "keywords": [
414
+ "requisition|req",
415
+ "ID|id|identifier",
416
+ "missing|without|share|provide|required"
417
+ ],
418
+ "tool_calls": []
419
+ }
420
+ },
421
+ {
422
+ "name": "task_8",
423
+ "description": "Returns the definition of the SLA metric for the given requisition. | Explanation: The definitions-and-methodology endpoint contains a JSON field \"sla\" holding the textual definition; the agent extracted that string verbatim. | Note: Tests the agent's ability to locate and return a specific definition.",
424
+ "intent": "How is the SLA metric defined for 05958BR?",
425
+ "difficulty": "easy",
426
+ "expected_output": {
427
+ "response": "SLA is defined as 'Percentage of candidates reviewed within the defined SLA window (e.g., 48 hours)'.",
428
+ "keywords": [
429
+ "SLA",
430
+ "Percentage",
431
+ "reviewed",
432
+ "window"
433
+ ],
434
+ "tool_calls": [
435
+ {
436
+ "name": "candidate_source_definitions_and_methodology",
437
+ "args": {}
438
+ }
439
+ ],
440
+ "tool_call_results": [
441
+ {
442
+ "name": "candidate_source_definitions_and_methodology",
443
+ "result": {
444
+ "job_id": "05958BR",
445
+ "definitions": {
446
+ "sla": "Percentage of candidates reviewed within the defined SLA window (e.g., 48 hours)",
447
+ "time_to_fill": "Average time from job posting to accepted offer",
448
+ "success_rate": "Ratio of candidates who accepted offers out of those interviewed"
449
+ },
450
+ "calculation_notes": "Metrics are computed from 1047 requisitions over the last 1.4 years. Funnel stats are based on system timestamps and recruiter actions in ATS.",
451
+ "top_metrics_considered": [
452
+ "SLA %",
453
+ "First round review %",
454
+ "Offer acceptance rate",
455
+ "Candidate volume",
456
+ "Total hires"
457
+ ]
458
+ }
459
+ }
460
+ ]
461
+ }
462
+ },
463
+ {
464
+ "name": "task_9",
465
+ "description": "Returns the number of requisitions used to compute the reported metrics. | Explanation: The methodology response includes a note like \"Metrics calculated over N = 1047 requisitions\"; the agent parsed the integer 1047 and returned it. | Note: Tests string parsing / information extraction from notes field.",
466
+ "intent": "How many requisitions were used to compute these metrics for 05958BR?",
467
+ "difficulty": "easy",
468
+ "expected_output": {
469
+ "response": "Metrics are computed from 1047 requisitions.",
470
+ "keywords": [
471
+ "1047",
472
+ "requisitions"
473
+ ],
474
+ "tool_calls": [
475
+ {
476
+ "name": "candidate_source_definitions_and_methodology",
477
+ "args": {}
478
+ }
479
+ ],
480
+ "tool_call_results": [
481
+ {
482
+ "name": "candidate_source_definitions_and_methodology",
483
+ "result": {
484
+ "job_id": "05958BR",
485
+ "definitions": {
486
+ "sla": "Percentage of candidates reviewed within the defined SLA window (e.g., 48 hours)",
487
+ "time_to_fill": "Average time from job posting to accepted offer",
488
+ "success_rate": "Ratio of candidates who accepted offers out of those interviewed"
489
+ },
490
+ "calculation_notes": "Metrics are computed from 1047 requisitions over the last 1.4 years. Funnel stats are based on system timestamps and recruiter actions in ATS.",
491
+ "top_metrics_considered": [
492
+ "SLA %",
493
+ "First round review %",
494
+ "Offer acceptance rate",
495
+ "Candidate volume",
496
+ "Total hires"
497
+ ]
498
+ }
499
+ }
500
+ ]
501
+ }
502
+ },
503
+ {
504
+ "name": "task_10",
505
+ "description": "Returns the list of top metrics considered for source evaluation. | Explanation: The agent read the \"top_metrics_considered\" array from the methodology API response and returned the metrics in the same order. | Note: Tests structured list extraction and formatting.",
506
+ "intent": "What are the top metrics considered when evaluating candidate sources for 05958BR?",
507
+ "difficulty": "easy",
508
+ "expected_output": {
509
+ "response": "The top metrics considered are: SLA %, First round review %, Offer acceptance rate, Candidate volume, Total hires.",
510
+ "keywords": [
511
+ "SLA",
512
+ "First round review",
513
+ "Offer acceptance",
514
+ "Candidate volume",
515
+ "Total hires"
516
+ ],
517
+ "tool_calls": [
518
+ {
519
+ "name": "candidate_source_definitions_and_methodology",
520
+ "args": {}
521
+ }
522
+ ],
523
+ "tool_call_results": [
524
+ {
525
+ "name": "candidate_source_definitions_and_methodology",
526
+ "result": {
527
+ "job_id": "05958BR",
528
+ "definitions": {
529
+ "sla": "Percentage of candidates reviewed within the defined SLA window (e.g., 48 hours)",
530
+ "time_to_fill": "Average time from job posting to accepted offer",
531
+ "success_rate": "Ratio of candidates who accepted offers out of those interviewed"
532
+ },
533
+ "calculation_notes": "Metrics are computed from 1047 requisitions over the last 1.4 years. Funnel stats are based on system timestamps and recruiter actions in ATS.",
534
+ "top_metrics_considered": [
535
+ "SLA %",
536
+ "First round review %",
537
+ "Offer acceptance rate",
538
+ "Candidate volume",
539
+ "Total hires"
540
+ ]
541
+ }
542
+ }
543
+ ]
544
+ }
545
+ },
546
+ {
547
+ "name": "task_11",
548
+ "description": "Loops through the provided list of models and reports which ones were used. | Explanation: The agent compared each provided model name against the \"models_involved\" array returned by data-sources-used API and reported matches (used) or non-matches (not used). | Note: Tests loop-based reasoning and partial matching for list membership.",
549
+ "intent": "Were the following models used to generate metrics for 05958BR: SLA impact regression model, Candidate ranking model, Skill relevance classifier?",
550
+ "difficulty": "easy",
551
+ "expected_output": {
552
+ "response": "Yes, 'SLA impact regression model' and 'Skill relevance classifier' were used. 'Candidate ranking model' was not listed among the models involved.",
553
+ "keywords": [
554
+ "SLA impact regression model",
555
+ "Skill relevance classifier",
556
+ "Candidate ranking model"
557
+ ],
558
+ "tool_calls": [
559
+ {
560
+ "name": "skills_data_sources_used",
561
+ "args": {}
562
+ }
563
+ ],
564
+ "tool_call_results": [
565
+ {
566
+ "name": "skills_data_sources_used",
567
+ "result": {
568
+ "requisition_id": "05958BR",
569
+ "datasets_used": [
570
+ "Historical hiring success data",
571
+ "Requisition skill tagging",
572
+ "Funnel conversion metrics",
573
+ "Candidate quality feedback"
574
+ ],
575
+ "models_involved": [
576
+ "Skill relevance classifier",
577
+ "SLA impact regression model",
578
+ "Funnel conversion recommender"
579
+ ]
580
+ }
581
+ }
582
+ ]
583
+ }
584
+ },
585
+ {
586
+ "name": "task_12",
587
+ "description": "Loops through the provided list of data sources and reports which ones were used. | Explanation: Each candidate data source was checked against the \"datasets_used\" array from data-sources-used API; two matched and one did not, which the agent reported accordingly. | Note: Tests loop-based reasoning and partial matching for list membership.",
588
+ "intent": "Were the following data sources used to compute the metrics for 05958BR: Historical hiring success data, Job description embeddings, Funnel conversion metrics?",
589
+ "difficulty": "easy",
590
+ "expected_output": {
591
+ "response": "Yes, 'Historical hiring success data' and 'Funnel conversion metrics' were used. 'Job description embeddings' was not listed among the data sources.",
592
+ "keywords": [
593
+ "Historical hiring success data",
594
+ "Funnel conversion metrics",
595
+ "Job description embeddings"
596
+ ],
597
+ "tool_calls": [
598
+ {
599
+ "name": "skills_data_sources_used",
600
+ "args": {}
601
+ }
602
+ ],
603
+ "tool_call_results": [
604
+ {
605
+ "name": "skills_data_sources_used",
606
+ "result": {
607
+ "requisition_id": "05958BR",
608
+ "datasets_used": [
609
+ "Historical hiring success data",
610
+ "Requisition skill tagging",
611
+ "Funnel conversion metrics",
612
+ "Candidate quality feedback"
613
+ ],
614
+ "models_involved": [
615
+ "Skill relevance classifier",
616
+ "SLA impact regression model",
617
+ "Funnel conversion recommender"
618
+ ]
619
+ }
620
+ }
621
+ ]
622
+ }
623
+ },
624
+ {
625
+ "name": "task_13",
626
+ "description": "Combines model lookup, retrieves actual SLA delta, and returns SLA definition. | Explanation: The SLA-impact API showed a 0 % delta for Python; data-sources-used API confirmed the 'SLA impact regression model' was involved; the methodology API supplied the formal SLA definition. These three pieces were combined into one coherent answer. | Note: Agent must combine numerical result (delta), model lookup, and formal definition into unified answer.",
627
+ "intent": "For 05958BR, when evaluating the SLA impact of Python, which models were used, what was the SLA delta, and what is the system definition of SLA?",
628
+ "difficulty": "hard",
629
+ "expected_output": {
630
+ "response": "'SLA impact regression model' was used. The SLA delta for Python was 0.0%. SLA is defined as 'Percentage of candidates reviewed within the defined SLA window (e.g., 48 hours)'.",
631
+ "keywords": [
632
+ "SLA impact regression model",
633
+ "0.0%|0.0 %|0.0|0%|0 %|0",
634
+ "SLA",
635
+ "Percentage",
636
+ "reviewed",
637
+ "window"
638
+ ],
639
+ "tool_calls": [
640
+ {
641
+ "name": "skills_skill_impact_sla",
642
+ "args": {}
643
+ },
644
+ {
645
+ "name": "skills_data_sources_used",
646
+ "args": {}
647
+ },
648
+ {
649
+ "name": "candidate_source_definitions_and_methodology",
650
+ "args": {}
651
+ }
652
+ ],
653
+ "tool_call_results": [
654
+ {
655
+ "name": "skills_skill_impact_sla",
656
+ "result": {
657
+ "requisition_id": "05958BR",
658
+ "skill_name": "Python",
659
+ "sla_achievement_with_skill": 90,
660
+ "sla_achievement_without_skill": 90,
661
+ "delta": 0
662
+ }
663
+ },
664
+ {
665
+ "name": "skills_data_sources_used",
666
+ "result": {
667
+ "requisition_id": "05958BR",
668
+ "datasets_used": [
669
+ "Historical hiring success data",
670
+ "Requisition skill tagging",
671
+ "Funnel conversion metrics",
672
+ "Candidate quality feedback"
673
+ ],
674
+ "models_involved": [
675
+ "Skill relevance classifier",
676
+ "SLA impact regression model",
677
+ "Funnel conversion recommender"
678
+ ]
679
+ }
680
+ },
681
+ {
682
+ "name": "candidate_source_definitions_and_methodology",
683
+ "result": {
684
+ "job_id": "05958BR",
685
+ "definitions": {
686
+ "sla": "Percentage of candidates reviewed within the defined SLA window (e.g., 48 hours)",
687
+ "time_to_fill": "Average time from job posting to accepted offer",
688
+ "success_rate": "Ratio of candidates who accepted offers out of those interviewed"
689
+ },
690
+ "calculation_notes": "Metrics are computed from 1047 requisitions over the last 1.4 years. Funnel stats are based on system timestamps and recruiter actions in ATS.",
691
+ "top_metrics_considered": [
692
+ "SLA %",
693
+ "First round review %",
694
+ "Offer acceptance rate",
695
+ "Candidate volume",
696
+ "Total hires"
697
+ ]
698
+ }
699
+ }
700
+ ]
701
+ }
702
+ },
703
+ {
704
+ "name": "task_14",
705
+ "description": "States that Risk Analysis negatively affects SLA and lists the datasets that informed the analysis. | Explanation: The skill-analysis API flagged Risk Analysis as negatively correlated with SLA. The data-sources-used API listed the four datasets underpinning the evaluation, and both results were consolidated in the response. | Note: Correlation wording corrected to match API ('highly negative impact on SLA').",
706
+ "intent": "Was 'Risk Analysis' considered historically effective, and what data sources informed this analysis for 05958BR?",
707
+ "difficulty": "medium",
708
+ "expected_output": {
709
+ "response": "'Risk Analysis' is **not** considered effective: historical analysis shows it is correlated with a **highly negative impact on SLA**. The evaluation used these data sources: Historical hiring success data, Requisition skill tagging, Funnel conversion metrics, and Candidate quality feedback.",
710
+ "keywords": [
711
+ "Risk Analysis",
712
+ "not",
713
+ "effective",
714
+ "highly negative impact on SLA",
715
+ "SLA",
716
+ "Historical hiring success data",
717
+ "Requisition skill tagging"
718
+ ],
719
+ "tool_calls": [
720
+ {
721
+ "name": "skills_skill_analysis",
722
+ "args": {}
723
+ },
724
+ {
725
+ "name": "skills_data_sources_used",
726
+ "args": {}
727
+ }
728
+ ],
729
+ "tool_call_results": [
730
+ {
731
+ "name": "skills_skill_analysis",
732
+ "result": {
733
+ "historical_jobs": 40,
734
+ "input_skills": [],
735
+ "historical_skills_with_analysis": [
736
+ {
737
+ "name": "AWS",
738
+ "skill_occurrence": 948,
739
+ "correlation": "slightly positive impact on SLA"
740
+ },
741
+ {
742
+ "name": "IT Support",
743
+ "skill_occurrence": 868,
744
+ "correlation": "slightly positive impact on SLA"
745
+ },
746
+ {
747
+ "name": "NIST Cybersecurity Framework",
748
+ "skill_occurrence": 816,
749
+ "correlation": "slightly negative impact on SLA"
750
+ },
751
+ {
752
+ "name": "Incident Management",
753
+ "skill_occurrence": 748,
754
+ "correlation": "slightly negative impact on SLA"
755
+ },
756
+ {
757
+ "name": "Firewalls",
758
+ "skill_occurrence": 744,
759
+ "correlation": "slightly negative impact on SLA"
760
+ },
761
+ {
762
+ "name": "Cloud Security",
763
+ "skill_occurrence": 592,
764
+ "correlation": "slightly negative impact on SLA"
765
+ },
766
+ {
767
+ "name": "Risk Assessment",
768
+ "skill_occurrence": 570,
769
+ "correlation": "slightly negative impact on SLA"
770
+ },
771
+ {
772
+ "name": "Risk Analysis",
773
+ "skill_occurrence": 371,
774
+ "correlation": "highly negative impact on SLA"
775
+ },
776
+ {
777
+ "name": "Microsoft PowerShell",
778
+ "skill_occurrence": 289,
779
+ "correlation": "slightly negative impact on SLA"
780
+ },
781
+ {
782
+ "name": "Python",
783
+ "skill_occurrence": 200,
784
+ "correlation": "slightly negative impact on SLA"
785
+ }
786
+ ]
787
+ }
788
+ },
789
+ {
790
+ "name": "skills_data_sources_used",
791
+ "result": {
792
+ "requisition_id": "05958BR",
793
+ "datasets_used": [
794
+ "Historical hiring success data",
795
+ "Requisition skill tagging",
796
+ "Funnel conversion metrics",
797
+ "Candidate quality feedback"
798
+ ],
799
+ "models_involved": [
800
+ "Skill relevance classifier",
801
+ "SLA impact regression model",
802
+ "Funnel conversion recommender"
803
+ ]
804
+ }
805
+ }
806
+ ]
807
+ }
808
+ },
809
+ {
810
+ "name": "task_15",
811
+ "description": "Explains that no provided skills hurt SLA, no provided sources require avoidance, and lists datasets used. | Explanation: The agent looped over each skill, querying SLA-impact and skill-analysis APIs; none yielded a negative delta. It then looped over each source, combining SLA-per-source and funnel-conversion data; all three met both SLA and offer benchmarks, so none were flagged. Finally, the datasets_used array was fetched once and reported. | Note: Fully aligned with current API payloads—no unsupported claims.",
812
+ "intent": "For requisition 05958BR, which of the following skills — Python, Quantum Physics, Wireshark — negatively impact SLA, and which of the following sources — Dice, GitHub, Referral — should be avoided based on low offer conversion or SLA performance? Also, which data sources were used to compute these insights?",
813
+ "difficulty": "hard",
814
+ "expected_output": {
815
+ "response": "None of the listed skills show a negative SLA impact in the available data: Python's SLA delta is 0 %, and Quantum Physics and Wireshark are not present in the analysis. Likewise, Dice (SLA 95 %, offer-acceptance 79 %), GitHub (SLA 90 %, offer-acceptance 77 %), and Referral (SLA 95 %, offer-acceptance 62 %) all perform satisfactorily, so none of them need to be avoided. These insights were produced using the following datasets: Historical hiring success data, Requisition skill tagging, Funnel conversion metrics, and Candidate quality feedback.",
816
+ "keywords": [
817
+ "Python",
818
+ "Quantum Physics",
819
+ "Wireshark",
820
+ "Dice",
821
+ "GitHub",
822
+ "Referral",
823
+ "SLA",
824
+ "0%|0 %|0",
825
+ "95%|95 %|95",
826
+ "79%|79 %|79",
827
+ "90%|90 %|90",
828
+ "77%|77 %|77",
829
+ "62%|62 %|62",
830
+ "Historical hiring success data",
831
+ "Requisition skill tagging",
832
+ "Funnel conversion metrics",
833
+ "Candidate quality feedback"
834
+ ],
835
+ "tool_calls": [
836
+ {
837
+ "name": "skills_skill_impact_sla",
838
+ "args": {}
839
+ },
840
+ {
841
+ "name": "skills_skill_analysis",
842
+ "args": {}
843
+ },
844
+ {
845
+ "name": "skills_data_sources_used",
846
+ "args": {}
847
+ },
848
+ {
849
+ "name": "candidate_source_funnel_conversion_by_source",
850
+ "args": {}
851
+ },
852
+ {
853
+ "name": "candidate_source_sla_per_source",
854
+ "args": {}
855
+ }
856
+ ],
857
+ "tool_call_results": [
858
+ {
859
+ "name": "skills_skill_impact_sla",
860
+ "result": {
861
+ "requisition_id": "05958BR",
862
+ "skill_name": "Python",
863
+ "sla_achievement_with_skill": 90,
864
+ "sla_achievement_without_skill": 90,
865
+ "delta": 0
866
+ }
867
+ },
868
+ {
869
+ "name": "skills_skill_analysis",
870
+ "result": {
871
+ "historical_jobs": 40,
872
+ "input_skills": [],
873
+ "historical_skills_with_analysis": [
874
+ {
875
+ "name": "AWS",
876
+ "skill_occurrence": 948,
877
+ "correlation": "slightly positive impact on SLA"
878
+ },
879
+ {
880
+ "name": "IT Support",
881
+ "skill_occurrence": 868,
882
+ "correlation": "slightly positive impact on SLA"
883
+ },
884
+ {
885
+ "name": "NIST Cybersecurity Framework",
886
+ "skill_occurrence": 816,
887
+ "correlation": "slightly negative impact on SLA"
888
+ },
889
+ {
890
+ "name": "Incident Management",
891
+ "skill_occurrence": 748,
892
+ "correlation": "slightly negative impact on SLA"
893
+ },
894
+ {
895
+ "name": "Firewalls",
896
+ "skill_occurrence": 744,
897
+ "correlation": "slightly negative impact on SLA"
898
+ },
899
+ {
900
+ "name": "Cloud Security",
901
+ "skill_occurrence": 592,
902
+ "correlation": "slightly negative impact on SLA"
903
+ },
904
+ {
905
+ "name": "Risk Assessment",
906
+ "skill_occurrence": 570,
907
+ "correlation": "slightly negative impact on SLA"
908
+ },
909
+ {
910
+ "name": "Risk Analysis",
911
+ "skill_occurrence": 371,
912
+ "correlation": "highly negative impact on SLA"
913
+ },
914
+ {
915
+ "name": "Microsoft PowerShell",
916
+ "skill_occurrence": 289,
917
+ "correlation": "slightly negative impact on SLA"
918
+ },
919
+ {
920
+ "name": "Python",
921
+ "skill_occurrence": 200,
922
+ "correlation": "slightly negative impact on SLA"
923
+ }
924
+ ]
925
+ }
926
+ },
927
+ {
928
+ "name": "skills_data_sources_used",
929
+ "result": {
930
+ "requisition_id": "05958BR",
931
+ "datasets_used": [
932
+ "Historical hiring success data",
933
+ "Requisition skill tagging",
934
+ "Funnel conversion metrics",
935
+ "Candidate quality feedback"
936
+ ],
937
+ "models_involved": [
938
+ "Skill relevance classifier",
939
+ "SLA impact regression model",
940
+ "Funnel conversion recommender"
941
+ ]
942
+ }
943
+ },
944
+ {
945
+ "name": "candidate_source_funnel_conversion_by_source",
946
+ "result": {
947
+ "job_id": "05958BR",
948
+ "metrics": [
949
+ {
950
+ "source_name": "CyberSec Jobs",
951
+ "first_round_review_percentage": 80.5,
952
+ "interview_rate": 19.0,
953
+ "offer_acceptance_rate": 3.0
954
+ },
955
+ {
956
+ "source_name": "Dice",
957
+ "first_round_review_percentage": 11.0,
958
+ "interview_rate": 6.8,
959
+ "offer_acceptance_rate": 2.7
960
+ },
961
+ {
962
+ "source_name": "GitHub",
963
+ "first_round_review_percentage": 76.1,
964
+ "interview_rate": 23.7,
965
+ "offer_acceptance_rate": 2.8
966
+ },
967
+ {
968
+ "source_name": "Indeed",
969
+ "first_round_review_percentage": 77.1,
970
+ "interview_rate": 22.0,
971
+ "offer_acceptance_rate": 0.0
972
+ },
973
+ {
974
+ "source_name": "Internal",
975
+ "first_round_review_percentage": 74.0,
976
+ "interview_rate": 18.5,
977
+ "offer_acceptance_rate": 2.5
978
+ },
979
+ {
980
+ "source_name": "LinkedIn",
981
+ "first_round_review_percentage": 75.1,
982
+ "interview_rate": 20.4,
983
+ "offer_acceptance_rate": 1.9
984
+ },
985
+ {
986
+ "source_name": "Referral",
987
+ "first_round_review_percentage": 70.0,
988
+ "interview_rate": 17.0,
989
+ "offer_acceptance_rate": 2.0
990
+ }
991
+ ]
992
+ }
993
+ },
994
+ {
995
+ "name": "candidate_source_sla_per_source",
996
+ "result": {
997
+ "metrics": [
998
+ {
999
+ "source_name": "CyberSec Jobs",
1000
+ "sla_percentage": 67
1001
+ },
1002
+ {
1003
+ "source_name": "Indeed",
1004
+ "sla_percentage": 86
1005
+ },
1006
+ {
1007
+ "source_name": "GitHub",
1008
+ "sla_percentage": 90
1009
+ },
1010
+ {
1011
+ "source_name": "Dice",
1012
+ "sla_percentage": 95
1013
+ },
1014
+ {
1015
+ "source_name": "Internal",
1016
+ "sla_percentage": 95
1017
+ },
1018
+ {
1019
+ "source_name": "LinkedIn",
1020
+ "sla_percentage": 95
1021
+ },
1022
+ {
1023
+ "source_name": "Referral",
1024
+ "sla_percentage": 95
1025
+ }
1026
+ ]
1027
+ }
1028
+ }
1029
+ ]
1030
+ }
1031
+ },
1032
+ {
1033
+ "name": "task_16",
1034
+ "description": "Apologises for missing capability; optionally provides an illustrative optimisation if the feature existed. | Explanation: No endpoint returns or accepts free-text job descriptions, so optimisation is unsupported. The hypothetical section shows what the answer could look like if such an endpoint were added. | Note: Demonstrates graceful degradation plus a forward-looking example.",
1035
+ "intent": "Help me optimize the job description for 05959BR.",
1036
+ "difficulty": "easy",
1037
+ "expected_output": {
1038
+ "response": "Sorry — the available APIs do not expose the current job-description text, so I cannot directly optimise it.",
1039
+ "keywords": [
1040
+ "APIs|API",
1041
+ "job-description|job description",
1042
+ "cannot|can't"
1043
+ ],
1044
+ "tool_calls": []
1045
+ }
1046
+ },
1047
+ {
1048
+ "name": "task_17",
1049
+ "description": "Prompts the user for the missing job ID instead of guessing. | Explanation: Illustrates a clarification turn when a required parameter (requisition_id) is missing. | Note: Tests conversational error-handling with zero API usage.",
1050
+ "intent": "Which sourcing channel is the most effective for this job?",
1051
+ "difficulty": "easy",
1052
+ "expected_output": {
1053
+ "response": "Can you share the requisition ID (e.g., 05958BR or 05959BR)?",
1054
+ "keywords": [
1055
+ "requisition|req",
1056
+ "ID|id|identifier",
1057
+ "missing|without|share|provide|required"
1058
+ ],
1059
+ "tool_calls": []
1060
+ }
1061
+ },
1062
+ {
1063
+ "name": "task_18",
1064
+ "description": "Explains unsupported request; shows what a successful answer might include. | Explanation: No endpoints expose hiring-manager data; a possible future answer is sketched for context. | Note: Covers gap #3: hiring-manager analytics.",
1065
+ "intent": "Who's the hiring manager for 05959BR and how responsive is she?",
1066
+ "difficulty": "easy",
1067
+ "expected_output": {
1068
+ "response": "Sorry — current APIs don't return hiring-manager names or responsiveness metrics, so I can't answer that.",
1069
+ "keywords": [
1070
+ "APIs",
1071
+ "hiring-manager",
1072
+ "responsive",
1073
+ "can't|cannot"
1074
+ ],
1075
+ "tool_calls": []
1076
+ }
1077
+ },
1078
+ {
1079
+ "name": "task_19",
1080
+ "description": "States limitation; optional mock table shows desired granularity. | Explanation: The existing funnel-conversion API only returns percentages by source, not absolute counts or durations per stage. | Note: Covers gap #4: full funnel metrics.",
1081
+ "intent": "Show me the average candidate counts and days spent in each funnel stage for roles like 05959BR.",
1082
+ "difficulty": "easy",
1083
+ "expected_output": {
1084
+ "response": "I'm sorry, but none of the available APIs provide stage-by-stage candidate counts or time-in-status metrics, so I can't generate a funnel table.",
1085
+ "keywords": [
1086
+ "APIs|API",
1087
+ "stage",
1088
+ "candidate counts",
1089
+ "time-in-status",
1090
+ "funnel"
1091
+ ],
1092
+ "tool_calls": []
1093
+ }
1094
+ },
1095
+ {
1096
+ "name": "task_20",
1097
+ "description": "Returns start/end dates, last update date, and requisition count. | Explanation: Pulled time_frame_start, time_frame_end, data_last_updated and total_requisitions_analysed from the metadata-and-timeframe endpoint. | Note: Demonstrates a fully supported recency / frequency query (gap #5).",
1098
+ "intent": "What's the data timeframe for 05958BR and how many similar requisitions were analysed?",
1099
+ "difficulty": "easy",
1100
+ "expected_output": {
1101
+ "response": "The metrics cover 9 Oct 2023 – 15 Mar 2025 (≈ 18 months) and were last updated on 29 Apr 2025. A total of 40 similar requisitions were analysed.",
1102
+ "keywords": [
1103
+ "9 Oct 2023|Oct 9, 2023",
1104
+ "15 Mar 2025|Mar 15, 2025",
1105
+ "29 Apr 2025|Apr 29, 2025",
1106
+ "40",
1107
+ "requisitions"
1108
+ ],
1109
+ "tool_calls": [
1110
+ {
1111
+ "name": "candidate_source_metadata_and_timeframe",
1112
+ "args": {}
1113
+ }
1114
+ ],
1115
+ "tool_call_results": [
1116
+ {
1117
+ "name": "candidate_source_metadata_and_timeframe",
1118
+ "result": {
1119
+ "job_id": "05958BR",
1120
+ "time_frame_start": "2023-10-09",
1121
+ "time_frame_end": "2025-03-15",
1122
+ "data_last_updated": "2025-04-29",
1123
+ "total_requisitions_analysed": 40
1124
+ }
1125
+ }
1126
+ ]
1127
+ }
1128
+ },
1129
+ {
1130
+ "name": "task_21",
1131
+ "description": "Gracefully declines unsupported analysis; shows sample comparative table. | Explanation: Only the definition endpoint describes *what* time-to-fill means; no endpoint provides per-source values to compute deltas. | Note: Covers gap #6.",
1132
+ "intent": "How does average time-to-fill compare to SLA for each sourcing channel?",
1133
+ "difficulty": "easy",
1134
+ "expected_output": {
1135
+ "response": "I'm afraid I can't provide that comparison because the current APIs don't expose time-to-fill broken down by source.",
1136
+ "keywords": [
1137
+ "APIs",
1138
+ "time-to-fill",
1139
+ "source"
1140
+ ],
1141
+ "tool_calls": []
1142
+ }
1143
+ },
1144
+ {
1145
+ "name": "task_22",
1146
+ "description": "Explains unsupported geographic/channel filter and gives mock estimation. | Explanation: No endpoints accept country/channel parameters; the example illustrates desired behaviour. | Note: Covers gap #7: region & channel-specific analytics.",
1147
+ "intent": "If we posted this role internally in France only, how long would it take to fill?",
1148
+ "difficulty": "easy",
1149
+ "expected_output": {
1150
+ "response": "Current APIs don't allow filtering by geography or posting channel, so I'm unable to estimate a France-only internal posting timeline.",
1151
+ "keywords": [
1152
+ "Sorry|no|unable|can't|cannot",
1153
+ "APIs|API|filtering|filter",
1154
+ "France|geography|geographic|region|channel",
1155
+ "internal|posting"
1156
+ ],
1157
+ "tool_calls": []
1158
+ }
1159
+ },
1160
+ {
1161
+ "name": "task_23",
1162
+ "description": "States limitation and shows example list of near-deadline requisitions. | Explanation: No endpoint surfaces open requisitions with SLA dates. The hypothetical section demonstrates the desired list format. | Note: Covers gap #8 and introduces a potential future endpoint.",
1163
+ "intent": "List all requisitions that are within 30 days of their SLA deadline.",
1164
+ "difficulty": "easy",
1165
+ "expected_output": {
1166
+ "response": "Sorry — the API suite doesn't provide live requisition status or SLA countdowns, so I can't generate that list.",
1167
+ "keywords": [
1168
+ "API|APIs",
1169
+ "SLA",
1170
+ "requisition|req",
1171
+ "status",
1172
+ "countdown|countdowns",
1173
+ "deadline|list"
1174
+ ],
1175
+ "tool_calls": []
1176
+ }
1177
+ },
1178
+ {
1179
+ "name": "task_24",
1180
+ "description": "Graceful 'ID not found' message with suggested alternatives. | Explanation: Because 05960BR does not exist, the assistant returns a polite error plus four close-match IDs (simulating fuzzy search in the ATS). No API call is made for a bad ID. | Note: Error-handling scenario for invalid requisition IDs.",
1181
+ "intent": "Show candidate funnel for job id 05960BR",
1182
+ "difficulty": "easy",
1183
+ "expected_output": {
1184
+ "response": "No job can be found with the ID 05960BR.\nDid you want to use one of the following job IDs instead?\n• UZLXBR — Sourcing Manager\n• F50HBR — Offering Manager\n• MJZ1BR — Offering Manager\n• 5TTKBR — Delivery Analyst",
1185
+ "keywords": [
1186
+ "05960BR",
1187
+ "No job",
1188
+ "can be found|not found"
1189
+ ],
1190
+ "tool_calls": []
1191
+ }
1192
+ },
1193
+ {
1194
+ "name": "task_25",
1195
+ "description": "Explains unsupported request and sketches desired output. | Explanation: There is no /job-details/ endpoint. The hypothetical section illustrates what the response would look like if such an endpoint became available. | Note: Completes coverage for full requisition card' requests.",
1196
+ "intent": "Show me the details of UZLXBR",
1197
+ "difficulty": "easy",
1198
+ "expected_output": {
1199
+ "response": "Sorry — none of the current APIs provide full job-card details (title, location, hiring-manager email, etc.), so I can't display that information.",
1200
+ "keywords": [
1201
+ "APIs",
1202
+ "job-card",
1203
+ "details"
1204
+ ],
1205
+ "tool_calls": []
1206
+ }
1207
+ },
1208
+ {
1209
+ "name": "task_26",
1210
+ "description": "Returns average candidate count for comparable requisitions. | Explanation: candidate-volume-by-source returns `total_candidate_volume = 2913`; metadata-and-timeframe shows `total_requisitions_analysed = 40`. Dividing 2913 ÷ 40 ≈ 73 yields the average. | Note: Covers the repeated average candidate volume questions.",
1211
+ "intent": "How many candidates do we usually get for postings similar to 05959BR?",
1212
+ "difficulty": "medium",
1213
+ "expected_output": {
1214
+ "response": "On average, similar postings attract **73 candidates**.",
1215
+ "keywords": [
1216
+ "73",
1217
+ "candidates",
1218
+ "average"
1219
+ ],
1220
+ "tool_calls": [
1221
+ {
1222
+ "name": "candidate_source_candidate_volume_by_source",
1223
+ "args": {}
1224
+ },
1225
+ {
1226
+ "name": "candidate_source_metadata_and_timeframe",
1227
+ "args": {}
1228
+ }
1229
+ ],
1230
+ "tool_call_results": [
1231
+ {
1232
+ "name": "candidate_source_candidate_volume_by_source",
1233
+ "result": {
1234
+ "job_id": "05959BR",
1235
+ "total_candidate_volume": 2913,
1236
+ "metrics": [
1237
+ {
1238
+ "source_name": "Dice",
1239
+ "candidate_volume": 525,
1240
+ "percentage": 18
1241
+ },
1242
+ {
1243
+ "source_name": "LinkedIn",
1244
+ "candidate_volume": 525,
1245
+ "percentage": 18
1246
+ },
1247
+ {
1248
+ "source_name": "GitHub",
1249
+ "candidate_volume": 465,
1250
+ "percentage": 16
1251
+ },
1252
+ {
1253
+ "source_name": "Internal",
1254
+ "candidate_volume": 403,
1255
+ "percentage": 14
1256
+ },
1257
+ {
1258
+ "source_name": "Indeed",
1259
+ "candidate_volume": 400,
1260
+ "percentage": 14
1261
+ },
1262
+ {
1263
+ "source_name": "Referral",
1264
+ "candidate_volume": 400,
1265
+ "percentage": 14
1266
+ },
1267
+ {
1268
+ "source_name": "CyberSec Jobs",
1269
+ "candidate_volume": 195,
1270
+ "percentage": 7
1271
+ }
1272
+ ],
1273
+ "heading": "For requisitions similar to 05959BR, there were 2913 candidates over the past three years. Here's how many candidates came from each source (with percentages from the total number):"
1274
+ }
1275
+ },
1276
+ {
1277
+ "name": "candidate_source_metadata_and_timeframe",
1278
+ "result": {
1279
+ "job_id": "05959BR",
1280
+ "time_frame_start": "2023-10-09",
1281
+ "time_frame_end": "2025-03-15",
1282
+ "data_last_updated": "2025-04-29",
1283
+ "total_requisitions_analysed": 40
1284
+ }
1285
+ }
1286
+ ]
1287
+ }
1288
+ }
1289
+ ]
1290
+ }
1291
+ ]
data/tasks_edge_cases.json ADDED
The diff for this file is too large to render. See raw diff
 
data/tasks_http_errors.json ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "name": "bpo-benchmark",
4
+ "user_info": [],
5
+ "test_cases": [
6
+ {
7
+ "name": "task_30",
8
+ "description": "Tests agent handling of probabilistic 404 errors (20% chance, seeded RNG). The candidate_source_candidate_pipeline_status tool randomly returns 404 errors. Agent should retry or report the error gracefully.",
9
+ "intent": "What's the candidate pipeline status for 05958BR?",
10
+ "difficulty": "hard",
11
+ "expected_output": {
12
+ "response": "The candidate pipeline for 05958BR shows the following distribution by source: LinkedIn, Dice, Indeed, GitHub, Referral, CyberSec Jobs, and TechCareers all have active candidates in the pipeline.",
13
+ "keywords": [
14
+ "pipeline|status|candidates",
15
+ "05958BR"
16
+ ],
17
+ "tool_calls": [
18
+ {
19
+ "name": "candidate_source_candidate_pipeline_status",
20
+ "args": {
21
+ "requisition_id": "05958BR"
22
+ }
23
+ }
24
+ ],
25
+ "error_handling": {
26
+ "error_type": "http_404_probabilistic",
27
+ "should_report_error": false,
28
+ "should_retry": true,
29
+ "expected_behavior": "Agent should retry on 404 or report temporary unavailability"
30
+ },
31
+ "tool_call_results": [
32
+ {
33
+ "name": "candidate_source_candidate_pipeline_status",
34
+ "result": {
35
+ "requisition_id": "05958BR",
36
+ "pipeline": {
37
+ "CyberSec Jobs": 200,
38
+ "Dice": 516,
39
+ "GitHub": 468,
40
+ "Indeed": 410,
41
+ "Internal": 400,
42
+ "LinkedIn": 519,
43
+ "Referral": 400
44
+ },
45
+ "total_candidates": 2913
46
+ }
47
+ }
48
+ ]
49
+ }
50
+ },
51
+ {
52
+ "name": "task_31",
53
+ "description": "Tests agent handling of HTTP 500 error that includes valid data in the response body. The candidate_source_source_sla_check tool returns error status 500 but includes correct SLA metrics in the body.",
54
+ "intent": "Run a quick SLA check across all sources for 05958BR",
55
+ "difficulty": "hard",
56
+ "expected_output": {
57
+ "response": "SLA check results for 05958BR across sources: CyberSec Jobs 67%, Indeed 72%, TechCareers 74%, GitHub 78%, LinkedIn 79%, Dice 80%, Referral 82%. Note: the data was retrieved despite a server error status.",
58
+ "keywords": [
59
+ "SLA",
60
+ "CyberSec Jobs|Indeed|Dice",
61
+ "67|72|74|78|79|80|82"
62
+ ],
63
+ "tool_calls": [
64
+ {
65
+ "name": "candidate_source_source_sla_check",
66
+ "args": {
67
+ "requisition_id": "05958BR"
68
+ }
69
+ }
70
+ ],
71
+ "error_handling": {
72
+ "error_type": "http_500_with_body",
73
+ "should_report_error": true,
74
+ "should_retry": false,
75
+ "expected_behavior": "Agent should use response body despite 500 status and optionally note the error"
76
+ },
77
+ "tool_call_results": [
78
+ {
79
+ "name": "candidate_source_source_sla_check",
80
+ "result": {
81
+ "status_code": 500,
82
+ "error": true,
83
+ "message": "Internal server error",
84
+ "body": {
85
+ "metrics": [
86
+ {
87
+ "source_name": "CyberSec Jobs",
88
+ "sla_percentage": 67
89
+ },
90
+ {
91
+ "source_name": "Dice",
92
+ "sla_percentage": 95
93
+ },
94
+ {
95
+ "source_name": "GitHub",
96
+ "sla_percentage": 90
97
+ },
98
+ {
99
+ "source_name": "Indeed",
100
+ "sla_percentage": 86
101
+ },
102
+ {
103
+ "source_name": "Internal",
104
+ "sla_percentage": 95
105
+ },
106
+ {
107
+ "source_name": "LinkedIn",
108
+ "sla_percentage": 95
109
+ },
110
+ {
111
+ "source_name": "Referral",
112
+ "sla_percentage": 95
113
+ }
114
+ ]
115
+ }
116
+ }
117
+ }
118
+ ]
119
+ }
120
+ },
121
+ {
122
+ "name": "task_32",
123
+ "description": "Tests agent handling of HTTP 503 Service Unavailable. The candidate_source_funnel_status tool always returns 503 with maintenance info.",
124
+ "intent": "What's the current funnel status for 05958BR?",
125
+ "difficulty": "medium",
126
+ "expected_output": {
127
+ "response": "The funnel status service is currently unavailable due to maintenance. Please try again later.",
128
+ "keywords": [
129
+ "unavailable|maintenance|temporarily",
130
+ "funnel|service",
131
+ "later|retry"
132
+ ],
133
+ "tool_calls": [
134
+ {
135
+ "name": "candidate_source_funnel_status",
136
+ "args": {
137
+ "requisition_id": "05958BR"
138
+ }
139
+ }
140
+ ],
141
+ "error_handling": {
142
+ "error_type": "http_503",
143
+ "should_report_error": true,
144
+ "should_retry": false,
145
+ "expected_behavior": "Agent should report service unavailable with retry info"
146
+ },
147
+ "tool_call_results": [
148
+ {
149
+ "name": "candidate_source_funnel_status",
150
+ "result": {
151
+ "status_code": 503,
152
+ "error": true,
153
+ "message": "Service temporarily unavailable. The funnel analytics engine is undergoing maintenance.",
154
+ "retry_after_seconds": 300,
155
+ "expected_recovery": "2025-05-01T12:00:00Z"
156
+ }
157
+ }
158
+ ]
159
+ }
160
+ },
161
+ {
162
+ "name": "task_33",
163
+ "description": "Tests agent handling of HTTP 429 rate limiting. The candidate_source_bulk_source_data tool returns 429 after the 3rd call in a session.",
164
+ "intent": "Pull bulk source data for all requisitions starting with 05958BR",
165
+ "difficulty": "hard",
166
+ "expected_output": {
167
+ "response": "Bulk source data for 05958BR shows candidate and hire counts across all sourcing channels including LinkedIn, Dice, Indeed, GitHub, Referral, CyberSec Jobs, and TechCareers.",
168
+ "keywords": [
169
+ "source|sources",
170
+ "candidates|data",
171
+ "05958BR"
172
+ ],
173
+ "tool_calls": [
174
+ {
175
+ "name": "candidate_source_bulk_source_data",
176
+ "args": {
177
+ "requisition_id": "05958BR"
178
+ }
179
+ }
180
+ ],
181
+ "error_handling": {
182
+ "error_type": "http_429",
183
+ "should_report_error": false,
184
+ "should_retry": false,
185
+ "expected_behavior": "Agent should respect rate limits and use available data"
186
+ },
187
+ "tool_call_results": [
188
+ {
189
+ "name": "candidate_source_bulk_source_data",
190
+ "result": {
191
+ "requisition_id": "05958BR",
192
+ "sources": {
193
+ "CyberSec Jobs": {
194
+ "total_candidates": 200,
195
+ "total_hires": 3,
196
+ "reviewed": 161
197
+ },
198
+ "Dice": {
199
+ "total_candidates": 516,
200
+ "total_hires": 11,
201
+ "reviewed": 57
202
+ },
203
+ "GitHub": {
204
+ "total_candidates": 468,
205
+ "total_hires": 10,
206
+ "reviewed": 356
207
+ },
208
+ "Indeed": {
209
+ "total_candidates": 410,
210
+ "total_hires": 0,
211
+ "reviewed": 316
212
+ },
213
+ "Internal": {
214
+ "total_candidates": 400,
215
+ "total_hires": 5,
216
+ "reviewed": 296
217
+ },
218
+ "LinkedIn": {
219
+ "total_candidates": 519,
220
+ "total_hires": 7,
221
+ "reviewed": 390
222
+ },
223
+ "Referral": {
224
+ "total_candidates": 400,
225
+ "total_hires": 4,
226
+ "reviewed": 280
227
+ }
228
+ },
229
+ "call_number": 1
230
+ }
231
+ }
232
+ ]
233
+ }
234
+ }
235
+ ]
236
+ }
237
+ ]
data/tasks_schema_violations.json ADDED
@@ -0,0 +1,265 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "name": "bpo-benchmark",
4
+ "user_info": [],
5
+ "test_cases": [
6
+ {
7
+ "name": "task_34",
8
+ "description": "Tests agent handling of untyped/unschema'd response. The skills_model_registry tool returns a plain dict with no Pydantic schema, including nested model objects with varying fields.",
9
+ "intent": "What ML models are registered for 05958BR?",
10
+ "difficulty": "medium",
11
+ "expected_output": {
12
+ "response": "The following ML models are registered for 05958BR: Skill relevance classifier (v2.1.0, active), SLA impact regression model (v1.4.2, active), and Funnel conversion recommender (v3.0.0-beta, staging).",
13
+ "keywords": [
14
+ "Skill relevance classifier",
15
+ "SLA impact regression model",
16
+ "Funnel conversion recommender",
17
+ "active|staging"
18
+ ],
19
+ "tool_calls": [
20
+ {
21
+ "name": "skills_model_registry",
22
+ "args": {
23
+ "requisition_id": "05958BR"
24
+ }
25
+ }
26
+ ],
27
+ "error_handling": {
28
+ "error_type": "missing_output_schema",
29
+ "should_report_error": false,
30
+ "should_retry": false,
31
+ "expected_behavior": "Agent should infer structure from the untyped response and present model info"
32
+ },
33
+ "tool_call_results": [
34
+ {
35
+ "name": "skills_model_registry",
36
+ "result": {
37
+ "requisition_id": "05958BR",
38
+ "models": [
39
+ {
40
+ "name": "Skill relevance classifier",
41
+ "version": "2.1.0",
42
+ "status": "active",
43
+ "last_trained": "2024-11-15",
44
+ "accuracy": 0.87
45
+ },
46
+ {
47
+ "name": "SLA impact regression model",
48
+ "version": "1.4.2",
49
+ "status": "active",
50
+ "last_trained": "2024-10-01",
51
+ "r_squared": 0.72
52
+ },
53
+ {
54
+ "name": "Funnel conversion recommender",
55
+ "version": "3.0.0-beta",
56
+ "status": "staging",
57
+ "last_trained": "2025-01-20",
58
+ "precision": 0.81
59
+ }
60
+ ],
61
+ "registry_updated": "2025-04-29"
62
+ }
63
+ }
64
+ ]
65
+ }
66
+ },
67
+ {
68
+ "name": "task_35",
69
+ "description": "Tests agent handling of undocumented/extra input parameters. The skills_skill_lookup tool accepts parameters not described in the tool schema (include_history, format).",
70
+ "intent": "Look up the skill Python for requisition 05958BR",
71
+ "difficulty": "medium",
72
+ "expected_output": {
73
+ "response": "Python for requisition 05958BR has an occurrence count across similar candidates, showing its prevalence in the candidate pool.",
74
+ "keywords": [
75
+ "Python",
76
+ "05958BR",
77
+ "occurrence|count|rate"
78
+ ],
79
+ "tool_calls": [
80
+ {
81
+ "name": "skills_skill_lookup",
82
+ "args": {
83
+ "requisition_id": "05958BR",
84
+ "skill_name": "Python"
85
+ }
86
+ }
87
+ ],
88
+ "error_handling": {
89
+ "error_type": "missing_input_schema",
90
+ "should_report_error": false,
91
+ "should_retry": false,
92
+ "expected_behavior": "Agent should infer required parameters and call the tool correctly"
93
+ },
94
+ "tool_call_results": [
95
+ {
96
+ "name": "skills_skill_lookup",
97
+ "result": {
98
+ "requisition_id": "05958BR",
99
+ "skill_name": "Python",
100
+ "occurrence_count": 200,
101
+ "total_candidates": 2913,
102
+ "occurrence_rate": 6.9
103
+ }
104
+ }
105
+ ]
106
+ }
107
+ },
108
+ {
109
+ "name": "task_36",
110
+ "description": "Tests agent handling of response with missing required fields. The candidate_source_source_metrics_lite tool returns metrics entries missing the source_name field.",
111
+ "intent": "Get a lightweight summary of source metrics for 05958BR",
112
+ "difficulty": "hard",
113
+ "expected_output": {
114
+ "response": "Source metrics for 05958BR show candidate counts and hire counts per source. Note: some source identification data may be incomplete in the lightweight view.",
115
+ "keywords": [
116
+ "metrics|source",
117
+ "candidate|hire",
118
+ "05958BR"
119
+ ],
120
+ "tool_calls": [
121
+ {
122
+ "name": "candidate_source_source_metrics_lite",
123
+ "args": {
124
+ "requisition_id": "05958BR"
125
+ }
126
+ }
127
+ ],
128
+ "error_handling": {
129
+ "error_type": "missing_fields",
130
+ "should_report_error": true,
131
+ "should_retry": false,
132
+ "expected_behavior": "Agent should handle partial data and note missing source names"
133
+ },
134
+ "tool_call_results": [
135
+ {
136
+ "name": "candidate_source_source_metrics_lite",
137
+ "result": {
138
+ "requisition_id": "05958BR",
139
+ "metrics": [
140
+ {
141
+ "candidate_count": 200,
142
+ "hire_count": 3,
143
+ "sla_met_count": 108
144
+ },
145
+ {
146
+ "candidate_count": 516,
147
+ "hire_count": 11,
148
+ "sla_met_count": 54
149
+ },
150
+ {
151
+ "candidate_count": 468,
152
+ "hire_count": 10,
153
+ "sla_met_count": 320
154
+ },
155
+ {
156
+ "candidate_count": 410,
157
+ "hire_count": 0,
158
+ "sla_met_count": 272
159
+ },
160
+ {
161
+ "candidate_count": 400,
162
+ "hire_count": 5,
163
+ "sla_met_count": 281
164
+ },
165
+ {
166
+ "candidate_count": 519,
167
+ "hire_count": 7,
168
+ "sla_met_count": 370
169
+ },
170
+ {
171
+ "candidate_count": 400,
172
+ "hire_count": 4,
173
+ "sla_met_count": 266
174
+ }
175
+ ],
176
+ "note": "Lightweight view — some fields may be omitted for performance."
177
+ }
178
+ }
179
+ ]
180
+ }
181
+ },
182
+ {
183
+ "name": "task_37",
184
+ "description": "Tests agent handling of wrong field types in response. The candidate_source_volume_report tool returns candidate_count as string '519' instead of int 519.",
185
+ "intent": "Generate a volume report for 05958BR",
186
+ "difficulty": "medium",
187
+ "expected_output": {
188
+ "response": "Volume report for 05958BR shows candidate counts by source, with LinkedIn, Dice, and GitHub among the top contributors.",
189
+ "keywords": [
190
+ "volume|report",
191
+ "candidates|count",
192
+ "05958BR"
193
+ ],
194
+ "tool_calls": [
195
+ {
196
+ "name": "candidate_source_volume_report",
197
+ "args": {
198
+ "requisition_id": "05958BR"
199
+ }
200
+ }
201
+ ],
202
+ "error_handling": {
203
+ "error_type": "wrong_field_types",
204
+ "should_report_error": false,
205
+ "should_retry": false,
206
+ "expected_behavior": "Agent should handle type coercion (string to int) transparently"
207
+ },
208
+ "tool_call_results": [
209
+ {
210
+ "name": "candidate_source_volume_report",
211
+ "result": {
212
+ "requisition_id": "05958BR",
213
+ "metrics": [
214
+ {
215
+ "source_name": "CyberSec Jobs",
216
+ "candidate_count": "200",
217
+ "hire_count": "3",
218
+ "review_rate": "80.5%"
219
+ },
220
+ {
221
+ "source_name": "Dice",
222
+ "candidate_count": "516",
223
+ "hire_count": "11",
224
+ "review_rate": "11.0%"
225
+ },
226
+ {
227
+ "source_name": "GitHub",
228
+ "candidate_count": "468",
229
+ "hire_count": "10",
230
+ "review_rate": "76.1%"
231
+ },
232
+ {
233
+ "source_name": "Indeed",
234
+ "candidate_count": "410",
235
+ "hire_count": "0",
236
+ "review_rate": "77.1%"
237
+ },
238
+ {
239
+ "source_name": "Internal",
240
+ "candidate_count": "400",
241
+ "hire_count": "5",
242
+ "review_rate": "74.0%"
243
+ },
244
+ {
245
+ "source_name": "LinkedIn",
246
+ "candidate_count": "519",
247
+ "hire_count": "7",
248
+ "review_rate": "75.1%"
249
+ },
250
+ {
251
+ "source_name": "Referral",
252
+ "candidate_count": "400",
253
+ "hire_count": "4",
254
+ "review_rate": "70.0%"
255
+ }
256
+ ],
257
+ "total_candidates": "2913"
258
+ }
259
+ }
260
+ ]
261
+ }
262
+ }
263
+ ]
264
+ }
265
+ ]
data/tasks_type_mismatch.json ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "name": "bpo-benchmark",
4
+ "user_info": [],
5
+ "test_cases": [
6
+ {
7
+ "name": "task_27",
8
+ "description": "Tests agent handling when API returns a plain string instead of structured data. The skills_skill_summary tool returns comma-separated skill names as a single string rather than a structured list/object.",
9
+ "intent": "Give me a quick summary of skills needed for 05958BR",
10
+ "difficulty": "medium",
11
+ "expected_output": {
12
+ "response": "The skills for 05958BR include Agile, Ansible, AWS, CI/CD, Cyber Engineering, Data Analysis, Docker, Git, Go, GraphQL, Hadoop, Java, Jenkins, Kafka, Kubernetes, Linux, Machine Learning, MongoDB, Networking, Node.js, PostgreSQL, Power BI, Python, React, Redis, REST APIs, Risk Analysis, Scrum, Spark, SQL, Tableau, Terraform, TypeScript, Wireshark",
13
+ "keywords": [
14
+ "Python",
15
+ "AWS|Docker|Kubernetes",
16
+ "skills"
17
+ ],
18
+ "tool_calls": [
19
+ {
20
+ "name": "skills_skill_summary",
21
+ "args": {
22
+ "requisition_id": "05958BR"
23
+ }
24
+ }
25
+ ],
26
+ "error_handling": {
27
+ "error_type": "type_mismatch",
28
+ "should_report_error": false,
29
+ "should_retry": false,
30
+ "expected_behavior": "Agent should parse the comma-separated string and present the skills"
31
+ },
32
+ "tool_call_results": [
33
+ {
34
+ "name": "skills_skill_summary",
35
+ "result": "AWS, Cloud Security, Cyber Engineering, Firewalls, IT Support, Incident Management, Microsoft PowerShell, NIST Cybersecurity Framework, Python, Quantum Physics, Risk Analysis, Risk Assessment, Wireshark"
36
+ }
37
+ ]
38
+ }
39
+ },
40
+ {
41
+ "name": "task_28",
42
+ "description": "Tests agent handling when API returns an int instead of a float or structured response. The candidate_source_source_sla_score tool returns a bare integer.",
43
+ "intent": "What's the SLA score for Dice sourcing channel on requisition 05958BR?",
44
+ "difficulty": "easy",
45
+ "expected_output": {
46
+ "response": "The SLA score for Dice on requisition 05958BR is 80%.",
47
+ "keywords": [
48
+ "Dice",
49
+ "SLA",
50
+ "re:\\b80\\b|re:\\b95\\b"
51
+ ],
52
+ "tool_calls": [
53
+ {
54
+ "name": "candidate_source_source_sla_score",
55
+ "args": {
56
+ "requisition_id": "05958BR",
57
+ "source_name": "Dice"
58
+ }
59
+ }
60
+ ],
61
+ "error_handling": {
62
+ "error_type": "type_mismatch",
63
+ "should_report_error": false,
64
+ "should_retry": false,
65
+ "expected_behavior": "Agent should handle int/float correctly and present as percentage"
66
+ },
67
+ "tool_call_results": [
68
+ {
69
+ "name": "candidate_source_source_sla_score",
70
+ "result": 95
71
+ }
72
+ ]
73
+ }
74
+ },
75
+ {
76
+ "name": "task_29",
77
+ "description": "Tests agent handling when API returns null/None instead of an empty list. The candidate_source_inactive_sources tool returns None when no inactive sources exist.",
78
+ "intent": "Show me any inactive sources with no candidates for 05958BR",
79
+ "difficulty": "medium",
80
+ "expected_output": {
81
+ "response": "There are no inactive sources for requisition 05958BR. All sourcing channels have active candidates.",
82
+ "keywords": [
83
+ "no inactive|none|no sources|all",
84
+ "active|candidates"
85
+ ],
86
+ "tool_calls": [
87
+ {
88
+ "name": "candidate_source_inactive_sources",
89
+ "args": {
90
+ "requisition_id": "05958BR"
91
+ }
92
+ }
93
+ ],
94
+ "error_handling": {
95
+ "error_type": "type_mismatch",
96
+ "should_report_error": false,
97
+ "should_retry": false,
98
+ "expected_behavior": "Agent should handle null gracefully and report no inactive sources"
99
+ },
100
+ "tool_call_results": [
101
+ {
102
+ "name": "candidate_source_inactive_sources",
103
+ "result": null
104
+ }
105
+ ]
106
+ }
107
+ }
108
+ ]
109
+ }
110
+ ]
data/tasks_undocumented.json ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "name": "bpo-benchmark",
4
+ "user_info": [],
5
+ "test_cases": [
6
+ {
7
+ "name": "task_43",
8
+ "description": "Tests agent handling of non-standard error format. The candidate_source_requisition_details tool returns {\"err\": \"not_found\"} instead of standard RequisitionNotFoundResponse when given an invalid ID.",
9
+ "intent": "Get detailed info for requisition 99999BR",
10
+ "difficulty": "medium",
11
+ "expected_output": {
12
+ "response": "Requisition 99999BR was not found. The system returned a non-standard error indicating the requisition does not exist.",
13
+ "keywords": [
14
+ "99999BR",
15
+ "not found|not_found|doesn't exist|does not exist|invalid"
16
+ ],
17
+ "tool_calls": [
18
+ {
19
+ "name": "candidate_source_requisition_details",
20
+ "args": {
21
+ "requisition_id": "99999BR"
22
+ }
23
+ }
24
+ ],
25
+ "error_handling": {
26
+ "error_type": "undocumented_error_format",
27
+ "should_report_error": true,
28
+ "should_retry": false,
29
+ "expected_behavior": "Agent should parse non-standard error format and report not found"
30
+ },
31
+ "tool_call_results": [
32
+ {
33
+ "name": "candidate_source_requisition_details",
34
+ "result": {
35
+ "err": "not_found",
36
+ "req": "99999BR"
37
+ }
38
+ }
39
+ ]
40
+ }
41
+ },
42
+ {
43
+ "name": "task_44",
44
+ "description": "Tests agent handling of undocumented pagination. The candidate_source_list_all_sources tool returns only a page of results with a next_page token not described in any schema.",
45
+ "intent": "List all available sourcing channels for 05958BR",
46
+ "difficulty": "hard",
47
+ "expected_output": {
48
+ "response": "The available sourcing channels for 05958BR include multiple sources. The results show a total count of all channels, though the response is paginated.",
49
+ "keywords": [
50
+ "source|sources|channels",
51
+ "05958BR"
52
+ ],
53
+ "tool_calls": [
54
+ {
55
+ "name": "candidate_source_list_all_sources",
56
+ "args": {
57
+ "requisition_id": "05958BR"
58
+ }
59
+ }
60
+ ],
61
+ "error_handling": {
62
+ "error_type": "undocumented_pagination",
63
+ "should_report_error": false,
64
+ "should_retry": false,
65
+ "expected_behavior": "Agent should detect and handle pagination, noting there are more results"
66
+ },
67
+ "tool_call_results": [
68
+ {
69
+ "name": "candidate_source_list_all_sources",
70
+ "result": {
71
+ "requisition_id": "05958BR",
72
+ "sources": [
73
+ {
74
+ "name": "CyberSec Jobs",
75
+ "index": 0
76
+ },
77
+ {
78
+ "name": "Dice",
79
+ "index": 1
80
+ },
81
+ {
82
+ "name": "GitHub",
83
+ "index": 2
84
+ }
85
+ ],
86
+ "total_count": 7,
87
+ "page_size": 3,
88
+ "page": 1,
89
+ "next_page": "eyJvZmZzZXQiOjMsInJlcV9pZCI6IjA1OTU4QlIifQ==",
90
+ "has_more": true
91
+ }
92
+ }
93
+ ]
94
+ }
95
+ },
96
+ {
97
+ "name": "task_45",
98
+ "description": "Tests agent handling of undocumented rate limiting info in response body. The candidate_source_batch_metrics tool includes X-RateLimit headers embedded in the JSON response.",
99
+ "intent": "Fetch batch metrics for all sources on 05958BR",
100
+ "difficulty": "medium",
101
+ "expected_output": {
102
+ "response": "Batch metrics for 05958BR show candidate counts and hire data across all sourcing channels including LinkedIn, Dice, Indeed, GitHub, Referral, CyberSec Jobs, and TechCareers.",
103
+ "keywords": [
104
+ "metrics|batch",
105
+ "candidates|hires|sources",
106
+ "05958BR"
107
+ ],
108
+ "tool_calls": [
109
+ {
110
+ "name": "candidate_source_batch_metrics",
111
+ "args": {
112
+ "requisition_id": "05958BR"
113
+ }
114
+ }
115
+ ],
116
+ "error_handling": {
117
+ "error_type": "undocumented_rate_limiting",
118
+ "should_report_error": false,
119
+ "should_retry": false,
120
+ "expected_behavior": "Agent should process the metrics data and optionally note rate limit information"
121
+ },
122
+ "tool_call_results": [
123
+ {
124
+ "name": "candidate_source_batch_metrics",
125
+ "result": {
126
+ "requisition_id": "05958BR",
127
+ "metrics": {
128
+ "CyberSec Jobs": {
129
+ "candidates": 200,
130
+ "hires": 3,
131
+ "reviewed": 161
132
+ },
133
+ "Dice": {
134
+ "candidates": 516,
135
+ "hires": 11,
136
+ "reviewed": 57
137
+ },
138
+ "GitHub": {
139
+ "candidates": 468,
140
+ "hires": 10,
141
+ "reviewed": 356
142
+ },
143
+ "Indeed": {
144
+ "candidates": 410,
145
+ "hires": 0,
146
+ "reviewed": 316
147
+ },
148
+ "Internal": {
149
+ "candidates": 400,
150
+ "hires": 5,
151
+ "reviewed": 296
152
+ },
153
+ "LinkedIn": {
154
+ "candidates": 519,
155
+ "hires": 7,
156
+ "reviewed": 390
157
+ },
158
+ "Referral": {
159
+ "candidates": 400,
160
+ "hires": 4,
161
+ "reviewed": 280
162
+ }
163
+ },
164
+ "X-RateLimit-Limit": 100,
165
+ "X-RateLimit-Remaining": 97,
166
+ "X-RateLimit-Reset": "2025-05-01T00:00:00Z",
167
+ "X-RateLimit-Window": "1h"
168
+ }
169
+ }
170
+ ]
171
+ }
172
+ }
173
+ ]
174
+ }
175
+ ]
data_loader.py ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Data loader for candidate data.
3
+
4
+ AUTO-GENERATED by scripts/generate_hf.sh - DO NOT EDIT DIRECTLY
5
+ Edit data_loader.py in main repo and regenerate.
6
+ """
7
+
8
+ from pathlib import Path
9
+ from typing import Optional
10
+ import pandas as pd
11
+
12
+
13
+ class DataLoader:
14
+ """Loads and caches candidate data from parquet file."""
15
+
16
+ _instance: Optional['DataLoader'] = None
17
+ _data: Optional[pd.DataFrame] = None
18
+
19
+ def __new__(cls):
20
+ if cls._instance is None:
21
+ cls._instance = super().__new__(cls)
22
+ return cls._instance
23
+
24
+ def __init__(self):
25
+ """Initialize the data loader."""
26
+ if self._data is None:
27
+ self._load_data()
28
+
29
+ def _load_data(self):
30
+ """Load candidate data from parquet file."""
31
+ # Find data file - try multiple locations for different deployments
32
+ possible_paths = [
33
+ # Main repo: bpo_benchmark/api/data_loader.py -> data/
34
+ Path(__file__).parent.parent.parent / "data" / "candidate_data.parquet",
35
+ # HF/CUGA: data_loader.py in same dir as data/
36
+ Path(__file__).parent / "data" / "candidate_data.parquet",
37
+ # Current working directory
38
+ Path("./data/candidate_data.parquet"),
39
+ # HuggingFace Spaces default path
40
+ Path("/home/user/app/data/candidate_data.parquet"),
41
+ ]
42
+
43
+ data_file = None
44
+ for path in possible_paths:
45
+ if path.exists():
46
+ data_file = path
47
+ break
48
+
49
+ if data_file is None:
50
+ raise FileNotFoundError(
51
+ f"Data file not found. Searched paths: {[str(p) for p in possible_paths]}"
52
+ )
53
+
54
+ self._data = pd.read_parquet(data_file)
55
+
56
+ # Parse skills column (may be string representation of list or already parsed)
57
+ import ast
58
+ import numpy as np
59
+
60
+ def parse_skills(x):
61
+ # Handle None/NaN
62
+ if x is None:
63
+ return []
64
+ # Check if it's a numpy/pandas array or list (already parsed)
65
+ if isinstance(x, (list, np.ndarray)):
66
+ return list(x) if isinstance(x, np.ndarray) else x
67
+ # Handle string case
68
+ if isinstance(x, str):
69
+ if x == '':
70
+ return []
71
+ try:
72
+ return ast.literal_eval(x)
73
+ except (ValueError, SyntaxError):
74
+ return []
75
+ # Scalar NaN case
76
+ try:
77
+ if pd.isna(x):
78
+ return []
79
+ except (TypeError, ValueError):
80
+ pass
81
+ return []
82
+
83
+ self._data['skills_parsed'] = self._data['skills'].apply(parse_skills)
84
+
85
+ @property
86
+ def data(self) -> pd.DataFrame:
87
+ """Get the loaded data."""
88
+ if self._data is None:
89
+ self._load_data()
90
+ return self._data
91
+
92
+ def get_by_requisition(self, requisition_id: str) -> pd.DataFrame:
93
+ """Get all candidates for a specific requisition."""
94
+ return self.data[self.data['requisition_id'] == requisition_id].copy()
95
+
96
+ def get_similar_requisitions(self, requisition_id: str) -> pd.DataFrame:
97
+ """
98
+ Get candidates from similar requisitions.
99
+
100
+ Similarity is determined by matching requisition metadata:
101
+ - Primary: requisition_template_id (most specific)
102
+ - Fallback: department + seniority_level (broader matching)
103
+
104
+ This enables data-driven similarity without hardcoded requisition lists.
105
+ """
106
+ # Get the reference requisition's metadata
107
+ ref_rows = self.data[self.data['requisition_id'] == requisition_id]
108
+
109
+ if ref_rows.empty:
110
+ # Unknown requisition - return empty DataFrame
111
+ return pd.DataFrame(columns=self.data.columns)
112
+
113
+ # Extract metadata from first row (all rows for same req_id have same metadata)
114
+ ref_row = ref_rows.iloc[0]
115
+ ref_template_id = ref_row.get('requisition_template_id')
116
+
117
+ # Primary: match by template ID if present
118
+ if pd.notna(ref_template_id) and str(ref_template_id).strip():
119
+ similar_mask = self.data['requisition_template_id'] == ref_template_id
120
+ similar = self.data[similar_mask]
121
+ else:
122
+ similar = pd.DataFrame(columns=self.data.columns)
123
+
124
+ # Fallback: if template match is missing/too small, match by dept + seniority
125
+ if similar.empty or similar['requisition_id'].nunique() < 2:
126
+ ref_department = ref_row.get('department')
127
+ ref_seniority = ref_row.get('seniority_level')
128
+ similar_mask = (
129
+ (self.data['department'] == ref_department)
130
+ & (self.data['seniority_level'] == ref_seniority)
131
+ )
132
+ similar = self.data[similar_mask]
133
+
134
+ return similar.copy()
135
+
136
+ def is_valid_requisition(self, requisition_id: str) -> bool:
137
+ """Check if a requisition ID exists in the data."""
138
+ return requisition_id in self.data['requisition_id'].values
139
+
140
+ def get_suggested_requisitions(self, invalid_id: str, limit: int = 4) -> list:
141
+ """
142
+ Get a list of valid requisition IDs to suggest when an invalid ID is provided.
143
+
144
+ Returns close-match IDs from the dataset.
145
+ """
146
+ valid_ids = list(self.data['requisition_id'].unique())
147
+ try:
148
+ from rapidfuzz import process, fuzz
149
+
150
+ matches = process.extract(
151
+ invalid_id,
152
+ valid_ids,
153
+ scorer=fuzz.WRatio,
154
+ limit=limit,
155
+ )
156
+ return [match[0] for match in matches]
157
+ except Exception:
158
+ # Fall back to first few valid IDs if RapidFuzz isn't available
159
+ return valid_ids[:limit]
160
+
161
+
162
+ # Singleton instance
163
+ _loader = DataLoader()
164
+
165
+
166
+ def get_data_loader() -> DataLoader:
167
+ """Get the singleton data loader instance."""
168
+ return _loader
evaluator.py ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Keyword-based evaluation for BPO benchmark."""
2
+
3
+ from typing import List, Dict, Any
4
+
5
+
6
+ def check_keywords(response: str, expected_keywords: List[str]) -> Dict[str, Any]:
7
+ """
8
+ Check if response contains expected keywords (supports OR with |).
9
+
10
+ Args:
11
+ response: The agent's response text
12
+ expected_keywords: List of keywords to check. Each keyword can contain
13
+ alternatives separated by | (e.g., "67%|67 %|67")
14
+
15
+ Returns:
16
+ Dictionary with found/missing keywords, match rate, and pass status
17
+ """
18
+ found = []
19
+ missing = []
20
+
21
+ for keyword in expected_keywords:
22
+ alternatives = keyword.split("|")
23
+ if any(alt.lower() in response.lower() for alt in alternatives):
24
+ found.append(keyword)
25
+ else:
26
+ missing.append(keyword)
27
+
28
+ match_rate = len(found) / len(expected_keywords) if expected_keywords else 1.0
29
+ return {
30
+ "found": found,
31
+ "missing": missing,
32
+ "match_rate": match_rate,
33
+ "passed": len(missing) == 0
34
+ }
35
+
36
+
37
+ def evaluate_task(task: Dict[str, Any], response: str, tool_calls: List[Dict[str, Any]]) -> Dict[str, Any]:
38
+ """
39
+ Evaluate a single task.
40
+
41
+ Args:
42
+ task: Task definition from tasks.json
43
+ response: The agent's response text
44
+ tool_calls: List of tool calls made by the agent
45
+
46
+ Returns:
47
+ Evaluation result dictionary
48
+ """
49
+ expected_output = task.get("expected_output", {})
50
+ keywords = expected_output.get("keywords", [])
51
+
52
+ result = check_keywords(response, keywords)
53
+
54
+ # Extract tool names from tool calls
55
+ tool_names = []
56
+ for tc in tool_calls:
57
+ if isinstance(tc, dict):
58
+ name = tc.get("name") or tc.get("function", {}).get("name", "")
59
+ if name:
60
+ tool_names.append(name)
61
+ elif isinstance(tc, str):
62
+ tool_names.append(tc)
63
+
64
+ # Check expected tool calls
65
+ expected_tools = expected_output.get("tool_calls", [])
66
+ expected_tool_names = [t.get("name", "") for t in expected_tools if isinstance(t, dict)]
67
+
68
+ # Calculate tool call accuracy
69
+ if expected_tool_names:
70
+ matched_tools = sum(1 for t in expected_tool_names if any(t in tn for tn in tool_names))
71
+ tool_accuracy = matched_tools / len(expected_tool_names)
72
+ else:
73
+ # No tools expected - check that none were called or that's acceptable
74
+ tool_accuracy = 1.0 if not tool_names else 0.5
75
+
76
+ # Calculate API count accuracy (lenient: correct if actual >= expected)
77
+ api_call_count = len(tool_names)
78
+ expected_api_count = len(expected_tool_names)
79
+ api_count_correct = 1 if api_call_count >= expected_api_count else 0
80
+
81
+ return {
82
+ "task_id": task.get("name", "unknown"),
83
+ "difficulty": task.get("difficulty", "unknown"),
84
+ "intent": task.get("intent", ""),
85
+ "response": response,
86
+ "expected_keywords": keywords,
87
+ "found_keywords": result["found"],
88
+ "missing_keywords": result["missing"],
89
+ "match_rate": result["match_rate"],
90
+ "passed": result["passed"],
91
+ "tool_calls": tool_names,
92
+ "expected_tool_calls": expected_tool_names,
93
+ "tool_accuracy": tool_accuracy,
94
+ "api_call_count": api_call_count,
95
+ "expected_api_count": expected_api_count,
96
+ "api_count_correct": api_count_correct,
97
+ }
98
+
99
+
100
+ def calculate_summary(results: List[Dict[str, Any]]) -> Dict[str, Any]:
101
+ """
102
+ Calculate summary statistics from evaluation results.
103
+
104
+ Args:
105
+ results: List of evaluation results from evaluate_task
106
+
107
+ Returns:
108
+ Summary dictionary with pass rates and averages
109
+ """
110
+ if not results:
111
+ return {
112
+ "total_tasks": 0,
113
+ "passed": 0,
114
+ "pass_rate": 0.0,
115
+ "avg_match_rate": 0.0,
116
+ "avg_tool_accuracy": 0.0,
117
+ "api_count_accuracy": 0.0,
118
+ "by_difficulty": {},
119
+ }
120
+
121
+ total = len(results)
122
+ passed = sum(1 for r in results if r.get("passed", False))
123
+ avg_match = sum(r.get("match_rate", 0) for r in results) / total
124
+ avg_tool = sum(r.get("tool_accuracy", 0) for r in results) / total
125
+ api_count_correct = sum(r.get("api_count_correct", 0) for r in results)
126
+
127
+ # Group by difficulty
128
+ by_difficulty = {}
129
+ for r in results:
130
+ diff = r.get("difficulty", "unknown")
131
+ if diff not in by_difficulty:
132
+ by_difficulty[diff] = {"total": 0, "passed": 0}
133
+ by_difficulty[diff]["total"] += 1
134
+ if r.get("passed", False):
135
+ by_difficulty[diff]["passed"] += 1
136
+
137
+ for diff in by_difficulty:
138
+ by_difficulty[diff]["pass_rate"] = (
139
+ by_difficulty[diff]["passed"] / by_difficulty[diff]["total"]
140
+ )
141
+
142
+ return {
143
+ "total_tasks": total,
144
+ "passed": passed,
145
+ "pass_rate": passed / total,
146
+ "avg_match_rate": avg_match,
147
+ "avg_tool_accuracy": avg_tool,
148
+ "api_count_accuracy": api_count_correct / total,
149
+ "by_difficulty": by_difficulty,
150
+ }
mcp_servers/bpo.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ services:
2
+ - bpo:
3
+ url: http://127.0.0.1:8000/openapi.json
4
+ description: BPO recruiting analytics API
models.py ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Pydantic schemas for BPO API responses."""
2
+
3
+ from pydantic import BaseModel
4
+ from typing import List, Optional, Dict, Any
5
+
6
+
7
+ # ============================================================================
8
+ # Error Response Models
9
+ # ============================================================================
10
+
11
+ class RequisitionNotFoundResponse(BaseModel):
12
+ """Response returned when a requisition ID is not found."""
13
+ error: str
14
+ message: str
15
+ suggested_requisition_ids: List[str]
16
+
17
+
18
+ # ============================================================================
19
+ # Candidate Source Response Models
20
+ # ============================================================================
21
+
22
+ class SLAMetric(BaseModel):
23
+ """SLA metric for a single source."""
24
+ source_name: str
25
+ sla_percentage: int
26
+
27
+
28
+ class SLAPerSourceResponse(BaseModel):
29
+ """Response for get_sla_per_source API."""
30
+ metrics: List[SLAMetric]
31
+
32
+
33
+ class HireMetric(BaseModel):
34
+ """Hire metric for a single source."""
35
+ source_name: str
36
+ total_hires: int
37
+
38
+
39
+ class TotalHiresBySourceResponse(BaseModel):
40
+ """Response for get_total_hires_by_source API."""
41
+ job_id: str
42
+ metrics: List[HireMetric]
43
+ total_hires: int
44
+
45
+
46
+ class VolumeMetric(BaseModel):
47
+ """Volume metric for a single source."""
48
+ source_name: str
49
+ candidate_volume: int
50
+ percentage: int
51
+
52
+
53
+ class CandidateVolumeResponse(BaseModel):
54
+ """Response for get_candidate_volume_by_source API."""
55
+ job_id: str
56
+ total_candidate_volume: int
57
+ metrics: List[VolumeMetric]
58
+ heading: str
59
+
60
+
61
+ class FunnelMetric(BaseModel):
62
+ """Funnel conversion metric for a single source."""
63
+ source_name: str
64
+ first_round_review_percentage: float
65
+ interview_rate: float
66
+ offer_acceptance_rate: float
67
+
68
+
69
+ class FunnelConversionResponse(BaseModel):
70
+ """Response for get_funnel_conversion_by_source API."""
71
+ job_id: str
72
+ metrics: List[FunnelMetric]
73
+
74
+
75
+ class MetadataResponse(BaseModel):
76
+ """Response for get_metadata_and_timeframe API."""
77
+ job_id: str
78
+ time_frame_start: str
79
+ time_frame_end: str
80
+ data_last_updated: str
81
+ total_requisitions_analysed: int
82
+
83
+
84
+ class DefinitionsResponse(BaseModel):
85
+ """Response for get_definitions_and_methodology API."""
86
+ job_id: str
87
+ definitions: Dict[str, str]
88
+ calculation_notes: str
89
+ top_metrics_considered: List[str]
90
+
91
+
92
+ class SourceSummaryMetric(BaseModel):
93
+ """Summary metric for a single source."""
94
+ source_name: str
95
+ jobs_filled_percentage: int
96
+ first_round_review_percentage: int
97
+ offer_acceptance_rate: int
98
+ total_hires: int
99
+
100
+
101
+ class SourceRecommendationResponse(BaseModel):
102
+ """Response for get_source_recommendation_summary API."""
103
+ total_requisitions: int
104
+ metrics: List[SourceSummaryMetric]
105
+
106
+
107
+ # ============================================================================
108
+ # Skills Response Models
109
+ # ============================================================================
110
+
111
+ class SkillWithAnalysis(BaseModel):
112
+ """Skill with historical analysis."""
113
+ name: str
114
+ skill_occurrence: int
115
+ correlation: str
116
+
117
+
118
+ class SkillAnalysisResponse(BaseModel):
119
+ """Response for get_skill_analysis API."""
120
+ historical_jobs: int
121
+ input_skills: List[Any]
122
+ historical_skills_with_analysis: List[SkillWithAnalysis]
123
+
124
+
125
+ class ImpactMetrics(BaseModel):
126
+ """Impact metrics for skill analysis."""
127
+ fill_rate_percentage: float
128
+ time_to_fill_days: int
129
+ candidate_pool_size: int
130
+
131
+
132
+ class SkillImpactFillRateResponse(BaseModel):
133
+ """Response for get_skill_impact_fill_rate API."""
134
+ skill_name: str
135
+ impact: ImpactMetrics
136
+ compared_to_baseline: ImpactMetrics
137
+
138
+
139
+ class SkillImpactSLAResponse(BaseModel):
140
+ """Response for get_skill_impact_sla API."""
141
+ requisition_id: str
142
+ skill_name: str
143
+ sla_achievement_with_skill: int
144
+ sla_achievement_without_skill: int
145
+ delta: int
146
+
147
+
148
+ class SkillJustificationImpact(BaseModel):
149
+ """Impact metrics within justification."""
150
+ fill_rate_percentage: float
151
+ time_to_fill_days: int
152
+ candidate_pool_size: int
153
+
154
+
155
+ class SkillJustificationData(BaseModel):
156
+ """Justification data for skill relevance."""
157
+ requisition_id: str
158
+ skill_name: str
159
+ sla_achievement_with_skill: int
160
+ sla_achievement_without_skill: int
161
+ delta: int
162
+ impact: SkillJustificationImpact
163
+ compared_to_baseline: SkillJustificationImpact
164
+
165
+
166
+ class SkillRelevanceResponse(BaseModel):
167
+ """Response for get_skill_relevance_justification API."""
168
+ requisition_id: str
169
+ skill_name: str
170
+ is_relevant: bool
171
+ justification: SkillJustificationData
172
+
173
+
174
+ class SuccessCriteria(BaseModel):
175
+ """Success criteria thresholds."""
176
+ time_to_fill_threshold_days: int
177
+ offer_acceptance_rate_min: int
178
+ sla_compliance_min: int
179
+ candidate_quality_rating_avg: float
180
+
181
+
182
+ class SuccessfulPostingResponse(BaseModel):
183
+ """Response for get_successful_posting_criteria API."""
184
+ criteria: SuccessCriteria
185
+ justification: str
186
+
187
+
188
+ class DataSourcesResponse(BaseModel):
189
+ """Response for get_data_sources_used API."""
190
+ requisition_id: str
191
+ datasets_used: List[str]
192
+ models_involved: List[str]
requirements.txt ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CUGA SDK - the agent being benchmarked
2
+ cuga>=0.2.8
3
+
4
+ # UI and server
5
+ gradio>=4.0.0
6
+ fastapi>=0.110.0
7
+ uvicorn>=0.27.0
8
+
9
+ # Data handling
10
+ pandas>=2.2.0
11
+ pyarrow>=15.0.0
12
+ pydantic>=2.0.0
13
+ numpy>=1.26.0
14
+
15
+ # HTTP client
16
+ httpx>=0.27.0
17
+
18
+ # Fuzzy matching for requisition suggestions
19
+ rapidfuzz>=3.6.0
20
+
21
+ # HuggingFace integration
22
+ huggingface_hub>=0.20.0
23
+
24
+ # LLM providers (used by CUGA SDK)
25
+ langchain-core>=0.1.0
26
+ langchain-openai>=0.1.0
27
+ langchain-groq>=0.1.0
28
+
29
+ # YAML for MCP config
30
+ pyyaml>=6.0.0
31
+
32
+ # Observability (optional)
33
+ langfuse>=2.0.0
server.py ADDED
@@ -0,0 +1,321 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """FastAPI HTTP server exposing BPO APIs with OpenAPI documentation.
2
+
3
+ AUTO-GENERATED by scripts/generate_hf.sh - DO NOT EDIT DIRECTLY
4
+ Edit bpo_benchmark/api/*.py in main repo and regenerate.
5
+ """
6
+
7
+ import json
8
+ import logging
9
+ from typing import Optional, List, Union
10
+
11
+ from fastapi import FastAPI, HTTPException
12
+ from fastapi.responses import JSONResponse
13
+
14
+ # Import response models
15
+ from models import (
16
+ RequisitionNotFoundResponse,
17
+ SLAPerSourceResponse,
18
+ TotalHiresBySourceResponse,
19
+ CandidateVolumeResponse,
20
+ FunnelConversionResponse,
21
+ MetadataResponse,
22
+ DefinitionsResponse,
23
+ SourceRecommendationResponse,
24
+ SkillAnalysisResponse,
25
+ SkillImpactFillRateResponse,
26
+ SkillImpactSLAResponse,
27
+ SkillRelevanceResponse,
28
+ SuccessfulPostingResponse,
29
+ DataSourcesResponse,
30
+ )
31
+
32
+ # Import API functions from transformed modules
33
+ import api_candidate_source
34
+ import api_skills
35
+
36
+ # Import error-prone API modules (for negative/hardness testing)
37
+ import api_candidate_source_error
38
+ import api_skills_error
39
+
40
+ logger = logging.getLogger(__name__)
41
+
42
+ app = FastAPI(
43
+ title="BPO Recruiting Analytics API",
44
+ description="API for BPO recruiting analytics benchmark with tool endpoints",
45
+ version="1.0.0",
46
+ )
47
+
48
+
49
+ # ============================================================================
50
+ # FastAPI Endpoints - Candidate Source
51
+ # ============================================================================
52
+
53
+ @app.get("/candidate-source/sla-per-source/{requisition_id}")
54
+ def candidate_source_sla_per_source(
55
+ requisition_id: str,
56
+ ) -> Union[SLAPerSourceResponse, RequisitionNotFoundResponse]:
57
+ """Retrieves the SLA percentage for each sourcing channel."""
58
+ return api_candidate_source.get_sla_per_source(requisition_id)
59
+
60
+
61
+ @app.get("/candidate-source/total-hires-by-source/{requisition_id}")
62
+ def candidate_source_total_hires_by_source(
63
+ requisition_id: str,
64
+ ) -> Union[TotalHiresBySourceResponse, RequisitionNotFoundResponse]:
65
+ """Retrieves the total number of hires per sourcing channel."""
66
+ return api_candidate_source.get_total_hires_by_source(requisition_id)
67
+
68
+
69
+ @app.get("/candidate-source/candidate-volume-by-source/{requisition_id}")
70
+ def candidate_source_candidate_volume_by_source(
71
+ requisition_id: str,
72
+ sources: Optional[List[str]] = None,
73
+ ) -> Union[CandidateVolumeResponse, RequisitionNotFoundResponse]:
74
+ """Retrieves candidate volume per sourcing channel."""
75
+ return api_candidate_source.get_candidate_volume_by_source(requisition_id, sources)
76
+
77
+
78
+ @app.get("/candidate-source/funnel-conversion-by-source/{requisition_id}")
79
+ def candidate_source_funnel_conversion_by_source(
80
+ requisition_id: str,
81
+ ) -> Union[FunnelConversionResponse, RequisitionNotFoundResponse]:
82
+ """Retrieves conversion rates at each funnel stage for each sourcing channel."""
83
+ return api_candidate_source.get_funnel_conversion_by_source(requisition_id)
84
+
85
+
86
+ @app.get("/candidate-source/metadata-and-timeframe/{requisition_id}")
87
+ def candidate_source_metadata_and_timeframe(
88
+ requisition_id: str,
89
+ ) -> Union[MetadataResponse, RequisitionNotFoundResponse]:
90
+ """Retrieves metadata including data timeframe and requisition summary."""
91
+ return api_candidate_source.get_metadata_and_timeframe(requisition_id)
92
+
93
+
94
+ @app.get("/candidate-source/definitions-and-methodology/{requisition_id}")
95
+ def candidate_source_definitions_and_methodology(
96
+ requisition_id: str,
97
+ ) -> Union[DefinitionsResponse, RequisitionNotFoundResponse]:
98
+ """Provides definitions of key metrics and methodology."""
99
+ return api_candidate_source.get_definitions_and_methodology(requisition_id)
100
+
101
+
102
+ @app.get("/candidate-source/source-recommendation-summary/{requisition_id}")
103
+ def candidate_source_source_recommendation_summary(
104
+ requisition_id: str,
105
+ ) -> Union[SourceRecommendationResponse, RequisitionNotFoundResponse]:
106
+ """Returns a high-level summary of source metrics."""
107
+ return api_candidate_source.get_source_recommendation_summary(requisition_id)
108
+
109
+
110
+ # ============================================================================
111
+ # FastAPI Endpoints - Skills
112
+ # ============================================================================
113
+
114
+ @app.get("/skills/skill-analysis/{requisition_id}")
115
+ def skills_skill_analysis(
116
+ requisition_id: str,
117
+ ) -> Union[SkillAnalysisResponse, RequisitionNotFoundResponse]:
118
+ """Provides statistical indicators for each skill associated with the requisition."""
119
+ return api_skills.get_skill_analysis(requisition_id)
120
+
121
+
122
+ @app.get("/skills/skill-impact-fill-rate/{requisition_id}/{skill_name}")
123
+ def skills_skill_impact_fill_rate(
124
+ requisition_id: str,
125
+ skill_name: str,
126
+ ) -> Union[SkillImpactFillRateResponse, RequisitionNotFoundResponse]:
127
+ """Evaluates how a skill affects fill-rate metrics."""
128
+ return api_skills.get_skill_impact_fill_rate(requisition_id, skill_name)
129
+
130
+
131
+ @app.get("/skills/skill-impact-sla/{requisition_id}/{skill_name}")
132
+ def skills_skill_impact_sla(
133
+ requisition_id: str,
134
+ skill_name: str,
135
+ ) -> Union[SkillImpactSLAResponse, RequisitionNotFoundResponse]:
136
+ """Analyzes how a skill affects SLA achievement rate."""
137
+ return api_skills.get_skill_impact_sla(requisition_id, skill_name)
138
+
139
+
140
+ @app.get("/skills/skill-relevance-justification/{requisition_id}/{skill_name}")
141
+ def skills_skill_relevance_justification(
142
+ requisition_id: str,
143
+ skill_name: str,
144
+ ) -> Union[SkillRelevanceResponse, RequisitionNotFoundResponse]:
145
+ """Explains whether a skill is relevant and why."""
146
+ return api_skills.get_skill_relevance_justification(requisition_id, skill_name)
147
+
148
+
149
+ @app.get("/skills/successful-posting-criteria")
150
+ def skills_successful_posting_criteria() -> SuccessfulPostingResponse:
151
+ """Returns the business definition of a successful job posting."""
152
+ return api_skills.get_successful_posting_criteria()
153
+
154
+
155
+ @app.get("/skills/data-sources-used/{requisition_id}")
156
+ def skills_data_sources_used(
157
+ requisition_id: str,
158
+ ) -> Union[DataSourcesResponse, RequisitionNotFoundResponse]:
159
+ """Lists the datasets and ML models used for recommendations."""
160
+ return api_skills.get_data_sources_used(requisition_id)
161
+
162
+
163
+ # ============================================================================
164
+ # Error-Prone Endpoints - Type Mismatch
165
+ # ============================================================================
166
+
167
+ @app.get("/skills/skill-summary/{requisition_id}")
168
+ def skills_skill_summary(requisition_id: str):
169
+ """Get a quick text summary of skills needed for a requisition. Returns a concise skill overview."""
170
+ return api_skills_error.get_skill_summary(requisition_id)
171
+
172
+
173
+ @app.get("/candidate-source/source-sla-score/{requisition_id}")
174
+ def candidate_source_source_sla_score(requisition_id: str, source_name: str = "Dice"):
175
+ """Get the SLA score for a specific sourcing channel. Returns the SLA achievement score."""
176
+ return api_candidate_source_error.get_source_sla_score(requisition_id, source_name)
177
+
178
+
179
+ @app.get("/candidate-source/inactive-sources/{requisition_id}")
180
+ def candidate_source_inactive_sources(requisition_id: str):
181
+ """Show any inactive sourcing channels with no candidates."""
182
+ return api_candidate_source_error.get_inactive_sources(requisition_id)
183
+
184
+
185
+ # ============================================================================
186
+ # Error-Prone Endpoints - HTTP Errors
187
+ # ============================================================================
188
+
189
+ @app.get("/candidate-source/candidate-pipeline-status/{requisition_id}")
190
+ def candidate_source_candidate_pipeline_status(requisition_id: str):
191
+ """Get candidate pipeline status showing distribution by source."""
192
+ result = api_candidate_source_error.get_candidate_pipeline_status(requisition_id)
193
+ if isinstance(result, dict) and result.get("error") and result.get("status_code"):
194
+ return JSONResponse(status_code=result["status_code"], content=result)
195
+ return result
196
+
197
+
198
+ @app.get("/candidate-source/source-sla-check/{requisition_id}")
199
+ def candidate_source_source_sla_check(requisition_id: str):
200
+ """Run a quick SLA status check across all sourcing channels."""
201
+ result = api_candidate_source_error.get_source_sla_check(requisition_id)
202
+ if isinstance(result, dict) and result.get("error") and result.get("status_code", 0) >= 500:
203
+ return JSONResponse(status_code=result["status_code"], content=result)
204
+ return result
205
+
206
+
207
+ @app.get("/candidate-source/funnel-status/{requisition_id}")
208
+ def candidate_source_funnel_status(requisition_id: str):
209
+ """Get the current funnel status showing conversion at each stage."""
210
+ result = api_candidate_source_error.get_funnel_status(requisition_id)
211
+ if isinstance(result, dict) and result.get("error"):
212
+ return JSONResponse(status_code=result.get("status_code", 500), content=result)
213
+ return result
214
+
215
+
216
+ @app.get("/candidate-source/bulk-source-data/{requisition_id}")
217
+ def candidate_source_bulk_source_data(requisition_id: str):
218
+ """Pull bulk source data for all requisitions in the system."""
219
+ result = api_candidate_source_error.get_bulk_source_data(requisition_id)
220
+ if isinstance(result, dict) and result.get("error") and result.get("status_code") == 429:
221
+ return JSONResponse(status_code=429, content=result)
222
+ return result
223
+
224
+
225
+ # ============================================================================
226
+ # Error-Prone Endpoints - Schema Violations
227
+ # ============================================================================
228
+
229
+ @app.get("/skills/model-registry/{requisition_id}")
230
+ def skills_model_registry(requisition_id: str):
231
+ """Check which ML models are registered for a given requisition."""
232
+ return api_skills_error.get_model_registry(requisition_id)
233
+
234
+
235
+ @app.get("/skills/skill-lookup/{requisition_id}")
236
+ def skills_skill_lookup(requisition_id: str, skill_name: str = None):
237
+ """Look up a specific skill and its metrics for a requisition."""
238
+ return api_skills_error.get_skill_lookup(requisition_id, skill_name)
239
+
240
+
241
+ @app.get("/candidate-source/source-metrics-lite/{requisition_id}")
242
+ def candidate_source_source_metrics_lite(requisition_id: str):
243
+ """Get a lightweight summary of source metrics for quick analysis."""
244
+ return api_candidate_source_error.get_source_metrics_lite(requisition_id)
245
+
246
+
247
+ @app.get("/candidate-source/volume-report/{requisition_id}")
248
+ def candidate_source_volume_report(requisition_id: str):
249
+ """Generate a volume report showing candidate statistics by source."""
250
+ return api_candidate_source_error.get_volume_report(requisition_id)
251
+
252
+
253
+ # ============================================================================
254
+ # Error-Prone Endpoints - Edge Cases
255
+ # ============================================================================
256
+
257
+ @app.get("/candidate-source/full-candidate-details/{requisition_id}")
258
+ def candidate_source_full_candidate_details(requisition_id: str):
259
+ """Get full candidate-level details for comprehensive analysis."""
260
+ return api_candidate_source_error.get_full_candidate_details(requisition_id)
261
+
262
+
263
+ @app.get("/candidate-source/source-directory/{requisition_id}")
264
+ def candidate_source_source_directory(requisition_id: str):
265
+ """Show the source directory listing all sourcing channels with metadata."""
266
+ return api_candidate_source_error.get_source_directory(requisition_id)
267
+
268
+
269
+ @app.get("/skills/skill-deep-analysis/{requisition_id}")
270
+ def skills_skill_deep_analysis(requisition_id: str):
271
+ """Get a deep analysis breakdown of skills with detailed sub-categories."""
272
+ return api_skills_error.get_skill_deep_analysis(requisition_id)
273
+
274
+
275
+ @app.get("/candidate-source/sla-extended/{requisition_id}")
276
+ def candidate_source_sla_extended(requisition_id: str, source_name: str = "Dice"):
277
+ """Get extended SLA data with additional analytics for a sourcing channel."""
278
+ return api_candidate_source_error.get_sla_extended(requisition_id, source_name)
279
+
280
+
281
+ @app.get("/skills/analyze-skill-match/{requisition_id}")
282
+ def skills_analyze_skill_match(requisition_id: str, skill_id: str = ""):
283
+ """Check if a skill is a good match for a requisition based on historical data."""
284
+ return api_skills_error.analyze_skill_match(requisition_id, skill_id)
285
+
286
+
287
+ # ============================================================================
288
+ # Error-Prone Endpoints - Undocumented Behaviors
289
+ # ============================================================================
290
+
291
+ @app.get("/candidate-source/requisition-details/{requisition_id}")
292
+ def candidate_source_requisition_details(requisition_id: str):
293
+ """Get detailed information for a specific requisition."""
294
+ return api_candidate_source_error.get_requisition_details(requisition_id)
295
+
296
+
297
+ @app.get("/candidate-source/list-all-sources/{requisition_id}")
298
+ def candidate_source_list_all_sources(requisition_id: str):
299
+ """List all available sourcing channels in the system."""
300
+ return api_candidate_source_error.list_all_sources(requisition_id)
301
+
302
+
303
+ @app.get("/candidate-source/batch-metrics/{requisition_id}")
304
+ def candidate_source_batch_metrics(requisition_id: str):
305
+ """Fetch aggregated batch metrics across all sourcing channels."""
306
+ return api_candidate_source_error.get_batch_metrics(requisition_id)
307
+
308
+
309
+ # ============================================================================
310
+ # Utility
311
+ # ============================================================================
312
+
313
+ @app.get("/health")
314
+ def health_check():
315
+ """Health check endpoint."""
316
+ return {"status": "healthy"}
317
+
318
+
319
+ if __name__ == "__main__":
320
+ import uvicorn
321
+ uvicorn.run(app, host="0.0.0.0", port=8000)