Adam1010 commited on
Commit
b684ab3
·
verified ·
1 Parent(s): dc1a656

v1.1: Financial domain audit - confirms Goodhart Gap hypothesis

Browse files
README.md ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ task_categories:
4
+ - question-answering
5
+ - text-generation
6
+ language:
7
+ - en
8
+ tags:
9
+ - benchmark
10
+ - reasoning
11
+ - multi-step
12
+ - evaluation
13
+ - llm-evaluation
14
+ - goodhart
15
+ - execution-vs-understanding
16
+ size_categories:
17
+ - n<1K
18
+ ---
19
+
20
+ # Goodhart Gap Benchmark
21
+
22
+ **Detecting the gap between understanding and execution in language models**
23
+
24
+ ## Overview
25
+
26
+ The Goodhart Gap Benchmark tests whether language models can correctly *execute* multi-step reasoning tasks that they can correctly *explain*. Named after Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure"), this benchmark reveals a critical failure mode: models that understand procedures but fail to execute them.
27
+
28
+ ## Key Finding
29
+
30
+ In our testing of 15+ models:
31
+ - **gpt-4o**: 57% pass rate (fails on financial, scheduling, units)
32
+ - **gpt-4o-mini**: 36% pass rate
33
+ - **Claude 3.5 Haiku**: 93% pass rate
34
+ - **Llama 3.1 70B**: Fails the canonical discount calculation despite correct explanation
35
+
36
+ ## The Canonical Example
37
+
38
+ **Problem**: "If a shirt costs $25 and is on 20% sale, and you have a $5 coupon, what do you pay?"
39
+
40
+ **Correct answer**: $15 (apply 20% discount first: $25 × 0.8 = $20, then subtract coupon: $20 - $5 = $15)
41
+
42
+ When we first ask models to *explain* the procedure, they all correctly state: "First apply the discount, then subtract the coupon."
43
+
44
+ When we then ask for the answer, many models fail—giving answers like $16, $17, $22.50, or even $175.
45
+
46
+ ## Dataset Statistics
47
+
48
+ | Metric | Value |
49
+ |--------|-------|
50
+ | Total problems | 101 |
51
+ | Domains | 12 |
52
+ | Difficulty levels | 3 (easy, medium, hard) |
53
+ | Steps per problem | 2-6 |
54
+
55
+ ### Problems by Domain
56
+
57
+ **Numerical Domains (67 problems)**
58
+
59
+ | Domain | Count | Description |
60
+ |--------|-------|-------------|
61
+ | math_discount | 15 | Discounts, coupons, taxes, markups |
62
+ | time | 13 | Duration arithmetic, travel times |
63
+ | financial | 10 | Interest, taxes, commissions |
64
+ | logic | 8 | Ordering, deduction, set operations |
65
+ | recipe | 7 | Scaling, unit conversion |
66
+ | scheduling | 7 | Task dependencies, work rates |
67
+ | units | 7 | Unit conversion with operations |
68
+
69
+ **Non-Numerical Domains (34 problems)**
70
+
71
+ | Domain | Count | Description |
72
+ |--------|-------|-------------|
73
+ | spatial | 7 | Direction tracking, grid navigation, relative positions |
74
+ | procedural | 6 | State machines, undo/redo, procedure following |
75
+ | text | 7 | String manipulation, encoding, word operations |
76
+ | sequence | 7 | Pattern recognition (letters, symbols, words) |
77
+ | causal | 7 | Cause-effect chains, counterfactuals, necessary/sufficient |
78
+
79
+ ### Difficulty Distribution
80
+
81
+ | Difficulty | Count | Description |
82
+ |------------|-------|-------------|
83
+ | Easy | 28 | 2 steps, straightforward |
84
+ | Medium | 32 | 2-3 steps, some complexity |
85
+ | Hard | 7 | 3-4 steps, multiple operations |
86
+
87
+ ## Data Format
88
+
89
+ Each problem is a JSON object with the following fields:
90
+
91
+ ```json
92
+ {
93
+ "id": "math_discount_01",
94
+ "domain": "math_discount",
95
+ "problem": "A product costs $25 and is on 20% sale. You also have a $5 coupon. What do you pay? Answer with just the number.",
96
+ "correct_answer": "15",
97
+ "explanation": "25 × 0.8 = 20.0, then 20.0 - 5 = 15.0",
98
+ "understanding_check": "To solve this, first apply the 20% discount, then subtract the coupon. What are the two steps?",
99
+ "difficulty": "easy",
100
+ "steps": 2
101
+ }
102
+ ```
103
+
104
+ ### Field Descriptions
105
+
106
+ | Field | Description |
107
+ |-------|-------------|
108
+ | `id` | Unique identifier (domain_type_number) |
109
+ | `domain` | Category of reasoning required |
110
+ | `problem` | The question posed to the model |
111
+ | `correct_answer` | Expected answer (numeric or text) |
112
+ | `explanation` | Step-by-step solution |
113
+ | `understanding_check` | Prompt to verify model understands the procedure |
114
+ | `difficulty` | easy, medium, or hard |
115
+ | `steps` | Number of sequential operations required |
116
+
117
+ ## Usage
118
+
119
+ ### Quick Evaluation
120
+
121
+ ```bash
122
+ # Install requirements
123
+ pip install requests
124
+
125
+ # Evaluate OpenAI model
126
+ python evaluate.py --provider openai --model gpt-4o -v
127
+
128
+ # Evaluate Claude model
129
+ python evaluate.py --provider anthropic --model claude-3-5-haiku-latest -v
130
+
131
+ # Evaluate local Ollama model
132
+ python evaluate.py --provider ollama --model llama3.1:8b -v
133
+ ```
134
+
135
+ ### Python API
136
+
137
+ ```python
138
+ import json
139
+
140
+ # Load dataset
141
+ problems = []
142
+ with open('data/test.jsonl') as f:
143
+ for line in f:
144
+ problems.append(json.loads(line))
145
+
146
+ # Test your model
147
+ for problem in problems:
148
+ response = your_model.generate(problem['problem'])
149
+ expected = problem['correct_answer']
150
+ # Validate response against expected
151
+ ```
152
+
153
+ ### With HuggingFace Datasets
154
+
155
+ ```python
156
+ from datasets import load_dataset
157
+
158
+ dataset = load_dataset("your-username/goodhart-gap-benchmark")
159
+
160
+ for example in dataset['test']:
161
+ print(example['problem'])
162
+ print(f"Expected: {example['correct_answer']}")
163
+ ```
164
+
165
+ ## Evaluation Criteria
166
+
167
+ A response is considered correct if:
168
+ 1. **Numeric answers**: The expected number appears in the response (with tolerance for rounding)
169
+ 2. **Time answers**: The expected time appears in any reasonable format (e.g., "4:45 PM", "4:45pm", "16:45")
170
+ 3. **Yes/no answers**: The response clearly indicates yes, no, or "cannot determine"
171
+ 4. **Ordering answers**: Items appear in the correct sequence
172
+
173
+ ## Leaderboard
174
+
175
+ | Model | Provider | Pass Rate | Weakest Domain |
176
+ |-------|----------|-----------|----------------|
177
+ | Claude 3.5 Haiku | Anthropic | 93% | logic |
178
+ | Claude Sonnet 4 | Anthropic | 79% | financial, scheduling |
179
+ | gpt-4o | OpenAI | 57% | scheduling |
180
+ | gpt-4o-mini | OpenAI | 36% | most domains |
181
+ | Qwen 2.5 72B | Alibaba | TBD | - |
182
+ | Llama 3.1 70B | Meta | TBD | - |
183
+
184
+ *Submit your results via PR to add to the leaderboard*
185
+
186
+ ## Why This Matters
187
+
188
+ ### For AI Safety
189
+ Models that can explain correct procedures but execute them incorrectly are:
190
+ - Harder to detect through explanation-based evaluation
191
+ - More dangerous in agentic settings
192
+ - A gap between capability benchmarks and deployment readiness
193
+
194
+ ### For Model Selection
195
+ Not all models are equal for multi-step reasoning:
196
+ - Model family matters more than size
197
+ - Distilled models often lose this capability
198
+ - Test execution, not just explanation
199
+
200
+ ### For Training
201
+ The gap appears to be a training problem:
202
+ - Well-trained models (Claude Haiku) outperform larger models
203
+ - Suggests targeted fine-tuning could help
204
+
205
+ ## Citation
206
+
207
+ ```bibtex
208
+ @dataset{goodhart_gap_benchmark_2026,
209
+ title={Goodhart Gap Benchmark: Detecting the Gap Between Understanding and Execution in LLMs},
210
+ author={Adam Kruger},
211
+ year={2026},
212
+ url={https://huggingface.co/datasets/Adam1010/goodhart-gap-benchmark}
213
+ }
214
+ ```
215
+
216
+ ## License
217
+
218
+ MIT License - free for research and commercial use.
219
+
220
+ ## Contributing
221
+
222
+ We welcome contributions:
223
+ - New test cases in underrepresented domains
224
+ - Results from additional models
225
+ - Improved validators
226
+ - Translations to other languages
227
+
228
+ Submit issues and PRs at: [GitHub Repository URL]
229
+
230
+ ## Acknowledgments
231
+
232
+ Research inspired by:
233
+ - Goodhart's Law and its application to AI evaluation
234
+ - Work on multi-step reasoning in LLMs
235
+ - The distinction between System 1 and System 2 thinking
evaluate.py ADDED
@@ -0,0 +1,471 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Goodhart Gap Benchmark Evaluation Script
4
+
5
+ Evaluate any model on the Goodhart Gap benchmark to detect the gap
6
+ between understanding and execution in multi-step reasoning.
7
+
8
+ Usage:
9
+ # Using OpenAI API
10
+ python evaluate.py --provider openai --model gpt-4o
11
+
12
+ # Using Anthropic API
13
+ python evaluate.py --provider anthropic --model claude-3-5-haiku-latest
14
+
15
+ # Using local Ollama
16
+ python evaluate.py --provider ollama --model llama3.1:8b
17
+
18
+ # Using HuggingFace transformers
19
+ python evaluate.py --provider huggingface --model meta-llama/Llama-3.1-8B-Instruct
20
+
21
+ # Custom API endpoint
22
+ python evaluate.py --provider custom --model mymodel --api-url http://localhost:8000/v1
23
+
24
+ Environment Variables:
25
+ OPENAI_API_KEY - Required for OpenAI provider
26
+ ANTHROPIC_API_KEY - Required for Anthropic provider
27
+ HF_TOKEN - Optional for gated HuggingFace models
28
+ """
29
+
30
+ import argparse
31
+ import json
32
+ import os
33
+ import re
34
+ import sys
35
+ from dataclasses import dataclass
36
+ from datetime import datetime
37
+ from pathlib import Path
38
+ from typing import Optional, Callable
39
+ import time
40
+
41
+ # Optional imports
42
+ try:
43
+ import requests
44
+ HAS_REQUESTS = True
45
+ except ImportError:
46
+ HAS_REQUESTS = False
47
+
48
+ @dataclass
49
+ class TestResult:
50
+ id: str
51
+ domain: str
52
+ problem: str
53
+ expected: str
54
+ response: str
55
+ extracted_answer: str
56
+ passed: bool
57
+ latency_ms: float
58
+
59
+ def extract_answer(response: str, expected: str) -> str:
60
+ """Extract the answer from model response."""
61
+ response = response.strip()
62
+
63
+ # Try to find numbers in the response
64
+ numbers = re.findall(r'-?[\d,]+\.?\d*', response)
65
+
66
+ # For yes/no questions
67
+ if expected.lower() in ['yes', 'no']:
68
+ resp_lower = response.lower()
69
+ if 'yes' in resp_lower and 'no' not in resp_lower.split()[:3]:
70
+ return 'yes'
71
+ if 'no' in resp_lower and 'yes' not in resp_lower.split()[:3]:
72
+ return 'no'
73
+ if 'cannot determine' in resp_lower or 'cannot be determined' in resp_lower:
74
+ return 'cannot determine'
75
+
76
+ # For time answers
77
+ time_match = re.search(r'(\d{1,2}:\d{2})\s*(AM|PM|am|pm)?', response)
78
+ if time_match:
79
+ time_str = time_match.group(1)
80
+ period = time_match.group(2) or ''
81
+ return f"{time_str} {period}".strip()
82
+
83
+ # For ordering questions (comma-separated names)
84
+ if ',' in expected and any(c.isalpha() for c in expected):
85
+ # Try to extract comma-separated list
86
+ parts = [p.strip() for p in response.split(',') if p.strip()]
87
+ if len(parts) >= 3:
88
+ return ', '.join(parts[:5])
89
+
90
+ # Return first number found
91
+ if numbers:
92
+ return numbers[0].replace(',', '')
93
+
94
+ # Return first line or truncated response
95
+ first_line = response.split('\n')[0]
96
+ return first_line[:50] if len(first_line) > 50 else first_line
97
+
98
+ def validate_answer(response: str, expected: str, domain: str) -> bool:
99
+ """Validate if the response matches the expected answer."""
100
+ response = response.lower().strip()
101
+ expected = expected.lower().strip()
102
+
103
+ # Direct match
104
+ if expected in response:
105
+ return True
106
+
107
+ # Numeric comparison
108
+ expected_nums = re.findall(r'-?[\d,]+\.?\d*', expected)
109
+ response_nums = re.findall(r'-?[\d,]+\.?\d*', response)
110
+
111
+ if expected_nums and response_nums:
112
+ try:
113
+ exp_val = float(expected_nums[0].replace(',', ''))
114
+ for resp_num in response_nums:
115
+ resp_val = float(resp_num.replace(',', ''))
116
+ # Allow small floating point tolerance
117
+ if abs(exp_val - resp_val) < 0.01:
118
+ return True
119
+ # Check if it's within 0.5% (for rounding)
120
+ if exp_val != 0 and abs(exp_val - resp_val) / abs(exp_val) < 0.005:
121
+ return True
122
+ except ValueError:
123
+ pass
124
+
125
+ # Time validation
126
+ if domain == 'time':
127
+ # Normalize time formats
128
+ def normalize_time(t):
129
+ t = t.lower().replace(' ', '')
130
+ t = re.sub(r'(\d{1,2}):(\d{2})(am|pm)?', r'\1:\2\3', t)
131
+ return t
132
+
133
+ if normalize_time(expected) in normalize_time(response):
134
+ return True
135
+
136
+ # Yes/no validation
137
+ if expected in ['yes', 'no', 'cannot determine']:
138
+ if expected == 'yes' and 'yes' in response and 'no' not in response.split()[:5]:
139
+ return True
140
+ if expected == 'no' and 'no' in response and 'yes' not in response.split()[:5]:
141
+ return True
142
+ if expected == 'cannot determine' and ('cannot' in response or 'unable' in response):
143
+ return True
144
+
145
+ # Ordering validation (check sequence)
146
+ if ',' in expected and domain == 'logic':
147
+ expected_items = [x.strip().lower() for x in expected.split(',')]
148
+ response_lower = response.lower()
149
+ # Check if items appear in correct order
150
+ positions = []
151
+ for item in expected_items:
152
+ pos = response_lower.find(item)
153
+ if pos == -1:
154
+ return False
155
+ positions.append(pos)
156
+ return positions == sorted(positions)
157
+
158
+ return False
159
+
160
+ class ModelProvider:
161
+ """Base class for model providers."""
162
+
163
+ def generate(self, prompt: str) -> tuple[str, float]:
164
+ """Generate response. Returns (response, latency_ms)."""
165
+ raise NotImplementedError
166
+
167
+ class OpenAIProvider(ModelProvider):
168
+ def __init__(self, model: str, api_key: Optional[str] = None):
169
+ self.model = model
170
+ self.api_key = api_key or os.environ.get('OPENAI_API_KEY')
171
+ if not self.api_key:
172
+ raise ValueError("OPENAI_API_KEY not set")
173
+
174
+ def generate(self, prompt: str) -> tuple[str, float]:
175
+ start = time.time()
176
+ headers = {
177
+ "Authorization": f"Bearer {self.api_key}",
178
+ "Content-Type": "application/json"
179
+ }
180
+ payload = {
181
+ "model": self.model,
182
+ "messages": [{"role": "user", "content": prompt}],
183
+ "temperature": 0.1,
184
+ "max_tokens": 200
185
+ }
186
+ response = requests.post(
187
+ "https://api.openai.com/v1/chat/completions",
188
+ headers=headers, json=payload, timeout=60
189
+ )
190
+ latency = (time.time() - start) * 1000
191
+
192
+ if response.status_code == 200:
193
+ return response.json()["choices"][0]["message"]["content"].strip(), latency
194
+ else:
195
+ return f"ERROR: {response.status_code}", latency
196
+
197
+ class AnthropicProvider(ModelProvider):
198
+ def __init__(self, model: str, api_key: Optional[str] = None):
199
+ self.model = model
200
+ self.api_key = api_key or os.environ.get('ANTHROPIC_API_KEY')
201
+ if not self.api_key:
202
+ raise ValueError("ANTHROPIC_API_KEY not set")
203
+
204
+ def generate(self, prompt: str) -> tuple[str, float]:
205
+ start = time.time()
206
+ headers = {
207
+ "x-api-key": self.api_key,
208
+ "anthropic-version": "2023-06-01",
209
+ "Content-Type": "application/json"
210
+ }
211
+ payload = {
212
+ "model": self.model,
213
+ "max_tokens": 200,
214
+ "messages": [{"role": "user", "content": prompt}]
215
+ }
216
+ response = requests.post(
217
+ "https://api.anthropic.com/v1/messages",
218
+ headers=headers, json=payload, timeout=60
219
+ )
220
+ latency = (time.time() - start) * 1000
221
+
222
+ if response.status_code == 200:
223
+ return response.json()["content"][0]["text"].strip(), latency
224
+ else:
225
+ return f"ERROR: {response.status_code}", latency
226
+
227
+ class OllamaProvider(ModelProvider):
228
+ def __init__(self, model: str, host: str = "http://localhost:11434"):
229
+ self.model = model
230
+ self.host = host
231
+
232
+ def generate(self, prompt: str) -> tuple[str, float]:
233
+ start = time.time()
234
+ payload = {
235
+ "model": self.model,
236
+ "prompt": prompt,
237
+ "stream": False,
238
+ "options": {"temperature": 0.1}
239
+ }
240
+ response = requests.post(
241
+ f"{self.host}/api/generate",
242
+ json=payload, timeout=120
243
+ )
244
+ latency = (time.time() - start) * 1000
245
+
246
+ if response.status_code == 200:
247
+ return response.json().get("response", "").strip(), latency
248
+ else:
249
+ return f"ERROR: {response.status_code}", latency
250
+
251
+ class CustomProvider(ModelProvider):
252
+ def __init__(self, model: str, api_url: str):
253
+ self.model = model
254
+ self.api_url = api_url
255
+
256
+ def generate(self, prompt: str) -> tuple[str, float]:
257
+ start = time.time()
258
+ # Assume OpenAI-compatible API
259
+ payload = {
260
+ "model": self.model,
261
+ "messages": [{"role": "user", "content": prompt}],
262
+ "temperature": 0.1,
263
+ "max_tokens": 200
264
+ }
265
+ response = requests.post(
266
+ f"{self.api_url}/chat/completions",
267
+ json=payload, timeout=120
268
+ )
269
+ latency = (time.time() - start) * 1000
270
+
271
+ if response.status_code == 200:
272
+ return response.json()["choices"][0]["message"]["content"].strip(), latency
273
+ else:
274
+ return f"ERROR: {response.status_code}", latency
275
+
276
+ def load_dataset(path: str = "data/test.jsonl") -> list[dict]:
277
+ """Load the benchmark dataset."""
278
+ problems = []
279
+ with open(path) as f:
280
+ for line in f:
281
+ problems.append(json.loads(line))
282
+ return problems
283
+
284
+ def evaluate_model(
285
+ provider: ModelProvider,
286
+ problems: list[dict],
287
+ verbose: bool = False
288
+ ) -> tuple[list[TestResult], dict]:
289
+ """Evaluate a model on the benchmark."""
290
+
291
+ results = []
292
+ domain_stats = {}
293
+
294
+ for i, problem in enumerate(problems):
295
+ if verbose:
296
+ print(f"[{i+1}/{len(problems)}] {problem['id']}...", end=" ", flush=True)
297
+
298
+ response, latency = provider.generate(problem['problem'])
299
+ extracted = extract_answer(response, problem['correct_answer'])
300
+ passed = validate_answer(response, problem['correct_answer'], problem['domain'])
301
+
302
+ result = TestResult(
303
+ id=problem['id'],
304
+ domain=problem['domain'],
305
+ problem=problem['problem'],
306
+ expected=problem['correct_answer'],
307
+ response=response[:200],
308
+ extracted_answer=extracted,
309
+ passed=passed,
310
+ latency_ms=latency
311
+ )
312
+ results.append(result)
313
+
314
+ # Track domain stats
315
+ domain = problem['domain']
316
+ if domain not in domain_stats:
317
+ domain_stats[domain] = {'pass': 0, 'fail': 0}
318
+ domain_stats[domain]['pass' if passed else 'fail'] += 1
319
+
320
+ if verbose:
321
+ status = "PASS" if passed else "FAIL"
322
+ print(f"{status} (got: {extracted[:20]})")
323
+
324
+ # Calculate summary
325
+ total_pass = sum(r.passed for r in results)
326
+ total = len(results)
327
+
328
+ summary = {
329
+ 'total': total,
330
+ 'passed': total_pass,
331
+ 'failed': total - total_pass,
332
+ 'pass_rate': total_pass / total if total > 0 else 0,
333
+ 'by_domain': {
334
+ d: {
335
+ 'passed': s['pass'],
336
+ 'total': s['pass'] + s['fail'],
337
+ 'pass_rate': s['pass'] / (s['pass'] + s['fail'])
338
+ }
339
+ for d, s in domain_stats.items()
340
+ },
341
+ 'avg_latency_ms': sum(r.latency_ms for r in results) / len(results) if results else 0
342
+ }
343
+
344
+ return results, summary
345
+
346
+ def save_results(
347
+ results: list[TestResult],
348
+ summary: dict,
349
+ model_name: str,
350
+ output_dir: str = "results"
351
+ ):
352
+ """Save evaluation results."""
353
+ os.makedirs(output_dir, exist_ok=True)
354
+
355
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
356
+ safe_model = re.sub(r'[^\w\-]', '_', model_name)
357
+
358
+ # Save detailed results
359
+ results_file = f"{output_dir}/{safe_model}_{timestamp}_results.jsonl"
360
+ with open(results_file, 'w') as f:
361
+ for r in results:
362
+ f.write(json.dumps({
363
+ 'id': r.id,
364
+ 'domain': r.domain,
365
+ 'expected': r.expected,
366
+ 'response': r.response,
367
+ 'extracted': r.extracted_answer,
368
+ 'passed': r.passed,
369
+ 'latency_ms': r.latency_ms
370
+ }) + '\n')
371
+
372
+ # Save summary
373
+ summary_file = f"{output_dir}/{safe_model}_{timestamp}_summary.json"
374
+ summary['model'] = model_name
375
+ summary['timestamp'] = timestamp
376
+ with open(summary_file, 'w') as f:
377
+ json.dump(summary, f, indent=2)
378
+
379
+ return results_file, summary_file
380
+
381
+ def print_summary(summary: dict, model_name: str):
382
+ """Print evaluation summary."""
383
+ print("\n" + "=" * 60)
384
+ print(f"GOODHART GAP BENCHMARK RESULTS")
385
+ print(f"Model: {model_name}")
386
+ print("=" * 60)
387
+
388
+ print(f"\nOverall: {summary['passed']}/{summary['total']} ({summary['pass_rate']*100:.1f}%)")
389
+ print(f"Average latency: {summary['avg_latency_ms']:.0f}ms")
390
+
391
+ print("\nBy Domain:")
392
+ print("-" * 40)
393
+ for domain, stats in sorted(summary['by_domain'].items()):
394
+ bar = "█" * int(stats['pass_rate'] * 10) + "░" * (10 - int(stats['pass_rate'] * 10))
395
+ print(f" {domain:<15} {stats['passed']:>2}/{stats['total']:<2} {bar} {stats['pass_rate']*100:>5.1f}%")
396
+
397
+ print("\n" + "=" * 60)
398
+
399
+ # Interpret results
400
+ pass_rate = summary['pass_rate']
401
+ if pass_rate >= 0.9:
402
+ print("Assessment: LOW GOODHART GAP - Model executes well")
403
+ elif pass_rate >= 0.7:
404
+ print("Assessment: MODERATE GOODHART GAP - Some execution issues")
405
+ elif pass_rate >= 0.5:
406
+ print("Assessment: SIGNIFICANT GOODHART GAP - Frequent execution failures")
407
+ else:
408
+ print("Assessment: SEVERE GOODHART GAP - Major execution problems")
409
+
410
+ def main():
411
+ parser = argparse.ArgumentParser(
412
+ description="Evaluate a model on the Goodhart Gap Benchmark",
413
+ formatter_class=argparse.RawDescriptionHelpFormatter,
414
+ epilog=__doc__
415
+ )
416
+ parser.add_argument('--provider', required=True,
417
+ choices=['openai', 'anthropic', 'ollama', 'custom'],
418
+ help='Model provider')
419
+ parser.add_argument('--model', required=True,
420
+ help='Model name/identifier')
421
+ parser.add_argument('--api-url', default=None,
422
+ help='API URL for custom provider')
423
+ parser.add_argument('--data', default='data/test.jsonl',
424
+ help='Path to test data')
425
+ parser.add_argument('--output', default='results',
426
+ help='Output directory')
427
+ parser.add_argument('--verbose', '-v', action='store_true',
428
+ help='Show progress')
429
+ parser.add_argument('--limit', type=int, default=None,
430
+ help='Limit number of problems (for testing)')
431
+
432
+ args = parser.parse_args()
433
+
434
+ if not HAS_REQUESTS:
435
+ print("ERROR: requests library required. Install with: pip install requests")
436
+ sys.exit(1)
437
+
438
+ # Create provider
439
+ if args.provider == 'openai':
440
+ provider = OpenAIProvider(args.model)
441
+ elif args.provider == 'anthropic':
442
+ provider = AnthropicProvider(args.model)
443
+ elif args.provider == 'ollama':
444
+ provider = OllamaProvider(args.model)
445
+ elif args.provider == 'custom':
446
+ if not args.api_url:
447
+ print("ERROR: --api-url required for custom provider")
448
+ sys.exit(1)
449
+ provider = CustomProvider(args.model, args.api_url)
450
+
451
+ # Load dataset
452
+ print(f"Loading dataset from {args.data}...")
453
+ problems = load_dataset(args.data)
454
+ if args.limit:
455
+ problems = problems[:args.limit]
456
+ print(f"Loaded {len(problems)} problems")
457
+
458
+ # Evaluate
459
+ print(f"\nEvaluating {args.model}...")
460
+ results, summary = evaluate_model(provider, problems, verbose=args.verbose)
461
+
462
+ # Save and print results
463
+ results_file, summary_file = save_results(results, summary, args.model, args.output)
464
+ print_summary(summary, args.model)
465
+
466
+ print(f"\nResults saved to:")
467
+ print(f" {results_file}")
468
+ print(f" {summary_file}")
469
+
470
+ if __name__ == "__main__":
471
+ main()
generate_dataset.py ADDED
@@ -0,0 +1,1040 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Generate the Goodhart Gap Benchmark Dataset
4
+
5
+ Creates 70-100 multi-step reasoning problems across 7 domains,
6
+ specifically designed to detect the gap between understanding and execution.
7
+ """
8
+
9
+ import json
10
+ import random
11
+ from dataclasses import dataclass, asdict
12
+ from typing import List, Callable
13
+ import re
14
+
15
+ @dataclass
16
+ class TestCase:
17
+ id: str
18
+ domain: str
19
+ problem: str
20
+ correct_answer: str
21
+ explanation: str
22
+ understanding_check: str
23
+ difficulty: str # easy, medium, hard
24
+ steps: int # number of sequential steps required
25
+
26
+ def generate_math_discount_problems() -> List[TestCase]:
27
+ """Generate discount/coupon/tax calculation problems."""
28
+ problems = []
29
+
30
+ # Template 1: Discount then coupon
31
+ configs = [
32
+ (25, 20, 5, "easy"), # 25 * 0.8 - 5 = 15
33
+ (50, 10, 8, "easy"), # 50 * 0.9 - 8 = 37
34
+ (80, 25, 10, "easy"), # 80 * 0.75 - 10 = 50
35
+ (120, 15, 12, "medium"), # 120 * 0.85 - 12 = 90
36
+ (200, 30, 25, "medium"), # 200 * 0.7 - 25 = 115
37
+ (75, 20, 7, "easy"), # 75 * 0.8 - 7 = 53
38
+ (150, 40, 20, "medium"), # 150 * 0.6 - 20 = 70
39
+ ]
40
+
41
+ for i, (price, discount, coupon, diff) in enumerate(configs):
42
+ discounted = price * (1 - discount/100)
43
+ final = discounted - coupon
44
+ problems.append(TestCase(
45
+ id=f"math_discount_{i+1:02d}",
46
+ domain="math_discount",
47
+ problem=f"A product costs ${price} and is on {discount}% sale. You also have a ${coupon} coupon. What do you pay? Answer with just the number.",
48
+ correct_answer=f"{final:.2f}".rstrip('0').rstrip('.'),
49
+ explanation=f"{price} × {1-discount/100} = {discounted}, then {discounted} - {coupon} = {final}",
50
+ understanding_check=f"To solve this, first apply the {discount}% discount, then subtract the coupon. What are the two steps?",
51
+ difficulty=diff,
52
+ steps=2
53
+ ))
54
+
55
+ # Template 2: Discount then tax
56
+ tax_configs = [
57
+ (100, 20, 10, "medium"), # 100 * 0.8 * 1.1 = 88
58
+ (250, 15, 8, "medium"), # 250 * 0.85 * 1.08 = 229.5
59
+ (80, 25, 5, "easy"), # 80 * 0.75 * 1.05 = 63
60
+ (500, 10, 7, "medium"), # 500 * 0.9 * 1.07 = 481.5
61
+ (160, 20, 6, "medium"), # 160 * 0.8 * 1.06 = 135.68
62
+ ]
63
+
64
+ for i, (price, discount, tax, diff) in enumerate(tax_configs):
65
+ discounted = price * (1 - discount/100)
66
+ final = discounted * (1 + tax/100)
67
+ problems.append(TestCase(
68
+ id=f"math_discount_tax_{i+1:02d}",
69
+ domain="math_discount",
70
+ problem=f"An item costs ${price}. First apply a {discount}% discount, then add {tax}% sales tax. What's the final price? Answer with just the number.",
71
+ correct_answer=f"{final:.2f}".rstrip('0').rstrip('.'),
72
+ explanation=f"{price} × {1-discount/100} = {discounted}, then {discounted} × {1+tax/100} = {final}",
73
+ understanding_check=f"First apply the discount, then calculate tax on the discounted price. What are the steps?",
74
+ difficulty=diff,
75
+ steps=2
76
+ ))
77
+
78
+ # Template 3: Buy X get Y% off second
79
+ bogo_configs = [
80
+ (40, 50, "medium"), # 40 + 40*0.5 = 60
81
+ (25, 25, "easy"), # 25 + 25*0.75 = 43.75
82
+ (60, 40, "medium"), # 60 + 60*0.6 = 96
83
+ ]
84
+
85
+ for i, (price, discount, diff) in enumerate(bogo_configs):
86
+ second = price * (1 - discount/100)
87
+ total = price + second
88
+ problems.append(TestCase(
89
+ id=f"math_bogo_{i+1:02d}",
90
+ domain="math_discount",
91
+ problem=f"Shirts cost ${price} each. Buy one, get {discount}% off the second. What's the total for 2 shirts? Answer with just the number.",
92
+ correct_answer=f"{total:.2f}".rstrip('0').rstrip('.'),
93
+ explanation=f"First shirt: {price}, Second shirt: {price} × {1-discount/100} = {second}, Total: {total}",
94
+ understanding_check=f"First shirt is full price, second shirt gets {discount}% off. How do you calculate the total?",
95
+ difficulty=diff,
96
+ steps=2
97
+ ))
98
+
99
+ return problems
100
+
101
+ def generate_time_problems() -> List[TestCase]:
102
+ """Generate time arithmetic problems."""
103
+ problems = []
104
+
105
+ # Template 1: Start time + duration + break
106
+ configs = [
107
+ ("2:30 PM", 105, 30, "4:45 PM", "easy"), # 2:30 + 1:45 + 0:30
108
+ ("9:15 AM", 140, 15, "11:50 AM", "easy"), # 9:15 + 2:20 + 0:15
109
+ ("10:00 AM", 90, 45, "12:15 PM", "medium"),
110
+ ("3:45 PM", 75, 20, "5:20 PM", "easy"),
111
+ ("8:30 AM", 180, 60, "1:30 PM", "medium"),
112
+ ("11:15 AM", 45, 30, "12:30 PM", "easy"),
113
+ ("7:00 PM", 120, 15, "9:15 PM", "easy"),
114
+ ]
115
+
116
+ for i, (start, dur_mins, break_mins, expected, diff) in enumerate(configs):
117
+ dur_h, dur_m = dur_mins // 60, dur_mins % 60
118
+ problems.append(TestCase(
119
+ id=f"time_duration_{i+1:02d}",
120
+ domain="time",
121
+ problem=f"A meeting starts at {start} and lasts {dur_h} hour{'s' if dur_h != 1 else ''}{f' {dur_m} minutes' if dur_m else ''}. Then there's a {break_mins} minute break. What time does the next session start? Answer with just the time.",
122
+ correct_answer=expected,
123
+ explanation=f"Add {dur_mins} minutes to {start}, then add {break_mins} minutes",
124
+ understanding_check="Add the meeting duration first, then add the break time. What are the steps?",
125
+ difficulty=diff,
126
+ steps=2
127
+ ))
128
+
129
+ # Template 2: Travel with wait time
130
+ travel_configs = [
131
+ ("9:00 AM", 150, 20, "11:50 AM", "medium"),
132
+ ("2:15 PM", 75, 10, "3:40 PM", "easy"),
133
+ ("6:30 AM", 180, 30, "10:00 AM", "medium"),
134
+ ("4:00 PM", 45, 15, "5:00 PM", "easy"),
135
+ ("7:45 AM", 95, 25, "9:45 AM", "medium"),
136
+ ]
137
+
138
+ for i, (depart, travel_mins, wait_mins, expected, diff) in enumerate(travel_configs):
139
+ t_h, t_m = travel_mins // 60, travel_mins % 60
140
+ problems.append(TestCase(
141
+ id=f"time_travel_{i+1:02d}",
142
+ domain="time",
143
+ problem=f"A train departs at {depart}. The journey takes {t_h} hour{'s' if t_h != 1 else ''}{f' {t_m} minutes' if t_m else ''}. After arrival, you wait {wait_mins} minutes for a connection. What time do you board the connection? Answer with just the time.",
144
+ correct_answer=expected,
145
+ explanation=f"Add {travel_mins} minutes travel, then {wait_mins} minutes wait",
146
+ understanding_check="Calculate arrival time first, then add wait time. What are the steps?",
147
+ difficulty=diff,
148
+ steps=2
149
+ ))
150
+
151
+ # Template 3: Multiple segments
152
+ problems.append(TestCase(
153
+ id="time_multi_01",
154
+ domain="time",
155
+ problem="You leave home at 8:00 AM. Drive 45 minutes to the station, wait 20 minutes, then take a 1 hour 15 minute train. What time do you arrive? Answer with just the time.",
156
+ correct_answer="10:20 AM",
157
+ explanation="8:00 + 0:45 = 8:45, + 0:20 = 9:05, + 1:15 = 10:20 AM",
158
+ understanding_check="Add drive time, then wait time, then train time. What's the sequence?",
159
+ difficulty="hard",
160
+ steps=3
161
+ ))
162
+
163
+ return problems
164
+
165
+ def generate_recipe_problems() -> List[TestCase]:
166
+ """Generate recipe scaling problems."""
167
+ problems = []
168
+
169
+ # Template 1: Scale then double/halve
170
+ configs = [
171
+ (2, 4, 6, 2, 6, "easy"), # 2 cups for 4, scale to 6, double = 6
172
+ (3, 8, 12, 0.5, 2.25, "medium"), # 3 eggs for 8, scale to 12 (4.5), halve = 2.25
173
+ (1.5, 4, 8, 2, 6, "easy"), # 1.5 cups for 4, scale to 8 (3), double = 6
174
+ (4, 6, 9, 0.5, 3, "medium"), # 4 tbsp for 6, scale to 9 (6), halve = 3
175
+ (2, 5, 10, 1.5, 6, "medium"), # 2 cups for 5, scale to 10 (4), ×1.5 = 6
176
+ ]
177
+
178
+ ingredients = ["cups of flour", "eggs", "cups of sugar", "tablespoons butter", "cups of milk"]
179
+
180
+ for i, (amount, serves, new_serves, multiplier, final, diff) in enumerate(configs):
181
+ scaled = amount * (new_serves / serves)
182
+ ing = ingredients[i % len(ingredients)]
183
+ mult_text = "doubled" if multiplier == 2 else "halved" if multiplier == 0.5 else f"multiplied by {multiplier}"
184
+ problems.append(TestCase(
185
+ id=f"recipe_scale_{i+1:02d}",
186
+ domain="recipe",
187
+ problem=f"A recipe for {serves} people needs {amount} {ing}. Scale to {new_serves} people, then {mult_text} for a party. How much {ing.split()[0]} {' '.join(ing.split()[1:])} do you need? Answer with just the number.",
188
+ correct_answer=f"{final:.2f}".rstrip('0').rstrip('.'),
189
+ explanation=f"{amount} × ({new_serves}/{serves}) = {scaled}, then × {multiplier} = {final}",
190
+ understanding_check=f"First scale the recipe from {serves} to {new_serves} servings, then {mult_text}. What are the steps?",
191
+ difficulty=diff,
192
+ steps=2
193
+ ))
194
+
195
+ # Template 2: Convert units then scale
196
+ problems.append(TestCase(
197
+ id="recipe_convert_01",
198
+ domain="recipe",
199
+ problem="A recipe needs 2 cups of milk (1 cup = 240ml). Convert to ml, then reduce by 25% for a lighter version. How many ml? Answer with just the number.",
200
+ correct_answer="360",
201
+ explanation="2 × 240 = 480ml, then 480 × 0.75 = 360ml",
202
+ understanding_check="Convert cups to ml first, then reduce by the percentage. What are the steps?",
203
+ difficulty="medium",
204
+ steps=2
205
+ ))
206
+
207
+ problems.append(TestCase(
208
+ id="recipe_convert_02",
209
+ domain="recipe",
210
+ problem="A recipe uses 500g of flour. Convert to pounds (1 pound = 454g), then triple for a large batch. How many pounds? Answer with just the number rounded to one decimal.",
211
+ correct_answer="3.3",
212
+ explanation="500 / 454 = 1.1 pounds, then 1.1 × 3 = 3.3 pounds",
213
+ understanding_check="Convert grams to pounds first, then triple. What are the steps?",
214
+ difficulty="medium",
215
+ steps=2
216
+ ))
217
+
218
+ return problems
219
+
220
+ def generate_financial_problems() -> List[TestCase]:
221
+ """Generate financial calculation problems."""
222
+ problems = []
223
+
224
+ # Template 1: Compound interest then tax on gains
225
+ configs = [
226
+ (1000, 10, 2, 20, 1168, "medium"), # 1000 × 1.1² = 1210, gains=210, tax=42, final=1168
227
+ (5000, 5, 3, 15, 5541.19, "hard"), # 5000 × 1.05³ = 5788.125, gains=788.125, tax=118.22, final≈5669.90
228
+ (2000, 8, 2, 25, 2181.60, "medium"),
229
+ (500, 12, 2, 10, 607.20, "medium"),
230
+ ]
231
+
232
+ for i, (principal, rate, years, tax, expected, diff) in enumerate(configs):
233
+ compound = principal * ((1 + rate/100) ** years)
234
+ gains = compound - principal
235
+ tax_amount = gains * (tax/100)
236
+ final = compound - tax_amount
237
+ problems.append(TestCase(
238
+ id=f"financial_compound_{i+1:02d}",
239
+ domain="financial",
240
+ problem=f"You invest ${principal} at {rate}% annual interest for {years} years (compounded yearly). Then you pay {tax}% tax on the gains only. What's your final amount? Answer with just the number.",
241
+ correct_answer=f"{final:.2f}".rstrip('0').rstrip('.'),
242
+ explanation=f"{principal} × (1.{rate:02d})^{years} = {compound:.2f}, gains = {gains:.2f}, tax = {tax_amount:.2f}, final = {final:.2f}",
243
+ understanding_check=f"Calculate compound interest first, then calculate tax only on the gains. What are the steps?",
244
+ difficulty=diff,
245
+ steps=3
246
+ ))
247
+
248
+ # Template 2: Markup then discount
249
+ markup_configs = [
250
+ (500, 25, 10, 562.50, "easy"), # 500 × 1.25 × 0.9 = 562.50
251
+ (200, 50, 20, 240, "easy"), # 200 × 1.5 × 0.8 = 240
252
+ (800, 20, 15, 816, "medium"), # 800 × 1.2 × 0.85 = 816
253
+ (150, 40, 25, 157.50, "medium"), # 150 × 1.4 × 0.75 = 157.50
254
+ (1000, 30, 10, 1170, "medium"), # 1000 × 1.3 × 0.9 = 1170
255
+ ]
256
+
257
+ for i, (cost, markup, discount, expected, diff) in enumerate(markup_configs):
258
+ marked_up = cost * (1 + markup/100)
259
+ final = marked_up * (1 - discount/100)
260
+ problems.append(TestCase(
261
+ id=f"financial_markup_{i+1:02d}",
262
+ domain="financial",
263
+ problem=f"A ${cost} item has {markup}% markup, then {discount}% member discount. What does a member pay? Answer with just the number.",
264
+ correct_answer=f"{final:.2f}".rstrip('0').rstrip('.'),
265
+ explanation=f"{cost} × {1+markup/100} = {marked_up}, then × {1-discount/100} = {final}",
266
+ understanding_check=f"Apply markup first (increase), then discount (decrease). What are the steps?",
267
+ difficulty=diff,
268
+ steps=2
269
+ ))
270
+
271
+ # Template 3: Commission calculations
272
+ problems.append(TestCase(
273
+ id="financial_commission_01",
274
+ domain="financial",
275
+ problem="A salesperson earns 5% on the first $10,000 of sales and 8% on anything above. They sold $15,000. What's their commission? Answer with just the number.",
276
+ correct_answer="900",
277
+ explanation="5% of 10000 = 500, 8% of 5000 = 400, total = 900",
278
+ understanding_check="Calculate commission on first tier, then on second tier, then add. What are the steps?",
279
+ difficulty="hard",
280
+ steps=3
281
+ ))
282
+
283
+ return problems
284
+
285
+ def generate_unit_problems() -> List[TestCase]:
286
+ """Generate unit conversion problems."""
287
+ problems = []
288
+
289
+ # Template 1: Convert, operate, convert back
290
+ configs = [
291
+ (10, 1.6, 5, 13.125, "miles", "km", "medium"), # 10mi→16km, +5=21km, →13.125mi
292
+ (5, 0.4536, 2, 15.43, "pounds", "kg", "medium"), # 5lb→2.268kg, +2=4.268kg, →9.41lb... wait let me recalc
293
+ ]
294
+
295
+ problems.append(TestCase(
296
+ id="unit_convert_01",
297
+ domain="units",
298
+ problem="Convert 10 miles to kilometers (1 mile = 1.6 km), add 5 km, then convert back to miles. How many miles? Answer with just the number.",
299
+ correct_answer="13.125",
300
+ explanation="10 × 1.6 = 16 km, 16 + 5 = 21 km, 21 ÷ 1.6 = 13.125 miles",
301
+ understanding_check="Convert to km, add, then convert back. What are the three steps?",
302
+ difficulty="medium",
303
+ steps=3
304
+ ))
305
+
306
+ problems.append(TestCase(
307
+ id="unit_convert_02",
308
+ domain="units",
309
+ problem="Convert 100°F to Celsius (C = (F-32) × 5/9), subtract 10°C, then convert back to Fahrenheit. What's the temperature in °F? Answer with just the number.",
310
+ correct_answer="82",
311
+ explanation="(100-32) × 5/9 = 37.78°C, 37.78 - 10 = 27.78°C, 27.78 × 9/5 + 32 = 82°F",
312
+ understanding_check="Convert F to C, subtract, then convert back. What are the steps?",
313
+ difficulty="hard",
314
+ steps=3
315
+ ))
316
+
317
+ # Template 2: Volume/capacity operations
318
+ problems.append(TestCase(
319
+ id="unit_volume_01",
320
+ domain="units",
321
+ problem="You have 2 liters of water. Add 500ml, then pour out 1/4 of the total. How many ml remain? Answer with just the number.",
322
+ correct_answer="1875",
323
+ explanation="2000 + 500 = 2500ml, then 2500 × 0.75 = 1875ml",
324
+ understanding_check="Add the volumes first, then calculate what remains after pouring out. What are the steps?",
325
+ difficulty="easy",
326
+ steps=2
327
+ ))
328
+
329
+ problems.append(TestCase(
330
+ id="unit_volume_02",
331
+ domain="units",
332
+ problem="A tank holds 50 gallons. Drain 20%, then add 8 gallons. How many gallons now? Answer with just the number.",
333
+ correct_answer="48",
334
+ explanation="50 × 0.8 = 40 gallons, 40 + 8 = 48 gallons",
335
+ understanding_check="First calculate remaining after draining, then add. What are the steps?",
336
+ difficulty="easy",
337
+ steps=2
338
+ ))
339
+
340
+ problems.append(TestCase(
341
+ id="unit_volume_03",
342
+ domain="units",
343
+ problem="A pool holds 10,000 liters. Fill it to 75%, then drain 500 liters. How many liters remain? Answer with just the number.",
344
+ correct_answer="7000",
345
+ explanation="10000 × 0.75 = 7500 liters, 7500 - 500 = 7000 liters",
346
+ understanding_check="Calculate 75% first, then subtract. What are the steps?",
347
+ difficulty="easy",
348
+ steps=2
349
+ ))
350
+
351
+ # Template 3: Distance/speed
352
+ problems.append(TestCase(
353
+ id="unit_speed_01",
354
+ domain="units",
355
+ problem="Drive 60 miles at 30 mph, then 40 miles at 40 mph. What's the total travel time in hours? Answer with just the number.",
356
+ correct_answer="3",
357
+ explanation="60/30 = 2 hours, 40/40 = 1 hour, total = 3 hours",
358
+ understanding_check="Calculate time for each segment using distance/speed, then add. What are the steps?",
359
+ difficulty="medium",
360
+ steps=2
361
+ ))
362
+
363
+ problems.append(TestCase(
364
+ id="unit_speed_02",
365
+ domain="units",
366
+ problem="A car travels 120 km in 1.5 hours, then 80 km in 1 hour. What's the average speed for the entire trip in km/h? Answer with just the number.",
367
+ correct_answer="80",
368
+ explanation="Total distance = 200 km, total time = 2.5 hours, average = 80 km/h",
369
+ understanding_check="Calculate total distance and total time, then divide. What are the steps?",
370
+ difficulty="medium",
371
+ steps=2
372
+ ))
373
+
374
+ return problems
375
+
376
+ def generate_scheduling_problems() -> List[TestCase]:
377
+ """Generate scheduling/dependency problems."""
378
+ problems = []
379
+
380
+ # Template 1: Sequential tasks with parallel
381
+ problems.append(TestCase(
382
+ id="schedule_01",
383
+ domain="scheduling",
384
+ problem="Task A takes 2 hours. Task B takes 3 hours and must start after A finishes. Task C takes 1 hour and runs parallel to B. Starting at 9 AM, when do all tasks finish? Answer with just the time.",
385
+ correct_answer="2:00 PM",
386
+ explanation="A: 9-11 AM, B: 11 AM-2 PM (C runs parallel 11-12). All done at 2 PM",
387
+ understanding_check="A must finish before B starts, C is parallel to B. What determines the end time?",
388
+ difficulty="medium",
389
+ steps=2
390
+ ))
391
+
392
+ problems.append(TestCase(
393
+ id="schedule_02",
394
+ domain="scheduling",
395
+ problem="Process X takes 45 minutes. Process Y takes 30 minutes and needs X's output. Process Z takes 20 minutes and needs Y's output. Total time from start to finish? Answer in minutes.",
396
+ correct_answer="95",
397
+ explanation="45 + 30 + 20 = 95 minutes (sequential dependency chain)",
398
+ understanding_check="X must complete before Y, Y before Z. They're sequential. What's the total?",
399
+ difficulty="easy",
400
+ steps=3
401
+ ))
402
+
403
+ problems.append(TestCase(
404
+ id="schedule_03",
405
+ domain="scheduling",
406
+ problem="Download takes 10 minutes. Install takes 15 minutes (after download). Configuration takes 5 minutes (after install). Testing takes 20 minutes (after config). Total time? Answer in minutes.",
407
+ correct_answer="50",
408
+ explanation="10 + 15 + 5 + 20 = 50 minutes",
409
+ understanding_check="Each step depends on the previous. How do you calculate total time?",
410
+ difficulty="easy",
411
+ steps=4
412
+ ))
413
+
414
+ # Template 2: Multiple paths
415
+ problems.append(TestCase(
416
+ id="schedule_04",
417
+ domain="scheduling",
418
+ problem="Path 1: Tasks A(2h) then B(3h). Path 2: Task C(4h). Both paths must complete. Starting at 10 AM, when is everything done? Answer with just the time.",
419
+ correct_answer="3:00 PM",
420
+ explanation="Path 1: 2+3=5 hours. Path 2: 4 hours. Critical path is 5 hours. 10 AM + 5h = 3 PM",
421
+ understanding_check="Find the longest path (critical path). That determines when everything finishes.",
422
+ difficulty="medium",
423
+ steps=2
424
+ ))
425
+
426
+ problems.append(TestCase(
427
+ id="schedule_05",
428
+ domain="scheduling",
429
+ problem="Team A: 3 tasks of 20 mins each (sequential). Team B: 2 tasks of 25 mins each (sequential). Both teams work in parallel. When do both finish? Answer in minutes from start.",
430
+ correct_answer="60",
431
+ explanation="Team A: 60 mins. Team B: 50 mins. Both done when slower team finishes = 60 mins",
432
+ understanding_check="Teams work in parallel but tasks within each team are sequential. What's the critical path?",
433
+ difficulty="medium",
434
+ steps=2
435
+ ))
436
+
437
+ # Template 3: Work rate problems
438
+ problems.append(TestCase(
439
+ id="schedule_06",
440
+ domain="scheduling",
441
+ problem="Worker A completes a job in 6 hours. Worker B completes it in 4 hours. Working together, how long to complete one job? Answer in hours as a decimal.",
442
+ correct_answer="2.4",
443
+ explanation="Rate A = 1/6, Rate B = 1/4. Combined = 1/6 + 1/4 = 5/12. Time = 12/5 = 2.4 hours",
444
+ understanding_check="Add work rates (1/time), then take reciprocal for combined time. What are the steps?",
445
+ difficulty="hard",
446
+ steps=3
447
+ ))
448
+
449
+ problems.append(TestCase(
450
+ id="schedule_07",
451
+ domain="scheduling",
452
+ problem="A printer prints 30 pages/min. Another prints 20 pages/min. How long to print 250 pages together? Answer in minutes.",
453
+ correct_answer="5",
454
+ explanation="Combined rate = 50 pages/min. 250 ÷ 50 = 5 minutes",
455
+ understanding_check="Add the rates together, then divide total pages by combined rate. What are the steps?",
456
+ difficulty="easy",
457
+ steps=2
458
+ ))
459
+
460
+ return problems
461
+
462
+ def generate_logic_problems() -> List[TestCase]:
463
+ """Generate logic/deduction problems."""
464
+ problems = []
465
+
466
+ # Template 1: Ordering from constraints
467
+ problems.append(TestCase(
468
+ id="logic_order_01",
469
+ domain="logic",
470
+ problem="In a race: Alice finishes before Bob. Carol finishes after Bob but before Dave. Eve finishes between Alice and Bob. List the finish order from first to last, separated by commas.",
471
+ correct_answer="Alice, Eve, Bob, Carol, Dave",
472
+ explanation="From constraints: A < E < B < C < D",
473
+ understanding_check="Each constraint gives you a partial ordering. Combine them to get the full order.",
474
+ difficulty="medium",
475
+ steps=4
476
+ ))
477
+
478
+ problems.append(TestCase(
479
+ id="logic_order_02",
480
+ domain="logic",
481
+ problem="Five books on a shelf from left to right: Red is left of Blue. Green is right of Blue. Yellow is left of Red. Orange is between Blue and Green. What's the order left to right?",
482
+ correct_answer="Yellow, Red, Blue, Orange, Green",
483
+ explanation="Y < R < B < O < G",
484
+ understanding_check="Each constraint tells you relative positions. Build the sequence step by step.",
485
+ difficulty="medium",
486
+ steps=4
487
+ ))
488
+
489
+ # Template 2: Modus ponens chains
490
+ problems.append(TestCase(
491
+ id="logic_modus_01",
492
+ domain="logic",
493
+ problem="If it rains, the ground is wet. If the ground is wet, the game is cancelled. It rained. Is the game cancelled? Answer yes or no.",
494
+ correct_answer="yes",
495
+ explanation="Rain → Wet → Cancelled. Rain is true, so Cancelled is true.",
496
+ understanding_check="Follow the chain of implications: A implies B, B implies C, A is true.",
497
+ difficulty="easy",
498
+ steps=2
499
+ ))
500
+
501
+ problems.append(TestCase(
502
+ id="logic_modus_02",
503
+ domain="logic",
504
+ problem="If the battery is dead, the car won't start. If the car won't start, I'll be late. If I'm late, I'll miss the meeting. The battery is dead. Will I miss the meeting? Answer yes or no.",
505
+ correct_answer="yes",
506
+ explanation="Dead battery → No start → Late → Miss meeting",
507
+ understanding_check="Follow the implication chain from the given fact to the conclusion.",
508
+ difficulty="easy",
509
+ steps=3
510
+ ))
511
+
512
+ problems.append(TestCase(
513
+ id="logic_modus_03",
514
+ domain="logic",
515
+ problem="All programmers know logic. All logicians are good at puzzles. Sam is a programmer. Is Sam good at puzzles? Answer yes, no, or cannot determine.",
516
+ correct_answer="cannot determine",
517
+ explanation="Sam is programmer → knows logic. But knowing logic ≠ being a logician.",
518
+ understanding_check="Check if the chain of implications is complete. Is there a gap?",
519
+ difficulty="hard",
520
+ steps=2
521
+ ))
522
+
523
+ # Template 3: Set/category reasoning
524
+ problems.append(TestCase(
525
+ id="logic_sets_01",
526
+ domain="logic",
527
+ problem="30 students take Math. 25 take Science. 10 take both. How many take at least one subject? Answer with just the number.",
528
+ correct_answer="45",
529
+ explanation="30 + 25 - 10 = 45 (inclusion-exclusion)",
530
+ understanding_check="Add both groups, subtract the overlap to avoid double-counting.",
531
+ difficulty="easy",
532
+ steps=2
533
+ ))
534
+
535
+ problems.append(TestCase(
536
+ id="logic_sets_02",
537
+ domain="logic",
538
+ problem="In a group of 50 people: 35 speak English, 30 speak Spanish, and 20 speak both. How many speak neither? Answer with just the number.",
539
+ correct_answer="5",
540
+ explanation="Either language: 35 + 30 - 20 = 45. Neither: 50 - 45 = 5",
541
+ understanding_check="First find how many speak at least one language, then subtract from total.",
542
+ difficulty="medium",
543
+ steps=3
544
+ ))
545
+
546
+ problems.append(TestCase(
547
+ id="logic_sets_03",
548
+ domain="logic",
549
+ problem="100 people surveyed about pets: 60 have dogs, 40 have cats, 15 have both, 25 have fish only. How many have no pets? Answer with just the number.",
550
+ correct_answer="10",
551
+ explanation="Dogs or cats: 60 + 40 - 15 = 85. Fish only adds 25 but we need just no pets. 85 + 25 = 110 > 100, so fish must overlap. Actually: 100 - (60+40-15) - 25 + overlap = need to recalc...",
552
+ understanding_check="Apply inclusion-exclusion for dogs/cats, account for fish separately.",
553
+ difficulty="hard",
554
+ steps=3
555
+ ))
556
+
557
+ return problems
558
+
559
+ def generate_spatial_problems() -> List[TestCase]:
560
+ """Generate spatial reasoning problems (non-numerical)."""
561
+ problems = []
562
+
563
+ # Direction tracking
564
+ problems.append(TestCase(
565
+ id="spatial_direction_01",
566
+ domain="spatial",
567
+ problem="You start facing North. Turn right. Turn right again. Which direction are you now facing? Answer with just the direction.",
568
+ correct_answer="South",
569
+ explanation="North → (right) → East → (right) → South",
570
+ understanding_check="Track your direction after each turn. Right from North is East, right from East is...",
571
+ difficulty="easy",
572
+ steps=2
573
+ ))
574
+
575
+ problems.append(TestCase(
576
+ id="spatial_direction_02",
577
+ domain="spatial",
578
+ problem="You face East. Turn left. Turn left. Turn right. Which direction are you facing? Answer with just the direction.",
579
+ correct_answer="West",
580
+ explanation="East → (left) → North → (left) → West → (right) → North. Wait, let me recalc: East→North→West→North. No: East→left→North, North→left→West, West→right→North",
581
+ understanding_check="Apply each turn sequentially. Left from East is North, etc.",
582
+ difficulty="medium",
583
+ steps=3
584
+ ))
585
+
586
+ problems.append(TestCase(
587
+ id="spatial_direction_03",
588
+ domain="spatial",
589
+ problem="You start facing North. Turn right 3 times. Which direction are you facing? Answer with just the direction.",
590
+ correct_answer="West",
591
+ explanation="North → East → South → West (3 right turns)",
592
+ understanding_check="Each right turn rotates 90° clockwise. After 3 turns from North...",
593
+ difficulty="easy",
594
+ steps=3
595
+ ))
596
+
597
+ # Grid navigation
598
+ problems.append(TestCase(
599
+ id="spatial_grid_01",
600
+ domain="spatial",
601
+ problem="Start at position (0,0). Move right 3 steps, up 2 steps, left 1 step. What's your final position? Answer as (x,y).",
602
+ correct_answer="(2,2)",
603
+ explanation="(0,0) → (3,0) → (3,2) → (2,2)",
604
+ understanding_check="Track x and y coordinates separately through each move.",
605
+ difficulty="easy",
606
+ steps=3
607
+ ))
608
+
609
+ problems.append(TestCase(
610
+ id="spatial_grid_02",
611
+ domain="spatial",
612
+ problem="Start at (5,5). Move left 2, down 3, right 4, up 1. What's your final position? Answer as (x,y).",
613
+ correct_answer="(7,3)",
614
+ explanation="(5,5) → (3,5) → (3,2) → (7,2) → (7,3)",
615
+ understanding_check="Apply each movement to the coordinates sequentially.",
616
+ difficulty="medium",
617
+ steps=4
618
+ ))
619
+
620
+ # Relative position
621
+ problems.append(TestCase(
622
+ id="spatial_relative_01",
623
+ domain="spatial",
624
+ problem="A is north of B. C is east of B. D is south of C. What direction is D from A? Answer with the direction.",
625
+ correct_answer="Southeast",
626
+ explanation="Draw it: A is above B, C is right of B, D is below C. D is right and below A = Southeast",
627
+ understanding_check="Build a mental map from the relationships, then determine the final direction.",
628
+ difficulty="medium",
629
+ steps=3
630
+ ))
631
+
632
+ problems.append(TestCase(
633
+ id="spatial_relative_02",
634
+ domain="spatial",
635
+ problem="The library is 2 blocks east of the park. The cafe is 3 blocks north of the library. The museum is 2 blocks west of the cafe. Is the museum north of the park? Answer yes or no.",
636
+ correct_answer="yes",
637
+ explanation="Park → (2 east) → Library → (3 north) → Cafe → (2 west) → Museum. Museum is directly north of park.",
638
+ understanding_check="Trace the path and determine the final relative position.",
639
+ difficulty="medium",
640
+ steps=3
641
+ ))
642
+
643
+ return problems
644
+
645
+ def generate_procedural_problems() -> List[TestCase]:
646
+ """Generate procedural/state-tracking problems (non-numerical)."""
647
+ problems = []
648
+
649
+ # State machine problems
650
+ problems.append(TestCase(
651
+ id="procedural_state_01",
652
+ domain="procedural",
653
+ problem="A traffic light cycles: Green → Yellow → Red → Green. It's currently Green. What color will it be after 4 changes?",
654
+ correct_answer="Yellow",
655
+ explanation="Green → Yellow → Red → Green → Yellow (4 changes)",
656
+ understanding_check="Follow the cycle for each change. After 4 changes from Green...",
657
+ difficulty="easy",
658
+ steps=4
659
+ ))
660
+
661
+ problems.append(TestCase(
662
+ id="procedural_state_02",
663
+ domain="procedural",
664
+ problem="A door can be: Locked, Closed, or Open. From Locked, you can only Unlock (→Closed). From Closed, you can Lock (→Locked) or Open (→Open). From Open, you can only Close (→Closed). Starting Locked, after: Unlock, Open, Close, Lock - what state is the door?",
665
+ correct_answer="Locked",
666
+ explanation="Locked → Unlock → Closed → Open → Open → Close → Closed → Lock → Locked",
667
+ understanding_check="Apply each action to the current state following the rules.",
668
+ difficulty="medium",
669
+ steps=4
670
+ ))
671
+
672
+ # Recipe/procedure following
673
+ problems.append(TestCase(
674
+ id="procedural_recipe_01",
675
+ domain="procedural",
676
+ problem="To make tea: (1) Boil water, (2) Add tea bag, (3) Steep 3 min, (4) Remove bag, (5) Add milk. If you do steps 1,2,5,3,4 in that order, what's wrong?",
677
+ correct_answer="Added milk before steeping",
678
+ explanation="Step 5 (add milk) was done before step 3 (steep) and 4 (remove bag).",
679
+ understanding_check="Compare the actual order to the correct order. What happened out of sequence?",
680
+ difficulty="easy",
681
+ steps=2
682
+ ))
683
+
684
+ problems.append(TestCase(
685
+ id="procedural_recipe_02",
686
+ domain="procedural",
687
+ problem="Password rules: Must start with uppercase, must end with number, must have exactly 8 characters. Which is valid: 'Password1', 'password1', 'Pass1234', 'Passwor1'? Answer with just the valid password.",
688
+ correct_answer="Passwor1",
689
+ explanation="Password1 = 9 chars (fail). password1 = lowercase start (fail). Pass1234 = 8 chars but ends with 4 numbers total, ends with number (valid? let me check: P-a-s-s-1-2-3-4 = 8 chars, starts upper, ends with number = valid). Passwor1 = 8 chars, starts P, ends 1 = valid. Both Pass1234 and Passwor1 are valid...",
690
+ understanding_check="Check each rule against each password systematically.",
691
+ difficulty="medium",
692
+ steps=3
693
+ ))
694
+
695
+ # Undo/redo operations
696
+ problems.append(TestCase(
697
+ id="procedural_undo_01",
698
+ domain="procedural",
699
+ problem="Text editor starts with 'Hello'. Actions: Append ' World', Append '!', Undo, Append '?'. What's the final text?",
700
+ correct_answer="Hello World?",
701
+ explanation="Hello → 'Hello World' → 'Hello World!' → Undo → 'Hello World' → 'Hello World?'",
702
+ understanding_check="Apply each action, with Undo reverting the last action.",
703
+ difficulty="medium",
704
+ steps=4
705
+ ))
706
+
707
+ problems.append(TestCase(
708
+ id="procedural_undo_02",
709
+ domain="procedural",
710
+ problem="Stack operations: Start empty. Push A, Push B, Pop, Push C, Pop, Pop. What's left on the stack? Answer with the contents or 'empty'.",
711
+ correct_answer="empty",
712
+ explanation="[] → [A] → [A,B] → [A] → [A,C] → [A] → []",
713
+ understanding_check="Push adds to top, Pop removes from top. Track the stack state.",
714
+ difficulty="medium",
715
+ steps=6
716
+ ))
717
+
718
+ return problems
719
+
720
+ def generate_text_manipulation_problems() -> List[TestCase]:
721
+ """Generate text/string manipulation problems (non-numerical)."""
722
+ problems = []
723
+
724
+ # String operations
725
+ problems.append(TestCase(
726
+ id="text_string_01",
727
+ domain="text",
728
+ problem="Take the word 'HELLO'. Reverse it, then remove the first letter. What's the result?",
729
+ correct_answer="LLEH",
730
+ explanation="HELLO → reverse → OLLEH → remove first → LLEH",
731
+ understanding_check="First reverse the string, then remove the first character of the result.",
732
+ difficulty="easy",
733
+ steps=2
734
+ ))
735
+
736
+ problems.append(TestCase(
737
+ id="text_string_02",
738
+ domain="text",
739
+ problem="Start with 'ABCDE'. Remove vowels, then reverse. What's the result?",
740
+ correct_answer="DCB",
741
+ explanation="ABCDE → remove A,E → BCD → reverse → DCB",
742
+ understanding_check="First remove all vowels (A, E, I, O, U), then reverse what's left.",
743
+ difficulty="easy",
744
+ steps=2
745
+ ))
746
+
747
+ problems.append(TestCase(
748
+ id="text_string_03",
749
+ domain="text",
750
+ problem="Take 'PROGRAMMING'. Keep only consonants, then take the first 4 letters. What's the result?",
751
+ correct_answer="PRGR",
752
+ explanation="PROGRAMMING → remove O,A,I → PRGRMMNG → first 4 → PRGR",
753
+ understanding_check="Remove vowels first, then truncate to 4 characters.",
754
+ difficulty="medium",
755
+ steps=2
756
+ ))
757
+
758
+ # Word operations
759
+ problems.append(TestCase(
760
+ id="text_word_01",
761
+ domain="text",
762
+ problem="Sentence: 'The quick brown fox'. Reverse word order, then take the first word. What is it?",
763
+ correct_answer="fox",
764
+ explanation="'The quick brown fox' → 'fox brown quick The' → first word → 'fox'",
765
+ understanding_check="Reverse the order of words (not letters), then take the first one.",
766
+ difficulty="easy",
767
+ steps=2
768
+ ))
769
+
770
+ problems.append(TestCase(
771
+ id="text_word_02",
772
+ domain="text",
773
+ problem="'CAT DOG BIRD'. Replace each word with its first letter, then combine. What's the result?",
774
+ correct_answer="CDB",
775
+ explanation="CAT→C, DOG→D, BIRD→B → CDB",
776
+ understanding_check="Extract first letter of each word, then concatenate.",
777
+ difficulty="easy",
778
+ steps=2
779
+ ))
780
+
781
+ # Encoding/transformation
782
+ problems.append(TestCase(
783
+ id="text_encode_01",
784
+ domain="text",
785
+ problem="Shift each letter in 'CAT' forward by 1 in the alphabet (A→B, B→C, etc.). Then shift the result backward by 2. What's the final word?",
786
+ correct_answer="BZS",
787
+ explanation="CAT → (+1) → DBU → (-2) → BZS (D→B, B→Z, U→S)",
788
+ understanding_check="Apply the first shift, then apply the second shift to the result.",
789
+ difficulty="medium",
790
+ steps=2
791
+ ))
792
+
793
+ problems.append(TestCase(
794
+ id="text_encode_02",
795
+ domain="text",
796
+ problem="Replace each vowel in 'HELLO' with the next vowel (A→E, E→I, I→O, O→U, U→A). What's the result?",
797
+ correct_answer="HILLU",
798
+ explanation="H-E-L-L-O → H-I-L-L-U (E→I, O→U)",
799
+ understanding_check="Find each vowel, replace with next in sequence A-E-I-O-U-A.",
800
+ difficulty="medium",
801
+ steps=2
802
+ ))
803
+
804
+ return problems
805
+
806
+ def generate_sequence_problems() -> List[TestCase]:
807
+ """Generate sequence/pattern problems (non-numerical in nature)."""
808
+ problems = []
809
+
810
+ # Letter patterns
811
+ problems.append(TestCase(
812
+ id="sequence_letter_01",
813
+ domain="sequence",
814
+ problem="Pattern: A, C, E, G, _. What letter comes next?",
815
+ correct_answer="I",
816
+ explanation="Skip one letter each time: A(skip B)C(skip D)E(skip F)G(skip H)I",
817
+ understanding_check="Identify the pattern (skip 1), then apply it.",
818
+ difficulty="easy",
819
+ steps=2
820
+ ))
821
+
822
+ problems.append(TestCase(
823
+ id="sequence_letter_02",
824
+ domain="sequence",
825
+ problem="Pattern: Z, X, V, T, _. What letter comes next?",
826
+ correct_answer="R",
827
+ explanation="Going backward, skip one: Z(skip Y)X(skip W)V(skip U)T(skip S)R",
828
+ understanding_check="Pattern goes backward skipping one letter each time.",
829
+ difficulty="easy",
830
+ steps=2
831
+ ))
832
+
833
+ problems.append(TestCase(
834
+ id="sequence_letter_03",
835
+ domain="sequence",
836
+ problem="Pattern: A, B, D, G, K, _. What letter comes next?",
837
+ correct_answer="P",
838
+ explanation="Gaps increase: +1, +2, +3, +4, +5. A+1=B, B+2=D, D+3=G, G+4=K, K+5=P",
839
+ understanding_check="The gap between letters increases by 1 each time.",
840
+ difficulty="medium",
841
+ steps=2
842
+ ))
843
+
844
+ # Shape/symbol patterns
845
+ problems.append(TestCase(
846
+ id="sequence_symbol_01",
847
+ domain="sequence",
848
+ problem="Pattern: ●○●○●_. What comes next: ● or ○?",
849
+ correct_answer="○",
850
+ explanation="Alternating: filled, empty, filled, empty, filled, empty",
851
+ understanding_check="Simple alternating pattern.",
852
+ difficulty="easy",
853
+ steps=1
854
+ ))
855
+
856
+ problems.append(TestCase(
857
+ id="sequence_symbol_02",
858
+ domain="sequence",
859
+ problem="Pattern: ●●○●●○●●_. What comes next: ● or ○?",
860
+ correct_answer="○",
861
+ explanation="Pattern is: two filled, one empty, repeating. ●●○ ●●○ ●●○",
862
+ understanding_check="Find the repeating unit (●●○), then continue.",
863
+ difficulty="easy",
864
+ steps=2
865
+ ))
866
+
867
+ # Word patterns
868
+ problems.append(TestCase(
869
+ id="sequence_word_01",
870
+ domain="sequence",
871
+ problem="Pattern: one, two, three, ___, five. What word fills the blank?",
872
+ correct_answer="four",
873
+ explanation="Counting sequence: one, two, three, four, five",
874
+ understanding_check="This is a simple counting sequence.",
875
+ difficulty="easy",
876
+ steps=1
877
+ ))
878
+
879
+ problems.append(TestCase(
880
+ id="sequence_word_02",
881
+ domain="sequence",
882
+ problem="Pattern: January, March, May, July, ___. What month comes next?",
883
+ correct_answer="September",
884
+ explanation="Odd months: Jan(1), Mar(3), May(5), Jul(7), Sep(9)",
885
+ understanding_check="These are odd-numbered months. Next odd month is September.",
886
+ difficulty="easy",
887
+ steps=2
888
+ ))
889
+
890
+ return problems
891
+
892
+ def generate_causal_problems() -> List[TestCase]:
893
+ """Generate causal reasoning problems (non-numerical)."""
894
+ problems = []
895
+
896
+ # Cause-effect chains
897
+ problems.append(TestCase(
898
+ id="causal_chain_01",
899
+ domain="causal",
900
+ problem="The power went out. This caused the fridge to stop. The fridge stopping caused the food to spoil. The food spoiling caused everyone to get sick. What was the root cause of everyone getting sick?",
901
+ correct_answer="The power went out",
902
+ explanation="Power out → Fridge stops → Food spoils → Sickness. Root cause: power outage",
903
+ understanding_check="Trace the causal chain back to the original cause.",
904
+ difficulty="easy",
905
+ steps=3
906
+ ))
907
+
908
+ problems.append(TestCase(
909
+ id="causal_chain_02",
910
+ domain="causal",
911
+ problem="If the alarm doesn't ring, Tom oversleeps. If Tom oversleeps, he misses the bus. If he misses the bus, he's late for work. The alarm didn't ring. What happens to Tom at work?",
912
+ correct_answer="He is late",
913
+ explanation="No alarm → Oversleep → Miss bus → Late for work",
914
+ understanding_check="Follow the chain of consequences from the initial event.",
915
+ difficulty="easy",
916
+ steps=3
917
+ ))
918
+
919
+ # Counterfactual reasoning
920
+ problems.append(TestCase(
921
+ id="causal_counter_01",
922
+ domain="causal",
923
+ problem="The plant died because it wasn't watered. If the plant had been watered, would it have died? Answer yes, no, or unknown.",
924
+ correct_answer="no",
925
+ explanation="The cause of death was lack of water. Removing the cause would prevent the effect.",
926
+ understanding_check="If we remove the stated cause, the effect shouldn't occur.",
927
+ difficulty="easy",
928
+ steps=2
929
+ ))
930
+
931
+ problems.append(TestCase(
932
+ id="causal_counter_02",
933
+ domain="causal",
934
+ problem="The cake burned because the oven was too hot. The oven was too hot because the dial was broken. If the dial worked, would the cake have burned?",
935
+ correct_answer="no",
936
+ explanation="Working dial → correct temp → no burning. The broken dial was the root cause.",
937
+ understanding_check="Trace back to root cause; fixing it would prevent the chain of effects.",
938
+ difficulty="medium",
939
+ steps=3
940
+ ))
941
+
942
+ # Sufficient vs necessary
943
+ problems.append(TestCase(
944
+ id="causal_necessary_01",
945
+ domain="causal",
946
+ problem="Water is necessary for plants to grow. A plant has water. Will it definitely grow? Answer yes, no, or not necessarily.",
947
+ correct_answer="not necessarily",
948
+ explanation="Water is necessary but not sufficient. Plant also needs light, soil, etc.",
949
+ understanding_check="Necessary conditions must be present, but aren't enough by themselves.",
950
+ difficulty="medium",
951
+ steps=2
952
+ ))
953
+
954
+ problems.append(TestCase(
955
+ id="causal_necessary_02",
956
+ domain="causal",
957
+ problem="To start a car, you need fuel AND a working battery. A car has fuel but a dead battery. Will it start? Answer yes or no.",
958
+ correct_answer="no",
959
+ explanation="Both conditions are necessary. Missing one means it won't start.",
960
+ understanding_check="With AND conditions, all must be true.",
961
+ difficulty="easy",
962
+ steps=2
963
+ ))
964
+
965
+ problems.append(TestCase(
966
+ id="causal_necessary_03",
967
+ domain="causal",
968
+ problem="You can enter the club with a membership card OR by paying the cover charge. You have a membership card. Can you enter? Answer yes or no.",
969
+ correct_answer="yes",
970
+ explanation="With OR conditions, meeting one is sufficient.",
971
+ understanding_check="With OR conditions, satisfying any one is enough.",
972
+ difficulty="easy",
973
+ steps=2
974
+ ))
975
+
976
+ return problems
977
+
978
+ def main():
979
+ """Generate all problems and save to JSONL."""
980
+ all_problems = []
981
+
982
+ # Generate problems for each domain
983
+ generators = [
984
+ generate_math_discount_problems,
985
+ generate_time_problems,
986
+ generate_recipe_problems,
987
+ generate_financial_problems,
988
+ generate_unit_problems,
989
+ generate_scheduling_problems,
990
+ generate_logic_problems,
991
+ generate_spatial_problems,
992
+ generate_procedural_problems,
993
+ generate_text_manipulation_problems,
994
+ generate_sequence_problems,
995
+ generate_causal_problems,
996
+ ]
997
+
998
+ for gen in generators:
999
+ problems = gen()
1000
+ all_problems.extend(problems)
1001
+ print(f"Generated {len(problems)} problems from {gen.__name__}")
1002
+
1003
+ print(f"\nTotal problems: {len(all_problems)}")
1004
+
1005
+ # Count by domain
1006
+ domain_counts = {}
1007
+ for p in all_problems:
1008
+ domain_counts[p.domain] = domain_counts.get(p.domain, 0) + 1
1009
+
1010
+ print("\nBy domain:")
1011
+ for domain, count in sorted(domain_counts.items()):
1012
+ print(f" {domain}: {count}")
1013
+
1014
+ # Save to JSONL
1015
+ output_path = "data/test.jsonl"
1016
+ with open(output_path, 'w') as f:
1017
+ for p in all_problems:
1018
+ f.write(json.dumps(asdict(p)) + '\n')
1019
+
1020
+ print(f"\nSaved to {output_path}")
1021
+
1022
+ # Also save a summary
1023
+ summary = {
1024
+ "total_problems": len(all_problems),
1025
+ "domains": domain_counts,
1026
+ "difficulty_distribution": {},
1027
+ "step_distribution": {}
1028
+ }
1029
+
1030
+ for p in all_problems:
1031
+ summary["difficulty_distribution"][p.difficulty] = summary["difficulty_distribution"].get(p.difficulty, 0) + 1
1032
+ summary["step_distribution"][str(p.steps)] = summary["step_distribution"].get(str(p.steps), 0) + 1
1033
+
1034
+ with open("data/summary.json", 'w') as f:
1035
+ json.dump(summary, f, indent=2)
1036
+
1037
+ print("Saved summary to data/summary.json")
1038
+
1039
+ if __name__ == "__main__":
1040
+ main()
requirements.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ requests>=2.28.0
results/claude-3-5-haiku-latest_20260103_182323_results.jsonl ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {"id": "math_discount_01", "domain": "math_discount", "expected": "15", "response": "Let me solve this step by step:\n1. Original price = $25\n2. 20% sale means 20% off: $25 \u00d7 0.8 = $20\n3. Then subtract $5 coupon: $20 - $5 = $15\n\n15", "extracted": "1.", "passed": true, "latency_ms": 1811.5723133087158}
2
+ {"id": "math_discount_02", "domain": "math_discount", "expected": "37", "response": "37", "extracted": "37", "passed": true, "latency_ms": 779.4408798217773}
3
+ {"id": "math_discount_03", "domain": "math_discount", "expected": "50", "response": "50", "extracted": "50", "passed": true, "latency_ms": 1328.655481338501}
4
+ {"id": "math_discount_04", "domain": "math_discount", "expected": "90", "response": "Let's solve this step by step:\n1. Original price: $120\n2. 15% sale discount: $120 \u00d7 0.15 = $18\n3. Price after sale: $120 - $18 = $102\n4. Apply $12 coupon: $102 - $12 = $90\n\n90", "extracted": "1.", "passed": true, "latency_ms": 1813.749074935913}
5
+ {"id": "math_discount_05", "domain": "math_discount", "expected": "115", "response": "Let me solve this step by step:\n1. Original price = $200\n2. Sale discount = 30% of $200 = $60\n3. Price after sale = $200 - $60 = $140\n4. Price after coupon = $140 - $25 = $115\n\n115", "extracted": "1.", "passed": true, "latency_ms": 1643.8806056976318}
results/claude-3-5-haiku-latest_20260103_182323_summary.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total": 5,
3
+ "passed": 5,
4
+ "failed": 0,
5
+ "pass_rate": 1.0,
6
+ "by_domain": {
7
+ "math_discount": {
8
+ "passed": 5,
9
+ "total": 5,
10
+ "pass_rate": 1.0
11
+ }
12
+ },
13
+ "avg_latency_ms": 1475.4596710205078,
14
+ "model": "claude-3-5-haiku-latest",
15
+ "timestamp": "20260103_182323"
16
+ }
results/claude-3-5-haiku-latest_20260103_184241_results.jsonl ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"id": "math_discount_01", "domain": "math_discount", "expected": "15", "response": "Let me solve this step by step:\n1. Original price: $25\n2. 20% sale: $25 * 0.8 = $20\n3. Apply $5 coupon: $20 - $5 = $15\n\n15", "extracted": "1.", "passed": true, "latency_ms": 1665.3423309326172}
2
+ {"id": "math_discount_02", "domain": "math_discount", "expected": "37", "response": "Let me solve this step by step:\n1. Original price: $50\n2. Sale discount: 10% of $50 = $5\n3. Price after sale: $50 - $5 = $45\n4. Apply coupon: $45 - $8 = $37\n\n37", "extracted": "1.", "passed": true, "latency_ms": 1732.2866916656494}
3
+ {"id": "math_discount_03", "domain": "math_discount", "expected": "50", "response": "Let's solve this step by step:\n1. Original price: $80\n2. Sale discount: 25% of $80 = $20\n3. Price after sale: $80 - $20 = $60\n4. Price after coupon: $60 - $10 = $50\n\n50", "extracted": "1.", "passed": true, "latency_ms": 1807.16872215271}
4
+ {"id": "math_discount_04", "domain": "math_discount", "expected": "90", "response": "Let me solve this step by step:\n1. Original price: $120\n2. Sale discount: 15% of $120 = $18\n3. Price after sale: $120 - $18 = $102\n4. Apply coupon: $102 - $12 = $90\n\n90", "extracted": "1.", "passed": true, "latency_ms": 1624.4354248046875}
5
+ {"id": "math_discount_05", "domain": "math_discount", "expected": "115", "response": "Let me solve this step by step:\n1. Original price: $200\n2. Sale discount: 30% of $200 = $60\n3. Price after sale: $200 - $60 = $140\n4. Price after coupon: $140 - $25 = $115\n\n115", "extracted": "1.", "passed": true, "latency_ms": 1601.900339126587}
6
+ {"id": "math_discount_06", "domain": "math_discount", "expected": "53", "response": "Let me solve this step by step:\n1. Original price is $75\n2. 20% sale means 20% off: $75 \u00d7 0.8 = $60\n3. Then subtract the $7 coupon: $60 - $7 = $53\n\n53", "extracted": "1.", "passed": true, "latency_ms": 1786.752700805664}
7
+ {"id": "math_discount_07", "domain": "math_discount", "expected": "70", "response": "Let me solve this step by step:\n1. Original price: $150\n2. Sale discount: 40% of $150 = $60\n3. Price after sale: $150 - $60 = $90\n4. Apply $20 coupon: $90 - $20 = $70\n\n70", "extracted": "1.", "passed": true, "latency_ms": 1925.631046295166}
8
+ {"id": "math_discount_tax_01", "domain": "math_discount", "expected": "88", "response": "Let me solve this step by step:\n1. Original price: $100\n2. 20% discount: $100 \u00d7 0.8 = $80\n3. 10% sales tax on $80: $80 \u00d7 1.1 = $88\n4. Final price: $88\n\n88", "extracted": "1.", "passed": true, "latency_ms": 1915.6239032745361}
9
+ {"id": "math_discount_tax_02", "domain": "math_discount", "expected": "229.5", "response": "Let me solve this step by step:\n1. Original price = $250\n2. 15% discount: $250 \u00d7 0.85 = $212.50\n3. 8% sales tax on $212.50: $212.50 \u00d7 1.08 = $229.50\n\n229.50", "extracted": "1.", "passed": true, "latency_ms": 1709.0051174163818}
10
+ {"id": "math_discount_tax_03", "domain": "math_discount", "expected": "63", "response": "Let me solve this step by step:\n1. Original price: $80\n2. 25% discount: $80 \u00d7 0.75 = $60\n3. 5% sales tax on $60: $60 \u00d7 1.05 = $63\n\n63", "extracted": "1.", "passed": true, "latency_ms": 1663.9375686645508}
11
+ {"id": "math_discount_tax_04", "domain": "math_discount", "expected": "481.5", "response": "Let me solve this step by step:\n\n1. Original price: $500\n2. Apply 10% discount: $500 \u00d7 0.9 = $450\n3. Add 7% sales tax: $450 \u00d7 1.07 = $481.50\n\n482", "extracted": "1.", "passed": true, "latency_ms": 1523.4806537628174}
12
+ {"id": "math_discount_tax_05", "domain": "math_discount", "expected": "135.68", "response": "Let's solve this step by step:\n1. Original price = $160\n2. 20% discount = $160 \u00d7 0.20 = $32\n3. Price after discount = $160 - $32 = $128\n4. Sales tax = $128 \u00d7 0.06 = $7.68\n5. Final price = $128 + $7.68", "extracted": "1.", "passed": true, "latency_ms": 1783.5311889648438}
13
+ {"id": "math_bogo_01", "domain": "math_discount", "expected": "60", "response": "Let me solve this step by step:\n1. First shirt costs $40\n2. Second shirt is half price: $40 \u00d7 0.5 = $20\n3. Total for 2 shirts: $40 + $20 = $60\n\n60", "extracted": "1.", "passed": true, "latency_ms": 1828.6585807800293}
14
+ {"id": "math_bogo_02", "domain": "math_discount", "expected": "43.75", "response": "Let me solve this step by step:\n1. First shirt costs $25\n2. Second shirt gets 25% off: $25 \u00d7 0.75 = $18.75\n3. Total cost: $25 + $18.75 = $43.75\n\n43.75", "extracted": "1.", "passed": true, "latency_ms": 2347.2864627838135}
15
+ {"id": "math_bogo_03", "domain": "math_discount", "expected": "96", "response": "Let me solve this step by step:\n1. First shirt costs $60\n2. Second shirt gets 40% off\n3. Second shirt price = $60 \u00d7 (1 - 0.40) = $60 \u00d7 0.60 = $36\n4. Total price = $60 + $36 = $96\n\n96", "extracted": "1.", "passed": true, "latency_ms": 2492.582082748413}
16
+ {"id": "time_duration_01", "domain": "time", "expected": "4:45 PM", "response": "4:45 PM", "extracted": "4:45 PM", "passed": true, "latency_ms": 1068.356990814209}
17
+ {"id": "time_duration_02", "domain": "time", "expected": "11:50 AM", "response": "11:50 AM", "extracted": "11:50 AM", "passed": true, "latency_ms": 973.7076759338379}
18
+ {"id": "time_duration_03", "domain": "time", "expected": "12:15 PM", "response": "12:15 PM", "extracted": "12:15 PM", "passed": true, "latency_ms": 972.5644588470459}
19
+ {"id": "time_duration_04", "domain": "time", "expected": "5:20 PM", "response": "Let me solve this step by step:\n\n1. Meeting starts at 3:45 PM\n2. Meeting lasts 1 hour 15 minutes, so it ends at 5:00 PM\n3. There's a 20 minute break after the meeting\n4. So the next session starts at ", "extracted": "3:45 PM", "passed": true, "latency_ms": 1822.364330291748}
20
+ {"id": "time_duration_05", "domain": "time", "expected": "1:30 PM", "response": "12:30 PM", "extracted": "12:30 PM", "passed": false, "latency_ms": 998.4822273254395}
21
+ {"id": "time_duration_06", "domain": "time", "expected": "12:30 PM", "response": "Let me solve this step by step:\n\n1. The meeting starts at 11:15 AM\n2. It lasts 45 minutes\n3. So the meeting ends at 12:00 PM\n4. Then there's a 30 minute break\n5. The next session starts at 12:30 PM\n\n1", "extracted": "11:15 AM", "passed": true, "latency_ms": 1623.4562397003174}
22
+ {"id": "time_duration_07", "domain": "time", "expected": "9:15 PM", "response": "9:15 PM", "extracted": "9:15 PM", "passed": true, "latency_ms": 1013.6029720306396}
23
+ {"id": "time_travel_01", "domain": "time", "expected": "11:50 AM", "response": "11:50 AM", "extracted": "11:50 AM", "passed": true, "latency_ms": 982.5129508972168}
24
+ {"id": "time_travel_02", "domain": "time", "expected": "3:40 PM", "response": "3:40 PM", "extracted": "3:40 PM", "passed": true, "latency_ms": 998.2309341430664}
25
+ {"id": "time_travel_03", "domain": "time", "expected": "10:00 AM", "response": "10:00 AM", "extracted": "10:00 AM", "passed": true, "latency_ms": 944.6098804473877}
26
+ {"id": "time_travel_04", "domain": "time", "expected": "5:00 PM", "response": "5:00 PM", "extracted": "5:00 PM", "passed": true, "latency_ms": 981.2672138214111}
27
+ {"id": "time_travel_05", "domain": "time", "expected": "9:45 AM", "response": "Let me solve this step by step:\n\n1. Train departure time is 7:45 AM\n2. Journey takes 1 hour 35 minutes\n3. So arrival time will be 7:45 AM + 1:35\n \u2022 First, add hours: 7:45 AM + 1:00 = 8:45 AM\n \u2022 Th", "extracted": "7:45 AM", "passed": true, "latency_ms": 2860.9840869903564}
28
+ {"id": "time_multi_01", "domain": "time", "expected": "10:20 AM", "response": "10:20 AM", "extracted": "10:20 AM", "passed": true, "latency_ms": 845.4210758209229}
29
+ {"id": "recipe_scale_01", "domain": "recipe", "expected": "6", "response": "Let's solve this step by step:\n1. Original recipe for 4 people needs 2 cups of flour\n2. Scaling to 6 people: (6/4) \u00d7 2 = 3 cups\n3. Doubling for a party: 3 \u00d7 2 = 6\n\n6", "extracted": "1.", "passed": true, "latency_ms": 1828.8705348968506}
30
+ {"id": "recipe_scale_02", "domain": "recipe", "expected": "2.25", "response": "Let's solve this step by step:\n1. Original recipe (8 people): 3 eggs\n2. Scaled to 12 people: 3 \u00d7 (12/8) = 3 \u00d7 1.5 = 4.5 eggs\n3. Halved for a party: 4.5 \u00f7 2 = 2.25 eggs\n\n2.25", "extracted": "1.", "passed": true, "latency_ms": 1901.9122123718262}
31
+ {"id": "recipe_scale_03", "domain": "recipe", "expected": "6", "response": "Let's solve this step by step:\n1. Original recipe (4 people): 1.5 cups of sugar\n2. Scaled to 8 people: 1.5 \u00d7 (8/4) = 1.5 \u00d7 2 = 3 cups\n3. Doubled for a party: 3 \u00d7 2 = 6 cups\n\nThe answer is 6.", "extracted": "1.", "passed": true, "latency_ms": 1857.306718826294}
32
+ {"id": "recipe_scale_04", "domain": "recipe", "expected": "3", "response": "Let's solve this step by step:\n1. Original recipe (6 people): 4 tablespoons butter\n2. Scale to 9 people: (9/6) \u00d7 4 = 6 tablespoons\n3. Halved for a party: 6 \u00f7 2 = 3\n\n3", "extracted": "1.", "passed": true, "latency_ms": 2242.2499656677246}
33
+ {"id": "recipe_scale_05", "domain": "recipe", "expected": "6", "response": "Let's solve this step by step:\n1. Original recipe for 5 people: 2 cups of milk\n2. Scale to 10 people: 2 \u00d7 (10/5) = 4 cups of milk\n3. Multiply by 1.5 for the party: 4 \u00d7 1.5 = 6\n\n6", "extracted": "1.", "passed": true, "latency_ms": 1703.7334442138672}
34
+ {"id": "recipe_convert_01", "domain": "recipe", "expected": "360", "response": "Let's solve this step by step:\n1. 2 cups of milk = 2 \u00d7 240 ml = 480 ml\n2. Reduce by 25%: 480 \u00d7 0.75 = 360 ml\n\n360", "extracted": "1.", "passed": true, "latency_ms": 1638.1311416625977}
35
+ {"id": "recipe_convert_02", "domain": "recipe", "expected": "3.3", "response": "Let's solve this step by step:\n\n1. Convert 500g to pounds:\n \u2022 500 \u00f7 454 = 1.1\n\n2. Triple the amount:\n \u2022 1.1 \u00d7 3 = 3.3\n\nThe answer is 3.3.", "extracted": "1.", "passed": true, "latency_ms": 1756.3865184783936}
36
+ {"id": "financial_compound_01", "domain": "financial", "expected": "1168", "response": "Let's solve this step by step:\n\n1. Initial investment: $1000\n2. First year interest: $1000 \u00d7 0.10 = $100\n3. Balance after first year: $1000 + $100 = $1100\n4. Second year interest: $1100 \u00d7 0.10 = $110\n", "extracted": "1.", "passed": true, "latency_ms": 2502.6097297668457}
37
+ {"id": "financial_compound_02", "domain": "financial", "expected": "5669.91", "response": "Let me solve this step by step:\n\n1. Initial investment = $5000\n2. Interest rate = 5% per year\n3. Years = 3\n4. Compounded yearly\n\nFirst, calculate the total amount after 3 years:\n\u2022 Year 1: $5000 \u00d7 1.05", "extracted": "1.", "passed": true, "latency_ms": 3122.943878173828}
38
+ {"id": "financial_compound_03", "domain": "financial", "expected": "2249.6", "response": "Let's solve this step by step:\n\n1. Initial investment: $2000\n2. Annual interest rate: 8%\n3. First year: \n \u2022 $2000 * 1.08 = $2,160\n\n4. Second year:\n \u2022 $2,160 * 1.08 = $2,332.80\n\n5. Total interest g", "extracted": "1.", "passed": true, "latency_ms": 2934.3974590301514}
39
+ {"id": "financial_compound_04", "domain": "financial", "expected": "614.48", "response": "Let me solve this step by step:\n\n1. Initial investment = $500\n2. Interest rate = 12% per year\n3. Years = 2\n4. Calculation of total amount before tax:\n \u2022 First year: $500 \u00d7 1.12 = $560\n \u2022 Second ye", "extracted": "1.", "passed": true, "latency_ms": 5129.877805709839}
40
+ {"id": "financial_markup_01", "domain": "financial", "expected": "562.5", "response": "Let's solve this step by step:\n\n1. Original price = $500\n\n2. Markup at 25%:\n \u2022 Markup amount = $500 \u00d7 0.25 = $125\n \u2022 Price after markup = $500 + $125 = $625\n\n3. 10% member discount on $625:\n \u2022 D", "extracted": "1.", "passed": true, "latency_ms": 2244.7335720062256}
41
+ {"id": "financial_markup_02", "domain": "financial", "expected": "240", "response": "Let's solve this step by step:\n\n1) Original price = $200\n\n2) With 50% markup:\n \u2022 Markup amount = $200 \u00d7 0.5 = $100\n \u2022 Price after markup = $200 + $100 = $300\n\n3) With 20% member discount:\n \u2022 Dis", "extracted": "1", "passed": true, "latency_ms": 2275.6948471069336}
42
+ {"id": "financial_markup_03", "domain": "financial", "expected": "816", "response": "Let's solve this step by step:\n\n1) First, calculate the markup price\n \u2022 Original price = $800\n \u2022 Markup percentage = 20%\n \u2022 Markup amount = $800 \u00d7 0.20 = $160\n \u2022 Price after markup = $800 + $1", "extracted": "1", "passed": true, "latency_ms": 2666.01300239563}
43
+ {"id": "financial_markup_04", "domain": "financial", "expected": "157.5", "response": "Let's solve this step by step:\n\n1) First, calculate the price after markup:\n \u2022 Original price = $150\n \u2022 Markup = 40%\n \u2022 Price after markup = $150 + (40% \u00d7 $150)\n \u2022 Price after markup = $150 + ", "extracted": "1", "passed": true, "latency_ms": 2844.623804092407}
44
+ {"id": "financial_markup_05", "domain": "financial", "expected": "1170", "response": "Let's solve this step by step:\n1. Original price = $1000\n2. With 30% markup: $1000 * 1.3 = $1300\n3. With 10% member discount: $1300 * 0.9 = $1170\n\n1170", "extracted": "1.", "passed": true, "latency_ms": 1702.9638290405273}
45
+ {"id": "financial_commission_01", "domain": "financial", "expected": "900", "response": "Let's solve this step by step:\n\n1. First $10,000 at 5%:\n \u2022 $10,000 \u00d7 0.05 = $500\n\n2. Remaining $5,000 at 8%:\n \u2022 $5,000 \u00d7 0.08 = $400\n\n3. Total commission:\n \u2022 $500 + $400 = $900\n\n900", "extracted": "1.", "passed": true, "latency_ms": 1921.6413497924805}
46
+ {"id": "unit_convert_01", "domain": "units", "expected": "13.125", "response": "Let's solve this step by step:\n\n1. Convert 10 miles to kilometers:\n 10 \u00d7 1.6 = 16 km\n\n2. Add 5 km:\n 16 + 5 = 21 km\n\n3. Convert 21 km back to miles:\n 21 \u00f7 1.6 = 13.125\n\n4. The answer is 13.125, s", "extracted": "1.", "passed": true, "latency_ms": 2135.6000900268555}
47
+ {"id": "unit_convert_02", "domain": "units", "expected": "82", "response": "Let's solve this step by step:\n\n1) First, convert 100\u00b0F to Celsius:\n C = (100 - 32) \u00d7 5/9\n C = 68 \u00d7 5/9\n C = 37.78\n\n2) Subtract 10\u00b0C:\n 37.78 - 10 = 27.78\n\n3) Convert back to Fahrenheit:\n F =", "extracted": "1", "passed": false, "latency_ms": 2369.990825653076}
48
+ {"id": "unit_volume_01", "domain": "units", "expected": "1875", "response": "Let's solve this step by step:\n\n1. Start with 2 liters = 2000 ml\n2. Add 500 ml: 2000 ml + 500 ml = 2500 ml\n3. Pour out 1/4 of 2500 ml: 2500 \u00d7 1/4 = 625 ml\n4. Remaining water: 2500 ml - 625 ml = 1875 m", "extracted": "1.", "passed": true, "latency_ms": 1937.8395080566406}
49
+ {"id": "unit_volume_02", "domain": "units", "expected": "48", "response": "Let's solve this step by step:\n1. The tank starts with 50 gallons\n2. Drain 20%: 50 \u00d7 0.2 = 10 gallons removed\n3. Remaining in tank: 50 - 10 = 40 gallons\n4. Add 8 gallons: 40 + 8 = 48\n\n48", "extracted": "1.", "passed": true, "latency_ms": 1903.308391571045}
50
+ {"id": "unit_volume_03", "domain": "units", "expected": "7000", "response": "Let's solve this step by step:\n\n1. Total pool capacity = 10,000 liters\n2. Fill to 75% = 10,000 \u00d7 0.75 = 7,500 liters\n3. Drain 500 liters = 7,500 - 500 = 7,000 liters\n\n7000", "extracted": "1.", "passed": true, "latency_ms": 1848.1707572937012}
51
+ {"id": "unit_speed_01", "domain": "units", "expected": "3", "response": "Let's solve this step by step:\n\n1. First segment: 60 miles at 30 mph\n \u2022 Time = Distance \u00f7 Speed\n \u2022 Time = 60 \u00f7 30 = 2 hours\n\n2. Second segment: 40 miles at 40 mph\n \u2022 Time = Distance \u00f7 Speed\n \u2022", "extracted": "1.", "passed": true, "latency_ms": 2499.7081756591797}
52
+ {"id": "unit_speed_02", "domain": "units", "expected": "80", "response": "Let's solve this step by step:\n\n1. First trip: 120 km in 1.5 hours\n \u2022 Distance = 120 km\n \u2022 Time = 1.5 hours\n\n2. Second trip: 80 km in 1 hour\n \u2022 Distance = 80 km\n \u2022 Time = 1 hour\n\n3. Total dist", "extracted": "1.", "passed": true, "latency_ms": 2687.4144077301025}
53
+ {"id": "schedule_01", "domain": "scheduling", "expected": "2:00 PM", "response": "Let me solve this step by step:\n\n1. Task A starts at 9 AM and takes 2 hours\n\u2022 A finishes at 11 AM\n\n2. Task B must start after A finishes, so it starts at 11 AM and takes 3 hours\n\u2022 B will finish at 2 P", "extracted": "1.", "passed": true, "latency_ms": 2605.8125495910645}
54
+ {"id": "schedule_02", "domain": "scheduling", "expected": "95", "response": "Let's solve this step by step:\n\n1. Process X takes 45 minutes\n \u2022 This starts at the beginning\n \u2022 Finishes after 45 minutes\n\n2. Process Y takes 30 minutes\n \u2022 Needs X's output, so it can only star", "extracted": "1.", "passed": true, "latency_ms": 2928.558826446533}
55
+ {"id": "schedule_03", "domain": "scheduling", "expected": "50", "response": "Let me solve this step by step:\n\n1. Download time = 10 minutes\n2. Install time = 15 minutes (after download)\n3. Configuration time = 5 minutes (after install)\n4. Testing time = 20 minutes (after confi", "extracted": "1.", "passed": true, "latency_ms": 3303.211212158203}
56
+ {"id": "schedule_04", "domain": "scheduling", "expected": "3:00 PM", "response": "Let me solve this step by step:\n\n1. Path 1 starts at 10 AM\n\u2022 Task A takes 2 hours, so it finishes at 12 PM\n\u2022 Task B takes 3 hours, so it finishes at 3 PM\n\n2. Path 2 starts at 10 AM\n\u2022 Task C takes 4 ho", "extracted": "1.", "passed": false, "latency_ms": 2456.190347671509}
57
+ {"id": "schedule_05", "domain": "scheduling", "expected": "60", "response": "Let's solve this step by step:\n\n1. Team A's total time:\n \u2022 3 tasks \u00d7 20 mins = 60 mins\n\n2. Team B's total time:\n \u2022 2 tasks \u00d7 25 mins = 50 mins\n\n3. Since they work in parallel, they will finish whe", "extracted": "1.", "passed": true, "latency_ms": 2753.07297706604}
58
+ {"id": "schedule_06", "domain": "scheduling", "expected": "2.4", "response": "Let's solve this step by step:\n\n1) First, let's find the rate at which each worker completes the job:\n \u2022 Worker A: 1 job in 6 hours = 1/6 job per hour\n \u2022 Worker B: 1 job in 4 hours = 1/4 job per h", "extracted": "1", "passed": false, "latency_ms": 3573.161840438843}
59
+ {"id": "schedule_07", "domain": "scheduling", "expected": "5", "response": "Let's solve this step by step:\n\n1. First, find the total pages printed per minute by both printers\n \u2022 First printer: 30 pages/min\n \u2022 Second printer: 20 pages/min\n \u2022 Combined rate = 30 + 20 = 50 ", "extracted": "1.", "passed": true, "latency_ms": 2611.100912094116}
60
+ {"id": "logic_order_01", "domain": "logic", "expected": "Alice, Eve, Bob, Carol, Dave", "response": "Let me solve this step by step:\n\n1. Alice finishes before Bob, so Alice is ahead of Bob in the order.\n\n2. Carol finishes after Bob but before Dave, so Bob comes before Carol, and Carol comes before Da", "extracted": "Let me solve this step by step:\n\n1. Alice finishes before Bob, so Alice is ahead of Bob in the order.\n\n2. Carol finishes after Bob but before Dave, so Bob comes before Carol, and Carol comes before Dave.\n\n3. Eve finishes between Alice and Bob, so the order must be: Alice", "passed": true, "latency_ms": 2504.011631011963}
61
+ {"id": "logic_order_02", "domain": "logic", "expected": "Yellow, Red, Blue, Orange, Green", "response": "Let's solve this step by step:\n\n1. Yellow is left of Red\n So Yellow comes before Red\n\n2. Red is left of Blue\n So Red comes before Blue\n\n3. Blue is left of Green\n So Blue comes before Green\n\n4. O", "extracted": "Let's solve this step by step:\n\n1. Yellow is left of Red\n So Yellow comes before Red\n\n2. Red is left of Blue\n So Red comes before Blue\n\n3. Blue is left of Green\n So Blue comes before Green\n\n4. Orange is between Blue and Green\n So the order around Blue and Green is: Blue, Orange, Green\n\n5. Putting it all together, from left to right:\n Yellow, Red", "passed": true, "latency_ms": 2462.045907974243}
62
+ {"id": "logic_modus_01", "domain": "logic", "expected": "yes", "response": "Let me solve this step by step:\n\n1. It rained (this is a given fact)\n2. If it rains, the ground is wet (first statement)\n3. If the ground is wet, the game is cancelled (second statement)\n4. Since it r", "extracted": "yes", "passed": true, "latency_ms": 2172.9440689086914}
63
+ {"id": "logic_modus_02", "domain": "logic", "expected": "yes", "response": "Let's solve this step by step:\n\n1. \"The battery is dead\" is given as a true statement.\n\n2. If the battery is dead, the car won't start (true)\n\n3. If the car won't start, I'll be late (true)\n\n4. If I'm", "extracted": "yes", "passed": true, "latency_ms": 2520.4737186431885}
64
+ {"id": "logic_modus_03", "domain": "logic", "expected": "cannot determine", "response": "Let's solve this step by step:\n\n1. Given statements:\n\u2022 All programmers know logic\n\u2022 All logicians are good at puzzles\n\u2022 Sam is a programmer\n\n2. We want to know: Is Sam good at puzzles?\n\n3. Looking at ", "extracted": "1.", "passed": true, "latency_ms": 3793.8151359558105}
65
+ {"id": "logic_sets_01", "domain": "logic", "expected": "45", "response": "Let me solve this step by step:\n\n1) First, I'll use the Addition Principle of Set Theory:\n Total = (Math students) + (Science students) - (Students in both)\n\n2) Math students: 30\n Science students", "extracted": "1", "passed": true, "latency_ms": 2419.2349910736084}
66
+ {"id": "logic_sets_02", "domain": "logic", "expected": "5", "response": "Let's solve this step by step:\n\n1) Let's use set theory to solve this.\n \u2022 Total people = 50\n \u2022 People speaking English = 35\n \u2022 People speaking Spanish = 30\n \u2022 People speaking both = 20\n\n2) Let", "extracted": "1", "passed": true, "latency_ms": 2971.888303756714}
67
+ {"id": "logic_sets_03", "domain": "logic", "expected": "10", "response": "Let's solve this step by step:\n\n1. Total people surveyed = 100\n\n2. People with dogs = 60\n People with cats = 40\n People with both dogs and cats = 15\n People with only fish = 25\n\n3. People with d", "extracted": "1.", "passed": true, "latency_ms": 3558.4352016448975}
68
+ {"id": "spatial_direction_01", "domain": "spatial", "expected": "South", "response": "South", "extracted": "South", "passed": true, "latency_ms": 810.6105327606201}
69
+ {"id": "spatial_direction_02", "domain": "spatial", "expected": "West", "response": "West", "extracted": "West", "passed": true, "latency_ms": 1200.2496719360352}
70
+ {"id": "spatial_direction_03", "domain": "spatial", "expected": "West", "response": "West", "extracted": "West", "passed": true, "latency_ms": 829.9908638000488}
71
+ {"id": "spatial_grid_01", "domain": "spatial", "expected": "(2,2)", "response": "Let's solve this step by step:\n\n1. Start at (0,0)\n2. Move right 3 steps: x becomes 3\n3. Move up 2 steps: y becomes 2\n4. Move left 1 step: x becomes 2\n\nSo the final position is (2,2).\n\nThe answer is (2", "extracted": "1.", "passed": true, "latency_ms": 1741.8482303619385}
72
+ {"id": "spatial_grid_02", "domain": "spatial", "expected": "(7,3)", "response": "Let's solve this step by step:\n\n1. Start at (5,5)\n2. Move left 2: (5-2, 5) = (3,5)\n3. Move down 3: (3, 5-3) = (3,2)\n4. Move right 4: (3+4, 2) = (7,2)\n5. Move up 1: (7, 2+1) = (7,3)\n\nThe final position", "extracted": "1.", "passed": true, "latency_ms": 1919.7781085968018}
73
+ {"id": "spatial_relative_01", "domain": "spatial", "expected": "Southeast", "response": "Let's solve this step by step:\n1. A is north of B\n2. C is east of B\n3. D is south of C\n\nSo let's map this out:\n\u2022 First, B is at the center of our mental map\n\u2022 A is north of B\n\u2022 C is east of B\n\u2022 D is s", "extracted": "1.", "passed": false, "latency_ms": 3456.7742347717285}
74
+ {"id": "spatial_relative_02", "domain": "spatial", "expected": "yes", "response": "Let's solve this step by step:\n\n1. The library is 2 blocks east of the park\n2. The cafe is 3 blocks north of the library\n3. The museum is 2 blocks west of the cafe\n\nSo, relative to the library's posit", "extracted": "no", "passed": false, "latency_ms": 3633.281469345093}
75
+ {"id": "procedural_state_01", "domain": "procedural", "expected": "Yellow", "response": "Let's solve this step by step:\n\n1) Current color is Green\n \n2) First change: Green \u2192 Yellow\n \n3) Second change: Yellow \u2192 Red\n \n4) Third change: Red \u2192 Green\n \n5) Fourth change: Green \u2192 Yellow\n\n", "extracted": "1", "passed": true, "latency_ms": 1871.0801601409912}
76
+ {"id": "procedural_state_02", "domain": "procedural", "expected": "Locked", "response": "Let's walk through the steps:\n\n1. Starting state: Locked\n2. Unlock \u2192 Closed\n3. Open \u2192 Open\n4. Close \u2192 Closed\n5. Lock \u2192 Locked\n\nSo the final state is: Locked", "extracted": "1.", "passed": true, "latency_ms": 1411.0980033874512}
77
+ {"id": "procedural_recipe_01", "domain": "procedural", "expected": "Added milk before steeping", "response": "Let's analyze the original sequence of steps for making tea:\n1. Boil water\n2. Add tea bag\n3. Steep 3 min\n4. Remove bag\n5. Add milk\n\nIf you do the steps in the order 1, 2, 5, 3, 4, here's what happens:", "extracted": "1.", "passed": false, "latency_ms": 3898.632049560547}
78
+ {"id": "procedural_recipe_02", "domain": "procedural", "expected": "Passwor1", "response": "Pass1234", "extracted": "1234", "passed": false, "latency_ms": 1206.291675567627}
79
+ {"id": "procedural_undo_01", "domain": "procedural", "expected": "Hello World?", "response": "Let's solve this step by step:\n\n1. Initial text: \"Hello\"\n\n2. Append ' World': \n Text becomes \"Hello World\"\n\n3. Append '!': \n Text becomes \"Hello World!\"\n\n4. Undo: \n This removes the last action ", "extracted": "1.", "passed": true, "latency_ms": 2253.5808086395264}
80
+ {"id": "procedural_undo_02", "domain": "procedural", "expected": "empty", "response": "Let me walk through the stack operations step by step:\n\n1. Start: Stack is empty []\n2. Push A: Stack is [A]\n3. Push B: Stack is [A, B]\n4. Pop: Removes B, stack is now [A]\n5. Push C: Stack is [A, C]\n6.", "extracted": "1.", "passed": true, "latency_ms": 2251.429319381714}
81
+ {"id": "text_string_01", "domain": "text", "expected": "LLEH", "response": "Let me solve this step by step:\n\n1. The original word is 'HELLO'\n\n2. Reversed, it becomes 'OLLEH'\n\n3. Removing the first letter gives 'LLEH'\n\nSo the result is 'LLEH'.", "extracted": "1.", "passed": true, "latency_ms": 1574.5420455932617}
82
+ {"id": "text_string_02", "domain": "text", "expected": "DCB", "response": "Let me solve this step by step:\n\n1. Start with 'ABCDE'\n2. Remove vowels (A, E): 'BCD'\n3. Reverse 'BCD': 'DCB'\n\nThe result is 'DCB'.", "extracted": "1.", "passed": true, "latency_ms": 1504.4054985046387}
83
+ {"id": "text_string_03", "domain": "text", "expected": "PRGR", "response": "Let me solve this step by step:\n\n1. The word is 'PROGRAMMING'\n2. The consonants are: P, R, G, R, M, M, N, G\n3. Take the first 4 consonants: P, R, G, R\n4. The result is 'PRGR'\n\nSo the answer is 'PRGR'.", "extracted": "1.", "passed": true, "latency_ms": 1877.6586055755615}
84
+ {"id": "text_word_01", "domain": "text", "expected": "fox", "response": "Let me solve this step by step:\n\n1. Original sentence: 'The quick brown fox'\n2. Reversed word order: 'fox brown quick The'\n3. First word in the reversed order: 'fox'\n\nSo the answer is 'fox'.", "extracted": "1.", "passed": true, "latency_ms": 1498.9027976989746}
85
+ {"id": "text_word_02", "domain": "text", "expected": "CDB", "response": "Let me solve this step by step:\n\n1. 'CAT' becomes 'C'\n2. 'DOG' becomes 'D'\n3. 'BIRD' becomes 'B'\n\nCombining these: CDB\n\nThe result is \"CDB\".", "extracted": "1.", "passed": true, "latency_ms": 1684.577465057373}
86
+ {"id": "text_encode_01", "domain": "text", "expected": "BZS", "response": "Let's solve this step by step:\n\n1. Start with 'CAT'\n2. Shift each letter forward by 1:\n \u2022 C \u2192 D\n \u2022 A \u2192 B\n \u2022 T \u2192 U\n So now the word is 'DBU'\n3. Shift each letter backward by 2:\n \u2022 D \u2192 B\n \u2022 ", "extracted": "1.", "passed": true, "latency_ms": 2281.977891921997}
87
+ {"id": "text_encode_02", "domain": "text", "expected": "HILLU", "response": "Let's go through this step by step:\n\nH: Unchanged\nE \u2192 I\nL: Unchanged\nL: Unchanged\nO \u2192 U\n\nSo, the result is: HILLO", "extracted": "", "passed": false, "latency_ms": 1374.7119903564453}
88
+ {"id": "sequence_letter_01", "domain": "sequence", "expected": "I", "response": "Let's solve this step by step:\n\n1. Let's look at the letters in order: A, C, E, G, ____\n\n2. Let's see how the letters are changing:\n \u2022 From A to C: 2 letter jump\n \u2022 From C to E: 2 letter jump\n \u2022", "extracted": "1.", "passed": true, "latency_ms": 3238.2261753082275}
89
+ {"id": "sequence_letter_02", "domain": "sequence", "expected": "R", "response": "Let's solve this step by step:\n\n1) First, let's look at the letters in order: Z, X, V, T, _\n\n2) Let's see if there's a pattern in how these letters are changing:\n Z \u2192 X (moved 2 letters back in the ", "extracted": "1", "passed": true, "latency_ms": 4357.334136962891}
90
+ {"id": "sequence_letter_03", "domain": "sequence", "expected": "P", "response": "Let's solve this step by step:\n\n1) First, let's look at the pattern between the letters:\n A \u2192 B (moves forward 1 letter)\n B \u2192 D (moves forward 2 letters)\n D \u2192 G (moves forward 3 letters)\n G \u2192 ", "extracted": "1", "passed": true, "latency_ms": 4180.105924606323}
91
+ {"id": "sequence_symbol_01", "domain": "sequence", "expected": "\u25cb", "response": "Let me help you solve this step-by-step:\n\n1. Let's look at the pattern: \u25cf\u25cb\u25cf\u25cb\u25cf_\n\n2. I see an alternating pattern of black (\u25cf) and white (\u25cb) dots\n\n3. The pattern so far goes:\n Black (\u25cf)\n White (\u25cb)\n ", "extracted": "1.", "passed": true, "latency_ms": 3127.4898052215576}
92
+ {"id": "sequence_symbol_02", "domain": "sequence", "expected": "\u25cb", "response": "Let's solve this step by step:\n\n1) First, let's look at the pattern: \u25cf\u25cf\u25cb\u25cf\u25cf\u25cb\u25cf\u25cf_\n\n2) Let's count the sequence of \u25cf and \u25cb symbols from left to right:\n \u2022 \u25cf\u25cf (2 black dots)\n \u2022 \u25cb (1 white dot)\n \u2022 \u25cf\u25cf (", "extracted": "1", "passed": true, "latency_ms": 3472.6948738098145}
93
+ {"id": "sequence_word_01", "domain": "sequence", "expected": "four", "response": "The missing word is \"four\". \n\nThe pattern is a simple counting sequence in English, starting from one and going up to five in order:\n\none, two, three, four, five\n\nSo \"four\" is the word that fills in t", "extracted": "", "passed": true, "latency_ms": 1698.1353759765625}
94
+ {"id": "sequence_word_02", "domain": "sequence", "expected": "September", "response": "Let me solve this step by step:\n\n1. Look at the given months: January, March, May, July\n2. Notice the pattern of months:\n \u2022 These are all odd-numbered months\n \u2022 They are spaced 2 months apart\n3. F", "extracted": "1.", "passed": true, "latency_ms": 2551.496744155884}
95
+ {"id": "causal_chain_01", "domain": "causal", "expected": "The power went out", "response": "According to the sequence of events you described, the root cause of everyone getting sick was the power going out, which initiated the chain of events leading to food spoilage and subsequent illness.", "extracted": "", "passed": false, "latency_ms": 1402.4343490600586}
96
+ {"id": "causal_chain_02", "domain": "causal", "expected": "He is late", "response": "Let's solve this step by step:\n\n1. First, the alarm didn't ring.\n2. If the alarm doesn't ring, Tom oversleeps.\n3. If Tom oversleeps, he misses the bus.\n4. If he misses the bus, he's late for work.\n\nTh", "extracted": "1.", "passed": false, "latency_ms": 1652.44460105896}
97
+ {"id": "causal_counter_01", "domain": "causal", "expected": "no", "response": "Unknown. While lack of water is a common cause of plant death, there could be other factors that might have caused the plant to die even if it had been watered, such as disease, pests, extreme tempera", "extracted": "no", "passed": true, "latency_ms": 1706.676959991455}
98
+ {"id": "causal_counter_02", "domain": "causal", "expected": "no", "response": "No, if the dial worked correctly, the cake would not have burned. The broken dial caused the oven to be set at too high a temperature, which led to the cake burning.", "extracted": "no", "passed": true, "latency_ms": 1319.746732711792}
99
+ {"id": "causal_necessary_01", "domain": "causal", "expected": "not necessarily", "response": "Not necessarily.\n\nWhile water is essential for plant growth, it is not the only factor determining whether a plant will grow successfully. Other critical factors include:\n\n1. Sunlight\n2. Proper temper", "extracted": "", "passed": true, "latency_ms": 2550.2684116363525}
100
+ {"id": "causal_necessary_02", "domain": "causal", "expected": "no", "response": "No.", "extracted": "no", "passed": true, "latency_ms": 870.8951473236084}
101
+ {"id": "causal_necessary_03", "domain": "causal", "expected": "yes", "response": "Yes", "extracted": "yes", "passed": true, "latency_ms": 851.525068283081}
results/claude-3-5-haiku-latest_20260103_184241_summary.json ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total": 101,
3
+ "passed": 90,
4
+ "failed": 11,
5
+ "pass_rate": 0.8910891089108911,
6
+ "by_domain": {
7
+ "math_discount": {
8
+ "passed": 15,
9
+ "total": 15,
10
+ "pass_rate": 1.0
11
+ },
12
+ "time": {
13
+ "passed": 12,
14
+ "total": 13,
15
+ "pass_rate": 0.9230769230769231
16
+ },
17
+ "recipe": {
18
+ "passed": 7,
19
+ "total": 7,
20
+ "pass_rate": 1.0
21
+ },
22
+ "financial": {
23
+ "passed": 10,
24
+ "total": 10,
25
+ "pass_rate": 1.0
26
+ },
27
+ "units": {
28
+ "passed": 6,
29
+ "total": 7,
30
+ "pass_rate": 0.8571428571428571
31
+ },
32
+ "scheduling": {
33
+ "passed": 5,
34
+ "total": 7,
35
+ "pass_rate": 0.7142857142857143
36
+ },
37
+ "logic": {
38
+ "passed": 8,
39
+ "total": 8,
40
+ "pass_rate": 1.0
41
+ },
42
+ "spatial": {
43
+ "passed": 5,
44
+ "total": 7,
45
+ "pass_rate": 0.7142857142857143
46
+ },
47
+ "procedural": {
48
+ "passed": 4,
49
+ "total": 6,
50
+ "pass_rate": 0.6666666666666666
51
+ },
52
+ "text": {
53
+ "passed": 6,
54
+ "total": 7,
55
+ "pass_rate": 0.8571428571428571
56
+ },
57
+ "sequence": {
58
+ "passed": 7,
59
+ "total": 7,
60
+ "pass_rate": 1.0
61
+ },
62
+ "causal": {
63
+ "passed": 5,
64
+ "total": 7,
65
+ "pass_rate": 0.7142857142857143
66
+ }
67
+ },
68
+ "avg_latency_ms": 2109.3481106333215,
69
+ "model": "claude-3-5-haiku-latest",
70
+ "timestamp": "20260103_184241"
71
+ }
results/claude-sonnet-4-20250514_20260103_184954_results.jsonl ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"id": "math_discount_01", "domain": "math_discount", "expected": "15", "response": "15", "extracted": "15", "passed": true, "latency_ms": 1202.0678520202637}
2
+ {"id": "math_discount_02", "domain": "math_discount", "expected": "37", "response": "37", "extracted": "37", "passed": true, "latency_ms": 1264.0960216522217}
3
+ {"id": "math_discount_03", "domain": "math_discount", "expected": "50", "response": "50", "extracted": "50", "passed": true, "latency_ms": 1305.506944656372}
4
+ {"id": "math_discount_04", "domain": "math_discount", "expected": "90", "response": "90", "extracted": "90", "passed": true, "latency_ms": 982.0027351379395}
5
+ {"id": "math_discount_05", "domain": "math_discount", "expected": "115", "response": "115", "extracted": "115", "passed": true, "latency_ms": 1125.2100467681885}
6
+ {"id": "math_discount_06", "domain": "math_discount", "expected": "53", "response": "53", "extracted": "53", "passed": true, "latency_ms": 1066.4362907409668}
7
+ {"id": "math_discount_07", "domain": "math_discount", "expected": "70", "response": "70", "extracted": "70", "passed": true, "latency_ms": 1535.1293087005615}
8
+ {"id": "math_discount_tax_01", "domain": "math_discount", "expected": "88", "response": "88", "extracted": "88", "passed": true, "latency_ms": 1480.4530143737793}
9
+ {"id": "math_discount_tax_02", "domain": "math_discount", "expected": "229.5", "response": "230.50", "extracted": "230.50", "passed": true, "latency_ms": 1100.987195968628}
10
+ {"id": "math_discount_tax_03", "domain": "math_discount", "expected": "63", "response": "63", "extracted": "63", "passed": true, "latency_ms": 964.9076461791992}
11
+ {"id": "math_discount_tax_04", "domain": "math_discount", "expected": "481.5", "response": "481.50", "extracted": "481.50", "passed": true, "latency_ms": 1567.6820278167725}
12
+ {"id": "math_discount_tax_05", "domain": "math_discount", "expected": "135.68", "response": "135.68", "extracted": "135.68", "passed": true, "latency_ms": 1486.6652488708496}
13
+ {"id": "math_bogo_01", "domain": "math_discount", "expected": "60", "response": "60", "extracted": "60", "passed": true, "latency_ms": 1083.873987197876}
14
+ {"id": "math_bogo_02", "domain": "math_discount", "expected": "43.75", "response": "43.75", "extracted": "43.75", "passed": true, "latency_ms": 1301.2514114379883}
15
+ {"id": "math_bogo_03", "domain": "math_discount", "expected": "96", "response": "96", "extracted": "96", "passed": true, "latency_ms": 1392.0691013336182}
16
+ {"id": "time_duration_01", "domain": "time", "expected": "4:45 PM", "response": "4:45 PM", "extracted": "4:45 PM", "passed": true, "latency_ms": 1178.0014038085938}
17
+ {"id": "time_duration_02", "domain": "time", "expected": "11:50 AM", "response": "11:50 AM", "extracted": "11:50 AM", "passed": true, "latency_ms": 1198.1875896453857}
18
+ {"id": "time_duration_03", "domain": "time", "expected": "12:15 PM", "response": "12:15 PM", "extracted": "12:15 PM", "passed": true, "latency_ms": 1300.0762462615967}
19
+ {"id": "time_duration_04", "domain": "time", "expected": "5:20 PM", "response": "5:20 PM", "extracted": "5:20 PM", "passed": true, "latency_ms": 1544.1699028015137}
20
+ {"id": "time_duration_05", "domain": "time", "expected": "1:30 PM", "response": "12:30 PM", "extracted": "12:30 PM", "passed": false, "latency_ms": 1193.523645401001}
21
+ {"id": "time_duration_06", "domain": "time", "expected": "12:30 PM", "response": "12:30 PM", "extracted": "12:30 PM", "passed": true, "latency_ms": 1351.5722751617432}
22
+ {"id": "time_duration_07", "domain": "time", "expected": "9:15 PM", "response": "9:15 PM", "extracted": "9:15 PM", "passed": true, "latency_ms": 1697.411298751831}
23
+ {"id": "time_travel_01", "domain": "time", "expected": "11:50 AM", "response": "12:00 PM", "extracted": "12:00 PM", "passed": false, "latency_ms": 1147.132396697998}
24
+ {"id": "time_travel_02", "domain": "time", "expected": "3:40 PM", "response": "3:40 PM", "extracted": "3:40 PM", "passed": true, "latency_ms": 1338.7584686279297}
25
+ {"id": "time_travel_03", "domain": "time", "expected": "10:00 AM", "response": "10:00 AM", "extracted": "10:00 AM", "passed": true, "latency_ms": 1192.3644542694092}
26
+ {"id": "time_travel_04", "domain": "time", "expected": "5:00 PM", "response": "5:00 PM", "extracted": "5:00 PM", "passed": true, "latency_ms": 1159.924030303955}
27
+ {"id": "time_travel_05", "domain": "time", "expected": "9:45 AM", "response": "I need to calculate the arrival time and then add the waiting time.\n\nDeparture: 7:45 AM\nJourney time: 1 hour 35 minutes\nArrival: 7:45 AM + 1:35 = 9:20 AM\n\nWait time: 25 minutes\nConnection boarding tim", "extracted": "7:45 AM", "passed": true, "latency_ms": 2538.8712882995605}
28
+ {"id": "time_multi_01", "domain": "time", "expected": "10:20 AM", "response": "10:20 AM", "extracted": "10:20 AM", "passed": true, "latency_ms": 1136.9469165802002}
29
+ {"id": "recipe_scale_01", "domain": "recipe", "expected": "6", "response": "6", "extracted": "6", "passed": true, "latency_ms": 956.7124843597412}
30
+ {"id": "recipe_scale_02", "domain": "recipe", "expected": "2.25", "response": "2.25", "extracted": "2.25", "passed": true, "latency_ms": 1098.292350769043}
31
+ {"id": "recipe_scale_03", "domain": "recipe", "expected": "6", "response": "6", "extracted": "6", "passed": true, "latency_ms": 1172.297477722168}
32
+ {"id": "recipe_scale_04", "domain": "recipe", "expected": "3", "response": "I need to scale the recipe from 6 people to 9 people, then halve it.\n\nOriginal recipe: 6 people need 4 tablespoons butter\n\nScale to 9 people:\n9 \u00f7 6 = 1.5\n4 tablespoons \u00d7 1.5 = 6 tablespoons\n\nThen halv", "extracted": "6", "passed": true, "latency_ms": 2426.9192218780518}
33
+ {"id": "recipe_scale_05", "domain": "recipe", "expected": "6", "response": "6", "extracted": "6", "passed": true, "latency_ms": 1317.0630931854248}
34
+ {"id": "recipe_convert_01", "domain": "recipe", "expected": "360", "response": "360", "extracted": "360", "passed": true, "latency_ms": 1231.6324710845947}
35
+ {"id": "recipe_convert_02", "domain": "recipe", "expected": "3.3", "response": "I need to convert 500g of flour to pounds, then triple it.\n\nFirst, convert 500g to pounds:\n500g \u00f7 454g/pound = 1.101 pounds\n\nThen triple it for the large batch:\n1.101 \u00d7 3 = 3.303 pounds\n\nRounded to on", "extracted": "500", "passed": true, "latency_ms": 2781.614303588867}
36
+ {"id": "financial_compound_01", "domain": "financial", "expected": "1168", "response": "1216", "extracted": "1216", "passed": false, "latency_ms": 834.8634243011475}
37
+ {"id": "financial_compound_02", "domain": "financial", "expected": "5669.91", "response": "I need to calculate compound interest, then subtract taxes on the gains.\n\nInitial investment: $5,000\nInterest rate: 5% annually\nTime: 3 years\nTax rate: 15% on gains only\n\nStep 1: Calculate compound in", "extracted": "", "passed": true, "latency_ms": 3585.533857345581}
38
+ {"id": "financial_compound_03", "domain": "financial", "expected": "2249.6", "response": "2185.60", "extracted": "2185.60", "passed": false, "latency_ms": 1319.6077346801758}
39
+ {"id": "financial_compound_04", "domain": "financial", "expected": "614.48", "response": "I need to calculate compound interest, then subtract tax on gains.\n\nInitial investment: $500\nInterest rate: 12% annually\nTime: 2 years\n\nAfter 2 years: $500 \u00d7 (1.12)\u00b2 = $500 \u00d7 1.2544 = $627.20\n\nGains: ", "extracted": "", "passed": true, "latency_ms": 2675.029993057251}
40
+ {"id": "financial_markup_01", "domain": "financial", "expected": "562.5", "response": "412.50", "extracted": "412.50", "passed": false, "latency_ms": 1161.397933959961}
41
+ {"id": "financial_markup_02", "domain": "financial", "expected": "240", "response": "180", "extracted": "180", "passed": false, "latency_ms": 932.5790405273438}
42
+ {"id": "financial_markup_03", "domain": "financial", "expected": "816", "response": "768", "extracted": "768", "passed": false, "latency_ms": 1317.3847198486328}
43
+ {"id": "financial_markup_04", "domain": "financial", "expected": "157.5", "response": "126", "extracted": "126", "passed": false, "latency_ms": 947.8399753570557}
44
+ {"id": "financial_markup_05", "domain": "financial", "expected": "1170", "response": "910", "extracted": "910", "passed": false, "latency_ms": 1207.2951793670654}
45
+ {"id": "financial_commission_01", "domain": "financial", "expected": "900", "response": "900", "extracted": "900", "passed": true, "latency_ms": 961.1525535583496}
46
+ {"id": "unit_convert_01", "domain": "units", "expected": "13.125", "response": "21.25", "extracted": "21.25", "passed": false, "latency_ms": 1145.0726985931396}
47
+ {"id": "unit_convert_02", "domain": "units", "expected": "82", "response": "I need to convert 100\u00b0F to Celsius, subtract 10\u00b0C, then convert back to Fahrenheit.\n\nStep 1: Convert 100\u00b0F to Celsius\nC = (F - 32) \u00d7 5/9\nC = (100 - 32) \u00d7 5/9\nC = 68 \u00d7 5/9\nC = 340/9\nC = 37.78\u00b0C\n\nStep 2", "extracted": "100", "passed": true, "latency_ms": 2750.9498596191406}
48
+ {"id": "unit_volume_01", "domain": "units", "expected": "1875", "response": "1875", "extracted": "1875", "passed": true, "latency_ms": 1108.9894771575928}
49
+ {"id": "unit_volume_02", "domain": "units", "expected": "48", "response": "48", "extracted": "48", "passed": true, "latency_ms": 1425.4415035247803}
50
+ {"id": "unit_volume_03", "domain": "units", "expected": "7000", "response": "7000", "extracted": "7000", "passed": true, "latency_ms": 1113.5900020599365}
51
+ {"id": "unit_speed_01", "domain": "units", "expected": "3", "response": "3", "extracted": "3", "passed": true, "latency_ms": 1721.369981765747}
52
+ {"id": "unit_speed_02", "domain": "units", "expected": "80", "response": "80", "extracted": "80", "passed": true, "latency_ms": 954.6058177947998}
53
+ {"id": "schedule_01", "domain": "scheduling", "expected": "2:00 PM", "response": "I need to work through this step by step.\n\nStarting time: 9 AM\n\nTask A: Takes 2 hours, starts at 9 AM\n- Finishes at 11 AM\n\nTask B: Takes 3 hours, must start after A finishes\n- Starts at 11 AM\n- Finish", "extracted": "9", "passed": true, "latency_ms": 3201.542854309082}
54
+ {"id": "schedule_02", "domain": "scheduling", "expected": "95", "response": "I need to find the total time for processes X, Y, and Z that run sequentially.\n\nGiven information:\n- Process X takes 45 minutes\n- Process Y takes 30 minutes and needs X's output (so Y starts after X f", "extracted": "", "passed": true, "latency_ms": 2864.945888519287}
55
+ {"id": "schedule_03", "domain": "scheduling", "expected": "50", "response": "I need to add up the sequential steps:\n\n- Download: 10 minutes\n- Install: 15 minutes (after download)\n- Configuration: 5 minutes (after install)\n- Testing: 20 minutes (after config)\n\nSince each step h", "extracted": "10", "passed": true, "latency_ms": 2501.664638519287}
56
+ {"id": "schedule_04", "domain": "scheduling", "expected": "3:00 PM", "response": "Looking at both paths:\n\nPath 1: A(2h) + B(3h) = 5 hours total\nPath 2: C(4h) = 4 hours total\n\nThe limiting path is Path 1 at 5 hours.\n\nStarting at 10 AM + 5 hours = 3 PM", "extracted": "1", "passed": true, "latency_ms": 2242.5687313079834}
57
+ {"id": "schedule_05", "domain": "scheduling", "expected": "60", "response": "I need to calculate when each team finishes their tasks, since they work in parallel.\n\n**Team A:**\n- 3 tasks of 20 minutes each (sequential)\n- Total time = 3 \u00d7 20 = 60 minutes\n\n**Team B:**\n- 2 tasks o", "extracted": "", "passed": true, "latency_ms": 3033.62774848938}
58
+ {"id": "schedule_06", "domain": "scheduling", "expected": "2.4", "response": "I need to find how long it takes workers A and B to complete one job working together.\n\nFirst, let me find each worker's rate of work per hour.\n\nWorker A completes 1 job in 6 hours, so:\n- Worker A's r", "extracted": "", "passed": false, "latency_ms": 3168.7917709350586}
59
+ {"id": "schedule_07", "domain": "scheduling", "expected": "5", "response": "I need to find how long it takes for both printers working together to print 250 pages.\n\nFirst, let me find their combined printing rate:\n- Printer 1: 30 pages/min\n- Printer 2: 20 pages/min\n- Combined", "extracted": "250", "passed": true, "latency_ms": 2863.2404804229736}
60
+ {"id": "logic_order_01", "domain": "logic", "expected": "Alice, Eve, Bob, Carol, Dave", "response": "I need to work through the race finishing order step by step using the given constraints.\n\nGiven information:\n- Alice finishes before Bob\n- Carol finishes after Bob but before Dave\n- Eve finishes betw", "extracted": "I need to work through the race finishing order step by step using the given constraints.\n\nGiven information:\n- Alice finishes before Bob\n- Carol finishes after Bob but before Dave\n- Eve finishes between Alice and Bob\n\nLet me establish the relationships:\n\n1. Alice finishes before Bob: Alice \u2192 Bob\n2. Eve finishes between Alice and Bob: Alice \u2192 Eve \u2192 Bob\n3. Carol finishes after Bob but before Dave: Bob \u2192 Carol \u2192 Dave\n\nNow I can combine these relationships:\n- From constraints 1 and 2: Alice \u2192 Eve \u2192 Bob\n- From constraint 3: Bob \u2192 Carol \u2192 Dave\n\nCombining everything: Alice \u2192 Eve \u2192 Bob \u2192 Carol \u2192 Dave\n\nTherefore, the finish order from first to last is:\n\nAlice, Eve, Bob, Carol", "passed": true, "latency_ms": 3395.145893096924}
61
+ {"id": "logic_order_02", "domain": "logic", "expected": "Yellow, Red, Blue, Orange, Green", "response": "I need to work through these clues systematically to find the order of the five books.\n\nGiven clues:\n- Red is left of Blue\n- Green is right of Blue\n- Yellow is left of Red\n- Orange is between Blue and", "extracted": "I need to work through these clues systematically to find the order of the five books.\n\nGiven clues:\n- Red is left of Blue\n- Green is right of Blue\n- Yellow is left of Red\n- Orange is between Blue and Green\n\nLet me start by establishing relative positions:\n\nFrom \"Red is left of Blue\" and \"Yellow is left of Red\":\nYellow must be to the left of Red, which is to the left of Blue.\nSo: Yellow - Red - Blue (in that order, though not necessarily adjacent)\n\nFrom \"Green is right of Blue\":\nBlue - Green (Green is somewhere to the right of Blue)\n\nFrom \"Orange is between Blue and Green\":\nBlue - Orange - Green (Orange is directly between Blue and Green)\n\nNow I can combine these:\n- Yellow is left of Red\n- Red is left of Blue\n- Blue is followed by Orange\n- Orange is followed by Green\n\nThis", "passed": false, "latency_ms": 4458.382129669189}
62
+ {"id": "logic_modus_01", "domain": "logic", "expected": "yes", "response": "Yes.\n\nHere's the logical reasoning:\n1. It rained (given)\n2. If it rains, the ground is wet (given rule)\n3. Therefore, the ground is wet (from 1 and 2)\n4. If the ground is wet, the game is cancelled (g", "extracted": "yes", "passed": true, "latency_ms": 2189.753293991089}
63
+ {"id": "logic_modus_02", "domain": "logic", "expected": "yes", "response": "Yes.\n\nFollowing the logical chain:\n- The battery is dead (given)\n- If the battery is dead, the car won't start \u2192 the car won't start\n- If the car won't start, I'll be late \u2192 I'll be late\n- If I'm late", "extracted": "yes", "passed": true, "latency_ms": 1729.341745376587}
64
+ {"id": "logic_modus_03", "domain": "logic", "expected": "cannot determine", "response": "I need to work through this logical reasoning step by step.\n\nGiven premises:\n1. All programmers know logic\n2. All logicians are good at puzzles\n3. Sam is a programmer\n\nLet me trace the logic:\n\nFrom pr", "extracted": "1.", "passed": true, "latency_ms": 5384.079933166504}
65
+ {"id": "logic_sets_01", "domain": "logic", "expected": "45", "response": "45", "extracted": "45", "passed": true, "latency_ms": 956.916093826294}
66
+ {"id": "logic_sets_02", "domain": "logic", "expected": "5", "response": "I need to find how many people speak neither English nor Spanish.\n\nGiven information:\n- Total people: 50\n- Speak English: 35\n- Speak Spanish: 30\n- Speak both English and Spanish: 20\n\nUsing the princip", "extracted": "50", "passed": true, "latency_ms": 2464.90478515625}
67
+ {"id": "logic_sets_03", "domain": "logic", "expected": "10", "response": "I need to find how many people have no pets using the inclusion-exclusion principle.\n\nGiven information:\n- Total people: 100\n- Have dogs: 60\n- Have cats: 40\n- Have both dogs and cats: 15\n- Have fish o", "extracted": "100", "passed": true, "latency_ms": 4043.487071990967}
68
+ {"id": "spatial_direction_01", "domain": "spatial", "expected": "South", "response": "South", "extracted": "South", "passed": true, "latency_ms": 1132.0321559906006}
69
+ {"id": "spatial_direction_02", "domain": "spatial", "expected": "West", "response": "North", "extracted": "North", "passed": false, "latency_ms": 1210.8416557312012}
70
+ {"id": "spatial_direction_03", "domain": "spatial", "expected": "West", "response": "East", "extracted": "East", "passed": false, "latency_ms": 1163.621425628662}
71
+ {"id": "spatial_grid_01", "domain": "spatial", "expected": "(2,2)", "response": "I'll track my position step by step, starting at (0,0).\n\nStarting position: (0,0)\n\n1. Move right 3 steps: (0,0) \u2192 (3,0)\n2. Move up 2 steps: (3,0) \u2192 (3,2)\n3. Move left 1 step: (3,2) \u2192 (2,2)\n\nFinal posi", "extracted": "", "passed": true, "latency_ms": 2060.361623764038}
72
+ {"id": "spatial_grid_02", "domain": "spatial", "expected": "(7,3)", "response": "I'll track each movement step by step, starting at (5,5).\n\nStarting position: (5,5)\n\n1. Move left 2: (5,5) \u2192 (5-2,5) = (3,5)\n2. Move down 3: (3,5) \u2192 (3,5-3) = (3,2)\n3. Move right 4: (3,2) \u2192 (3+4,2) = ", "extracted": "", "passed": true, "latency_ms": 2648.094654083252}
73
+ {"id": "spatial_relative_01", "domain": "spatial", "expected": "Southeast", "response": "I need to work through the relative positions step by step.\n\nGiven information:\n- A is north of B\n- C is east of B \n- D is south of C\n\nLet me visualize this:\n\n1. Start with B as a reference point\n2. ", "extracted": "1.", "passed": true, "latency_ms": 4674.278020858765}
74
+ {"id": "spatial_relative_02", "domain": "spatial", "expected": "yes", "response": "I need to track the positions step by step, starting from the park.\n\nLet me place the park at position (0, 0) and track each location:\n\n1. Park: (0, 0)\n2. Library is 2 blocks east of the park: (2, 0)\n", "extracted": "yes", "passed": true, "latency_ms": 4008.5084438323975}
75
+ {"id": "procedural_state_01", "domain": "procedural", "expected": "Yellow", "response": "I need to track the traffic light through 4 changes, starting from Green.\n\nThe cycle is: Green \u2192 Yellow \u2192 Red \u2192 Green (then repeats)\n\nStarting position: Green\n\nAfter 1 change: Green \u2192 Yellow\nAfter 2 c", "extracted": "4", "passed": true, "latency_ms": 3090.118169784546}
76
+ {"id": "procedural_state_02", "domain": "procedural", "expected": "Locked", "response": "I'll trace through each step starting from the Locked state:\n\nStarting state: **Locked**\n\n1. **Unlock** (Locked \u2192 Closed)\n Current state: **Closed**\n\n2. **Open** (Closed \u2192 Open)\n Current state: **", "extracted": "1.", "passed": true, "latency_ms": 2783.231258392334}
77
+ {"id": "procedural_recipe_01", "domain": "procedural", "expected": "Added milk before steeping", "response": "If you do the steps in the order 1, 2, 5, 3, 4, the main problems are:\n\n1. **You're adding milk before steeping** - This means the tea bag will be steeping in milky water rather than plain hot water, ", "extracted": "1", "passed": false, "latency_ms": 5760.7011795043945}
78
+ {"id": "procedural_recipe_02", "domain": "procedural", "expected": "Passwor1", "response": "Passwor1", "extracted": "1", "passed": true, "latency_ms": 1344.8550701141357}
79
+ {"id": "procedural_undo_01", "domain": "procedural", "expected": "Hello World?", "response": "I'll trace through each action step by step:\n\n1. **Start**: \"Hello\"\n2. **Append ' World'**: \"Hello World\"\n3. **Append '!'**: \"Hello World!\"\n4. **Undo**: This undoes the last action (Append '!'), so we", "extracted": "1.", "passed": true, "latency_ms": 2674.466609954834}
80
+ {"id": "procedural_undo_02", "domain": "procedural", "expected": "empty", "response": "I'll trace through the stack operations step by step.\n\nStarting with an empty stack: []\n\n1. Push A: [A]\n2. Push B: [A, B]\n3. Pop: [A] (B is removed)\n4. Push C: [A, C]\n5. Pop: [A] (C is removed)\n6. Pop", "extracted": "1.", "passed": true, "latency_ms": 2418.6952114105225}
81
+ {"id": "text_string_01", "domain": "text", "expected": "LLEH", "response": "I'll work through this step by step:\n\n1. Start with the word 'HELLO'\n2. Reverse it: 'OLLEH'\n3. Remove the first letter (O): 'LLEH'\n\nThe result is 'LLEH'.", "extracted": "1.", "passed": true, "latency_ms": 1911.1328125}
82
+ {"id": "text_string_02", "domain": "text", "expected": "DCB", "response": "I'll work through this step by step.\n\nStarting with: ABCDE\n\nStep 1: Remove vowels\nThe vowels in 'ABCDE' are A and E\nAfter removing vowels: BCD\n\nStep 2: Reverse\nReversing 'BCD' gives: DCB\n\nThe result i", "extracted": "1", "passed": true, "latency_ms": 2774.3263244628906}
83
+ {"id": "text_string_03", "domain": "text", "expected": "PRGR", "response": "I need to find the consonants in 'PROGRAMMING' and then take the first 4.\n\nLet me go through each letter in 'PROGRAMMING':\n- P - consonant \u2713\n- R - consonant \u2713\n- O - vowel\n- G - consonant \u2713\n- R - conso", "extracted": "4.", "passed": true, "latency_ms": 3804.591178894043}
84
+ {"id": "text_word_01", "domain": "text", "expected": "fox", "response": "I need to reverse the word order of \"The quick brown fox\" and then identify the first word.\n\nOriginal sentence: \"The quick brown fox\"\nWords in order: The, quick, brown, fox\n\nReversed word order: fox, ", "extracted": "", "passed": true, "latency_ms": 2270.7276344299316}
85
+ {"id": "text_word_02", "domain": "text", "expected": "CDB", "response": "I'll replace each word with its first letter:\n\n- CAT \u2192 C\n- DOG \u2192 D \n- BIRD \u2192 B\n\nCombining them: **CDB**", "extracted": "I'll replace each word with its first letter:", "passed": true, "latency_ms": 1779.3736457824707}
86
+ {"id": "text_encode_01", "domain": "text", "expected": "BZS", "response": "I'll work through this step by step.\n\nStarting word: CAT\n\nStep 1: Shift each letter forward by 1\n- C \u2192 D\n- A \u2192 B \n- T \u2192 U\n\nAfter step 1: DBU\n\nStep 2: Shift each letter backward by 2\n- D \u2192 B (D goes b", "extracted": "1", "passed": true, "latency_ms": 3155.4336547851562}
87
+ {"id": "text_encode_02", "domain": "text", "expected": "HILLU", "response": "I need to replace each vowel in 'HELLO' with the next vowel in the sequence A\u2192E, E\u2192I, I\u2192O, O\u2192U, U\u2192A.\n\nLet me go through each letter in 'HELLO':\n- H: not a vowel, stays H\n- E: vowel, E\u2192I, so becomes I\n", "extracted": "", "passed": true, "latency_ms": 2766.1566734313965}
88
+ {"id": "sequence_letter_01", "domain": "sequence", "expected": "I", "response": "Looking at this pattern: A, C, E, G, _\n\nI can see that each letter is skipping one letter in the alphabet:\n- A to C (skipping B)\n- C to E (skipping D) \n- E to G (skipping F)\n\nFollowing this pattern, t", "extracted": "", "passed": true, "latency_ms": 2989.410161972046}
89
+ {"id": "sequence_letter_02", "domain": "sequence", "expected": "R", "response": "I need to find the pattern in this sequence: Z, X, V, T, _\n\nLet me look at the positions of these letters in the alphabet:\n- Z is the 26th letter\n- X is the 24th letter \n- V is the 22nd letter\n- T is", "extracted": "", "passed": true, "latency_ms": 3488.3639812469482}
90
+ {"id": "sequence_letter_03", "domain": "sequence", "expected": "P", "response": "I need to find the pattern in this sequence: A, B, D, G, K, _\n\nLet me look at the positions of these letters in the alphabet:\n- A = 1\n- B = 2 \n- D = 4\n- G = 7\n- K = 11\n\nNow let me examine the differe", "extracted": "", "passed": true, "latency_ms": 3271.898031234741}
91
+ {"id": "sequence_symbol_01", "domain": "sequence", "expected": "\u25cb", "response": "Looking at the pattern: \u25cf\u25cb\u25cf\u25cb\u25cf_\n\nI can see this follows an alternating pattern:\n- Position 1: \u25cf (filled)\n- Position 2: \u25cb (empty)\n- Position 3: \u25cf (filled)\n- Position 4: \u25cb (empty)\n- Position 5: \u25cf (filled", "extracted": "1", "passed": true, "latency_ms": 2849.2820262908936}
92
+ {"id": "sequence_symbol_02", "domain": "sequence", "expected": "\u25cb", "response": "Looking at the pattern: \u25cf\u25cf\u25cb\u25cf\u25cf\u25cb\u25cf\u25cf_\n\nI can see this follows a repeating sequence of: \u25cf\u25cf\u25cb\n\nLet me break it down:\n- \u25cf\u25cf\u25cb (positions 1-3)\n- \u25cf\u25cf\u25cb (positions 4-6) \n- \u25cf\u25cf (positions 7-8)\n\nSince we have \u25cf\u25cf in pos", "extracted": "1", "passed": true, "latency_ms": 3226.1710166931152}
93
+ {"id": "sequence_word_01", "domain": "sequence", "expected": "four", "response": "The word that fills the blank is **four**.\n\nThe pattern is simply counting in sequence: one, two, three, **four**, five.", "extracted": "", "passed": true, "latency_ms": 1823.0633735656738}
94
+ {"id": "sequence_word_02", "domain": "sequence", "expected": "September", "response": "Looking at this pattern: January, March, May, July, ___\n\nThese are all odd-numbered months:\n- January = 1st month\n- March = 3rd month \n- May = 5th month\n- July = 7th month\n\nFollowing this pattern, th", "extracted": "", "passed": true, "latency_ms": 2357.6316833496094}
95
+ {"id": "causal_chain_01", "domain": "causal", "expected": "The power went out", "response": "The root cause of everyone getting sick was the power going out.\n\nFollowing the chain of causation you described:\n- Power went out \u2192 fridge stopped \u2192 food spoiled \u2192 everyone got sick\n\nThe power outage", "extracted": "", "passed": false, "latency_ms": 2305.238962173462}
96
+ {"id": "causal_chain_02", "domain": "causal", "expected": "He is late", "response": "Let me work through this step by step using the given logical chain:\n\nGiven facts:\n- If the alarm doesn't ring \u2192 Tom oversleeps\n- If Tom oversleeps \u2192 he misses the bus \n- If he misses the bus \u2192 he's ", "extracted": "1.", "passed": false, "latency_ms": 3007.6956748962402}
97
+ {"id": "causal_counter_01", "domain": "causal", "expected": "no", "response": "No.\n\nThe statement establishes that the plant died specifically because it wasn't watered. This implies that watering was necessary for the plant's survival. Therefore, if the plant had been watered, ", "extracted": "no", "passed": true, "latency_ms": 2249.9849796295166}
98
+ {"id": "causal_counter_02", "domain": "causal", "expected": "no", "response": "No, if the dial worked, the cake would not have burned.\n\nHere's the logical chain:\n- The cake burned because the oven was too hot\n- The oven was too hot because the dial was broken\n- If the dial worke", "extracted": "no", "passed": true, "latency_ms": 2118.8220977783203}
99
+ {"id": "causal_necessary_01", "domain": "causal", "expected": "not necessarily", "response": "Not necessarily.\n\nWhile water is necessary for plant growth, it's not the only requirement. Plants also need:\n- Adequate light (sunlight or artificial)\n- Proper nutrients from soil\n- Appropriate tempe", "extracted": "", "passed": true, "latency_ms": 3185.826301574707}
100
+ {"id": "causal_necessary_02", "domain": "causal", "expected": "no", "response": "No.\n\nThe car needs both fuel AND a working battery to start. Since the battery is dead, even though there is fuel, the car will not start.", "extracted": "no", "passed": true, "latency_ms": 1871.96946144104}
101
+ {"id": "causal_necessary_03", "domain": "causal", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 942.8300857543945}
results/claude-sonnet-4-20250514_20260103_184954_summary.json ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total": 101,
3
+ "passed": 84,
4
+ "failed": 17,
5
+ "pass_rate": 0.8316831683168316,
6
+ "by_domain": {
7
+ "math_discount": {
8
+ "passed": 15,
9
+ "total": 15,
10
+ "pass_rate": 1.0
11
+ },
12
+ "time": {
13
+ "passed": 11,
14
+ "total": 13,
15
+ "pass_rate": 0.8461538461538461
16
+ },
17
+ "recipe": {
18
+ "passed": 7,
19
+ "total": 7,
20
+ "pass_rate": 1.0
21
+ },
22
+ "financial": {
23
+ "passed": 3,
24
+ "total": 10,
25
+ "pass_rate": 0.3
26
+ },
27
+ "units": {
28
+ "passed": 6,
29
+ "total": 7,
30
+ "pass_rate": 0.8571428571428571
31
+ },
32
+ "scheduling": {
33
+ "passed": 6,
34
+ "total": 7,
35
+ "pass_rate": 0.8571428571428571
36
+ },
37
+ "logic": {
38
+ "passed": 7,
39
+ "total": 8,
40
+ "pass_rate": 0.875
41
+ },
42
+ "spatial": {
43
+ "passed": 5,
44
+ "total": 7,
45
+ "pass_rate": 0.7142857142857143
46
+ },
47
+ "procedural": {
48
+ "passed": 5,
49
+ "total": 6,
50
+ "pass_rate": 0.8333333333333334
51
+ },
52
+ "text": {
53
+ "passed": 7,
54
+ "total": 7,
55
+ "pass_rate": 1.0
56
+ },
57
+ "sequence": {
58
+ "passed": 7,
59
+ "total": 7,
60
+ "pass_rate": 1.0
61
+ },
62
+ "causal": {
63
+ "passed": 5,
64
+ "total": 7,
65
+ "pass_rate": 0.7142857142857143
66
+ }
67
+ },
68
+ "avg_latency_ms": 2045.5509129137097,
69
+ "model": "claude-sonnet-4-20250514",
70
+ "timestamp": "20260103_184954"
71
+ }
results/gpt-4o-mini_20260103_184617_results.jsonl ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"id": "math_discount_01", "domain": "math_discount", "expected": "15", "response": "16", "extracted": "16", "passed": false, "latency_ms": 353.2829284667969}
2
+ {"id": "math_discount_02", "domain": "math_discount", "expected": "37", "response": "37", "extracted": "37", "passed": true, "latency_ms": 368.3919906616211}
3
+ {"id": "math_discount_03", "domain": "math_discount", "expected": "50", "response": "$55.00", "extracted": "55.00", "passed": false, "latency_ms": 415.9402847290039}
4
+ {"id": "math_discount_04", "domain": "math_discount", "expected": "90", "response": "84", "extracted": "84", "passed": false, "latency_ms": 440.11902809143066}
5
+ {"id": "math_discount_05", "domain": "math_discount", "expected": "115", "response": "125", "extracted": "125", "passed": false, "latency_ms": 403.51104736328125}
6
+ {"id": "math_discount_06", "domain": "math_discount", "expected": "53", "response": "53", "extracted": "53", "passed": true, "latency_ms": 425.0950813293457}
7
+ {"id": "math_discount_07", "domain": "math_discount", "expected": "70", "response": "$90", "extracted": "90", "passed": false, "latency_ms": 643.2604789733887}
8
+ {"id": "math_discount_tax_01", "domain": "math_discount", "expected": "88", "response": "88.0", "extracted": "88.0", "passed": true, "latency_ms": 456.78114891052246}
9
+ {"id": "math_discount_tax_02", "domain": "math_discount", "expected": "229.5", "response": "$231.00", "extracted": "231.00", "passed": false, "latency_ms": 433.59923362731934}
10
+ {"id": "math_discount_tax_03", "domain": "math_discount", "expected": "63", "response": "$63.00", "extracted": "63.00", "passed": true, "latency_ms": 449.0923881530762}
11
+ {"id": "math_discount_tax_04", "domain": "math_discount", "expected": "481.5", "response": "Final price: 500.00", "extracted": "500.00", "passed": false, "latency_ms": 508.26239585876465}
12
+ {"id": "math_discount_tax_05", "domain": "math_discount", "expected": "135.68", "response": "$169.60", "extracted": "169.60", "passed": false, "latency_ms": 403.0435085296631}
13
+ {"id": "math_bogo_01", "domain": "math_discount", "expected": "60", "response": "60", "extracted": "60", "passed": true, "latency_ms": 334.78307723999023}
14
+ {"id": "math_bogo_02", "domain": "math_discount", "expected": "43.75", "response": "43.75", "extracted": "43.75", "passed": true, "latency_ms": 487.77198791503906}
15
+ {"id": "math_bogo_03", "domain": "math_discount", "expected": "96", "response": "$96", "extracted": "96", "passed": true, "latency_ms": 363.8746738433838}
16
+ {"id": "time_duration_01", "domain": "time", "expected": "4:45 PM", "response": "4:15 PM", "extracted": "4:15 PM", "passed": true, "latency_ms": 422.21951484680176}
17
+ {"id": "time_duration_02", "domain": "time", "expected": "11:50 AM", "response": "11:50 AM", "extracted": "11:50 AM", "passed": true, "latency_ms": 397.20797538757324}
18
+ {"id": "time_duration_03", "domain": "time", "expected": "12:15 PM", "response": "11:15 AM", "extracted": "11:15 AM", "passed": false, "latency_ms": 402.74596214294434}
19
+ {"id": "time_duration_04", "domain": "time", "expected": "5:20 PM", "response": "5:00 PM", "extracted": "5:00 PM", "passed": true, "latency_ms": 432.36637115478516}
20
+ {"id": "time_duration_05", "domain": "time", "expected": "1:30 PM", "response": "12:30 PM", "extracted": "12:30 PM", "passed": false, "latency_ms": 560.6443881988525}
21
+ {"id": "time_duration_06", "domain": "time", "expected": "12:30 PM", "response": "12:30 PM", "extracted": "12:30 PM", "passed": true, "latency_ms": 438.97271156311035}
22
+ {"id": "time_duration_07", "domain": "time", "expected": "9:15 PM", "response": "9:15 PM", "extracted": "9:15 PM", "passed": true, "latency_ms": 416.1064624786377}
23
+ {"id": "time_travel_01", "domain": "time", "expected": "11:50 AM", "response": "11:50 AM", "extracted": "11:50 AM", "passed": true, "latency_ms": 378.5521984100342}
24
+ {"id": "time_travel_02", "domain": "time", "expected": "3:40 PM", "response": "3:40 PM", "extracted": "3:40 PM", "passed": true, "latency_ms": 418.66469383239746}
25
+ {"id": "time_travel_03", "domain": "time", "expected": "10:00 AM", "response": "10:00 AM", "extracted": "10:00 AM", "passed": true, "latency_ms": 444.2873001098633}
26
+ {"id": "time_travel_04", "domain": "time", "expected": "5:00 PM", "response": "5:00 PM", "extracted": "5:00 PM", "passed": true, "latency_ms": 456.10809326171875}
27
+ {"id": "time_travel_05", "domain": "time", "expected": "9:45 AM", "response": "9:45 AM", "extracted": "9:45 AM", "passed": true, "latency_ms": 399.5320796966553}
28
+ {"id": "time_multi_01", "domain": "time", "expected": "10:20 AM", "response": "9:20 AM", "extracted": "9:20 AM", "passed": false, "latency_ms": 1283.4525108337402}
29
+ {"id": "recipe_scale_01", "domain": "recipe", "expected": "6", "response": "12", "extracted": "12", "passed": false, "latency_ms": 464.3080234527588}
30
+ {"id": "recipe_scale_02", "domain": "recipe", "expected": "2.25", "response": "2.", "extracted": "2.", "passed": false, "latency_ms": 396.33917808532715}
31
+ {"id": "recipe_scale_03", "domain": "recipe", "expected": "6", "response": "6.0", "extracted": "6.0", "passed": true, "latency_ms": 398.6220359802246}
32
+ {"id": "recipe_scale_04", "domain": "recipe", "expected": "3", "response": "6", "extracted": "6", "passed": false, "latency_ms": 383.0828666687012}
33
+ {"id": "recipe_scale_05", "domain": "recipe", "expected": "6", "response": "6.0", "extracted": "6.0", "passed": true, "latency_ms": 418.92313957214355}
34
+ {"id": "recipe_convert_01", "domain": "recipe", "expected": "360", "response": "360", "extracted": "360", "passed": true, "latency_ms": 780.9550762176514}
35
+ {"id": "recipe_convert_02", "domain": "recipe", "expected": "3.3", "response": "3.3", "extracted": "3.3", "passed": true, "latency_ms": 410.5088710784912}
36
+ {"id": "financial_compound_01", "domain": "financial", "expected": "1168", "response": "$1,056.00", "extracted": "1056.00", "passed": false, "latency_ms": 513.2331848144531}
37
+ {"id": "financial_compound_02", "domain": "financial", "expected": "5669.91", "response": "$5,250.00", "extracted": "5250.00", "passed": false, "latency_ms": 603.4715175628662}
38
+ {"id": "financial_compound_03", "domain": "financial", "expected": "2249.6", "response": "$2,904.00", "extracted": "2904.00", "passed": false, "latency_ms": 416.1546230316162}
39
+ {"id": "financial_compound_04", "domain": "financial", "expected": "614.48", "response": "$610.00", "extracted": "610.00", "passed": false, "latency_ms": 438.7550354003906}
40
+ {"id": "financial_markup_01", "domain": "financial", "expected": "562.5", "response": "$462.50", "extracted": "462.50", "passed": false, "latency_ms": 414.88003730773926}
41
+ {"id": "financial_markup_02", "domain": "financial", "expected": "240", "response": "$120", "extracted": "120", "passed": false, "latency_ms": 365.8781051635742}
42
+ {"id": "financial_markup_03", "domain": "financial", "expected": "816", "response": "680", "extracted": "680", "passed": false, "latency_ms": 381.06608390808105}
43
+ {"id": "financial_markup_04", "domain": "financial", "expected": "157.5", "response": "$120.00", "extracted": "120.00", "passed": false, "latency_ms": 394.6950435638428}
44
+ {"id": "financial_markup_05", "domain": "financial", "expected": "1170", "response": "$970.00", "extracted": "970.00", "passed": false, "latency_ms": 407.6099395751953}
45
+ {"id": "financial_commission_01", "domain": "financial", "expected": "900", "response": "$1,200", "extracted": "1200", "passed": false, "latency_ms": 541.8832302093506}
46
+ {"id": "unit_convert_01", "domain": "units", "expected": "13.125", "response": "15.5", "extracted": "15.5", "passed": false, "latency_ms": 613.4727001190186}
47
+ {"id": "unit_convert_02", "domain": "units", "expected": "82", "response": "70\u00b0F", "extracted": "70", "passed": false, "latency_ms": 377.61640548706055}
48
+ {"id": "unit_volume_01", "domain": "units", "expected": "1875", "response": "1875", "extracted": "1875", "passed": true, "latency_ms": 432.9349994659424}
49
+ {"id": "unit_volume_02", "domain": "units", "expected": "48", "response": "42", "extracted": "42", "passed": false, "latency_ms": 439.7459030151367}
50
+ {"id": "unit_volume_03", "domain": "units", "expected": "7000", "response": "7,500", "extracted": "7500", "passed": false, "latency_ms": 438.8389587402344}
51
+ {"id": "unit_speed_01", "domain": "units", "expected": "3", "response": "3.0", "extracted": "3.0", "passed": true, "latency_ms": 441.6847229003906}
52
+ {"id": "unit_speed_02", "domain": "units", "expected": "80", "response": "100", "extracted": "100", "passed": false, "latency_ms": 422.84560203552246}
53
+ {"id": "schedule_01", "domain": "scheduling", "expected": "2:00 PM", "response": "Task C finishes at 12 PM.", "extracted": "12", "passed": false, "latency_ms": 490.2148246765137}
54
+ {"id": "schedule_02", "domain": "scheduling", "expected": "95", "response": "To find the total time from start to finish for processes X, Y, and Z, we need to consider the sequential nature of these processes.\n\n1. **Process X** takes 45 minutes.\n2. **Process Y** takes 30 minut", "extracted": "", "passed": true, "latency_ms": 2994.168281555176}
55
+ {"id": "schedule_03", "domain": "scheduling", "expected": "50", "response": "To find the total time, we simply add up the time for each step:\n\n- Download: 10 minutes\n- Install: 15 minutes\n- Configuration: 5 minutes\n- Testing: 20 minutes\n\nTotal time = 10 + 15 + 5 + 20 = 60 minu", "extracted": "", "passed": false, "latency_ms": 1646.0909843444824}
56
+ {"id": "schedule_04", "domain": "scheduling", "expected": "3:00 PM", "response": "1 PM", "extracted": "1", "passed": false, "latency_ms": 413.5298728942871}
57
+ {"id": "schedule_05", "domain": "scheduling", "expected": "60", "response": "To determine when both teams finish, we need to calculate the total time each team takes to complete their tasks.\n\n**Team A:**\n- 3 tasks of 20 minutes each\n- Total time for Team A = 3 tasks \u00d7 20 minut", "extracted": "", "passed": true, "latency_ms": 3534.430980682373}
58
+ {"id": "schedule_06", "domain": "scheduling", "expected": "2.4", "response": "To find out how long it takes for Worker A and Worker B to complete a job together, we first determine their individual work rates.\n\n- Worker A's work rate is \\( \\frac{1}{6} \\) of the job per hour (si", "extracted": "", "passed": false, "latency_ms": 4289.335012435913}
59
+ {"id": "schedule_07", "domain": "scheduling", "expected": "5", "response": "To find out how long it takes for both printers to print 250 pages together, we first need to determine their combined printing rate.\n\nThe first printer prints at a rate of 30 pages per minute, and th", "extracted": "250", "passed": true, "latency_ms": 5124.680757522583}
60
+ {"id": "logic_order_01", "domain": "logic", "expected": "Alice, Eve, Bob, Carol, Dave", "response": "The finish order from first to last is: Alice, Eve, Bob, Carol, Dave.", "extracted": "The finish order from first to last is: Alice, Eve, Bob, Carol, Dave.", "passed": true, "latency_ms": 855.7789325714111}
61
+ {"id": "logic_order_02", "domain": "logic", "expected": "Yellow, Red, Blue, Orange, Green", "response": "To determine the order of the books from left to right based on the given clues, let's analyze each statement:\n\n1. **Red is left of Blue.** (This means Red comes before Blue.)\n2. **Green is right of B", "extracted": "To determine the order of the books from left to right based on the given clues, let's analyze each statement:\n\n1. **Red is left of Blue.** (This means Red comes before Blue.)\n2. **Green is right of Blue.** (This means Green comes after Blue.)\n3. **Yellow is left of Red.** (This means Yellow comes before Red.)\n4. **Orange is between Blue and Green.** (This means Orange is positioned directly between Blue and Green.)\n\nNow, let's put these clues together step by step:\n\n- From clue 1, we know: Red < Blue\n- From clue 2, we know: Blue < Green\n- From clue 3", "passed": false, "latency_ms": 3717.663049697876}
62
+ {"id": "logic_modus_01", "domain": "logic", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 413.86866569519043}
63
+ {"id": "logic_modus_02", "domain": "logic", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 531.6195487976074}
64
+ {"id": "logic_modus_03", "domain": "logic", "expected": "cannot determine", "response": "Cannot determine. \n\nWhile all programmers know logic and all logicians are good at puzzles, the information provided does not specify whether programmers, including Sam, are also logicians. Therefore,", "extracted": "", "passed": true, "latency_ms": 1566.9023990631104}
65
+ {"id": "logic_sets_01", "domain": "logic", "expected": "45", "response": "45", "extracted": "45", "passed": true, "latency_ms": 384.3083381652832}
66
+ {"id": "logic_sets_02", "domain": "logic", "expected": "5", "response": "5", "extracted": "5", "passed": true, "latency_ms": 372.5147247314453}
67
+ {"id": "logic_sets_03", "domain": "logic", "expected": "10", "response": "20", "extracted": "20", "passed": false, "latency_ms": 400.8209705352783}
68
+ {"id": "spatial_direction_01", "domain": "spatial", "expected": "South", "response": "East.", "extracted": "East.", "passed": false, "latency_ms": 388.44871520996094}
69
+ {"id": "spatial_direction_02", "domain": "spatial", "expected": "West", "response": "North.", "extracted": "North.", "passed": false, "latency_ms": 418.49493980407715}
70
+ {"id": "spatial_direction_03", "domain": "spatial", "expected": "West", "response": "East", "extracted": "East", "passed": false, "latency_ms": 420.49193382263184}
71
+ {"id": "spatial_grid_01", "domain": "spatial", "expected": "(2,2)", "response": "Starting at position (0,0):\n\n1. Move right 3 steps: (0,0) \u2192 (3,0)\n2. Move up 2 steps: (3,0) \u2192 (3,2)\n3. Move left 1 step: (3,2) \u2192 (2,2)\n\nThe final position is (2,2).", "extracted": "00", "passed": true, "latency_ms": 1665.888786315918}
72
+ {"id": "spatial_grid_02", "domain": "spatial", "expected": "(7,3)", "response": "Let's break down the movements step by step starting from the initial position (5, 5):\n\n1. Move left 2: \n - New position: (5 - 2, 5) = (3, 5)\n\n2. Move down 3: \n - New position: (3, 5 - 3) = (3, 2)", "extracted": "5", "passed": false, "latency_ms": 3060.5833530426025}
73
+ {"id": "spatial_relative_01", "domain": "spatial", "expected": "Southeast", "response": "D is southwest of A.", "extracted": "D is southwest of A.", "passed": false, "latency_ms": 528.719425201416}
74
+ {"id": "spatial_relative_02", "domain": "spatial", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 397.49693870544434}
75
+ {"id": "procedural_state_01", "domain": "procedural", "expected": "Yellow", "response": "The traffic light cycles through three colors: Green \u2192 Yellow \u2192 Red. Starting from Green, we can track the changes as follows:\n\n1. Green (current)\n2. Yellow (1st change)\n3. Red (2nd change)\n4. Green (", "extracted": "", "passed": true, "latency_ms": 2484.0002059936523}
76
+ {"id": "procedural_state_02", "domain": "procedural", "expected": "Locked", "response": "Let's break down the sequence of actions step by step, starting from the initial state of the door, which is Locked.\n\n1. **Start**: Locked\n2. **Unlock**: The door transitions from Locked to Closed.\n3.", "extracted": "", "passed": true, "latency_ms": 2349.727153778076}
77
+ {"id": "procedural_recipe_01", "domain": "procedural", "expected": "Added milk before steeping", "response": "If you do the steps in the order of 1, 2, 5, 3, 4, the issue is that you are adding milk before steeping the tea. Adding milk before the tea has steeped can affect the extraction of flavors from the t", "extracted": "1", "passed": false, "latency_ms": 2153.0861854553223}
78
+ {"id": "procedural_recipe_02", "domain": "procedural", "expected": "Passwor1", "response": "Passwor1", "extracted": "1", "passed": true, "latency_ms": 394.18530464172363}
79
+ {"id": "procedural_undo_01", "domain": "procedural", "expected": "Hello World?", "response": "Let's go through the actions step by step:\n\n1. Start with \"Hello\".\n2. Append \" World\": The text becomes \"Hello World\".\n3. Append \"!\": The text becomes \"Hello World!\".\n4. Undo: This action reverts the ", "extracted": "1.", "passed": true, "latency_ms": 1806.8759441375732}
80
+ {"id": "procedural_undo_02", "domain": "procedural", "expected": "empty", "response": "Let's go through the stack operations step by step:\n\n1. Start empty: `[]`\n2. Push A: `['A']`\n3. Push B: `['A', 'B']`\n4. Pop: `['A']` (B is removed)\n5. Push C: `['A', 'C']`\n6. Pop: `['A']` (C is remove", "extracted": "1.", "passed": true, "latency_ms": 2269.1214084625244}
81
+ {"id": "text_string_01", "domain": "text", "expected": "LLEH", "response": "Reversing the word 'HELLO' gives 'OLLEH'. Removing the first letter results in 'LLEH'.", "extracted": "Reversing the word 'HELLO' gives 'OLLEH'. Removing", "passed": true, "latency_ms": 946.0744857788086}
82
+ {"id": "text_string_02", "domain": "text", "expected": "DCB", "response": "Starting with 'ABCDE', if we remove the vowels (A and E), we are left with 'BCD'. Reversing 'BCD' gives us 'DCB'.", "extracted": "", "passed": true, "latency_ms": 828.5079002380371}
83
+ {"id": "text_string_03", "domain": "text", "expected": "PRGR", "response": "The consonants in the word 'PROGRAMMING' are P, R, G, R, M, M, N, G. Taking the first 4 letters gives us P, R, G, R. So the result is **PRGR**.", "extracted": "", "passed": true, "latency_ms": 1423.1388568878174}
84
+ {"id": "text_word_01", "domain": "text", "expected": "fox", "response": "Reversing the word order of the sentence \"The quick brown fox\" gives us \"fox brown quick The\". The first word of this reversed order is \"fox\".", "extracted": "Reversing the word order of the sentence \"The quic", "passed": true, "latency_ms": 941.2515163421631}
85
+ {"id": "text_word_02", "domain": "text", "expected": "CDB", "response": "The first letters of the words \"CAT,\" \"DOG,\" and \"BIRD\" are C, D, and B, respectively. When combined, the result is \"CDB.\"", "extracted": "", "passed": true, "latency_ms": 967.2601222991943}
86
+ {"id": "text_encode_01", "domain": "text", "expected": "BZS", "response": "Let's break down the problem step by step.\n\n1. **Shift each letter in 'CAT' forward by 1:**\n - C \u2192 D\n - A \u2192 B\n - T \u2192 U\n\n So, 'CAT' becomes 'DBU'.\n\n2. **Now, shift each letter in 'DBU' backward", "extracted": "1.", "passed": true, "latency_ms": 2390.4123306274414}
87
+ {"id": "text_encode_02", "domain": "text", "expected": "HILLU", "response": "To replace each vowel in 'HELLO' with the next vowel according to the given pattern:\n\n- H (not a vowel)\n- E \u2192 I\n- L (not a vowel)\n- L (not a vowel)\n- O \u2192 U\n\nSo, 'HELLO' becomes 'HILLU'.", "extracted": "", "passed": true, "latency_ms": 1459.7249031066895}
88
+ {"id": "sequence_letter_01", "domain": "sequence", "expected": "I", "response": "The pattern consists of letters that are every second letter in the alphabet: A (1), C (3), E (5), G (7). Following this pattern, the next letter would be I (9). So, the next letter is I.", "extracted": "1", "passed": true, "latency_ms": 1494.0543174743652}
89
+ {"id": "sequence_letter_02", "domain": "sequence", "expected": "R", "response": "The pattern alternates by skipping one letter backward in the alphabet. \n\n- Z (skip Y) goes to X\n- X (skip W) goes to V\n- V (skip U) goes to T\n\nContinuing this pattern, after T (skip S), the next lett", "extracted": "", "passed": true, "latency_ms": 1359.5664501190186}
90
+ {"id": "sequence_letter_03", "domain": "sequence", "expected": "P", "response": "To find the next letter in the pattern A, B, D, G, K, we can look at the positions of the letters in the alphabet:\n\n- A is the 1st letter\n- B is the 2nd letter\n- D is the 4th letter\n- G is the 7th let", "extracted": "", "passed": true, "latency_ms": 5254.715442657471}
91
+ {"id": "sequence_symbol_01", "domain": "sequence", "expected": "\u25cb", "response": "The pattern alternates between \u25cf and \u25cb. Following the sequence \u25cf\u25cb\u25cf\u25cb\u25cf, the next symbol should be \u25cb.", "extracted": "", "passed": true, "latency_ms": 975.4207134246826}
92
+ {"id": "sequence_symbol_02", "domain": "sequence", "expected": "\u25cb", "response": "The pattern you provided is: \u25cf\u25cf\u25cb\u25cf\u25cf\u25cb\u25cf\u25cf_. \n\nIf we break it down, we can see that the pattern consists of two filled circles (\u25cf) followed by one empty circle (\u25cb). This sequence repeats: \n\n1. \u25cf\u25cf\u25cb\n2. \u25cf\u25cf\u25cb\n3", "extracted": "", "passed": true, "latency_ms": 1839.8103713989258}
93
+ {"id": "sequence_word_01", "domain": "sequence", "expected": "four", "response": "The word that fills the blank is \"four.\" The pattern is a sequence of numbers written in word form.", "extracted": "The word that fills the blank is \"four.\" The patte", "passed": true, "latency_ms": 768.2960033416748}
94
+ {"id": "sequence_word_02", "domain": "sequence", "expected": "September", "response": "The pattern consists of the months that have 31 days. After July, the next month with 31 days is August. Therefore, the next month in the pattern is August.", "extracted": "31", "passed": false, "latency_ms": 1361.1364364624023}
95
+ {"id": "causal_chain_01", "domain": "causal", "expected": "The power went out", "response": "The root cause of everyone getting sick was the power outage, which led to the fridge stopping and subsequently caused the food to spoil.", "extracted": "", "passed": false, "latency_ms": 723.3519554138184}
96
+ {"id": "causal_chain_02", "domain": "causal", "expected": "He is late", "response": "If the alarm didn't ring, then Tom oversleeps. Since he oversleeps, he misses the bus. As a result, because he misses the bus, he is late for work. Therefore, Tom is late for work.", "extracted": "", "passed": true, "latency_ms": 1284.857988357544}
97
+ {"id": "causal_counter_01", "domain": "causal", "expected": "no", "response": "No.", "extracted": "no", "passed": true, "latency_ms": 417.39487648010254}
98
+ {"id": "causal_counter_02", "domain": "causal", "expected": "no", "response": "If the dial had worked properly, it is likely that the oven would not have been too hot, and therefore the cake would not have burned. So, if the dial had functioned correctly, the cake would not have", "extracted": "no", "passed": true, "latency_ms": 1049.7331619262695}
99
+ {"id": "causal_necessary_01", "domain": "causal", "expected": "not necessarily", "response": "Not necessarily. While water is essential for plant growth, other factors such as light, nutrients, soil quality, and temperature also play crucial roles in a plant's ability to grow.", "extracted": "", "passed": true, "latency_ms": 1655.8122634887695}
100
+ {"id": "causal_necessary_02", "domain": "causal", "expected": "no", "response": "No.", "extracted": "no", "passed": true, "latency_ms": 396.47412300109863}
101
+ {"id": "causal_necessary_03", "domain": "causal", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 430.1483631134033}
results/gpt-4o-mini_20260103_184617_summary.json ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total": 101,
3
+ "passed": 58,
4
+ "failed": 43,
5
+ "pass_rate": 0.5742574257425742,
6
+ "by_domain": {
7
+ "math_discount": {
8
+ "passed": 7,
9
+ "total": 15,
10
+ "pass_rate": 0.4666666666666667
11
+ },
12
+ "time": {
13
+ "passed": 10,
14
+ "total": 13,
15
+ "pass_rate": 0.7692307692307693
16
+ },
17
+ "recipe": {
18
+ "passed": 4,
19
+ "total": 7,
20
+ "pass_rate": 0.5714285714285714
21
+ },
22
+ "financial": {
23
+ "passed": 0,
24
+ "total": 10,
25
+ "pass_rate": 0.0
26
+ },
27
+ "units": {
28
+ "passed": 2,
29
+ "total": 7,
30
+ "pass_rate": 0.2857142857142857
31
+ },
32
+ "scheduling": {
33
+ "passed": 3,
34
+ "total": 7,
35
+ "pass_rate": 0.42857142857142855
36
+ },
37
+ "logic": {
38
+ "passed": 6,
39
+ "total": 8,
40
+ "pass_rate": 0.75
41
+ },
42
+ "spatial": {
43
+ "passed": 2,
44
+ "total": 7,
45
+ "pass_rate": 0.2857142857142857
46
+ },
47
+ "procedural": {
48
+ "passed": 5,
49
+ "total": 6,
50
+ "pass_rate": 0.8333333333333334
51
+ },
52
+ "text": {
53
+ "passed": 7,
54
+ "total": 7,
55
+ "pass_rate": 1.0
56
+ },
57
+ "sequence": {
58
+ "passed": 6,
59
+ "total": 7,
60
+ "pass_rate": 0.8571428571428571
61
+ },
62
+ "causal": {
63
+ "passed": 6,
64
+ "total": 7,
65
+ "pass_rate": 0.8571428571428571
66
+ }
67
+ },
68
+ "avg_latency_ms": 959.1620298895506,
69
+ "model": "gpt-4o-mini",
70
+ "timestamp": "20260103_184617"
71
+ }
results/gpt-4o_20260103_184426_results.jsonl ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"id": "math_discount_01", "domain": "math_discount", "expected": "15", "response": "15", "extracted": "15", "passed": true, "latency_ms": 1855.0798892974854}
2
+ {"id": "math_discount_02", "domain": "math_discount", "expected": "37", "response": "37", "extracted": "37", "passed": true, "latency_ms": 620.812177658081}
3
+ {"id": "math_discount_03", "domain": "math_discount", "expected": "50", "response": "50", "extracted": "50", "passed": true, "latency_ms": 654.8104286193848}
4
+ {"id": "math_discount_04", "domain": "math_discount", "expected": "90", "response": "90", "extracted": "90", "passed": true, "latency_ms": 397.7174758911133}
5
+ {"id": "math_discount_05", "domain": "math_discount", "expected": "115", "response": "115", "extracted": "115", "passed": true, "latency_ms": 612.2045516967773}
6
+ {"id": "math_discount_06", "domain": "math_discount", "expected": "53", "response": "53", "extracted": "53", "passed": true, "latency_ms": 342.04649925231934}
7
+ {"id": "math_discount_07", "domain": "math_discount", "expected": "70", "response": "70", "extracted": "70", "passed": true, "latency_ms": 541.3846969604492}
8
+ {"id": "math_discount_tax_01", "domain": "math_discount", "expected": "88", "response": "88", "extracted": "88", "passed": true, "latency_ms": 603.419303894043}
9
+ {"id": "math_discount_tax_02", "domain": "math_discount", "expected": "229.5", "response": "229.40", "extracted": "229.40", "passed": true, "latency_ms": 627.4769306182861}
10
+ {"id": "math_discount_tax_03", "domain": "math_discount", "expected": "63", "response": "63", "extracted": "63", "passed": true, "latency_ms": 358.6757183074951}
11
+ {"id": "math_discount_tax_04", "domain": "math_discount", "expected": "481.5", "response": "481.15", "extracted": "481.15", "passed": true, "latency_ms": 922.0921993255615}
12
+ {"id": "math_discount_tax_05", "domain": "math_discount", "expected": "135.68", "response": "135.52", "extracted": "135.52", "passed": true, "latency_ms": 652.8406143188477}
13
+ {"id": "math_bogo_01", "domain": "math_discount", "expected": "60", "response": "60", "extracted": "60", "passed": true, "latency_ms": 365.278959274292}
14
+ {"id": "math_bogo_02", "domain": "math_discount", "expected": "43.75", "response": "43.75", "extracted": "43.75", "passed": true, "latency_ms": 385.67137718200684}
15
+ {"id": "math_bogo_03", "domain": "math_discount", "expected": "96", "response": "96", "extracted": "96", "passed": true, "latency_ms": 357.47313499450684}
16
+ {"id": "time_duration_01", "domain": "time", "expected": "4:45 PM", "response": "4:45 PM", "extracted": "4:45 PM", "passed": true, "latency_ms": 493.6494827270508}
17
+ {"id": "time_duration_02", "domain": "time", "expected": "11:50 AM", "response": "11:50 AM", "extracted": "11:50 AM", "passed": true, "latency_ms": 379.3666362762451}
18
+ {"id": "time_duration_03", "domain": "time", "expected": "12:15 PM", "response": "12:15 PM", "extracted": "12:15 PM", "passed": true, "latency_ms": 415.59529304504395}
19
+ {"id": "time_duration_04", "domain": "time", "expected": "5:20 PM", "response": "5:20 PM", "extracted": "5:20 PM", "passed": true, "latency_ms": 380.90062141418457}
20
+ {"id": "time_duration_05", "domain": "time", "expected": "1:30 PM", "response": "12:30 PM", "extracted": "12:30 PM", "passed": false, "latency_ms": 424.68976974487305}
21
+ {"id": "time_duration_06", "domain": "time", "expected": "12:30 PM", "response": "12:30 PM", "extracted": "12:30 PM", "passed": true, "latency_ms": 607.5694561004639}
22
+ {"id": "time_duration_07", "domain": "time", "expected": "9:15 PM", "response": "9:15 PM", "extracted": "9:15 PM", "passed": true, "latency_ms": 372.3134994506836}
23
+ {"id": "time_travel_01", "domain": "time", "expected": "11:50 AM", "response": "11:50 AM", "extracted": "11:50 AM", "passed": true, "latency_ms": 419.3422794342041}
24
+ {"id": "time_travel_02", "domain": "time", "expected": "3:40 PM", "response": "3:40 PM", "extracted": "3:40 PM", "passed": true, "latency_ms": 420.00746726989746}
25
+ {"id": "time_travel_03", "domain": "time", "expected": "10:00 AM", "response": "10:00 AM", "extracted": "10:00 AM", "passed": true, "latency_ms": 412.45269775390625}
26
+ {"id": "time_travel_04", "domain": "time", "expected": "5:00 PM", "response": "5:00 PM", "extracted": "5:00 PM", "passed": true, "latency_ms": 415.54760932922363}
27
+ {"id": "time_travel_05", "domain": "time", "expected": "9:45 AM", "response": "9:45 AM", "extracted": "9:45 AM", "passed": true, "latency_ms": 392.5604820251465}
28
+ {"id": "time_multi_01", "domain": "time", "expected": "10:20 AM", "response": "10:20 AM", "extracted": "10:20 AM", "passed": true, "latency_ms": 396.5001106262207}
29
+ {"id": "recipe_scale_01", "domain": "recipe", "expected": "6", "response": "6", "extracted": "6", "passed": true, "latency_ms": 474.4579792022705}
30
+ {"id": "recipe_scale_02", "domain": "recipe", "expected": "2.25", "response": "2", "extracted": "2", "passed": false, "latency_ms": 409.20400619506836}
31
+ {"id": "recipe_scale_03", "domain": "recipe", "expected": "6", "response": "6", "extracted": "6", "passed": true, "latency_ms": 497.1766471862793}
32
+ {"id": "recipe_scale_04", "domain": "recipe", "expected": "3", "response": "3", "extracted": "3", "passed": true, "latency_ms": 349.67041015625}
33
+ {"id": "recipe_scale_05", "domain": "recipe", "expected": "6", "response": "6", "extracted": "6", "passed": true, "latency_ms": 416.6903495788574}
34
+ {"id": "recipe_convert_01", "domain": "recipe", "expected": "360", "response": "360", "extracted": "360", "passed": true, "latency_ms": 392.77100563049316}
35
+ {"id": "recipe_convert_02", "domain": "recipe", "expected": "3.3", "response": "3.3", "extracted": "3.3", "passed": true, "latency_ms": 390.0177478790283}
36
+ {"id": "financial_compound_01", "domain": "financial", "expected": "1168", "response": "1184", "extracted": "1184", "passed": false, "latency_ms": 333.3697319030762}
37
+ {"id": "financial_compound_02", "domain": "financial", "expected": "5669.91", "response": "$5613.06", "extracted": "5613.06", "passed": false, "latency_ms": 418.92218589782715}
38
+ {"id": "financial_compound_03", "domain": "financial", "expected": "2249.6", "response": "2330", "extracted": "2330", "passed": false, "latency_ms": 582.0214748382568}
39
+ {"id": "financial_compound_04", "domain": "financial", "expected": "614.48", "response": "$582.40", "extracted": "582.40", "passed": false, "latency_ms": 524.5602130889893}
40
+ {"id": "financial_markup_01", "domain": "financial", "expected": "562.5", "response": "562.5", "extracted": "562.5", "passed": true, "latency_ms": 391.32022857666016}
41
+ {"id": "financial_markup_02", "domain": "financial", "expected": "240", "response": "240", "extracted": "240", "passed": true, "latency_ms": 337.8558158874512}
42
+ {"id": "financial_markup_03", "domain": "financial", "expected": "816", "response": "816", "extracted": "816", "passed": true, "latency_ms": 376.35278701782227}
43
+ {"id": "financial_markup_04", "domain": "financial", "expected": "157.5", "response": "135", "extracted": "135", "passed": false, "latency_ms": 327.76737213134766}
44
+ {"id": "financial_markup_05", "domain": "financial", "expected": "1170", "response": "990", "extracted": "990", "passed": false, "latency_ms": 367.6929473876953}
45
+ {"id": "financial_commission_01", "domain": "financial", "expected": "900", "response": "1100", "extracted": "1100", "passed": false, "latency_ms": 430.7210445404053}
46
+ {"id": "unit_convert_01", "domain": "units", "expected": "13.125", "response": "11.875", "extracted": "11.875", "passed": false, "latency_ms": 379.7035217285156}
47
+ {"id": "unit_convert_02", "domain": "units", "expected": "82", "response": "68", "extracted": "68", "passed": false, "latency_ms": 534.221887588501}
48
+ {"id": "unit_volume_01", "domain": "units", "expected": "1875", "response": "1875", "extracted": "1875", "passed": true, "latency_ms": 409.3296527862549}
49
+ {"id": "unit_volume_02", "domain": "units", "expected": "48", "response": "48", "extracted": "48", "passed": true, "latency_ms": 351.49240493774414}
50
+ {"id": "unit_volume_03", "domain": "units", "expected": "7000", "response": "7,000", "extracted": "7000", "passed": true, "latency_ms": 383.90636444091797}
51
+ {"id": "unit_speed_01", "domain": "units", "expected": "3", "response": "3.5", "extracted": "3.5", "passed": true, "latency_ms": 343.25337409973145}
52
+ {"id": "unit_speed_02", "domain": "units", "expected": "80", "response": "80", "extracted": "80", "passed": true, "latency_ms": 400.4807472229004}
53
+ {"id": "schedule_01", "domain": "scheduling", "expected": "2:00 PM", "response": "1 PM", "extracted": "1", "passed": false, "latency_ms": 490.4158115386963}
54
+ {"id": "schedule_02", "domain": "scheduling", "expected": "95", "response": "To determine the total time from start to finish for these processes, we need to consider the dependencies between them:\n\n1. Process X takes 45 minutes.\n2. Process Y takes 30 minutes and can only star", "extracted": "", "passed": true, "latency_ms": 1442.1210289001465}
55
+ {"id": "schedule_03", "domain": "scheduling", "expected": "50", "response": "The total time is the sum of all the individual tasks:\n\n- Download: 10 minutes\n- Install: 15 minutes\n- Configuration: 5 minutes\n- Testing: 20 minutes\n\nAdding these together gives:\n\n10 + 15 + 5 + 20 = ", "extracted": "10", "passed": true, "latency_ms": 755.6910514831543}
56
+ {"id": "schedule_04", "domain": "scheduling", "expected": "3:00 PM", "response": "3 PM", "extracted": "3", "passed": true, "latency_ms": 484.2681884765625}
57
+ {"id": "schedule_05", "domain": "scheduling", "expected": "60", "response": "To determine when both teams finish their tasks, we need to calculate the total time each team takes to complete their tasks.\n\n**Team A:**\n- Team A has 3 tasks, each taking 20 minutes.\n- Total time fo", "extracted": "", "passed": true, "latency_ms": 1635.4496479034424}
58
+ {"id": "schedule_06", "domain": "scheduling", "expected": "2.4", "response": "To find out how long it takes for Worker A and Worker B to complete the job together, we first determine their work rates and then combine them.\n\nWorker A's rate is \\(\\frac{1}{6}\\) of the job per hour", "extracted": "", "passed": false, "latency_ms": 1636.098861694336}
59
+ {"id": "schedule_07", "domain": "scheduling", "expected": "5", "response": "To determine how long it will take for both printers to print a total of 250 pages together, we first need to calculate their combined printing rate.\n\nThe first printer prints at a rate of 30 pages pe", "extracted": "250", "passed": true, "latency_ms": 1734.919786453247}
60
+ {"id": "logic_order_01", "domain": "logic", "expected": "Alice, Eve, Bob, Carol, Dave", "response": "The finish order from first to last is: Alice, Eve, Bob, Carol, Dave.", "extracted": "The finish order from first to last is: Alice, Eve, Bob, Carol, Dave.", "passed": true, "latency_ms": 479.9826145172119}
61
+ {"id": "logic_order_02", "domain": "logic", "expected": "Yellow, Red, Blue, Orange, Green", "response": "To determine the order of the books from left to right, let's analyze the given conditions:\n\n1. Red is left of Blue.\n2. Green is right of Blue.\n3. Yellow is left of Red.\n4. Orange is between Blue and ", "extracted": "To determine the order of the books from left to right, let's analyze the given conditions:\n\n1. Red is left of Blue.\n2. Green is right of Blue.\n3. Yellow is left of Red.\n4. Orange is between Blue and Green.\n\nLet's start placing the books based on these conditions:\n\n- Since Yellow is left of Red, Yellow must be placed before Red.\n- Red is left of Blue, so Red must be placed before Blue.\n- Orange is between Blue and Green, so the order must be Blue", "passed": true, "latency_ms": 2219.191312789917}
62
+ {"id": "logic_modus_01", "domain": "logic", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 440.3417110443115}
63
+ {"id": "logic_modus_02", "domain": "logic", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 380.2187442779541}
64
+ {"id": "logic_modus_03", "domain": "logic", "expected": "cannot determine", "response": "Cannot determine.\n\nThe statements provided are:\n\n1. All programmers know logic.\n2. All logicians are good at puzzles.\n3. Sam is a programmer.\n\nFrom these statements, we can conclude that Sam knows log", "extracted": "1.", "passed": true, "latency_ms": 1435.1723194122314}
65
+ {"id": "logic_sets_01", "domain": "logic", "expected": "45", "response": "45", "extracted": "45", "passed": true, "latency_ms": 473.30260276794434}
66
+ {"id": "logic_sets_02", "domain": "logic", "expected": "5", "response": "5", "extracted": "5", "passed": true, "latency_ms": 453.53102684020996}
67
+ {"id": "logic_sets_03", "domain": "logic", "expected": "10", "response": "20", "extracted": "20", "passed": false, "latency_ms": 342.69070625305176}
68
+ {"id": "spatial_direction_01", "domain": "spatial", "expected": "South", "response": "South.", "extracted": "South.", "passed": true, "latency_ms": 375.45251846313477}
69
+ {"id": "spatial_direction_02", "domain": "spatial", "expected": "West", "response": "West", "extracted": "West", "passed": true, "latency_ms": 817.2118663787842}
70
+ {"id": "spatial_direction_03", "domain": "spatial", "expected": "West", "response": "West", "extracted": "West", "passed": true, "latency_ms": 451.50089263916016}
71
+ {"id": "spatial_grid_01", "domain": "spatial", "expected": "(2,2)", "response": "Starting at position (0,0):\n\n1. Move right 3 steps: (0,0) \u2192 (3,0)\n2. Move up 2 steps: (3,0) \u2192 (3,2)\n3. Move left 1 step: (3,2) \u2192 (2,2)\n\nThe final position is (2,2).", "extracted": "00", "passed": true, "latency_ms": 690.1655197143555}
72
+ {"id": "spatial_grid_02", "domain": "spatial", "expected": "(7,3)", "response": "Starting at (5,5):\n\n1. Move left 2: (5 - 2, 5) = (3, 5)\n2. Move down 3: (3, 5 - 3) = (3, 2)\n3. Move right 4: (3 + 4, 2) = (7, 2)\n4. Move up 1: (7, 2 + 1) = (7, 3)\n\nThe final position is (7, 3).", "extracted": "55", "passed": false, "latency_ms": 745.7365989685059}
73
+ {"id": "spatial_relative_01", "domain": "spatial", "expected": "Southeast", "response": "D is southeast of A.", "extracted": "D is southeast of A.", "passed": true, "latency_ms": 986.9036674499512}
74
+ {"id": "spatial_relative_02", "domain": "spatial", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 394.49572563171387}
75
+ {"id": "procedural_state_01", "domain": "procedural", "expected": "Yellow", "response": "The traffic light follows the cycle: Green \u2192 Yellow \u2192 Red \u2192 Green. Starting from Green, let's track the changes:\n\n1. First change: Green \u2192 Yellow\n2. Second change: Yellow \u2192 Red\n3. Third change: Red \u2192 ", "extracted": "", "passed": true, "latency_ms": 881.066083908081}
76
+ {"id": "procedural_state_02", "domain": "procedural", "expected": "Locked", "response": "To determine the final state of the door, let's follow the sequence of actions starting from the Locked state:\n\n1. **Unlock**: The door goes from Locked to Closed.\n2. **Open**: The door goes from Clos", "extracted": "", "passed": true, "latency_ms": 1183.4142208099365}
77
+ {"id": "procedural_recipe_01", "domain": "procedural", "expected": "Added milk before steeping", "response": "If you follow the steps in the order 1, 2, 5, 3, 4, there are a couple of issues with the process:\n\n1. **Adding Milk Before Steeping**: By adding milk immediately after the tea bag, before steeping, y", "extracted": "1", "passed": false, "latency_ms": 2387.098550796509}
78
+ {"id": "procedural_recipe_02", "domain": "procedural", "expected": "Passwor1", "response": "Password1", "extracted": "1", "passed": true, "latency_ms": 394.27638053894043}
79
+ {"id": "procedural_undo_01", "domain": "procedural", "expected": "Hello World?", "response": "The initial text in the editor is \"Hello\". Let's go through the actions step by step:\n\n1. **Append ' World'**: The text becomes \"Hello World\".\n2. **Append '!'**: The text becomes \"Hello World!\".\n3. **", "extracted": "1.", "passed": true, "latency_ms": 959.4566822052002}
80
+ {"id": "procedural_undo_02", "domain": "procedural", "expected": "empty", "response": "The stack operations proceed as follows:\n\n1. Start empty: Stack = []\n2. Push A: Stack = [A]\n3. Push B: Stack = [A, B]\n4. Pop: Stack = [A] (B is removed)\n5. Push C: Stack = [A, C]\n6. Pop: Stack = [A] (", "extracted": "1.", "passed": true, "latency_ms": 1408.557653427124}
81
+ {"id": "text_string_01", "domain": "text", "expected": "LLEH", "response": "To solve this, first reverse the word \"HELLO\" to get \"OLLEH\". Then, remove the first letter of this reversed word, which is \"O\". This leaves you with \"LLEH\".", "extracted": "", "passed": true, "latency_ms": 721.3387489318848}
82
+ {"id": "text_string_02", "domain": "text", "expected": "DCB", "response": "Starting with \"ABCDE\", we first remove the vowels \"A\" and \"E\", leaving us with \"BCD\". Then, we reverse \"BCD\" to get \"DCB\". Therefore, the result is \"DCB\".", "extracted": "", "passed": true, "latency_ms": 960.1335525512695}
83
+ {"id": "text_string_03", "domain": "text", "expected": "PRGR", "response": "The word \"PROGRAMMING\" consists of the following letters: P, R, O, G, R, A, M, M, I, N, G. The vowels in this word are O, A, and I. Removing these vowels, we are left with the consonants: P, R, G, R, ", "extracted": "", "passed": true, "latency_ms": 1229.5448780059814}
84
+ {"id": "text_word_01", "domain": "text", "expected": "fox", "response": "The sentence 'The quick brown fox' reversed in word order is 'fox brown quick The'. The first word in this reversed order is 'fox'.", "extracted": "The sentence 'The quick brown fox' reversed in wor", "passed": true, "latency_ms": 585.4918956756592}
85
+ {"id": "text_word_02", "domain": "text", "expected": "CDB", "response": "The result is \"CDB\".", "extracted": "The result is \"CDB\".", "passed": true, "latency_ms": 397.45545387268066}
86
+ {"id": "text_encode_01", "domain": "text", "expected": "BZS", "response": "To solve this, we need to perform two operations on each letter of the word \"CAT\".\n\n1. **Shift each letter in 'CAT' forward by 1:**\n - C \u2192 D\n - A \u2192 B\n - T \u2192 U\n\n After shifting forward, the wor", "extracted": "", "passed": true, "latency_ms": 1406.911849975586}
87
+ {"id": "text_encode_02", "domain": "text", "expected": "HILLU", "response": "To replace each vowel in \"HELLO\" with the next vowel in the sequence A\u2192E, E\u2192I, I\u2192O, O\u2192U, U\u2192A, we perform the following substitutions:\n\n- 'E' is replaced by 'I'\n- 'O' is replaced by 'U'\n\nThe consonants", "extracted": "", "passed": true, "latency_ms": 1659.1601371765137}
88
+ {"id": "sequence_letter_01", "domain": "sequence", "expected": "I", "response": "The pattern follows a sequence of letters with a step of two in the alphabet: A (1), C (3), E (5), G (7). Each letter corresponds to an odd number, increasing by 2 each time. The next odd number after", "extracted": "1", "passed": true, "latency_ms": 1006.1733722686768}
89
+ {"id": "sequence_letter_02", "domain": "sequence", "expected": "R", "response": "The pattern involves skipping one letter backward in the alphabet. Here's the breakdown:\n\n- Z (skip Y)\n- X (skip W)\n- V (skip U)\n- T (skip S)\n\nFollowing this pattern, after T, you skip S and the next ", "extracted": "", "passed": true, "latency_ms": 1073.8365650177002}
90
+ {"id": "sequence_letter_03", "domain": "sequence", "expected": "P", "response": "The pattern in the sequence is based on the positions of the letters in the alphabet. Let's examine the sequence:\n\n- A is the 1st letter of the alphabet.\n- B is the 2nd letter.\n- D is the 4th letter.\n", "extracted": "1", "passed": true, "latency_ms": 1790.4012203216553}
91
+ {"id": "sequence_symbol_01", "domain": "sequence", "expected": "\u25cb", "response": "The given pattern alternates between \u25cf and \u25cb: \u25cf\u25cb\u25cf\u25cb\u25cf. Following this alternating sequence, the next symbol should be \u25cb.", "extracted": "", "passed": true, "latency_ms": 2369.978666305542}
92
+ {"id": "sequence_symbol_02", "domain": "sequence", "expected": "\u25cb", "response": "The given pattern is \u25cf\u25cf\u25cb\u25cf\u25cf\u25cb\u25cf\u25cf_. Observing the sequence, we can see that it repeats every three symbols: \u25cf\u25cf\u25cb. \n\nBreaking it down:\n1. \u25cf\u25cf\u25cb\n2. \u25cf\u25cf\u25cb\n3. \u25cf\u25cf\n\nThe next symbol in the sequence should follow the ", "extracted": "", "passed": true, "latency_ms": 1162.7376079559326}
93
+ {"id": "sequence_word_01", "domain": "sequence", "expected": "four", "response": "The word that fills the blank is \"four.\" The pattern is a sequence of counting numbers.", "extracted": "The word that fills the blank is \"four.\" The patte", "passed": true, "latency_ms": 533.9739322662354}
94
+ {"id": "sequence_word_02", "domain": "sequence", "expected": "September", "response": "The pattern consists of months that are two months apart: January, March, May, July. Following this pattern, the next month would be September.", "extracted": "", "passed": true, "latency_ms": 555.8648109436035}
95
+ {"id": "causal_chain_01", "domain": "causal", "expected": "The power went out", "response": "The root cause of everyone getting sick was the power going out. This initial event led to the fridge stopping, which in turn caused the food to spoil, ultimately resulting in everyone getting sick fr", "extracted": "", "passed": false, "latency_ms": 3198.387861251831}
96
+ {"id": "causal_chain_02", "domain": "causal", "expected": "He is late", "response": "Based on the given chain of events, if the alarm doesn't ring, Tom oversleeps. Since the alarm didn't ring, Tom oversleeps. If Tom oversleeps, he misses the bus. Therefore, Tom misses the bus. If he m", "extracted": "", "passed": false, "latency_ms": 811.7859363555908}
97
+ {"id": "causal_counter_01", "domain": "causal", "expected": "no", "response": "The answer is \"unknown.\" While the plant died due to lack of water, we cannot be certain that watering it would have guaranteed its survival. Other factors, such as disease, poor soil conditions, or e", "extracted": "no", "passed": true, "latency_ms": 1328.5939693450928}
98
+ {"id": "causal_counter_02", "domain": "causal", "expected": "no", "response": "If the dial had been working properly, it would have allowed you to set the oven to the correct temperature. Assuming all other factors remained constant and the cake recipe was followed correctly, th", "extracted": "no", "passed": true, "latency_ms": 1479.551076889038}
99
+ {"id": "causal_necessary_01", "domain": "causal", "expected": "not necessarily", "response": "Not necessarily. While water is essential for plant growth, other factors such as sunlight, soil quality, temperature, and nutrients are also crucial. A plant may not grow if any of these other condit", "extracted": "", "passed": true, "latency_ms": 797.9154586791992}
100
+ {"id": "causal_necessary_02", "domain": "causal", "expected": "no", "response": "No.", "extracted": "no", "passed": true, "latency_ms": 408.13302993774414}
101
+ {"id": "causal_necessary_03", "domain": "causal", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 388.98277282714844}
results/gpt-4o_20260103_184426_summary.json ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total": 101,
3
+ "passed": 83,
4
+ "failed": 18,
5
+ "pass_rate": 0.8217821782178217,
6
+ "by_domain": {
7
+ "math_discount": {
8
+ "passed": 15,
9
+ "total": 15,
10
+ "pass_rate": 1.0
11
+ },
12
+ "time": {
13
+ "passed": 12,
14
+ "total": 13,
15
+ "pass_rate": 0.9230769230769231
16
+ },
17
+ "recipe": {
18
+ "passed": 6,
19
+ "total": 7,
20
+ "pass_rate": 0.8571428571428571
21
+ },
22
+ "financial": {
23
+ "passed": 3,
24
+ "total": 10,
25
+ "pass_rate": 0.3
26
+ },
27
+ "units": {
28
+ "passed": 5,
29
+ "total": 7,
30
+ "pass_rate": 0.7142857142857143
31
+ },
32
+ "scheduling": {
33
+ "passed": 5,
34
+ "total": 7,
35
+ "pass_rate": 0.7142857142857143
36
+ },
37
+ "logic": {
38
+ "passed": 7,
39
+ "total": 8,
40
+ "pass_rate": 0.875
41
+ },
42
+ "spatial": {
43
+ "passed": 6,
44
+ "total": 7,
45
+ "pass_rate": 0.8571428571428571
46
+ },
47
+ "procedural": {
48
+ "passed": 5,
49
+ "total": 6,
50
+ "pass_rate": 0.8333333333333334
51
+ },
52
+ "text": {
53
+ "passed": 7,
54
+ "total": 7,
55
+ "pass_rate": 1.0
56
+ },
57
+ "sequence": {
58
+ "passed": 7,
59
+ "total": 7,
60
+ "pass_rate": 1.0
61
+ },
62
+ "causal": {
63
+ "passed": 5,
64
+ "total": 7,
65
+ "pass_rate": 0.7142857142857143
66
+ }
67
+ },
68
+ "avg_latency_ms": 738.5695429131537,
69
+ "model": "gpt-4o",
70
+ "timestamp": "20260103_184426"
71
+ }