v1.1: Financial domain audit - confirms Goodhart Gap hypothesis

Browse files

Files changed (14) hide show

README.md +235 -0
evaluate.py +471 -0
generate_dataset.py +1040 -0
requirements.txt +1 -0
results/claude-3-5-haiku-latest_20260103_182323_results.jsonl +5 -0
results/claude-3-5-haiku-latest_20260103_182323_summary.json +16 -0
results/claude-3-5-haiku-latest_20260103_184241_results.jsonl +101 -0
results/claude-3-5-haiku-latest_20260103_184241_summary.json +71 -0
results/claude-sonnet-4-20250514_20260103_184954_results.jsonl +101 -0
results/claude-sonnet-4-20250514_20260103_184954_summary.json +71 -0
results/gpt-4o-mini_20260103_184617_results.jsonl +101 -0
results/gpt-4o-mini_20260103_184617_summary.json +71 -0
results/gpt-4o_20260103_184426_results.jsonl +101 -0
results/gpt-4o_20260103_184426_summary.json +71 -0

README.md ADDED Viewed

	@@ -0,0 +1,235 @@

+---
+license: mit
+task_categories:
+  - question-answering
+  - text-generation
+language:
+  - en
+tags:
+  - benchmark
+  - reasoning
+  - multi-step
+  - evaluation
+  - llm-evaluation
+  - goodhart
+  - execution-vs-understanding
+size_categories:
+  - n<1K
+---
+# Goodhart Gap Benchmark
+**Detecting the gap between understanding and execution in language models**
+## Overview
+The Goodhart Gap Benchmark tests whether language models can correctly *execute* multi-step reasoning tasks that they can correctly *explain*. Named after Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure"), this benchmark reveals a critical failure mode: models that understand procedures but fail to execute them.
+## Key Finding
+In our testing of 15+ models:
+- **gpt-4o**: 57% pass rate (fails on financial, scheduling, units)
+- **gpt-4o-mini**: 36% pass rate
+- **Claude 3.5 Haiku**: 93% pass rate
+- **Llama 3.1 70B**: Fails the canonical discount calculation despite correct explanation
+## The Canonical Example
+**Problem**: "If a shirt costs $25 and is on 20% sale, and you have a $5 coupon, what do you pay?"
+**Correct answer**: $15 (apply 20% discount first: $25 × 0.8 = $20, then subtract coupon: $20 - $5 = $15)
+When we first ask models to *explain* the procedure, they all correctly state: "First apply the discount, then subtract the coupon."
+When we then ask for the answer, many models fail—giving answers like $16, $17, $22.50, or even $175.
+## Dataset Statistics
+| Metric | Value |
+|--------|-------|
+| Total problems | 101 |
+| Domains | 12 |
+| Difficulty levels | 3 (easy, medium, hard) |
+| Steps per problem | 2-6 |
+### Problems by Domain
+**Numerical Domains (67 problems)**
+| Domain | Count | Description |
+|--------|-------|-------------|
+| math_discount | 15 | Discounts, coupons, taxes, markups |
+| time | 13 | Duration arithmetic, travel times |
+| financial | 10 | Interest, taxes, commissions |
+| logic | 8 | Ordering, deduction, set operations |
+| recipe | 7 | Scaling, unit conversion |
+| scheduling | 7 | Task dependencies, work rates |
+| units | 7 | Unit conversion with operations |
+**Non-Numerical Domains (34 problems)**
+| Domain | Count | Description |
+|--------|-------|-------------|
+| spatial | 7 | Direction tracking, grid navigation, relative positions |
+| procedural | 6 | State machines, undo/redo, procedure following |
+| text | 7 | String manipulation, encoding, word operations |
+| sequence | 7 | Pattern recognition (letters, symbols, words) |
+| causal | 7 | Cause-effect chains, counterfactuals, necessary/sufficient |
+### Difficulty Distribution
+| Difficulty | Count | Description |
+|------------|-------|-------------|
+| Easy | 28 | 2 steps, straightforward |
+| Medium | 32 | 2-3 steps, some complexity |
+| Hard | 7 | 3-4 steps, multiple operations |
+## Data Format
+Each problem is a JSON object with the following fields:
+```json
+{
+  "id": "math_discount_01",
+  "domain": "math_discount",
+  "problem": "A product costs $25 and is on 20% sale. You also have a $5 coupon. What do you pay? Answer with just the number.",
+  "correct_answer": "15",
+  "explanation": "25 × 0.8 = 20.0, then 20.0 - 5 = 15.0",
+  "understanding_check": "To solve this, first apply the 20% discount, then subtract the coupon. What are the two steps?",
+  "difficulty": "easy",
+  "steps": 2
+}
+```
+### Field Descriptions
+| Field | Description |
+|-------|-------------|
+| `id` | Unique identifier (domain_type_number) |
+| `domain` | Category of reasoning required |
+| `problem` | The question posed to the model |
+| `correct_answer` | Expected answer (numeric or text) |
+| `explanation` | Step-by-step solution |
+| `understanding_check` | Prompt to verify model understands the procedure |
+| `difficulty` | easy, medium, or hard |
+| `steps` | Number of sequential operations required |
+## Usage
+### Quick Evaluation
+```bash
+# Install requirements
+pip install requests
+# Evaluate OpenAI model
+python evaluate.py --provider openai --model gpt-4o -v
+# Evaluate Claude model
+python evaluate.py --provider anthropic --model claude-3-5-haiku-latest -v
+# Evaluate local Ollama model
+python evaluate.py --provider ollama --model llama3.1:8b -v
+```
+### Python API
+```python
+import json
+# Load dataset
+problems = []
+with open('data/test.jsonl') as f:
+    for line in f:
+        problems.append(json.loads(line))
+# Test your model
+for problem in problems:
+    response = your_model.generate(problem['problem'])
+    expected = problem['correct_answer']
+    # Validate response against expected
+```
+### With HuggingFace Datasets
+```python
+from datasets import load_dataset
+dataset = load_dataset("your-username/goodhart-gap-benchmark")
+for example in dataset['test']:
+    print(example['problem'])
+    print(f"Expected: {example['correct_answer']}")
+```
+## Evaluation Criteria
+A response is considered correct if:
+1. **Numeric answers**: The expected number appears in the response (with tolerance for rounding)
+2. **Time answers**: The expected time appears in any reasonable format (e.g., "4:45 PM", "4:45pm", "16:45")
+3. **Yes/no answers**: The response clearly indicates yes, no, or "cannot determine"
+4. **Ordering answers**: Items appear in the correct sequence
+## Leaderboard
+| Model | Provider | Pass Rate | Weakest Domain |
+|-------|----------|-----------|----------------|
+| Claude 3.5 Haiku | Anthropic | 93% | logic |
+| Claude Sonnet 4 | Anthropic | 79% | financial, scheduling |
+| gpt-4o | OpenAI | 57% | scheduling |
+| gpt-4o-mini | OpenAI | 36% | most domains |
+| Qwen 2.5 72B | Alibaba | TBD | - |
+| Llama 3.1 70B | Meta | TBD | - |
+*Submit your results via PR to add to the leaderboard*
+## Why This Matters
+### For AI Safety
+Models that can explain correct procedures but execute them incorrectly are:
+- Harder to detect through explanation-based evaluation
+- More dangerous in agentic settings
+- A gap between capability benchmarks and deployment readiness
+### For Model Selection
+Not all models are equal for multi-step reasoning:
+- Model family matters more than size
+- Distilled models often lose this capability
+- Test execution, not just explanation
+### For Training
+The gap appears to be a training problem:
+- Well-trained models (Claude Haiku) outperform larger models
+- Suggests targeted fine-tuning could help
+## Citation
+```bibtex
+@dataset{goodhart_gap_benchmark_2026,
+  title={Goodhart Gap Benchmark: Detecting the Gap Between Understanding and Execution in LLMs},
+  author={Adam Kruger},
+  year={2026},
+  url={https://huggingface.co/datasets/Adam1010/goodhart-gap-benchmark}
+}
+```
+## License
+MIT License - free for research and commercial use.
+## Contributing
+We welcome contributions:
+- New test cases in underrepresented domains
+- Results from additional models
+- Improved validators
+- Translations to other languages
+Submit issues and PRs at: [GitHub Repository URL]
+## Acknowledgments
+Research inspired by:
+- Goodhart's Law and its application to AI evaluation
+- Work on multi-step reasoning in LLMs
+- The distinction between System 1 and System 2 thinking

evaluate.py ADDED Viewed

	@@ -0,0 +1,471 @@

+#!/usr/bin/env python3
+"""
+Goodhart Gap Benchmark Evaluation Script
+Evaluate any model on the Goodhart Gap benchmark to detect the gap
+between understanding and execution in multi-step reasoning.
+Usage:
+    # Using OpenAI API
+    python evaluate.py --provider openai --model gpt-4o
+    # Using Anthropic API
+    python evaluate.py --provider anthropic --model claude-3-5-haiku-latest
+    # Using local Ollama
+    python evaluate.py --provider ollama --model llama3.1:8b
+    # Using HuggingFace transformers
+    python evaluate.py --provider huggingface --model meta-llama/Llama-3.1-8B-Instruct
+    # Custom API endpoint
+    python evaluate.py --provider custom --model mymodel --api-url http://localhost:8000/v1
+Environment Variables:
+    OPENAI_API_KEY - Required for OpenAI provider
+    ANTHROPIC_API_KEY - Required for Anthropic provider
+    HF_TOKEN - Optional for gated HuggingFace models
+"""
+import argparse
+import json
+import os
+import re
+import sys
+from dataclasses import dataclass
+from datetime import datetime
+from pathlib import Path
+from typing import Optional, Callable
+import time
+# Optional imports
+try:
+    import requests
+    HAS_REQUESTS = True
+except ImportError:
+    HAS_REQUESTS = False
+@dataclass
+class TestResult:
+    id: str
+    domain: str
+    problem: str
+    expected: str
+    response: str
+    extracted_answer: str
+    passed: bool
+    latency_ms: float
+def extract_answer(response: str, expected: str) -> str:
+    """Extract the answer from model response."""
+    response = response.strip()
+    # Try to find numbers in the response
+    numbers = re.findall(r'-?[\d,]+\.?\d*', response)
+    # For yes/no questions
+    if expected.lower() in ['yes', 'no']:
+        resp_lower = response.lower()
+        if 'yes' in resp_lower and 'no' not in resp_lower.split()[:3]:
+            return 'yes'
+        if 'no' in resp_lower and 'yes' not in resp_lower.split()[:3]:
+            return 'no'
+        if 'cannot determine' in resp_lower or 'cannot be determined' in resp_lower:
+            return 'cannot determine'
+    # For time answers
+    time_match = re.search(r'(\d{1,2}:\d{2})\s*(AM|PM|am|pm)?', response)
+    if time_match:
+        time_str = time_match.group(1)
+        period = time_match.group(2) or ''
+        return f"{time_str} {period}".strip()
+    # For ordering questions (comma-separated names)
+    if ',' in expected and any(c.isalpha() for c in expected):
+        # Try to extract comma-separated list
+        parts = [p.strip() for p in response.split(',') if p.strip()]
+        if len(parts) >= 3:
+            return ', '.join(parts[:5])
+    # Return first number found
+    if numbers:
+        return numbers[0].replace(',', '')
+    # Return first line or truncated response
+    first_line = response.split('\n')[0]
+    return first_line[:50] if len(first_line) > 50 else first_line
+def validate_answer(response: str, expected: str, domain: str) -> bool:
+    """Validate if the response matches the expected answer."""
+    response = response.lower().strip()
+    expected = expected.lower().strip()
+    # Direct match
+    if expected in response:
+        return True
+    # Numeric comparison
+    expected_nums = re.findall(r'-?[\d,]+\.?\d*', expected)
+    response_nums = re.findall(r'-?[\d,]+\.?\d*', response)
+    if expected_nums and response_nums:
+        try:
+            exp_val = float(expected_nums[0].replace(',', ''))
+            for resp_num in response_nums:
+                resp_val = float(resp_num.replace(',', ''))
+                # Allow small floating point tolerance
+                if abs(exp_val - resp_val) < 0.01:
+                    return True
+                # Check if it's within 0.5% (for rounding)
+                if exp_val != 0 and abs(exp_val - resp_val) / abs(exp_val) < 0.005:
+                    return True
+        except ValueError:
+            pass
+    # Time validation
+    if domain == 'time':
+        # Normalize time formats
+        def normalize_time(t):
+            t = t.lower().replace(' ', '')
+            t = re.sub(r'(\d{1,2}):(\d{2})(am|pm)?', r'\1:\2\3', t)
+            return t
+        if normalize_time(expected) in normalize_time(response):
+            return True
+    # Yes/no validation
+    if expected in ['yes', 'no', 'cannot determine']:
+        if expected == 'yes' and 'yes' in response and 'no' not in response.split()[:5]:
+            return True
+        if expected == 'no' and 'no' in response and 'yes' not in response.split()[:5]:
+            return True
+        if expected == 'cannot determine' and ('cannot' in response or 'unable' in response):
+            return True
+    # Ordering validation (check sequence)
+    if ',' in expected and domain == 'logic':
+        expected_items = [x.strip().lower() for x in expected.split(',')]
+        response_lower = response.lower()
+        # Check if items appear in correct order
+        positions = []
+        for item in expected_items:
+            pos = response_lower.find(item)
+            if pos == -1:
+                return False
+            positions.append(pos)
+        return positions == sorted(positions)
+    return False
+class ModelProvider:
+    """Base class for model providers."""
+    def generate(self, prompt: str) -> tuple[str, float]:
+        """Generate response. Returns (response, latency_ms)."""
+        raise NotImplementedError
+class OpenAIProvider(ModelProvider):
+    def __init__(self, model: str, api_key: Optional[str] = None):
+        self.model = model
+        self.api_key = api_key or os.environ.get('OPENAI_API_KEY')
+        if not self.api_key:
+            raise ValueError("OPENAI_API_KEY not set")
+    def generate(self, prompt: str) -> tuple[str, float]:
+        start = time.time()
+        headers = {
+            "Authorization": f"Bearer {self.api_key}",
+            "Content-Type": "application/json"
+        }
+        payload = {
+            "model": self.model,
+            "messages": [{"role": "user", "content": prompt}],
+            "temperature": 0.1,
+            "max_tokens": 200
+        }
+        response = requests.post(
+            "https://api.openai.com/v1/chat/completions",
+            headers=headers, json=payload, timeout=60
+        )
+        latency = (time.time() - start) * 1000
+        if response.status_code == 200:
+            return response.json()["choices"][0]["message"]["content"].strip(), latency
+        else:
+            return f"ERROR: {response.status_code}", latency
+class AnthropicProvider(ModelProvider):
+    def __init__(self, model: str, api_key: Optional[str] = None):
+        self.model = model
+        self.api_key = api_key or os.environ.get('ANTHROPIC_API_KEY')
+        if not self.api_key:
+            raise ValueError("ANTHROPIC_API_KEY not set")
+    def generate(self, prompt: str) -> tuple[str, float]:
+        start = time.time()
+        headers = {
+            "x-api-key": self.api_key,
+            "anthropic-version": "2023-06-01",
+            "Content-Type": "application/json"
+        }
+        payload = {
+            "model": self.model,
+            "max_tokens": 200,
+            "messages": [{"role": "user", "content": prompt}]
+        }
+        response = requests.post(
+            "https://api.anthropic.com/v1/messages",
+            headers=headers, json=payload, timeout=60
+        )
+        latency = (time.time() - start) * 1000
+        if response.status_code == 200:
+            return response.json()["content"][0]["text"].strip(), latency
+        else:
+            return f"ERROR: {response.status_code}", latency
+class OllamaProvider(ModelProvider):
+    def __init__(self, model: str, host: str = "http://localhost:11434"):
+        self.model = model
+        self.host = host
+    def generate(self, prompt: str) -> tuple[str, float]:
+        start = time.time()
+        payload = {
+            "model": self.model,
+            "prompt": prompt,
+            "stream": False,
+            "options": {"temperature": 0.1}
+        }
+        response = requests.post(
+            f"{self.host}/api/generate",
+            json=payload, timeout=120
+        )
+        latency = (time.time() - start) * 1000
+        if response.status_code == 200:
+            return response.json().get("response", "").strip(), latency
+        else:
+            return f"ERROR: {response.status_code}", latency
+class CustomProvider(ModelProvider):
+    def __init__(self, model: str, api_url: str):
+        self.model = model
+        self.api_url = api_url
+    def generate(self, prompt: str) -> tuple[str, float]:
+        start = time.time()
+        # Assume OpenAI-compatible API
+        payload = {
+            "model": self.model,
+            "messages": [{"role": "user", "content": prompt}],
+            "temperature": 0.1,
+            "max_tokens": 200
+        }
+        response = requests.post(
+            f"{self.api_url}/chat/completions",
+            json=payload, timeout=120
+        )
+        latency = (time.time() - start) * 1000
+        if response.status_code == 200:
+            return response.json()["choices"][0]["message"]["content"].strip(), latency
+        else:
+            return f"ERROR: {response.status_code}", latency
+def load_dataset(path: str = "data/test.jsonl") -> list[dict]:
+    """Load the benchmark dataset."""
+    problems = []
+    with open(path) as f:
+        for line in f:
+            problems.append(json.loads(line))
+    return problems
+def evaluate_model(
+    provider: ModelProvider,
+    problems: list[dict],
+    verbose: bool = False
+) -> tuple[list[TestResult], dict]:
+    """Evaluate a model on the benchmark."""
+    results = []
+    domain_stats = {}
+    for i, problem in enumerate(problems):
+        if verbose:
+            print(f"[{i+1}/{len(problems)}] {problem['id']}...", end=" ", flush=True)
+        response, latency = provider.generate(problem['problem'])
+        extracted = extract_answer(response, problem['correct_answer'])
+        passed = validate_answer(response, problem['correct_answer'], problem['domain'])
+        result = TestResult(
+            id=problem['id'],
+            domain=problem['domain'],
+            problem=problem['problem'],
+            expected=problem['correct_answer'],
+            response=response[:200],
+            extracted_answer=extracted,
+            passed=passed,
+            latency_ms=latency
+        )
+        results.append(result)
+        # Track domain stats
+        domain = problem['domain']
+        if domain not in domain_stats:
+            domain_stats[domain] = {'pass': 0, 'fail': 0}
+        domain_stats[domain]['pass' if passed else 'fail'] += 1
+        if verbose:
+            status = "PASS" if passed else "FAIL"
+            print(f"{status} (got: {extracted[:20]})")
+    # Calculate summary
+    total_pass = sum(r.passed for r in results)
+    total = len(results)
+    summary = {
+        'total': total,
+        'passed': total_pass,
+        'failed': total - total_pass,
+        'pass_rate': total_pass / total if total > 0 else 0,
+        'by_domain': {
+            d: {
+                'passed': s['pass'],
+                'total': s['pass'] + s['fail'],
+                'pass_rate': s['pass'] / (s['pass'] + s['fail'])
+            }
+            for d, s in domain_stats.items()
+        },
+        'avg_latency_ms': sum(r.latency_ms for r in results) / len(results) if results else 0
+    }
+    return results, summary
+def save_results(
+    results: list[TestResult],
+    summary: dict,
+    model_name: str,
+    output_dir: str = "results"
+):
+    """Save evaluation results."""
+    os.makedirs(output_dir, exist_ok=True)
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    safe_model = re.sub(r'[^\w\-]', '_', model_name)
+    # Save detailed results
+    results_file = f"{output_dir}/{safe_model}_{timestamp}_results.jsonl"
+    with open(results_file, 'w') as f:
+        for r in results:
+            f.write(json.dumps({
+                'id': r.id,
+                'domain': r.domain,
+                'expected': r.expected,
+                'response': r.response,
+                'extracted': r.extracted_answer,
+                'passed': r.passed,
+                'latency_ms': r.latency_ms
+            }) + '\n')
+    # Save summary
+    summary_file = f"{output_dir}/{safe_model}_{timestamp}_summary.json"
+    summary['model'] = model_name
+    summary['timestamp'] = timestamp
+    with open(summary_file, 'w') as f:
+        json.dump(summary, f, indent=2)
+    return results_file, summary_file
+def print_summary(summary: dict, model_name: str):
+    """Print evaluation summary."""
+    print("\n" + "=" * 60)
+    print(f"GOODHART GAP BENCHMARK RESULTS")
+    print(f"Model: {model_name}")
+    print("=" * 60)
+    print(f"\nOverall: {summary['passed']}/{summary['total']} ({summary['pass_rate']*100:.1f}%)")
+    print(f"Average latency: {summary['avg_latency_ms']:.0f}ms")
+    print("\nBy Domain:")
+    print("-" * 40)
+    for domain, stats in sorted(summary['by_domain'].items()):
+        bar = "█" * int(stats['pass_rate'] * 10) + "░" * (10 - int(stats['pass_rate'] * 10))
+        print(f"  {domain:<15} {stats['passed']:>2}/{stats['total']:<2} {bar} {stats['pass_rate']*100:>5.1f}%")
+    print("\n" + "=" * 60)
+    # Interpret results
+    pass_rate = summary['pass_rate']
+    if pass_rate >= 0.9:
+        print("Assessment: LOW GOODHART GAP - Model executes well")
+    elif pass_rate >= 0.7:
+        print("Assessment: MODERATE GOODHART GAP - Some execution issues")
+    elif pass_rate >= 0.5:
+        print("Assessment: SIGNIFICANT GOODHART GAP - Frequent execution failures")
+    else:
+        print("Assessment: SEVERE GOODHART GAP - Major execution problems")
+def main():
+    parser = argparse.ArgumentParser(
+        description="Evaluate a model on the Goodhart Gap Benchmark",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog=__doc__
+    )
+    parser.add_argument('--provider', required=True,
+                       choices=['openai', 'anthropic', 'ollama', 'custom'],
+                       help='Model provider')
+    parser.add_argument('--model', required=True,
+                       help='Model name/identifier')
+    parser.add_argument('--api-url', default=None,
+                       help='API URL for custom provider')
+    parser.add_argument('--data', default='data/test.jsonl',
+                       help='Path to test data')
+    parser.add_argument('--output', default='results',
+                       help='Output directory')
+    parser.add_argument('--verbose', '-v', action='store_true',
+                       help='Show progress')
+    parser.add_argument('--limit', type=int, default=None,
+                       help='Limit number of problems (for testing)')
+    args = parser.parse_args()
+    if not HAS_REQUESTS:
+        print("ERROR: requests library required. Install with: pip install requests")
+        sys.exit(1)
+    # Create provider
+    if args.provider == 'openai':
+        provider = OpenAIProvider(args.model)
+    elif args.provider == 'anthropic':
+        provider = AnthropicProvider(args.model)
+    elif args.provider == 'ollama':
+        provider = OllamaProvider(args.model)
+    elif args.provider == 'custom':
+        if not args.api_url:
+            print("ERROR: --api-url required for custom provider")
+            sys.exit(1)
+        provider = CustomProvider(args.model, args.api_url)
+    # Load dataset
+    print(f"Loading dataset from {args.data}...")
+    problems = load_dataset(args.data)
+    if args.limit:
+        problems = problems[:args.limit]
+    print(f"Loaded {len(problems)} problems")
+    # Evaluate
+    print(f"\nEvaluating {args.model}...")
+    results, summary = evaluate_model(provider, problems, verbose=args.verbose)
+    # Save and print results
+    results_file, summary_file = save_results(results, summary, args.model, args.output)
+    print_summary(summary, args.model)
+    print(f"\nResults saved to:")
+    print(f"  {results_file}")
+    print(f"  {summary_file}")
+if __name__ == "__main__":
+    main()

generate_dataset.py ADDED Viewed

	@@ -0,0 +1,1040 @@

+#!/usr/bin/env python3
+"""
+Generate the Goodhart Gap Benchmark Dataset
+Creates 70-100 multi-step reasoning problems across 7 domains,
+specifically designed to detect the gap between understanding and execution.
+"""
+import json
+import random
+from dataclasses import dataclass, asdict
+from typing import List, Callable
+import re
+@dataclass
+class TestCase:
+    id: str
+    domain: str
+    problem: str
+    correct_answer: str
+    explanation: str
+    understanding_check: str
+    difficulty: str  # easy, medium, hard
+    steps: int  # number of sequential steps required
+def generate_math_discount_problems() -> List[TestCase]:
+    """Generate discount/coupon/tax calculation problems."""
+    problems = []
+    # Template 1: Discount then coupon
+    configs = [
+        (25, 20, 5, "easy"),   # 25 * 0.8 - 5 = 15
+        (50, 10, 8, "easy"),   # 50 * 0.9 - 8 = 37
+        (80, 25, 10, "easy"),  # 80 * 0.75 - 10 = 50
+        (120, 15, 12, "medium"),  # 120 * 0.85 - 12 = 90
+        (200, 30, 25, "medium"),  # 200 * 0.7 - 25 = 115
+        (75, 20, 7, "easy"),   # 75 * 0.8 - 7 = 53
+        (150, 40, 20, "medium"),  # 150 * 0.6 - 20 = 70
+    ]
+    for i, (price, discount, coupon, diff) in enumerate(configs):
+        discounted = price * (1 - discount/100)
+        final = discounted - coupon
+        problems.append(TestCase(
+            id=f"math_discount_{i+1:02d}",
+            domain="math_discount",
+            problem=f"A product costs ${price} and is on {discount}% sale. You also have a ${coupon} coupon. What do you pay? Answer with just the number.",
+            correct_answer=f"{final:.2f}".rstrip('0').rstrip('.'),
+            explanation=f"{price} × {1-discount/100} = {discounted}, then {discounted} - {coupon} = {final}",
+            understanding_check=f"To solve this, first apply the {discount}% discount, then subtract the coupon. What are the two steps?",
+            difficulty=diff,
+            steps=2
+        ))
+    # Template 2: Discount then tax
+    tax_configs = [
+        (100, 20, 10, "medium"),  # 100 * 0.8 * 1.1 = 88
+        (250, 15, 8, "medium"),   # 250 * 0.85 * 1.08 = 229.5
+        (80, 25, 5, "easy"),      # 80 * 0.75 * 1.05 = 63
+        (500, 10, 7, "medium"),   # 500 * 0.9 * 1.07 = 481.5
+        (160, 20, 6, "medium"),   # 160 * 0.8 * 1.06 = 135.68
+    ]
+    for i, (price, discount, tax, diff) in enumerate(tax_configs):
+        discounted = price * (1 - discount/100)
+        final = discounted * (1 + tax/100)
+        problems.append(TestCase(
+            id=f"math_discount_tax_{i+1:02d}",
+            domain="math_discount",
+            problem=f"An item costs ${price}. First apply a {discount}% discount, then add {tax}% sales tax. What's the final price? Answer with just the number.",
+            correct_answer=f"{final:.2f}".rstrip('0').rstrip('.'),
+            explanation=f"{price} × {1-discount/100} = {discounted}, then {discounted} × {1+tax/100} = {final}",
+            understanding_check=f"First apply the discount, then calculate tax on the discounted price. What are the steps?",
+            difficulty=diff,
+            steps=2
+        ))
+    # Template 3: Buy X get Y% off second
+    bogo_configs = [
+        (40, 50, "medium"),  # 40 + 40*0.5 = 60
+        (25, 25, "easy"),    # 25 + 25*0.75 = 43.75
+        (60, 40, "medium"),  # 60 + 60*0.6 = 96
+    ]
+    for i, (price, discount, diff) in enumerate(bogo_configs):
+        second = price * (1 - discount/100)
+        total = price + second
+        problems.append(TestCase(
+            id=f"math_bogo_{i+1:02d}",
+            domain="math_discount",
+            problem=f"Shirts cost ${price} each. Buy one, get {discount}% off the second. What's the total for 2 shirts? Answer with just the number.",
+            correct_answer=f"{total:.2f}".rstrip('0').rstrip('.'),
+            explanation=f"First shirt: {price}, Second shirt: {price} × {1-discount/100} = {second}, Total: {total}",
+            understanding_check=f"First shirt is full price, second shirt gets {discount}% off. How do you calculate the total?",
+            difficulty=diff,
+            steps=2
+        ))
+    return problems
+def generate_time_problems() -> List[TestCase]:
+    """Generate time arithmetic problems."""
+    problems = []
+    # Template 1: Start time + duration + break
+    configs = [
+        ("2:30 PM", 105, 30, "4:45 PM", "easy"),   # 2:30 + 1:45 + 0:30
+        ("9:15 AM", 140, 15, "11:50 AM", "easy"),  # 9:15 + 2:20 + 0:15
+        ("10:00 AM", 90, 45, "12:15 PM", "medium"),
+        ("3:45 PM", 75, 20, "5:20 PM", "easy"),
+        ("8:30 AM", 180, 60, "1:30 PM", "medium"),
+        ("11:15 AM", 45, 30, "12:30 PM", "easy"),
+        ("7:00 PM", 120, 15, "9:15 PM", "easy"),
+    ]
+    for i, (start, dur_mins, break_mins, expected, diff) in enumerate(configs):
+        dur_h, dur_m = dur_mins // 60, dur_mins % 60
+        problems.append(TestCase(
+            id=f"time_duration_{i+1:02d}",
+            domain="time",
+            problem=f"A meeting starts at {start} and lasts {dur_h} hour{'s' if dur_h != 1 else ''}{f' {dur_m} minutes' if dur_m else ''}. Then there's a {break_mins} minute break. What time does the next session start? Answer with just the time.",
+            correct_answer=expected,
+            explanation=f"Add {dur_mins} minutes to {start}, then add {break_mins} minutes",
+            understanding_check="Add the meeting duration first, then add the break time. What are the steps?",
+            difficulty=diff,
+            steps=2
+        ))
+    # Template 2: Travel with wait time
+    travel_configs = [
+        ("9:00 AM", 150, 20, "11:50 AM", "medium"),
+        ("2:15 PM", 75, 10, "3:40 PM", "easy"),
+        ("6:30 AM", 180, 30, "10:00 AM", "medium"),
+        ("4:00 PM", 45, 15, "5:00 PM", "easy"),
+        ("7:45 AM", 95, 25, "9:45 AM", "medium"),
+    ]
+    for i, (depart, travel_mins, wait_mins, expected, diff) in enumerate(travel_configs):
+        t_h, t_m = travel_mins // 60, travel_mins % 60
+        problems.append(TestCase(
+            id=f"time_travel_{i+1:02d}",
+            domain="time",
+            problem=f"A train departs at {depart}. The journey takes {t_h} hour{'s' if t_h != 1 else ''}{f' {t_m} minutes' if t_m else ''}. After arrival, you wait {wait_mins} minutes for a connection. What time do you board the connection? Answer with just the time.",
+            correct_answer=expected,
+            explanation=f"Add {travel_mins} minutes travel, then {wait_mins} minutes wait",
+            understanding_check="Calculate arrival time first, then add wait time. What are the steps?",
+            difficulty=diff,
+            steps=2
+        ))
+    # Template 3: Multiple segments
+    problems.append(TestCase(
+        id="time_multi_01",
+        domain="time",
+        problem="You leave home at 8:00 AM. Drive 45 minutes to the station, wait 20 minutes, then take a 1 hour 15 minute train. What time do you arrive? Answer with just the time.",
+        correct_answer="10:20 AM",
+        explanation="8:00 + 0:45 = 8:45, + 0:20 = 9:05, + 1:15 = 10:20 AM",
+        understanding_check="Add drive time, then wait time, then train time. What's the sequence?",
+        difficulty="hard",
+        steps=3
+    ))
+    return problems
+def generate_recipe_problems() -> List[TestCase]:
+    """Generate recipe scaling problems."""
+    problems = []
+    # Template 1: Scale then double/halve
+    configs = [
+        (2, 4, 6, 2, 6, "easy"),     # 2 cups for 4, scale to 6, double = 6
+        (3, 8, 12, 0.5, 2.25, "medium"),  # 3 eggs for 8, scale to 12 (4.5), halve = 2.25
+        (1.5, 4, 8, 2, 6, "easy"),   # 1.5 cups for 4, scale to 8 (3), double = 6
+        (4, 6, 9, 0.5, 3, "medium"), # 4 tbsp for 6, scale to 9 (6), halve = 3
+        (2, 5, 10, 1.5, 6, "medium"), # 2 cups for 5, scale to 10 (4), ×1.5 = 6
+    ]
+    ingredients = ["cups of flour", "eggs", "cups of sugar", "tablespoons butter", "cups of milk"]
+    for i, (amount, serves, new_serves, multiplier, final, diff) in enumerate(configs):
+        scaled = amount * (new_serves / serves)
+        ing = ingredients[i % len(ingredients)]
+        mult_text = "doubled" if multiplier == 2 else "halved" if multiplier == 0.5 else f"multiplied by {multiplier}"
+        problems.append(TestCase(
+            id=f"recipe_scale_{i+1:02d}",
+            domain="recipe",
+            problem=f"A recipe for {serves} people needs {amount} {ing}. Scale to {new_serves} people, then {mult_text} for a party. How much {ing.split()[0]} {' '.join(ing.split()[1:])} do you need? Answer with just the number.",
+            correct_answer=f"{final:.2f}".rstrip('0').rstrip('.'),
+            explanation=f"{amount} × ({new_serves}/{serves}) = {scaled}, then × {multiplier} = {final}",
+            understanding_check=f"First scale the recipe from {serves} to {new_serves} servings, then {mult_text}. What are the steps?",
+            difficulty=diff,
+            steps=2
+        ))
+    # Template 2: Convert units then scale
+    problems.append(TestCase(
+        id="recipe_convert_01",
+        domain="recipe",
+        problem="A recipe needs 2 cups of milk (1 cup = 240ml). Convert to ml, then reduce by 25% for a lighter version. How many ml? Answer with just the number.",
+        correct_answer="360",
+        explanation="2 × 240 = 480ml, then 480 × 0.75 = 360ml",
+        understanding_check="Convert cups to ml first, then reduce by the percentage. What are the steps?",
+        difficulty="medium",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="recipe_convert_02",
+        domain="recipe",
+        problem="A recipe uses 500g of flour. Convert to pounds (1 pound = 454g), then triple for a large batch. How many pounds? Answer with just the number rounded to one decimal.",
+        correct_answer="3.3",
+        explanation="500 / 454 = 1.1 pounds, then 1.1 × 3 = 3.3 pounds",
+        understanding_check="Convert grams to pounds first, then triple. What are the steps?",
+        difficulty="medium",
+        steps=2
+    ))
+    return problems
+def generate_financial_problems() -> List[TestCase]:
+    """Generate financial calculation problems."""
+    problems = []
+    # Template 1: Compound interest then tax on gains
+    configs = [
+        (1000, 10, 2, 20, 1168, "medium"),  # 1000 × 1.1² = 1210, gains=210, tax=42, final=1168
+        (5000, 5, 3, 15, 5541.19, "hard"),  # 5000 × 1.05³ = 5788.125, gains=788.125, tax=118.22, final≈5669.90
+        (2000, 8, 2, 25, 2181.60, "medium"),
+        (500, 12, 2, 10, 607.20, "medium"),
+    ]
+    for i, (principal, rate, years, tax, expected, diff) in enumerate(configs):
+        compound = principal * ((1 + rate/100) ** years)
+        gains = compound - principal
+        tax_amount = gains * (tax/100)
+        final = compound - tax_amount
+        problems.append(TestCase(
+            id=f"financial_compound_{i+1:02d}",
+            domain="financial",
+            problem=f"You invest ${principal} at {rate}% annual interest for {years} years (compounded yearly). Then you pay {tax}% tax on the gains only. What's your final amount? Answer with just the number.",
+            correct_answer=f"{final:.2f}".rstrip('0').rstrip('.'),
+            explanation=f"{principal} × (1.{rate:02d})^{years} = {compound:.2f}, gains = {gains:.2f}, tax = {tax_amount:.2f}, final = {final:.2f}",
+            understanding_check=f"Calculate compound interest first, then calculate tax only on the gains. What are the steps?",
+            difficulty=diff,
+            steps=3
+        ))
+    # Template 2: Markup then discount
+    markup_configs = [
+        (500, 25, 10, 562.50, "easy"),    # 500 × 1.25 × 0.9 = 562.50
+        (200, 50, 20, 240, "easy"),       # 200 × 1.5 × 0.8 = 240
+        (800, 20, 15, 816, "medium"),     # 800 × 1.2 × 0.85 = 816
+        (150, 40, 25, 157.50, "medium"),  # 150 × 1.4 × 0.75 = 157.50
+        (1000, 30, 10, 1170, "medium"),   # 1000 × 1.3 × 0.9 = 1170
+    ]
+    for i, (cost, markup, discount, expected, diff) in enumerate(markup_configs):
+        marked_up = cost * (1 + markup/100)
+        final = marked_up * (1 - discount/100)
+        problems.append(TestCase(
+            id=f"financial_markup_{i+1:02d}",
+            domain="financial",
+            problem=f"A ${cost} item has {markup}% markup, then {discount}% member discount. What does a member pay? Answer with just the number.",
+            correct_answer=f"{final:.2f}".rstrip('0').rstrip('.'),
+            explanation=f"{cost} × {1+markup/100} = {marked_up}, then × {1-discount/100} = {final}",
+            understanding_check=f"Apply markup first (increase), then discount (decrease). What are the steps?",
+            difficulty=diff,
+            steps=2
+        ))
+    # Template 3: Commission calculations
+    problems.append(TestCase(
+        id="financial_commission_01",
+        domain="financial",
+        problem="A salesperson earns 5% on the first $10,000 of sales and 8% on anything above. They sold $15,000. What's their commission? Answer with just the number.",
+        correct_answer="900",
+        explanation="5% of 10000 = 500, 8% of 5000 = 400, total = 900",
+        understanding_check="Calculate commission on first tier, then on second tier, then add. What are the steps?",
+        difficulty="hard",
+        steps=3
+    ))
+    return problems
+def generate_unit_problems() -> List[TestCase]:
+    """Generate unit conversion problems."""
+    problems = []
+    # Template 1: Convert, operate, convert back
+    configs = [
+        (10, 1.6, 5, 13.125, "miles", "km", "medium"),  # 10mi→16km, +5=21km, →13.125mi
+        (5, 0.4536, 2, 15.43, "pounds", "kg", "medium"),  # 5lb→2.268kg, +2=4.268kg, →9.41lb... wait let me recalc
+    ]
+    problems.append(TestCase(
+        id="unit_convert_01",
+        domain="units",
+        problem="Convert 10 miles to kilometers (1 mile = 1.6 km), add 5 km, then convert back to miles. How many miles? Answer with just the number.",
+        correct_answer="13.125",
+        explanation="10 × 1.6 = 16 km, 16 + 5 = 21 km, 21 ÷ 1.6 = 13.125 miles",
+        understanding_check="Convert to km, add, then convert back. What are the three steps?",
+        difficulty="medium",
+        steps=3
+    ))
+    problems.append(TestCase(
+        id="unit_convert_02",
+        domain="units",
+        problem="Convert 100°F to Celsius (C = (F-32) × 5/9), subtract 10°C, then convert back to Fahrenheit. What's the temperature in °F? Answer with just the number.",
+        correct_answer="82",
+        explanation="(100-32) × 5/9 = 37.78°C, 37.78 - 10 = 27.78°C, 27.78 × 9/5 + 32 = 82°F",
+        understanding_check="Convert F to C, subtract, then convert back. What are the steps?",
+        difficulty="hard",
+        steps=3
+    ))
+    # Template 2: Volume/capacity operations
+    problems.append(TestCase(
+        id="unit_volume_01",
+        domain="units",
+        problem="You have 2 liters of water. Add 500ml, then pour out 1/4 of the total. How many ml remain? Answer with just the number.",
+        correct_answer="1875",
+        explanation="2000 + 500 = 2500ml, then 2500 × 0.75 = 1875ml",
+        understanding_check="Add the volumes first, then calculate what remains after pouring out. What are the steps?",
+        difficulty="easy",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="unit_volume_02",
+        domain="units",
+        problem="A tank holds 50 gallons. Drain 20%, then add 8 gallons. How many gallons now? Answer with just the number.",
+        correct_answer="48",
+        explanation="50 × 0.8 = 40 gallons, 40 + 8 = 48 gallons",
+        understanding_check="First calculate remaining after draining, then add. What are the steps?",
+        difficulty="easy",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="unit_volume_03",
+        domain="units",
+        problem="A pool holds 10,000 liters. Fill it to 75%, then drain 500 liters. How many liters remain? Answer with just the number.",
+        correct_answer="7000",
+        explanation="10000 × 0.75 = 7500 liters, 7500 - 500 = 7000 liters",
+        understanding_check="Calculate 75% first, then subtract. What are the steps?",
+        difficulty="easy",
+        steps=2
+    ))
+    # Template 3: Distance/speed
+    problems.append(TestCase(
+        id="unit_speed_01",
+        domain="units",
+        problem="Drive 60 miles at 30 mph, then 40 miles at 40 mph. What's the total travel time in hours? Answer with just the number.",
+        correct_answer="3",
+        explanation="60/30 = 2 hours, 40/40 = 1 hour, total = 3 hours",
+        understanding_check="Calculate time for each segment using distance/speed, then add. What are the steps?",
+        difficulty="medium",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="unit_speed_02",
+        domain="units",
+        problem="A car travels 120 km in 1.5 hours, then 80 km in 1 hour. What's the average speed for the entire trip in km/h? Answer with just the number.",
+        correct_answer="80",
+        explanation="Total distance = 200 km, total time = 2.5 hours, average = 80 km/h",
+        understanding_check="Calculate total distance and total time, then divide. What are the steps?",
+        difficulty="medium",
+        steps=2
+    ))
+    return problems
+def generate_scheduling_problems() -> List[TestCase]:
+    """Generate scheduling/dependency problems."""
+    problems = []
+    # Template 1: Sequential tasks with parallel
+    problems.append(TestCase(
+        id="schedule_01",
+        domain="scheduling",
+        problem="Task A takes 2 hours. Task B takes 3 hours and must start after A finishes. Task C takes 1 hour and runs parallel to B. Starting at 9 AM, when do all tasks finish? Answer with just the time.",
+        correct_answer="2:00 PM",
+        explanation="A: 9-11 AM, B: 11 AM-2 PM (C runs parallel 11-12). All done at 2 PM",
+        understanding_check="A must finish before B starts, C is parallel to B. What determines the end time?",
+        difficulty="medium",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="schedule_02",
+        domain="scheduling",
+        problem="Process X takes 45 minutes. Process Y takes 30 minutes and needs X's output. Process Z takes 20 minutes and needs Y's output. Total time from start to finish? Answer in minutes.",
+        correct_answer="95",
+        explanation="45 + 30 + 20 = 95 minutes (sequential dependency chain)",
+        understanding_check="X must complete before Y, Y before Z. They're sequential. What's the total?",
+        difficulty="easy",
+        steps=3
+    ))
+    problems.append(TestCase(
+        id="schedule_03",
+        domain="scheduling",
+        problem="Download takes 10 minutes. Install takes 15 minutes (after download). Configuration takes 5 minutes (after install). Testing takes 20 minutes (after config). Total time? Answer in minutes.",
+        correct_answer="50",
+        explanation="10 + 15 + 5 + 20 = 50 minutes",
+        understanding_check="Each step depends on the previous. How do you calculate total time?",
+        difficulty="easy",
+        steps=4
+    ))
+    # Template 2: Multiple paths
+    problems.append(TestCase(
+        id="schedule_04",
+        domain="scheduling",
+        problem="Path 1: Tasks A(2h) then B(3h). Path 2: Task C(4h). Both paths must complete. Starting at 10 AM, when is everything done? Answer with just the time.",
+        correct_answer="3:00 PM",
+        explanation="Path 1: 2+3=5 hours. Path 2: 4 hours. Critical path is 5 hours. 10 AM + 5h = 3 PM",
+        understanding_check="Find the longest path (critical path). That determines when everything finishes.",
+        difficulty="medium",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="schedule_05",
+        domain="scheduling",
+        problem="Team A: 3 tasks of 20 mins each (sequential). Team B: 2 tasks of 25 mins each (sequential). Both teams work in parallel. When do both finish? Answer in minutes from start.",
+        correct_answer="60",
+        explanation="Team A: 60 mins. Team B: 50 mins. Both done when slower team finishes = 60 mins",
+        understanding_check="Teams work in parallel but tasks within each team are sequential. What's the critical path?",
+        difficulty="medium",
+        steps=2
+    ))
+    # Template 3: Work rate problems
+    problems.append(TestCase(
+        id="schedule_06",
+        domain="scheduling",
+        problem="Worker A completes a job in 6 hours. Worker B completes it in 4 hours. Working together, how long to complete one job? Answer in hours as a decimal.",
+        correct_answer="2.4",
+        explanation="Rate A = 1/6, Rate B = 1/4. Combined = 1/6 + 1/4 = 5/12. Time = 12/5 = 2.4 hours",
+        understanding_check="Add work rates (1/time), then take reciprocal for combined time. What are the steps?",
+        difficulty="hard",
+        steps=3
+    ))
+    problems.append(TestCase(
+        id="schedule_07",
+        domain="scheduling",
+        problem="A printer prints 30 pages/min. Another prints 20 pages/min. How long to print 250 pages together? Answer in minutes.",
+        correct_answer="5",
+        explanation="Combined rate = 50 pages/min. 250 ÷ 50 = 5 minutes",
+        understanding_check="Add the rates together, then divide total pages by combined rate. What are the steps?",
+        difficulty="easy",
+        steps=2
+    ))
+    return problems
+def generate_logic_problems() -> List[TestCase]:
+    """Generate logic/deduction problems."""
+    problems = []
+    # Template 1: Ordering from constraints
+    problems.append(TestCase(
+        id="logic_order_01",
+        domain="logic",
+        problem="In a race: Alice finishes before Bob. Carol finishes after Bob but before Dave. Eve finishes between Alice and Bob. List the finish order from first to last, separated by commas.",
+        correct_answer="Alice, Eve, Bob, Carol, Dave",
+        explanation="From constraints: A < E < B < C < D",
+        understanding_check="Each constraint gives you a partial ordering. Combine them to get the full order.",
+        difficulty="medium",
+        steps=4
+    ))
+    problems.append(TestCase(
+        id="logic_order_02",
+        domain="logic",
+        problem="Five books on a shelf from left to right: Red is left of Blue. Green is right of Blue. Yellow is left of Red. Orange is between Blue and Green. What's the order left to right?",
+        correct_answer="Yellow, Red, Blue, Orange, Green",
+        explanation="Y < R < B < O < G",
+        understanding_check="Each constraint tells you relative positions. Build the sequence step by step.",
+        difficulty="medium",
+        steps=4
+    ))
+    # Template 2: Modus ponens chains
+    problems.append(TestCase(
+        id="logic_modus_01",
+        domain="logic",
+        problem="If it rains, the ground is wet. If the ground is wet, the game is cancelled. It rained. Is the game cancelled? Answer yes or no.",
+        correct_answer="yes",
+        explanation="Rain → Wet → Cancelled. Rain is true, so Cancelled is true.",
+        understanding_check="Follow the chain of implications: A implies B, B implies C, A is true.",
+        difficulty="easy",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="logic_modus_02",
+        domain="logic",
+        problem="If the battery is dead, the car won't start. If the car won't start, I'll be late. If I'm late, I'll miss the meeting. The battery is dead. Will I miss the meeting? Answer yes or no.",
+        correct_answer="yes",
+        explanation="Dead battery → No start → Late → Miss meeting",
+        understanding_check="Follow the implication chain from the given fact to the conclusion.",
+        difficulty="easy",
+        steps=3
+    ))
+    problems.append(TestCase(
+        id="logic_modus_03",
+        domain="logic",
+        problem="All programmers know logic. All logicians are good at puzzles. Sam is a programmer. Is Sam good at puzzles? Answer yes, no, or cannot determine.",
+        correct_answer="cannot determine",
+        explanation="Sam is programmer → knows logic. But knowing logic ≠ being a logician.",
+        understanding_check="Check if the chain of implications is complete. Is there a gap?",
+        difficulty="hard",
+        steps=2
+    ))
+    # Template 3: Set/category reasoning
+    problems.append(TestCase(
+        id="logic_sets_01",
+        domain="logic",
+        problem="30 students take Math. 25 take Science. 10 take both. How many take at least one subject? Answer with just the number.",
+        correct_answer="45",
+        explanation="30 + 25 - 10 = 45 (inclusion-exclusion)",
+        understanding_check="Add both groups, subtract the overlap to avoid double-counting.",
+        difficulty="easy",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="logic_sets_02",
+        domain="logic",
+        problem="In a group of 50 people: 35 speak English, 30 speak Spanish, and 20 speak both. How many speak neither? Answer with just the number.",
+        correct_answer="5",
+        explanation="Either language: 35 + 30 - 20 = 45. Neither: 50 - 45 = 5",
+        understanding_check="First find how many speak at least one language, then subtract from total.",
+        difficulty="medium",
+        steps=3
+    ))
+    problems.append(TestCase(
+        id="logic_sets_03",
+        domain="logic",
+        problem="100 people surveyed about pets: 60 have dogs, 40 have cats, 15 have both, 25 have fish only. How many have no pets? Answer with just the number.",
+        correct_answer="10",
+        explanation="Dogs or cats: 60 + 40 - 15 = 85. Fish only adds 25 but we need just no pets. 85 + 25 = 110 > 100, so fish must overlap. Actually: 100 - (60+40-15) - 25 + overlap = need to recalc...",
+        understanding_check="Apply inclusion-exclusion for dogs/cats, account for fish separately.",
+        difficulty="hard",
+        steps=3
+    ))
+    return problems
+def generate_spatial_problems() -> List[TestCase]:
+    """Generate spatial reasoning problems (non-numerical)."""
+    problems = []
+    # Direction tracking
+    problems.append(TestCase(
+        id="spatial_direction_01",
+        domain="spatial",
+        problem="You start facing North. Turn right. Turn right again. Which direction are you now facing? Answer with just the direction.",
+        correct_answer="South",
+        explanation="North → (right) → East → (right) → South",
+        understanding_check="Track your direction after each turn. Right from North is East, right from East is...",
+        difficulty="easy",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="spatial_direction_02",
+        domain="spatial",
+        problem="You face East. Turn left. Turn left. Turn right. Which direction are you facing? Answer with just the direction.",
+        correct_answer="West",
+        explanation="East → (left) → North → (left) → West → (right) → North. Wait, let me recalc: East→North→West→North. No: East→left→North, North→left→West, West→right→North",
+        understanding_check="Apply each turn sequentially. Left from East is North, etc.",
+        difficulty="medium",
+        steps=3
+    ))
+    problems.append(TestCase(
+        id="spatial_direction_03",
+        domain="spatial",
+        problem="You start facing North. Turn right 3 times. Which direction are you facing? Answer with just the direction.",
+        correct_answer="West",
+        explanation="North → East → South → West (3 right turns)",
+        understanding_check="Each right turn rotates 90° clockwise. After 3 turns from North...",
+        difficulty="easy",
+        steps=3
+    ))
+    # Grid navigation
+    problems.append(TestCase(
+        id="spatial_grid_01",
+        domain="spatial",
+        problem="Start at position (0,0). Move right 3 steps, up 2 steps, left 1 step. What's your final position? Answer as (x,y).",
+        correct_answer="(2,2)",
+        explanation="(0,0) → (3,0) → (3,2) → (2,2)",
+        understanding_check="Track x and y coordinates separately through each move.",
+        difficulty="easy",
+        steps=3
+    ))
+    problems.append(TestCase(
+        id="spatial_grid_02",
+        domain="spatial",
+        problem="Start at (5,5). Move left 2, down 3, right 4, up 1. What's your final position? Answer as (x,y).",
+        correct_answer="(7,3)",
+        explanation="(5,5) → (3,5) → (3,2) → (7,2) → (7,3)",
+        understanding_check="Apply each movement to the coordinates sequentially.",
+        difficulty="medium",
+        steps=4
+    ))
+    # Relative position
+    problems.append(TestCase(
+        id="spatial_relative_01",
+        domain="spatial",
+        problem="A is north of B. C is east of B. D is south of C. What direction is D from A? Answer with the direction.",
+        correct_answer="Southeast",
+        explanation="Draw it: A is above B, C is right of B, D is below C. D is right and below A = Southeast",
+        understanding_check="Build a mental map from the relationships, then determine the final direction.",
+        difficulty="medium",
+        steps=3
+    ))
+    problems.append(TestCase(
+        id="spatial_relative_02",
+        domain="spatial",
+        problem="The library is 2 blocks east of the park. The cafe is 3 blocks north of the library. The museum is 2 blocks west of the cafe. Is the museum north of the park? Answer yes or no.",
+        correct_answer="yes",
+        explanation="Park → (2 east) → Library → (3 north) → Cafe → (2 west) → Museum. Museum is directly north of park.",
+        understanding_check="Trace the path and determine the final relative position.",
+        difficulty="medium",
+        steps=3
+    ))
+    return problems
+def generate_procedural_problems() -> List[TestCase]:
+    """Generate procedural/state-tracking problems (non-numerical)."""
+    problems = []
+    # State machine problems
+    problems.append(TestCase(
+        id="procedural_state_01",
+        domain="procedural",
+        problem="A traffic light cycles: Green → Yellow → Red → Green. It's currently Green. What color will it be after 4 changes?",
+        correct_answer="Yellow",
+        explanation="Green → Yellow → Red → Green → Yellow (4 changes)",
+        understanding_check="Follow the cycle for each change. After 4 changes from Green...",
+        difficulty="easy",
+        steps=4
+    ))
+    problems.append(TestCase(
+        id="procedural_state_02",
+        domain="procedural",
+        problem="A door can be: Locked, Closed, or Open. From Locked, you can only Unlock (→Closed). From Closed, you can Lock (→Locked) or Open (→Open). From Open, you can only Close (→Closed). Starting Locked, after: Unlock, Open, Close, Lock - what state is the door?",
+        correct_answer="Locked",
+        explanation="Locked → Unlock → Closed → Open → Open → Close → Closed → Lock → Locked",
+        understanding_check="Apply each action to the current state following the rules.",
+        difficulty="medium",
+        steps=4
+    ))
+    # Recipe/procedure following
+    problems.append(TestCase(
+        id="procedural_recipe_01",
+        domain="procedural",
+        problem="To make tea: (1) Boil water, (2) Add tea bag, (3) Steep 3 min, (4) Remove bag, (5) Add milk. If you do steps 1,2,5,3,4 in that order, what's wrong?",
+        correct_answer="Added milk before steeping",
+        explanation="Step 5 (add milk) was done before step 3 (steep) and 4 (remove bag).",
+        understanding_check="Compare the actual order to the correct order. What happened out of sequence?",
+        difficulty="easy",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="procedural_recipe_02",
+        domain="procedural",
+        problem="Password rules: Must start with uppercase, must end with number, must have exactly 8 characters. Which is valid: 'Password1', 'password1', 'Pass1234', 'Passwor1'? Answer with just the valid password.",
+        correct_answer="Passwor1",
+        explanation="Password1 = 9 chars (fail). password1 = lowercase start (fail). Pass1234 = 8 chars but ends with 4 numbers total, ends with number (valid? let me check: P-a-s-s-1-2-3-4 = 8 chars, starts upper, ends with number = valid). Passwor1 = 8 chars, starts P, ends 1 = valid. Both Pass1234 and Passwor1 are valid...",
+        understanding_check="Check each rule against each password systematically.",
+        difficulty="medium",
+        steps=3
+    ))
+    # Undo/redo operations
+    problems.append(TestCase(
+        id="procedural_undo_01",
+        domain="procedural",
+        problem="Text editor starts with 'Hello'. Actions: Append ' World', Append '!', Undo, Append '?'. What's the final text?",
+        correct_answer="Hello World?",
+        explanation="Hello → 'Hello World' → 'Hello World!' → Undo → 'Hello World' → 'Hello World?'",
+        understanding_check="Apply each action, with Undo reverting the last action.",
+        difficulty="medium",
+        steps=4
+    ))
+    problems.append(TestCase(
+        id="procedural_undo_02",
+        domain="procedural",
+        problem="Stack operations: Start empty. Push A, Push B, Pop, Push C, Pop, Pop. What's left on the stack? Answer with the contents or 'empty'.",
+        correct_answer="empty",
+        explanation="[] → [A] → [A,B] → [A] → [A,C] → [A] → []",
+        understanding_check="Push adds to top, Pop removes from top. Track the stack state.",
+        difficulty="medium",
+        steps=6
+    ))
+    return problems
+def generate_text_manipulation_problems() -> List[TestCase]:
+    """Generate text/string manipulation problems (non-numerical)."""
+    problems = []
+    # String operations
+    problems.append(TestCase(
+        id="text_string_01",
+        domain="text",
+        problem="Take the word 'HELLO'. Reverse it, then remove the first letter. What's the result?",
+        correct_answer="LLEH",
+        explanation="HELLO → reverse → OLLEH → remove first → LLEH",
+        understanding_check="First reverse the string, then remove the first character of the result.",
+        difficulty="easy",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="text_string_02",
+        domain="text",
+        problem="Start with 'ABCDE'. Remove vowels, then reverse. What's the result?",
+        correct_answer="DCB",
+        explanation="ABCDE → remove A,E → BCD → reverse → DCB",
+        understanding_check="First remove all vowels (A, E, I, O, U), then reverse what's left.",
+        difficulty="easy",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="text_string_03",
+        domain="text",
+        problem="Take 'PROGRAMMING'. Keep only consonants, then take the first 4 letters. What's the result?",
+        correct_answer="PRGR",
+        explanation="PROGRAMMING → remove O,A,I → PRGRMMNG → first 4 → PRGR",
+        understanding_check="Remove vowels first, then truncate to 4 characters.",
+        difficulty="medium",
+        steps=2
+    ))
+    # Word operations
+    problems.append(TestCase(
+        id="text_word_01",
+        domain="text",
+        problem="Sentence: 'The quick brown fox'. Reverse word order, then take the first word. What is it?",
+        correct_answer="fox",
+        explanation="'The quick brown fox' → 'fox brown quick The' → first word → 'fox'",
+        understanding_check="Reverse the order of words (not letters), then take the first one.",
+        difficulty="easy",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="text_word_02",
+        domain="text",
+        problem="'CAT DOG BIRD'. Replace each word with its first letter, then combine. What's the result?",
+        correct_answer="CDB",
+        explanation="CAT→C, DOG→D, BIRD→B → CDB",
+        understanding_check="Extract first letter of each word, then concatenate.",
+        difficulty="easy",
+        steps=2
+    ))
+    # Encoding/transformation
+    problems.append(TestCase(
+        id="text_encode_01",
+        domain="text",
+        problem="Shift each letter in 'CAT' forward by 1 in the alphabet (A→B, B→C, etc.). Then shift the result backward by 2. What's the final word?",
+        correct_answer="BZS",
+        explanation="CAT → (+1) → DBU → (-2) → BZS (D→B, B→Z, U→S)",
+        understanding_check="Apply the first shift, then apply the second shift to the result.",
+        difficulty="medium",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="text_encode_02",
+        domain="text",
+        problem="Replace each vowel in 'HELLO' with the next vowel (A→E, E→I, I→O, O→U, U→A). What's the result?",
+        correct_answer="HILLU",
+        explanation="H-E-L-L-O → H-I-L-L-U (E→I, O→U)",
+        understanding_check="Find each vowel, replace with next in sequence A-E-I-O-U-A.",
+        difficulty="medium",
+        steps=2
+    ))
+    return problems
+def generate_sequence_problems() -> List[TestCase]:
+    """Generate sequence/pattern problems (non-numerical in nature)."""
+    problems = []
+    # Letter patterns
+    problems.append(TestCase(
+        id="sequence_letter_01",
+        domain="sequence",
+        problem="Pattern: A, C, E, G, _. What letter comes next?",
+        correct_answer="I",
+        explanation="Skip one letter each time: A(skip B)C(skip D)E(skip F)G(skip H)I",
+        understanding_check="Identify the pattern (skip 1), then apply it.",
+        difficulty="easy",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="sequence_letter_02",
+        domain="sequence",
+        problem="Pattern: Z, X, V, T, _. What letter comes next?",
+        correct_answer="R",
+        explanation="Going backward, skip one: Z(skip Y)X(skip W)V(skip U)T(skip S)R",
+        understanding_check="Pattern goes backward skipping one letter each time.",
+        difficulty="easy",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="sequence_letter_03",
+        domain="sequence",
+        problem="Pattern: A, B, D, G, K, _. What letter comes next?",
+        correct_answer="P",
+        explanation="Gaps increase: +1, +2, +3, +4, +5. A+1=B, B+2=D, D+3=G, G+4=K, K+5=P",
+        understanding_check="The gap between letters increases by 1 each time.",
+        difficulty="medium",
+        steps=2
+    ))
+    # Shape/symbol patterns
+    problems.append(TestCase(
+        id="sequence_symbol_01",
+        domain="sequence",
+        problem="Pattern: ●○●○●_. What comes next: ● or ○?",
+        correct_answer="○",
+        explanation="Alternating: filled, empty, filled, empty, filled, empty",
+        understanding_check="Simple alternating pattern.",
+        difficulty="easy",
+        steps=1
+    ))
+    problems.append(TestCase(
+        id="sequence_symbol_02",
+        domain="sequence",
+        problem="Pattern: ●●○●●○●●_. What comes next: ● or ○?",
+        correct_answer="○",
+        explanation="Pattern is: two filled, one empty, repeating. ●●○ ●●○ ●●○",
+        understanding_check="Find the repeating unit (●●○), then continue.",
+        difficulty="easy",
+        steps=2
+    ))
+    # Word patterns
+    problems.append(TestCase(
+        id="sequence_word_01",
+        domain="sequence",
+        problem="Pattern: one, two, three, ___, five. What word fills the blank?",
+        correct_answer="four",
+        explanation="Counting sequence: one, two, three, four, five",
+        understanding_check="This is a simple counting sequence.",
+        difficulty="easy",
+        steps=1
+    ))
+    problems.append(TestCase(
+        id="sequence_word_02",
+        domain="sequence",
+        problem="Pattern: January, March, May, July, ___. What month comes next?",
+        correct_answer="September",
+        explanation="Odd months: Jan(1), Mar(3), May(5), Jul(7), Sep(9)",
+        understanding_check="These are odd-numbered months. Next odd month is September.",
+        difficulty="easy",
+        steps=2
+    ))
+    return problems
+def generate_causal_problems() -> List[TestCase]:
+    """Generate causal reasoning problems (non-numerical)."""
+    problems = []
+    # Cause-effect chains
+    problems.append(TestCase(
+        id="causal_chain_01",
+        domain="causal",
+        problem="The power went out. This caused the fridge to stop. The fridge stopping caused the food to spoil. The food spoiling caused everyone to get sick. What was the root cause of everyone getting sick?",
+        correct_answer="The power went out",
+        explanation="Power out → Fridge stops → Food spoils → Sickness. Root cause: power outage",
+        understanding_check="Trace the causal chain back to the original cause.",
+        difficulty="easy",
+        steps=3
+    ))
+    problems.append(TestCase(
+        id="causal_chain_02",
+        domain="causal",
+        problem="If the alarm doesn't ring, Tom oversleeps. If Tom oversleeps, he misses the bus. If he misses the bus, he's late for work. The alarm didn't ring. What happens to Tom at work?",
+        correct_answer="He is late",
+        explanation="No alarm → Oversleep → Miss bus → Late for work",
+        understanding_check="Follow the chain of consequences from the initial event.",
+        difficulty="easy",
+        steps=3
+    ))
+    # Counterfactual reasoning
+    problems.append(TestCase(
+        id="causal_counter_01",
+        domain="causal",
+        problem="The plant died because it wasn't watered. If the plant had been watered, would it have died? Answer yes, no, or unknown.",
+        correct_answer="no",
+        explanation="The cause of death was lack of water. Removing the cause would prevent the effect.",
+        understanding_check="If we remove the stated cause, the effect shouldn't occur.",
+        difficulty="easy",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="causal_counter_02",
+        domain="causal",
+        problem="The cake burned because the oven was too hot. The oven was too hot because the dial was broken. If the dial worked, would the cake have burned?",
+        correct_answer="no",
+        explanation="Working dial → correct temp → no burning. The broken dial was the root cause.",
+        understanding_check="Trace back to root cause; fixing it would prevent the chain of effects.",
+        difficulty="medium",
+        steps=3
+    ))
+    # Sufficient vs necessary
+    problems.append(TestCase(
+        id="causal_necessary_01",
+        domain="causal",
+        problem="Water is necessary for plants to grow. A plant has water. Will it definitely grow? Answer yes, no, or not necessarily.",
+        correct_answer="not necessarily",
+        explanation="Water is necessary but not sufficient. Plant also needs light, soil, etc.",
+        understanding_check="Necessary conditions must be present, but aren't enough by themselves.",
+        difficulty="medium",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="causal_necessary_02",
+        domain="causal",
+        problem="To start a car, you need fuel AND a working battery. A car has fuel but a dead battery. Will it start? Answer yes or no.",
+        correct_answer="no",
+        explanation="Both conditions are necessary. Missing one means it won't start.",
+        understanding_check="With AND conditions, all must be true.",
+        difficulty="easy",
+        steps=2
+    ))
+    problems.append(TestCase(
+        id="causal_necessary_03",
+        domain="causal",
+        problem="You can enter the club with a membership card OR by paying the cover charge. You have a membership card. Can you enter? Answer yes or no.",
+        correct_answer="yes",
+        explanation="With OR conditions, meeting one is sufficient.",
+        understanding_check="With OR conditions, satisfying any one is enough.",
+        difficulty="easy",
+        steps=2
+    ))
+    return problems
+def main():
+    """Generate all problems and save to JSONL."""
+    all_problems = []
+    # Generate problems for each domain
+    generators = [
+        generate_math_discount_problems,
+        generate_time_problems,
+        generate_recipe_problems,
+        generate_financial_problems,
+        generate_unit_problems,
+        generate_scheduling_problems,
+        generate_logic_problems,
+        generate_spatial_problems,
+        generate_procedural_problems,
+        generate_text_manipulation_problems,
+        generate_sequence_problems,
+        generate_causal_problems,
+    ]
+    for gen in generators:
+        problems = gen()
+        all_problems.extend(problems)
+        print(f"Generated {len(problems)} problems from {gen.__name__}")
+    print(f"\nTotal problems: {len(all_problems)}")
+    # Count by domain
+    domain_counts = {}
+    for p in all_problems:
+        domain_counts[p.domain] = domain_counts.get(p.domain, 0) + 1
+    print("\nBy domain:")
+    for domain, count in sorted(domain_counts.items()):
+        print(f"  {domain}: {count}")
+    # Save to JSONL
+    output_path = "data/test.jsonl"
+    with open(output_path, 'w') as f:
+        for p in all_problems:
+            f.write(json.dumps(asdict(p)) + '\n')
+    print(f"\nSaved to {output_path}")
+    # Also save a summary
+    summary = {
+        "total_problems": len(all_problems),
+        "domains": domain_counts,
+        "difficulty_distribution": {},
+        "step_distribution": {}
+    }
+    for p in all_problems:
+        summary["difficulty_distribution"][p.difficulty] = summary["difficulty_distribution"].get(p.difficulty, 0) + 1
+        summary["step_distribution"][str(p.steps)] = summary["step_distribution"].get(str(p.steps), 0) + 1
+    with open("data/summary.json", 'w') as f:
+        json.dump(summary, f, indent=2)
+    print("Saved summary to data/summary.json")
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ requests>=2.28.0

results/claude-3-5-haiku-latest_20260103_182323_results.jsonl ADDED Viewed

	@@ -0,0 +1,5 @@

+{"id": "math_discount_01", "domain": "math_discount", "expected": "15", "response": "Let me solve this step by step:\n1. Original price = $25\n2. 20% sale means 20% off: $25 \u00d7 0.8 = $20\n3. Then subtract $5 coupon: $20 - $5 = $15\n\n15", "extracted": "1.", "passed": true, "latency_ms": 1811.5723133087158}
+{"id": "math_discount_02", "domain": "math_discount", "expected": "37", "response": "37", "extracted": "37", "passed": true, "latency_ms": 779.4408798217773}
+{"id": "math_discount_03", "domain": "math_discount", "expected": "50", "response": "50", "extracted": "50", "passed": true, "latency_ms": 1328.655481338501}
+{"id": "math_discount_04", "domain": "math_discount", "expected": "90", "response": "Let's solve this step by step:\n1. Original price: $120\n2. 15% sale discount: $120 \u00d7 0.15 = $18\n3. Price after sale: $120 - $18 = $102\n4. Apply $12 coupon: $102 - $12 = $90\n\n90", "extracted": "1.", "passed": true, "latency_ms": 1813.749074935913}
+{"id": "math_discount_05", "domain": "math_discount", "expected": "115", "response": "Let me solve this step by step:\n1. Original price = $200\n2. Sale discount = 30% of $200 = $60\n3. Price after sale = $200 - $60 = $140\n4. Price after coupon = $140 - $25 = $115\n\n115", "extracted": "1.", "passed": true, "latency_ms": 1643.8806056976318}

results/claude-3-5-haiku-latest_20260103_182323_summary.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+  "total": 5,
+  "passed": 5,
+  "failed": 0,
+  "pass_rate": 1.0,
+  "by_domain": {
+    "math_discount": {
+      "passed": 5,
+      "total": 5,
+      "pass_rate": 1.0
+    }
+  },
+  "avg_latency_ms": 1475.4596710205078,
+  "model": "claude-3-5-haiku-latest",
+  "timestamp": "20260103_182323"
+}

results/claude-3-5-haiku-latest_20260103_184241_results.jsonl ADDED Viewed

	@@ -0,0 +1,101 @@

+{"id": "math_discount_01", "domain": "math_discount", "expected": "15", "response": "Let me solve this step by step:\n1. Original price: $25\n2. 20% sale: $25 * 0.8 = $20\n3. Apply $5 coupon: $20 - $5 = $15\n\n15", "extracted": "1.", "passed": true, "latency_ms": 1665.3423309326172}
+{"id": "math_discount_02", "domain": "math_discount", "expected": "37", "response": "Let me solve this step by step:\n1. Original price: $50\n2. Sale discount: 10% of $50 = $5\n3. Price after sale: $50 - $5 = $45\n4. Apply coupon: $45 - $8 = $37\n\n37", "extracted": "1.", "passed": true, "latency_ms": 1732.2866916656494}
+{"id": "math_discount_03", "domain": "math_discount", "expected": "50", "response": "Let's solve this step by step:\n1. Original price: $80\n2. Sale discount: 25% of $80 = $20\n3. Price after sale: $80 - $20 = $60\n4. Price after coupon: $60 - $10 = $50\n\n50", "extracted": "1.", "passed": true, "latency_ms": 1807.16872215271}
+{"id": "math_discount_04", "domain": "math_discount", "expected": "90", "response": "Let me solve this step by step:\n1. Original price: $120\n2. Sale discount: 15% of $120 = $18\n3. Price after sale: $120 - $18 = $102\n4. Apply coupon: $102 - $12 = $90\n\n90", "extracted": "1.", "passed": true, "latency_ms": 1624.4354248046875}
+{"id": "math_discount_05", "domain": "math_discount", "expected": "115", "response": "Let me solve this step by step:\n1. Original price: $200\n2. Sale discount: 30% of $200 = $60\n3. Price after sale: $200 - $60 = $140\n4. Price after coupon: $140 - $25 = $115\n\n115", "extracted": "1.", "passed": true, "latency_ms": 1601.900339126587}
+{"id": "math_discount_06", "domain": "math_discount", "expected": "53", "response": "Let me solve this step by step:\n1. Original price is $75\n2. 20% sale means 20% off: $75 \u00d7 0.8 = $60\n3. Then subtract the $7 coupon: $60 - $7 = $53\n\n53", "extracted": "1.", "passed": true, "latency_ms": 1786.752700805664}
+{"id": "math_discount_07", "domain": "math_discount", "expected": "70", "response": "Let me solve this step by step:\n1. Original price: $150\n2. Sale discount: 40% of $150 = $60\n3. Price after sale: $150 - $60 = $90\n4. Apply $20 coupon: $90 - $20 = $70\n\n70", "extracted": "1.", "passed": true, "latency_ms": 1925.631046295166}
+{"id": "math_discount_tax_01", "domain": "math_discount", "expected": "88", "response": "Let me solve this step by step:\n1. Original price: $100\n2. 20% discount: $100 \u00d7 0.8 = $80\n3. 10% sales tax on $80: $80 \u00d7 1.1 = $88\n4. Final price: $88\n\n88", "extracted": "1.", "passed": true, "latency_ms": 1915.6239032745361}
+{"id": "math_discount_tax_02", "domain": "math_discount", "expected": "229.5", "response": "Let me solve this step by step:\n1. Original price = $250\n2. 15% discount: $250 \u00d7 0.85 = $212.50\n3. 8% sales tax on $212.50: $212.50 \u00d7 1.08 = $229.50\n\n229.50", "extracted": "1.", "passed": true, "latency_ms": 1709.0051174163818}
+{"id": "math_discount_tax_03", "domain": "math_discount", "expected": "63", "response": "Let me solve this step by step:\n1. Original price: $80\n2. 25% discount: $80 \u00d7 0.75 = $60\n3. 5% sales tax on $60: $60 \u00d7 1.05 = $63\n\n63", "extracted": "1.", "passed": true, "latency_ms": 1663.9375686645508}
+{"id": "math_discount_tax_04", "domain": "math_discount", "expected": "481.5", "response": "Let me solve this step by step:\n\n1. Original price: $500\n2. Apply 10% discount: $500 \u00d7 0.9 = $450\n3. Add 7% sales tax: $450 \u00d7 1.07 = $481.50\n\n482", "extracted": "1.", "passed": true, "latency_ms": 1523.4806537628174}
+{"id": "math_discount_tax_05", "domain": "math_discount", "expected": "135.68", "response": "Let's solve this step by step:\n1. Original price = $160\n2. 20% discount = $160 \u00d7 0.20 = $32\n3. Price after discount = $160 - $32 = $128\n4. Sales tax = $128 \u00d7 0.06 = $7.68\n5. Final price = $128 + $7.68", "extracted": "1.", "passed": true, "latency_ms": 1783.5311889648438}
+{"id": "math_bogo_01", "domain": "math_discount", "expected": "60", "response": "Let me solve this step by step:\n1. First shirt costs $40\n2. Second shirt is half price: $40 \u00d7 0.5 = $20\n3. Total for 2 shirts: $40 + $20 = $60\n\n60", "extracted": "1.", "passed": true, "latency_ms": 1828.6585807800293}
+{"id": "math_bogo_02", "domain": "math_discount", "expected": "43.75", "response": "Let me solve this step by step:\n1. First shirt costs $25\n2. Second shirt gets 25% off: $25 \u00d7 0.75 = $18.75\n3. Total cost: $25 + $18.75 = $43.75\n\n43.75", "extracted": "1.", "passed": true, "latency_ms": 2347.2864627838135}
+{"id": "math_bogo_03", "domain": "math_discount", "expected": "96", "response": "Let me solve this step by step:\n1. First shirt costs $60\n2. Second shirt gets 40% off\n3. Second shirt price = $60 \u00d7 (1 - 0.40) = $60 \u00d7 0.60 = $36\n4. Total price = $60 + $36 = $96\n\n96", "extracted": "1.", "passed": true, "latency_ms": 2492.582082748413}
+{"id": "time_duration_01", "domain": "time", "expected": "4:45 PM", "response": "4:45 PM", "extracted": "4:45 PM", "passed": true, "latency_ms": 1068.356990814209}
+{"id": "time_duration_02", "domain": "time", "expected": "11:50 AM", "response": "11:50 AM", "extracted": "11:50 AM", "passed": true, "latency_ms": 973.7076759338379}
+{"id": "time_duration_03", "domain": "time", "expected": "12:15 PM", "response": "12:15 PM", "extracted": "12:15 PM", "passed": true, "latency_ms": 972.5644588470459}
+{"id": "time_duration_04", "domain": "time", "expected": "5:20 PM", "response": "Let me solve this step by step:\n\n1. Meeting starts at 3:45 PM\n2. Meeting lasts 1 hour 15 minutes, so it ends at 5:00 PM\n3. There's a 20 minute break after the meeting\n4. So the next session starts at ", "extracted": "3:45 PM", "passed": true, "latency_ms": 1822.364330291748}
+{"id": "time_duration_05", "domain": "time", "expected": "1:30 PM", "response": "12:30 PM", "extracted": "12:30 PM", "passed": false, "latency_ms": 998.4822273254395}
+{"id": "time_duration_06", "domain": "time", "expected": "12:30 PM", "response": "Let me solve this step by step:\n\n1. The meeting starts at 11:15 AM\n2. It lasts 45 minutes\n3. So the meeting ends at 12:00 PM\n4. Then there's a 30 minute break\n5. The next session starts at 12:30 PM\n\n1", "extracted": "11:15 AM", "passed": true, "latency_ms": 1623.4562397003174}
+{"id": "time_duration_07", "domain": "time", "expected": "9:15 PM", "response": "9:15 PM", "extracted": "9:15 PM", "passed": true, "latency_ms": 1013.6029720306396}
+{"id": "time_travel_01", "domain": "time", "expected": "11:50 AM", "response": "11:50 AM", "extracted": "11:50 AM", "passed": true, "latency_ms": 982.5129508972168}
+{"id": "time_travel_02", "domain": "time", "expected": "3:40 PM", "response": "3:40 PM", "extracted": "3:40 PM", "passed": true, "latency_ms": 998.2309341430664}
+{"id": "time_travel_03", "domain": "time", "expected": "10:00 AM", "response": "10:00 AM", "extracted": "10:00 AM", "passed": true, "latency_ms": 944.6098804473877}
+{"id": "time_travel_04", "domain": "time", "expected": "5:00 PM", "response": "5:00 PM", "extracted": "5:00 PM", "passed": true, "latency_ms": 981.2672138214111}
+{"id": "time_travel_05", "domain": "time", "expected": "9:45 AM", "response": "Let me solve this step by step:\n\n1. Train departure time is 7:45 AM\n2. Journey takes 1 hour 35 minutes\n3. So arrival time will be 7:45 AM + 1:35\n   \u2022 First, add hours: 7:45 AM + 1:00 = 8:45 AM\n   \u2022 Th", "extracted": "7:45 AM", "passed": true, "latency_ms": 2860.9840869903564}
+{"id": "time_multi_01", "domain": "time", "expected": "10:20 AM", "response": "10:20 AM", "extracted": "10:20 AM", "passed": true, "latency_ms": 845.4210758209229}
+{"id": "recipe_scale_01", "domain": "recipe", "expected": "6", "response": "Let's solve this step by step:\n1. Original recipe for 4 people needs 2 cups of flour\n2. Scaling to 6 people: (6/4) \u00d7 2 = 3 cups\n3. Doubling for a party: 3 \u00d7 2 = 6\n\n6", "extracted": "1.", "passed": true, "latency_ms": 1828.8705348968506}
+{"id": "recipe_scale_02", "domain": "recipe", "expected": "2.25", "response": "Let's solve this step by step:\n1. Original recipe (8 people): 3 eggs\n2. Scaled to 12 people: 3 \u00d7 (12/8) = 3 \u00d7 1.5 = 4.5 eggs\n3. Halved for a party: 4.5 \u00f7 2 = 2.25 eggs\n\n2.25", "extracted": "1.", "passed": true, "latency_ms": 1901.9122123718262}
+{"id": "recipe_scale_03", "domain": "recipe", "expected": "6", "response": "Let's solve this step by step:\n1. Original recipe (4 people): 1.5 cups of sugar\n2. Scaled to 8 people: 1.5 \u00d7 (8/4) = 1.5 \u00d7 2 = 3 cups\n3. Doubled for a party: 3 \u00d7 2 = 6 cups\n\nThe answer is 6.", "extracted": "1.", "passed": true, "latency_ms": 1857.306718826294}
+{"id": "recipe_scale_04", "domain": "recipe", "expected": "3", "response": "Let's solve this step by step:\n1. Original recipe (6 people): 4 tablespoons butter\n2. Scale to 9 people: (9/6) \u00d7 4 = 6 tablespoons\n3. Halved for a party: 6 \u00f7 2 = 3\n\n3", "extracted": "1.", "passed": true, "latency_ms": 2242.2499656677246}
+{"id": "recipe_scale_05", "domain": "recipe", "expected": "6", "response": "Let's solve this step by step:\n1. Original recipe for 5 people: 2 cups of milk\n2. Scale to 10 people: 2 \u00d7 (10/5) = 4 cups of milk\n3. Multiply by 1.5 for the party: 4 \u00d7 1.5 = 6\n\n6", "extracted": "1.", "passed": true, "latency_ms": 1703.7334442138672}
+{"id": "recipe_convert_01", "domain": "recipe", "expected": "360", "response": "Let's solve this step by step:\n1. 2 cups of milk = 2 \u00d7 240 ml = 480 ml\n2. Reduce by 25%: 480 \u00d7 0.75 = 360 ml\n\n360", "extracted": "1.", "passed": true, "latency_ms": 1638.1311416625977}
+{"id": "recipe_convert_02", "domain": "recipe", "expected": "3.3", "response": "Let's solve this step by step:\n\n1. Convert 500g to pounds:\n   \u2022 500 \u00f7 454 = 1.1\n\n2. Triple the amount:\n   \u2022 1.1 \u00d7 3 = 3.3\n\nThe answer is 3.3.", "extracted": "1.", "passed": true, "latency_ms": 1756.3865184783936}
+{"id": "financial_compound_01", "domain": "financial", "expected": "1168", "response": "Let's solve this step by step:\n\n1. Initial investment: $1000\n2. First year interest: $1000 \u00d7 0.10 = $100\n3. Balance after first year: $1000 + $100 = $1100\n4. Second year interest: $1100 \u00d7 0.10 = $110\n", "extracted": "1.", "passed": true, "latency_ms": 2502.6097297668457}
+{"id": "financial_compound_02", "domain": "financial", "expected": "5669.91", "response": "Let me solve this step by step:\n\n1. Initial investment = $5000\n2. Interest rate = 5% per year\n3. Years = 3\n4. Compounded yearly\n\nFirst, calculate the total amount after 3 years:\n\u2022 Year 1: $5000 \u00d7 1.05", "extracted": "1.", "passed": true, "latency_ms": 3122.943878173828}
+{"id": "financial_compound_03", "domain": "financial", "expected": "2249.6", "response": "Let's solve this step by step:\n\n1. Initial investment: $2000\n2. Annual interest rate: 8%\n3. First year: \n   \u2022 $2000 * 1.08 = $2,160\n\n4. Second year:\n   \u2022 $2,160 * 1.08 = $2,332.80\n\n5. Total interest g", "extracted": "1.", "passed": true, "latency_ms": 2934.3974590301514}
+{"id": "financial_compound_04", "domain": "financial", "expected": "614.48", "response": "Let me solve this step by step:\n\n1. Initial investment = $500\n2. Interest rate = 12% per year\n3. Years = 2\n4. Calculation of total amount before tax:\n   \u2022 First year: $500 \u00d7 1.12 = $560\n   \u2022 Second ye", "extracted": "1.", "passed": true, "latency_ms": 5129.877805709839}
+{"id": "financial_markup_01", "domain": "financial", "expected": "562.5", "response": "Let's solve this step by step:\n\n1. Original price = $500\n\n2. Markup at 25%:\n   \u2022 Markup amount = $500 \u00d7 0.25 = $125\n   \u2022 Price after markup = $500 + $125 = $625\n\n3. 10% member discount on $625:\n   \u2022 D", "extracted": "1.", "passed": true, "latency_ms": 2244.7335720062256}
+{"id": "financial_markup_02", "domain": "financial", "expected": "240", "response": "Let's solve this step by step:\n\n1) Original price = $200\n\n2) With 50% markup:\n   \u2022 Markup amount = $200 \u00d7 0.5 = $100\n   \u2022 Price after markup = $200 + $100 = $300\n\n3) With 20% member discount:\n   \u2022 Dis", "extracted": "1", "passed": true, "latency_ms": 2275.6948471069336}
+{"id": "financial_markup_03", "domain": "financial", "expected": "816", "response": "Let's solve this step by step:\n\n1) First, calculate the markup price\n   \u2022 Original price = $800\n   \u2022 Markup percentage = 20%\n   \u2022 Markup amount = $800 \u00d7 0.20 = $160\n   \u2022 Price after markup = $800 + $1", "extracted": "1", "passed": true, "latency_ms": 2666.01300239563}
+{"id": "financial_markup_04", "domain": "financial", "expected": "157.5", "response": "Let's solve this step by step:\n\n1) First, calculate the price after markup:\n   \u2022 Original price = $150\n   \u2022 Markup = 40%\n   \u2022 Price after markup = $150 + (40% \u00d7 $150)\n   \u2022 Price after markup = $150 + ", "extracted": "1", "passed": true, "latency_ms": 2844.623804092407}
+{"id": "financial_markup_05", "domain": "financial", "expected": "1170", "response": "Let's solve this step by step:\n1. Original price = $1000\n2. With 30% markup: $1000 * 1.3 = $1300\n3. With 10% member discount: $1300 * 0.9 = $1170\n\n1170", "extracted": "1.", "passed": true, "latency_ms": 1702.9638290405273}
+{"id": "financial_commission_01", "domain": "financial", "expected": "900", "response": "Let's solve this step by step:\n\n1. First $10,000 at 5%:\n   \u2022 $10,000 \u00d7 0.05 = $500\n\n2. Remaining $5,000 at 8%:\n   \u2022 $5,000 \u00d7 0.08 = $400\n\n3. Total commission:\n   \u2022 $500 + $400 = $900\n\n900", "extracted": "1.", "passed": true, "latency_ms": 1921.6413497924805}
+{"id": "unit_convert_01", "domain": "units", "expected": "13.125", "response": "Let's solve this step by step:\n\n1. Convert 10 miles to kilometers:\n   10 \u00d7 1.6 = 16 km\n\n2. Add 5 km:\n   16 + 5 = 21 km\n\n3. Convert 21 km back to miles:\n   21 \u00f7 1.6 = 13.125\n\n4. The answer is 13.125, s", "extracted": "1.", "passed": true, "latency_ms": 2135.6000900268555}
+{"id": "unit_convert_02", "domain": "units", "expected": "82", "response": "Let's solve this step by step:\n\n1) First, convert 100\u00b0F to Celsius:\n   C = (100 - 32) \u00d7 5/9\n   C = 68 \u00d7 5/9\n   C = 37.78\n\n2) Subtract 10\u00b0C:\n   37.78 - 10 = 27.78\n\n3) Convert back to Fahrenheit:\n   F =", "extracted": "1", "passed": false, "latency_ms": 2369.990825653076}
+{"id": "unit_volume_01", "domain": "units", "expected": "1875", "response": "Let's solve this step by step:\n\n1. Start with 2 liters = 2000 ml\n2. Add 500 ml: 2000 ml + 500 ml = 2500 ml\n3. Pour out 1/4 of 2500 ml: 2500 \u00d7 1/4 = 625 ml\n4. Remaining water: 2500 ml - 625 ml = 1875 m", "extracted": "1.", "passed": true, "latency_ms": 1937.8395080566406}
+{"id": "unit_volume_02", "domain": "units", "expected": "48", "response": "Let's solve this step by step:\n1. The tank starts with 50 gallons\n2. Drain 20%: 50 \u00d7 0.2 = 10 gallons removed\n3. Remaining in tank: 50 - 10 = 40 gallons\n4. Add 8 gallons: 40 + 8 = 48\n\n48", "extracted": "1.", "passed": true, "latency_ms": 1903.308391571045}
+{"id": "unit_volume_03", "domain": "units", "expected": "7000", "response": "Let's solve this step by step:\n\n1. Total pool capacity = 10,000 liters\n2. Fill to 75% = 10,000 \u00d7 0.75 = 7,500 liters\n3. Drain 500 liters = 7,500 - 500 = 7,000 liters\n\n7000", "extracted": "1.", "passed": true, "latency_ms": 1848.1707572937012}
+{"id": "unit_speed_01", "domain": "units", "expected": "3", "response": "Let's solve this step by step:\n\n1. First segment: 60 miles at 30 mph\n   \u2022 Time = Distance \u00f7 Speed\n   \u2022 Time = 60 \u00f7 30 = 2 hours\n\n2. Second segment: 40 miles at 40 mph\n   \u2022 Time = Distance \u00f7 Speed\n   \u2022", "extracted": "1.", "passed": true, "latency_ms": 2499.7081756591797}
+{"id": "unit_speed_02", "domain": "units", "expected": "80", "response": "Let's solve this step by step:\n\n1. First trip: 120 km in 1.5 hours\n   \u2022 Distance = 120 km\n   \u2022 Time = 1.5 hours\n\n2. Second trip: 80 km in 1 hour\n   \u2022 Distance = 80 km\n   \u2022 Time = 1 hour\n\n3. Total dist", "extracted": "1.", "passed": true, "latency_ms": 2687.4144077301025}
+{"id": "schedule_01", "domain": "scheduling", "expected": "2:00 PM", "response": "Let me solve this step by step:\n\n1. Task A starts at 9 AM and takes 2 hours\n\u2022 A finishes at 11 AM\n\n2. Task B must start after A finishes, so it starts at 11 AM and takes 3 hours\n\u2022 B will finish at 2 P", "extracted": "1.", "passed": true, "latency_ms": 2605.8125495910645}
+{"id": "schedule_02", "domain": "scheduling", "expected": "95", "response": "Let's solve this step by step:\n\n1. Process X takes 45 minutes\n   \u2022 This starts at the beginning\n   \u2022 Finishes after 45 minutes\n\n2. Process Y takes 30 minutes\n   \u2022 Needs X's output, so it can only star", "extracted": "1.", "passed": true, "latency_ms": 2928.558826446533}
+{"id": "schedule_03", "domain": "scheduling", "expected": "50", "response": "Let me solve this step by step:\n\n1. Download time = 10 minutes\n2. Install time = 15 minutes (after download)\n3. Configuration time = 5 minutes (after install)\n4. Testing time = 20 minutes (after confi", "extracted": "1.", "passed": true, "latency_ms": 3303.211212158203}
+{"id": "schedule_04", "domain": "scheduling", "expected": "3:00 PM", "response": "Let me solve this step by step:\n\n1. Path 1 starts at 10 AM\n\u2022 Task A takes 2 hours, so it finishes at 12 PM\n\u2022 Task B takes 3 hours, so it finishes at 3 PM\n\n2. Path 2 starts at 10 AM\n\u2022 Task C takes 4 ho", "extracted": "1.", "passed": false, "latency_ms": 2456.190347671509}
+{"id": "schedule_05", "domain": "scheduling", "expected": "60", "response": "Let's solve this step by step:\n\n1. Team A's total time:\n   \u2022 3 tasks \u00d7 20 mins = 60 mins\n\n2. Team B's total time:\n   \u2022 2 tasks \u00d7 25 mins = 50 mins\n\n3. Since they work in parallel, they will finish whe", "extracted": "1.", "passed": true, "latency_ms": 2753.07297706604}
+{"id": "schedule_06", "domain": "scheduling", "expected": "2.4", "response": "Let's solve this step by step:\n\n1) First, let's find the rate at which each worker completes the job:\n   \u2022 Worker A: 1 job in 6 hours = 1/6 job per hour\n   \u2022 Worker B: 1 job in 4 hours = 1/4 job per h", "extracted": "1", "passed": false, "latency_ms": 3573.161840438843}
+{"id": "schedule_07", "domain": "scheduling", "expected": "5", "response": "Let's solve this step by step:\n\n1. First, find the total pages printed per minute by both printers\n   \u2022 First printer: 30 pages/min\n   \u2022 Second printer: 20 pages/min\n   \u2022 Combined rate = 30 + 20 = 50 ", "extracted": "1.", "passed": true, "latency_ms": 2611.100912094116}
+{"id": "logic_order_01", "domain": "logic", "expected": "Alice, Eve, Bob, Carol, Dave", "response": "Let me solve this step by step:\n\n1. Alice finishes before Bob, so Alice is ahead of Bob in the order.\n\n2. Carol finishes after Bob but before Dave, so Bob comes before Carol, and Carol comes before Da", "extracted": "Let me solve this step by step:\n\n1. Alice finishes before Bob, so Alice is ahead of Bob in the order.\n\n2. Carol finishes after Bob but before Dave, so Bob comes before Carol, and Carol comes before Dave.\n\n3. Eve finishes between Alice and Bob, so the order must be: Alice", "passed": true, "latency_ms": 2504.011631011963}
+{"id": "logic_order_02", "domain": "logic", "expected": "Yellow, Red, Blue, Orange, Green", "response": "Let's solve this step by step:\n\n1. Yellow is left of Red\n   So Yellow comes before Red\n\n2. Red is left of Blue\n   So Red comes before Blue\n\n3. Blue is left of Green\n   So Blue comes before Green\n\n4. O", "extracted": "Let's solve this step by step:\n\n1. Yellow is left of Red\n   So Yellow comes before Red\n\n2. Red is left of Blue\n   So Red comes before Blue\n\n3. Blue is left of Green\n   So Blue comes before Green\n\n4. Orange is between Blue and Green\n   So the order around Blue and Green is: Blue, Orange, Green\n\n5. Putting it all together, from left to right:\n   Yellow, Red", "passed": true, "latency_ms": 2462.045907974243}
+{"id": "logic_modus_01", "domain": "logic", "expected": "yes", "response": "Let me solve this step by step:\n\n1. It rained (this is a given fact)\n2. If it rains, the ground is wet (first statement)\n3. If the ground is wet, the game is cancelled (second statement)\n4. Since it r", "extracted": "yes", "passed": true, "latency_ms": 2172.9440689086914}
+{"id": "logic_modus_02", "domain": "logic", "expected": "yes", "response": "Let's solve this step by step:\n\n1. \"The battery is dead\" is given as a true statement.\n\n2. If the battery is dead, the car won't start (true)\n\n3. If the car won't start, I'll be late (true)\n\n4. If I'm", "extracted": "yes", "passed": true, "latency_ms": 2520.4737186431885}
+{"id": "logic_modus_03", "domain": "logic", "expected": "cannot determine", "response": "Let's solve this step by step:\n\n1. Given statements:\n\u2022 All programmers know logic\n\u2022 All logicians are good at puzzles\n\u2022 Sam is a programmer\n\n2. We want to know: Is Sam good at puzzles?\n\n3. Looking at ", "extracted": "1.", "passed": true, "latency_ms": 3793.8151359558105}
+{"id": "logic_sets_01", "domain": "logic", "expected": "45", "response": "Let me solve this step by step:\n\n1) First, I'll use the Addition Principle of Set Theory:\n   Total = (Math students) + (Science students) - (Students in both)\n\n2) Math students: 30\n   Science students", "extracted": "1", "passed": true, "latency_ms": 2419.2349910736084}
+{"id": "logic_sets_02", "domain": "logic", "expected": "5", "response": "Let's solve this step by step:\n\n1) Let's use set theory to solve this.\n   \u2022 Total people = 50\n   \u2022 People speaking English = 35\n   \u2022 People speaking Spanish = 30\n   \u2022 People speaking both = 20\n\n2) Let", "extracted": "1", "passed": true, "latency_ms": 2971.888303756714}
+{"id": "logic_sets_03", "domain": "logic", "expected": "10", "response": "Let's solve this step by step:\n\n1. Total people surveyed = 100\n\n2. People with dogs = 60\n   People with cats = 40\n   People with both dogs and cats = 15\n   People with only fish = 25\n\n3. People with d", "extracted": "1.", "passed": true, "latency_ms": 3558.4352016448975}
+{"id": "spatial_direction_01", "domain": "spatial", "expected": "South", "response": "South", "extracted": "South", "passed": true, "latency_ms": 810.6105327606201}
+{"id": "spatial_direction_02", "domain": "spatial", "expected": "West", "response": "West", "extracted": "West", "passed": true, "latency_ms": 1200.2496719360352}
+{"id": "spatial_direction_03", "domain": "spatial", "expected": "West", "response": "West", "extracted": "West", "passed": true, "latency_ms": 829.9908638000488}
+{"id": "spatial_grid_01", "domain": "spatial", "expected": "(2,2)", "response": "Let's solve this step by step:\n\n1. Start at (0,0)\n2. Move right 3 steps: x becomes 3\n3. Move up 2 steps: y becomes 2\n4. Move left 1 step: x becomes 2\n\nSo the final position is (2,2).\n\nThe answer is (2", "extracted": "1.", "passed": true, "latency_ms": 1741.8482303619385}
+{"id": "spatial_grid_02", "domain": "spatial", "expected": "(7,3)", "response": "Let's solve this step by step:\n\n1. Start at (5,5)\n2. Move left 2: (5-2, 5) = (3,5)\n3. Move down 3: (3, 5-3) = (3,2)\n4. Move right 4: (3+4, 2) = (7,2)\n5. Move up 1: (7, 2+1) = (7,3)\n\nThe final position", "extracted": "1.", "passed": true, "latency_ms": 1919.7781085968018}
+{"id": "spatial_relative_01", "domain": "spatial", "expected": "Southeast", "response": "Let's solve this step by step:\n1. A is north of B\n2. C is east of B\n3. D is south of C\n\nSo let's map this out:\n\u2022 First, B is at the center of our mental map\n\u2022 A is north of B\n\u2022 C is east of B\n\u2022 D is s", "extracted": "1.", "passed": false, "latency_ms": 3456.7742347717285}
+{"id": "spatial_relative_02", "domain": "spatial", "expected": "yes", "response": "Let's solve this step by step:\n\n1. The library is 2 blocks east of the park\n2. The cafe is 3 blocks north of the library\n3. The museum is 2 blocks west of the cafe\n\nSo, relative to the library's posit", "extracted": "no", "passed": false, "latency_ms": 3633.281469345093}
+{"id": "procedural_state_01", "domain": "procedural", "expected": "Yellow", "response": "Let's solve this step by step:\n\n1) Current color is Green\n   \n2) First change: Green \u2192 Yellow\n   \n3) Second change: Yellow \u2192 Red\n   \n4) Third change: Red \u2192 Green\n   \n5) Fourth change: Green \u2192 Yellow\n\n", "extracted": "1", "passed": true, "latency_ms": 1871.0801601409912}
+{"id": "procedural_state_02", "domain": "procedural", "expected": "Locked", "response": "Let's walk through the steps:\n\n1. Starting state: Locked\n2. Unlock \u2192 Closed\n3. Open \u2192 Open\n4. Close \u2192 Closed\n5. Lock \u2192 Locked\n\nSo the final state is: Locked", "extracted": "1.", "passed": true, "latency_ms": 1411.0980033874512}
+{"id": "procedural_recipe_01", "domain": "procedural", "expected": "Added milk before steeping", "response": "Let's analyze the original sequence of steps for making tea:\n1. Boil water\n2. Add tea bag\n3. Steep 3 min\n4. Remove bag\n5. Add milk\n\nIf you do the steps in the order 1, 2, 5, 3, 4, here's what happens:", "extracted": "1.", "passed": false, "latency_ms": 3898.632049560547}
+{"id": "procedural_recipe_02", "domain": "procedural", "expected": "Passwor1", "response": "Pass1234", "extracted": "1234", "passed": false, "latency_ms": 1206.291675567627}
+{"id": "procedural_undo_01", "domain": "procedural", "expected": "Hello World?", "response": "Let's solve this step by step:\n\n1. Initial text: \"Hello\"\n\n2. Append ' World': \n   Text becomes \"Hello World\"\n\n3. Append '!': \n   Text becomes \"Hello World!\"\n\n4. Undo: \n   This removes the last action ", "extracted": "1.", "passed": true, "latency_ms": 2253.5808086395264}
+{"id": "procedural_undo_02", "domain": "procedural", "expected": "empty", "response": "Let me walk through the stack operations step by step:\n\n1. Start: Stack is empty []\n2. Push A: Stack is [A]\n3. Push B: Stack is [A, B]\n4. Pop: Removes B, stack is now [A]\n5. Push C: Stack is [A, C]\n6.", "extracted": "1.", "passed": true, "latency_ms": 2251.429319381714}
+{"id": "text_string_01", "domain": "text", "expected": "LLEH", "response": "Let me solve this step by step:\n\n1. The original word is 'HELLO'\n\n2. Reversed, it becomes 'OLLEH'\n\n3. Removing the first letter gives 'LLEH'\n\nSo the result is 'LLEH'.", "extracted": "1.", "passed": true, "latency_ms": 1574.5420455932617}
+{"id": "text_string_02", "domain": "text", "expected": "DCB", "response": "Let me solve this step by step:\n\n1. Start with 'ABCDE'\n2. Remove vowels (A, E): 'BCD'\n3. Reverse 'BCD': 'DCB'\n\nThe result is 'DCB'.", "extracted": "1.", "passed": true, "latency_ms": 1504.4054985046387}
+{"id": "text_string_03", "domain": "text", "expected": "PRGR", "response": "Let me solve this step by step:\n\n1. The word is 'PROGRAMMING'\n2. The consonants are: P, R, G, R, M, M, N, G\n3. Take the first 4 consonants: P, R, G, R\n4. The result is 'PRGR'\n\nSo the answer is 'PRGR'.", "extracted": "1.", "passed": true, "latency_ms": 1877.6586055755615}
+{"id": "text_word_01", "domain": "text", "expected": "fox", "response": "Let me solve this step by step:\n\n1. Original sentence: 'The quick brown fox'\n2. Reversed word order: 'fox brown quick The'\n3. First word in the reversed order: 'fox'\n\nSo the answer is 'fox'.", "extracted": "1.", "passed": true, "latency_ms": 1498.9027976989746}
+{"id": "text_word_02", "domain": "text", "expected": "CDB", "response": "Let me solve this step by step:\n\n1. 'CAT' becomes 'C'\n2. 'DOG' becomes 'D'\n3. 'BIRD' becomes 'B'\n\nCombining these: CDB\n\nThe result is \"CDB\".", "extracted": "1.", "passed": true, "latency_ms": 1684.577465057373}
+{"id": "text_encode_01", "domain": "text", "expected": "BZS", "response": "Let's solve this step by step:\n\n1. Start with 'CAT'\n2. Shift each letter forward by 1:\n   \u2022 C \u2192 D\n   \u2022 A \u2192 B\n   \u2022 T \u2192 U\n   So now the word is 'DBU'\n3. Shift each letter backward by 2:\n   \u2022 D \u2192 B\n   \u2022 ", "extracted": "1.", "passed": true, "latency_ms": 2281.977891921997}
+{"id": "text_encode_02", "domain": "text", "expected": "HILLU", "response": "Let's go through this step by step:\n\nH: Unchanged\nE \u2192 I\nL: Unchanged\nL: Unchanged\nO \u2192 U\n\nSo, the result is: HILLO", "extracted": "", "passed": false, "latency_ms": 1374.7119903564453}
+{"id": "sequence_letter_01", "domain": "sequence", "expected": "I", "response": "Let's solve this step by step:\n\n1. Let's look at the letters in order: A, C, E, G, ____\n\n2. Let's see how the letters are changing:\n   \u2022 From A to C: 2 letter jump\n   \u2022 From C to E: 2 letter jump\n   \u2022", "extracted": "1.", "passed": true, "latency_ms": 3238.2261753082275}
+{"id": "sequence_letter_02", "domain": "sequence", "expected": "R", "response": "Let's solve this step by step:\n\n1) First, let's look at the letters in order: Z, X, V, T, _\n\n2) Let's see if there's a pattern in how these letters are changing:\n   Z \u2192 X (moved 2 letters back in the ", "extracted": "1", "passed": true, "latency_ms": 4357.334136962891}
+{"id": "sequence_letter_03", "domain": "sequence", "expected": "P", "response": "Let's solve this step by step:\n\n1) First, let's look at the pattern between the letters:\n   A \u2192 B (moves forward 1 letter)\n   B \u2192 D (moves forward 2 letters)\n   D \u2192 G (moves forward 3 letters)\n   G \u2192 ", "extracted": "1", "passed": true, "latency_ms": 4180.105924606323}
+{"id": "sequence_symbol_01", "domain": "sequence", "expected": "\u25cb", "response": "Let me help you solve this step-by-step:\n\n1. Let's look at the pattern: \u25cf\u25cb\u25cf\u25cb\u25cf_\n\n2. I see an alternating pattern of black (\u25cf) and white (\u25cb) dots\n\n3. The pattern so far goes:\n   Black (\u25cf)\n   White (\u25cb)\n ", "extracted": "1.", "passed": true, "latency_ms": 3127.4898052215576}
+{"id": "sequence_symbol_02", "domain": "sequence", "expected": "\u25cb", "response": "Let's solve this step by step:\n\n1) First, let's look at the pattern: \u25cf\u25cf\u25cb\u25cf\u25cf\u25cb\u25cf\u25cf_\n\n2) Let's count the sequence of \u25cf and \u25cb symbols from left to right:\n   \u2022 \u25cf\u25cf (2 black dots)\n   \u2022 \u25cb (1 white dot)\n   \u2022 \u25cf\u25cf (", "extracted": "1", "passed": true, "latency_ms": 3472.6948738098145}
+{"id": "sequence_word_01", "domain": "sequence", "expected": "four", "response": "The missing word is \"four\". \n\nThe pattern is a simple counting sequence in English, starting from one and going up to five in order:\n\none, two, three, four, five\n\nSo \"four\" is the word that fills in t", "extracted": "", "passed": true, "latency_ms": 1698.1353759765625}
+{"id": "sequence_word_02", "domain": "sequence", "expected": "September", "response": "Let me solve this step by step:\n\n1. Look at the given months: January, March, May, July\n2. Notice the pattern of months:\n   \u2022 These are all odd-numbered months\n   \u2022 They are spaced 2 months apart\n3. F", "extracted": "1.", "passed": true, "latency_ms": 2551.496744155884}
+{"id": "causal_chain_01", "domain": "causal", "expected": "The power went out", "response": "According to the sequence of events you described, the root cause of everyone getting sick was the power going out, which initiated the chain of events leading to food spoilage and subsequent illness.", "extracted": "", "passed": false, "latency_ms": 1402.4343490600586}
+{"id": "causal_chain_02", "domain": "causal", "expected": "He is late", "response": "Let's solve this step by step:\n\n1. First, the alarm didn't ring.\n2. If the alarm doesn't ring, Tom oversleeps.\n3. If Tom oversleeps, he misses the bus.\n4. If he misses the bus, he's late for work.\n\nTh", "extracted": "1.", "passed": false, "latency_ms": 1652.44460105896}
+{"id": "causal_counter_01", "domain": "causal", "expected": "no", "response": "Unknown. While lack of water is a common cause of plant death, there could be other factors that might have caused the plant to die even if it had been watered, such as disease, pests, extreme tempera", "extracted": "no", "passed": true, "latency_ms": 1706.676959991455}
+{"id": "causal_counter_02", "domain": "causal", "expected": "no", "response": "No, if the dial worked correctly, the cake would not have burned. The broken dial caused the oven to be set at too high a temperature, which led to the cake burning.", "extracted": "no", "passed": true, "latency_ms": 1319.746732711792}
+{"id": "causal_necessary_01", "domain": "causal", "expected": "not necessarily", "response": "Not necessarily.\n\nWhile water is essential for plant growth, it is not the only factor determining whether a plant will grow successfully. Other critical factors include:\n\n1. Sunlight\n2. Proper temper", "extracted": "", "passed": true, "latency_ms": 2550.2684116363525}
+{"id": "causal_necessary_02", "domain": "causal", "expected": "no", "response": "No.", "extracted": "no", "passed": true, "latency_ms": 870.8951473236084}
+{"id": "causal_necessary_03", "domain": "causal", "expected": "yes", "response": "Yes", "extracted": "yes", "passed": true, "latency_ms": 851.525068283081}

results/claude-3-5-haiku-latest_20260103_184241_summary.json ADDED Viewed

	@@ -0,0 +1,71 @@

+{
+  "total": 101,
+  "passed": 90,
+  "failed": 11,
+  "pass_rate": 0.8910891089108911,
+  "by_domain": {
+    "math_discount": {
+      "passed": 15,
+      "total": 15,
+      "pass_rate": 1.0
+    },
+    "time": {
+      "passed": 12,
+      "total": 13,
+      "pass_rate": 0.9230769230769231
+    },
+    "recipe": {
+      "passed": 7,
+      "total": 7,
+      "pass_rate": 1.0
+    },
+    "financial": {
+      "passed": 10,
+      "total": 10,
+      "pass_rate": 1.0
+    },
+    "units": {
+      "passed": 6,
+      "total": 7,
+      "pass_rate": 0.8571428571428571
+    },
+    "scheduling": {
+      "passed": 5,
+      "total": 7,
+      "pass_rate": 0.7142857142857143
+    },
+    "logic": {
+      "passed": 8,
+      "total": 8,
+      "pass_rate": 1.0
+    },
+    "spatial": {
+      "passed": 5,
+      "total": 7,
+      "pass_rate": 0.7142857142857143
+    },
+    "procedural": {
+      "passed": 4,
+      "total": 6,
+      "pass_rate": 0.6666666666666666
+    },
+    "text": {
+      "passed": 6,
+      "total": 7,
+      "pass_rate": 0.8571428571428571
+    },
+    "sequence": {
+      "passed": 7,
+      "total": 7,
+      "pass_rate": 1.0
+    },
+    "causal": {
+      "passed": 5,
+      "total": 7,
+      "pass_rate": 0.7142857142857143
+    }
+  },
+  "avg_latency_ms": 2109.3481106333215,
+  "model": "claude-3-5-haiku-latest",
+  "timestamp": "20260103_184241"
+}

results/claude-sonnet-4-20250514_20260103_184954_results.jsonl ADDED Viewed

	@@ -0,0 +1,101 @@

+{"id": "math_discount_01", "domain": "math_discount", "expected": "15", "response": "15", "extracted": "15", "passed": true, "latency_ms": 1202.0678520202637}
+{"id": "math_discount_02", "domain": "math_discount", "expected": "37", "response": "37", "extracted": "37", "passed": true, "latency_ms": 1264.0960216522217}
+{"id": "math_discount_03", "domain": "math_discount", "expected": "50", "response": "50", "extracted": "50", "passed": true, "latency_ms": 1305.506944656372}
+{"id": "math_discount_04", "domain": "math_discount", "expected": "90", "response": "90", "extracted": "90", "passed": true, "latency_ms": 982.0027351379395}
+{"id": "math_discount_05", "domain": "math_discount", "expected": "115", "response": "115", "extracted": "115", "passed": true, "latency_ms": 1125.2100467681885}
+{"id": "math_discount_06", "domain": "math_discount", "expected": "53", "response": "53", "extracted": "53", "passed": true, "latency_ms": 1066.4362907409668}
+{"id": "math_discount_07", "domain": "math_discount", "expected": "70", "response": "70", "extracted": "70", "passed": true, "latency_ms": 1535.1293087005615}
+{"id": "math_discount_tax_01", "domain": "math_discount", "expected": "88", "response": "88", "extracted": "88", "passed": true, "latency_ms": 1480.4530143737793}
+{"id": "math_discount_tax_02", "domain": "math_discount", "expected": "229.5", "response": "230.50", "extracted": "230.50", "passed": true, "latency_ms": 1100.987195968628}
+{"id": "math_discount_tax_03", "domain": "math_discount", "expected": "63", "response": "63", "extracted": "63", "passed": true, "latency_ms": 964.9076461791992}
+{"id": "math_discount_tax_04", "domain": "math_discount", "expected": "481.5", "response": "481.50", "extracted": "481.50", "passed": true, "latency_ms": 1567.6820278167725}
+{"id": "math_discount_tax_05", "domain": "math_discount", "expected": "135.68", "response": "135.68", "extracted": "135.68", "passed": true, "latency_ms": 1486.6652488708496}
+{"id": "math_bogo_01", "domain": "math_discount", "expected": "60", "response": "60", "extracted": "60", "passed": true, "latency_ms": 1083.873987197876}
+{"id": "math_bogo_02", "domain": "math_discount", "expected": "43.75", "response": "43.75", "extracted": "43.75", "passed": true, "latency_ms": 1301.2514114379883}
+{"id": "math_bogo_03", "domain": "math_discount", "expected": "96", "response": "96", "extracted": "96", "passed": true, "latency_ms": 1392.0691013336182}
+{"id": "time_duration_01", "domain": "time", "expected": "4:45 PM", "response": "4:45 PM", "extracted": "4:45 PM", "passed": true, "latency_ms": 1178.0014038085938}
+{"id": "time_duration_02", "domain": "time", "expected": "11:50 AM", "response": "11:50 AM", "extracted": "11:50 AM", "passed": true, "latency_ms": 1198.1875896453857}
+{"id": "time_duration_03", "domain": "time", "expected": "12:15 PM", "response": "12:15 PM", "extracted": "12:15 PM", "passed": true, "latency_ms": 1300.0762462615967}
+{"id": "time_duration_04", "domain": "time", "expected": "5:20 PM", "response": "5:20 PM", "extracted": "5:20 PM", "passed": true, "latency_ms": 1544.1699028015137}
+{"id": "time_duration_05", "domain": "time", "expected": "1:30 PM", "response": "12:30 PM", "extracted": "12:30 PM", "passed": false, "latency_ms": 1193.523645401001}
+{"id": "time_duration_06", "domain": "time", "expected": "12:30 PM", "response": "12:30 PM", "extracted": "12:30 PM", "passed": true, "latency_ms": 1351.5722751617432}
+{"id": "time_duration_07", "domain": "time", "expected": "9:15 PM", "response": "9:15 PM", "extracted": "9:15 PM", "passed": true, "latency_ms": 1697.411298751831}
+{"id": "time_travel_01", "domain": "time", "expected": "11:50 AM", "response": "12:00 PM", "extracted": "12:00 PM", "passed": false, "latency_ms": 1147.132396697998}
+{"id": "time_travel_02", "domain": "time", "expected": "3:40 PM", "response": "3:40 PM", "extracted": "3:40 PM", "passed": true, "latency_ms": 1338.7584686279297}
+{"id": "time_travel_03", "domain": "time", "expected": "10:00 AM", "response": "10:00 AM", "extracted": "10:00 AM", "passed": true, "latency_ms": 1192.3644542694092}
+{"id": "time_travel_04", "domain": "time", "expected": "5:00 PM", "response": "5:00 PM", "extracted": "5:00 PM", "passed": true, "latency_ms": 1159.924030303955}
+{"id": "time_travel_05", "domain": "time", "expected": "9:45 AM", "response": "I need to calculate the arrival time and then add the waiting time.\n\nDeparture: 7:45 AM\nJourney time: 1 hour 35 minutes\nArrival: 7:45 AM + 1:35 = 9:20 AM\n\nWait time: 25 minutes\nConnection boarding tim", "extracted": "7:45 AM", "passed": true, "latency_ms": 2538.8712882995605}
+{"id": "time_multi_01", "domain": "time", "expected": "10:20 AM", "response": "10:20 AM", "extracted": "10:20 AM", "passed": true, "latency_ms": 1136.9469165802002}
+{"id": "recipe_scale_01", "domain": "recipe", "expected": "6", "response": "6", "extracted": "6", "passed": true, "latency_ms": 956.7124843597412}
+{"id": "recipe_scale_02", "domain": "recipe", "expected": "2.25", "response": "2.25", "extracted": "2.25", "passed": true, "latency_ms": 1098.292350769043}
+{"id": "recipe_scale_03", "domain": "recipe", "expected": "6", "response": "6", "extracted": "6", "passed": true, "latency_ms": 1172.297477722168}
+{"id": "recipe_scale_04", "domain": "recipe", "expected": "3", "response": "I need to scale the recipe from 6 people to 9 people, then halve it.\n\nOriginal recipe: 6 people need 4 tablespoons butter\n\nScale to 9 people:\n9 \u00f7 6 = 1.5\n4 tablespoons \u00d7 1.5 = 6 tablespoons\n\nThen halv", "extracted": "6", "passed": true, "latency_ms": 2426.9192218780518}
+{"id": "recipe_scale_05", "domain": "recipe", "expected": "6", "response": "6", "extracted": "6", "passed": true, "latency_ms": 1317.0630931854248}
+{"id": "recipe_convert_01", "domain": "recipe", "expected": "360", "response": "360", "extracted": "360", "passed": true, "latency_ms": 1231.6324710845947}
+{"id": "recipe_convert_02", "domain": "recipe", "expected": "3.3", "response": "I need to convert 500g of flour to pounds, then triple it.\n\nFirst, convert 500g to pounds:\n500g \u00f7 454g/pound = 1.101 pounds\n\nThen triple it for the large batch:\n1.101 \u00d7 3 = 3.303 pounds\n\nRounded to on", "extracted": "500", "passed": true, "latency_ms": 2781.614303588867}
+{"id": "financial_compound_01", "domain": "financial", "expected": "1168", "response": "1216", "extracted": "1216", "passed": false, "latency_ms": 834.8634243011475}
+{"id": "financial_compound_02", "domain": "financial", "expected": "5669.91", "response": "I need to calculate compound interest, then subtract taxes on the gains.\n\nInitial investment: $5,000\nInterest rate: 5% annually\nTime: 3 years\nTax rate: 15% on gains only\n\nStep 1: Calculate compound in", "extracted": "", "passed": true, "latency_ms": 3585.533857345581}
+{"id": "financial_compound_03", "domain": "financial", "expected": "2249.6", "response": "2185.60", "extracted": "2185.60", "passed": false, "latency_ms": 1319.6077346801758}
+{"id": "financial_compound_04", "domain": "financial", "expected": "614.48", "response": "I need to calculate compound interest, then subtract tax on gains.\n\nInitial investment: $500\nInterest rate: 12% annually\nTime: 2 years\n\nAfter 2 years: $500 \u00d7 (1.12)\u00b2 = $500 \u00d7 1.2544 = $627.20\n\nGains: ", "extracted": "", "passed": true, "latency_ms": 2675.029993057251}
+{"id": "financial_markup_01", "domain": "financial", "expected": "562.5", "response": "412.50", "extracted": "412.50", "passed": false, "latency_ms": 1161.397933959961}
+{"id": "financial_markup_02", "domain": "financial", "expected": "240", "response": "180", "extracted": "180", "passed": false, "latency_ms": 932.5790405273438}
+{"id": "financial_markup_03", "domain": "financial", "expected": "816", "response": "768", "extracted": "768", "passed": false, "latency_ms": 1317.3847198486328}
+{"id": "financial_markup_04", "domain": "financial", "expected": "157.5", "response": "126", "extracted": "126", "passed": false, "latency_ms": 947.8399753570557}
+{"id": "financial_markup_05", "domain": "financial", "expected": "1170", "response": "910", "extracted": "910", "passed": false, "latency_ms": 1207.2951793670654}
+{"id": "financial_commission_01", "domain": "financial", "expected": "900", "response": "900", "extracted": "900", "passed": true, "latency_ms": 961.1525535583496}
+{"id": "unit_convert_01", "domain": "units", "expected": "13.125", "response": "21.25", "extracted": "21.25", "passed": false, "latency_ms": 1145.0726985931396}
+{"id": "unit_convert_02", "domain": "units", "expected": "82", "response": "I need to convert 100\u00b0F to Celsius, subtract 10\u00b0C, then convert back to Fahrenheit.\n\nStep 1: Convert 100\u00b0F to Celsius\nC = (F - 32) \u00d7 5/9\nC = (100 - 32) \u00d7 5/9\nC = 68 \u00d7 5/9\nC = 340/9\nC = 37.78\u00b0C\n\nStep 2", "extracted": "100", "passed": true, "latency_ms": 2750.9498596191406}
+{"id": "unit_volume_01", "domain": "units", "expected": "1875", "response": "1875", "extracted": "1875", "passed": true, "latency_ms": 1108.9894771575928}
+{"id": "unit_volume_02", "domain": "units", "expected": "48", "response": "48", "extracted": "48", "passed": true, "latency_ms": 1425.4415035247803}
+{"id": "unit_volume_03", "domain": "units", "expected": "7000", "response": "7000", "extracted": "7000", "passed": true, "latency_ms": 1113.5900020599365}
+{"id": "unit_speed_01", "domain": "units", "expected": "3", "response": "3", "extracted": "3", "passed": true, "latency_ms": 1721.369981765747}
+{"id": "unit_speed_02", "domain": "units", "expected": "80", "response": "80", "extracted": "80", "passed": true, "latency_ms": 954.6058177947998}
+{"id": "schedule_01", "domain": "scheduling", "expected": "2:00 PM", "response": "I need to work through this step by step.\n\nStarting time: 9 AM\n\nTask A: Takes 2 hours, starts at 9 AM\n- Finishes at 11 AM\n\nTask B: Takes 3 hours, must start after A finishes\n- Starts at 11 AM\n- Finish", "extracted": "9", "passed": true, "latency_ms": 3201.542854309082}
+{"id": "schedule_02", "domain": "scheduling", "expected": "95", "response": "I need to find the total time for processes X, Y, and Z that run sequentially.\n\nGiven information:\n- Process X takes 45 minutes\n- Process Y takes 30 minutes and needs X's output (so Y starts after X f", "extracted": "", "passed": true, "latency_ms": 2864.945888519287}
+{"id": "schedule_03", "domain": "scheduling", "expected": "50", "response": "I need to add up the sequential steps:\n\n- Download: 10 minutes\n- Install: 15 minutes (after download)\n- Configuration: 5 minutes (after install)\n- Testing: 20 minutes (after config)\n\nSince each step h", "extracted": "10", "passed": true, "latency_ms": 2501.664638519287}
+{"id": "schedule_04", "domain": "scheduling", "expected": "3:00 PM", "response": "Looking at both paths:\n\nPath 1: A(2h) + B(3h) = 5 hours total\nPath 2: C(4h) = 4 hours total\n\nThe limiting path is Path 1 at 5 hours.\n\nStarting at 10 AM + 5 hours = 3 PM", "extracted": "1", "passed": true, "latency_ms": 2242.5687313079834}
+{"id": "schedule_05", "domain": "scheduling", "expected": "60", "response": "I need to calculate when each team finishes their tasks, since they work in parallel.\n\n**Team A:**\n- 3 tasks of 20 minutes each (sequential)\n- Total time = 3 \u00d7 20 = 60 minutes\n\n**Team B:**\n- 2 tasks o", "extracted": "", "passed": true, "latency_ms": 3033.62774848938}
+{"id": "schedule_06", "domain": "scheduling", "expected": "2.4", "response": "I need to find how long it takes workers A and B to complete one job working together.\n\nFirst, let me find each worker's rate of work per hour.\n\nWorker A completes 1 job in 6 hours, so:\n- Worker A's r", "extracted": "", "passed": false, "latency_ms": 3168.7917709350586}
+{"id": "schedule_07", "domain": "scheduling", "expected": "5", "response": "I need to find how long it takes for both printers working together to print 250 pages.\n\nFirst, let me find their combined printing rate:\n- Printer 1: 30 pages/min\n- Printer 2: 20 pages/min\n- Combined", "extracted": "250", "passed": true, "latency_ms": 2863.2404804229736}
+{"id": "logic_order_01", "domain": "logic", "expected": "Alice, Eve, Bob, Carol, Dave", "response": "I need to work through the race finishing order step by step using the given constraints.\n\nGiven information:\n- Alice finishes before Bob\n- Carol finishes after Bob but before Dave\n- Eve finishes betw", "extracted": "I need to work through the race finishing order step by step using the given constraints.\n\nGiven information:\n- Alice finishes before Bob\n- Carol finishes after Bob but before Dave\n- Eve finishes between Alice and Bob\n\nLet me establish the relationships:\n\n1. Alice finishes before Bob: Alice \u2192 Bob\n2. Eve finishes between Alice and Bob: Alice \u2192 Eve \u2192 Bob\n3. Carol finishes after Bob but before Dave: Bob \u2192 Carol \u2192 Dave\n\nNow I can combine these relationships:\n- From constraints 1 and 2: Alice \u2192 Eve \u2192 Bob\n- From constraint 3: Bob \u2192 Carol \u2192 Dave\n\nCombining everything: Alice \u2192 Eve \u2192 Bob \u2192 Carol \u2192 Dave\n\nTherefore, the finish order from first to last is:\n\nAlice, Eve, Bob, Carol", "passed": true, "latency_ms": 3395.145893096924}
+{"id": "logic_order_02", "domain": "logic", "expected": "Yellow, Red, Blue, Orange, Green", "response": "I need to work through these clues systematically to find the order of the five books.\n\nGiven clues:\n- Red is left of Blue\n- Green is right of Blue\n- Yellow is left of Red\n- Orange is between Blue and", "extracted": "I need to work through these clues systematically to find the order of the five books.\n\nGiven clues:\n- Red is left of Blue\n- Green is right of Blue\n- Yellow is left of Red\n- Orange is between Blue and Green\n\nLet me start by establishing relative positions:\n\nFrom \"Red is left of Blue\" and \"Yellow is left of Red\":\nYellow must be to the left of Red, which is to the left of Blue.\nSo: Yellow - Red - Blue (in that order, though not necessarily adjacent)\n\nFrom \"Green is right of Blue\":\nBlue - Green (Green is somewhere to the right of Blue)\n\nFrom \"Orange is between Blue and Green\":\nBlue - Orange - Green (Orange is directly between Blue and Green)\n\nNow I can combine these:\n- Yellow is left of Red\n- Red is left of Blue\n- Blue is followed by Orange\n- Orange is followed by Green\n\nThis", "passed": false, "latency_ms": 4458.382129669189}
+{"id": "logic_modus_01", "domain": "logic", "expected": "yes", "response": "Yes.\n\nHere's the logical reasoning:\n1. It rained (given)\n2. If it rains, the ground is wet (given rule)\n3. Therefore, the ground is wet (from 1 and 2)\n4. If the ground is wet, the game is cancelled (g", "extracted": "yes", "passed": true, "latency_ms": 2189.753293991089}
+{"id": "logic_modus_02", "domain": "logic", "expected": "yes", "response": "Yes.\n\nFollowing the logical chain:\n- The battery is dead (given)\n- If the battery is dead, the car won't start \u2192 the car won't start\n- If the car won't start, I'll be late \u2192 I'll be late\n- If I'm late", "extracted": "yes", "passed": true, "latency_ms": 1729.341745376587}
+{"id": "logic_modus_03", "domain": "logic", "expected": "cannot determine", "response": "I need to work through this logical reasoning step by step.\n\nGiven premises:\n1. All programmers know logic\n2. All logicians are good at puzzles\n3. Sam is a programmer\n\nLet me trace the logic:\n\nFrom pr", "extracted": "1.", "passed": true, "latency_ms": 5384.079933166504}
+{"id": "logic_sets_01", "domain": "logic", "expected": "45", "response": "45", "extracted": "45", "passed": true, "latency_ms": 956.916093826294}
+{"id": "logic_sets_02", "domain": "logic", "expected": "5", "response": "I need to find how many people speak neither English nor Spanish.\n\nGiven information:\n- Total people: 50\n- Speak English: 35\n- Speak Spanish: 30\n- Speak both English and Spanish: 20\n\nUsing the princip", "extracted": "50", "passed": true, "latency_ms": 2464.90478515625}
+{"id": "logic_sets_03", "domain": "logic", "expected": "10", "response": "I need to find how many people have no pets using the inclusion-exclusion principle.\n\nGiven information:\n- Total people: 100\n- Have dogs: 60\n- Have cats: 40\n- Have both dogs and cats: 15\n- Have fish o", "extracted": "100", "passed": true, "latency_ms": 4043.487071990967}
+{"id": "spatial_direction_01", "domain": "spatial", "expected": "South", "response": "South", "extracted": "South", "passed": true, "latency_ms": 1132.0321559906006}
+{"id": "spatial_direction_02", "domain": "spatial", "expected": "West", "response": "North", "extracted": "North", "passed": false, "latency_ms": 1210.8416557312012}
+{"id": "spatial_direction_03", "domain": "spatial", "expected": "West", "response": "East", "extracted": "East", "passed": false, "latency_ms": 1163.621425628662}
+{"id": "spatial_grid_01", "domain": "spatial", "expected": "(2,2)", "response": "I'll track my position step by step, starting at (0,0).\n\nStarting position: (0,0)\n\n1. Move right 3 steps: (0,0) \u2192 (3,0)\n2. Move up 2 steps: (3,0) \u2192 (3,2)\n3. Move left 1 step: (3,2) \u2192 (2,2)\n\nFinal posi", "extracted": "", "passed": true, "latency_ms": 2060.361623764038}
+{"id": "spatial_grid_02", "domain": "spatial", "expected": "(7,3)", "response": "I'll track each movement step by step, starting at (5,5).\n\nStarting position: (5,5)\n\n1. Move left 2: (5,5) \u2192 (5-2,5) = (3,5)\n2. Move down 3: (3,5) \u2192 (3,5-3) = (3,2)\n3. Move right 4: (3,2) \u2192 (3+4,2) = ", "extracted": "", "passed": true, "latency_ms": 2648.094654083252}
+{"id": "spatial_relative_01", "domain": "spatial", "expected": "Southeast", "response": "I need to work through the relative positions step by step.\n\nGiven information:\n- A is north of B\n- C is east of B  \n- D is south of C\n\nLet me visualize this:\n\n1. Start with B as a reference point\n2. ", "extracted": "1.", "passed": true, "latency_ms": 4674.278020858765}
+{"id": "spatial_relative_02", "domain": "spatial", "expected": "yes", "response": "I need to track the positions step by step, starting from the park.\n\nLet me place the park at position (0, 0) and track each location:\n\n1. Park: (0, 0)\n2. Library is 2 blocks east of the park: (2, 0)\n", "extracted": "yes", "passed": true, "latency_ms": 4008.5084438323975}
+{"id": "procedural_state_01", "domain": "procedural", "expected": "Yellow", "response": "I need to track the traffic light through 4 changes, starting from Green.\n\nThe cycle is: Green \u2192 Yellow \u2192 Red \u2192 Green (then repeats)\n\nStarting position: Green\n\nAfter 1 change: Green \u2192 Yellow\nAfter 2 c", "extracted": "4", "passed": true, "latency_ms": 3090.118169784546}
+{"id": "procedural_state_02", "domain": "procedural", "expected": "Locked", "response": "I'll trace through each step starting from the Locked state:\n\nStarting state: **Locked**\n\n1. **Unlock** (Locked \u2192 Closed)\n   Current state: **Closed**\n\n2. **Open** (Closed \u2192 Open)\n   Current state: **", "extracted": "1.", "passed": true, "latency_ms": 2783.231258392334}
+{"id": "procedural_recipe_01", "domain": "procedural", "expected": "Added milk before steeping", "response": "If you do the steps in the order 1, 2, 5, 3, 4, the main problems are:\n\n1. **You're adding milk before steeping** - This means the tea bag will be steeping in milky water rather than plain hot water, ", "extracted": "1", "passed": false, "latency_ms": 5760.7011795043945}
+{"id": "procedural_recipe_02", "domain": "procedural", "expected": "Passwor1", "response": "Passwor1", "extracted": "1", "passed": true, "latency_ms": 1344.8550701141357}
+{"id": "procedural_undo_01", "domain": "procedural", "expected": "Hello World?", "response": "I'll trace through each action step by step:\n\n1. **Start**: \"Hello\"\n2. **Append ' World'**: \"Hello World\"\n3. **Append '!'**: \"Hello World!\"\n4. **Undo**: This undoes the last action (Append '!'), so we", "extracted": "1.", "passed": true, "latency_ms": 2674.466609954834}
+{"id": "procedural_undo_02", "domain": "procedural", "expected": "empty", "response": "I'll trace through the stack operations step by step.\n\nStarting with an empty stack: []\n\n1. Push A: [A]\n2. Push B: [A, B]\n3. Pop: [A] (B is removed)\n4. Push C: [A, C]\n5. Pop: [A] (C is removed)\n6. Pop", "extracted": "1.", "passed": true, "latency_ms": 2418.6952114105225}
+{"id": "text_string_01", "domain": "text", "expected": "LLEH", "response": "I'll work through this step by step:\n\n1. Start with the word 'HELLO'\n2. Reverse it: 'OLLEH'\n3. Remove the first letter (O): 'LLEH'\n\nThe result is 'LLEH'.", "extracted": "1.", "passed": true, "latency_ms": 1911.1328125}
+{"id": "text_string_02", "domain": "text", "expected": "DCB", "response": "I'll work through this step by step.\n\nStarting with: ABCDE\n\nStep 1: Remove vowels\nThe vowels in 'ABCDE' are A and E\nAfter removing vowels: BCD\n\nStep 2: Reverse\nReversing 'BCD' gives: DCB\n\nThe result i", "extracted": "1", "passed": true, "latency_ms": 2774.3263244628906}
+{"id": "text_string_03", "domain": "text", "expected": "PRGR", "response": "I need to find the consonants in 'PROGRAMMING' and then take the first 4.\n\nLet me go through each letter in 'PROGRAMMING':\n- P - consonant \u2713\n- R - consonant \u2713\n- O - vowel\n- G - consonant \u2713\n- R - conso", "extracted": "4.", "passed": true, "latency_ms": 3804.591178894043}
+{"id": "text_word_01", "domain": "text", "expected": "fox", "response": "I need to reverse the word order of \"The quick brown fox\" and then identify the first word.\n\nOriginal sentence: \"The quick brown fox\"\nWords in order: The, quick, brown, fox\n\nReversed word order: fox, ", "extracted": "", "passed": true, "latency_ms": 2270.7276344299316}
+{"id": "text_word_02", "domain": "text", "expected": "CDB", "response": "I'll replace each word with its first letter:\n\n- CAT \u2192 C\n- DOG \u2192 D  \n- BIRD \u2192 B\n\nCombining them: **CDB**", "extracted": "I'll replace each word with its first letter:", "passed": true, "latency_ms": 1779.3736457824707}
+{"id": "text_encode_01", "domain": "text", "expected": "BZS", "response": "I'll work through this step by step.\n\nStarting word: CAT\n\nStep 1: Shift each letter forward by 1\n- C \u2192 D\n- A \u2192 B  \n- T \u2192 U\n\nAfter step 1: DBU\n\nStep 2: Shift each letter backward by 2\n- D \u2192 B (D goes b", "extracted": "1", "passed": true, "latency_ms": 3155.4336547851562}
+{"id": "text_encode_02", "domain": "text", "expected": "HILLU", "response": "I need to replace each vowel in 'HELLO' with the next vowel in the sequence A\u2192E, E\u2192I, I\u2192O, O\u2192U, U\u2192A.\n\nLet me go through each letter in 'HELLO':\n- H: not a vowel, stays H\n- E: vowel, E\u2192I, so becomes I\n", "extracted": "", "passed": true, "latency_ms": 2766.1566734313965}
+{"id": "sequence_letter_01", "domain": "sequence", "expected": "I", "response": "Looking at this pattern: A, C, E, G, _\n\nI can see that each letter is skipping one letter in the alphabet:\n- A to C (skipping B)\n- C to E (skipping D) \n- E to G (skipping F)\n\nFollowing this pattern, t", "extracted": "", "passed": true, "latency_ms": 2989.410161972046}
+{"id": "sequence_letter_02", "domain": "sequence", "expected": "R", "response": "I need to find the pattern in this sequence: Z, X, V, T, _\n\nLet me look at the positions of these letters in the alphabet:\n- Z is the 26th letter\n- X is the 24th letter  \n- V is the 22nd letter\n- T is", "extracted": "", "passed": true, "latency_ms": 3488.3639812469482}
+{"id": "sequence_letter_03", "domain": "sequence", "expected": "P", "response": "I need to find the pattern in this sequence: A, B, D, G, K, _\n\nLet me look at the positions of these letters in the alphabet:\n- A = 1\n- B = 2  \n- D = 4\n- G = 7\n- K = 11\n\nNow let me examine the differe", "extracted": "", "passed": true, "latency_ms": 3271.898031234741}
+{"id": "sequence_symbol_01", "domain": "sequence", "expected": "\u25cb", "response": "Looking at the pattern: \u25cf\u25cb\u25cf\u25cb\u25cf_\n\nI can see this follows an alternating pattern:\n- Position 1: \u25cf (filled)\n- Position 2: \u25cb (empty)\n- Position 3: \u25cf (filled)\n- Position 4: \u25cb (empty)\n- Position 5: \u25cf (filled", "extracted": "1", "passed": true, "latency_ms": 2849.2820262908936}
+{"id": "sequence_symbol_02", "domain": "sequence", "expected": "\u25cb", "response": "Looking at the pattern: \u25cf\u25cf\u25cb\u25cf\u25cf\u25cb\u25cf\u25cf_\n\nI can see this follows a repeating sequence of: \u25cf\u25cf\u25cb\n\nLet me break it down:\n- \u25cf\u25cf\u25cb (positions 1-3)\n- \u25cf\u25cf\u25cb (positions 4-6) \n- \u25cf\u25cf (positions 7-8)\n\nSince we have \u25cf\u25cf in pos", "extracted": "1", "passed": true, "latency_ms": 3226.1710166931152}
+{"id": "sequence_word_01", "domain": "sequence", "expected": "four", "response": "The word that fills the blank is **four**.\n\nThe pattern is simply counting in sequence: one, two, three, **four**, five.", "extracted": "", "passed": true, "latency_ms": 1823.0633735656738}
+{"id": "sequence_word_02", "domain": "sequence", "expected": "September", "response": "Looking at this pattern: January, March, May, July, ___\n\nThese are all odd-numbered months:\n- January = 1st month\n- March = 3rd month  \n- May = 5th month\n- July = 7th month\n\nFollowing this pattern, th", "extracted": "", "passed": true, "latency_ms": 2357.6316833496094}
+{"id": "causal_chain_01", "domain": "causal", "expected": "The power went out", "response": "The root cause of everyone getting sick was the power going out.\n\nFollowing the chain of causation you described:\n- Power went out \u2192 fridge stopped \u2192 food spoiled \u2192 everyone got sick\n\nThe power outage", "extracted": "", "passed": false, "latency_ms": 2305.238962173462}
+{"id": "causal_chain_02", "domain": "causal", "expected": "He is late", "response": "Let me work through this step by step using the given logical chain:\n\nGiven facts:\n- If the alarm doesn't ring \u2192 Tom oversleeps\n- If Tom oversleeps \u2192 he misses the bus  \n- If he misses the bus \u2192 he's ", "extracted": "1.", "passed": false, "latency_ms": 3007.6956748962402}
+{"id": "causal_counter_01", "domain": "causal", "expected": "no", "response": "No.\n\nThe statement establishes that the plant died specifically because it wasn't watered. This implies that watering was necessary for the plant's survival. Therefore, if the plant had been watered, ", "extracted": "no", "passed": true, "latency_ms": 2249.9849796295166}
+{"id": "causal_counter_02", "domain": "causal", "expected": "no", "response": "No, if the dial worked, the cake would not have burned.\n\nHere's the logical chain:\n- The cake burned because the oven was too hot\n- The oven was too hot because the dial was broken\n- If the dial worke", "extracted": "no", "passed": true, "latency_ms": 2118.8220977783203}
+{"id": "causal_necessary_01", "domain": "causal", "expected": "not necessarily", "response": "Not necessarily.\n\nWhile water is necessary for plant growth, it's not the only requirement. Plants also need:\n- Adequate light (sunlight or artificial)\n- Proper nutrients from soil\n- Appropriate tempe", "extracted": "", "passed": true, "latency_ms": 3185.826301574707}
+{"id": "causal_necessary_02", "domain": "causal", "expected": "no", "response": "No.\n\nThe car needs both fuel AND a working battery to start. Since the battery is dead, even though there is fuel, the car will not start.", "extracted": "no", "passed": true, "latency_ms": 1871.96946144104}
+{"id": "causal_necessary_03", "domain": "causal", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 942.8300857543945}

results/claude-sonnet-4-20250514_20260103_184954_summary.json ADDED Viewed

	@@ -0,0 +1,71 @@

+{
+  "total": 101,
+  "passed": 84,
+  "failed": 17,
+  "pass_rate": 0.8316831683168316,
+  "by_domain": {
+    "math_discount": {
+      "passed": 15,
+      "total": 15,
+      "pass_rate": 1.0
+    },
+    "time": {
+      "passed": 11,
+      "total": 13,
+      "pass_rate": 0.8461538461538461
+    },
+    "recipe": {
+      "passed": 7,
+      "total": 7,
+      "pass_rate": 1.0
+    },
+    "financial": {
+      "passed": 3,
+      "total": 10,
+      "pass_rate": 0.3
+    },
+    "units": {
+      "passed": 6,
+      "total": 7,
+      "pass_rate": 0.8571428571428571
+    },
+    "scheduling": {
+      "passed": 6,
+      "total": 7,
+      "pass_rate": 0.8571428571428571
+    },
+    "logic": {
+      "passed": 7,
+      "total": 8,
+      "pass_rate": 0.875
+    },
+    "spatial": {
+      "passed": 5,
+      "total": 7,
+      "pass_rate": 0.7142857142857143
+    },
+    "procedural": {
+      "passed": 5,
+      "total": 6,
+      "pass_rate": 0.8333333333333334
+    },
+    "text": {
+      "passed": 7,
+      "total": 7,
+      "pass_rate": 1.0
+    },
+    "sequence": {
+      "passed": 7,
+      "total": 7,
+      "pass_rate": 1.0
+    },
+    "causal": {
+      "passed": 5,
+      "total": 7,
+      "pass_rate": 0.7142857142857143
+    }
+  },
+  "avg_latency_ms": 2045.5509129137097,
+  "model": "claude-sonnet-4-20250514",
+  "timestamp": "20260103_184954"
+}

results/gpt-4o-mini_20260103_184617_results.jsonl ADDED Viewed

	@@ -0,0 +1,101 @@

+{"id": "math_discount_01", "domain": "math_discount", "expected": "15", "response": "16", "extracted": "16", "passed": false, "latency_ms": 353.2829284667969}
+{"id": "math_discount_02", "domain": "math_discount", "expected": "37", "response": "37", "extracted": "37", "passed": true, "latency_ms": 368.3919906616211}
+{"id": "math_discount_03", "domain": "math_discount", "expected": "50", "response": "$55.00", "extracted": "55.00", "passed": false, "latency_ms": 415.9402847290039}
+{"id": "math_discount_04", "domain": "math_discount", "expected": "90", "response": "84", "extracted": "84", "passed": false, "latency_ms": 440.11902809143066}
+{"id": "math_discount_05", "domain": "math_discount", "expected": "115", "response": "125", "extracted": "125", "passed": false, "latency_ms": 403.51104736328125}
+{"id": "math_discount_06", "domain": "math_discount", "expected": "53", "response": "53", "extracted": "53", "passed": true, "latency_ms": 425.0950813293457}
+{"id": "math_discount_07", "domain": "math_discount", "expected": "70", "response": "$90", "extracted": "90", "passed": false, "latency_ms": 643.2604789733887}
+{"id": "math_discount_tax_01", "domain": "math_discount", "expected": "88", "response": "88.0", "extracted": "88.0", "passed": true, "latency_ms": 456.78114891052246}
+{"id": "math_discount_tax_02", "domain": "math_discount", "expected": "229.5", "response": "$231.00", "extracted": "231.00", "passed": false, "latency_ms": 433.59923362731934}
+{"id": "math_discount_tax_03", "domain": "math_discount", "expected": "63", "response": "$63.00", "extracted": "63.00", "passed": true, "latency_ms": 449.0923881530762}
+{"id": "math_discount_tax_04", "domain": "math_discount", "expected": "481.5", "response": "Final price: 500.00", "extracted": "500.00", "passed": false, "latency_ms": 508.26239585876465}
+{"id": "math_discount_tax_05", "domain": "math_discount", "expected": "135.68", "response": "$169.60", "extracted": "169.60", "passed": false, "latency_ms": 403.0435085296631}
+{"id": "math_bogo_01", "domain": "math_discount", "expected": "60", "response": "60", "extracted": "60", "passed": true, "latency_ms": 334.78307723999023}
+{"id": "math_bogo_02", "domain": "math_discount", "expected": "43.75", "response": "43.75", "extracted": "43.75", "passed": true, "latency_ms": 487.77198791503906}
+{"id": "math_bogo_03", "domain": "math_discount", "expected": "96", "response": "$96", "extracted": "96", "passed": true, "latency_ms": 363.8746738433838}
+{"id": "time_duration_01", "domain": "time", "expected": "4:45 PM", "response": "4:15 PM", "extracted": "4:15 PM", "passed": true, "latency_ms": 422.21951484680176}
+{"id": "time_duration_02", "domain": "time", "expected": "11:50 AM", "response": "11:50 AM", "extracted": "11:50 AM", "passed": true, "latency_ms": 397.20797538757324}
+{"id": "time_duration_03", "domain": "time", "expected": "12:15 PM", "response": "11:15 AM", "extracted": "11:15 AM", "passed": false, "latency_ms": 402.74596214294434}
+{"id": "time_duration_04", "domain": "time", "expected": "5:20 PM", "response": "5:00 PM", "extracted": "5:00 PM", "passed": true, "latency_ms": 432.36637115478516}
+{"id": "time_duration_05", "domain": "time", "expected": "1:30 PM", "response": "12:30 PM", "extracted": "12:30 PM", "passed": false, "latency_ms": 560.6443881988525}
+{"id": "time_duration_06", "domain": "time", "expected": "12:30 PM", "response": "12:30 PM", "extracted": "12:30 PM", "passed": true, "latency_ms": 438.97271156311035}
+{"id": "time_duration_07", "domain": "time", "expected": "9:15 PM", "response": "9:15 PM", "extracted": "9:15 PM", "passed": true, "latency_ms": 416.1064624786377}
+{"id": "time_travel_01", "domain": "time", "expected": "11:50 AM", "response": "11:50 AM", "extracted": "11:50 AM", "passed": true, "latency_ms": 378.5521984100342}
+{"id": "time_travel_02", "domain": "time", "expected": "3:40 PM", "response": "3:40 PM", "extracted": "3:40 PM", "passed": true, "latency_ms": 418.66469383239746}
+{"id": "time_travel_03", "domain": "time", "expected": "10:00 AM", "response": "10:00 AM", "extracted": "10:00 AM", "passed": true, "latency_ms": 444.2873001098633}
+{"id": "time_travel_04", "domain": "time", "expected": "5:00 PM", "response": "5:00 PM", "extracted": "5:00 PM", "passed": true, "latency_ms": 456.10809326171875}
+{"id": "time_travel_05", "domain": "time", "expected": "9:45 AM", "response": "9:45 AM", "extracted": "9:45 AM", "passed": true, "latency_ms": 399.5320796966553}
+{"id": "time_multi_01", "domain": "time", "expected": "10:20 AM", "response": "9:20 AM", "extracted": "9:20 AM", "passed": false, "latency_ms": 1283.4525108337402}
+{"id": "recipe_scale_01", "domain": "recipe", "expected": "6", "response": "12", "extracted": "12", "passed": false, "latency_ms": 464.3080234527588}
+{"id": "recipe_scale_02", "domain": "recipe", "expected": "2.25", "response": "2.", "extracted": "2.", "passed": false, "latency_ms": 396.33917808532715}
+{"id": "recipe_scale_03", "domain": "recipe", "expected": "6", "response": "6.0", "extracted": "6.0", "passed": true, "latency_ms": 398.6220359802246}
+{"id": "recipe_scale_04", "domain": "recipe", "expected": "3", "response": "6", "extracted": "6", "passed": false, "latency_ms": 383.0828666687012}
+{"id": "recipe_scale_05", "domain": "recipe", "expected": "6", "response": "6.0", "extracted": "6.0", "passed": true, "latency_ms": 418.92313957214355}
+{"id": "recipe_convert_01", "domain": "recipe", "expected": "360", "response": "360", "extracted": "360", "passed": true, "latency_ms": 780.9550762176514}
+{"id": "recipe_convert_02", "domain": "recipe", "expected": "3.3", "response": "3.3", "extracted": "3.3", "passed": true, "latency_ms": 410.5088710784912}
+{"id": "financial_compound_01", "domain": "financial", "expected": "1168", "response": "$1,056.00", "extracted": "1056.00", "passed": false, "latency_ms": 513.2331848144531}
+{"id": "financial_compound_02", "domain": "financial", "expected": "5669.91", "response": "$5,250.00", "extracted": "5250.00", "passed": false, "latency_ms": 603.4715175628662}
+{"id": "financial_compound_03", "domain": "financial", "expected": "2249.6", "response": "$2,904.00", "extracted": "2904.00", "passed": false, "latency_ms": 416.1546230316162}
+{"id": "financial_compound_04", "domain": "financial", "expected": "614.48", "response": "$610.00", "extracted": "610.00", "passed": false, "latency_ms": 438.7550354003906}
+{"id": "financial_markup_01", "domain": "financial", "expected": "562.5", "response": "$462.50", "extracted": "462.50", "passed": false, "latency_ms": 414.88003730773926}
+{"id": "financial_markup_02", "domain": "financial", "expected": "240", "response": "$120", "extracted": "120", "passed": false, "latency_ms": 365.8781051635742}
+{"id": "financial_markup_03", "domain": "financial", "expected": "816", "response": "680", "extracted": "680", "passed": false, "latency_ms": 381.06608390808105}
+{"id": "financial_markup_04", "domain": "financial", "expected": "157.5", "response": "$120.00", "extracted": "120.00", "passed": false, "latency_ms": 394.6950435638428}
+{"id": "financial_markup_05", "domain": "financial", "expected": "1170", "response": "$970.00", "extracted": "970.00", "passed": false, "latency_ms": 407.6099395751953}
+{"id": "financial_commission_01", "domain": "financial", "expected": "900", "response": "$1,200", "extracted": "1200", "passed": false, "latency_ms": 541.8832302093506}
+{"id": "unit_convert_01", "domain": "units", "expected": "13.125", "response": "15.5", "extracted": "15.5", "passed": false, "latency_ms": 613.4727001190186}
+{"id": "unit_convert_02", "domain": "units", "expected": "82", "response": "70\u00b0F", "extracted": "70", "passed": false, "latency_ms": 377.61640548706055}
+{"id": "unit_volume_01", "domain": "units", "expected": "1875", "response": "1875", "extracted": "1875", "passed": true, "latency_ms": 432.9349994659424}
+{"id": "unit_volume_02", "domain": "units", "expected": "48", "response": "42", "extracted": "42", "passed": false, "latency_ms": 439.7459030151367}
+{"id": "unit_volume_03", "domain": "units", "expected": "7000", "response": "7,500", "extracted": "7500", "passed": false, "latency_ms": 438.8389587402344}
+{"id": "unit_speed_01", "domain": "units", "expected": "3", "response": "3.0", "extracted": "3.0", "passed": true, "latency_ms": 441.6847229003906}
+{"id": "unit_speed_02", "domain": "units", "expected": "80", "response": "100", "extracted": "100", "passed": false, "latency_ms": 422.84560203552246}
+{"id": "schedule_01", "domain": "scheduling", "expected": "2:00 PM", "response": "Task C finishes at 12 PM.", "extracted": "12", "passed": false, "latency_ms": 490.2148246765137}
+{"id": "schedule_02", "domain": "scheduling", "expected": "95", "response": "To find the total time from start to finish for processes X, Y, and Z, we need to consider the sequential nature of these processes.\n\n1. **Process X** takes 45 minutes.\n2. **Process Y** takes 30 minut", "extracted": "", "passed": true, "latency_ms": 2994.168281555176}
+{"id": "schedule_03", "domain": "scheduling", "expected": "50", "response": "To find the total time, we simply add up the time for each step:\n\n- Download: 10 minutes\n- Install: 15 minutes\n- Configuration: 5 minutes\n- Testing: 20 minutes\n\nTotal time = 10 + 15 + 5 + 20 = 60 minu", "extracted": "", "passed": false, "latency_ms": 1646.0909843444824}
+{"id": "schedule_04", "domain": "scheduling", "expected": "3:00 PM", "response": "1 PM", "extracted": "1", "passed": false, "latency_ms": 413.5298728942871}
+{"id": "schedule_05", "domain": "scheduling", "expected": "60", "response": "To determine when both teams finish, we need to calculate the total time each team takes to complete their tasks.\n\n**Team A:**\n- 3 tasks of 20 minutes each\n- Total time for Team A = 3 tasks \u00d7 20 minut", "extracted": "", "passed": true, "latency_ms": 3534.430980682373}
+{"id": "schedule_06", "domain": "scheduling", "expected": "2.4", "response": "To find out how long it takes for Worker A and Worker B to complete a job together, we first determine their individual work rates.\n\n- Worker A's work rate is \\( \\frac{1}{6} \\) of the job per hour (si", "extracted": "", "passed": false, "latency_ms": 4289.335012435913}
+{"id": "schedule_07", "domain": "scheduling", "expected": "5", "response": "To find out how long it takes for both printers to print 250 pages together, we first need to determine their combined printing rate.\n\nThe first printer prints at a rate of 30 pages per minute, and th", "extracted": "250", "passed": true, "latency_ms": 5124.680757522583}
+{"id": "logic_order_01", "domain": "logic", "expected": "Alice, Eve, Bob, Carol, Dave", "response": "The finish order from first to last is: Alice, Eve, Bob, Carol, Dave.", "extracted": "The finish order from first to last is: Alice, Eve, Bob, Carol, Dave.", "passed": true, "latency_ms": 855.7789325714111}
+{"id": "logic_order_02", "domain": "logic", "expected": "Yellow, Red, Blue, Orange, Green", "response": "To determine the order of the books from left to right based on the given clues, let's analyze each statement:\n\n1. **Red is left of Blue.** (This means Red comes before Blue.)\n2. **Green is right of B", "extracted": "To determine the order of the books from left to right based on the given clues, let's analyze each statement:\n\n1. **Red is left of Blue.** (This means Red comes before Blue.)\n2. **Green is right of Blue.** (This means Green comes after Blue.)\n3. **Yellow is left of Red.** (This means Yellow comes before Red.)\n4. **Orange is between Blue and Green.** (This means Orange is positioned directly between Blue and Green.)\n\nNow, let's put these clues together step by step:\n\n- From clue 1, we know: Red < Blue\n- From clue 2, we know: Blue < Green\n- From clue 3", "passed": false, "latency_ms": 3717.663049697876}
+{"id": "logic_modus_01", "domain": "logic", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 413.86866569519043}
+{"id": "logic_modus_02", "domain": "logic", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 531.6195487976074}
+{"id": "logic_modus_03", "domain": "logic", "expected": "cannot determine", "response": "Cannot determine. \n\nWhile all programmers know logic and all logicians are good at puzzles, the information provided does not specify whether programmers, including Sam, are also logicians. Therefore,", "extracted": "", "passed": true, "latency_ms": 1566.9023990631104}
+{"id": "logic_sets_01", "domain": "logic", "expected": "45", "response": "45", "extracted": "45", "passed": true, "latency_ms": 384.3083381652832}
+{"id": "logic_sets_02", "domain": "logic", "expected": "5", "response": "5", "extracted": "5", "passed": true, "latency_ms": 372.5147247314453}
+{"id": "logic_sets_03", "domain": "logic", "expected": "10", "response": "20", "extracted": "20", "passed": false, "latency_ms": 400.8209705352783}
+{"id": "spatial_direction_01", "domain": "spatial", "expected": "South", "response": "East.", "extracted": "East.", "passed": false, "latency_ms": 388.44871520996094}
+{"id": "spatial_direction_02", "domain": "spatial", "expected": "West", "response": "North.", "extracted": "North.", "passed": false, "latency_ms": 418.49493980407715}
+{"id": "spatial_direction_03", "domain": "spatial", "expected": "West", "response": "East", "extracted": "East", "passed": false, "latency_ms": 420.49193382263184}
+{"id": "spatial_grid_01", "domain": "spatial", "expected": "(2,2)", "response": "Starting at position (0,0):\n\n1. Move right 3 steps: (0,0) \u2192 (3,0)\n2. Move up 2 steps: (3,0) \u2192 (3,2)\n3. Move left 1 step: (3,2) \u2192 (2,2)\n\nThe final position is (2,2).", "extracted": "00", "passed": true, "latency_ms": 1665.888786315918}
+{"id": "spatial_grid_02", "domain": "spatial", "expected": "(7,3)", "response": "Let's break down the movements step by step starting from the initial position (5, 5):\n\n1. Move left 2: \n   - New position: (5 - 2, 5) = (3, 5)\n\n2. Move down 3: \n   - New position: (3, 5 - 3) = (3, 2)", "extracted": "5", "passed": false, "latency_ms": 3060.5833530426025}
+{"id": "spatial_relative_01", "domain": "spatial", "expected": "Southeast", "response": "D is southwest of A.", "extracted": "D is southwest of A.", "passed": false, "latency_ms": 528.719425201416}
+{"id": "spatial_relative_02", "domain": "spatial", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 397.49693870544434}
+{"id": "procedural_state_01", "domain": "procedural", "expected": "Yellow", "response": "The traffic light cycles through three colors: Green \u2192 Yellow \u2192 Red. Starting from Green, we can track the changes as follows:\n\n1. Green (current)\n2. Yellow (1st change)\n3. Red (2nd change)\n4. Green (", "extracted": "", "passed": true, "latency_ms": 2484.0002059936523}
+{"id": "procedural_state_02", "domain": "procedural", "expected": "Locked", "response": "Let's break down the sequence of actions step by step, starting from the initial state of the door, which is Locked.\n\n1. **Start**: Locked\n2. **Unlock**: The door transitions from Locked to Closed.\n3.", "extracted": "", "passed": true, "latency_ms": 2349.727153778076}
+{"id": "procedural_recipe_01", "domain": "procedural", "expected": "Added milk before steeping", "response": "If you do the steps in the order of 1, 2, 5, 3, 4, the issue is that you are adding milk before steeping the tea. Adding milk before the tea has steeped can affect the extraction of flavors from the t", "extracted": "1", "passed": false, "latency_ms": 2153.0861854553223}
+{"id": "procedural_recipe_02", "domain": "procedural", "expected": "Passwor1", "response": "Passwor1", "extracted": "1", "passed": true, "latency_ms": 394.18530464172363}
+{"id": "procedural_undo_01", "domain": "procedural", "expected": "Hello World?", "response": "Let's go through the actions step by step:\n\n1. Start with \"Hello\".\n2. Append \" World\": The text becomes \"Hello World\".\n3. Append \"!\": The text becomes \"Hello World!\".\n4. Undo: This action reverts the ", "extracted": "1.", "passed": true, "latency_ms": 1806.8759441375732}
+{"id": "procedural_undo_02", "domain": "procedural", "expected": "empty", "response": "Let's go through the stack operations step by step:\n\n1. Start empty: `[]`\n2. Push A: `['A']`\n3. Push B: `['A', 'B']`\n4. Pop: `['A']` (B is removed)\n5. Push C: `['A', 'C']`\n6. Pop: `['A']` (C is remove", "extracted": "1.", "passed": true, "latency_ms": 2269.1214084625244}
+{"id": "text_string_01", "domain": "text", "expected": "LLEH", "response": "Reversing the word 'HELLO' gives 'OLLEH'. Removing the first letter results in 'LLEH'.", "extracted": "Reversing the word 'HELLO' gives 'OLLEH'. Removing", "passed": true, "latency_ms": 946.0744857788086}
+{"id": "text_string_02", "domain": "text", "expected": "DCB", "response": "Starting with 'ABCDE', if we remove the vowels (A and E), we are left with 'BCD'. Reversing 'BCD' gives us 'DCB'.", "extracted": "", "passed": true, "latency_ms": 828.5079002380371}
+{"id": "text_string_03", "domain": "text", "expected": "PRGR", "response": "The consonants in the word 'PROGRAMMING' are P, R, G, R, M, M, N, G. Taking the first 4 letters gives us P, R, G, R. So the result is **PRGR**.", "extracted": "", "passed": true, "latency_ms": 1423.1388568878174}
+{"id": "text_word_01", "domain": "text", "expected": "fox", "response": "Reversing the word order of the sentence \"The quick brown fox\" gives us \"fox brown quick The\". The first word of this reversed order is \"fox\".", "extracted": "Reversing the word order of the sentence \"The quic", "passed": true, "latency_ms": 941.2515163421631}
+{"id": "text_word_02", "domain": "text", "expected": "CDB", "response": "The first letters of the words \"CAT,\" \"DOG,\" and \"BIRD\" are C, D, and B, respectively. When combined, the result is \"CDB.\"", "extracted": "", "passed": true, "latency_ms": 967.2601222991943}
+{"id": "text_encode_01", "domain": "text", "expected": "BZS", "response": "Let's break down the problem step by step.\n\n1. **Shift each letter in 'CAT' forward by 1:**\n   - C \u2192 D\n   - A \u2192 B\n   - T \u2192 U\n\n   So, 'CAT' becomes 'DBU'.\n\n2. **Now, shift each letter in 'DBU' backward", "extracted": "1.", "passed": true, "latency_ms": 2390.4123306274414}
+{"id": "text_encode_02", "domain": "text", "expected": "HILLU", "response": "To replace each vowel in 'HELLO' with the next vowel according to the given pattern:\n\n- H (not a vowel)\n- E \u2192 I\n- L (not a vowel)\n- L (not a vowel)\n- O \u2192 U\n\nSo, 'HELLO' becomes 'HILLU'.", "extracted": "", "passed": true, "latency_ms": 1459.7249031066895}
+{"id": "sequence_letter_01", "domain": "sequence", "expected": "I", "response": "The pattern consists of letters that are every second letter in the alphabet: A (1), C (3), E (5), G (7). Following this pattern, the next letter would be I (9). So, the next letter is I.", "extracted": "1", "passed": true, "latency_ms": 1494.0543174743652}
+{"id": "sequence_letter_02", "domain": "sequence", "expected": "R", "response": "The pattern alternates by skipping one letter backward in the alphabet. \n\n- Z (skip Y) goes to X\n- X (skip W) goes to V\n- V (skip U) goes to T\n\nContinuing this pattern, after T (skip S), the next lett", "extracted": "", "passed": true, "latency_ms": 1359.5664501190186}
+{"id": "sequence_letter_03", "domain": "sequence", "expected": "P", "response": "To find the next letter in the pattern A, B, D, G, K, we can look at the positions of the letters in the alphabet:\n\n- A is the 1st letter\n- B is the 2nd letter\n- D is the 4th letter\n- G is the 7th let", "extracted": "", "passed": true, "latency_ms": 5254.715442657471}
+{"id": "sequence_symbol_01", "domain": "sequence", "expected": "\u25cb", "response": "The pattern alternates between \u25cf and \u25cb. Following the sequence \u25cf\u25cb\u25cf\u25cb\u25cf, the next symbol should be \u25cb.", "extracted": "", "passed": true, "latency_ms": 975.4207134246826}
+{"id": "sequence_symbol_02", "domain": "sequence", "expected": "\u25cb", "response": "The pattern you provided is: \u25cf\u25cf\u25cb\u25cf\u25cf\u25cb\u25cf\u25cf_. \n\nIf we break it down, we can see that the pattern consists of two filled circles (\u25cf) followed by one empty circle (\u25cb). This sequence repeats: \n\n1. \u25cf\u25cf\u25cb\n2. \u25cf\u25cf\u25cb\n3", "extracted": "", "passed": true, "latency_ms": 1839.8103713989258}
+{"id": "sequence_word_01", "domain": "sequence", "expected": "four", "response": "The word that fills the blank is \"four.\" The pattern is a sequence of numbers written in word form.", "extracted": "The word that fills the blank is \"four.\" The patte", "passed": true, "latency_ms": 768.2960033416748}
+{"id": "sequence_word_02", "domain": "sequence", "expected": "September", "response": "The pattern consists of the months that have 31 days. After July, the next month with 31 days is August. Therefore, the next month in the pattern is August.", "extracted": "31", "passed": false, "latency_ms": 1361.1364364624023}
+{"id": "causal_chain_01", "domain": "causal", "expected": "The power went out", "response": "The root cause of everyone getting sick was the power outage, which led to the fridge stopping and subsequently caused the food to spoil.", "extracted": "", "passed": false, "latency_ms": 723.3519554138184}
+{"id": "causal_chain_02", "domain": "causal", "expected": "He is late", "response": "If the alarm didn't ring, then Tom oversleeps. Since he oversleeps, he misses the bus. As a result, because he misses the bus, he is late for work. Therefore, Tom is late for work.", "extracted": "", "passed": true, "latency_ms": 1284.857988357544}
+{"id": "causal_counter_01", "domain": "causal", "expected": "no", "response": "No.", "extracted": "no", "passed": true, "latency_ms": 417.39487648010254}
+{"id": "causal_counter_02", "domain": "causal", "expected": "no", "response": "If the dial had worked properly, it is likely that the oven would not have been too hot, and therefore the cake would not have burned. So, if the dial had functioned correctly, the cake would not have", "extracted": "no", "passed": true, "latency_ms": 1049.7331619262695}
+{"id": "causal_necessary_01", "domain": "causal", "expected": "not necessarily", "response": "Not necessarily. While water is essential for plant growth, other factors such as light, nutrients, soil quality, and temperature also play crucial roles in a plant's ability to grow.", "extracted": "", "passed": true, "latency_ms": 1655.8122634887695}
+{"id": "causal_necessary_02", "domain": "causal", "expected": "no", "response": "No.", "extracted": "no", "passed": true, "latency_ms": 396.47412300109863}
+{"id": "causal_necessary_03", "domain": "causal", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 430.1483631134033}

results/gpt-4o-mini_20260103_184617_summary.json ADDED Viewed

	@@ -0,0 +1,71 @@

+{
+  "total": 101,
+  "passed": 58,
+  "failed": 43,
+  "pass_rate": 0.5742574257425742,
+  "by_domain": {
+    "math_discount": {
+      "passed": 7,
+      "total": 15,
+      "pass_rate": 0.4666666666666667
+    },
+    "time": {
+      "passed": 10,
+      "total": 13,
+      "pass_rate": 0.7692307692307693
+    },
+    "recipe": {
+      "passed": 4,
+      "total": 7,
+      "pass_rate": 0.5714285714285714
+    },
+    "financial": {
+      "passed": 0,
+      "total": 10,
+      "pass_rate": 0.0
+    },
+    "units": {
+      "passed": 2,
+      "total": 7,
+      "pass_rate": 0.2857142857142857
+    },
+    "scheduling": {
+      "passed": 3,
+      "total": 7,
+      "pass_rate": 0.42857142857142855
+    },
+    "logic": {
+      "passed": 6,
+      "total": 8,
+      "pass_rate": 0.75
+    },
+    "spatial": {
+      "passed": 2,
+      "total": 7,
+      "pass_rate": 0.2857142857142857
+    },
+    "procedural": {
+      "passed": 5,
+      "total": 6,
+      "pass_rate": 0.8333333333333334
+    },
+    "text": {
+      "passed": 7,
+      "total": 7,
+      "pass_rate": 1.0
+    },
+    "sequence": {
+      "passed": 6,
+      "total": 7,
+      "pass_rate": 0.8571428571428571
+    },
+    "causal": {
+      "passed": 6,
+      "total": 7,
+      "pass_rate": 0.8571428571428571
+    }
+  },
+  "avg_latency_ms": 959.1620298895506,
+  "model": "gpt-4o-mini",
+  "timestamp": "20260103_184617"
+}

results/gpt-4o_20260103_184426_results.jsonl ADDED Viewed

	@@ -0,0 +1,101 @@

+{"id": "math_discount_01", "domain": "math_discount", "expected": "15", "response": "15", "extracted": "15", "passed": true, "latency_ms": 1855.0798892974854}
+{"id": "math_discount_02", "domain": "math_discount", "expected": "37", "response": "37", "extracted": "37", "passed": true, "latency_ms": 620.812177658081}
+{"id": "math_discount_03", "domain": "math_discount", "expected": "50", "response": "50", "extracted": "50", "passed": true, "latency_ms": 654.8104286193848}
+{"id": "math_discount_04", "domain": "math_discount", "expected": "90", "response": "90", "extracted": "90", "passed": true, "latency_ms": 397.7174758911133}
+{"id": "math_discount_05", "domain": "math_discount", "expected": "115", "response": "115", "extracted": "115", "passed": true, "latency_ms": 612.2045516967773}
+{"id": "math_discount_06", "domain": "math_discount", "expected": "53", "response": "53", "extracted": "53", "passed": true, "latency_ms": 342.04649925231934}
+{"id": "math_discount_07", "domain": "math_discount", "expected": "70", "response": "70", "extracted": "70", "passed": true, "latency_ms": 541.3846969604492}
+{"id": "math_discount_tax_01", "domain": "math_discount", "expected": "88", "response": "88", "extracted": "88", "passed": true, "latency_ms": 603.419303894043}
+{"id": "math_discount_tax_02", "domain": "math_discount", "expected": "229.5", "response": "229.40", "extracted": "229.40", "passed": true, "latency_ms": 627.4769306182861}
+{"id": "math_discount_tax_03", "domain": "math_discount", "expected": "63", "response": "63", "extracted": "63", "passed": true, "latency_ms": 358.6757183074951}
+{"id": "math_discount_tax_04", "domain": "math_discount", "expected": "481.5", "response": "481.15", "extracted": "481.15", "passed": true, "latency_ms": 922.0921993255615}
+{"id": "math_discount_tax_05", "domain": "math_discount", "expected": "135.68", "response": "135.52", "extracted": "135.52", "passed": true, "latency_ms": 652.8406143188477}
+{"id": "math_bogo_01", "domain": "math_discount", "expected": "60", "response": "60", "extracted": "60", "passed": true, "latency_ms": 365.278959274292}
+{"id": "math_bogo_02", "domain": "math_discount", "expected": "43.75", "response": "43.75", "extracted": "43.75", "passed": true, "latency_ms": 385.67137718200684}
+{"id": "math_bogo_03", "domain": "math_discount", "expected": "96", "response": "96", "extracted": "96", "passed": true, "latency_ms": 357.47313499450684}
+{"id": "time_duration_01", "domain": "time", "expected": "4:45 PM", "response": "4:45 PM", "extracted": "4:45 PM", "passed": true, "latency_ms": 493.6494827270508}
+{"id": "time_duration_02", "domain": "time", "expected": "11:50 AM", "response": "11:50 AM", "extracted": "11:50 AM", "passed": true, "latency_ms": 379.3666362762451}
+{"id": "time_duration_03", "domain": "time", "expected": "12:15 PM", "response": "12:15 PM", "extracted": "12:15 PM", "passed": true, "latency_ms": 415.59529304504395}
+{"id": "time_duration_04", "domain": "time", "expected": "5:20 PM", "response": "5:20 PM", "extracted": "5:20 PM", "passed": true, "latency_ms": 380.90062141418457}
+{"id": "time_duration_05", "domain": "time", "expected": "1:30 PM", "response": "12:30 PM", "extracted": "12:30 PM", "passed": false, "latency_ms": 424.68976974487305}
+{"id": "time_duration_06", "domain": "time", "expected": "12:30 PM", "response": "12:30 PM", "extracted": "12:30 PM", "passed": true, "latency_ms": 607.5694561004639}
+{"id": "time_duration_07", "domain": "time", "expected": "9:15 PM", "response": "9:15 PM", "extracted": "9:15 PM", "passed": true, "latency_ms": 372.3134994506836}
+{"id": "time_travel_01", "domain": "time", "expected": "11:50 AM", "response": "11:50 AM", "extracted": "11:50 AM", "passed": true, "latency_ms": 419.3422794342041}
+{"id": "time_travel_02", "domain": "time", "expected": "3:40 PM", "response": "3:40 PM", "extracted": "3:40 PM", "passed": true, "latency_ms": 420.00746726989746}
+{"id": "time_travel_03", "domain": "time", "expected": "10:00 AM", "response": "10:00 AM", "extracted": "10:00 AM", "passed": true, "latency_ms": 412.45269775390625}
+{"id": "time_travel_04", "domain": "time", "expected": "5:00 PM", "response": "5:00 PM", "extracted": "5:00 PM", "passed": true, "latency_ms": 415.54760932922363}
+{"id": "time_travel_05", "domain": "time", "expected": "9:45 AM", "response": "9:45 AM", "extracted": "9:45 AM", "passed": true, "latency_ms": 392.5604820251465}
+{"id": "time_multi_01", "domain": "time", "expected": "10:20 AM", "response": "10:20 AM", "extracted": "10:20 AM", "passed": true, "latency_ms": 396.5001106262207}
+{"id": "recipe_scale_01", "domain": "recipe", "expected": "6", "response": "6", "extracted": "6", "passed": true, "latency_ms": 474.4579792022705}
+{"id": "recipe_scale_02", "domain": "recipe", "expected": "2.25", "response": "2", "extracted": "2", "passed": false, "latency_ms": 409.20400619506836}
+{"id": "recipe_scale_03", "domain": "recipe", "expected": "6", "response": "6", "extracted": "6", "passed": true, "latency_ms": 497.1766471862793}
+{"id": "recipe_scale_04", "domain": "recipe", "expected": "3", "response": "3", "extracted": "3", "passed": true, "latency_ms": 349.67041015625}
+{"id": "recipe_scale_05", "domain": "recipe", "expected": "6", "response": "6", "extracted": "6", "passed": true, "latency_ms": 416.6903495788574}
+{"id": "recipe_convert_01", "domain": "recipe", "expected": "360", "response": "360", "extracted": "360", "passed": true, "latency_ms": 392.77100563049316}
+{"id": "recipe_convert_02", "domain": "recipe", "expected": "3.3", "response": "3.3", "extracted": "3.3", "passed": true, "latency_ms": 390.0177478790283}
+{"id": "financial_compound_01", "domain": "financial", "expected": "1168", "response": "1184", "extracted": "1184", "passed": false, "latency_ms": 333.3697319030762}
+{"id": "financial_compound_02", "domain": "financial", "expected": "5669.91", "response": "$5613.06", "extracted": "5613.06", "passed": false, "latency_ms": 418.92218589782715}
+{"id": "financial_compound_03", "domain": "financial", "expected": "2249.6", "response": "2330", "extracted": "2330", "passed": false, "latency_ms": 582.0214748382568}
+{"id": "financial_compound_04", "domain": "financial", "expected": "614.48", "response": "$582.40", "extracted": "582.40", "passed": false, "latency_ms": 524.5602130889893}
+{"id": "financial_markup_01", "domain": "financial", "expected": "562.5", "response": "562.5", "extracted": "562.5", "passed": true, "latency_ms": 391.32022857666016}
+{"id": "financial_markup_02", "domain": "financial", "expected": "240", "response": "240", "extracted": "240", "passed": true, "latency_ms": 337.8558158874512}
+{"id": "financial_markup_03", "domain": "financial", "expected": "816", "response": "816", "extracted": "816", "passed": true, "latency_ms": 376.35278701782227}
+{"id": "financial_markup_04", "domain": "financial", "expected": "157.5", "response": "135", "extracted": "135", "passed": false, "latency_ms": 327.76737213134766}
+{"id": "financial_markup_05", "domain": "financial", "expected": "1170", "response": "990", "extracted": "990", "passed": false, "latency_ms": 367.6929473876953}
+{"id": "financial_commission_01", "domain": "financial", "expected": "900", "response": "1100", "extracted": "1100", "passed": false, "latency_ms": 430.7210445404053}
+{"id": "unit_convert_01", "domain": "units", "expected": "13.125", "response": "11.875", "extracted": "11.875", "passed": false, "latency_ms": 379.7035217285156}
+{"id": "unit_convert_02", "domain": "units", "expected": "82", "response": "68", "extracted": "68", "passed": false, "latency_ms": 534.221887588501}
+{"id": "unit_volume_01", "domain": "units", "expected": "1875", "response": "1875", "extracted": "1875", "passed": true, "latency_ms": 409.3296527862549}
+{"id": "unit_volume_02", "domain": "units", "expected": "48", "response": "48", "extracted": "48", "passed": true, "latency_ms": 351.49240493774414}
+{"id": "unit_volume_03", "domain": "units", "expected": "7000", "response": "7,000", "extracted": "7000", "passed": true, "latency_ms": 383.90636444091797}
+{"id": "unit_speed_01", "domain": "units", "expected": "3", "response": "3.5", "extracted": "3.5", "passed": true, "latency_ms": 343.25337409973145}
+{"id": "unit_speed_02", "domain": "units", "expected": "80", "response": "80", "extracted": "80", "passed": true, "latency_ms": 400.4807472229004}
+{"id": "schedule_01", "domain": "scheduling", "expected": "2:00 PM", "response": "1 PM", "extracted": "1", "passed": false, "latency_ms": 490.4158115386963}
+{"id": "schedule_02", "domain": "scheduling", "expected": "95", "response": "To determine the total time from start to finish for these processes, we need to consider the dependencies between them:\n\n1. Process X takes 45 minutes.\n2. Process Y takes 30 minutes and can only star", "extracted": "", "passed": true, "latency_ms": 1442.1210289001465}
+{"id": "schedule_03", "domain": "scheduling", "expected": "50", "response": "The total time is the sum of all the individual tasks:\n\n- Download: 10 minutes\n- Install: 15 minutes\n- Configuration: 5 minutes\n- Testing: 20 minutes\n\nAdding these together gives:\n\n10 + 15 + 5 + 20 = ", "extracted": "10", "passed": true, "latency_ms": 755.6910514831543}
+{"id": "schedule_04", "domain": "scheduling", "expected": "3:00 PM", "response": "3 PM", "extracted": "3", "passed": true, "latency_ms": 484.2681884765625}
+{"id": "schedule_05", "domain": "scheduling", "expected": "60", "response": "To determine when both teams finish their tasks, we need to calculate the total time each team takes to complete their tasks.\n\n**Team A:**\n- Team A has 3 tasks, each taking 20 minutes.\n- Total time fo", "extracted": "", "passed": true, "latency_ms": 1635.4496479034424}
+{"id": "schedule_06", "domain": "scheduling", "expected": "2.4", "response": "To find out how long it takes for Worker A and Worker B to complete the job together, we first determine their work rates and then combine them.\n\nWorker A's rate is \\(\\frac{1}{6}\\) of the job per hour", "extracted": "", "passed": false, "latency_ms": 1636.098861694336}
+{"id": "schedule_07", "domain": "scheduling", "expected": "5", "response": "To determine how long it will take for both printers to print a total of 250 pages together, we first need to calculate their combined printing rate.\n\nThe first printer prints at a rate of 30 pages pe", "extracted": "250", "passed": true, "latency_ms": 1734.919786453247}
+{"id": "logic_order_01", "domain": "logic", "expected": "Alice, Eve, Bob, Carol, Dave", "response": "The finish order from first to last is: Alice, Eve, Bob, Carol, Dave.", "extracted": "The finish order from first to last is: Alice, Eve, Bob, Carol, Dave.", "passed": true, "latency_ms": 479.9826145172119}
+{"id": "logic_order_02", "domain": "logic", "expected": "Yellow, Red, Blue, Orange, Green", "response": "To determine the order of the books from left to right, let's analyze the given conditions:\n\n1. Red is left of Blue.\n2. Green is right of Blue.\n3. Yellow is left of Red.\n4. Orange is between Blue and ", "extracted": "To determine the order of the books from left to right, let's analyze the given conditions:\n\n1. Red is left of Blue.\n2. Green is right of Blue.\n3. Yellow is left of Red.\n4. Orange is between Blue and Green.\n\nLet's start placing the books based on these conditions:\n\n- Since Yellow is left of Red, Yellow must be placed before Red.\n- Red is left of Blue, so Red must be placed before Blue.\n- Orange is between Blue and Green, so the order must be Blue", "passed": true, "latency_ms": 2219.191312789917}
+{"id": "logic_modus_01", "domain": "logic", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 440.3417110443115}
+{"id": "logic_modus_02", "domain": "logic", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 380.2187442779541}
+{"id": "logic_modus_03", "domain": "logic", "expected": "cannot determine", "response": "Cannot determine.\n\nThe statements provided are:\n\n1. All programmers know logic.\n2. All logicians are good at puzzles.\n3. Sam is a programmer.\n\nFrom these statements, we can conclude that Sam knows log", "extracted": "1.", "passed": true, "latency_ms": 1435.1723194122314}
+{"id": "logic_sets_01", "domain": "logic", "expected": "45", "response": "45", "extracted": "45", "passed": true, "latency_ms": 473.30260276794434}
+{"id": "logic_sets_02", "domain": "logic", "expected": "5", "response": "5", "extracted": "5", "passed": true, "latency_ms": 453.53102684020996}
+{"id": "logic_sets_03", "domain": "logic", "expected": "10", "response": "20", "extracted": "20", "passed": false, "latency_ms": 342.69070625305176}
+{"id": "spatial_direction_01", "domain": "spatial", "expected": "South", "response": "South.", "extracted": "South.", "passed": true, "latency_ms": 375.45251846313477}
+{"id": "spatial_direction_02", "domain": "spatial", "expected": "West", "response": "West", "extracted": "West", "passed": true, "latency_ms": 817.2118663787842}
+{"id": "spatial_direction_03", "domain": "spatial", "expected": "West", "response": "West", "extracted": "West", "passed": true, "latency_ms": 451.50089263916016}
+{"id": "spatial_grid_01", "domain": "spatial", "expected": "(2,2)", "response": "Starting at position (0,0):\n\n1. Move right 3 steps: (0,0) \u2192 (3,0)\n2. Move up 2 steps: (3,0) \u2192 (3,2)\n3. Move left 1 step: (3,2) \u2192 (2,2)\n\nThe final position is (2,2).", "extracted": "00", "passed": true, "latency_ms": 690.1655197143555}
+{"id": "spatial_grid_02", "domain": "spatial", "expected": "(7,3)", "response": "Starting at (5,5):\n\n1. Move left 2: (5 - 2, 5) = (3, 5)\n2. Move down 3: (3, 5 - 3) = (3, 2)\n3. Move right 4: (3 + 4, 2) = (7, 2)\n4. Move up 1: (7, 2 + 1) = (7, 3)\n\nThe final position is (7, 3).", "extracted": "55", "passed": false, "latency_ms": 745.7365989685059}
+{"id": "spatial_relative_01", "domain": "spatial", "expected": "Southeast", "response": "D is southeast of A.", "extracted": "D is southeast of A.", "passed": true, "latency_ms": 986.9036674499512}
+{"id": "spatial_relative_02", "domain": "spatial", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 394.49572563171387}
+{"id": "procedural_state_01", "domain": "procedural", "expected": "Yellow", "response": "The traffic light follows the cycle: Green \u2192 Yellow \u2192 Red \u2192 Green. Starting from Green, let's track the changes:\n\n1. First change: Green \u2192 Yellow\n2. Second change: Yellow \u2192 Red\n3. Third change: Red \u2192 ", "extracted": "", "passed": true, "latency_ms": 881.066083908081}
+{"id": "procedural_state_02", "domain": "procedural", "expected": "Locked", "response": "To determine the final state of the door, let's follow the sequence of actions starting from the Locked state:\n\n1. **Unlock**: The door goes from Locked to Closed.\n2. **Open**: The door goes from Clos", "extracted": "", "passed": true, "latency_ms": 1183.4142208099365}
+{"id": "procedural_recipe_01", "domain": "procedural", "expected": "Added milk before steeping", "response": "If you follow the steps in the order 1, 2, 5, 3, 4, there are a couple of issues with the process:\n\n1. **Adding Milk Before Steeping**: By adding milk immediately after the tea bag, before steeping, y", "extracted": "1", "passed": false, "latency_ms": 2387.098550796509}
+{"id": "procedural_recipe_02", "domain": "procedural", "expected": "Passwor1", "response": "Password1", "extracted": "1", "passed": true, "latency_ms": 394.27638053894043}
+{"id": "procedural_undo_01", "domain": "procedural", "expected": "Hello World?", "response": "The initial text in the editor is \"Hello\". Let's go through the actions step by step:\n\n1. **Append ' World'**: The text becomes \"Hello World\".\n2. **Append '!'**: The text becomes \"Hello World!\".\n3. **", "extracted": "1.", "passed": true, "latency_ms": 959.4566822052002}
+{"id": "procedural_undo_02", "domain": "procedural", "expected": "empty", "response": "The stack operations proceed as follows:\n\n1. Start empty: Stack = []\n2. Push A: Stack = [A]\n3. Push B: Stack = [A, B]\n4. Pop: Stack = [A] (B is removed)\n5. Push C: Stack = [A, C]\n6. Pop: Stack = [A] (", "extracted": "1.", "passed": true, "latency_ms": 1408.557653427124}
+{"id": "text_string_01", "domain": "text", "expected": "LLEH", "response": "To solve this, first reverse the word \"HELLO\" to get \"OLLEH\". Then, remove the first letter of this reversed word, which is \"O\". This leaves you with \"LLEH\".", "extracted": "", "passed": true, "latency_ms": 721.3387489318848}
+{"id": "text_string_02", "domain": "text", "expected": "DCB", "response": "Starting with \"ABCDE\", we first remove the vowels \"A\" and \"E\", leaving us with \"BCD\". Then, we reverse \"BCD\" to get \"DCB\". Therefore, the result is \"DCB\".", "extracted": "", "passed": true, "latency_ms": 960.1335525512695}
+{"id": "text_string_03", "domain": "text", "expected": "PRGR", "response": "The word \"PROGRAMMING\" consists of the following letters: P, R, O, G, R, A, M, M, I, N, G. The vowels in this word are O, A, and I. Removing these vowels, we are left with the consonants: P, R, G, R, ", "extracted": "", "passed": true, "latency_ms": 1229.5448780059814}
+{"id": "text_word_01", "domain": "text", "expected": "fox", "response": "The sentence 'The quick brown fox' reversed in word order is 'fox brown quick The'. The first word in this reversed order is 'fox'.", "extracted": "The sentence 'The quick brown fox' reversed in wor", "passed": true, "latency_ms": 585.4918956756592}
+{"id": "text_word_02", "domain": "text", "expected": "CDB", "response": "The result is \"CDB\".", "extracted": "The result is \"CDB\".", "passed": true, "latency_ms": 397.45545387268066}
+{"id": "text_encode_01", "domain": "text", "expected": "BZS", "response": "To solve this, we need to perform two operations on each letter of the word \"CAT\".\n\n1. **Shift each letter in 'CAT' forward by 1:**\n   - C \u2192 D\n   - A \u2192 B\n   - T \u2192 U\n\n   After shifting forward, the wor", "extracted": "", "passed": true, "latency_ms": 1406.911849975586}
+{"id": "text_encode_02", "domain": "text", "expected": "HILLU", "response": "To replace each vowel in \"HELLO\" with the next vowel in the sequence A\u2192E, E\u2192I, I\u2192O, O\u2192U, U\u2192A, we perform the following substitutions:\n\n- 'E' is replaced by 'I'\n- 'O' is replaced by 'U'\n\nThe consonants", "extracted": "", "passed": true, "latency_ms": 1659.1601371765137}
+{"id": "sequence_letter_01", "domain": "sequence", "expected": "I", "response": "The pattern follows a sequence of letters with a step of two in the alphabet: A (1), C (3), E (5), G (7). Each letter corresponds to an odd number, increasing by 2 each time. The next odd number after", "extracted": "1", "passed": true, "latency_ms": 1006.1733722686768}
+{"id": "sequence_letter_02", "domain": "sequence", "expected": "R", "response": "The pattern involves skipping one letter backward in the alphabet. Here's the breakdown:\n\n- Z (skip Y)\n- X (skip W)\n- V (skip U)\n- T (skip S)\n\nFollowing this pattern, after T, you skip S and the next ", "extracted": "", "passed": true, "latency_ms": 1073.8365650177002}
+{"id": "sequence_letter_03", "domain": "sequence", "expected": "P", "response": "The pattern in the sequence is based on the positions of the letters in the alphabet. Let's examine the sequence:\n\n- A is the 1st letter of the alphabet.\n- B is the 2nd letter.\n- D is the 4th letter.\n", "extracted": "1", "passed": true, "latency_ms": 1790.4012203216553}
+{"id": "sequence_symbol_01", "domain": "sequence", "expected": "\u25cb", "response": "The given pattern alternates between \u25cf and \u25cb: \u25cf\u25cb\u25cf\u25cb\u25cf. Following this alternating sequence, the next symbol should be \u25cb.", "extracted": "", "passed": true, "latency_ms": 2369.978666305542}
+{"id": "sequence_symbol_02", "domain": "sequence", "expected": "\u25cb", "response": "The given pattern is \u25cf\u25cf\u25cb\u25cf\u25cf\u25cb\u25cf\u25cf_. Observing the sequence, we can see that it repeats every three symbols: \u25cf\u25cf\u25cb. \n\nBreaking it down:\n1. \u25cf\u25cf\u25cb\n2. \u25cf\u25cf\u25cb\n3. \u25cf\u25cf\n\nThe next symbol in the sequence should follow the ", "extracted": "", "passed": true, "latency_ms": 1162.7376079559326}
+{"id": "sequence_word_01", "domain": "sequence", "expected": "four", "response": "The word that fills the blank is \"four.\" The pattern is a sequence of counting numbers.", "extracted": "The word that fills the blank is \"four.\" The patte", "passed": true, "latency_ms": 533.9739322662354}
+{"id": "sequence_word_02", "domain": "sequence", "expected": "September", "response": "The pattern consists of months that are two months apart: January, March, May, July. Following this pattern, the next month would be September.", "extracted": "", "passed": true, "latency_ms": 555.8648109436035}
+{"id": "causal_chain_01", "domain": "causal", "expected": "The power went out", "response": "The root cause of everyone getting sick was the power going out. This initial event led to the fridge stopping, which in turn caused the food to spoil, ultimately resulting in everyone getting sick fr", "extracted": "", "passed": false, "latency_ms": 3198.387861251831}
+{"id": "causal_chain_02", "domain": "causal", "expected": "He is late", "response": "Based on the given chain of events, if the alarm doesn't ring, Tom oversleeps. Since the alarm didn't ring, Tom oversleeps. If Tom oversleeps, he misses the bus. Therefore, Tom misses the bus. If he m", "extracted": "", "passed": false, "latency_ms": 811.7859363555908}
+{"id": "causal_counter_01", "domain": "causal", "expected": "no", "response": "The answer is \"unknown.\" While the plant died due to lack of water, we cannot be certain that watering it would have guaranteed its survival. Other factors, such as disease, poor soil conditions, or e", "extracted": "no", "passed": true, "latency_ms": 1328.5939693450928}
+{"id": "causal_counter_02", "domain": "causal", "expected": "no", "response": "If the dial had been working properly, it would have allowed you to set the oven to the correct temperature. Assuming all other factors remained constant and the cake recipe was followed correctly, th", "extracted": "no", "passed": true, "latency_ms": 1479.551076889038}
+{"id": "causal_necessary_01", "domain": "causal", "expected": "not necessarily", "response": "Not necessarily. While water is essential for plant growth, other factors such as sunlight, soil quality, temperature, and nutrients are also crucial. A plant may not grow if any of these other condit", "extracted": "", "passed": true, "latency_ms": 797.9154586791992}
+{"id": "causal_necessary_02", "domain": "causal", "expected": "no", "response": "No.", "extracted": "no", "passed": true, "latency_ms": 408.13302993774414}
+{"id": "causal_necessary_03", "domain": "causal", "expected": "yes", "response": "Yes.", "extracted": "yes", "passed": true, "latency_ms": 388.98277282714844}

results/gpt-4o_20260103_184426_summary.json ADDED Viewed

	@@ -0,0 +1,71 @@

+{
+  "total": 101,
+  "passed": 83,
+  "failed": 18,
+  "pass_rate": 0.8217821782178217,
+  "by_domain": {
+    "math_discount": {
+      "passed": 15,
+      "total": 15,
+      "pass_rate": 1.0
+    },
+    "time": {
+      "passed": 12,
+      "total": 13,
+      "pass_rate": 0.9230769230769231
+    },
+    "recipe": {
+      "passed": 6,
+      "total": 7,
+      "pass_rate": 0.8571428571428571
+    },
+    "financial": {
+      "passed": 3,
+      "total": 10,
+      "pass_rate": 0.3
+    },
+    "units": {
+      "passed": 5,
+      "total": 7,
+      "pass_rate": 0.7142857142857143
+    },
+    "scheduling": {
+      "passed": 5,
+      "total": 7,
+      "pass_rate": 0.7142857142857143
+    },
+    "logic": {
+      "passed": 7,
+      "total": 8,
+      "pass_rate": 0.875
+    },
+    "spatial": {
+      "passed": 6,
+      "total": 7,
+      "pass_rate": 0.8571428571428571
+    },
+    "procedural": {
+      "passed": 5,
+      "total": 6,
+      "pass_rate": 0.8333333333333334
+    },
+    "text": {
+      "passed": 7,
+      "total": 7,
+      "pass_rate": 1.0
+    },
+    "sequence": {
+      "passed": 7,
+      "total": 7,
+      "pass_rate": 1.0
+    },
+    "causal": {
+      "passed": 5,
+      "total": 7,
+      "pass_rate": 0.7142857142857143
+    }
+  },
+  "avg_latency_ms": 738.5695429131537,
+  "model": "gpt-4o",
+  "timestamp": "20260103_184426"
+}