Spaces:

victordibia
/

flow

Running

App Files Files Community

victordibia commited on Feb 3

Commit

a08910d

1 Parent(s): a23ff80

Deploy 2026-02-03 00:28:32

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +180 -69
pyproject.toml +9 -2
src/flow/cli/app.py +45 -8
src/flow/cli/optimize.py +386 -29
src/flow/cli/repl.py +12 -10
src/flow/experiments/__init__.py +0 -2
src/flow/experiments/ablation.py +10 -31
src/flow/experiments/data/tasks/coding.jsonl +5 -10
src/flow/experiments/data/tasks/gaia_all.jsonl +0 -0
src/flow/experiments/data/tasks/gaia_level1.jsonl +106 -0
src/flow/experiments/data/tasks/gaia_level2.jsonl +172 -0
src/flow/experiments/data/tasks/gaia_level3.jsonl +52 -0
src/flow/experiments/evaluators/heuristic.py +1 -1
src/flow/experiments/evaluators/llm.py +14 -5
src/flow/experiments/models.py +311 -32
src/flow/experiments/optimizer.py +65 -13
src/flow/experiments/runner.py +11 -5
src/flow/experiments/types.py +50 -0
src/flow/harness/__init__.py +23 -1
src/flow/harness/base.py +24 -21
src/flow/harness/langgraph/__init__.py +37 -0
src/flow/harness/langgraph/compaction.py +51 -0
src/flow/harness/langgraph/harness.py +257 -0
src/flow/harness/langgraph/otel_callback.py +173 -0
src/flow/harness/langgraph/wrappers.py +76 -0
src/flow/harness/maf/__init__.py +4 -0
src/flow/harness/maf/agent.py +15 -18
src/flow/harness/maf/harness.py +68 -51
src/flow/harness/maf/tools/__init__.py +96 -115
src/flow/harness/maf/tools/coding.py +0 -391
src/flow/harness/maf/tools/core.py +0 -100
src/flow/harness/maf/tools/execution.py +0 -479
src/flow/harness/maf/tools/memory.py +0 -260
src/flow/harness/maf/tools/sub_agent.py +0 -196
src/flow/harness/maf/wrappers.py +64 -0
src/flow/harness/miniagent/__init__.py +139 -0
src/flow/harness/miniagent/agent.py +604 -0
src/flow/harness/miniagent/client.py +185 -0
src/flow/harness/miniagent/context.py +664 -0
src/flow/harness/miniagent/harness.py +403 -0
src/flow/harness/miniagent/hooks.py +209 -0
src/flow/harness/miniagent/instructions.py +207 -0
src/flow/harness/miniagent/messages.py +88 -0
src/flow/harness/miniagent/otel.py +258 -0
src/flow/harness/miniagent/tool.py +173 -0
src/flow/harness/miniagent/tools/__init__.py +125 -0
src/flow/harness/miniagent/workspace.py +198 -0
src/flow/harness/registry.py +80 -0
src/flow/llm/__init__.py +49 -0
src/flow/llm/config.py +227 -0

README.md CHANGED Viewed

@@ -1,124 +1,235 @@
----
-title: Flow
-emoji: 🔄
-colorFrom: blue
-colorTo: purple
-sdk: docker
-app_port: 7860
-pinned: false
----
 # Flow
-**Evaluate and Optimize Coding Agent Configurations**
-Flow is a framework for running experiments on LLM coding agents. Compare context engineering strategies (message compaction, agent memory, sub-agents), evaluate results with LLM-as-Judge, and find optimal configurations that balance quality and token cost.
 ![Flow UI](docs/flow.png)
-## Features
-- **Ablation Studies**: Test different agent configurations side-by-side
-- **LLM-as-Judge Evaluation**: Automatically score agent outputs for correctness
-- **Pareto Analysis**: Find optimal quality vs. cost tradeoffs
-- **Web UI**: Visual interface for managing experiments and viewing results
-- **Config Export**: Export winning configurations for production use
 ## Quick Start
 ### 1. Install
 ```bash
-# Clone and install with uv
 git clone https://github.com/victordibia/flow
 cd flow
 uv sync
 ```
-### 2. Configure Azure OpenAI
 ```bash
-export AZURE_OPENAI_API_KEY="your-api-key"
-export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
-export AZURE_OPENAI_DEPLOYMENT="gpt-4o"
 ```
-### 3. Run Optimization
 ```bash
-# Run with built-in task suite
-uv run flow optimize --suite coding
-# Or with custom tasks
-uv run flow optimize --tasks my_tasks.jsonl
 ```
-### 4. Launch Web UI
 ```bash
 uv run flow serve
-# Opens at http://localhost:8091
 ```
-## CLI Commands
-```bash
-flow optimize [OPTIONS]   # Run optimization experiments
-flow serve               # Start the web UI
-flow run [TASK]          # Run a single agent task
-flow config              # Show current configuration
-flow init                # Initialize Flow directories
 ```
-## What Gets Optimized
-Flow tests different **context engineering strategies**:
-| Strategy | Description |
-|----------|-------------|
-| **Message Compaction** | Keep first N + last M messages, discard middle |
-| **Agent Memory** | Persistent storage the agent controls |
-| **Sub-Agent Isolation** | Delegate research to isolated sub-agent |
-Example configurations:
-```python
-from flow.experiments.models import Agent, CompactionConfig, GridSearchStrategy
-# Define a base agent
-base = Agent(name="my_agent", enable_memory=True)
-# Generate candidates via grid search
-strategy = GridSearchStrategy(variations={
-    "enable_memory": [True, False],
-    "compaction": [CompactionConfig.head_tail(10, 40), CompactionConfig.none()],
-})
-candidates = strategy.generate(base, budget=10)
 ```
-## Task Format
-Tasks are defined in JSONL format:
-```json
-{"name": "fizzbuzz", "prompt": "Create fizzbuzz.py and run it", "criteria": [{"name": "correct", "instruction": "Output shows FizzBuzz pattern"}]}
 ```
-## Development
 ```bash
-# Install dev dependencies
-uv sync --dev
-# Run tests
-uv run pytest tests/ -v
-# Type checking
-uv run pyright src/
-# Linting
-uv run ruff check src/
-uv run ruff format src/
 ```
 ## License

 # Flow
+> [!NOTE]
+> Flow is an experimental prototype and changing rapidly.
+Flow helps you find the best configuration for your AI coding agent. Define your agent spec, provide evaluation tasks, and Flow automatically generates variants, scores them, and shows you the quality vs. cost tradeoffs.
+- **Simplified experimentation** — Automates the search for optimal agent configurations
+- **Transparency** — See exactly what was tested, scores, and tradeoffs on a Pareto chart
+- **User control** — Choose your tasks, evaluation criteria, and approve variants
+- **Framework agnostic** — Standardized agent spec with pluggable runtime adapters (MAF built-in, extensible)
 ![Flow UI](docs/flow.png)
+## How It Works
+```mermaid
+flowchart LR
+    A[Agent Spec] --> D[Optimizer]
+    B[Tasks] --> D
+    C[Evaluator] --> D
+    D --> E[Agent Variants/Candidates]
+    E --> F[Pareto Graph]
+```
+## Core Concepts
+| Component      | What It Is                                                                          |
+| -------------- | ----------------------------------------------------------------------------------- |
+| **Agent Spec** | Agent configuration (model, tools, compaction, instructions) with pluggable runtime |
+| **Task**       | A coding challenge with evaluation criteria                                         |
+| **Evaluator**  | Scores agent output (LLM-as-Judge, heuristics, or trace-based)                      |
+| **Optimizer**  | Generates variants and runs experiments (GridSearch, extensible)                    |
 ## Quick Start
 ### 1. Install
 ```bash
 git clone https://github.com/victordibia/flow
 cd flow
 uv sync
 ```
+### 2. Configure
+Create a `.env` file in the project root:
 ```bash
+AZURE_OPENAI_API_KEY=your-api-key-here
+AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
+AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=gpt-4o-mini
 ```
+**Important:** Make sure your Azure OpenAI deployment has adequate rate limits:
+- **Minimum:** 10,000 tokens per minute (TPM)
+- **Recommended:** 30,000+ TPM for optimization runs
+See [Azure Portal](https://portal.azure.com) → Your OpenAI resource → Deployments to adjust rate limits.
+### 3. Test Your Setup
+Before running optimization, verify your Azure OpenAI connection:
 ```bash
+# Test Azure OpenAI connection
+uv run python scripts/test_azure_connection.py
+# Test basic agent execution
+uv run python scripts/test_basic_agent.py
+# Test LLM evaluator
+uv run python scripts/test_evaluator.py
 ```
+All tests should pass with non-zero scores and token counts.
+### 4. Run
 ```bash
+# Launch the web UI
 uv run flow serve
+# Or run optimization from CLI (base agent + variations + tasks)
+uv run flow optimize --agent base.yaml --vary compaction,memory --tasks tasks.jsonl
 ```
+## Agent Spec
+Define your agent configuration:
+```python
+from flow.experiments.models import Agent, CompactionConfig
+agent = Agent(
+    name="my-agent",
+    framework="maf",  # default; extensible to other runtimes
+    instructions="You are a coding assistant",
+    tools="standard",  # or "minimal", "full", "readonly"
+    compaction=CompactionConfig.head_tail(10, 40),  # keep first 10 + last 40 messages
+)
 ```
+Flow tests variations like:
+- **Compaction strategies** — `none`, `head_tail(N, M)`, `last_n(N)`
+- **Tool configurations** — different tool sets
+- **Instructions** — prompt variations
+## Task Format
+Tasks are JSONL with evaluation criteria:
+```json
+{
+  "name": "fizzbuzz",
+  "prompt": "Create fizzbuzz.py and run it",
+  "criteria": [
+    { "name": "correct", "instruction": "Output shows FizzBuzz pattern" }
+  ]
+}
+```
+## Web UI
+Launch with `uv run flow serve`. Create agents, import task suites, run optimization jobs, and view results with Pareto analysis. Test agents interactively with live trace streaming.
+## CLI Commands
+```bash
+# Web UI
+flow serve                                          # Start the web UI
+# Optimization
+flow optimize --agent base.yaml --tasks tasks.jsonl # Optimize base agent
+flow optimize --vary compaction,memory              # Vary specific parameters
+flow optimize --suite coding                        # Use built-in task suite
+# Single Task Execution
+flow run "Create hello.py"                          # Run a single task
+flow run --config best.yaml "task"                  # Run with optimized config
+# Testing & Diagnostics
+python scripts/test_azure_connection.py             # Test Azure OpenAI connection
+python scripts/test_basic_agent.py                  # Test basic agent execution
+python scripts/test_evaluator.py                    # Test LLM evaluator
 ```
+## Optimizer
+Flow includes multiple optimization strategies for finding the best agent configuration.
+### Grid Search (Default)
+Test predefined variations of your agent:
+```bash
+# Vary compaction and memory settings
+flow optimize --agent examples/base_agent.yaml --vary compaction,memory --tasks examples/coding_tasks.jsonl
+# Or define variations in a config file
+flow optimize --config variations.yaml --agent base_agent.yaml --tasks tasks.jsonl
 ```
+### GEPA (Active Learning)
+Use GEPA (Generative Evolutionary Prompt Adjustment) for automatic prompt optimization:
 ```bash
+# Run GEPA optimization
+flow optimize \
+  --config examples/gepa_strategy.yaml \
+  --agent examples/base_agent.yaml \
+  --tasks examples/coding_tasks.jsonl \
+  --budget 10 \
+  --parallel 2
+```
+**GEPA Configuration:**
+1. **Strategy Config** (`examples/gepa_strategy.yaml`):
+   ```yaml
+   strategy_type: gepa
+   config:
+     reflection_lm: gpt-4o-mini  # Model for GEPA's reflection
+   ```
+2. **Base Agent** (`examples/base_agent.yaml`):
+   ```yaml
+   name: coding-assistant
+   model: gpt-4o-mini            # Model for agent execution
+   tools: standard
+   instructions: |
+     Your initial prompt that GEPA will optimize...
+   ```
+3. **Run Optimization:**
+   - `--budget`: Number of optimization iterations (default: 10)
+   - `--parallel`: Concurrent evaluations (default: 4)
+   - Tasks must include evaluation criteria for LLM scoring
+**Example Output:**
+```
+[1/10] coding-assistant_gepa_eval/fibonacci: ✓ score=0.85 tokens=1,245
+[2/10] coding-assistant_gepa_eval/palindrome: ✓ score=0.78 tokens=982
+...
+Best agent exported to: ~/.flow/optimizations/<timestamp>/agents/best_score.yaml
+```
+### Requirements for Optimization
+- **Azure OpenAI Deployment:** Create a deployment with your chosen model (e.g., `gpt-4o-mini`)
+- **Rate Limits:** Minimum 10K TPM; 30K+ recommended for smooth runs
+- **Task Criteria:** Tasks need evaluation criteria for LLM-based scoring:
+  ```json
+  {
+    "name": "task_name",
+    "prompt": "Task description",
+    "criteria": [
+      {"name": "correctness", "instruction": "Solution is correct", "weight": 1.0},
+      {"name": "quality", "instruction": "Code is clean and documented", "weight": 0.7}
+    ]
+  }
+  ```
+## Development
+```bash
+uv sync --dev            # Install dev dependencies
+uv run pytest tests/ -v  # Run tests
+uv run pyright src/      # Type checking
+uv run ruff check src/   # Linting
 ```
 ## License

pyproject.toml CHANGED Viewed

@@ -26,7 +26,7 @@ dependencies = [
     "typer>=0.9.0",
     "httpx>=0.25.0",
     "python-dotenv>=1.0.0",
-    "agent-framework-core>=1.0.0b0",
     "azure-identity>=1.15.0",
     "pyyaml>=6.0.0",
     # OpenTelemetry for experiments tracing
@@ -38,14 +38,21 @@ dependencies = [
     "uvicorn>=0.27.0",
     "sqlmodel>=0.0.14",
     "aiosqlite>=0.19.0",
 ]
 [project.optional-dependencies]
 # Optional features
 research = ["beautifulsoup4>=4.12.0", "html2text>=2024.2.26"]
 # Bundles
-all = ["flow-agent[research]"]
 dev = [
     "pytest>=8.0.0",
     "pytest-asyncio>=0.23.0",

     "typer>=0.9.0",
     "httpx>=0.25.0",
     "python-dotenv>=1.0.0",
+    "agent-framework-core>=1.0.0b5",
     "azure-identity>=1.15.0",
     "pyyaml>=6.0.0",
     # OpenTelemetry for experiments tracing
     "uvicorn>=0.27.0",
     "sqlmodel>=0.0.14",
     "aiosqlite>=0.19.0",
+    "tiktoken>=0.12.0",
 ]
 [project.optional-dependencies]
 # Optional features
 research = ["beautifulsoup4>=4.12.0", "html2text>=2024.2.26"]
+langgraph = [
+    "langgraph>=0.2.0",
+    "langchain-core>=0.3.0",
+    "langchain-openai>=0.2.0",
+]
+optimizer = ["gepa>=0.0.20"]
 # Bundles
+all = ["flow-agent[research,langgraph,optimizer]"]
 dev = [
     "pytest>=8.0.0",
     "pytest-asyncio>=0.23.0",

src/flow/cli/app.py CHANGED Viewed

@@ -11,6 +11,7 @@ from pathlib import Path
 from typing import Annotated
 import typer
 from rich.console import Console
 from flow import __version__
@@ -61,6 +62,10 @@ def run(
         Path | None,
         typer.Option("--config", "-c", help="Config file from optimization (YAML)"),
     ] = None,
     interactive: Annotated[
         bool,
         typer.Option("--interactive/--no-interactive", "-i", help="Interactive mode"),
@@ -82,9 +87,14 @@ def run(
     workspace_path.mkdir(parents=True, exist_ok=True)
     memory_path.mkdir(parents=True, exist_ok=True)
     if task:
         # Single task mode
-        asyncio.run(_run_single_task(workspace_path, memory_path, task, config))
     elif interactive:
         # Interactive REPL mode
         from flow.cli.repl import FlowREPL
@@ -100,22 +110,47 @@ async def _run_single_task(
     memory_path: Path,
     task: str,
     config_path: Path | None = None,
 ) -> None:
     """Run a single task and print the result."""
     from flow.cli.output import print_event
     from flow.harness.base import EventType
-    from flow.harness.maf import MAFHarness
     if config_path:
         # Load agent config from optimization result
         from flow.experiments.models import load_agent
-        from flow.experiments.ablation import create_harness_from_agent
         agent_config = load_agent(config_path)
-        console.print(f"[dim]Using agent config: {agent_config.name}[/]")
-        harness = create_harness_from_agent(agent_config, workspace)
     else:
-        harness = MAFHarness(workspace=workspace, memory_path=memory_path)
     try:
         console.print("\n[bold blue]Flow[/] - Executing task...\n")
@@ -237,7 +272,7 @@ def config() -> None:
     table.add_row("Workspace", str(DEFAULT_WORKSPACE))
     table.add_row("Memory Path", str(DEFAULT_MEMORY_PATH))
     table.add_row("Azure Endpoint", os.environ.get("AZURE_OPENAI_ENDPOINT", "(not set)"))
-    table.add_row("Azure Deployment", os.environ.get("AZURE_OPENAI_DEPLOYMENT", "(not set)"))
     console.print(table)
@@ -256,7 +291,7 @@ def init() -> None:
     console.print("  1. Set your Azure OpenAI credentials:")
     console.print("     [dim]export AZURE_OPENAI_API_KEY=your-key[/]")
     console.print("     [dim]export AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/[/]")
-    console.print("     [dim]export AZURE_OPENAI_DEPLOYMENT=your-deployment[/]")
     console.print("\n  2. Run Flow:")
     console.print('     [dim]flow run "Create a hello world Python script"[/]')
     console.print("     [dim]flow run -i  # Interactive mode[/]")
@@ -264,6 +299,8 @@ def init() -> None:
 def main() -> None:
     """Main entry point."""
     app()

 from typing import Annotated
 import typer
+from dotenv import load_dotenv
 from rich.console import Console
 from flow import __version__
         Path | None,
         typer.Option("--config", "-c", help="Config file from optimization (YAML)"),
     ] = None,
+    framework: Annotated[
+        str,
+        typer.Option("--framework", "-f", help="Agent framework: 'maf', 'miniagent', or 'langgraph'"),
+    ] = "maf",
     interactive: Annotated[
         bool,
         typer.Option("--interactive/--no-interactive", "-i", help="Interactive mode"),
     workspace_path.mkdir(parents=True, exist_ok=True)
     memory_path.mkdir(parents=True, exist_ok=True)
+    # Validate framework
+    if framework not in ("maf", "miniagent", "langgraph"):
+        console.print(f"[red]Error:[/] Unknown framework '{framework}'. Use 'maf', 'miniagent', or 'langgraph'.")
+        raise typer.Exit(1)
     if task:
         # Single task mode
+        asyncio.run(_run_single_task(workspace_path, memory_path, task, config, framework))
     elif interactive:
         # Interactive REPL mode
         from flow.cli.repl import FlowREPL
     memory_path: Path,
     task: str,
     config_path: Path | None = None,
+    framework: str = "maf",
 ) -> None:
     """Run a single task and print the result."""
     from flow.cli.output import print_event
     from flow.harness.base import EventType
+    # Import harness modules to register them
+    import flow.harness.maf  # noqa: F401
+    import flow.harness.miniagent  # noqa: F401  # pyright: ignore[reportUnusedImport]
+    if framework == "langgraph":
+        try:
+            import flow.harness.langgraph  # noqa: F401
+        except ImportError:
+            console.print("[red]Error:[/] LangGraph dependencies not installed.")
+            console.print("[dim]Install with: pip install flow-agent[langgraph][/]")
+            raise typer.Exit(1)
+    from flow.harness import create_harness
+    from flow.experiments.models import Agent
     if config_path:
         # Load agent config from optimization result
         from flow.experiments.models import load_agent
         agent_config = load_agent(config_path)
+        # Override framework if specified
+        if framework != "maf":
+            agent_config = Agent(
+                name=agent_config.name,
+                framework=framework,
+                tools=agent_config.tools,
+                model=agent_config.model,
+                instructions=agent_config.instructions,
+                compaction=agent_config.compaction,
+            )
+        console.print(f"[dim]Using agent config: {agent_config.name} ({framework})[/]")
+        harness = create_harness(agent_config, workspace)
     else:
+        agent = Agent(name="flow-cli", framework=framework)
+        harness = create_harness(agent, workspace)
     try:
         console.print("\n[bold blue]Flow[/] - Executing task...\n")
     table.add_row("Workspace", str(DEFAULT_WORKSPACE))
     table.add_row("Memory Path", str(DEFAULT_MEMORY_PATH))
     table.add_row("Azure Endpoint", os.environ.get("AZURE_OPENAI_ENDPOINT", "(not set)"))
+    table.add_row("Azure Deployment", os.environ.get("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME", "(not set)"))
     console.print(table)
     console.print("  1. Set your Azure OpenAI credentials:")
     console.print("     [dim]export AZURE_OPENAI_API_KEY=your-key[/]")
     console.print("     [dim]export AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/[/]")
+    console.print("     [dim]export AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=your-deployment[/]")
     console.print("\n  2. Run Flow:")
     console.print('     [dim]flow run "Create a hello world Python script"[/]')
     console.print("     [dim]flow run -i  # Interactive mode[/]")
 def main() -> None:
     """Main entry point."""
+    # Load environment variables from .env file if present
+    load_dotenv()
     app()

src/flow/cli/optimize.py CHANGED Viewed

@@ -6,6 +6,7 @@ from __future__ import annotations
 import asyncio
 import importlib.util
 import sys
 from pathlib import Path
 from typing import Annotated, Any
@@ -13,7 +14,15 @@ from typing import Annotated, Any
 import typer
 from rich.console import Console
-from flow.experiments.models import Agent, Candidate, CompactionConfig, GridSearchStrategy
 from flow.experiments.optimizer import FlowOptimizer, load_tasks_from_jsonl
 from flow.experiments.types import Task, get_task_suite
@@ -32,7 +41,14 @@ def optimize(
         Path | None,
         typer.Option(
             "--config", "-c",
-            help="Path to Python config file with CANDIDATES or VARIATIONS",
         ),
     ] = None,
     agent: Annotated[
@@ -60,7 +76,7 @@ def optimize(
         str | None,
         typer.Option(
             "--vary", "-v",
-            help="Comma-separated params to vary: compaction,memory,subagent",
         ),
     ] = None,
     output: Annotated[
@@ -92,24 +108,31 @@ def optimize(
     Examples:
         # Run with task file and default candidates
         flow optimize --tasks tasks.jsonl
-        # Use custom candidates from Python file
-        flow optimize --config my_configs.py --tasks tasks.jsonl
         # Vary specific parameters
-        flow optimize --vary compaction,memory --tasks tasks.jsonl
         # Use built-in task suite
         flow optimize --suite coding --parallel 2
         # Start from a base agent definition
-        flow optimize --agent base_agent.yaml --vary compaction,memory --tasks tasks.jsonl
     """
     asyncio.run(_run_optimize(
         tasks_path=tasks,
         config_path=config,
         agent_path=agent,
         suite=suite,
         parallel=parallel,
@@ -123,6 +146,7 @@ def optimize(
 async def _run_optimize(
     tasks_path: Path | None,
     config_path: Path | None,
     agent_path: Path | None,
     suite: str | None,
     parallel: int,
@@ -132,6 +156,11 @@ async def _run_optimize(
     budget: int,
 ) -> None:
     """Run the optimization."""
     # Load tasks
     tasks = _load_tasks(tasks_path, suite)
     if not tasks:
@@ -141,8 +170,24 @@ async def _run_optimize(
     # Load base agent
     base = _load_base_agent(agent_path)
-    # Load/generate candidates
-    candidates = _load_candidates(config_path, vary, base, budget)
     if not candidates:
         console.print("[red]Error:[/] No candidates to test. Use --config or --vary")
         raise typer.Exit(1)
@@ -176,6 +221,94 @@ async def _run_optimize(
         raise typer.Exit(1)
 def _load_tasks(tasks_path: Path | None, suite: str | None) -> list[Task]:
     """Load tasks from file or built-in suite."""
     if tasks_path:
@@ -211,47 +344,119 @@ def _load_base_agent(agent_path: Path | None) -> Agent:
     return Agent(name="flow_agent")
-def _load_candidates(
     config_path: Path | None,
     vary: str | None,
     base: Agent,
     budget: int,
-) -> list[Candidate]:
-    """Load candidates from file or generate from variations."""
     if config_path:
         if not config_path.exists():
             console.print(f"[red]Error:[/] Config file not found: {config_path}")
             raise typer.Exit(1)
-        candidates, variations = _load_python_config(config_path)
         if variations:
             strategy = GridSearchStrategy(variations)
-            return strategy.generate(base, budget)
         elif candidates:
-            return candidates
         else:
-            console.print("[red]Error:[/] Config file has no CANDIDATES or VARIATIONS")
             raise typer.Exit(1)
     if vary:
         variations = _parse_vary_flag(vary)
         strategy = GridSearchStrategy(variations)
-        return strategy.generate(base, budget)
     # Default: explore context engineering dimensions
     strategy = GridSearchStrategy(variations={
-        "enable_memory": [True, False],
         "compaction": [
             CompactionConfig.head_tail(10, 40),
             CompactionConfig.none(),
         ],
     })
-    return strategy.generate(base, budget)
-def _load_python_config(path: Path) -> tuple[list[Candidate], dict[str, Any]]:
-    """Load CANDIDATES and VARIATIONS from a Python file."""
     spec = importlib.util.spec_from_file_location("config_module", path)
     if spec is None or spec.loader is None:
         raise ValueError(f"Cannot load {path}")
@@ -262,12 +467,21 @@ def _load_python_config(path: Path) -> tuple[list[Candidate], dict[str, Any]]:
     candidates = getattr(module, "CANDIDATES", [])
     variations = getattr(module, "VARIATIONS", {})
-    return candidates, variations
 def _parse_vary_flag(vary: str) -> dict[str, Any]:
-    """Parse --vary flag into variations dict."""
     variations: dict[str, Any] = {}
     for param in vary.split(","):
@@ -278,10 +492,17 @@ def _parse_vary_flag(vary: str) -> dict[str, Any]:
                 CompactionConfig.head_tail(10, 40),
                 CompactionConfig.none(),
             ]
-        elif param in ("memory", "mem"):
-            variations["enable_memory"] = [True, False]
-        elif param in ("subagent", "sub"):
-            variations["enable_sub_agent"] = [True, False]
         elif param in ("head", "head_size"):
             variations["compaction"] = [
                 CompactionConfig.head_tail(h, 40) for h in [5, 10, 20]
@@ -294,3 +515,139 @@ def _parse_vary_flag(vary: str) -> dict[str, Any]:
             console.print(f"[yellow]Warning:[/] Unknown vary param: {param}")
     return variations

 import asyncio
 import importlib.util
+import logging
 import sys
 from pathlib import Path
 from typing import Annotated, Any
 import typer
 from rich.console import Console
+from flow.experiments.models import (
+    Agent,
+    Candidate,
+    CompactionConfig,
+    Experiment,
+    ExperimentResult,
+    GridSearchStrategy,
+    load_experiment,
+)
 from flow.experiments.optimizer import FlowOptimizer, load_tasks_from_jsonl
 from flow.experiments.types import Task, get_task_suite
         Path | None,
         typer.Option(
             "--config", "-c",
+            help="Path to config file (YAML or Python) with STRATEGY, CANDIDATES, or VARIATIONS",
+        ),
+    ] = None,
+    experiment: Annotated[
+        Path | None,
+        typer.Option(
+            "--experiment", "-e",
+            help="Path to experiment YAML file (defines agent, tasks, and variations)",
         ),
     ] = None,
     agent: Annotated[
         str | None,
         typer.Option(
             "--vary", "-v",
+            help="Comma-separated params to vary: compaction, strategy, tools, head, tail",
         ),
     ] = None,
     output: Annotated[
     Examples:
+        # Use experiment YAML (recommended - defines agent, tasks, and variations)
+        flow optimize --experiment experiment.yaml
         # Run with task file and default candidates
         flow optimize --tasks tasks.jsonl
         # Vary specific parameters
+        flow optimize --vary compaction,tools --tasks tasks.jsonl
+        # Test all compaction strategies
+        flow optimize --vary strategy --suite coding
         # Use built-in task suite
         flow optimize --suite coding --parallel 2
         # Start from a base agent definition
+        flow optimize --agent base_agent.yaml --vary compaction,tools --tasks tasks.jsonl
+        # Use GEPA for active prompt optimization (via YAML config)
+        flow optimize --config gepa_strategy.yaml --agent base_agent.yaml --tasks tasks.jsonl
     """
     asyncio.run(_run_optimize(
         tasks_path=tasks,
         config_path=config,
+        experiment_path=experiment,
         agent_path=agent,
         suite=suite,
         parallel=parallel,
 async def _run_optimize(
     tasks_path: Path | None,
     config_path: Path | None,
+    experiment_path: Path | None,
     agent_path: Path | None,
     suite: str | None,
     parallel: int,
     budget: int,
 ) -> None:
     """Run the optimization."""
+    # If experiment YAML provided, use it as the source of truth
+    if experiment_path:
+        await _run_from_experiment(experiment_path, output_dir)
+        return
     # Load tasks
     tasks = _load_tasks(tasks_path, suite)
     if not tasks:
     # Load base agent
     base = _load_base_agent(agent_path)
+    # Load candidates and check if a strategy is defined in config
+    candidates, strategy_instance = _load_candidates_and_strategy(config_path, vary, base, budget)
+    # If a strategy was provided (like GepaStrategy), run it directly
+    if strategy_instance is not None:
+        console.print("\n[bold]Running active optimization strategy...[/]")
+        await _run_active_strategy(
+            strategy=strategy_instance,
+            base_agent=base,
+            tasks=tasks,
+            output_dir=output_dir,
+            parallel=parallel,
+            use_llm_eval=use_llm_eval,
+            budget=budget
+        )
+        return
+    # Otherwise, use traditional grid search with candidates
     if not candidates:
         console.print("[red]Error:[/] No candidates to test. Use --config or --vary")
         raise typer.Exit(1)
         raise typer.Exit(1)
+async def _run_from_experiment(experiment_path: Path, output_dir: Path | None) -> None:
+    """Run optimization from an experiment YAML file.
+    The experiment YAML defines:
+    - base_agent: Path to agent YAML
+    - suite/tasks: Which tasks to run
+    - variations: Parameter variations for grid search
+    - parallel, budget, use_llm_eval: Optimization settings
+    """
+    if not experiment_path.exists():
+        console.print(f"[red]Error:[/] Experiment file not found: {experiment_path}")
+        raise typer.Exit(1)
+    exp = load_experiment(experiment_path)
+    # Load base agent
+    if exp.base_agent:
+        base_agent_path = Path(exp.base_agent)
+        # Handle relative paths (relative to experiment file)
+        if not base_agent_path.is_absolute():
+            base_agent_path = experiment_path.parent / base_agent_path
+        if not base_agent_path.exists():
+            console.print(f"[red]Error:[/] Base agent file not found: {base_agent_path}")
+            raise typer.Exit(1)
+        from flow.experiments.models import load_agent
+        base = load_agent(base_agent_path)
+    else:
+        base = Agent(name="flow_agent")
+    # Load tasks
+    tasks: list[Task] = []
+    if exp.tasks:
+        tasks_path = Path(exp.tasks)
+        if not tasks_path.is_absolute():
+            tasks_path = experiment_path.parent / tasks_path
+        if not tasks_path.exists():
+            console.print(f"[red]Error:[/] Tasks file not found: {tasks_path}")
+            raise typer.Exit(1)
+        tasks = load_tasks_from_jsonl(tasks_path)
+    elif exp.suite:
+        try:
+            tasks = get_task_suite(exp.suite)
+        except ValueError as e:
+            console.print(f"[red]Error:[/] {e}")
+            raise typer.Exit(1)
+    else:
+        console.print("[red]Error:[/] Experiment must specify 'suite' or 'tasks'")
+        raise typer.Exit(1)
+    # Generate candidates from variations
+    if exp.variations:
+        strategy = GridSearchStrategy(exp.variations)
+        candidates = strategy.generate(base, exp.budget)
+    else:
+        candidates = [Candidate(agent=base, mutations={}, rationale="baseline")]
+    console.print(f"\n[bold]Experiment:[/] {experiment_path.name}")
+    console.print(f"[bold]Base Agent:[/] {base.name}")
+    console.print(f"\n[bold]Tasks:[/] {len(tasks)}")
+    for t in tasks:
+        console.print(f"  - {t.name}")
+    console.print(f"\n[bold]Variations:[/]")
+    for key, values in exp.variations.items():
+        console.print(f"  - {key}: {len(values)} variants")
+    console.print(f"\n[bold]Candidates:[/] {len(candidates)}")
+    # Run optimizer
+    optimizer = FlowOptimizer(
+        parallel=exp.parallel,
+        use_llm_evaluator=exp.use_llm_eval,
+        output_dir=output_dir,
+    )
+    try:
+        result = await optimizer.optimize(candidates, tasks)
+        console.print("\n[bold green]Optimization complete![/]")
+        console.print(f"\nBest agents exported to: [cyan]{result.output_dir / 'agents'}[/]")
+        console.print("\nTo use an agent config:")
+        console.print(f"  [dim]flow run --config {result.output_dir / 'agents' / 'best_score.yaml'} \"your task\"[/]")
+    except KeyboardInterrupt:
+        console.print("\n[yellow]Optimization cancelled.[/]")
+        raise typer.Exit(1)
 def _load_tasks(tasks_path: Path | None, suite: str | None) -> list[Task]:
     """Load tasks from file or built-in suite."""
     if tasks_path:
     return Agent(name="flow_agent")
+def _load_candidates_and_strategy(
     config_path: Path | None,
     vary: str | None,
     base: Agent,
     budget: int,
+) -> tuple[list[Candidate], Any | None]:
+    """Load candidates from file or generate from variations.
+    Supports both YAML and Python config files:
+    - YAML: strategy configuration (strategy_type, config)
+    - Python: STRATEGY object, CANDIDATES list, or VARIATIONS dict
+    Returns:
+        Tuple of (candidates, strategy_instance)
+        - If a STRATEGY is defined in config, returns ([], strategy_instance)
+        - Otherwise returns (candidates, None) for traditional grid search
+    """
     if config_path:
         if not config_path.exists():
             console.print(f"[red]Error:[/] Config file not found: {config_path}")
             raise typer.Exit(1)
+        # Check file extension to determine format
+        if config_path.suffix in (".yaml", ".yml"):
+            strategy_obj = _load_yaml_strategy(config_path)
+            if strategy_obj is not None:
+                return [], strategy_obj
+            # YAML files currently only support strategy definitions
+            console.print("[red]Error:[/] YAML config must define a strategy")
+            raise typer.Exit(1)
+        # Python config file
+        candidates, variations, strategy_obj = _load_python_config(config_path)
+        # If a strategy object was provided (e.g., GepaStrategy), return it
+        if strategy_obj is not None:
+            return [], strategy_obj
         if variations:
             strategy = GridSearchStrategy(variations)
+            return strategy.generate(base, budget), None
         elif candidates:
+            return candidates, None
         else:
+            console.print("[red]Error:[/] Config file has no CANDIDATES, VARIATIONS, or STRATEGY")
             raise typer.Exit(1)
     if vary:
         variations = _parse_vary_flag(vary)
         strategy = GridSearchStrategy(variations)
+        return strategy.generate(base, budget), None
     # Default: explore context engineering dimensions
     strategy = GridSearchStrategy(variations={
         "compaction": [
             CompactionConfig.head_tail(10, 40),
             CompactionConfig.none(),
         ],
+        "tools": ["minimal", "standard"],
     })
+    return strategy.generate(base, budget), None
+def _load_yaml_strategy(path: Path) -> Any | None:
+    """Load strategy configuration from a YAML file.
+    Expected YAML format:
+    ```yaml
+    strategy_type: gepa  # or other strategy types
+    config:
+      reflection_lm: gpt-4o
+      population_size: 5
+      optimize_fields:
+        - instructions
+    ```
+    Returns:
+        Strategy instance or None if file doesn't define a strategy
+    """
+    import yaml
+    with open(path) as f:
+        data = yaml.safe_load(f)
+    if not data or "strategy_type" not in data:
+        return None
+    strategy_type = data["strategy_type"].lower()
+    strategy_config = data.get("config", {})
+    if strategy_type == "gepa":
+        try:
+            from flow.optimizers import GepaStrategy
+            return GepaStrategy(config=strategy_config)
+        except ImportError:
+            console.print("[red]Error:[/] GEPA optimizer not available.")
+            console.print("[dim]Install with: pip install flow-agent[optimizer][/]")
+            raise typer.Exit(1)
+    else:
+        console.print(f"[red]Error:[/] Unknown strategy type: {strategy_type}")
+        console.print("[dim]Supported: gepa[/]")
+        raise typer.Exit(1)
+def _load_python_config(path: Path) -> tuple[list[Candidate], dict[str, Any], Any | None]:
+    """Load CANDIDATES, VARIATIONS, and STRATEGY from a Python file.
+    Returns:
+        Tuple of (candidates, variations, strategy)
+        - candidates: List of Candidate objects
+        - variations: Dict of parameter variations for GridSearchStrategy
+        - strategy: Strategy instance (e.g., GepaStrategy) or None
+    """
     spec = importlib.util.spec_from_file_location("config_module", path)
     if spec is None or spec.loader is None:
         raise ValueError(f"Cannot load {path}")
     candidates = getattr(module, "CANDIDATES", [])
     variations = getattr(module, "VARIATIONS", {})
+    strategy = getattr(module, "STRATEGY", None)
+    return candidates, variations, strategy
 def _parse_vary_flag(vary: str) -> dict[str, Any]:
+    """Parse --vary flag into variations dict.
+    Supported parameters:
+        compaction, compact: Test head_tail vs none
+        strategy: Test all compaction strategies (none, head_tail, sliding_window, summarization)
+        tools: Test minimal vs standard tool sets
+        head, head_size: Vary head sizes (5, 10, 20)
+        tail, tail_size: Vary tail sizes (20, 40, 60)
+    """
     variations: dict[str, Any] = {}
     for param in vary.split(","):
                 CompactionConfig.head_tail(10, 40),
                 CompactionConfig.none(),
             ]
+        elif param in ("strategy", "strategies"):
+            # Test all compaction strategies
+            variations["compaction"] = [
+                CompactionConfig.none(),
+                CompactionConfig.head_tail(10, 40),
+                CompactionConfig(strategy="sliding_window", token_budget=50_000),
+                CompactionConfig(strategy="summarization", token_budget=50_000),
+            ]
+        elif param in ("tools", "toolset"):
+            # Tool variations - memory and subagent are just tools
+            variations["tools"] = ["minimal", "standard"]
         elif param in ("head", "head_size"):
             variations["compaction"] = [
                 CompactionConfig.head_tail(h, 40) for h in [5, 10, 20]
             console.print(f"[yellow]Warning:[/] Unknown vary param: {param}")
     return variations
+async def _run_active_strategy(
+    strategy: Any,
+    base_agent: Agent,
+    tasks: list[Task],
+    output_dir: Path | None,
+    parallel: int,
+    use_llm_eval: bool,
+    budget: int,
+) -> None:
+    """Run an active optimization strategy (like GEPA)."""
+    logger = logging.getLogger(__name__)
+    # Create optimizer instance to run evaluations
+    optimizer_runner = FlowOptimizer(
+        parallel=parallel,
+        use_llm_evaluator=use_llm_eval,
+        output_dir=None, # Don't export every intermediate run result
+    )
+    main_loop = asyncio.get_running_loop()
+    # Define evaluator function to inject into strategy
+    def evaluator(candidate: Candidate, minibatch: list[Task] | None = None) -> ExperimentResult:
+        """Evaluate a candidate on a minibatch of tasks."""
+        eval_tasks = minibatch if minibatch else tasks
+        logger.info(f"[EVALUATOR] Evaluating candidate '{candidate.agent.name}' on {len(eval_tasks)} tasks")
+        logger.info(f"[EVALUATOR] Using LLM evaluator: {use_llm_eval}")
+        logger.debug(f"[EVALUATOR] Tasks: {[t.name for t in eval_tasks]}")
+        try:
+            # Run async evaluation on the main loop and wait for result
+            # This is safe because strategy.generate (which calls this)
+            # is running in an executor thread.
+            future = asyncio.run_coroutine_threadsafe(
+                optimizer_runner.optimize([candidate], eval_tasks),
+                main_loop
+            )
+            optimization_result = future.result()
+            # Check if we got any results
+            if not optimization_result.summaries:
+                logger.warning(f"[EVALUATOR] Optimization produced no summaries for candidate '{candidate.agent.name}'")
+                # Return a fallback result with zero score instead of raising
+                return ExperimentResult(
+                    candidate=candidate,
+                    run_result=None,
+                    metrics={"score": 0.0, "error": "No summaries produced"},
+                    eval_score=0.0,
+                    eval_passed=False,
+                    eval_reasoning="Evaluation failed to produce results",
+                    traces={}
+                )
+            summary = optimization_result.summaries[0]
+            logger.info(f"[EVALUATOR] Candidate '{candidate.agent.name}' avg_score={summary.avg_score:.3f}, pass_rate={summary.pass_rate:.2f}")
+            # Log individual task results for debugging
+            if summary.task_results:
+                for tr in summary.task_results:
+                    logger.info(f"[EVALUATOR]   Task '{tr.task_name}': score={tr.eval_score:.3f}, passed={tr.eval_passed}")
+                    logger.debug(f"[EVALUATOR]     Reasoning: '{tr.eval_reasoning[:150]}'")
+                    logger.debug(f"[EVALUATOR]     Metrics: tokens={tr.metrics.total_tokens}, duration={tr.run_result.duration_seconds if tr.run_result else 0:.2f}s")
+            # Convert CandidateSummary to ExperimentResult for GEPA
+            if summary.task_results:
+                tr = summary.task_results[0]
+                return ExperimentResult(
+                    candidate=candidate,
+                    run_result=tr.run_result,
+                    metrics=tr.metrics,
+                    eval_score=tr.eval_score,
+                    eval_passed=tr.eval_passed,
+                    eval_reasoning=tr.eval_reasoning,
+                    traces=tr.run_result.trace if tr.run_result and isinstance(tr.run_result.trace, dict) else {}
+                )
+            # Fallback to aggregate metrics if no individual task results
+            return ExperimentResult(
+                candidate=candidate,
+                run_result=None,
+                metrics={"score": summary.avg_score},
+                eval_score=summary.avg_score,
+                eval_passed=summary.pass_rate > 0.5,
+                eval_reasoning=f"Aggregate pass rate: {summary.pass_rate}",
+                traces={}
+            )
+        except Exception as e:
+            logger.error(f"Error evaluating candidate '{candidate.agent.name}': {e}", exc_info=True)
+            # Return a fallback result instead of propagating the exception
+            return ExperimentResult(
+                candidate=candidate,
+                run_result=None,
+                metrics={"score": 0.0, "error": str(e)},
+                eval_score=0.0,
+                eval_passed=False,
+                eval_reasoning=f"Evaluation error: {str(e)}",
+                traces={}
+            )
+    # Inject dependencies into strategy if supported
+    # GepaStrategy accepts them in __init__, but we might have loaded it from config
+    # without them.
+    if hasattr(strategy, "evaluator") and strategy.evaluator is None:
+        strategy.evaluator = evaluator
+    if hasattr(strategy, "dataset") and strategy.dataset is None:
+        strategy.dataset = tasks
+    # Execute strategy (blocking/sync)
+    # We should run this in an executor to avoid blocking the main async loop
+    # if we were doing other async things, but here we just wait for it.
+    loop = asyncio.get_running_loop()
+    candidates = await loop.run_in_executor(None, strategy.generate, base_agent, budget)
+    console.print("\n[bold green]Optimization complete![/]")
+    console.print(f"Generated {len(candidates)} candidates.")
+    # Export results
+    if output_dir:
+        from flow.experiments.models import export_agent
+        output_dir.mkdir(parents=True, exist_ok=True)
+        (output_dir / "agents").mkdir(exist_ok=True)
+        for i, cand in enumerate(candidates):
+            # Basic export
+            name = cand.agent.name or f"candidate_{i}"
+            export_agent(cand.agent, output_dir / "agents" / f"{name}.yaml", metrics={"rationale": cand.rationale})
+        console.print(f"\nAgents exported to: [cyan]{output_dir / 'agents'}[/]")

src/flow/cli/repl.py CHANGED Viewed

@@ -11,8 +11,8 @@ from pathlib import Path
 from rich.console import Console
 from flow.cli.output import print_event, print_welcome
-from flow.harness.base import EventType
-from flow.harness.maf import MAFHarness
 # Default paths
 DEFAULT_WORKSPACE = Path.home() / ".flow" / "workspace"
@@ -40,16 +40,18 @@ class FlowREPL:
         self._workspace = workspace or DEFAULT_WORKSPACE
         self._memory_path = memory_path or DEFAULT_MEMORY_PATH
         self._console = Console()
-        self._harness: MAFHarness | None = None
         self._thread_id: str | None = None
-    def _get_harness(self) -> MAFHarness:
         """Get or create the harness instance."""
         if self._harness is None:
-            self._harness = MAFHarness(
-                workspace=self._workspace,
-                memory_path=self._memory_path,
-            )
         return self._harness
     async def run(self) -> None:
@@ -112,7 +114,7 @@ class FlowREPL:
         except EOFError:
             return None
-    async def _run_task(self, harness: MAFHarness, task: str) -> None:
         """Run a task and stream the output.
         Args:
@@ -122,7 +124,7 @@ class FlowREPL:
         self._console.print()  # Blank line before output
         try:
-            async for event in harness.run_stream(task, self._thread_id):
                 print_event(self._console, event)
                 # Store thread ID for conversation continuity

 from rich.console import Console
 from flow.cli.output import print_event, print_welcome
+from flow.experiments.models import Agent
+from flow.harness.base import BaseHarness, EventType
 # Default paths
 DEFAULT_WORKSPACE = Path.home() / ".flow" / "workspace"
         self._workspace = workspace or DEFAULT_WORKSPACE
         self._memory_path = memory_path or DEFAULT_MEMORY_PATH
         self._console = Console()
+        self._harness: BaseHarness | None = None
         self._thread_id: str | None = None
+    def _get_harness(self) -> BaseHarness:
         """Get or create the harness instance."""
         if self._harness is None:
+            # Import maf module to register the harness, then use registry
+            import flow.harness.maf  # noqa: F401
+            from flow.harness import create_harness
+            agent = Agent(name="flow-repl")
+            self._harness = create_harness(agent, self._workspace)
         return self._harness
     async def run(self) -> None:
         except EOFError:
             return None
+    async def _run_task(self, harness: BaseHarness, task: str) -> None:
         """Run a task and stream the output.
         Args:
         self._console.print()  # Blank line before output
         try:
+            async for event in harness.run_stream(task):
                 print_event(self._console, event)
                 # Store thread ID for conversation continuity

src/flow/experiments/__init__.py CHANGED Viewed

@@ -52,7 +52,6 @@ from .models import (
 # Experiment runner + Pareto analysis
 from .ablation import (
     compute_pareto_frontier,
-    create_harness_from_agent,
     generate_recommendation,
     run_experiments,
     run_single_experiment,
@@ -146,7 +145,6 @@ __all__ = [  # noqa: RUF022  # Intentionally grouped by category
     "print_comparison_table",
     "print_eval_result",
     # Experiment runner
-    "create_harness_from_agent",
     "run_experiments",
     "run_single_experiment",
     "compute_pareto_frontier",

 # Experiment runner + Pareto analysis
 from .ablation import (
     compute_pareto_frontier,
     generate_recommendation,
     run_experiments,
     run_single_experiment,
     "print_comparison_table",
     "print_eval_result",
     # Experiment runner
     "run_experiments",
     "run_single_experiment",
     "compute_pareto_frontier",

src/flow/experiments/ablation.py CHANGED Viewed

@@ -19,46 +19,17 @@ from typing import TYPE_CHECKING, Any
 from .evaluators import HeuristicEvaluator
 from .metrics import extract_metrics, metrics_to_dict
-from .models import Agent, Candidate, ExperimentResult
 from .reporters import print_comparison_table, save_run_result
 from .runner import FlowExperimentRunner, setup_tracing
 from .types import EvalCriterion, Task
 if TYPE_CHECKING:
-    from flow.harness.maf import MAFHarness
     from .optimizer import CandidateSummary
 logger = logging.getLogger(__name__)
-def create_harness_from_agent(agent: Agent, workspace: Path) -> MAFHarness:
-    """Create a MAFHarness from an Agent definition.
-    Args:
-        agent: The agent definition
-        workspace: Working directory
-    Returns:
-        A configured MAFHarness
-    """
-    from flow.experiments.models import resolve_tools
-    from flow.harness.maf import MAFHarness
-    # Resolve tools to dict form
-    tools_spec = resolve_tools(agent.tools)
-    return MAFHarness(
-        workspace=workspace,
-        memory_path=workspace / "memory",
-        enable_compaction=agent.compaction.enabled,
-        compaction_head_size=agent.compaction.head_size,
-        compaction_tail_size=agent.compaction.tail_size,
-        tools=tools_spec,
-        instructions=agent.instructions,
-    )
 async def run_single_experiment(
     candidate: Candidate,
     task: Task,
@@ -74,7 +45,15 @@ async def run_single_experiment(
     Returns:
         ExperimentResult with metrics and evaluation
     """
-    harness = create_harness_from_agent(candidate.agent, workspace)
     try:
         runner = FlowExperimentRunner(keep_workspace=True)

 from .evaluators import HeuristicEvaluator
 from .metrics import extract_metrics, metrics_to_dict
+from .models import Candidate, ExperimentResult
 from .reporters import print_comparison_table, save_run_result
 from .runner import FlowExperimentRunner, setup_tracing
 from .types import EvalCriterion, Task
 if TYPE_CHECKING:
     from .optimizer import CandidateSummary
 logger = logging.getLogger(__name__)
 async def run_single_experiment(
     candidate: Candidate,
     task: Task,
     Returns:
         ExperimentResult with metrics and evaluation
     """
+    # Import harness modules to register them, then use registry
+    import flow.harness.maf  # noqa: F401
+    try:
+        import flow.harness.miniagent  # noqa: F401
+    except ImportError:
+        pass  # miniagent harness is optional
+    from flow.harness import create_harness
+    harness = create_harness(candidate.agent, workspace)
     try:
         runner = FlowExperimentRunner(keep_workspace=True)

src/flow/experiments/data/tasks/coding.jsonl CHANGED Viewed

@@ -1,10 +1,5 @@
-{"name": "fizzbuzz", "prompt": "Create fizzbuzz.py that prints FizzBuzz 1-100 and run it.", "criteria": [{"name": "correct", "instruction": "Correct FizzBuzz output"}], "category": "short"}
-{"name": "rest_api", "prompt": "Create a FastAPI CRUD TODO app with GET/POST/DELETE endpoints.", "criteria": [{"name": "has_crud", "instruction": "Has working CRUD"}], "category": "medium"}
-{"name": "cli_tool", "prompt": "Create an argparse CLI that counts lines/words/chars in a file.", "criteria": [{"name": "works", "instruction": "CLI works correctly"}], "category": "medium"}
-{"name": "data_pipeline", "prompt": "Create a script that reads CSV data, filters rows, aggregates, and outputs JSON.", "criteria": [{"name": "works", "instruction": "Pipeline produces correct output"}], "category": "medium"}
-{"name": "unit_tests", "prompt": "Create calc.py with math functions and test_calc.py with pytest tests.", "criteria": [{"name": "tests_pass", "instruction": "Tests pass"}], "category": "medium"}
-{"name": "web_scraper", "prompt": "Create a script that fetches a webpage and extracts all links.", "criteria": [{"name": "extracts_links", "instruction": "Extracts links correctly"}], "category": "medium"}
-{"name": "async_downloader", "prompt": "Create an async script that downloads multiple URLs concurrently using aiohttp.", "criteria": [{"name": "uses_async", "instruction": "Uses async/await correctly"}], "category": "complex"}
-{"name": "database_orm", "prompt": "Create a SQLAlchemy model for Users with CRUD operations.", "criteria": [{"name": "has_orm", "instruction": "Uses SQLAlchemy ORM correctly"}], "category": "complex"}
-{"name": "decorator_lib", "prompt": "Create a library with timing, retry, and caching decorators.", "criteria": [{"name": "decorators_work", "instruction": "Decorators function correctly"}], "category": "complex"}
-{"name": "config_parser", "prompt": "Create a config parser that supports YAML, JSON, and env vars with validation.", "criteria": [{"name": "multi_format", "instruction": "Supports multiple formats"}], "category": "complex"}

+{"name": "repo_documentation", "prompt": "Clone the repository https://github.com/microsoft-foundry/ai-tutorials and generate comprehensive documentation.\n\nSTEP 1: Clone the repository\n- Use bash to clone: git clone https://github.com/microsoft-foundry/ai-tutorials\n- Confirm the clone succeeded by listing the directory contents\n\nSTEP 2: Explore the structure\n- List all files and directories recursively\n- Identify the main components, tutorials, and examples\n\nSTEP 3: Generate documentation\nFor EVERY file in the repository:\n1. Read the complete file\n2. Document its purpose (1-2 sentences)\n3. List key functions/classes if code, or sections if markdown\n4. Note dependencies or prerequisites\n\nSTEP 4: Create a comprehensive report\n- Overall repository purpose and structure\n- Table of contents of all tutorials/examples\n- Prerequisites for running each tutorial\n- Suggested learning path for beginners\n\nBe thorough. Read every file completely. Document everything.", "criteria": [{"name": "clone_success", "instruction": "Repository was successfully cloned", "weight": 1.0}, {"name": "file_coverage", "instruction": "All files in the repository were read and documented", "weight": 0.9}, {"name": "documentation_quality", "instruction": "Each file has meaningful description, not just filenames", "weight": 0.8}, {"name": "synthesis", "instruction": "Final report provides useful overview and learning path", "weight": 0.7}], "metadata": {"expected_iterations": 20, "min_tokens": 50000, "category": "context_stress"}}
+{"name": "code_review", "prompt": "Clone https://github.com/microsoft-foundry/ai-tutorials and perform an exhaustive code review.\n\nSTEP 1: Clone the repository\n- git clone https://github.com/microsoft-foundry/ai-tutorials\n- Verify the clone succeeded\n\nSTEP 2: Inventory all code files\n- Find all Python files (.py)\n- Find all Jupyter notebooks (.ipynb)\n- Find all configuration files\n\nSTEP 3: Review each code file\nFor EVERY Python file and notebook:\n1. Read the complete file\n2. For each function/method, document:\n   - Name and signature\n   - What it does (1-2 sentences)\n   - Any potential issues (edge cases, missing error handling)\n3. For each class, document:\n   - Purpose\n   - All methods with their purposes\n\nSTEP 4: Synthesize findings\n- Summary table of all modules and their relationships\n- Top 10 code quality issues found\n- Recommendations for improvement\n- Best practices observed worth replicating\n\nRead every file. Be thorough and systematic.", "criteria": [{"name": "clone_success", "instruction": "Repository was successfully cloned", "weight": 1.0}, {"name": "completeness", "instruction": "All code files were read and reviewed", "weight": 0.9}, {"name": "depth", "instruction": "Each function/class has meaningful analysis, not just signatures", "weight": 0.8}, {"name": "issues_found", "instruction": "Identified real code quality issues with specific examples", "weight": 0.7}], "metadata": {"expected_iterations": 22, "min_tokens": 55000, "category": "context_stress"}}
+{"name": "tutorial_analysis", "prompt": "Clone https://github.com/microsoft-foundry/ai-tutorials and analyze the tutorial content for educational effectiveness.\n\nSTEP 1: Clone the repository\n- git clone https://github.com/microsoft-foundry/ai-tutorials\n- List the repository contents to understand structure\n\nSTEP 2: Read ALL tutorials\nFor EACH tutorial or example:\n1. Read the complete content\n2. Identify the learning objectives\n3. List prerequisites assumed\n4. Note the teaching approach used\n\nSTEP 3: Evaluate educational quality\nFor each tutorial, assess:\n- Clarity of explanations\n- Code-to-explanation ratio\n- Progression of difficulty\n- Hands-on exercises included\n- Common pitfalls addressed\n\nSTEP 4: Create improvement report\n- Rank tutorials by educational effectiveness\n- Identify gaps in coverage\n- Suggest specific improvements for each tutorial\n- Recommend additional tutorials that should be added\n- Create an optimal learning sequence\n\nBe thorough. Read every file. Provide specific examples.", "criteria": [{"name": "clone_success", "instruction": "Repository was successfully cloned", "weight": 1.0}, {"name": "tutorial_coverage", "instruction": "All tutorials were read and analyzed", "weight": 0.9}, {"name": "evaluation_depth", "instruction": "Evaluation criteria applied consistently across tutorials", "weight": 0.8}, {"name": "actionable_recommendations", "instruction": "Improvement suggestions are specific and implementable", "weight": 0.7}], "metadata": {"expected_iterations": 20, "min_tokens": 50000, "category": "context_stress"}}
+{"name": "dependency_audit", "prompt": "Clone https://github.com/microsoft-foundry/ai-tutorials and perform a thorough dependency and compatibility audit.\n\nSTEP 1: Clone the repository\n- git clone https://github.com/microsoft-foundry/ai-tutorials\n- Confirm successful clone\n\nSTEP 2: Find all dependency specifications\n- Search for requirements.txt files\n- Search for pyproject.toml files\n- Search for setup.py files\n- Search for environment.yml files\n- Check imports in Python files for implicit dependencies\n\nSTEP 3: Analyze each dependency\nFor EVERY dependency found:\n1. Current version specified (or 'unpinned' if none)\n2. Latest available version\n3. Known security vulnerabilities\n4. Compatibility with Python 3.10, 3.11, 3.12\n5. Transitive dependencies introduced\n\nSTEP 4: Generate audit report\n- Dependency tree visualization (text format)\n- Security vulnerabilities found with severity\n- Version conflicts or incompatibilities\n- Recommendations for updates\n- Suggested requirements.txt with pinned versions\n\nRead all relevant files. Be thorough and specific.", "criteria": [{"name": "clone_success", "instruction": "Repository was successfully cloned", "weight": 1.0}, {"name": "dependency_discovery", "instruction": "All dependency specifications were found and analyzed", "weight": 0.9}, {"name": "analysis_depth", "instruction": "Each dependency was analyzed for versions and compatibility", "weight": 0.8}, {"name": "actionable_report", "instruction": "Report includes specific version recommendations", "weight": 0.7}], "metadata": {"expected_iterations": 18, "min_tokens": 45000, "category": "context_stress"}}
+{"name": "architecture_analysis", "prompt": "Clone https://github.com/microsoft-foundry/ai-tutorials and analyze the overall architecture and design patterns.\n\nSTEP 1: Clone the repository\n- git clone https://github.com/microsoft-foundry/ai-tutorials\n- Verify clone success\n\nSTEP 2: Map the repository structure\n- Create a complete directory tree\n- Identify major components/modules\n- Document file organization patterns\n\nSTEP 3: Analyze design patterns\nFor EACH significant code file:\n1. Read the complete file\n2. Identify design patterns used (factory, singleton, observer, etc.)\n3. Note coding conventions and style\n4. Document error handling approaches\n5. Analyze how components interact\n\nSTEP 4: Create architecture document\n- High-level architecture diagram (text format)\n- Component interaction map\n- Data flow descriptions\n- Design pattern catalog with examples from code\n- Evaluation of architectural decisions\n- Suggestions for architectural improvements\n\nRead every file. Document patterns with specific code references.", "criteria": [{"name": "clone_success", "instruction": "Repository was successfully cloned", "weight": 1.0}, {"name": "structure_mapped", "instruction": "Complete directory structure documented", "weight": 0.8}, {"name": "patterns_identified", "instruction": "Design patterns identified with specific code examples", "weight": 0.9}, {"name": "architecture_doc", "instruction": "Architecture document is comprehensive and accurate", "weight": 0.8}], "metadata": {"expected_iterations": 22, "min_tokens": 55000, "category": "context_stress"}}

src/flow/experiments/data/tasks/gaia_all.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

src/flow/experiments/data/tasks/gaia_level1.jsonl ADDED Viewed

	@@ -0,0 +1,106 @@

+{"name": "e1fc63a2-da7a-432f-be78-7c4a95598703", "prompt": "If Eliud Kipchoge could maintain his record-making marathon pace indefinitely, how many thousand hours would it take him to run the distance between the Earth and the Moon its closest approach? Please use the minimum perigee value on the Wikipedia page for the Moon when carrying out your calculation. Round your result to the nearest 1000 hours and do not use any comma separators if necessary.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 17", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "17", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "prompt": "How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can use the latest 2022 version of english wikipedia.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "3", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "ec09fa32-d03f-4bf8-84b0-1f16922c3ae4", "prompt": "Here's a fun riddle that I think you'll enjoy.\n\nYou have been selected to play the final round of the hit new game show \"Pick That Ping-Pong\". In this round, you will be competing for a large cash prize. Your job will be to pick one of several different numbered ping-pong balls, and then the game will commence. The host describes how the game works.\n\nA device consisting of a winding clear ramp and a series of pistons controls the outcome of the game. The ramp feeds balls onto a platform. The platform has room for three ping-pong balls at a time. The three balls on the platform are each aligned with one of three pistons. At each stage of the game, one of the three pistons will randomly fire, ejecting the ball it strikes. If the piston ejects the ball in the first position on the platform the balls in the second and third position on the platform each advance one space, and the next ball on the ramp advances to the third position. If the piston ejects the ball in the second position, the ball in the first position is released and rolls away, the ball in the third position advances two spaces to occupy the first position, and the next two balls on the ramp advance to occupy the second and third positions on the platform. If the piston ejects the ball in the third position, the ball in the first position is released and rolls away, the ball in the second position advances one space to occupy the first position, and the next two balls on the ramp advance to occupy the second and third positions on the platform.\n\nThe ramp begins with 100 numbered ping-pong balls, arranged in ascending order from 1 to 100. The host activates the machine and the first three balls, numbered 1, 2, and 3, advance to the platform. Before the random firing of the pistons begins, you are asked which of the 100 balls you would like to pick. If your pick is ejected by one of the pistons, you win the grand prize, $10,000.\n\nWhich ball should you choose to maximize your odds of winning the big prize? Please provide your answer as the number of the ball selected.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "3", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "5d0080cb-90d7-4712-bc33-848150e917d3", "prompt": "What was the volume in m^3 of the fish bag that was calculated in the University of Leicester paper \"Can Hiccup Supply Enough Fish to Maintain a Dragon\u2019s Diet?\"", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0.1777", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "0.1777", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6", "prompt": "In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird species to be on camera simultaneously?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "3", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "46719c30-f4c3-4cad-be07-d5cb21eee6bb", "prompt": "Of the authors (First M. Last) that worked on the paper \"Pie Menus or Linear Menus, Which Is Better?\" in 2015, what was the title of the first paper authored by the one that had authored prior papers?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Mapping Human Oriented Information to Software Agents for Online Systems Usage", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Mapping Human Oriented Information to Software Agents for Online Systems Usage", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "4b6bb5f7-f634-410e-815d-e673ab7f8632", "prompt": "In Series 9, Episode 11 of Doctor Who, the Doctor is trapped inside an ever-shifting maze. What is this location called in the official script for the episode? Give the setting exactly as it appears in the first scene heading.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: THE CASTLE", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "THE CASTLE", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "cffe0e32-c9a6-4c52-9877-78ceb4aaa9fb", "prompt": "An office held a Secret Santa gift exchange where each of its twelve employees was assigned one other employee in the group to present with a gift. Each employee filled out a profile including three likes or hobbies. On the day of the gift exchange, only eleven gifts were given, each one specific to one of the recipient's interests. Based on the information in the document, who did not give a gift?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Fred", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Fred", "gaia_level": 1, "gaia_file": "cffe0e32-c9a6-4c52-9877-78ceb4aaa9fb.docx", "source": "gaia-benchmark"}}
+{"name": "2d83110e-a098-4ebb-9987-066c06fa42d0", "prompt": ".rewsna eht sa \"tfel\" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Right", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Right", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "5cfb274c-0207-4aa7-9575-6ac0bd95d9b2", "prompt": "Each cell in the attached spreadsheet represents a plot of land. The color of the cell indicates who owns that plot. Green cells are plots owned by Earl Smith. Can Earl walk through every plot he owns (and no other plots) and return to his starting plot without backtracking? For this question, consider backtracking to be any instance where Earl would enter a plot of land he had already entered since leaving his starting plot.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: No", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "No", "gaia_level": 1, "gaia_file": "5cfb274c-0207-4aa7-9575-6ac0bd95d9b2.xlsx", "source": "gaia-benchmark"}}
+{"name": "27d5d136-8563-469e-92bf-fd103c28b57c", "prompt": "\u00ac(A \u2227 B) \u2194 (\u00acA \u2228 \u00acB)\n\u00ac(A \u2228 B) \u2194 (\u00acA \u2227 \u00acB)\n(A \u2192 B) \u2194 (\u00acB \u2192 \u00acA)\n(A \u2192 B) \u2194 (\u00acA \u2228 B)\n(\u00acA \u2192 B) \u2194 (A \u2228 \u00acB)\n\u00ac(A \u2192 B) \u2194 (A \u2227 \u00acB)\n\nWhich of the above is not logically equivalent to the rest? Provide the full statement that doesn't fit.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: (\u00acA \u2192 B) \u2194 (A \u2228 \u00acB)", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "(\u00acA \u2192 B) \u2194 (A \u2228 \u00acB)", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "dc28cf18-6431-458b-83ef-64b3ce566c10", "prompt": "My family reunion is this week, and I was assigned the mashed potatoes to bring. The attendees include my married mother and father, my twin brother and his family, my aunt and her family, my grandma and her brother, her brother's daughter, and his daughter's family. All the adults but me have been married, and no one is divorced or remarried, but my grandpa and my grandma's sister-in-law passed away last year. All living spouses are attending. My brother has two children that are still kids, my aunt has one six-year-old, and my grandma's brother's daughter has three kids under 12. I figure each adult will eat about 1.5 potatoes of mashed potatoes and each kid will eat about 1/2 a potato of mashed potatoes, except my second cousins don't eat carbs. The average potato is about half a pound, and potatoes are sold in 5-pound bags. How many whole bags of potatoes do I need? Just give the number.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 2", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "2", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "b816bfce-3d80-4913-a07d-69b752ce6377", "prompt": "In Emily Midkiff's June 2014 article in a journal named for the one of Hreidmar's sons that guarded his house, what word was quoted from two different authors in distaste for the nature of dragon depictions?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: fluffy", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "fluffy", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "72e110e7-464c-453c-a309-90a95aed6538", "prompt": "Under DDC 633 on Bielefeld University Library's BASE, as of 2020, from what country was the unknown language article with a flag unique from the others?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Guatemala", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Guatemala", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "42576abe-0deb-4869-8c63-225c2d75a95a", "prompt": "In the fictional language of Tizin, basic sentences are arranged with the Verb first, followed by the direct object, followed by the subject of the sentence. I want to express my love for apples to my Tizin friend. \n\nThe word that indicates oneself is \"Pa\" is the nominative form, \"Mato\" is the accusative form, and \"Sing\" is the genitive form. \n\nThe root verb that indicates an intense like for something is \"Maktay\". When it is used in the present, it is used in it's root form, when it is used in the preterit past, it is \"Tay\", and when it is used in the imperfect past, it is \"Aktay\". It is used differently than in English, and is better translated as \"is pleasing to\", meaning that the thing doing the liking is actually the object of the sentence rather than the subject.\n\nThe word for apples is borrowed from English in Tizin, and so it is \"Apple\" is the nominative form, \"Zapple\" is the accusative form, and \"Izapple\" is the genitive form. \n\nPlease translate \"I like apples\" to Tizin.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Maktay mato apple", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Maktay mato apple", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "b415aba4-4b68-4fc6-9b89-2c812e55a3e1", "prompt": "In Nature journal's Scientific Reports conference proceedings from 2012, in the article that did not mention plasmons or plasmonics, what nano-compound is studied? Don't use the prefix nano in your answer if there is one.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: diamond", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "diamond", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "cca530fc-4052-43b2-b130-b30968d8aa44", "prompt": "Review the chess position provided in the image. It is black's turn. Provide the correct next move for black which guarantees a win. Please provide your response in algebraic notation.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Rd5", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Rd5", "gaia_level": 1, "gaia_file": "cca530fc-4052-43b2-b130-b30968d8aa44.png", "source": "gaia-benchmark"}}
+{"name": "935e2cff-ae78-4218-b3f5-115589b19dae", "prompt": "In the year 2022, and before December, what does \"R\" stand for in the three core policies of the type of content that was violated in the public logs on the Legume Wikipedia page?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: research", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "research", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "4fc2f1ae-8625-45b5-ab34-ad4433bc21f8", "prompt": "Who nominated the only Featured Article on English Wikipedia about a dinosaur that was promoted in November 2016?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: FunkMonk", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "FunkMonk", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "5188369a-3bbe-43d8-8b94-11558f909a08", "prompt": "What writer is quoted by Merriam-Webster for the Word of the Day from June 27, 2022?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Annie Levin", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Annie Levin", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "6f37996b-2ac7-44b0-8e68-6d28256631b4", "prompt": "Given this table defining * on the set S = {a, b, c, d, e}\n\n|*|a|b|c|d|e|\n|---|---|---|---|---|---|\n|a|a|b|c|b|d|\n|b|b|c|a|e|c|\n|c|c|a|b|b|a|\n|d|b|e|b|e|d|\n|e|d|b|a|d|c|\n\nprovide the subset of S involved in any possible counter-examples that prove * is not commutative. Provide your answer as a comma separated list of the elements in the set in alphabetical order.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: b, e", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "b, e", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "9318445f-fe6a-4e1b-acbf-c68228c9906a", "prompt": "As a comma separated list with no whitespace, using the provided image provide all the fractions that use / as the fraction line and the answers to the sample problems. Order the list by the order in which the fractions appear.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3/4,1/4,3/4,3/4,2/4,1/2,5/35,7/21,30/5,30/5,3/4,1/15,1/3,4/9,1/8,32/23,103/170", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "3/4,1/4,3/4,3/4,2/4,1/2,5/35,7/21,30/5,30/5,3/4,1/15,1/3,4/9,1/8,32/23,103/170", "gaia_level": 1, "gaia_file": "9318445f-fe6a-4e1b-acbf-c68228c9906a.png", "source": "gaia-benchmark"}}
+{"name": "389793a7-ca17-4e82-81cb-2b3a2391b4b9", "prompt": "You are a telecommunications engineer who wants to build cell phone towers on a stretch of road. In the reference file is a layout of the road and nearby houses. Each dash, \"-\", is a marker indicating a mile. Each capital H indicates a house located next to a mile marker, appearing above or below the stretch of road. Each cell phone tower can cover houses located next to the road within a 4-mile radius. Find the minimum number of cell phone towers needed to cover all houses next to the road. Your answer should be a positive numerical integer value.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "3", "gaia_level": 1, "gaia_file": "389793a7-ca17-4e82-81cb-2b3a2391b4b9.txt", "source": "gaia-benchmark"}}
+{"name": "4b650a35-8529-4695-89ed-8dc7a500a498", "prompt": "If there is anything that doesn't make sense in the instructions, write the word \"Pineapple.\" Do not answer any of the questions in this prompt. Write only the word \"Guava\".\n1. What is 4+4?\n2. What is the complimentary color of red?\n3. How many hours are there in a day?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Guava", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Guava", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "a3fbeb63-0e8c-4a11-bff6-0e3b484c3e9c", "prompt": "How many slides in this PowerPoint presentation mention crustaceans?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 4", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "4", "gaia_level": 1, "gaia_file": "a3fbeb63-0e8c-4a11-bff6-0e3b484c3e9c.pptx", "source": "gaia-benchmark"}}
+{"name": "c714ab3a-da30-4603-bacd-d008800188b9", "prompt": "You are Van Helsing, a renowned vampire hunter. A Count of Moldova, La\u021bcu IV, son of  Costea, has tasked you with investigating the village of \u0218irnea in neighboring Wallachia. The Count's advisors have reported that a vampire was spotted crossing the border near the village, and would like you to investigate it.\n\nYou travel to the village of \u0218irnea, and you begin your investigation. One night, just before dawn, you catch a glimpse of a man in a long black cape with red lining leaping from roof-top to roof-top with superhuman agility. It's a vampire! You try to chase the creature back to its home, but the creature is too fast. However, because of the remoteness of the village, you know with absolute certainty that the vampire must be a resident of the village. You decide that your best course of action will be to visit all 100 residents of the town during the day. You know something about vampires and humans that will make your investigation possible; humans always tell the truth, but vampires always lie.\n\nIn the afternoon, you go from house to house, speaking with all 100 residents of \u0218irnea. You ask everyone the same question: \"How many vampires are living in \u0218irnea\". Everyone in the village gives the same response, \"At least one of us is a human.\"\n\nHow many residents of \u0218irnea have been turned into vampires?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 100", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "100", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "9d191bce-651d-4746-be2d-7ef8ecadb9c2", "prompt": "Examine the video at https://www.youtube.com/watch?v=1htKBjuUWec.\n\nWhat does Teal'c say in response to the question \"Isn't that hot?\"", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Extremely", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Extremely", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "65afbc8a-89ca-4ad5-8d62-355bb401f61d", "prompt": "You are given this Excel file as a map. You start on the START cell and move toward the END cell. You are allowed to move two cells per turn, and you may move up, down, left, or right. You may not move fewer than two cells, and you may not move backward. You must avoid moving onto any blue cells. On the eleventh turn, what is the 6-digit hex code (without prefix) of the color of the cell where you land after moving?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: F478A7", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "F478A7", "gaia_level": 1, "gaia_file": "65afbc8a-89ca-4ad5-8d62-355bb401f61d.xlsx", "source": "gaia-benchmark"}}
+{"name": "cabe07ed-9eca-40ea-8ead-410ef5e83f91", "prompt": "What is the surname of the equine veterinarian mentioned in 1.E Exercises from the chemistry materials licensed by Marisa Alviar-Agnew & Henry Agnew under the CK-12 license in LibreText's Introductory Chemistry materials as compiled 08/21/2023?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Louvrier", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Louvrier", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "3cef3a44-215e-4aed-8e3b-b1e3f08063b7", "prompt": "I'm making a grocery list for my mom, but she's a professor of botany and she's a real stickler when it comes to categorizing things. I need to add different foods to different categories on the grocery list, but if I make a mistake, she won't buy anything inserted in the wrong category. Here's the list I have so far:\n\nmilk, eggs, flour, whole bean coffee, Oreos, sweet potatoes, fresh basil, plums, green beans, rice, corn, bell pepper, whole allspice, acorns, broccoli, celery, zucchini, lettuce, peanuts\n\nI need to make headings for the fruits and vegetables. Could you please create a list of just the vegetables from my list? If you could do that, then I can figure out how to categorize the rest of the list into the appropriate categories. But remember that my mom is a real stickler, so make sure that no botanical fruits end up on the vegetable list, or she won't get them when she's at the store. Please alphabetize the list of vegetables, and place each item in a comma separated list.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: broccoli, celery, fresh basil, lettuce, sweet potatoes", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "broccoli, celery, fresh basil, lettuce, sweet potatoes", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3", "prompt": "Hi, I'm making a pie but I could use some help with my shopping list. I have everything I need for the crust, but I'm not sure about the filling. I got the recipe from my friend Aditi, but she left it as a voice memo and the speaker on my phone is buzzing so I can't quite make out what she's saying. Could you please listen to the recipe and list all of the ingredients that my friend described? I only want the ingredients for the filling, as I have everything I need to make my favorite pie crust. I've attached the recipe as Strawberry pie.mp3.\n\nIn your response, please only list the ingredients, not any measurements. So if the recipe calls for \"a pinch of salt\" or \"two cups of ripe strawberries\" the ingredients on the list would be \"salt\" and \"ripe strawberries\".\n\nPlease format your response as a comma separated list of ingredients. Also, please alphabetize the ingredients.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: cornstarch, freshly squeezed lemon juice, granulated sugar, pure vanilla extract, ripe strawberries", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "cornstarch, freshly squeezed lemon juice, granulated sugar, pure vanilla extract, ripe strawberries", "gaia_level": 1, "gaia_file": "99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3.mp3", "source": "gaia-benchmark"}}
+{"name": "d0633230-7067-47a9-9dbf-ee11e0a2cdd6", "prompt": "In the Scikit-Learn July 2017 changelog, what other predictor base command received a bug fix? Just give the name, not a path.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: BaseLabelPropagation", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "BaseLabelPropagation", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "305ac316-eef6-4446-960a-92d80d542f82", "prompt": "Who did the actor who played Ray in the Polish-language version of Everybody Loves Raymond play in Magda M.? Give only the first name.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Wojciech", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Wojciech", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0383a3ee-47a7-41a4-b493-519bdefe0488", "prompt": "On the BBC Earth YouTube video of the Top 5 Silliest Animal Moments, what species of bird is featured?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Rockhopper penguin", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Rockhopper penguin", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "f918266a-b3e0-4914-865d-4faa564f1aef", "prompt": "What is the final numeric output from the attached Python code?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "0", "gaia_level": 1, "gaia_file": "f918266a-b3e0-4914-865d-4faa564f1aef.py", "source": "gaia-benchmark"}}
+{"name": "11af4e1a-5f45-467d-9aeb-46f4bb0bf034", "prompt": "How many more blocks (also denoted as layers) in BERT base encoder than the encoder from the architecture proposed in Attention is All You Need?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 6", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "6", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "e142056d-56ab-4352-b091-b56054bd1359", "prompt": "Bob was invited to participate in a game show, and he advanced to the final round. The final round offered Bob the chance to win a large sum by playing a game against the host. The host has 30 shiny prop coins, each of which is worth $1,000 if Bob manages to win them by playing the game. The host hides the coins in three different prize boxes and then shuffles their order. The only rule restricting the host's coin placement is that one box must contain at least 2 coins, and one box must contain 6 more coins than another box. In order to play, Bob must submit three guesses, one guess for the number of coins in each box. The box is then opened and the number of coins is revealed. If Bob's guess is a number greater than the number of coins in the box, Bob earns no coins. If Bob guesses a number equal to or less than the number of coins in the box, Bob wins a number of coins equal to his guess.\n\nIf Bob plays uses the optimal strategy, what's the minimum amount of money he can win from the game?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 16000", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "16000", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "50ad0280-0819-4bd9-b275-5de32d3b5bcb", "prompt": "Pull out the sentence in the following 5x7 block of text. Read from left to right and use all of the letters in order:\n\nTHESE\nAGULL\nGLIDE\nDPEAC\nEFULL\nYTOMY\nCHAIR", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: The seagull glided peacefully to my chair.", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "The seagull glided peacefully to my chair.", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "7673d772-ef80-4f0f-a602-1bf4485c9b43", "prompt": "On Cornell Law School website's legal information institute, under the fifth section of federal rules alphabetically, what word was deleted in the last amendment to the first rule in the article that has \"witnesses\" in the most titles as of 2021?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: inference", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "inference", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "c365c1c7-a3db-4d5e-a9a1-66f56eae7865", "prompt": "Of the cities within the United States where U.S. presidents were born, which two are the farthest apart from the westernmost to the easternmost going east, giving the city names only? Give them to me in alphabetical order, in a comma-separated list", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Braintree, Honolulu", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Braintree, Honolulu", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "7d4a7d1d-cac6-44a8-96e8-ea9584a70825", "prompt": "According to Girls Who Code, how long did it take in years for the percentage of computer scientists that were women to change by 13% from a starting point of 37%?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 22", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "22", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "dc22a632-937f-4e6a-b72f-ba0ff3f5ff97", "prompt": "What was the complete title of the book in which two James Beard Award winners recommended the restaurant where Ali Khan enjoyed a New Mexican staple in his cost-conscious TV show that started in 2015? Write the numbers in plain text if there are some in the title.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Five Hundred Things To Eat Before It's Too Late: and the Very Best Places to Eat Them", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Five Hundred Things To Eat Before It's Too Late: and the Very Best Places to Eat Them", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "3f57289b-8c60-48be-bd80-01f8099ca449", "prompt": "How many at bats did the Yankee with the most walks in the 1977 regular season have that same season?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 519", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "519", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "23dd907f-1261-4488-b21c-e9185af91d5e", "prompt": "In Audre Lorde\u2019s poem \u201cFather Son and Holy Ghost\u201d, what is the number of the stanza in which some lines are indented?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 2", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "2", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "1f975693-876d-457b-a649-393859e79bf3", "prompt": "Hi, I was out sick from my classes on Friday, so I'm trying to figure out what I need to study for my Calculus mid-term next week. My friend from class sent me an audio recording of Professor Willowbrook giving out the recommended reading for the test, but my headphones are broken :(\n\nCould you please listen to the recording for me and tell me the page numbers I'm supposed to go over? I've attached a file called Homework.mp3 that has the recording. Please provide just the page numbers as a comma-delimited list. And please provide the list in ascending order.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 132, 133, 134, 197, 245", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "132, 133, 134, 197, 245", "gaia_level": 1, "gaia_file": "1f975693-876d-457b-a649-393859e79bf3.mp3", "source": "gaia-benchmark"}}
+{"name": "840bfca7-4f7b-481a-8794-c560c340185d", "prompt": "On June 6, 2023, an article by Carolyn Collins Petersen was published in Universe Today. This article mentions a team that produced a paper about their observations, linked at the bottom of the article. Find this paper. Under what NASA award number was the work performed by R. G. Arendt supported by?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 80GSFC21M0002", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "80GSFC21M0002", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "a0068077-79f4-461a-adfe-75c1a4148545", "prompt": "What was the actual enrollment count of the clinical trial on H. pylori in acne vulgaris patients from Jan-May 2018 as listed on the NIH website?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 90", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "90", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "bda648d7-d618-4883-88f4-3466eabd860e", "prompt": "Where were the Vietnamese specimens described by Kuznetzov in Nedoshivina's 2010 paper eventually deposited? Just give me the city name without abbreviations.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Saint Petersburg", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Saint Petersburg", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "50ec8903-b81f-4257-9450-1085afd2c319", "prompt": "A standard Rubik\u2019s cube has been broken into cubes making up its sides. The cubes are jumbled, and one is removed. There are 6 cubes with one colored face, 12 edge cubes with two colored faces, and 8 corner cubes with three colored faces. All blue cubes have been found. All cubes directly left, right, above, and below the orange center cube have been found, along with the center cube. The green corners have all been found, along with all green that borders yellow. For all orange cubes found, the opposite face\u2019s cubes have been found. The removed cube has two colors on its faces. What are they? Answer using a comma separated list, with the colors ordered alphabetically.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: green, white", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "green, white", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "cf106601-ab4f-4af9-b045-5295fe67b37d", "prompt": "What country had the least number of athletes at the 1928 Summer Olympics? If there's a tie for a number of athletes, return the first in alphabetical order. Give the IOC country code as your answer.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: CUB", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "CUB", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "a0c07678-e491-4bbc-8f0b-07405144218f", "prompt": "Who are the pitchers with the number before and after Taish\u014d Tamai's number as of July 2023? Give them to me in the form Pitcher Before, Pitcher After, use their last names only, in Roman characters.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Yoshida, Uehara", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Yoshida, Uehara", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "7bd855d8-463d-4ed5-93ca-5fe35145f733", "prompt": "The attached Excel file contains the sales of menu items for a local fast-food chain. What were the total sales that the chain made from food (not including drinks)? Express your answer in USD with two decimal places.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 89706.00", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "89706.00", "gaia_level": 1, "gaia_file": "7bd855d8-463d-4ed5-93ca-5fe35145f733.xlsx", "source": "gaia-benchmark"}}
+{"name": "5a0c1adf-205e-4841-a666-7c3ef95def9d", "prompt": "What is the first name of the only Malko Competition recipient from the 20th Century (after 1977) whose nationality on record is a country that no longer exists?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Claus", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Claus", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "e1fc63a2-da7a-432f-be78-7c4a95598703", "prompt": "If Eliud Kipchoge could maintain his record-making marathon pace indefinitely, how many thousand hours would it take him to run the distance between the Earth and the Moon its closest approach? Please use the minimum perigee value on the Wikipedia page for the Moon when carrying out your calculation. Round your result to the nearest 1000 hours and do not use any comma separators if necessary.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 17", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "17", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "prompt": "How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can use the latest 2022 version of english wikipedia.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "3", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "ec09fa32-d03f-4bf8-84b0-1f16922c3ae4", "prompt": "Here's a fun riddle that I think you'll enjoy.\n\nYou have been selected to play the final round of the hit new game show \"Pick That Ping-Pong\". In this round, you will be competing for a large cash prize. Your job will be to pick one of several different numbered ping-pong balls, and then the game will commence. The host describes how the game works.\n\nA device consisting of a winding clear ramp and a series of pistons controls the outcome of the game. The ramp feeds balls onto a platform. The platform has room for three ping-pong balls at a time. The three balls on the platform are each aligned with one of three pistons. At each stage of the game, one of the three pistons will randomly fire, ejecting the ball it strikes. If the piston ejects the ball in the first position on the platform the balls in the second and third position on the platform each advance one space, and the next ball on the ramp advances to the third position. If the piston ejects the ball in the second position, the ball in the first position is released and rolls away, the ball in the third position advances two spaces to occupy the first position, and the next two balls on the ramp advance to occupy the second and third positions on the platform. If the piston ejects the ball in the third position, the ball in the first position is released and rolls away, the ball in the second position advances one space to occupy the first position, and the next two balls on the ramp advance to occupy the second and third positions on the platform.\n\nThe ramp begins with 100 numbered ping-pong balls, arranged in ascending order from 1 to 100. The host activates the machine and the first three balls, numbered 1, 2, and 3, advance to the platform. Before the random firing of the pistons begins, you are asked which of the 100 balls you would like to pick. If your pick is ejected by one of the pistons, you win the grand prize, $10,000.\n\nWhich ball should you choose to maximize your odds of winning the big prize? Please provide your answer as the number of the ball selected.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "3", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "5d0080cb-90d7-4712-bc33-848150e917d3", "prompt": "What was the volume in m^3 of the fish bag that was calculated in the University of Leicester paper \"Can Hiccup Supply Enough Fish to Maintain a Dragon\u2019s Diet?\"", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0.1777", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "0.1777", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6", "prompt": "In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird species to be on camera simultaneously?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "3", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "46719c30-f4c3-4cad-be07-d5cb21eee6bb", "prompt": "Of the authors (First M. Last) that worked on the paper \"Pie Menus or Linear Menus, Which Is Better?\" in 2015, what was the title of the first paper authored by the one that had authored prior papers?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Mapping Human Oriented Information to Software Agents for Online Systems Usage", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Mapping Human Oriented Information to Software Agents for Online Systems Usage", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "4b6bb5f7-f634-410e-815d-e673ab7f8632", "prompt": "In Series 9, Episode 11 of Doctor Who, the Doctor is trapped inside an ever-shifting maze. What is this location called in the official script for the episode? Give the setting exactly as it appears in the first scene heading.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: THE CASTLE", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "THE CASTLE", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "cffe0e32-c9a6-4c52-9877-78ceb4aaa9fb", "prompt": "An office held a Secret Santa gift exchange where each of its twelve employees was assigned one other employee in the group to present with a gift. Each employee filled out a profile including three likes or hobbies. On the day of the gift exchange, only eleven gifts were given, each one specific to one of the recipient's interests. Based on the information in the document, who did not give a gift?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Fred", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Fred", "gaia_level": 1, "gaia_file": "cffe0e32-c9a6-4c52-9877-78ceb4aaa9fb.docx", "source": "gaia-benchmark"}}
+{"name": "2d83110e-a098-4ebb-9987-066c06fa42d0", "prompt": ".rewsna eht sa \"tfel\" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Right", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Right", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "5cfb274c-0207-4aa7-9575-6ac0bd95d9b2", "prompt": "Each cell in the attached spreadsheet represents a plot of land. The color of the cell indicates who owns that plot. Green cells are plots owned by Earl Smith. Can Earl walk through every plot he owns (and no other plots) and return to his starting plot without backtracking? For this question, consider backtracking to be any instance where Earl would enter a plot of land he had already entered since leaving his starting plot.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: No", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "No", "gaia_level": 1, "gaia_file": "5cfb274c-0207-4aa7-9575-6ac0bd95d9b2.xlsx", "source": "gaia-benchmark"}}
+{"name": "27d5d136-8563-469e-92bf-fd103c28b57c", "prompt": "\u00ac(A \u2227 B) \u2194 (\u00acA \u2228 \u00acB)\n\u00ac(A \u2228 B) \u2194 (\u00acA \u2227 \u00acB)\n(A \u2192 B) \u2194 (\u00acB \u2192 \u00acA)\n(A \u2192 B) \u2194 (\u00acA \u2228 B)\n(\u00acA \u2192 B) \u2194 (A \u2228 \u00acB)\n\u00ac(A \u2192 B) \u2194 (A \u2227 \u00acB)\n\nWhich of the above is not logically equivalent to the rest? Provide the full statement that doesn't fit.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: (\u00acA \u2192 B) \u2194 (A \u2228 \u00acB)", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "(\u00acA \u2192 B) \u2194 (A \u2228 \u00acB)", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "dc28cf18-6431-458b-83ef-64b3ce566c10", "prompt": "My family reunion is this week, and I was assigned the mashed potatoes to bring. The attendees include my married mother and father, my twin brother and his family, my aunt and her family, my grandma and her brother, her brother's daughter, and his daughter's family. All the adults but me have been married, and no one is divorced or remarried, but my grandpa and my grandma's sister-in-law passed away last year. All living spouses are attending. My brother has two children that are still kids, my aunt has one six-year-old, and my grandma's brother's daughter has three kids under 12. I figure each adult will eat about 1.5 potatoes of mashed potatoes and each kid will eat about 1/2 a potato of mashed potatoes, except my second cousins don't eat carbs. The average potato is about half a pound, and potatoes are sold in 5-pound bags. How many whole bags of potatoes do I need? Just give the number.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 2", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "2", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "b816bfce-3d80-4913-a07d-69b752ce6377", "prompt": "In Emily Midkiff's June 2014 article in a journal named for the one of Hreidmar's sons that guarded his house, what word was quoted from two different authors in distaste for the nature of dragon depictions?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: fluffy", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "fluffy", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "72e110e7-464c-453c-a309-90a95aed6538", "prompt": "Under DDC 633 on Bielefeld University Library's BASE, as of 2020, from what country was the unknown language article with a flag unique from the others?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Guatemala", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Guatemala", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "42576abe-0deb-4869-8c63-225c2d75a95a", "prompt": "In the fictional language of Tizin, basic sentences are arranged with the Verb first, followed by the direct object, followed by the subject of the sentence. I want to express my love for apples to my Tizin friend. \n\nThe word that indicates oneself is \"Pa\" is the nominative form, \"Mato\" is the accusative form, and \"Sing\" is the genitive form. \n\nThe root verb that indicates an intense like for something is \"Maktay\". When it is used in the present, it is used in it's root form, when it is used in the preterit past, it is \"Tay\", and when it is used in the imperfect past, it is \"Aktay\". It is used differently than in English, and is better translated as \"is pleasing to\", meaning that the thing doing the liking is actually the object of the sentence rather than the subject.\n\nThe word for apples is borrowed from English in Tizin, and so it is \"Apple\" is the nominative form, \"Zapple\" is the accusative form, and \"Izapple\" is the genitive form. \n\nPlease translate \"I like apples\" to Tizin.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Maktay mato apple", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Maktay mato apple", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "b415aba4-4b68-4fc6-9b89-2c812e55a3e1", "prompt": "In Nature journal's Scientific Reports conference proceedings from 2012, in the article that did not mention plasmons or plasmonics, what nano-compound is studied? Don't use the prefix nano in your answer if there is one.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: diamond", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "diamond", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "cca530fc-4052-43b2-b130-b30968d8aa44", "prompt": "Review the chess position provided in the image. It is black's turn. Provide the correct next move for black which guarantees a win. Please provide your response in algebraic notation.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Rd5", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Rd5", "gaia_level": 1, "gaia_file": "cca530fc-4052-43b2-b130-b30968d8aa44.png", "source": "gaia-benchmark"}}
+{"name": "935e2cff-ae78-4218-b3f5-115589b19dae", "prompt": "In the year 2022, and before December, what does \"R\" stand for in the three core policies of the type of content that was violated in the public logs on the Legume Wikipedia page?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: research", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "research", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "4fc2f1ae-8625-45b5-ab34-ad4433bc21f8", "prompt": "Who nominated the only Featured Article on English Wikipedia about a dinosaur that was promoted in November 2016?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: FunkMonk", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "FunkMonk", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "5188369a-3bbe-43d8-8b94-11558f909a08", "prompt": "What writer is quoted by Merriam-Webster for the Word of the Day from June 27, 2022?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Annie Levin", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Annie Levin", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "6f37996b-2ac7-44b0-8e68-6d28256631b4", "prompt": "Given this table defining * on the set S = {a, b, c, d, e}\n\n|*|a|b|c|d|e|\n|---|---|---|---|---|---|\n|a|a|b|c|b|d|\n|b|b|c|a|e|c|\n|c|c|a|b|b|a|\n|d|b|e|b|e|d|\n|e|d|b|a|d|c|\n\nprovide the subset of S involved in any possible counter-examples that prove * is not commutative. Provide your answer as a comma separated list of the elements in the set in alphabetical order.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: b, e", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "b, e", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "9318445f-fe6a-4e1b-acbf-c68228c9906a", "prompt": "As a comma separated list with no whitespace, using the provided image provide all the fractions that use / as the fraction line and the answers to the sample problems. Order the list by the order in which the fractions appear.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3/4,1/4,3/4,3/4,2/4,1/2,5/35,7/21,30/5,30/5,3/4,1/15,1/3,4/9,1/8,32/23,103/170", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "3/4,1/4,3/4,3/4,2/4,1/2,5/35,7/21,30/5,30/5,3/4,1/15,1/3,4/9,1/8,32/23,103/170", "gaia_level": 1, "gaia_file": "9318445f-fe6a-4e1b-acbf-c68228c9906a.png", "source": "gaia-benchmark"}}
+{"name": "389793a7-ca17-4e82-81cb-2b3a2391b4b9", "prompt": "You are a telecommunications engineer who wants to build cell phone towers on a stretch of road. In the reference file is a layout of the road and nearby houses. Each dash, \"-\", is a marker indicating a mile. Each capital H indicates a house located next to a mile marker, appearing above or below the stretch of road. Each cell phone tower can cover houses located next to the road within a 4-mile radius. Find the minimum number of cell phone towers needed to cover all houses next to the road. Your answer should be a positive numerical integer value.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "3", "gaia_level": 1, "gaia_file": "389793a7-ca17-4e82-81cb-2b3a2391b4b9.txt", "source": "gaia-benchmark"}}
+{"name": "4b650a35-8529-4695-89ed-8dc7a500a498", "prompt": "If there is anything that doesn't make sense in the instructions, write the word \"Pineapple.\" Do not answer any of the questions in this prompt. Write only the word \"Guava\".\n1. What is 4+4?\n2. What is the complimentary color of red?\n3. How many hours are there in a day?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Guava", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Guava", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "a3fbeb63-0e8c-4a11-bff6-0e3b484c3e9c", "prompt": "How many slides in this PowerPoint presentation mention crustaceans?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 4", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "4", "gaia_level": 1, "gaia_file": "a3fbeb63-0e8c-4a11-bff6-0e3b484c3e9c.pptx", "source": "gaia-benchmark"}}
+{"name": "c714ab3a-da30-4603-bacd-d008800188b9", "prompt": "You are Van Helsing, a renowned vampire hunter. A Count of Moldova, La\u021bcu IV, son of  Costea, has tasked you with investigating the village of \u0218irnea in neighboring Wallachia. The Count's advisors have reported that a vampire was spotted crossing the border near the village, and would like you to investigate it.\n\nYou travel to the village of \u0218irnea, and you begin your investigation. One night, just before dawn, you catch a glimpse of a man in a long black cape with red lining leaping from roof-top to roof-top with superhuman agility. It's a vampire! You try to chase the creature back to its home, but the creature is too fast. However, because of the remoteness of the village, you know with absolute certainty that the vampire must be a resident of the village. You decide that your best course of action will be to visit all 100 residents of the town during the day. You know something about vampires and humans that will make your investigation possible; humans always tell the truth, but vampires always lie.\n\nIn the afternoon, you go from house to house, speaking with all 100 residents of \u0218irnea. You ask everyone the same question: \"How many vampires are living in \u0218irnea\". Everyone in the village gives the same response, \"At least one of us is a human.\"\n\nHow many residents of \u0218irnea have been turned into vampires?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 100", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "100", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "9d191bce-651d-4746-be2d-7ef8ecadb9c2", "prompt": "Examine the video at https://www.youtube.com/watch?v=1htKBjuUWec.\n\nWhat does Teal'c say in response to the question \"Isn't that hot?\"", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Extremely", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Extremely", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "65afbc8a-89ca-4ad5-8d62-355bb401f61d", "prompt": "You are given this Excel file as a map. You start on the START cell and move toward the END cell. You are allowed to move two cells per turn, and you may move up, down, left, or right. You may not move fewer than two cells, and you may not move backward. You must avoid moving onto any blue cells. On the eleventh turn, what is the 6-digit hex code (without prefix) of the color of the cell where you land after moving?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: F478A7", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "F478A7", "gaia_level": 1, "gaia_file": "65afbc8a-89ca-4ad5-8d62-355bb401f61d.xlsx", "source": "gaia-benchmark"}}
+{"name": "cabe07ed-9eca-40ea-8ead-410ef5e83f91", "prompt": "What is the surname of the equine veterinarian mentioned in 1.E Exercises from the chemistry materials licensed by Marisa Alviar-Agnew & Henry Agnew under the CK-12 license in LibreText's Introductory Chemistry materials as compiled 08/21/2023?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Louvrier", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Louvrier", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "3cef3a44-215e-4aed-8e3b-b1e3f08063b7", "prompt": "I'm making a grocery list for my mom, but she's a professor of botany and she's a real stickler when it comes to categorizing things. I need to add different foods to different categories on the grocery list, but if I make a mistake, she won't buy anything inserted in the wrong category. Here's the list I have so far:\n\nmilk, eggs, flour, whole bean coffee, Oreos, sweet potatoes, fresh basil, plums, green beans, rice, corn, bell pepper, whole allspice, acorns, broccoli, celery, zucchini, lettuce, peanuts\n\nI need to make headings for the fruits and vegetables. Could you please create a list of just the vegetables from my list? If you could do that, then I can figure out how to categorize the rest of the list into the appropriate categories. But remember that my mom is a real stickler, so make sure that no botanical fruits end up on the vegetable list, or she won't get them when she's at the store. Please alphabetize the list of vegetables, and place each item in a comma separated list.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: broccoli, celery, fresh basil, lettuce, sweet potatoes", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "broccoli, celery, fresh basil, lettuce, sweet potatoes", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3", "prompt": "Hi, I'm making a pie but I could use some help with my shopping list. I have everything I need for the crust, but I'm not sure about the filling. I got the recipe from my friend Aditi, but she left it as a voice memo and the speaker on my phone is buzzing so I can't quite make out what she's saying. Could you please listen to the recipe and list all of the ingredients that my friend described? I only want the ingredients for the filling, as I have everything I need to make my favorite pie crust. I've attached the recipe as Strawberry pie.mp3.\n\nIn your response, please only list the ingredients, not any measurements. So if the recipe calls for \"a pinch of salt\" or \"two cups of ripe strawberries\" the ingredients on the list would be \"salt\" and \"ripe strawberries\".\n\nPlease format your response as a comma separated list of ingredients. Also, please alphabetize the ingredients.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: cornstarch, freshly squeezed lemon juice, granulated sugar, pure vanilla extract, ripe strawberries", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "cornstarch, freshly squeezed lemon juice, granulated sugar, pure vanilla extract, ripe strawberries", "gaia_level": 1, "gaia_file": "99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3.mp3", "source": "gaia-benchmark"}}
+{"name": "d0633230-7067-47a9-9dbf-ee11e0a2cdd6", "prompt": "In the Scikit-Learn July 2017 changelog, what other predictor base command received a bug fix? Just give the name, not a path.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: BaseLabelPropagation", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "BaseLabelPropagation", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "305ac316-eef6-4446-960a-92d80d542f82", "prompt": "Who did the actor who played Ray in the Polish-language version of Everybody Loves Raymond play in Magda M.? Give only the first name.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Wojciech", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Wojciech", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0383a3ee-47a7-41a4-b493-519bdefe0488", "prompt": "On the BBC Earth YouTube video of the Top 5 Silliest Animal Moments, what species of bird is featured?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Rockhopper penguin", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Rockhopper penguin", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "f918266a-b3e0-4914-865d-4faa564f1aef", "prompt": "What is the final numeric output from the attached Python code?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "0", "gaia_level": 1, "gaia_file": "f918266a-b3e0-4914-865d-4faa564f1aef.py", "source": "gaia-benchmark"}}
+{"name": "11af4e1a-5f45-467d-9aeb-46f4bb0bf034", "prompt": "How many more blocks (also denoted as layers) in BERT base encoder than the encoder from the architecture proposed in Attention is All You Need?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 6", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "6", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "e142056d-56ab-4352-b091-b56054bd1359", "prompt": "Bob was invited to participate in a game show, and he advanced to the final round. The final round offered Bob the chance to win a large sum by playing a game against the host. The host has 30 shiny prop coins, each of which is worth $1,000 if Bob manages to win them by playing the game. The host hides the coins in three different prize boxes and then shuffles their order. The only rule restricting the host's coin placement is that one box must contain at least 2 coins, and one box must contain 6 more coins than another box. In order to play, Bob must submit three guesses, one guess for the number of coins in each box. The box is then opened and the number of coins is revealed. If Bob's guess is a number greater than the number of coins in the box, Bob earns no coins. If Bob guesses a number equal to or less than the number of coins in the box, Bob wins a number of coins equal to his guess.\n\nIf Bob plays uses the optimal strategy, what's the minimum amount of money he can win from the game?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 16000", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "16000", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "50ad0280-0819-4bd9-b275-5de32d3b5bcb", "prompt": "Pull out the sentence in the following 5x7 block of text. Read from left to right and use all of the letters in order:\n\nTHESE\nAGULL\nGLIDE\nDPEAC\nEFULL\nYTOMY\nCHAIR", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: The seagull glided peacefully to my chair.", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "The seagull glided peacefully to my chair.", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "7673d772-ef80-4f0f-a602-1bf4485c9b43", "prompt": "On Cornell Law School website's legal information institute, under the fifth section of federal rules alphabetically, what word was deleted in the last amendment to the first rule in the article that has \"witnesses\" in the most titles as of 2021?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: inference", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "inference", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "c365c1c7-a3db-4d5e-a9a1-66f56eae7865", "prompt": "Of the cities within the United States where U.S. presidents were born, which two are the farthest apart from the westernmost to the easternmost going east, giving the city names only? Give them to me in alphabetical order, in a comma-separated list", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Braintree, Honolulu", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Braintree, Honolulu", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "7d4a7d1d-cac6-44a8-96e8-ea9584a70825", "prompt": "According to Girls Who Code, how long did it take in years for the percentage of computer scientists that were women to change by 13% from a starting point of 37%?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 22", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "22", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "dc22a632-937f-4e6a-b72f-ba0ff3f5ff97", "prompt": "What was the complete title of the book in which two James Beard Award winners recommended the restaurant where Ali Khan enjoyed a New Mexican staple in his cost-conscious TV show that started in 2015? Write the numbers in plain text if there are some in the title.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Five Hundred Things To Eat Before It's Too Late: and the Very Best Places to Eat Them", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Five Hundred Things To Eat Before It's Too Late: and the Very Best Places to Eat Them", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "3f57289b-8c60-48be-bd80-01f8099ca449", "prompt": "How many at bats did the Yankee with the most walks in the 1977 regular season have that same season?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 519", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "519", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "23dd907f-1261-4488-b21c-e9185af91d5e", "prompt": "In Audre Lorde\u2019s poem \u201cFather Son and Holy Ghost\u201d, what is the number of the stanza in which some lines are indented?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 2", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "2", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "1f975693-876d-457b-a649-393859e79bf3", "prompt": "Hi, I was out sick from my classes on Friday, so I'm trying to figure out what I need to study for my Calculus mid-term next week. My friend from class sent me an audio recording of Professor Willowbrook giving out the recommended reading for the test, but my headphones are broken :(\n\nCould you please listen to the recording for me and tell me the page numbers I'm supposed to go over? I've attached a file called Homework.mp3 that has the recording. Please provide just the page numbers as a comma-delimited list. And please provide the list in ascending order.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 132, 133, 134, 197, 245", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "132, 133, 134, 197, 245", "gaia_level": 1, "gaia_file": "1f975693-876d-457b-a649-393859e79bf3.mp3", "source": "gaia-benchmark"}}
+{"name": "840bfca7-4f7b-481a-8794-c560c340185d", "prompt": "On June 6, 2023, an article by Carolyn Collins Petersen was published in Universe Today. This article mentions a team that produced a paper about their observations, linked at the bottom of the article. Find this paper. Under what NASA award number was the work performed by R. G. Arendt supported by?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 80GSFC21M0002", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "80GSFC21M0002", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "a0068077-79f4-461a-adfe-75c1a4148545", "prompt": "What was the actual enrollment count of the clinical trial on H. pylori in acne vulgaris patients from Jan-May 2018 as listed on the NIH website?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 90", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "90", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "bda648d7-d618-4883-88f4-3466eabd860e", "prompt": "Where were the Vietnamese specimens described by Kuznetzov in Nedoshivina's 2010 paper eventually deposited? Just give me the city name without abbreviations.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Saint Petersburg", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Saint Petersburg", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "50ec8903-b81f-4257-9450-1085afd2c319", "prompt": "A standard Rubik\u2019s cube has been broken into cubes making up its sides. The cubes are jumbled, and one is removed. There are 6 cubes with one colored face, 12 edge cubes with two colored faces, and 8 corner cubes with three colored faces. All blue cubes have been found. All cubes directly left, right, above, and below the orange center cube have been found, along with the center cube. The green corners have all been found, along with all green that borders yellow. For all orange cubes found, the opposite face\u2019s cubes have been found. The removed cube has two colors on its faces. What are they? Answer using a comma separated list, with the colors ordered alphabetically.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: green, white", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "green, white", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "cf106601-ab4f-4af9-b045-5295fe67b37d", "prompt": "What country had the least number of athletes at the 1928 Summer Olympics? If there's a tie for a number of athletes, return the first in alphabetical order. Give the IOC country code as your answer.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: CUB", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "CUB", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "a0c07678-e491-4bbc-8f0b-07405144218f", "prompt": "Who are the pitchers with the number before and after Taish\u014d Tamai's number as of July 2023? Give them to me in the form Pitcher Before, Pitcher After, use their last names only, in Roman characters.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Yoshida, Uehara", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Yoshida, Uehara", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "7bd855d8-463d-4ed5-93ca-5fe35145f733", "prompt": "The attached Excel file contains the sales of menu items for a local fast-food chain. What were the total sales that the chain made from food (not including drinks)? Express your answer in USD with two decimal places.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 89706.00", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "89706.00", "gaia_level": 1, "gaia_file": "7bd855d8-463d-4ed5-93ca-5fe35145f733.xlsx", "source": "gaia-benchmark"}}
+{"name": "5a0c1adf-205e-4841-a666-7c3ef95def9d", "prompt": "What is the first name of the only Malko Competition recipient from the 20th Century (after 1977) whose nationality on record is a country that no longer exists?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Claus", "weight": 1.0}], "category": "level_1", "metadata": {"gaia_answer": "Claus", "gaia_level": 1, "gaia_file": null, "source": "gaia-benchmark"}}

src/flow/experiments/data/tasks/gaia_level2.jsonl ADDED Viewed

	@@ -0,0 +1,172 @@

+{"name": "c61d22de-5f6c-4958-a7f6-5e9707bd3466", "prompt": "A paper about AI regulation that was originally submitted to arXiv.org in June 2022 shows a figure with three axes, where each axis has a label word at both ends. Which of these words is used to describe a type of society in a Physics and Society article submitted to arXiv.org on August 11, 2016?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: egalitarian", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "egalitarian", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "17b5a6a3-bc87-42e8-b0fb-6ab0781ef2cc", "prompt": "I\u2019m researching species that became invasive after people who kept them as pets released them. There\u2019s a certain species of fish that was popularized as a pet by being the main character of the movie Finding Nemo. According to the USGS, where was this fish found as a nonnative species, before the year 2020? I need the answer formatted as the five-digit zip codes of the places the species was found, separated by commas if there is more than one place.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 34689", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "34689", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "04a04a9b-226c-43fd-b319-d5e89743676f", "prompt": "If we assume all articles published by Nature in 2020 (articles, only, not book reviews/columns, etc) relied on statistical significance to justify their findings and they on average came to a p-value of 0.04, how many papers would be incorrect as to their claims of statistical significance? Round the value up to the next integer.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 41", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "41", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "14569e28-c88c-43e4-8c32-097d35b9a67d", "prompt": "In Unlambda, what exact charcter or text needs to be added to correct the following code to output \"For penguins\"? If what is needed is a character, answer with the name of the character. If there are different names for the character, use the shortest. The text location is not needed. Code:\n\n`r```````````.F.o.r. .p.e.n.g.u.i.n.si", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: backtick", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "backtick", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "32102e3e-d12a-4209-9163-7b3a104efe5d", "prompt": "The attached spreadsheet shows the inventory for a movie and video game rental store in Seattle, Washington. What is the title of the oldest Blu-Ray recorded in this spreadsheet? Return it as appearing in the spreadsheet.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Time-Parking 2: Parallel Universe", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Time-Parking 2: Parallel Universe", "gaia_level": 2, "gaia_file": "32102e3e-d12a-4209-9163-7b3a104efe5d.xlsx", "source": "gaia-benchmark"}}
+{"name": "3627a8be-a77f-41bb-b807-7e1bd4c0ebdf", "prompt": "The object in the British Museum's collection with a museum number of 2012,5015.17 is the shell of a particular mollusk species. According to the abstract of a research article published in Science Advances in 2021, beads made from the shells of this species were found that are at least how many thousands of years old?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 142", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "142", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "7619a514-5fa8-43ef-9143-83b66a43d7a4", "prompt": "According to github, when was Regression added to the oldest closed numpy.polynomial issue that has the Regression label in MM/DD/YY?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 04/15/18", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "04/15/18", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "7dd30055-0198-452e-8c25-f73dbe27dcb8", "prompt": "Using the Biopython library in Python, parse the PDB file of the protein identified by the PDB ID 5wb7 from the RCSB Protein Data Bank. Calculate the distance between the first and second atoms as they are listed in the PDB file. Report the answer in Angstroms, rounded to the nearest picometer.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 1.456", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "1.456", "gaia_level": 2, "gaia_file": "7dd30055-0198-452e-8c25-f73dbe27dcb8.pdb", "source": "gaia-benchmark"}}
+{"name": "2a649bb1-795f-4a01-b3be-9a01868dae73", "prompt": "What are the EC numbers of the two most commonly used chemicals for the virus testing method in the paper about SPFMV and SPCSV in the Pearl Of Africa from 2016? Return the semicolon-separated numbers in the order of the alphabetized chemicals.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3.1.3.1; 1.11.1.7", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "3.1.3.1; 1.11.1.7", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "87c610df-bef7-4932-b950-1d83ef4e282b", "prompt": "In April of 1977, who was the Prime Minister of the first place mentioned by name in the Book of Esther (in the New International Version)?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Morarji Desai", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Morarji Desai", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "624cbf11-6a41-4692-af9c-36b3e5ca3130", "prompt": "What's the last line of the rhyme under the flavor name on the headstone visible in the background of the photo of the oldest flavor's headstone in the Ben & Jerry's online flavor graveyard as of the end of 2022?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: So we had to let it die.", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "So we had to let it die.", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "dd3c7503-f62a-4bd0-9f67-1b63b94194cc", "prompt": "Use density measures from the chemistry materials licensed by Marisa Alviar-Agnew & Henry Agnew under the CK-12 license in LibreText's Introductory Chemistry materials as compiled 08/21/2023.\n\nI have a gallon of honey and a gallon of mayonnaise at 25C. I remove one cup of honey at a time from the gallon of honey. How many times will I need to remove a cup to have the honey weigh less than the mayonaise? Assume the containers themselves weigh the same.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 6", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "6", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "df6561b2-7ee5-4540-baab-5095f742716a", "prompt": "When you take the average of the standard population deviation of the red numbers and the standard sample deviation of the green numbers in this image using the statistics module in Python 3.11, what is the result rounded to the nearest three decimal points?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 17.056", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "17.056", "gaia_level": 2, "gaia_file": "df6561b2-7ee5-4540-baab-5095f742716a.png", "source": "gaia-benchmark"}}
+{"name": "f0f46385-fc03-4599-b5d3-f56496c3e69f", "prompt": "In terms of geographical distance between capital cities, which 2 countries are the furthest from each other within the ASEAN bloc according to wikipedia? Answer using a comma separated list, ordering the countries by alphabetical order.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Indonesia, Myanmar", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Indonesia, Myanmar", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "e4e91f1c-1dcd-439e-9fdd-cb976f5293fd", "prompt": "I need to fact-check a citation. This is the citation from the bibliography:\n\nGreetham, David. \"Uncoupled: OR, How I Lost My Author(s).\" Textual Cultures: Texts, Contexts, Interpretation, vol. 3 no. 1, 2008, p. 45-46. Project MUSE, doi:10.2979/tex.2008.3.1.44.\n\nAnd this is the in-line citation:\n\nOur relationship with the authors of the works we read can often be \u201cobscured not by a \"cloak of print\" but by the veil of scribal confusion and mis-transmission\u201d (Greetham 45-46).\n\nDoes the quoted text match what is actually in the article? If Yes, answer Yes, otherwise, give me the word in my citation that does not match with the correct one (without any article).", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: cloak", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "cloak", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "56137764-b4e0-45b8-9c52-1866420c3df5", "prompt": "Which contributor to the version of OpenCV where support was added for the Mask-RCNN model has the same name as a former Chinese head of government when the names are transliterated to the Latin alphabet?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Li Peng", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Li Peng", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "8b3379c0-0981-4f5b-8407-6444610cb212", "prompt": "What is the maximum length in meters of #9 in the first National Geographic short on YouTube that was ever released according to the Monterey Bay Aquarium website? Just give the number.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 1.8", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "1.8", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0ff53813-3367-4f43-bcbd-3fd725c1bf4b", "prompt": "What two-word type of model did Manash Pratim Kashyap's and PS Fader's studies in customer retention studies published during 2018-2019 have in common (no punctuation)?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: beta geometric", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "beta geometric", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "a7feb290-76bb-4cb7-8800-7edaf7954f2f", "prompt": "How many High Energy Physics - Lattice articles listed in January 2020 on Arxiv had ps versions available?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 31", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "31", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "b4cc024b-3f5e-480e-b96a-6656493255b5", "prompt": "The photograph in the Whitney Museum of American Art's collection with accession number 2022.128 shows a person holding a book. Which military unit did the author of this book join in 1813? Answer without using articles.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Russian-German Legion", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Russian-German Legion", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "33d8ea3b-6c6b-4ff1-803d-7e270dea8a57", "prompt": "What is the minimum number of page links a person must click on to go from the english Wikipedia page on The Lord of the Rings (the book) to the english Wikipedia page on A Song of Ice and Fire (the book series)? In your count, include each link you would click on to get to the page. Use the pages as they appeared at the end of the day on July 3, 2023.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 2", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "2", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "e8cb5b03-41e0-4086-99e5-f6806cd97211", "prompt": "I went to Virtue restaurant & bar in Chicago for my birthday on March 22, 2021 and the main course I had was delicious!  Unfortunately, when I went back about a month later on April 21, it was no longer on the dinner menu.  Using the Wayback Machine, can you help me figure out which main course was on the dinner menu for Virtue on March 22, 2021 but not April 21, 2021? Answer using the singular form, without articles.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: shrimp", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "shrimp", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "f46b4380-207e-4434-820b-f32ce04ae2a4", "prompt": "It is 1999. Before you party like it is 1999, please assist me in settling a bet.\n\nFiona Apple and Paula Cole released albums prior to 1999. Of these albums, which didn't receive a letter grade from Robert Christgau? Provide your answer as a comma delimited list of album titles, sorted alphabetically.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Harbinger, Tidal", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Harbinger, Tidal", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "05407167-39ec-4d3a-a234-73a9120c325d", "prompt": "In the 2018 VSCode blog post on replit.com, what was the command they clicked on in the last video to remove extra lines?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Format Document", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Format Document", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "b9763138-c053-4832-9f55-86200cb1f99c", "prompt": "Compute the check digit the Tropicos ID for the Order Helotiales would have if it were an ISBN-10 number.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "3", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "16d825ff-1623-4176-a5b5-42e0f5c2b0ac", "prompt": "What time was the Tri-Rail train that carried the most passengers on May 27, 2019 scheduled to arrive in Pompano Beach? Express your answer in the 12-hour digital clock format without leading zero if any, and include whether it is AM or PM.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 6:41 PM", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "6:41 PM", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "2b3ef98c-cc05-450b-a719-711aee40ac65", "prompt": "Could you help me out with this assignment? Our professor sprung it on us at the end of class Friday, and I'm still trying to figure it out. The question he asked us was about an anagram. I've attached an audio recording of the question that he asked, so if you could please take a listen and give me the answer, I'd really appreciate the help. Please limit your response to the anagram text that could be generated from the original line which fulfills the professor's request, without any other commentary. Also, please don't include any punctuation in your response.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: To be or not to be that is the question whether tis nobler in the mind to suffer the slings and arrows of outrageous fortune", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "To be or not to be that is the question whether tis nobler in the mind to suffer the slings and arrows of outrageous fortune", "gaia_level": 2, "gaia_file": "2b3ef98c-cc05-450b-a719-711aee40ac65.mp3", "source": "gaia-benchmark"}}
+{"name": "bfcd99e1-0690-4b53-a85c-0174a8629083", "prompt": "How many applicants for the job in the PDF are only missing a single qualification?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 17", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "17", "gaia_level": 2, "gaia_file": "bfcd99e1-0690-4b53-a85c-0174a8629083.zip", "source": "gaia-benchmark"}}
+{"name": "544b7f0c-173a-4377-8d56-57b36eb26ddf", "prompt": "In Valentina Re\u2019s contribution to the 2017 book \u201cWorld Building: Transmedia, Fans, Industries\u201d, what horror movie does the author cite as having popularized metalepsis between a dream world and reality? Use the complete name with article if any.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: A Nightmare on Elm Street", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "A Nightmare on Elm Street", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "6b078778-0b90-464d-83f6-59511c811b01", "prompt": "The Metropolitan Museum of Art has a portrait in its collection with an accession number of 29.100.5. Of the consecrators and co-consecrators of this portrait's subject as a bishop, what is the name of the one who never became pope?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Alfonso Visconti", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Alfonso Visconti", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "076c8171-9b3b-49b9-a477-244d2a532826", "prompt": "The attached file contains a list of vendors in the Liminal Springs mall, along with each vendor\u2019s monthly revenue and the rent they pay the mall. I want you to find the vendor that makes the least money, relative to the rent it pays. Then, tell me what is listed in the \u201ctype\u201d column for that vendor.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Finance", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Finance", "gaia_level": 2, "gaia_file": "076c8171-9b3b-49b9-a477-244d2a532826.xlsx", "source": "gaia-benchmark"}}
+{"name": "08cae58d-4084-4616-b6dd-dd6534e4825b", "prompt": "According to Google Finance, when was the first year the Apple stock went above $50 (without adjusting for stock split)?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 2018", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "2018", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "2dfc4c37-fec1-4518-84a7-10095d30ad75", "prompt": "According to Box Office Mojo's 2020 Worldwide Box Office list, how many of the top 10 highest-grossing worldwide movies are also on the top 10 highest-grossing domestic movies? Your answer should be a numerical integer value.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 6", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "6", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "9f41b083-683e-4dcf-9185-ccfeaa88fa45", "prompt": "How many pages if the 2023 IPCC report (85 pages version) mentions nuclear energy?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "0", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "ecbc4f94-95a3-4cc7-b255-6741a458a625", "prompt": "How many images are there in the latest 2022 Lego english wikipedia article?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 13", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "13", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "e9a2c537-8232-4c3f-85b0-b52de6bcba99", "prompt": "The attached file shows a list of books in the collection of Scribe County Public Library. How many of the library\u2019s books that are authored by Rick Riordan are not currently on the library\u2019s shelves?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 7", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "7", "gaia_level": 2, "gaia_file": "e9a2c537-8232-4c3f-85b0-b52de6bcba99.pdf", "source": "gaia-benchmark"}}
+{"name": "71345b0a-9c7d-4b50-b2bf-937ec5879845", "prompt": "On a leap day before the year 2008, a joke was removed from the Wikipedia page for \u201cDragon\u201d. What was the phrase that was removed? Give the phrase as it appeared on the page, but without punctuation.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Here be dragons", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Here be dragons", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "7b5377b0-3f38-4103-8ad2-90fe89864c04", "prompt": "Find the value of x to the nearest tenth: Lx = (d/dx * (A * x-squared)) + 4-thousand'n'ninety-7 minus C\nWhere L is the last two digits of the year of the Venezuelan Declaration of Independence,\nA is the number of colors in the TikTok logo as of July 2023, excluding black and white,\nand C is the height of the average woman in the Philippines according to a July 2023 Business Insider article, rounded to the nearest whole centimeter", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 563.9", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "563.9", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "114d5fd0-e2ae-4b6d-a65a-870da2d19c08", "prompt": "In the endnote found in the second-to-last paragraph of page 11 of the book with the doi 10.2307/j.ctv9b2xdv, what date in November was the Wikipedia article accessed? Just give the day of the month.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 4", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "4", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "8f80e01c-1296-4371-9486-bb3d68651a60", "prompt": "Using bass clef notes, what is the age of someone who has experienced the word spelled out in the sheet music by the note letters the total number of lines and notes minus the number of notes on lines in the image?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 90", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "90", "gaia_level": 2, "gaia_file": "8f80e01c-1296-4371-9486-bb3d68651a60.png", "source": "gaia-benchmark"}}
+{"name": "ad37a656-079a-49f9-a493-7b739c9167d1", "prompt": "On July 15, 2008, Phys.org published an article about a catastrophe. Find the explosive force of this catastrophe according to Encyclopedia Britannica, then find the name of the US nuclear test that had the same yield. Your answer should only be the last word of the name of the test.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Bravo", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Bravo", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "366e2f2b-8632-4ef2-81eb-bc3877489217", "prompt": "The attached file lists accommodations in the resort town of Seahorse Island. Based on the information in this file, which seems like the better available place to stay for a family that enjoys swimming and wants a full house?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Shelley's place", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Shelley's place", "gaia_level": 2, "gaia_file": "366e2f2b-8632-4ef2-81eb-bc3877489217.pdf", "source": "gaia-benchmark"}}
+{"name": "f3917a3d-1d17-4ee2-90c5-683b072218fe", "prompt": "How many edits were made to the Wikipedia page on Antidisestablishmentarianism from its inception until June of 2023?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 2732", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "2732", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "48eb8242-1099-4c26-95d4-ef22b002457a", "prompt": "How many nonindigenous crocodiles were found in Florida from the year 2000 through 2020? You can get the data from the USGS Nonindigenous Aquatic Species database.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 6", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "6", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "c8b7e059-c60d-472e-ad64-3b04ae1166dc", "prompt": "The work referenced in footnote 397 of Federico Lauria's 2014 dissertation is also the source for the titles of two paintings in the Smithsonian American Art Museum's collection, as of August 2023. What is the absolute difference between the chapter numbers of the chapters that the titles of these two paintings quote?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 8", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "8", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "d1af70ea-a9a4-421a-b9cc-94b5e02f1788", "prompt": "As of the 2020 census, what was the population difference between the largest county seat and smallest county seat, by land area of the county seat, in Washington state? For population figures, please use the official data from data.census.gov. Please report the integer difference.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 736455", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "736455", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "08f3a05f-5947-4089-a4c4-d4bcfaa6b7a0", "prompt": "Given $x_0 = -5$ and $f(x) = x^3 + 4x^2 - 3x + 8$, what is the smallest $n$ where using Newton's Method $n = n+1$ after rounding to four decimal places?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 2", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "2", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "54612da3-fd56-4941-80f4-5eb82330de25", "prompt": "The attached file shows the locomotives in the collection of a North American railroad museum. How many wheels do the listed steam locomotives have in total?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 60", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "60", "gaia_level": 2, "gaia_file": "54612da3-fd56-4941-80f4-5eb82330de25.xlsx", "source": "gaia-benchmark"}}
+{"name": "ded28325-3447-4c56-860f-e497d6fb3577", "prompt": "This is a secret message my friend gave me. It says where we should meet for our picnic on Friday. The only problem is, it\u2019s encrypted in the Caesar cipher, so I can\u2019t read it. Can you tell me what it says? This is the message:\n\nZsmxsm sc sx Zyvilsec Zvkjk.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Picnic is in Ploybius Plaza.", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Picnic is in Ploybius Plaza.", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "6359a0b1-8f7b-499b-9336-840f9ab90688", "prompt": "What is the area of the green polygon in the attached file? The numbers in purple represent the lengths of the side they are next to.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 39", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "39", "gaia_level": 2, "gaia_file": "6359a0b1-8f7b-499b-9336-840f9ab90688.png", "source": "gaia-benchmark"}}
+{"name": "7cc4acfa-63fd-4acc-a1a1-e8e529e0a97f", "prompt": "The attached spreadsheet contains the sales of menu items for a regional fast-food chain. Which city had the greater total sales: Wharvton or Algrimand?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Wharvton", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Wharvton", "gaia_level": 2, "gaia_file": "7cc4acfa-63fd-4acc-a1a1-e8e529e0a97f.xlsx", "source": "gaia-benchmark"}}
+{"name": "d700d50d-c707-4dca-90dc-4528cddd0c80", "prompt": "Who composed the song that was performed by a rooster and a hamster in separate animated videos at separate tempos with different lyrics? Answer using the format First name Last name.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Roger Miller", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Roger Miller", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0a3cd321-3e76-4622-911b-0fda2e5d6b1a", "prompt": "According to the World Bank, which countries had gross savings of over 35% of GDP for every year in the period 2001-2010? Give your answer as a comma-separated list of countries in alphabetical order. Use the countries most common names in english when answering.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Brunei, China, Morocco, Singapore", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Brunei, China, Morocco, Singapore", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "f2feb6a4-363c-4c09-a804-0db564eafd68", "prompt": "I\u2019m thinking about selling my home, so I want to learn more about how homes in my area sold recently. I live in Pearl City, Hawaii, which is on the island of Oahu. I know two homes near me that sold in 2022 were 2072 Akaikai Loop, and 2017 Komo Mai Drive. Find which of those homes sold for more in 2022, and tell me how much it sold for. Don\u2019t put commas or decimal places in the answer.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 900000", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "900000", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0b260a57-3f3a-4405-9f29-6d7a1012dbfb", "prompt": "On ScienceDirect, what is the difference to 3 decimal places in the sample standard deviations of the number of Reference Works in each Life Science domain compared to Health Sciences as of 2022?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0.269", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "0.269", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "ed58682d-bc52-4baa-9eb0-4eb81e1edacc", "prompt": "What is the last word before the second chorus of the King of Pop's fifth single from his sixth studio album?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: stare", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "stare", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "cca70ce6-1952-45d2-acd4-80c903b0bc49", "prompt": "Look at the attached image. The quiz is scored as follows:\n\nProblems that ask the student to add or subtract fractions: 5 points\nProblems that ask the student to multiply or divide fractions: 10 points\nProblems that ask the student to form an improper fraction: 15 points\nProblems that ask the student to form a mixed number: 20 points\n\nDue to a technical issue that delayed having students take the quiz, the teacher is giving everyone 5 bonus points.\n\nIf you graded the quiz in the attached image, how many points would the student have earned? There is no partial credit.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 85", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "85", "gaia_level": 2, "gaia_file": "cca70ce6-1952-45d2-acd4-80c903b0bc49.png", "source": "gaia-benchmark"}}
+{"name": "b7f857e4-d8aa-4387-af2a-0e844df5b9d8", "prompt": "The attached image contains a Python script. Run the Python code against an array of strings, listed below. The output of the Python script will be a URL containing C++ source code. Compile and run this C++ code against the array [35, 12, 8, 99, 21, 5] and return the sum of the third and fifth integers in the sorted list.\n\narr = ['_alg', 'ghi', 'C++', 'jkl', 'tps', '/Q', 'pqr', 'stu', ':', '//', 'rose', 'vwx', 'yz1', '234', 'tta', '567', '890', 'cod', 'e.', 'or', 'g/', 'wiki', '/', 'ing', 'sort', 'abc' , 'or', 'it', 'hms', 'mno' , 'uic', 'ksort', '#', 'ht' ]", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 47", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "47", "gaia_level": 2, "gaia_file": "b7f857e4-d8aa-4387-af2a-0e844df5b9d8.png", "source": "gaia-benchmark"}}
+{"name": "d8152ad6-e4d5-4c12-8bb7-8d57dc10c6de", "prompt": "I have the Standard plan in the image below, and I just uploaded 60 equally sized files and got a message that I'm 100GB over the limit. I have 980 more files of the same size to upload. What is the average additional cost per file in dollar that goes over my current plan limit rounded to the nearest cent if I have to upgrade to the minimum possible plan to store them all? Answer with the following format: x.xx", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0.03", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "0.03", "gaia_level": 2, "gaia_file": "d8152ad6-e4d5-4c12-8bb7-8d57dc10c6de.png", "source": "gaia-benchmark"}}
+{"name": "67e8878b-5cef-4375-804e-e6291fdbe78a", "prompt": "The attached PDF lists accommodations in the resort community of Seahorse Island. Which type of accommodation has a higher average rating in Seahorse Island?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Hotels", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Hotels", "gaia_level": 2, "gaia_file": "67e8878b-5cef-4375-804e-e6291fdbe78a.pdf", "source": "gaia-benchmark"}}
+{"name": "023e9d44-96ae-4eed-b912-244ee8c3b994", "prompt": "It's May 2023, and I'm about to drive across the U.S. from California to Maine. I always recycle my water bottles at the end of a trip, and I drink 5 12-ounce water bottles for every 100 miles I travel, rounded to the nearest 100. Assuming I follow I-40 from Los Angeles to Cincinnati, then take I-90 from Cincinnati to Augusta, how many dollars will I get back according to Wikipedia?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 8", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "8", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0e9e85b8-52b9-4de4-b402-5f635ab9631f", "prompt": "What is the latest chronological year date written in the image on the webpage found when following the first citation reference link on the latest version of Carl Nebel's Wikipedia page as of August 2023?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 1927", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "1927", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "20194330-9976-4043-8632-f8485c6c71b2", "prompt": "The YouTube channel Game Grumps began a Let\u2019s Play of the game Sonic the Hedgehog (2006) in the year 2012. Thirty seconds into the first episode, a phrase is shown on the screen in white letters on a red background. How many times does the letter \"E\" appear in this phrase?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 4", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "4", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "4d51c4bf-4b0e-4f3d-897b-3f6687a7d9f2", "prompt": "This spreadsheet contains a list of clients for a retractable awning company. Each client has ordered a new awning for the back of their house within the last 90 days. The company makes different designs depending on whether the awning is made to block sunrises or sunsets. In this region, houses with odd-numbered street addresses face east, and houses with even-numbered street addresses face west. How many of these clients will be receiving the sunset awning design?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 8", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "8", "gaia_level": 2, "gaia_file": "4d51c4bf-4b0e-4f3d-897b-3f6687a7d9f2.xlsx", "source": "gaia-benchmark"}}
+{"name": "65638e28-7f37-4fa7-b7b9-8c19bb609879", "prompt": "The book with the doi 10.1353/book.24372 concerns a certain neurologist. According to chapter 2 of the book, what author influenced this neurologist\u2019s belief in \u201cendopsychic myths\u201d? Give the last name only.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Kleinpaul", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Kleinpaul", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "3ff6b7a9-a5bd-4412-ad92-0cd0d45c0fee", "prompt": "The longest-lived vertebrate is named after an island.  According to Wikipedia as of January 1, 2021, what is the 2020 estimated population of that island, to the nearest thousand?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 56000", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "56000", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "708b99c5-e4a7-49cb-a5cf-933c8d46470d", "prompt": "On the DeepFruits fruit detection graph on Connected Papers from 2016, what feature caused the largest bubble to be the size it is?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Citations", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Citations", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0a65cb96-cb6e-4a6a-8aae-c1084f613456", "prompt": "During the first week of August 2015, one of the NASA Astronomy Pictures of the Day shows the lights of a city on the horizon. The namesake of this city also has a landmark building in Chicago named after him. What is the name of the architectural firm that designed this landmark building? Give the first name appearing in the name of the firm as of June 2023.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Holabird", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Holabird", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "65da0822-a48a-4a68-bbad-8ed1b835a834", "prompt": "All of the individuals who formally held the position of United States secretary of homeland security prior to April 2019, excluding those who held the position in an acting capacity, have a bachelor's degree. Of the universities that these bachelor's degrees were from, which is the westernmost university and which is the easternmost university? Give them to me as a comma-separated list, I only want the name of the cities where the universities are located, with the westernmost city listed first.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Santa Clara, Boston", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Santa Clara, Boston", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0bb3b44a-ede5-4db5-a520-4e844b0079c5", "prompt": "Consider the following symbols: \ud809\udc1c  \ud809\udc10\ud809\udc1a\n\nThis is a number written using the Mesopotamian/Babylonian number system and represented with Sumerian cuneiform. Convert this number into Arabic numerals as a decimal number.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 536", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "536", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "73c1b9fe-ee1d-4cf4-96ca-35c08f97b054", "prompt": "According to the USGS, in what year was the American Alligator first found west of Texas (not including Texas)?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 1954", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "1954", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "e2d69698-bc99-4e85-9880-67eaccd66e6c", "prompt": "As of August 2023, who is the only winner of the US version of Survivor to be born in the month of May?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Michele Fitzgerald", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Michele Fitzgerald", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "a56f1527-3abf-41d6-91f8-7296d6336c3f", "prompt": "The cover of the August 2021 issue of Vogue shows a famous landmark in the background behind some trees. How tall is this monument in yards, rounded to the nearest yard? Give the number only.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 185", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "185", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "42d4198c-5895-4f0a-b0c0-424a66465d83", "prompt": "I'm curious about how much information is available for popular video games before their release. Find the Wikipedia page for the 2019 game that won the British Academy Games Awards. How many revisions did that page have before the month listed as the game's release date on that Wikipedia page (as of the most recent entry from 2022)?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 60", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "60", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "edd4d4f2-1a58-45c4-b038-67337af4e029", "prompt": "The attached spreadsheet lists the locomotives owned by a local railroad museum. What is the typical American name for the type of locomotive this museum uses for the Murder Mystery Express?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Berkshire", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Berkshire", "gaia_level": 2, "gaia_file": "edd4d4f2-1a58-45c4-b038-67337af4e029.xlsx", "source": "gaia-benchmark"}}
+{"name": "a26649c6-1cb2-470a-871e-6910c64c3e53", "prompt": "What is the absolute difference in tens of thousands between the population of chinstrap penguins on the Wikipedia page for penguin species populations as of the end of 2018 and the population recorded in the Nature.com \"global population assessment of the Chinstrap penguin\" article from 2020, assuming two penguins per breeding pair?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 116", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "116", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "4d0aa727-86b1-406b-9b33-f870dd14a4a5", "prompt": "The attached file lists the locomotives owned by a local railroad museum. It gives each locomotive\u2019s identifying number, operating status, and the name of the daily excursion it heads, if operational. What are the odds that today\u2019s Sunset Picnic Trip will use a steam locomotive? Assume that each day\u2019s excursion picks one of its assigned locomotives at random, and express the answer in the form \u201c1 in 4\u201d, \u201c1 in 5\u201d, etc.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 1 in 3", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "1 in 3", "gaia_level": 2, "gaia_file": "4d0aa727-86b1-406b-9b33-f870dd14a4a5.xlsx", "source": "gaia-benchmark"}}
+{"name": "d5141ca5-e7a0-469f-bf3e-e773507c86e2", "prompt": "When was a picture of St. Thomas Aquinas first added to the Wikipedia page on the Principle of double effect? Answer using the format DD/MM/YYYY.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 19/02/2009", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "19/02/2009", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "1dcc160f-c187-48c2-b68e-319bd4354f3d", "prompt": "According to Openreview.net, at the NeurIPS 2022 Conference, how many papers by an author named Yuri were accepted with a \"certain\" recommendation?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "3", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "b2c257e0-3ad7-4f05-b8e3-d9da973be36e", "prompt": "If this whole pint is made up of ice cream, how many percent above or below the US federal standards for butterfat content is it when using the standards as reported by Wikipedia in 2020? Answer as + or - a number rounded to one decimal place.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: +4.6", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "+4.6", "gaia_level": 2, "gaia_file": "b2c257e0-3ad7-4f05-b8e3-d9da973be36e.jpg", "source": "gaia-benchmark"}}
+{"name": "e0c10771-d627-4fd7-9694-05348e54ee36", "prompt": "Take the gender split from the 2011 Bulgarian census about those who have completed tertiary education. Subtract the smaller number from the larger number, then return the difference in thousands of women. So if there were 30.1 thousand more men, you'd give \"30.1\"", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 234.9", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "234.9", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "e29834fd-413a-455c-a33e-c3915b07401c", "prompt": "I'd like to learn more about some popular reality television competition shows. As of the end of the 44th season of the American version of Survivor, how many more unique winners have there been compared to the number of winners of American Idol?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 21", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "21", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "08c0b6e9-1b43-4c2e-ae55-4e3fce2c2715", "prompt": "In the film Goldfinger, what color was the object that James Bond concealed himself and his companion Pussy Galore at the end of the film? If there are multiple colors, put them in a comma-separated list in alphabetical order.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: orange, white", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "orange, white", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "db4fd70a-2d37-40ea-873f-9433dc5e301f", "prompt": "As of May 2023, how many stops are between South Station and Windsor Gardens on MBTA\u2019s Franklin-Foxboro line (not included)?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 10", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "10", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "853c8244-429e-46ca-89f2-addf40dfb2bd", "prompt": "In the 2015 Metropolitan Museum of Art exhibition titled after the Chinese zodiac animal of 2015, how many of the \"twelve animals of the Chinese zodiac\" have a hand visible?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 11", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "11", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "7a4a336d-dcfa-45a0-b014-824c7619e8de", "prompt": "At the two-minute mark in the YouTube video uploaded by the channel \u201cGameGrumps\u201d on May 14, 2017 as part of their playthrough of the game Mario Kart 8 Deluxe, the shows\u2019 hosts are competing on one of the game\u2019s racetracks. What was the world record time for that track in the game\u2019s 150cc mode as of June 7, 2023? Express your answer in minutes and seconds, rounding the seconds to the nearest hundredth, e.g. 1:01.001.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 1:41.614", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "1:41.614", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "c61d22de-5f6c-4958-a7f6-5e9707bd3466", "prompt": "A paper about AI regulation that was originally submitted to arXiv.org in June 2022 shows a figure with three axes, where each axis has a label word at both ends. Which of these words is used to describe a type of society in a Physics and Society article submitted to arXiv.org on August 11, 2016?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: egalitarian", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "egalitarian", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "17b5a6a3-bc87-42e8-b0fb-6ab0781ef2cc", "prompt": "I\u2019m researching species that became invasive after people who kept them as pets released them. There\u2019s a certain species of fish that was popularized as a pet by being the main character of the movie Finding Nemo. According to the USGS, where was this fish found as a nonnative species, before the year 2020? I need the answer formatted as the five-digit zip codes of the places the species was found, separated by commas if there is more than one place.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 34689", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "34689", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "04a04a9b-226c-43fd-b319-d5e89743676f", "prompt": "If we assume all articles published by Nature in 2020 (articles, only, not book reviews/columns, etc) relied on statistical significance to justify their findings and they on average came to a p-value of 0.04, how many papers would be incorrect as to their claims of statistical significance? Round the value up to the next integer.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 41", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "41", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "14569e28-c88c-43e4-8c32-097d35b9a67d", "prompt": "In Unlambda, what exact charcter or text needs to be added to correct the following code to output \"For penguins\"? If what is needed is a character, answer with the name of the character. If there are different names for the character, use the shortest. The text location is not needed. Code:\n\n`r```````````.F.o.r. .p.e.n.g.u.i.n.si", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: backtick", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "backtick", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "32102e3e-d12a-4209-9163-7b3a104efe5d", "prompt": "The attached spreadsheet shows the inventory for a movie and video game rental store in Seattle, Washington. What is the title of the oldest Blu-Ray recorded in this spreadsheet? Return it as appearing in the spreadsheet.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Time-Parking 2: Parallel Universe", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Time-Parking 2: Parallel Universe", "gaia_level": 2, "gaia_file": "32102e3e-d12a-4209-9163-7b3a104efe5d.xlsx", "source": "gaia-benchmark"}}
+{"name": "3627a8be-a77f-41bb-b807-7e1bd4c0ebdf", "prompt": "The object in the British Museum's collection with a museum number of 2012,5015.17 is the shell of a particular mollusk species. According to the abstract of a research article published in Science Advances in 2021, beads made from the shells of this species were found that are at least how many thousands of years old?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 142", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "142", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "7619a514-5fa8-43ef-9143-83b66a43d7a4", "prompt": "According to github, when was Regression added to the oldest closed numpy.polynomial issue that has the Regression label in MM/DD/YY?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 04/15/18", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "04/15/18", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "7dd30055-0198-452e-8c25-f73dbe27dcb8", "prompt": "Using the Biopython library in Python, parse the PDB file of the protein identified by the PDB ID 5wb7 from the RCSB Protein Data Bank. Calculate the distance between the first and second atoms as they are listed in the PDB file. Report the answer in Angstroms, rounded to the nearest picometer.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 1.456", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "1.456", "gaia_level": 2, "gaia_file": "7dd30055-0198-452e-8c25-f73dbe27dcb8.pdb", "source": "gaia-benchmark"}}
+{"name": "2a649bb1-795f-4a01-b3be-9a01868dae73", "prompt": "What are the EC numbers of the two most commonly used chemicals for the virus testing method in the paper about SPFMV and SPCSV in the Pearl Of Africa from 2016? Return the semicolon-separated numbers in the order of the alphabetized chemicals.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3.1.3.1; 1.11.1.7", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "3.1.3.1; 1.11.1.7", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "87c610df-bef7-4932-b950-1d83ef4e282b", "prompt": "In April of 1977, who was the Prime Minister of the first place mentioned by name in the Book of Esther (in the New International Version)?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Morarji Desai", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Morarji Desai", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "624cbf11-6a41-4692-af9c-36b3e5ca3130", "prompt": "What's the last line of the rhyme under the flavor name on the headstone visible in the background of the photo of the oldest flavor's headstone in the Ben & Jerry's online flavor graveyard as of the end of 2022?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: So we had to let it die.", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "So we had to let it die.", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "dd3c7503-f62a-4bd0-9f67-1b63b94194cc", "prompt": "Use density measures from the chemistry materials licensed by Marisa Alviar-Agnew & Henry Agnew under the CK-12 license in LibreText's Introductory Chemistry materials as compiled 08/21/2023.\n\nI have a gallon of honey and a gallon of mayonnaise at 25C. I remove one cup of honey at a time from the gallon of honey. How many times will I need to remove a cup to have the honey weigh less than the mayonaise? Assume the containers themselves weigh the same.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 6", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "6", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "df6561b2-7ee5-4540-baab-5095f742716a", "prompt": "When you take the average of the standard population deviation of the red numbers and the standard sample deviation of the green numbers in this image using the statistics module in Python 3.11, what is the result rounded to the nearest three decimal points?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 17.056", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "17.056", "gaia_level": 2, "gaia_file": "df6561b2-7ee5-4540-baab-5095f742716a.png", "source": "gaia-benchmark"}}
+{"name": "f0f46385-fc03-4599-b5d3-f56496c3e69f", "prompt": "In terms of geographical distance between capital cities, which 2 countries are the furthest from each other within the ASEAN bloc according to wikipedia? Answer using a comma separated list, ordering the countries by alphabetical order.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Indonesia, Myanmar", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Indonesia, Myanmar", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "e4e91f1c-1dcd-439e-9fdd-cb976f5293fd", "prompt": "I need to fact-check a citation. This is the citation from the bibliography:\n\nGreetham, David. \"Uncoupled: OR, How I Lost My Author(s).\" Textual Cultures: Texts, Contexts, Interpretation, vol. 3 no. 1, 2008, p. 45-46. Project MUSE, doi:10.2979/tex.2008.3.1.44.\n\nAnd this is the in-line citation:\n\nOur relationship with the authors of the works we read can often be \u201cobscured not by a \"cloak of print\" but by the veil of scribal confusion and mis-transmission\u201d (Greetham 45-46).\n\nDoes the quoted text match what is actually in the article? If Yes, answer Yes, otherwise, give me the word in my citation that does not match with the correct one (without any article).", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: cloak", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "cloak", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "56137764-b4e0-45b8-9c52-1866420c3df5", "prompt": "Which contributor to the version of OpenCV where support was added for the Mask-RCNN model has the same name as a former Chinese head of government when the names are transliterated to the Latin alphabet?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Li Peng", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Li Peng", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "8b3379c0-0981-4f5b-8407-6444610cb212", "prompt": "What is the maximum length in meters of #9 in the first National Geographic short on YouTube that was ever released according to the Monterey Bay Aquarium website? Just give the number.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 1.8", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "1.8", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0ff53813-3367-4f43-bcbd-3fd725c1bf4b", "prompt": "What two-word type of model did Manash Pratim Kashyap's and PS Fader's studies in customer retention studies published during 2018-2019 have in common (no punctuation)?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: beta geometric", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "beta geometric", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "a7feb290-76bb-4cb7-8800-7edaf7954f2f", "prompt": "How many High Energy Physics - Lattice articles listed in January 2020 on Arxiv had ps versions available?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 31", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "31", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "b4cc024b-3f5e-480e-b96a-6656493255b5", "prompt": "The photograph in the Whitney Museum of American Art's collection with accession number 2022.128 shows a person holding a book. Which military unit did the author of this book join in 1813? Answer without using articles.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Russian-German Legion", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Russian-German Legion", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "33d8ea3b-6c6b-4ff1-803d-7e270dea8a57", "prompt": "What is the minimum number of page links a person must click on to go from the english Wikipedia page on The Lord of the Rings (the book) to the english Wikipedia page on A Song of Ice and Fire (the book series)? In your count, include each link you would click on to get to the page. Use the pages as they appeared at the end of the day on July 3, 2023.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 2", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "2", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "e8cb5b03-41e0-4086-99e5-f6806cd97211", "prompt": "I went to Virtue restaurant & bar in Chicago for my birthday on March 22, 2021 and the main course I had was delicious!  Unfortunately, when I went back about a month later on April 21, it was no longer on the dinner menu.  Using the Wayback Machine, can you help me figure out which main course was on the dinner menu for Virtue on March 22, 2021 but not April 21, 2021? Answer using the singular form, without articles.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: shrimp", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "shrimp", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "f46b4380-207e-4434-820b-f32ce04ae2a4", "prompt": "It is 1999. Before you party like it is 1999, please assist me in settling a bet.\n\nFiona Apple and Paula Cole released albums prior to 1999. Of these albums, which didn't receive a letter grade from Robert Christgau? Provide your answer as a comma delimited list of album titles, sorted alphabetically.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Harbinger, Tidal", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Harbinger, Tidal", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "05407167-39ec-4d3a-a234-73a9120c325d", "prompt": "In the 2018 VSCode blog post on replit.com, what was the command they clicked on in the last video to remove extra lines?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Format Document", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Format Document", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "b9763138-c053-4832-9f55-86200cb1f99c", "prompt": "Compute the check digit the Tropicos ID for the Order Helotiales would have if it were an ISBN-10 number.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "3", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "16d825ff-1623-4176-a5b5-42e0f5c2b0ac", "prompt": "What time was the Tri-Rail train that carried the most passengers on May 27, 2019 scheduled to arrive in Pompano Beach? Express your answer in the 12-hour digital clock format without leading zero if any, and include whether it is AM or PM.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 6:41 PM", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "6:41 PM", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "2b3ef98c-cc05-450b-a719-711aee40ac65", "prompt": "Could you help me out with this assignment? Our professor sprung it on us at the end of class Friday, and I'm still trying to figure it out. The question he asked us was about an anagram. I've attached an audio recording of the question that he asked, so if you could please take a listen and give me the answer, I'd really appreciate the help. Please limit your response to the anagram text that could be generated from the original line which fulfills the professor's request, without any other commentary. Also, please don't include any punctuation in your response.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: To be or not to be that is the question whether tis nobler in the mind to suffer the slings and arrows of outrageous fortune", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "To be or not to be that is the question whether tis nobler in the mind to suffer the slings and arrows of outrageous fortune", "gaia_level": 2, "gaia_file": "2b3ef98c-cc05-450b-a719-711aee40ac65.mp3", "source": "gaia-benchmark"}}
+{"name": "bfcd99e1-0690-4b53-a85c-0174a8629083", "prompt": "How many applicants for the job in the PDF are only missing a single qualification?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 17", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "17", "gaia_level": 2, "gaia_file": "bfcd99e1-0690-4b53-a85c-0174a8629083.zip", "source": "gaia-benchmark"}}
+{"name": "544b7f0c-173a-4377-8d56-57b36eb26ddf", "prompt": "In Valentina Re\u2019s contribution to the 2017 book \u201cWorld Building: Transmedia, Fans, Industries\u201d, what horror movie does the author cite as having popularized metalepsis between a dream world and reality? Use the complete name with article if any.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: A Nightmare on Elm Street", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "A Nightmare on Elm Street", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "6b078778-0b90-464d-83f6-59511c811b01", "prompt": "The Metropolitan Museum of Art has a portrait in its collection with an accession number of 29.100.5. Of the consecrators and co-consecrators of this portrait's subject as a bishop, what is the name of the one who never became pope?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Alfonso Visconti", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Alfonso Visconti", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "076c8171-9b3b-49b9-a477-244d2a532826", "prompt": "The attached file contains a list of vendors in the Liminal Springs mall, along with each vendor\u2019s monthly revenue and the rent they pay the mall. I want you to find the vendor that makes the least money, relative to the rent it pays. Then, tell me what is listed in the \u201ctype\u201d column for that vendor.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Finance", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Finance", "gaia_level": 2, "gaia_file": "076c8171-9b3b-49b9-a477-244d2a532826.xlsx", "source": "gaia-benchmark"}}
+{"name": "08cae58d-4084-4616-b6dd-dd6534e4825b", "prompt": "According to Google Finance, when was the first year the Apple stock went above $50 (without adjusting for stock split)?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 2018", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "2018", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "2dfc4c37-fec1-4518-84a7-10095d30ad75", "prompt": "According to Box Office Mojo's 2020 Worldwide Box Office list, how many of the top 10 highest-grossing worldwide movies are also on the top 10 highest-grossing domestic movies? Your answer should be a numerical integer value.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 6", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "6", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "9f41b083-683e-4dcf-9185-ccfeaa88fa45", "prompt": "How many pages if the 2023 IPCC report (85 pages version) mentions nuclear energy?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "0", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "ecbc4f94-95a3-4cc7-b255-6741a458a625", "prompt": "How many images are there in the latest 2022 Lego english wikipedia article?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 13", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "13", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "e9a2c537-8232-4c3f-85b0-b52de6bcba99", "prompt": "The attached file shows a list of books in the collection of Scribe County Public Library. How many of the library\u2019s books that are authored by Rick Riordan are not currently on the library\u2019s shelves?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 7", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "7", "gaia_level": 2, "gaia_file": "e9a2c537-8232-4c3f-85b0-b52de6bcba99.pdf", "source": "gaia-benchmark"}}
+{"name": "71345b0a-9c7d-4b50-b2bf-937ec5879845", "prompt": "On a leap day before the year 2008, a joke was removed from the Wikipedia page for \u201cDragon\u201d. What was the phrase that was removed? Give the phrase as it appeared on the page, but without punctuation.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Here be dragons", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Here be dragons", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "7b5377b0-3f38-4103-8ad2-90fe89864c04", "prompt": "Find the value of x to the nearest tenth: Lx = (d/dx * (A * x-squared)) + 4-thousand'n'ninety-7 minus C\nWhere L is the last two digits of the year of the Venezuelan Declaration of Independence,\nA is the number of colors in the TikTok logo as of July 2023, excluding black and white,\nand C is the height of the average woman in the Philippines according to a July 2023 Business Insider article, rounded to the nearest whole centimeter", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 563.9", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "563.9", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "114d5fd0-e2ae-4b6d-a65a-870da2d19c08", "prompt": "In the endnote found in the second-to-last paragraph of page 11 of the book with the doi 10.2307/j.ctv9b2xdv, what date in November was the Wikipedia article accessed? Just give the day of the month.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 4", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "4", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "8f80e01c-1296-4371-9486-bb3d68651a60", "prompt": "Using bass clef notes, what is the age of someone who has experienced the word spelled out in the sheet music by the note letters the total number of lines and notes minus the number of notes on lines in the image?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 90", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "90", "gaia_level": 2, "gaia_file": "8f80e01c-1296-4371-9486-bb3d68651a60.png", "source": "gaia-benchmark"}}
+{"name": "ad37a656-079a-49f9-a493-7b739c9167d1", "prompt": "On July 15, 2008, Phys.org published an article about a catastrophe. Find the explosive force of this catastrophe according to Encyclopedia Britannica, then find the name of the US nuclear test that had the same yield. Your answer should only be the last word of the name of the test.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Bravo", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Bravo", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "366e2f2b-8632-4ef2-81eb-bc3877489217", "prompt": "The attached file lists accommodations in the resort town of Seahorse Island. Based on the information in this file, which seems like the better available place to stay for a family that enjoys swimming and wants a full house?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Shelley's place", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Shelley's place", "gaia_level": 2, "gaia_file": "366e2f2b-8632-4ef2-81eb-bc3877489217.pdf", "source": "gaia-benchmark"}}
+{"name": "f3917a3d-1d17-4ee2-90c5-683b072218fe", "prompt": "How many edits were made to the Wikipedia page on Antidisestablishmentarianism from its inception until June of 2023?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 2732", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "2732", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "48eb8242-1099-4c26-95d4-ef22b002457a", "prompt": "How many nonindigenous crocodiles were found in Florida from the year 2000 through 2020? You can get the data from the USGS Nonindigenous Aquatic Species database.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 6", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "6", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "c8b7e059-c60d-472e-ad64-3b04ae1166dc", "prompt": "The work referenced in footnote 397 of Federico Lauria's 2014 dissertation is also the source for the titles of two paintings in the Smithsonian American Art Museum's collection, as of August 2023. What is the absolute difference between the chapter numbers of the chapters that the titles of these two paintings quote?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 8", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "8", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "d1af70ea-a9a4-421a-b9cc-94b5e02f1788", "prompt": "As of the 2020 census, what was the population difference between the largest county seat and smallest county seat, by land area of the county seat, in Washington state? For population figures, please use the official data from data.census.gov. Please report the integer difference.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 736455", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "736455", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "08f3a05f-5947-4089-a4c4-d4bcfaa6b7a0", "prompt": "Given $x_0 = -5$ and $f(x) = x^3 + 4x^2 - 3x + 8$, what is the smallest $n$ where using Newton's Method $n = n+1$ after rounding to four decimal places?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 2", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "2", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "54612da3-fd56-4941-80f4-5eb82330de25", "prompt": "The attached file shows the locomotives in the collection of a North American railroad museum. How many wheels do the listed steam locomotives have in total?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 60", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "60", "gaia_level": 2, "gaia_file": "54612da3-fd56-4941-80f4-5eb82330de25.xlsx", "source": "gaia-benchmark"}}
+{"name": "ded28325-3447-4c56-860f-e497d6fb3577", "prompt": "This is a secret message my friend gave me. It says where we should meet for our picnic on Friday. The only problem is, it\u2019s encrypted in the Caesar cipher, so I can\u2019t read it. Can you tell me what it says? This is the message:\n\nZsmxsm sc sx Zyvilsec Zvkjk.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Picnic is in Ploybius Plaza.", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Picnic is in Ploybius Plaza.", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "6359a0b1-8f7b-499b-9336-840f9ab90688", "prompt": "What is the area of the green polygon in the attached file? The numbers in purple represent the lengths of the side they are next to.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 39", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "39", "gaia_level": 2, "gaia_file": "6359a0b1-8f7b-499b-9336-840f9ab90688.png", "source": "gaia-benchmark"}}
+{"name": "7cc4acfa-63fd-4acc-a1a1-e8e529e0a97f", "prompt": "The attached spreadsheet contains the sales of menu items for a regional fast-food chain. Which city had the greater total sales: Wharvton or Algrimand?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Wharvton", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Wharvton", "gaia_level": 2, "gaia_file": "7cc4acfa-63fd-4acc-a1a1-e8e529e0a97f.xlsx", "source": "gaia-benchmark"}}
+{"name": "d700d50d-c707-4dca-90dc-4528cddd0c80", "prompt": "Who composed the song that was performed by a rooster and a hamster in separate animated videos at separate tempos with different lyrics? Answer using the format First name Last name.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Roger Miller", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Roger Miller", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0a3cd321-3e76-4622-911b-0fda2e5d6b1a", "prompt": "According to the World Bank, which countries had gross savings of over 35% of GDP for every year in the period 2001-2010? Give your answer as a comma-separated list of countries in alphabetical order. Use the countries most common names in english when answering.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Brunei, China, Morocco, Singapore", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Brunei, China, Morocco, Singapore", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "f2feb6a4-363c-4c09-a804-0db564eafd68", "prompt": "I\u2019m thinking about selling my home, so I want to learn more about how homes in my area sold recently. I live in Pearl City, Hawaii, which is on the island of Oahu. I know two homes near me that sold in 2022 were 2072 Akaikai Loop, and 2017 Komo Mai Drive. Find which of those homes sold for more in 2022, and tell me how much it sold for. Don\u2019t put commas or decimal places in the answer.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 900000", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "900000", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0b260a57-3f3a-4405-9f29-6d7a1012dbfb", "prompt": "On ScienceDirect, what is the difference to 3 decimal places in the sample standard deviations of the number of Reference Works in each Life Science domain compared to Health Sciences as of 2022?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0.269", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "0.269", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "ed58682d-bc52-4baa-9eb0-4eb81e1edacc", "prompt": "What is the last word before the second chorus of the King of Pop's fifth single from his sixth studio album?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: stare", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "stare", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "cca70ce6-1952-45d2-acd4-80c903b0bc49", "prompt": "Look at the attached image. The quiz is scored as follows:\n\nProblems that ask the student to add or subtract fractions: 5 points\nProblems that ask the student to multiply or divide fractions: 10 points\nProblems that ask the student to form an improper fraction: 15 points\nProblems that ask the student to form a mixed number: 20 points\n\nDue to a technical issue that delayed having students take the quiz, the teacher is giving everyone 5 bonus points.\n\nIf you graded the quiz in the attached image, how many points would the student have earned? There is no partial credit.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 85", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "85", "gaia_level": 2, "gaia_file": "cca70ce6-1952-45d2-acd4-80c903b0bc49.png", "source": "gaia-benchmark"}}
+{"name": "b7f857e4-d8aa-4387-af2a-0e844df5b9d8", "prompt": "The attached image contains a Python script. Run the Python code against an array of strings, listed below. The output of the Python script will be a URL containing C++ source code. Compile and run this C++ code against the array [35, 12, 8, 99, 21, 5] and return the sum of the third and fifth integers in the sorted list.\n\narr = ['_alg', 'ghi', 'C++', 'jkl', 'tps', '/Q', 'pqr', 'stu', ':', '//', 'rose', 'vwx', 'yz1', '234', 'tta', '567', '890', 'cod', 'e.', 'or', 'g/', 'wiki', '/', 'ing', 'sort', 'abc' , 'or', 'it', 'hms', 'mno' , 'uic', 'ksort', '#', 'ht' ]", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 47", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "47", "gaia_level": 2, "gaia_file": "b7f857e4-d8aa-4387-af2a-0e844df5b9d8.png", "source": "gaia-benchmark"}}
+{"name": "d8152ad6-e4d5-4c12-8bb7-8d57dc10c6de", "prompt": "I have the Standard plan in the image below, and I just uploaded 60 equally sized files and got a message that I'm 100GB over the limit. I have 980 more files of the same size to upload. What is the average additional cost per file in dollar that goes over my current plan limit rounded to the nearest cent if I have to upgrade to the minimum possible plan to store them all? Answer with the following format: x.xx", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0.03", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "0.03", "gaia_level": 2, "gaia_file": "d8152ad6-e4d5-4c12-8bb7-8d57dc10c6de.png", "source": "gaia-benchmark"}}
+{"name": "67e8878b-5cef-4375-804e-e6291fdbe78a", "prompt": "The attached PDF lists accommodations in the resort community of Seahorse Island. Which type of accommodation has a higher average rating in Seahorse Island?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Hotels", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Hotels", "gaia_level": 2, "gaia_file": "67e8878b-5cef-4375-804e-e6291fdbe78a.pdf", "source": "gaia-benchmark"}}
+{"name": "023e9d44-96ae-4eed-b912-244ee8c3b994", "prompt": "It's May 2023, and I'm about to drive across the U.S. from California to Maine. I always recycle my water bottles at the end of a trip, and I drink 5 12-ounce water bottles for every 100 miles I travel, rounded to the nearest 100. Assuming I follow I-40 from Los Angeles to Cincinnati, then take I-90 from Cincinnati to Augusta, how many dollars will I get back according to Wikipedia?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 8", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "8", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0e9e85b8-52b9-4de4-b402-5f635ab9631f", "prompt": "What is the latest chronological year date written in the image on the webpage found when following the first citation reference link on the latest version of Carl Nebel's Wikipedia page as of August 2023?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 1927", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "1927", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "20194330-9976-4043-8632-f8485c6c71b2", "prompt": "The YouTube channel Game Grumps began a Let\u2019s Play of the game Sonic the Hedgehog (2006) in the year 2012. Thirty seconds into the first episode, a phrase is shown on the screen in white letters on a red background. How many times does the letter \"E\" appear in this phrase?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 4", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "4", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "4d51c4bf-4b0e-4f3d-897b-3f6687a7d9f2", "prompt": "This spreadsheet contains a list of clients for a retractable awning company. Each client has ordered a new awning for the back of their house within the last 90 days. The company makes different designs depending on whether the awning is made to block sunrises or sunsets. In this region, houses with odd-numbered street addresses face east, and houses with even-numbered street addresses face west. How many of these clients will be receiving the sunset awning design?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 8", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "8", "gaia_level": 2, "gaia_file": "4d51c4bf-4b0e-4f3d-897b-3f6687a7d9f2.xlsx", "source": "gaia-benchmark"}}
+{"name": "65638e28-7f37-4fa7-b7b9-8c19bb609879", "prompt": "The book with the doi 10.1353/book.24372 concerns a certain neurologist. According to chapter 2 of the book, what author influenced this neurologist\u2019s belief in \u201cendopsychic myths\u201d? Give the last name only.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Kleinpaul", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Kleinpaul", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "3ff6b7a9-a5bd-4412-ad92-0cd0d45c0fee", "prompt": "The longest-lived vertebrate is named after an island.  According to Wikipedia as of January 1, 2021, what is the 2020 estimated population of that island, to the nearest thousand?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 56000", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "56000", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "708b99c5-e4a7-49cb-a5cf-933c8d46470d", "prompt": "On the DeepFruits fruit detection graph on Connected Papers from 2016, what feature caused the largest bubble to be the size it is?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Citations", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Citations", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0a65cb96-cb6e-4a6a-8aae-c1084f613456", "prompt": "During the first week of August 2015, one of the NASA Astronomy Pictures of the Day shows the lights of a city on the horizon. The namesake of this city also has a landmark building in Chicago named after him. What is the name of the architectural firm that designed this landmark building? Give the first name appearing in the name of the firm as of June 2023.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Holabird", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Holabird", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "65da0822-a48a-4a68-bbad-8ed1b835a834", "prompt": "All of the individuals who formally held the position of United States secretary of homeland security prior to April 2019, excluding those who held the position in an acting capacity, have a bachelor's degree. Of the universities that these bachelor's degrees were from, which is the westernmost university and which is the easternmost university? Give them to me as a comma-separated list, I only want the name of the cities where the universities are located, with the westernmost city listed first.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Santa Clara, Boston", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Santa Clara, Boston", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0bb3b44a-ede5-4db5-a520-4e844b0079c5", "prompt": "Consider the following symbols: \ud809\udc1c  \ud809\udc10\ud809\udc1a\n\nThis is a number written using the Mesopotamian/Babylonian number system and represented with Sumerian cuneiform. Convert this number into Arabic numerals as a decimal number.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 536", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "536", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "73c1b9fe-ee1d-4cf4-96ca-35c08f97b054", "prompt": "According to the USGS, in what year was the American Alligator first found west of Texas (not including Texas)?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 1954", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "1954", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "e2d69698-bc99-4e85-9880-67eaccd66e6c", "prompt": "As of August 2023, who is the only winner of the US version of Survivor to be born in the month of May?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Michele Fitzgerald", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Michele Fitzgerald", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "a56f1527-3abf-41d6-91f8-7296d6336c3f", "prompt": "The cover of the August 2021 issue of Vogue shows a famous landmark in the background behind some trees. How tall is this monument in yards, rounded to the nearest yard? Give the number only.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 185", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "185", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "42d4198c-5895-4f0a-b0c0-424a66465d83", "prompt": "I'm curious about how much information is available for popular video games before their release. Find the Wikipedia page for the 2019 game that won the British Academy Games Awards. How many revisions did that page have before the month listed as the game's release date on that Wikipedia page (as of the most recent entry from 2022)?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 60", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "60", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "edd4d4f2-1a58-45c4-b038-67337af4e029", "prompt": "The attached spreadsheet lists the locomotives owned by a local railroad museum. What is the typical American name for the type of locomotive this museum uses for the Murder Mystery Express?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Berkshire", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "Berkshire", "gaia_level": 2, "gaia_file": "edd4d4f2-1a58-45c4-b038-67337af4e029.xlsx", "source": "gaia-benchmark"}}
+{"name": "a26649c6-1cb2-470a-871e-6910c64c3e53", "prompt": "What is the absolute difference in tens of thousands between the population of chinstrap penguins on the Wikipedia page for penguin species populations as of the end of 2018 and the population recorded in the Nature.com \"global population assessment of the Chinstrap penguin\" article from 2020, assuming two penguins per breeding pair?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 116", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "116", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "4d0aa727-86b1-406b-9b33-f870dd14a4a5", "prompt": "The attached file lists the locomotives owned by a local railroad museum. It gives each locomotive\u2019s identifying number, operating status, and the name of the daily excursion it heads, if operational. What are the odds that today\u2019s Sunset Picnic Trip will use a steam locomotive? Assume that each day\u2019s excursion picks one of its assigned locomotives at random, and express the answer in the form \u201c1 in 4\u201d, \u201c1 in 5\u201d, etc.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 1 in 3", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "1 in 3", "gaia_level": 2, "gaia_file": "4d0aa727-86b1-406b-9b33-f870dd14a4a5.xlsx", "source": "gaia-benchmark"}}
+{"name": "d5141ca5-e7a0-469f-bf3e-e773507c86e2", "prompt": "When was a picture of St. Thomas Aquinas first added to the Wikipedia page on the Principle of double effect? Answer using the format DD/MM/YYYY.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 19/02/2009", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "19/02/2009", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "1dcc160f-c187-48c2-b68e-319bd4354f3d", "prompt": "According to Openreview.net, at the NeurIPS 2022 Conference, how many papers by an author named Yuri were accepted with a \"certain\" recommendation?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "3", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "b2c257e0-3ad7-4f05-b8e3-d9da973be36e", "prompt": "If this whole pint is made up of ice cream, how many percent above or below the US federal standards for butterfat content is it when using the standards as reported by Wikipedia in 2020? Answer as + or - a number rounded to one decimal place.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: +4.6", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "+4.6", "gaia_level": 2, "gaia_file": "b2c257e0-3ad7-4f05-b8e3-d9da973be36e.jpg", "source": "gaia-benchmark"}}
+{"name": "e0c10771-d627-4fd7-9694-05348e54ee36", "prompt": "Take the gender split from the 2011 Bulgarian census about those who have completed tertiary education. Subtract the smaller number from the larger number, then return the difference in thousands of women. So if there were 30.1 thousand more men, you'd give \"30.1\"", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 234.9", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "234.9", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "e29834fd-413a-455c-a33e-c3915b07401c", "prompt": "I'd like to learn more about some popular reality television competition shows. As of the end of the 44th season of the American version of Survivor, how many more unique winners have there been compared to the number of winners of American Idol?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 21", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "21", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "08c0b6e9-1b43-4c2e-ae55-4e3fce2c2715", "prompt": "In the film Goldfinger, what color was the object that James Bond concealed himself and his companion Pussy Galore at the end of the film? If there are multiple colors, put them in a comma-separated list in alphabetical order.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: orange, white", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "orange, white", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "db4fd70a-2d37-40ea-873f-9433dc5e301f", "prompt": "As of May 2023, how many stops are between South Station and Windsor Gardens on MBTA\u2019s Franklin-Foxboro line (not included)?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 10", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "10", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "853c8244-429e-46ca-89f2-addf40dfb2bd", "prompt": "In the 2015 Metropolitan Museum of Art exhibition titled after the Chinese zodiac animal of 2015, how many of the \"twelve animals of the Chinese zodiac\" have a hand visible?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 11", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "11", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "7a4a336d-dcfa-45a0-b014-824c7619e8de", "prompt": "At the two-minute mark in the YouTube video uploaded by the channel \u201cGameGrumps\u201d on May 14, 2017 as part of their playthrough of the game Mario Kart 8 Deluxe, the shows\u2019 hosts are competing on one of the game\u2019s racetracks. What was the world record time for that track in the game\u2019s 150cc mode as of June 7, 2023? Express your answer in minutes and seconds, rounding the seconds to the nearest hundredth, e.g. 1:01.001.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 1:41.614", "weight": 1.0}], "category": "level_2", "metadata": {"gaia_answer": "1:41.614", "gaia_level": 2, "gaia_file": null, "source": "gaia-benchmark"}}

src/flow/experiments/data/tasks/gaia_level3.jsonl ADDED Viewed

	@@ -0,0 +1,52 @@

+{"name": "676e5e31-a554-4acc-9286-b60d90a92d26", "prompt": "In July 2, 1959 United States standards for grades of processed fruits, vegetables, and certain other products listed as dehydrated, consider the items in the \"dried and dehydrated section\" specifically marked as dehydrated along with any items in the Frozen/Chilled section that contain the whole name of the item, but not if they're marked Chilled. As of August 2023, what is the percentage (to the nearest percent) of those standards that have been superseded by a new version since the date given in the 1959 standards?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 86", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "86", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "bec74516-02fc-48dc-b202-55e78d0e17cf", "prompt": "What is the average number of pre-2020 works on the open researcher and contributor identification pages of the people whose identification is in this file?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 26.4", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "26.4", "gaia_level": 3, "gaia_file": "bec74516-02fc-48dc-b202-55e78d0e17cf.jsonld", "source": "gaia-benchmark"}}
+{"name": "00d579ea-0889-4fd9-a771-2c8d79835c8d", "prompt": "Assuming scientists in the famous youtube video The Thinking Machine (Artificial Intelligence in the 1960s) were interviewed the same year, what is the name of the scientist predicting the sooner thinking machines or robots? Answer using the format First name Last name", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Claude Shannon", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "Claude Shannon", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "384d0dd8-e8a4-4cfe-963c-d37f256e7662", "prompt": "In the NCATS PubChem compound database for Food Additive Status classification, find the compound that has a molecular weight of 100 g/mol or less, 6 heavy atoms, 1 or fewer hydrogen bond acceptors, and a complexity between 10 and 15. Of the shared gene-chemical co-occurrences between its two possible enzyme transformations, what is the PubChem CID of the heaviest by molecular weight?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 4192", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "4192", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "de9887f5-ead8-4727-876f-5a4078f8598c", "prompt": "What integer-rounded percentage of the total length of the harlequin shrimp recorded in Omar Valencfia-Mendez 2017 paper was the sea star fed to the same type of shrimp in G. Curt Fiedler's 2002 paper?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 22", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "22", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "983bba7c-c092-455f-b6c9-7857003d48fc", "prompt": "What animals that were mentioned in both Ilias Lagkouvardos's and Olga Tapia's papers on the alvei species of the genus named for Copenhagen outside the bibliographies were also present in the 2021 article cited on the alvei species' Wikipedia page about a multicenter, randomized, double-blind study?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: mice", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "mice", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "9b54f9d9-35ee-4a14-b62f-d130ea00317f", "prompt": "Which of the text elements under CATEGORIES in the XML would contain the one food in the spreadsheet that does not appear a second time under a different name?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Soups and Stews", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "Soups and Stews", "gaia_level": 3, "gaia_file": "9b54f9d9-35ee-4a14-b62f-d130ea00317f.zip", "source": "gaia-benchmark"}}
+{"name": "56db2318-640f-477a-a82f-bc93ad13e882", "prompt": "The following numbers function similarly to ISBN 13 numbers, however, their validation methods are slightly different. Rather than using alternate weights of 1 and 3, the checksum digit is calculated with an alternate weight of 1 and some other positive integer less than 10. Otherwise, the checksum digit is calculated as expected. Unfortunately, there is an error in the data. Two adjacent columns have been transposed. These errored columns do not involve the final column or one of the first three columns. Using this information, please provide all potential solutions with the unknown weight and the smaller index of the two errored columns (assume we start our indexing at 0 and ignore hyphens). Give your answer in the form x, y where x is the weight and y is the smaller index of the two transposed columns.\n\n978-354181391-9\n978-946669746-1\n978-398036139-6\n978-447656680-4\n978-279586664-7\n978-595073693-3\n978-976647652-6\n978-591178125-5\n978-728465924-5\n978-414825155-9", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 7, 9", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "7, 9", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "8131e2c0-0083-4265-9ce7-78c2d568425d", "prompt": "I was trying to remember how well the Cheater Beater performed in comparison to the Cheater when James tested it on his channel. I know that the Cheater still outperformed the Cheater Beater in terms of CFM. Could you please look that up for me, and report the CFM of both the Cheater and the Cheater Beater? I'm not sure if he made any changes to his testing, but this was back in season 4, so just report the value from that season. Please format your response like this: CFM number for Cheater, CFM number for Cheater beater", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 101.376, 84.348", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "101.376, 84.348", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "72c06643-a2fa-4186-aa5c-9ec33ae9b445", "prompt": "What is the volume in milliliters of a system comprised of 0.312 kg Freon-12 refrigerant when placed at the bottom of the Marianas Trench and allowed to stabilize at the Trench's peak temperature, rounded to the nearest mL? Provide your answer as just an integer value.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 55", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "55", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "ebbc1f13-d24d-40df-9068-adcf735b4240", "prompt": "The Latin root of the Yola word \"gimlie\" shares a spelling with a Spanish word. What is the Google translation of the source title for the 1994 example sentence for that word in the Collins Spanish-to-English dictionary online? Answer in plain text, without punctuation.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: The World of the Twenty First Century", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "The World of the Twenty First Century", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "c526d8d6-5987-4da9-b24c-83466fa172f3", "prompt": "In the NIH translation of the original 1913 Michaelis-Menten Paper, what is the velocity of a reaction to four decimal places using the final equation in the paper based on the information for Reaction 7 in the Excel file?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0.0424", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "0.0424", "gaia_level": 3, "gaia_file": "c526d8d6-5987-4da9-b24c-83466fa172f3.xlsx", "source": "gaia-benchmark"}}
+{"name": "3da89939-209c-4086-8520-7eb734e6b4ef", "prompt": "I was referencing each of the tables in the file from papers that were cited by the \"Trans fatty acid contents in chocolates and chocolate wafers in Turkey\" paper. I lost my own reference sheet and need to know which of the papers each table came from. The file may not use the full table caption. If the references in the\"Trans fatty acid\" paper bibliography were numbered starting with 1, give me the numbers in the order that they would be used to fill the cells in the Excel file from top to bottom, as a comma separated list.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 8, 29, 22, 1, 8, 26", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "8, 29, 22, 1, 8, 26", "gaia_level": 3, "gaia_file": "3da89939-209c-4086-8520-7eb734e6b4ef.xlsx", "source": "gaia-benchmark"}}
+{"name": "8d46b8d6-b38a-47ff-ac74-cda14cf2d19b", "prompt": "What percentage of the total penguin population according to the upper estimates on english Wikipedia at the end of 2012 is made up by the penguins in this file that don't live on Dream Island or have beaks longer than 42mm? Round to the nearest five decimal places.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0.00033", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "0.00033", "gaia_level": 3, "gaia_file": "8d46b8d6-b38a-47ff-ac74-cda14cf2d19b.csv", "source": "gaia-benchmark"}}
+{"name": "e961a717-6b25-4175-8a68-874d28190ee4", "prompt": "According to wikipedia, how many Asian countries still have a monarchy and access to the sea in 2021?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 12", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "12", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "851e570a-e3de-4d84-bcfa-cc85578baa59", "prompt": "I thought we could try a fun word puzzle together :)\n\nI've got a Boggle board here:\n\nABRL\nEITE\nIONS\nFPEI\n\nI'd like to know the longest word that can be generated from the board. Please find the longest English language word that can be generated from this board. If more than one word of the same length exists at the maximum word length, please report the longest word that comes first, alphabetically. Oh, and I know that there might be different wordlists available for Boggle, so let's please just use the words_alpha dictionary found at https://github.com/dwyl/english-words as the dictionary for our game.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Briniest", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "Briniest", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "50f58759-7bd6-406f-9b0d-5692beb2a926", "prompt": "How many times was a Twitter/X post cited as a reference on the english Wikipedia pages for each day of August in the last June 2023 versions of the pages?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "3", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "872bfbb1-9ccf-49f6-8c5f-aa22818ccd66", "prompt": "Which of the fruits shown in the 2008 painting \"Embroidery from Uzbekistan\" were served as part of the October 1949 breakfast menu for the ocean liner that was later used as a floating prop for the film \"The Last Voyage\"? Give the items as a comma-separated list, ordering them in clockwise order based on their arrangement in the painting starting from the 12 o'clock position. Use the plural form of each fruit.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: pears, bananas", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "pears, bananas", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "c3a79cfe-8206-451f-aca8-3fec8ebe51d3", "prompt": "The year is 2022. I am at the National Air and Space Museum east of the Potomac River. I want to go to Fire Station 301 DCA ARFF using the metro. I go in the wrong direction and end up at the station closest to Cleveland Elementary School. How many metro stations am I away from my original destination if I don't change lines? Your answer should be a numerical integer value.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 8", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "8", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "da52d699-e8d2-4dc5-9191-a2199e0b6a9b", "prompt": "The attached spreadsheet contains a list of books I read in the year 2022. What is the title of the book that I read the slowest, using the rate of words per day?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Out of the Silent Planet", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "Out of the Silent Planet", "gaia_level": 3, "gaia_file": "da52d699-e8d2-4dc5-9191-a2199e0b6a9b.xlsx", "source": "gaia-benchmark"}}
+{"name": "ad2b4d70-9314-4fe6-bfbe-894a45f6055f", "prompt": "Eva Draconis has a personal website which can be accessed on her YouTube page. What is the meaning of the only symbol seen in the top banner that has a curved line that isn't a circle or a portion of a circle? Answer without punctuation.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: War is not here this is a land of peace", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "War is not here this is a land of peace", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "5b2a14e8-6e59-479c-80e3-4696e8980152", "prompt": "The brand that makes these harnesses the dogs are wearing in the attached pic shares stories from their ambassadors on their website. What meat is mentioned in the story added Dec 8th 2022?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: bacon", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "bacon", "gaia_level": 3, "gaia_file": "5b2a14e8-6e59-479c-80e3-4696e8980152.jpg", "source": "gaia-benchmark"}}
+{"name": "9e1fc53b-46ff-49a1-9d05-9e6faac34cc5", "prompt": "A 5-man group made up of one tank, one healer, and three DPS is doing a dungeon that was just released in World of Warcraft. Two are plate wearers and two are cloth wearers. At the final boss, both the tank and the healer are casting holy spells. Ice and fire are being used, each one by a different DPS. A bear from the group is attacking the boss. Metamorphosis is cast. The Kilt of the Forgotten One drops as loot, but no one can use it. If all classes were using their class abilities and all classes are unique, what are the five classes in the group in alphabetical order separated by commas?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Death Knight, Hunter, Paladin, Priest, Warlock", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "Death Knight, Hunter, Paladin, Priest, Warlock", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "5f982798-16b9-4051-ab57-cfc7ebdb2a91", "prompt": "I read a paper about multiwavelength observations of fast radio bursts back in March 2021 on Arxiv, and it had a fascinating diagram of an X-ray time profile. There was a similar burst-1 diagram in another paper from one of the same authors about fast radio bursts back in July 2020, but I can't recall what the difference in seconds in the measured time span was. How many more seconds did one measure than the other? Just give the number.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0.2", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "0.2", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0512426f-4d28-49f0-be77-06d05daec096", "prompt": "In the YouTube 360 VR video from March 2018 narrated by the voice actor of Lord of the Rings' Gollum, what number was mentioned by the narrator directly after dinosaurs were first shown in the video?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 100000000", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "100000000", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0bdb7c40-671d-4ad1-9ce3-986b159c0ddc", "prompt": "In NASA's Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute? Exclude any astronauts who did not spend any time in space. Give the last name of the astronaut, separated from the number of minutes by a semicolon.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: White; 5876", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "White; 5876", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "676e5e31-a554-4acc-9286-b60d90a92d26", "prompt": "In July 2, 1959 United States standards for grades of processed fruits, vegetables, and certain other products listed as dehydrated, consider the items in the \"dried and dehydrated section\" specifically marked as dehydrated along with any items in the Frozen/Chilled section that contain the whole name of the item, but not if they're marked Chilled. As of August 2023, what is the percentage (to the nearest percent) of those standards that have been superseded by a new version since the date given in the 1959 standards?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 86", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "86", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "bec74516-02fc-48dc-b202-55e78d0e17cf", "prompt": "What is the average number of pre-2020 works on the open researcher and contributor identification pages of the people whose identification is in this file?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 26.4", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "26.4", "gaia_level": 3, "gaia_file": "bec74516-02fc-48dc-b202-55e78d0e17cf.jsonld", "source": "gaia-benchmark"}}
+{"name": "00d579ea-0889-4fd9-a771-2c8d79835c8d", "prompt": "Assuming scientists in the famous youtube video The Thinking Machine (Artificial Intelligence in the 1960s) were interviewed the same year, what is the name of the scientist predicting the sooner thinking machines or robots? Answer using the format First name Last name", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Claude Shannon", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "Claude Shannon", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "384d0dd8-e8a4-4cfe-963c-d37f256e7662", "prompt": "In the NCATS PubChem compound database for Food Additive Status classification, find the compound that has a molecular weight of 100 g/mol or less, 6 heavy atoms, 1 or fewer hydrogen bond acceptors, and a complexity between 10 and 15. Of the shared gene-chemical co-occurrences between its two possible enzyme transformations, what is the PubChem CID of the heaviest by molecular weight?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 4192", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "4192", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "de9887f5-ead8-4727-876f-5a4078f8598c", "prompt": "What integer-rounded percentage of the total length of the harlequin shrimp recorded in Omar Valencfia-Mendez 2017 paper was the sea star fed to the same type of shrimp in G. Curt Fiedler's 2002 paper?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 22", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "22", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "983bba7c-c092-455f-b6c9-7857003d48fc", "prompt": "What animals that were mentioned in both Ilias Lagkouvardos's and Olga Tapia's papers on the alvei species of the genus named for Copenhagen outside the bibliographies were also present in the 2021 article cited on the alvei species' Wikipedia page about a multicenter, randomized, double-blind study?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: mice", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "mice", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "9b54f9d9-35ee-4a14-b62f-d130ea00317f", "prompt": "Which of the text elements under CATEGORIES in the XML would contain the one food in the spreadsheet that does not appear a second time under a different name?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Soups and Stews", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "Soups and Stews", "gaia_level": 3, "gaia_file": "9b54f9d9-35ee-4a14-b62f-d130ea00317f.zip", "source": "gaia-benchmark"}}
+{"name": "56db2318-640f-477a-a82f-bc93ad13e882", "prompt": "The following numbers function similarly to ISBN 13 numbers, however, their validation methods are slightly different. Rather than using alternate weights of 1 and 3, the checksum digit is calculated with an alternate weight of 1 and some other positive integer less than 10. Otherwise, the checksum digit is calculated as expected. Unfortunately, there is an error in the data. Two adjacent columns have been transposed. These errored columns do not involve the final column or one of the first three columns. Using this information, please provide all potential solutions with the unknown weight and the smaller index of the two errored columns (assume we start our indexing at 0 and ignore hyphens). Give your answer in the form x, y where x is the weight and y is the smaller index of the two transposed columns.\n\n978-354181391-9\n978-946669746-1\n978-398036139-6\n978-447656680-4\n978-279586664-7\n978-595073693-3\n978-976647652-6\n978-591178125-5\n978-728465924-5\n978-414825155-9", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 7, 9", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "7, 9", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "8131e2c0-0083-4265-9ce7-78c2d568425d", "prompt": "I was trying to remember how well the Cheater Beater performed in comparison to the Cheater when James tested it on his channel. I know that the Cheater still outperformed the Cheater Beater in terms of CFM. Could you please look that up for me, and report the CFM of both the Cheater and the Cheater Beater? I'm not sure if he made any changes to his testing, but this was back in season 4, so just report the value from that season. Please format your response like this: CFM number for Cheater, CFM number for Cheater beater", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 101.376, 84.348", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "101.376, 84.348", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "72c06643-a2fa-4186-aa5c-9ec33ae9b445", "prompt": "What is the volume in milliliters of a system comprised of 0.312 kg Freon-12 refrigerant when placed at the bottom of the Marianas Trench and allowed to stabilize at the Trench's peak temperature, rounded to the nearest mL? Provide your answer as just an integer value.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 55", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "55", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "ebbc1f13-d24d-40df-9068-adcf735b4240", "prompt": "The Latin root of the Yola word \"gimlie\" shares a spelling with a Spanish word. What is the Google translation of the source title for the 1994 example sentence for that word in the Collins Spanish-to-English dictionary online? Answer in plain text, without punctuation.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: The World of the Twenty First Century", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "The World of the Twenty First Century", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "c526d8d6-5987-4da9-b24c-83466fa172f3", "prompt": "In the NIH translation of the original 1913 Michaelis-Menten Paper, what is the velocity of a reaction to four decimal places using the final equation in the paper based on the information for Reaction 7 in the Excel file?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0.0424", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "0.0424", "gaia_level": 3, "gaia_file": "c526d8d6-5987-4da9-b24c-83466fa172f3.xlsx", "source": "gaia-benchmark"}}
+{"name": "3da89939-209c-4086-8520-7eb734e6b4ef", "prompt": "I was referencing each of the tables in the file from papers that were cited by the \"Trans fatty acid contents in chocolates and chocolate wafers in Turkey\" paper. I lost my own reference sheet and need to know which of the papers each table came from. The file may not use the full table caption. If the references in the\"Trans fatty acid\" paper bibliography were numbered starting with 1, give me the numbers in the order that they would be used to fill the cells in the Excel file from top to bottom, as a comma separated list.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 8, 29, 22, 1, 8, 26", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "8, 29, 22, 1, 8, 26", "gaia_level": 3, "gaia_file": "3da89939-209c-4086-8520-7eb734e6b4ef.xlsx", "source": "gaia-benchmark"}}
+{"name": "8d46b8d6-b38a-47ff-ac74-cda14cf2d19b", "prompt": "What percentage of the total penguin population according to the upper estimates on english Wikipedia at the end of 2012 is made up by the penguins in this file that don't live on Dream Island or have beaks longer than 42mm? Round to the nearest five decimal places.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0.00033", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "0.00033", "gaia_level": 3, "gaia_file": "8d46b8d6-b38a-47ff-ac74-cda14cf2d19b.csv", "source": "gaia-benchmark"}}
+{"name": "e961a717-6b25-4175-8a68-874d28190ee4", "prompt": "According to wikipedia, how many Asian countries still have a monarchy and access to the sea in 2021?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 12", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "12", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "851e570a-e3de-4d84-bcfa-cc85578baa59", "prompt": "I thought we could try a fun word puzzle together :)\n\nI've got a Boggle board here:\n\nABRL\nEITE\nIONS\nFPEI\n\nI'd like to know the longest word that can be generated from the board. Please find the longest English language word that can be generated from this board. If more than one word of the same length exists at the maximum word length, please report the longest word that comes first, alphabetically. Oh, and I know that there might be different wordlists available for Boggle, so let's please just use the words_alpha dictionary found at https://github.com/dwyl/english-words as the dictionary for our game.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Briniest", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "Briniest", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "50f58759-7bd6-406f-9b0d-5692beb2a926", "prompt": "How many times was a Twitter/X post cited as a reference on the english Wikipedia pages for each day of August in the last June 2023 versions of the pages?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 3", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "3", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "872bfbb1-9ccf-49f6-8c5f-aa22818ccd66", "prompt": "Which of the fruits shown in the 2008 painting \"Embroidery from Uzbekistan\" were served as part of the October 1949 breakfast menu for the ocean liner that was later used as a floating prop for the film \"The Last Voyage\"? Give the items as a comma-separated list, ordering them in clockwise order based on their arrangement in the painting starting from the 12 o'clock position. Use the plural form of each fruit.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: pears, bananas", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "pears, bananas", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "c3a79cfe-8206-451f-aca8-3fec8ebe51d3", "prompt": "The year is 2022. I am at the National Air and Space Museum east of the Potomac River. I want to go to Fire Station 301 DCA ARFF using the metro. I go in the wrong direction and end up at the station closest to Cleveland Elementary School. How many metro stations am I away from my original destination if I don't change lines? Your answer should be a numerical integer value.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 8", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "8", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "da52d699-e8d2-4dc5-9191-a2199e0b6a9b", "prompt": "The attached spreadsheet contains a list of books I read in the year 2022. What is the title of the book that I read the slowest, using the rate of words per day?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Out of the Silent Planet", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "Out of the Silent Planet", "gaia_level": 3, "gaia_file": "da52d699-e8d2-4dc5-9191-a2199e0b6a9b.xlsx", "source": "gaia-benchmark"}}
+{"name": "ad2b4d70-9314-4fe6-bfbe-894a45f6055f", "prompt": "Eva Draconis has a personal website which can be accessed on her YouTube page. What is the meaning of the only symbol seen in the top banner that has a curved line that isn't a circle or a portion of a circle? Answer without punctuation.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: War is not here this is a land of peace", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "War is not here this is a land of peace", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "5b2a14e8-6e59-479c-80e3-4696e8980152", "prompt": "The brand that makes these harnesses the dogs are wearing in the attached pic shares stories from their ambassadors on their website. What meat is mentioned in the story added Dec 8th 2022?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: bacon", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "bacon", "gaia_level": 3, "gaia_file": "5b2a14e8-6e59-479c-80e3-4696e8980152.jpg", "source": "gaia-benchmark"}}
+{"name": "9e1fc53b-46ff-49a1-9d05-9e6faac34cc5", "prompt": "A 5-man group made up of one tank, one healer, and three DPS is doing a dungeon that was just released in World of Warcraft. Two are plate wearers and two are cloth wearers. At the final boss, both the tank and the healer are casting holy spells. Ice and fire are being used, each one by a different DPS. A bear from the group is attacking the boss. Metamorphosis is cast. The Kilt of the Forgotten One drops as loot, but no one can use it. If all classes were using their class abilities and all classes are unique, what are the five classes in the group in alphabetical order separated by commas?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: Death Knight, Hunter, Paladin, Priest, Warlock", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "Death Knight, Hunter, Paladin, Priest, Warlock", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "5f982798-16b9-4051-ab57-cfc7ebdb2a91", "prompt": "I read a paper about multiwavelength observations of fast radio bursts back in March 2021 on Arxiv, and it had a fascinating diagram of an X-ray time profile. There was a similar burst-1 diagram in another paper from one of the same authors about fast radio bursts back in July 2020, but I can't recall what the difference in seconds in the measured time span was. How many more seconds did one measure than the other? Just give the number.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 0.2", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "0.2", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0512426f-4d28-49f0-be77-06d05daec096", "prompt": "In the YouTube 360 VR video from March 2018 narrated by the voice actor of Lord of the Rings' Gollum, what number was mentioned by the narrator directly after dinosaurs were first shown in the video?", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: 100000000", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "100000000", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}
+{"name": "0bdb7c40-671d-4ad1-9ce3-986b159c0ddc", "prompt": "In NASA's Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute? Exclude any astronauts who did not spend any time in space. Give the last name of the astronaut, separated from the number of minutes by a semicolon.", "criteria": [{"name": "correct_answer", "instruction": "The agent's final answer must match: White; 5876", "weight": 1.0}], "category": "level_3", "metadata": {"gaia_answer": "White; 5876", "gaia_level": 3, "gaia_file": null, "source": "gaia-benchmark"}}

src/flow/experiments/evaluators/heuristic.py CHANGED Viewed

@@ -73,7 +73,7 @@ class HeuristicEvaluator:
         # Check if agent reported task complete
         output_lower = run_result.output.lower()
-        if "task_done" in output_lower or "complete" in output_lower or "finished" in output_lower:
             criteria_results.append(
                 CriterionResult(
                     name="task_completed",

         # Check if agent reported task complete
         output_lower = run_result.output.lower()
+        if "complete" in output_lower or "complete" in output_lower or "finished" in output_lower:
             criteria_results.append(
                 CriterionResult(
                     name="task_completed",

src/flow/experiments/evaluators/llm.py CHANGED Viewed

@@ -38,6 +38,7 @@ class LLMEvaluator:
         model_client: Any,
         model_name: str = "gpt-4o",
         passing_threshold: float = 0.7,
     ) -> None:
         """Initialize the LLM evaluator.
@@ -46,10 +47,14 @@ class LLMEvaluator:
                          (e.g., AsyncOpenAI, AsyncAzureOpenAI)
             model_name: Model name/deployment to use for evaluation
             passing_threshold: Minimum score to pass (0.0 to 1.0)
         """
         self.model_client = model_client
         self.model_name = model_name
         self.passing_threshold = passing_threshold
     def _get_evaluation_prompt(self, run_result: RunResult) -> str:
         """Build the evaluation prompt for the LLM."""
@@ -156,17 +161,21 @@ Tokens used: {metrics.total_tokens} (input: {metrics.input_tokens}, output: {met
         prompt = self._get_evaluation_prompt(run_result)
         try:
-            response = await self.model_client.chat.completions.create(
-                model=self.model_name,
-                messages=[
                     {
                         "role": "system",
                         "content": "You are an expert evaluator. Respond only with valid JSON.",
                     },
                     {"role": "user", "content": prompt},
                 ],
-                temperature=0.1,  # Low temperature for consistent evaluation
-            )
             # Extract the response text
             response_text = response.choices[0].message.content or ""

         model_client: Any,
         model_name: str = "gpt-4o",
         passing_threshold: float = 0.7,
+        temperature: float | None = None,
     ) -> None:
         """Initialize the LLM evaluator.
                          (e.g., AsyncOpenAI, AsyncAzureOpenAI)
             model_name: Model name/deployment to use for evaluation
             passing_threshold: Minimum score to pass (0.0 to 1.0)
+            temperature: Temperature for LLM calls. None means don't specify
+                        (use model default). Some models like gpt-5.2-chat
+                        only support temperature=1.0.
         """
         self.model_client = model_client
         self.model_name = model_name
         self.passing_threshold = passing_threshold
+        self.temperature = temperature
     def _get_evaluation_prompt(self, run_result: RunResult) -> str:
         """Build the evaluation prompt for the LLM."""
         prompt = self._get_evaluation_prompt(run_result)
         try:
+            # Build params - only include temperature if explicitly set
+            params: dict[str, Any] = {
+                "model": self.model_name,
+                "messages": [
                     {
                         "role": "system",
                         "content": "You are an expert evaluator. Respond only with valid JSON.",
                     },
                     {"role": "user", "content": prompt},
                 ],
+            }
+            if self.temperature is not None:
+                params["temperature"] = self.temperature
+            response = await self.model_client.chat.completions.create(**params)
             # Extract the response text
             response_text = response.choices[0].message.content or ""

src/flow/experiments/models.py CHANGED Viewed

@@ -17,10 +17,16 @@ from __future__ import annotations
 from dataclasses import asdict, dataclass, field
 from itertools import product as itertools_product
 from pathlib import Path
-from typing import Any, Protocol, runtime_checkable
 import yaml
 # =============================================================================
 # Tool Configuration
@@ -32,40 +38,55 @@ TOOL_PRESETS: dict[str, dict[str, dict[str, Any]]] = {
     "full": {
         "read_file": {},
         "write_file": {},
-        "list_directory": {},
-        "grep_search": {},
-        "bash_execute": {"timeout": 120},
         "check_processes": {},
         "python_repl": {},
         "think": {},
-        "task_done": {},
         "memory": {},
-        "sub_agent": {"model": "gpt-4o-mini"},
     },
     "standard": {
         "read_file": {},
         "write_file": {},
-        "list_directory": {},
-        "grep_search": {},
-        "bash_execute": {"timeout": 120},
         "check_processes": {},
         "python_repl": {},
         "think": {},
-        "task_done": {},
         "memory": {},
     },
     "minimal": {
         "read_file": {},
         "write_file": {},
-        "bash_execute": {"timeout": 120},
-        "task_done": {},
     },
     "readonly": {
         "read_file": {},
-        "list_directory": {},
-        "grep_search": {},
         "think": {},
-        "task_done": {},
     },
 }
@@ -91,11 +112,11 @@ def resolve_tools(tools: str | list[str] | dict[str, dict[str, Any]]) -> dict[st
         >>> resolve_tools("standard")
         {"read_file": {}, "write_file": {}, ...}
-        >>> resolve_tools(["read_file", "bash_execute"])
-        {"read_file": {}, "bash_execute": {}}
-        >>> resolve_tools({"bash_execute": {"timeout": 60}})
-        {"bash_execute": {"timeout": 60}}
     """
     if isinstance(tools, str):
         if tools not in TOOL_PRESETS:
@@ -114,24 +135,30 @@ class CompactionConfig:
     """Extensible compaction strategy configuration.
     Supports multiple strategies via a tagged-union pattern:
-    - "head_tail": Keep first N + last M messages (default)
     - "last_n": Keep only the last N messages
     - "none": No compaction
-    Future strategies (e.g., "summarize") can be added without
-    changing existing code.
     Attributes:
         strategy: The compaction strategy name
         params: Strategy-specific parameters
     """
     strategy: str = "head_tail"
     params: dict[str, Any] = field(default_factory=lambda: {"head_size": 10, "tail_size": 40})
     @staticmethod
     def head_tail(head_size: int = 10, tail_size: int = 40) -> CompactionConfig:
-        """Create a head+tail compaction config."""
         return CompactionConfig(strategy="head_tail", params={"head_size": head_size, "tail_size": tail_size})
     @staticmethod
@@ -144,6 +171,92 @@ class CompactionConfig:
         """Create a no-compaction config."""
         return CompactionConfig(strategy="none", params={})
     @property
     def enabled(self) -> bool:
         """Whether compaction is enabled."""
@@ -159,6 +272,11 @@ class CompactionConfig:
         """Tail size for head_tail strategy. Returns 0 for other strategies."""
         return self.params.get("tail_size", 0)
 @dataclass
 class Agent:
@@ -171,8 +289,10 @@ class Agent:
     Attributes:
         name: Unique identifier for this agent
         description: Human-readable description
         instructions: System prompt / instructions (optional, uses framework default if None)
         model: Model deployment name (e.g., "gpt-4o")
         compaction: Compaction strategy configuration
         tools: Tool configuration - can be:
@@ -182,8 +302,10 @@ class Agent:
     """
     name: str
     description: str = ""
     instructions: str | None = None
     model: str | None = None
     compaction: CompactionConfig = field(default_factory=CompactionConfig)
     tools: str | list[str] | dict[str, dict[str, Any]] = "standard"
@@ -218,27 +340,50 @@ class ExperimentResult:
     eval_score: float = 0.0
     eval_passed: bool = False
     eval_reasoning: str = ""
 @runtime_checkable
 class CandidateStrategy(Protocol):
     """Protocol for generating candidate variants from a base agent.
-    Implementations explore different regions of the optimization space:
     - GridSearchStrategy: Exhaustive grid over parameter combinations
-    - (Future) HeuristicStrategy: Rule-based mutations from telemetry
     - (Future) BayesianStrategy: Bayesian optimization over parameters
     """
-    def generate(self, base: Agent, budget: int) -> list[Candidate]:
         """Generate candidate variants from a base agent.
         Args:
-            base: The base agent to mutate
-            budget: Maximum number of candidates to generate
         Returns:
-            List of Candidate objects (at most `budget` items)
         """
         ...
@@ -272,8 +417,24 @@ class GridSearchStrategy:
         """
         self.variations = variations
-    def generate(self, base: Agent, budget: int) -> list[Candidate]:
-        """Generate all grid combinations up to budget."""
         if not self.variations:
             return [Candidate(agent=base, mutations={}, rationale="baseline")]
@@ -515,3 +676,121 @@ def _extract_metrics(
         "pareto_rank": summary.get("pareto_rank"),
         "is_pareto_optimal": summary.get("is_pareto_optimal", False),
     }

 from dataclasses import asdict, dataclass, field
 from itertools import product as itertools_product
 from pathlib import Path
+from typing import TYPE_CHECKING, Any, Protocol, runtime_checkable
 import yaml
+if TYPE_CHECKING:
+    from collections.abc import Awaitable, Callable
+    from .evaluators.base import Evaluator
+    from .types import Task
 # =============================================================================
 # Tool Configuration
     "full": {
         "read_file": {},
         "write_file": {},
+        "edit_file": {},
+        "multi_edit": {},
+        "glob_files": {},
+        "ls": {},
+        "grep": {},
+        "bash": {"timeout": 120},
         "check_processes": {},
         "python_repl": {},
         "think": {},
+        "todo_write": {},
+        "todo_read": {},
         "memory": {},
+        "skills": {},
+        "task": {"model": "gpt-4o-mini"},
+        "web_search": {},
+        "web_fetch": {},
+        "notebook_edit": {},
+        "notebook_read": {},
     },
     "standard": {
         "read_file": {},
         "write_file": {},
+        "edit_file": {},
+        "multi_edit": {},
+        "glob_files": {},
+        "ls": {},
+        "grep": {},
+        "bash": {"timeout": 120},
         "check_processes": {},
         "python_repl": {},
         "think": {},
+        "todo_write": {},
+        "todo_read": {},
         "memory": {},
+        "skills": {},
     },
     "minimal": {
         "read_file": {},
         "write_file": {},
+        "edit_file": {},
+        "bash": {"timeout": 120},
+        "think": {},
     },
     "readonly": {
         "read_file": {},
+        "glob_files": {},
+        "ls": {},
+        "grep": {},
         "think": {},
     },
 }
         >>> resolve_tools("standard")
         {"read_file": {}, "write_file": {}, ...}
+        >>> resolve_tools(["read_file", "bash"])
+        {"read_file": {}, "bash": {}}
+        >>> resolve_tools({"bash": {"timeout": 60}})
+        {"bash": {"timeout": 60}}
     """
     if isinstance(tools, str):
         if tools not in TOOL_PRESETS:
     """Extensible compaction strategy configuration.
     Supports multiple strategies via a tagged-union pattern:
+    - "head_tail": Keep first N + last M messages (message-count based)
+    - "head_tail_tokens": Token-aware head+tail (miniagent's HeadTailStrategy)
+    - "sliding_window": Keep system + recent messages within token budget
+    - "summarization": Summarize middle messages using LLM
     - "last_n": Keep only the last N messages
     - "none": No compaction
     Attributes:
         strategy: The compaction strategy name
         params: Strategy-specific parameters
+        token_budget: Maximum tokens for context window (used by token-based strategies)
     """
     strategy: str = "head_tail"
     params: dict[str, Any] = field(default_factory=lambda: {"head_size": 10, "tail_size": 40})
+    token_budget: int = 100_000
+    # =========================================================================
+    # Message-count based strategies (legacy, for MAF/LangGraph)
+    # =========================================================================
     @staticmethod
     def head_tail(head_size: int = 10, tail_size: int = 40) -> CompactionConfig:
+        """Create a message-count based head+tail compaction config."""
         return CompactionConfig(strategy="head_tail", params={"head_size": head_size, "tail_size": tail_size})
     @staticmethod
         """Create a no-compaction config."""
         return CompactionConfig(strategy="none", params={})
+    # =========================================================================
+    # Token-based strategies (for miniagent)
+    # =========================================================================
+    @staticmethod
+    def head_tail_tokens(head_ratio: float = 0.2, token_budget: int = 100_000) -> CompactionConfig:
+        """Create a token-aware head+tail compaction config.
+        This maps to miniagent's HeadTailStrategy which:
+        - Preserves head (system prompt, initial context) using head_ratio of budget
+        - Preserves tail (recent tool calls/results) using remaining budget
+        - Drops middle messages when over budget
+        - Respects atomic groups (tool calls + results stay together)
+        Args:
+            head_ratio: Fraction of budget for head messages (default 0.2 = 20%)
+            token_budget: Maximum tokens for context window
+        Returns:
+            CompactionConfig for token-based head+tail strategy
+        """
+        return CompactionConfig(
+            strategy="head_tail_tokens",
+            params={"head_ratio": head_ratio},
+            token_budget=token_budget,
+        )
+    @staticmethod
+    def sliding_window(token_budget: int = 100_000) -> CompactionConfig:
+        """Create a sliding window compaction config.
+        This maps to miniagent's SlidingWindowStrategy which:
+        - Always keeps system message(s)
+        - Keeps most recent messages that fit within token budget
+        - Respects atomic groups (tool calls + results stay together)
+        Args:
+            token_budget: Maximum tokens for context window
+        Returns:
+            CompactionConfig for sliding window strategy
+        """
+        return CompactionConfig(
+            strategy="sliding_window",
+            params={},
+            token_budget=token_budget,
+        )
+    @staticmethod
+    def summarization(
+        head_messages: int = 2,
+        tail_messages: int = 4,
+        summary_max_tokens: int = 1000,
+        token_budget: int = 100_000,
+    ) -> CompactionConfig:
+        """Create a summarization compaction config.
+        This maps to miniagent's SummarizationStrategy which:
+        - Keeps head messages (system + initial user message)
+        - Keeps tail messages (recent context)
+        - Summarizes middle messages using LLM instead of dropping them
+        - Preserves critical state (files read, findings, progress)
+        Args:
+            head_messages: Number of messages to keep at head (default 2)
+            tail_messages: Number of messages to keep at tail (default 4)
+            summary_max_tokens: Max tokens for the summary (default 1000)
+            token_budget: Maximum tokens for context window
+        Returns:
+            CompactionConfig for summarization strategy
+        """
+        return CompactionConfig(
+            strategy="summarization",
+            params={
+                "head_messages": head_messages,
+                "tail_messages": tail_messages,
+                "summary_max_tokens": summary_max_tokens,
+            },
+            token_budget=token_budget,
+        )
+    # =========================================================================
+    # Properties
+    # =========================================================================
     @property
     def enabled(self) -> bool:
         """Whether compaction is enabled."""
         """Tail size for head_tail strategy. Returns 0 for other strategies."""
         return self.params.get("tail_size", 0)
+    @property
+    def head_ratio(self) -> float:
+        """Head ratio for head_tail_tokens strategy. Returns 0.2 default."""
+        return self.params.get("head_ratio", 0.2)
 @dataclass
 class Agent:
     Attributes:
         name: Unique identifier for this agent
+        framework: Which harness to use ("maf", "langgraph", "claude")
         description: Human-readable description
         instructions: System prompt / instructions (optional, uses framework default if None)
+        instructions_preset: Preset name for instructions ("coding", "benchmark", etc.)
         model: Model deployment name (e.g., "gpt-4o")
         compaction: Compaction strategy configuration
         tools: Tool configuration - can be:
     """
     name: str
+    framework: str = "maf"
     description: str = ""
     instructions: str | None = None
+    instructions_preset: str | None = None  # e.g., "coding", "benchmark", "research"
     model: str | None = None
     compaction: CompactionConfig = field(default_factory=CompactionConfig)
     tools: str | list[str] | dict[str, dict[str, Any]] = "standard"
     eval_score: float = 0.0
     eval_passed: bool = False
     eval_reasoning: str = ""
+    traces: dict[str, Any] = field(default_factory=dict)
 @runtime_checkable
 class CandidateStrategy(Protocol):
     """Protocol for generating candidate variants from a base agent.
+    Implementations can be:
+    - Simple (single-shot): GridSearchStrategy ignores optional params
+    - Complex (iterative): Runs internal experiments, checks convergence,
+      distills failures, etc. using the provided callbacks
+    All logic is internal to the strategy - the caller just calls generate()
+    and receives the final list of candidates.
+    Examples:
     - GridSearchStrategy: Exhaustive grid over parameter combinations
+    - (Future) AdaptivePromptOptimizer: Iteratively improves prompts from failures
     - (Future) BayesianStrategy: Bayesian optimization over parameters
     """
+    def generate(
+        self,
+        base: Agent,
+        budget: int,
+        *,
+        tasks: list[Task] | None = None,
+        evaluator: Evaluator | None = None,
+        run_experiment: Callable[[Candidate, Task], Awaitable[ExperimentResult]] | None = None,
+    ) -> list[Candidate]:
         """Generate candidate variants from a base agent.
         Args:
+            base: The base agent to optimize
+            budget: Maximum number of candidates to return
+            tasks: Optional tasks for strategies that run internal experiments
+            evaluator: Optional evaluator for strategies that need scoring
+            run_experiment: Optional async callback to execute a candidate on a task.
+                           Signature: async (candidate, task) -> ExperimentResult
         Returns:
+            List of Candidate objects (at most `budget` items).
+            For iterative strategies, returns the final/best candidates after
+            internal optimization loops complete.
         """
         ...
         """
         self.variations = variations
+    def generate(
+        self,
+        base: Agent,
+        budget: int,
+        *,
+        tasks: list[Task] | None = None,
+        evaluator: Evaluator | None = None,
+        run_experiment: Callable[[Candidate, Task], Awaitable[ExperimentResult]] | None = None,
+    ) -> list[Candidate]:
+        """Generate all grid combinations up to budget.
+        Note: tasks, evaluator, and run_experiment are accepted for protocol
+        compatibility but ignored - GridSearchStrategy is a simple single-shot
+        strategy that doesn't run experiments internally.
+        """
+        # Delete unused params to satisfy linters
+        del tasks, evaluator, run_experiment
         if not self.variations:
             return [Candidate(agent=base, mutations={}, rationale="baseline")]
         "pareto_rank": summary.get("pareto_rank"),
         "is_pareto_optimal": summary.get("is_pareto_optimal", False),
     }
+# =============================================================================
+# Experiment YAML - Defines variations for optimization
+# =============================================================================
+@dataclass
+class Experiment:
+    """Experiment configuration for optimization.
+    Separates concerns:
+    - Agent YAML: What the agent is (model, instructions, defaults)
+    - Experiment YAML: How to test it (variations, tasks, evaluation settings)
+    Attributes:
+        base_agent: Path to base agent YAML file
+        suite: Built-in task suite name (e.g., "coding", "quick")
+        tasks: Path to custom tasks JSONL file (alternative to suite)
+        variations: Dict of parameter variations for grid search
+        parallel: Max concurrent experiments
+        budget: Maximum candidates to generate
+        use_llm_eval: Whether to use LLM-as-Judge evaluation
+    Example YAML:
+        ```yaml
+        base_agent: examples/miniagent_base.yaml
+        suite: coding
+        variations:
+          compaction:
+            - strategy: none
+            - strategy: head_tail
+              params: { head_size: 10, tail_size: 40 }
+            - strategy: sliding_window
+              token_budget: 50000
+            - strategy: summarization
+              token_budget: 50000
+          tools:
+            - minimal
+            - standard
+            - [read_file, write_file, bash, memory]
+        parallel: 4
+        budget: 20
+        use_llm_eval: true
+        ```
+    """
+    base_agent: str | None = None
+    suite: str | None = None
+    tasks: str | None = None
+    variations: dict[str, list[Any]] = field(default_factory=dict)
+    parallel: int = 4
+    budget: int = 100
+    use_llm_eval: bool = True
+def load_experiment(path: Path) -> Experiment:
+    """Load an Experiment from a YAML file.
+    Handles conversion of compaction variations from dict to CompactionConfig.
+    Args:
+        path: Path to the experiment YAML file
+    Returns:
+        Experiment instance with parsed variations
+    Raises:
+        FileNotFoundError: If the file doesn't exist
+        ValueError: If the config is invalid
+    """
+    if not path.exists():
+        raise FileNotFoundError(f"Experiment config file not found: {path}")
+    data = yaml.safe_load(path.read_text())
+    # Parse variations - convert compaction dicts to CompactionConfig
+    variations: dict[str, list[Any]] = {}
+    raw_variations = data.get("variations", {})
+    for key, values in raw_variations.items():
+        if key == "compaction":
+            # Convert each compaction dict to CompactionConfig
+            parsed_compactions = []
+            for v in values:
+                if isinstance(v, dict):
+                    parsed_compactions.append(CompactionConfig(**v))
+                elif isinstance(v, str):
+                    # Handle shorthand: "none", "head_tail", etc.
+                    if v == "none":
+                        parsed_compactions.append(CompactionConfig.none())
+                    elif v == "head_tail":
+                        parsed_compactions.append(CompactionConfig.head_tail())
+                    elif v == "sliding_window":
+                        parsed_compactions.append(CompactionConfig.sliding_window())
+                    elif v == "summarization":
+                        parsed_compactions.append(CompactionConfig.summarization())
+                    else:
+                        raise ValueError(f"Unknown compaction shorthand: {v}")
+                else:
+                    parsed_compactions.append(v)
+            variations["compaction"] = parsed_compactions
+        else:
+            # Other variations pass through as-is
+            variations[key] = values
+    return Experiment(
+        base_agent=data.get("base_agent"),
+        suite=data.get("suite"),
+        tasks=data.get("tasks"),
+        variations=variations,
+        parallel=data.get("parallel", 4),
+        budget=data.get("budget", 100),
+        use_llm_eval=data.get("use_llm_eval", True),
+    )

src/flow/experiments/optimizer.py CHANGED Viewed

@@ -20,10 +20,7 @@ from typing import Any
 from openai import AsyncAzureOpenAI
-from .ablation import (
-    compute_pareto_frontier,
-    create_harness_from_agent,
-)
 from .evaluators import LLMEvaluator
 from .metrics import TraceMetrics, extract_metrics
 from .models import (
@@ -47,6 +44,7 @@ class TaskResult:
     eval_score: float
     eval_passed: bool
     eval_reasoning: str
 @dataclass
@@ -84,6 +82,18 @@ class CandidateSummary:
             "task_count": self.task_count,
             "pareto_rank": self.pareto_rank,
             "is_pareto_optimal": self.is_pareto_optimal,
         }
@@ -287,18 +297,37 @@ class FlowOptimizer:
         evaluator: LLMEvaluator | None,
     ) -> TaskResult:
         """Run a single candidate-task experiment."""
-        harness = create_harness_from_agent(candidate.agent, workspace)
         try:
             runner = FlowExperimentRunner(keep_workspace=True)
             run_result = await runner.run(harness, task, workspace=workspace)
             metrics = extract_metrics(run_result.trace)
             if evaluator:
                 eval_result = await evaluator.evaluate(run_result)
                 eval_score = eval_result.score
                 eval_passed = eval_result.passed
                 eval_reasoning = eval_result.reasoning
             else:
                 eval_score = 1.0 if run_result.success else 0.0
                 eval_passed = run_result.success
@@ -312,6 +341,7 @@ class FlowOptimizer:
                 eval_score=eval_score,
                 eval_passed=eval_passed,
                 eval_reasoning=eval_reasoning,
             )
         finally:
             await harness.close()
@@ -366,26 +396,48 @@ class FlowOptimizer:
     def _create_evaluator(self) -> LLMEvaluator | None:
         """Create LLM evaluator if credentials available."""
         api_key = os.environ.get("AZURE_OPENAI_API_KEY")
         endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT")
-        deployment = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4o")
         if not api_key or not endpoint:
-            logger.warning("No Azure OpenAI credentials, using heuristic evaluation")
             return None
-        client = AsyncAzureOpenAI(
-            api_key=api_key,
-            api_version="2024-02-15-preview",
-            azure_endpoint=endpoint,
-        )
-        return LLMEvaluator(
             model_client=client,
             model_name=deployment,
             passing_threshold=0.7,
         )
     def _save_config(
         self,
         candidates: list[Candidate],

 from openai import AsyncAzureOpenAI
+from .ablation import compute_pareto_frontier
 from .evaluators import LLMEvaluator
 from .metrics import TraceMetrics, extract_metrics
 from .models import (
     eval_score: float
     eval_passed: bool
     eval_reasoning: str
+    criteria_results: list[dict[str, Any]] = field(default_factory=list)  # Per-criterion scores
 @dataclass
             "task_count": self.task_count,
             "pareto_rank": self.pareto_rank,
             "is_pareto_optimal": self.is_pareto_optimal,
+            # Include per-task results with eval reasoning
+            "task_results": [
+                {
+                    "task_name": tr.task_name,
+                    "eval_score": tr.eval_score,
+                    "eval_passed": tr.eval_passed,
+                    "eval_reasoning": tr.eval_reasoning,
+                    "tokens": tr.metrics.total_tokens,
+                    "duration": tr.run_result.duration_seconds,
+                }
+                for tr in self.task_results
+            ],
         }
         evaluator: LLMEvaluator | None,
     ) -> TaskResult:
         """Run a single candidate-task experiment."""
+        # Import harness modules to register them, then use registry
+        import flow.harness.maf  # noqa: F401
+        try:
+            import flow.harness.miniagent  # noqa: F401
+        except ImportError:
+            pass  # miniagent harness is optional
+        from flow.harness import create_harness
+        harness = create_harness(candidate.agent, workspace)
         try:
             runner = FlowExperimentRunner(keep_workspace=True)
             run_result = await runner.run(harness, task, workspace=workspace)
             metrics = extract_metrics(run_result.trace)
+            criteria_results: list[dict[str, Any]] = []
             if evaluator:
                 eval_result = await evaluator.evaluate(run_result)
                 eval_score = eval_result.score
                 eval_passed = eval_result.passed
                 eval_reasoning = eval_result.reasoning
+                # Convert criteria results to dicts for serialization
+                criteria_results = [
+                    {
+                        "name": cr.name,
+                        "score": cr.score,
+                        "passed": cr.passed,
+                        "reasoning": cr.reasoning,
+                    }
+                    for cr in eval_result.criteria_results
+                ]
             else:
                 eval_score = 1.0 if run_result.success else 0.0
                 eval_passed = run_result.success
                 eval_score=eval_score,
                 eval_passed=eval_passed,
                 eval_reasoning=eval_reasoning,
+                criteria_results=criteria_results,
             )
         finally:
             await harness.close()
     def _create_evaluator(self) -> LLMEvaluator | None:
         """Create LLM evaluator if credentials available."""
+        from openai import AsyncOpenAI
         api_key = os.environ.get("AZURE_OPENAI_API_KEY")
         endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT")
+        deployment = os.environ.get("AZURE_OPENAI_DEPLOYMENT") or os.environ.get(
+            "AZURE_OPENAI_CHAT_DEPLOYMENT_NAME", "gpt-4o"
+        )
+        logger.info("Creating LLM evaluator...")
+        logger.debug(f"  API Key: {'[SET]' if api_key else '[NOT SET]'}")
+        logger.debug(f"  Endpoint: {endpoint if endpoint else '[NOT SET]'}")
+        logger.debug(f"  Deployment: {deployment}")
         if not api_key or not endpoint:
+            logger.warning("No Azure OpenAI credentials, using heuristic evaluation (binary 0/1 scores)")
             return None
+        # Check if using OpenAI-compatible endpoint (e.g., /openai/v1/)
+        # vs traditional Azure OpenAI endpoint
+        if "/v1" in endpoint:
+            logger.info("Creating AsyncOpenAI client for evaluator (OpenAI-compatible endpoint)")
+            client = AsyncOpenAI(
+                base_url=endpoint,
+                api_key=api_key,
+            )
+        else:
+            logger.info("Creating AsyncAzureOpenAI client for evaluator")
+            client = AsyncAzureOpenAI(
+                api_key=api_key,
+                api_version="2024-02-15-preview",
+                azure_endpoint=endpoint,
+            )
+        evaluator = LLMEvaluator(
             model_client=client,
             model_name=deployment,
             passing_threshold=0.7,
         )
+        logger.info(f"LLM evaluator created successfully (model={deployment}, threshold=0.7)")
+        return evaluator
     def _save_config(
         self,
         candidates: list[Candidate],

src/flow/experiments/runner.py CHANGED Viewed

@@ -21,7 +21,7 @@ from .trace_collector import FlowTraceCollector
 from .types import RunResult, Task
 if TYPE_CHECKING:
-    from flow.harness.maf import MAFHarness
 logger = logging.getLogger(__name__)
@@ -66,10 +66,12 @@ class FlowExperimentRunner:
     - Supporting streaming execution
     Example:
-        from flow.harness.maf import MAFHarness
         from flow.experiments import FlowExperimentRunner, Task
-        harness = MAFHarness()
         runner = FlowExperimentRunner(keep_workspace=True)
         task = Task(name="hello", prompt="Create a hello world script")
@@ -95,7 +97,7 @@ class FlowExperimentRunner:
     async def run(
         self,
-        harness: MAFHarness,
         task: Task,
         workspace: Path | None = None,
     ) -> RunResult:
@@ -109,7 +111,7 @@ class FlowExperimentRunner:
         5. Returns a RunResult with all data
         Args:
-            harness: The MAFHarness to run
             task: The task to execute
             workspace: Optional workspace directory (creates temp if None)
@@ -167,6 +169,10 @@ class FlowExperimentRunner:
                             elif event.type == EventType.TOOL_RESULT:
                                 # Optionally capture tool results
                                 pass
             finally:
                 os.chdir(original_cwd)

 from .types import RunResult, Task
 if TYPE_CHECKING:
+    from flow.harness.base import BaseHarness
 logger = logging.getLogger(__name__)
     - Supporting streaming execution
     Example:
+        from flow.harness import create_harness
         from flow.experiments import FlowExperimentRunner, Task
+        from flow.experiments.models import Agent
+        agent = Agent(name="my-agent")
+        harness = create_harness(agent, workspace=Path("/tmp"))
         runner = FlowExperimentRunner(keep_workspace=True)
         task = Task(name="hello", prompt="Create a hello world script")
     async def run(
         self,
+        harness: "BaseHarness",
         task: Task,
         workspace: Path | None = None,
     ) -> RunResult:
         5. Returns a RunResult with all data
         Args:
+            harness: The harness to run (any BaseHarness implementation)
             task: The task to execute
             workspace: Optional workspace directory (creates temp if None)
                             elif event.type == EventType.TOOL_RESULT:
                                 # Optionally capture tool results
                                 pass
+                            elif event.type == EventType.ERROR:
+                                # Capture error from harness
+                                error = event.content
+                                logger.error(f"Harness error: {error}")
             finally:
                 os.chdir(original_cwd)

src/flow/experiments/types.py CHANGED Viewed

@@ -168,6 +168,56 @@ def get_available_suites() -> list[str]:
     return sorted(p.stem for p in _DATA_DIR.glob("*.jsonl"))
 def get_task_suite(suite_name: str) -> list[Task]:
     """Get a built-in task suite by name.

     return sorted(p.stem for p in _DATA_DIR.glob("*.jsonl"))
+@dataclass
+class SuiteInfo:
+    """Information about a task suite."""
+    name: str
+    task_count: int
+    description: str
+# Suite descriptions for known suites
+_SUITE_DESCRIPTIONS: dict[str, str] = {
+    "quick": "Fast testing",
+    "core": "Standard evaluation",
+    "coding": "Self-contained repo analysis tasks (clone, analyze, report)",
+    "gaia_level1": "GAIA easy benchmark",
+    "gaia_level2": "GAIA medium benchmark",
+    "gaia_level3": "GAIA hard benchmark",
+    "gaia_all": "GAIA full benchmark",
+}
+def get_suite_info(suite_name: str) -> SuiteInfo:
+    """Get information about a specific suite.
+    Args:
+        suite_name: Name of the suite
+    Returns:
+        SuiteInfo with name, task count, and description
+    """
+    path = _DATA_DIR / f"{suite_name}.jsonl"
+    if not path.exists():
+        raise ValueError(f"Suite not found: {suite_name}")
+    # Count lines (tasks) in the file
+    task_count = sum(1 for line in path.open() if line.strip())
+    description = _SUITE_DESCRIPTIONS.get(suite_name, "Custom task suite")
+    return SuiteInfo(name=suite_name, task_count=task_count, description=description)
+def get_all_suite_info() -> list[SuiteInfo]:
+    """Get information about all available suites.
+    Returns:
+        List of SuiteInfo for each available suite.
+    """
+    return [get_suite_info(name) for name in get_available_suites()]
 def get_task_suite(suite_name: str) -> list[Task]:
     """Get a built-in task suite by name.

src/flow/harness/__init__.py CHANGED Viewed

@@ -5,14 +5,36 @@ events to a uniform Event format for CLI/UI consumption.
 Available harnesses:
 - maf: Microsoft Agent Framework harness
-- (future) langchain: LangChain harness
 - (future) claude: Claude SDK harness
 """
 from flow.harness.base import BaseHarness, Event, EventType
 __all__ = [
     "BaseHarness",
     "Event",
     "EventType",
 ]

 Available harnesses:
 - maf: Microsoft Agent Framework harness
+- (future) langgraph: LangGraph harness
 - (future) claude: Claude SDK harness
+Usage:
+    from flow.harness import create_harness
+    from flow.experiments.models import Agent
+    agent = Agent(name="my-agent", framework="maf")
+    harness = create_harness(agent, workspace=Path("/tmp"))
 """
 from flow.harness.base import BaseHarness, Event, EventType
+from flow.harness.registry import (
+    available_frameworks,
+    create_harness,
+    get_harness_class,
+    register,
+)
+# Auto-register harnesses by importing them
+# Each harness module calls register() on import
+from flow.harness import maf as _maf  # noqa: F401
+from flow.harness import miniagent as _miniagent  # noqa: F401
 __all__ = [
     "BaseHarness",
     "Event",
     "EventType",
+    "available_frameworks",
+    "create_harness",
+    "get_harness_class",
+    "register",
 ]

src/flow/harness/base.py CHANGED Viewed

@@ -7,10 +7,16 @@ allowing Flow to run on different agent frameworks.
 from __future__ import annotations
 from abc import ABC, abstractmethod
-from collections.abc import AsyncIterator, Callable, Coroutine
 from dataclasses import dataclass, field
 from enum import Enum
-from typing import Any
 class EventType(Enum):
@@ -49,52 +55,49 @@ class BaseHarness(ABC):
     to the uniform Flow Event format for CLI/UI consumption.
     Each harness implementation handles:
-    - Taking a pre-configured agent from the framework
-    - Running tasks on the agent
     - Converting framework-specific events to Flow Events
     - Managing conversation threads
     Implementations:
     - MAFHarness (flow.harness.maf): Microsoft Agent Framework
-    - (Future) LangChainHarness: LangChain
     - (Future) ClaudeHarness: Claude SDK
     """
     @abstractmethod
-    async def run(self, task: str, thread_id: str | None = None) -> str:
-        """Run a task and return the final response.
         Args:
-            task: The task/prompt to execute
-            thread_id: Optional thread ID for conversation continuity
         Returns:
-            The agent's final response text
         """
         ...
     @abstractmethod
-    def run_stream(self, task: str, thread_id: str | None = None) -> AsyncIterator[Event]:
         """Run a task with streaming events.
         Args:
             task: The task/prompt to execute
-            thread_id: Optional thread ID for conversation continuity
         Yields:
             Event objects representing agent activity
         """
         ...
-    @abstractmethod
-    def register_tools(self, tools: list[Callable[..., Coroutine[Any, Any, str]]]) -> None:
-        """Register tools with the harness.
-        Args:
-            tools: List of tool functions to register
-        """
-        ...
     @abstractmethod
     def get_thread_id(self) -> str:
         """Get the current thread ID.

 from __future__ import annotations
 from abc import ABC, abstractmethod
+from collections.abc import AsyncIterator
 from dataclasses import dataclass, field
 from enum import Enum
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from pathlib import Path
+    from flow.experiments.models import Agent
+    from flow.llm import LLMClientConfig
 class EventType(Enum):
     to the uniform Flow Event format for CLI/UI consumption.
     Each harness implementation handles:
+    - Creating an agent from an Agent spec via from_agent()
+    - Running tasks on the agent with streaming events
     - Converting framework-specific events to Flow Events
     - Managing conversation threads
     Implementations:
     - MAFHarness (flow.harness.maf): Microsoft Agent Framework
+    - (Future) LangGraphHarness: LangGraph
     - (Future) ClaudeHarness: Claude SDK
     """
+    @classmethod
     @abstractmethod
+    def from_agent(
+        cls,
+        agent: "Agent",
+        workspace: "Path",
+        llm_config: "LLMClientConfig | None" = None,
+    ) -> "BaseHarness":
+        """Create a harness from an Agent definition.
         Args:
+            agent: The Agent spec defining the configuration
+            workspace: Working directory for the agent
+            llm_config: Optional LLM configuration (falls back to env vars if not provided)
         Returns:
+            A configured harness instance
         """
         ...
     @abstractmethod
+    def run_stream(self, task: str) -> AsyncIterator[Event]:
         """Run a task with streaming events.
         Args:
             task: The task/prompt to execute
         Yields:
             Event objects representing agent activity
         """
         ...
     @abstractmethod
     def get_thread_id(self) -> str:
         """Get the current thread ID.

src/flow/harness/langgraph/__init__.py ADDED Viewed

	@@ -0,0 +1,37 @@

+"""LangGraph harness for Flow.
+This module provides a harness adapter that allows LangGraph agents
+to be used within the Flow experimentation framework.
+Usage:
+    from flow.experiments.models import Agent
+    from flow.harness import create_harness
+    agent = Agent(
+        name="my-langgraph-agent",
+        framework="langgraph",  # <-- Use LangGraph harness
+        tools="standard",
+        model="openai:gpt-4o",
+    )
+    harness = create_harness(agent, workspace=Path("/tmp/workspace"))
+    async for event in harness.run_stream("Create hello.py"):
+        print(event.type, event.content)
+"""
+from flow.harness.langgraph.compaction import create_compaction_hook
+from flow.harness.langgraph.harness import LangGraphHarness
+from flow.harness.langgraph.otel_callback import OTelCallbackHandler
+from flow.harness.langgraph.wrappers import build_langgraph_tools, wrap_for_langgraph
+from flow.harness.registry import register
+# Register the harness with Flow
+register("langgraph", LangGraphHarness)
+__all__ = [
+    "LangGraphHarness",
+    "OTelCallbackHandler",
+    "build_langgraph_tools",
+    "create_compaction_hook",
+    "wrap_for_langgraph",
+]

src/flow/harness/langgraph/compaction.py ADDED Viewed

	@@ -0,0 +1,51 @@

+"""Message compaction for LangGraph.
+Provides a pre-model hook that implements head-tail message compaction,
+similar to MAF's HeadTailCompactingChatMessageStore.
+"""
+from __future__ import annotations
+from typing import Any
+__all__ = ["create_compaction_hook"]
+def create_compaction_hook(head_size: int, tail_size: int):
+    """Create a pre-model hook for message compaction.
+    This hook compacts messages by keeping the first `head_size` messages
+    and the last `tail_size` messages, dropping the middle.
+    Args:
+        head_size: Number of messages to keep from the start
+        tail_size: Number of messages to keep from the end
+    Returns:
+        A function that can be used as a pre_model_hook in create_react_agent
+    Example:
+        hook = create_compaction_hook(10, 40)
+        graph = create_react_agent(
+            model=model,
+            tools=tools,
+            pre_model_hook=hook,
+        )
+    """
+    def compact_messages(state: dict[str, Any]) -> dict[str, Any]:
+        """Compact messages keeping head and tail, dropping middle."""
+        messages = state.get("messages", [])
+        total = len(messages)
+        # No compaction needed if within limits
+        if total <= head_size + tail_size:
+            return {"llm_input_messages": messages}
+        # Keep head and tail
+        head = messages[:head_size]
+        tail = messages[-tail_size:]
+        return {"llm_input_messages": head + tail}
+    return compact_messages

src/flow/harness/langgraph/harness.py ADDED Viewed

	@@ -0,0 +1,257 @@

+"""LangGraph harness for Flow.
+Provides a harness adapter that allows LangGraph agents to be used
+within the Flow experimentation framework.
+"""
+from __future__ import annotations
+import logging
+import uuid
+from collections.abc import AsyncIterator
+from pathlib import Path
+from typing import TYPE_CHECKING, Any
+from opentelemetry import trace
+from flow.harness.base import BaseHarness, Event, EventType
+if TYPE_CHECKING:
+    from flow.experiments.models import Agent
+logger = logging.getLogger(__name__)
+# Get tracer for LangGraph instrumentation
+_tracer = trace.get_tracer("flow.langgraph", "0.1.0")
+__all__ = ["LangGraphHarness"]
+class LangGraphHarness(BaseHarness):
+    """Harness adapter for LangGraph.
+    This harness allows LangGraph agents to be used within the Flow
+    experimentation framework. It converts LangGraph streaming events
+    to Flow's uniform Event format and emits OpenTelemetry spans.
+    Example:
+        from flow.experiments.models import Agent
+        from flow.harness import create_harness
+        agent = Agent(
+            name="my-langgraph-agent",
+            framework="langgraph",
+            tools="standard",
+            model="openai:gpt-4o",
+        )
+        harness = create_harness(agent, workspace=Path("/tmp/workspace"))
+        async for event in harness.run_stream("Create hello.py"):
+            print(event.type, event.content)
+    """
+    @classmethod
+    def from_agent(cls, agent: Agent, workspace: Path) -> LangGraphHarness:
+        """Create a LangGraph harness from an Agent spec.
+        Args:
+            agent: Agent configuration
+            workspace: Working directory for file operations
+        Returns:
+            Configured LangGraphHarness instance
+        """
+        from flow.experiments.models import resolve_tools
+        from flow.harness.langgraph.compaction import create_compaction_hook
+        from flow.harness.langgraph.wrappers import build_langgraph_tools
+        from langgraph.checkpoint.memory import InMemorySaver
+        from langgraph.prebuilt import create_react_agent
+        # Build tools (skip sub_agent - MAF-specific)
+        tools_spec = resolve_tools(agent.tools)
+        if "sub_agent" in tools_spec:
+            logger.warning("sub_agent tool not supported in LangGraph harness, skipping")
+            del tools_spec["sub_agent"]
+        memory_path = workspace / "memory"
+        memory_path.mkdir(parents=True, exist_ok=True)
+        tools = build_langgraph_tools(tools_spec, workspace, memory_path)
+        # Create model
+        model = cls._create_model(agent.model)
+        # Create compaction hook if enabled
+        pre_model_hook = None
+        if agent.compaction and agent.compaction.strategy != "none":
+            params = agent.compaction.params or {}
+            head_size = params.get("head_size", 10)
+            tail_size = params.get("tail_size", 40)
+            pre_model_hook = create_compaction_hook(head_size, tail_size)
+        # Build graph
+        graph = create_react_agent(
+            model=model,
+            tools=tools,
+            prompt=agent.instructions,
+            pre_model_hook=pre_model_hook,
+            checkpointer=InMemorySaver(),
+        )
+        return cls(graph=graph, agent_name=agent.name, workspace=workspace)
+    @staticmethod
+    def _create_model(model_spec: str | None):
+        """Create a LangChain chat model from spec.
+        Args:
+            model_spec: Model specification, e.g., "openai:gpt-4o" or "gpt-4o"
+        Returns:
+            A LangChain chat model instance
+        """
+        import os
+        if model_spec and ":" in model_spec:
+            # "provider:model" syntax - use init_chat_model
+            from langchain.chat_models import init_chat_model
+            return init_chat_model(model_spec)
+        # Default: Azure OpenAI from environment
+        from langchain_openai import AzureChatOpenAI
+        return AzureChatOpenAI(
+            deployment_name=os.environ.get("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"),
+            api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
+            azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
+            api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2024-02-15-preview"),
+        )
+    def __init__(
+        self,
+        graph: Any = None,
+        agent_name: str = "LangGraphAgent",
+        workspace: Path | None = None,
+    ) -> None:
+        """Initialize the harness.
+        Args:
+            graph: A compiled LangGraph StateGraph
+            agent_name: Name of the agent (for tracing)
+            workspace: Working directory
+        """
+        from flow.harness.langgraph.otel_callback import OTelCallbackHandler
+        self._graph = graph
+        self._agent_name = agent_name
+        self._workspace = workspace
+        self._thread_id = str(uuid.uuid4())
+        self._otel_callback = OTelCallbackHandler()
+    async def run_stream(self, task: str) -> AsyncIterator[Event]:
+        """Run a task with streaming events.
+        Args:
+            task: The task/prompt to execute
+        Yields:
+            Event objects representing the agent's actions
+        """
+        from langchain_core.messages import HumanMessage
+        config = {
+            "configurable": {"thread_id": self._thread_id},
+            "callbacks": [self._otel_callback],
+        }
+        input_state = {"messages": [HumanMessage(content=task)]}
+        # Wrap in agent span for tracing
+        with _tracer.start_as_current_span(
+            f"invoke_agent {self._agent_name}",
+            kind=trace.SpanKind.INTERNAL,
+        ) as span:
+            span.set_attribute("gen_ai.operation.name", "invoke_agent")
+            span.set_attribute("gen_ai.agent.name", self._agent_name)
+            span.set_attribute("gen_ai.conversation.id", self._thread_id)
+            try:
+                async for chunk in self._graph.astream(
+                    input_state,
+                    config,
+                    stream_mode=["messages", "updates"],
+                ):
+                    for event in self._convert_chunk(chunk):
+                        yield event
+                yield Event(type=EventType.DONE)
+            except Exception as e:
+                logger.exception("Error during LangGraph execution")
+                span.record_exception(e)
+                span.set_status(trace.StatusCode.ERROR, str(e))
+                yield Event(type=EventType.ERROR, content=str(e))
+    def _convert_chunk(self, chunk: tuple) -> list[Event]:
+        """Convert a LangGraph stream chunk to Flow Events.
+        Args:
+            chunk: A tuple of (stream_mode, data) from LangGraph
+        Returns:
+            List of Flow Event objects
+        """
+        from langchain_core.messages import ToolMessage
+        events: list[Event] = []
+        if not isinstance(chunk, tuple) or len(chunk) != 2:
+            return events
+        mode, data = chunk
+        if mode == "messages":
+            msg_chunk, metadata = data
+            # Text content
+            if hasattr(msg_chunk, "content") and msg_chunk.content:
+                events.append(Event(
+                    type=EventType.TEXT_DELTA,
+                    content=msg_chunk.content,
+                ))
+            # Tool call chunks
+            if hasattr(msg_chunk, "tool_call_chunks"):
+                for tc in msg_chunk.tool_call_chunks or []:
+                    if tc.get("name"):
+                        events.append(Event(
+                            type=EventType.TOOL_CALL_START,
+                            tool_name=tc["name"],
+                            tool_call_id=tc.get("id"),
+                        ))
+                    if tc.get("args"):
+                        events.append(Event(
+                            type=EventType.TOOL_CALL_ARGS,
+                            content=tc["args"],
+                        ))
+        elif mode == "updates":
+            for node_name, update in data.items():
+                if node_name == "tools" and "messages" in update:
+                    for msg in update["messages"]:
+                        if isinstance(msg, ToolMessage):
+                            events.append(Event(
+                                type=EventType.TOOL_RESULT,
+                                content=str(msg.content),
+                                tool_call_id=msg.tool_call_id,
+                            ))
+                            events.append(Event(type=EventType.TOOL_CALL_DONE))
+        return events
+    def get_thread_id(self) -> str:
+        """Get the current thread/conversation ID."""
+        return self._thread_id
+    async def close(self) -> None:
+        """Clean up resources."""
+        self._thread_id = None

src/flow/harness/langgraph/otel_callback.py ADDED Viewed

	@@ -0,0 +1,173 @@

+"""OTel callback for LangGraph - emits GenAI semantic convention spans.
+This module provides a LangChain callback handler that emits OpenTelemetry
+spans conforming to the GenAI semantic conventions. This fills the gap
+that LangGraph doesn't have native GenAI OTel support like MAF does.
+Reference: https://opentelemetry.io/docs/specs/semconv/gen-ai/
+"""
+from __future__ import annotations
+from typing import Any
+from langchain_core.callbacks import BaseCallbackHandler
+from opentelemetry import trace
+__all__ = ["GenAIAttr", "OTelCallbackHandler"]
+class GenAIAttr:
+    """OpenTelemetry GenAI semantic convention attributes.
+    These match the attributes used by MAF for consistency.
+    Reference: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
+    """
+    # Operation
+    OPERATION_NAME = "gen_ai.operation.name"
+    PROVIDER_NAME = "gen_ai.provider.name"
+    # Model
+    REQUEST_MODEL = "gen_ai.request.model"
+    RESPONSE_MODEL = "gen_ai.response.model"
+    # Tokens
+    INPUT_TOKENS = "gen_ai.usage.input_tokens"
+    OUTPUT_TOKENS = "gen_ai.usage.output_tokens"
+    # Tool
+    TOOL_NAME = "gen_ai.tool.name"
+    TOOL_TYPE = "gen_ai.tool.type"
+    TOOL_CALL_ID = "gen_ai.tool.call.id"
+    # Error
+    ERROR_TYPE = "error.type"
+# Get tracer for LangGraph instrumentation
+_tracer = trace.get_tracer("flow.langgraph", "0.1.0")
+class OTelCallbackHandler(BaseCallbackHandler):
+    """Emit OpenTelemetry spans for LangGraph LLM and tool calls.
+    This callback handler hooks into LangChain's callback system and
+    emits spans that conform to the GenAI semantic conventions.
+    Usage:
+        callback = OTelCallbackHandler()
+        config = {"callbacks": [callback]}
+        graph.invoke(input, config)
+    """
+    def __init__(self) -> None:
+        """Initialize the callback handler."""
+        self._spans: dict[str, trace.Span] = {}
+    def on_llm_start(
+        self,
+        serialized: dict[str, Any],
+        prompts: list[str],
+        *,
+        run_id: Any,
+        **kwargs: Any,
+    ) -> None:
+        """Called when LLM starts generating."""
+        # Extract model and provider from serialized data
+        model = serialized.get("kwargs", {}).get("model", "unknown")
+        if not model or model == "unknown":
+            model = serialized.get("kwargs", {}).get("model_name", "unknown")
+        # Try to get provider from serialized id
+        serialized_id = serialized.get("id", [])
+        provider = serialized_id[-1] if serialized_id else "unknown"
+        # Start span
+        span = _tracer.start_span(f"chat {model}", kind=trace.SpanKind.CLIENT)
+        span.set_attribute(GenAIAttr.OPERATION_NAME, "chat")
+        span.set_attribute(GenAIAttr.REQUEST_MODEL, model)
+        span.set_attribute(GenAIAttr.PROVIDER_NAME, provider)
+        self._spans[str(run_id)] = span
+    def on_llm_end(
+        self,
+        response: Any,
+        *,
+        run_id: Any,
+        **kwargs: Any,
+    ) -> None:
+        """Called when LLM finishes generating."""
+        span = self._spans.pop(str(run_id), None)
+        if span:
+            # Extract token usage from response
+            usage = {}
+            if hasattr(response, "llm_output") and response.llm_output:
+                usage = response.llm_output.get("token_usage", {})
+            if usage:
+                span.set_attribute(GenAIAttr.INPUT_TOKENS, usage.get("prompt_tokens", 0))
+                span.set_attribute(GenAIAttr.OUTPUT_TOKENS, usage.get("completion_tokens", 0))
+            span.end()
+    def on_llm_error(
+        self,
+        error: BaseException,
+        *,
+        run_id: Any,
+        **kwargs: Any,
+    ) -> None:
+        """Called when LLM encounters an error."""
+        span = self._spans.pop(str(run_id), None)
+        if span:
+            span.set_attribute(GenAIAttr.ERROR_TYPE, type(error).__name__)
+            span.record_exception(error)
+            span.set_status(trace.StatusCode.ERROR, str(error))
+            span.end()
+    def on_tool_start(
+        self,
+        serialized: dict[str, Any],
+        input_str: str,
+        *,
+        run_id: Any,
+        **kwargs: Any,
+    ) -> None:
+        """Called when a tool starts executing."""
+        tool_name = serialized.get("name", "unknown")
+        span = _tracer.start_span(f"execute_tool {tool_name}", kind=trace.SpanKind.INTERNAL)
+        span.set_attribute(GenAIAttr.OPERATION_NAME, "execute_tool")
+        span.set_attribute(GenAIAttr.TOOL_NAME, tool_name)
+        span.set_attribute(GenAIAttr.TOOL_TYPE, "function")
+        self._spans[str(run_id)] = span
+    def on_tool_end(
+        self,
+        output: str,
+        *,
+        run_id: Any,
+        **kwargs: Any,
+    ) -> None:
+        """Called when a tool finishes executing."""
+        span = self._spans.pop(str(run_id), None)
+        if span:
+            span.end()
+    def on_tool_error(
+        self,
+        error: BaseException,
+        *,
+        run_id: Any,
+        **kwargs: Any,
+    ) -> None:
+        """Called when a tool encounters an error."""
+        span = self._spans.pop(str(run_id), None)
+        if span:
+            span.set_attribute(GenAIAttr.ERROR_TYPE, type(error).__name__)
+            span.record_exception(error)
+            span.set_status(trace.StatusCode.ERROR, str(error))
+            span.end()

src/flow/harness/langgraph/wrappers.py ADDED Viewed

	@@ -0,0 +1,76 @@

+"""LangGraph-specific tool wrappers.
+This module wraps shared tools for use with LangGraph/LangChain.
+"""
+from __future__ import annotations
+import logging
+from collections.abc import Callable, Coroutine
+from pathlib import Path
+from typing import Any
+from langchain_core.tools import tool as langchain_tool
+from flow.tools import build_tools, get_tool_meta
+logger = logging.getLogger(__name__)
+__all__ = ["build_langgraph_tools", "wrap_for_langgraph"]
+def wrap_for_langgraph(
+    tool_func: Callable[..., Coroutine[Any, Any, str]]
+) -> Callable[..., Coroutine[Any, Any, str]]:
+    """Wrap a Flow tool for LangGraph/LangChain.
+    Applies LangChain's @tool decorator with metadata from the @tool decorator.
+    Args:
+        tool_func: A tool function decorated with @tool
+    Returns:
+        The function wrapped with LangChain's @tool for LangGraph
+    Raises:
+        ValueError: If the function has no tool metadata
+    """
+    meta = get_tool_meta(tool_func)
+    if meta is None:
+        raise ValueError(f"Function {tool_func} has no tool metadata. Decorate with @tool first.")
+    # LangChain's @tool decorator takes name as first positional arg
+    # and description as keyword arg
+    return langchain_tool(meta.name, description=meta.description)(tool_func)
+def build_langgraph_tools(
+    tools_spec: dict[str, dict[str, Any]],
+    workspace: Path,
+    memory_path: Path,
+) -> list[Any]:  # Returns list of LangChain BaseTool
+    """Build LangGraph-compatible tools from a specification dict.
+    Creates shared tools and wraps them with LangChain's @tool decorator.
+    Args:
+        tools_spec: Dict mapping tool names to their config dicts.
+        workspace: Root directory for file operations
+        memory_path: Directory for persistent memory
+    Returns:
+        List of tool functions wrapped for LangGraph
+    """
+    # Build raw tools from shared module
+    raw_tools = build_tools(tools_spec, workspace, memory_path)
+    # Wrap each with LangChain's @tool
+    lg_tools = []
+    for tool_func in raw_tools:
+        try:
+            wrapped = wrap_for_langgraph(tool_func)
+            lg_tools.append(wrapped)
+        except ValueError as e:
+            logger.warning(f"Could not wrap tool: {e}")
+    return lg_tools

src/flow/harness/maf/__init__.py CHANGED Viewed

@@ -6,6 +6,10 @@ Provides integration with Microsoft Agent Framework for running Flow agents.
 from flow.harness.maf.agent import create_agent
 from flow.harness.maf.harness import MAFHarness
 from flow.harness.maf.message_store import HeadTailCompactingChatMessageStore
 __all__ = [
     "create_agent",

 from flow.harness.maf.agent import create_agent
 from flow.harness.maf.harness import MAFHarness
 from flow.harness.maf.message_store import HeadTailCompactingChatMessageStore
+from flow.harness.registry import register
+# Auto-register MAFHarness as the "maf" framework
+register("maf", MAFHarness)
 __all__ = [
     "create_agent",

src/flow/harness/maf/agent.py CHANGED Viewed

@@ -11,7 +11,7 @@ from typing import TYPE_CHECKING, Any
 from flow.experiments.models import TOOL_PRESETS, resolve_tools
 from flow.harness.maf.message_store import HeadTailCompactingChatMessageStore
-from flow.harness.maf.tools import build_tools
 from flow.prompts import build_instructions
 if TYPE_CHECKING:
@@ -54,7 +54,7 @@ def create_agent(
     Args:
         endpoint: Azure OpenAI endpoint URL. Defaults to AZURE_OPENAI_ENDPOINT env var.
         api_key: Azure OpenAI API key. Defaults to AZURE_OPENAI_API_KEY env var.
-        deployment: Azure OpenAI deployment name. Defaults to AZURE_OPENAI_DEPLOYMENT env var.
         api_version: Azure OpenAI API version.
         name: Agent name.
         instructions: Agent instructions. Defaults to FLOW_AGENT_INSTRUCTIONS.
@@ -86,7 +86,7 @@ def create_agent(
         >>> agent = create_agent(tools={"bash_execute": {"timeout": 60}, "memory": {}})
     """
     try:
-        from agent_framework import ChatAgent, ai_function
         from agent_framework.azure import AzureOpenAIChatClient
     except ImportError as e:
         raise ImportError(
@@ -97,7 +97,7 @@ def create_agent(
     # Resolve configuration from environment if not provided
     endpoint = endpoint or os.environ.get("AZURE_OPENAI_ENDPOINT")
     api_key = api_key or os.environ.get("AZURE_OPENAI_API_KEY")
-    deployment = deployment or os.environ.get("AZURE_OPENAI_DEPLOYMENT")
     if not endpoint:
         raise ValueError(
@@ -112,7 +112,7 @@ def create_agent(
     if not deployment:
         raise ValueError(
             "Azure OpenAI deployment is required. "
-            "Set AZURE_OPENAI_DEPLOYMENT or pass deployment parameter."
         )
     # Resolve paths
@@ -125,26 +125,23 @@ def create_agent(
     # Create tools from specification or use provided functions
     if isinstance(tools, (str, list, dict)):
-        # Resolve to dict form and build tools
         tools_spec = resolve_tools(tools)
-        tool_functions = build_tools(tools_spec, workspace, memory_path)
     else:
-        # Already a sequence of callable tools
-        tool_functions = tools
-    # Wrap tools with ai_function decorator for Agent Framework
-    converted_tools = []
-    for tool_func in tool_functions:
-        tool_name = getattr(tool_func, "_tool_name", tool_func.__name__)
-        tool_description = getattr(tool_func, "_tool_description", tool_func.__doc__ or "")
-        wrapped = ai_function(name=tool_name, description=tool_description)(tool_func)
-        converted_tools.append(wrapped)
     # Create the chat client
     client = AzureOpenAIChatClient(
         api_key=api_key,
         endpoint=endpoint,
-        deployment=deployment,
         api_version=api_version,
     )

 from flow.experiments.models import TOOL_PRESETS, resolve_tools
 from flow.harness.maf.message_store import HeadTailCompactingChatMessageStore
+from flow.harness.maf.wrappers import build_maf_tools
 from flow.prompts import build_instructions
 if TYPE_CHECKING:
     Args:
         endpoint: Azure OpenAI endpoint URL. Defaults to AZURE_OPENAI_ENDPOINT env var.
         api_key: Azure OpenAI API key. Defaults to AZURE_OPENAI_API_KEY env var.
+        deployment: Azure OpenAI deployment name. Defaults to AZURE_OPENAI_CHAT_DEPLOYMENT_NAME env var.
         api_version: Azure OpenAI API version.
         name: Agent name.
         instructions: Agent instructions. Defaults to FLOW_AGENT_INSTRUCTIONS.
         >>> agent = create_agent(tools={"bash_execute": {"timeout": 60}, "memory": {}})
     """
     try:
+        from agent_framework import ChatAgent, tool
         from agent_framework.azure import AzureOpenAIChatClient
     except ImportError as e:
         raise ImportError(
     # Resolve configuration from environment if not provided
     endpoint = endpoint or os.environ.get("AZURE_OPENAI_ENDPOINT")
     api_key = api_key or os.environ.get("AZURE_OPENAI_API_KEY")
+    deployment = deployment or os.environ.get("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME")
     if not endpoint:
         raise ValueError(
     if not deployment:
         raise ValueError(
             "Azure OpenAI deployment is required. "
+            "Set AZURE_OPENAI_CHAT_DEPLOYMENT_NAME or pass deployment parameter."
         )
     # Resolve paths
     # Create tools from specification or use provided functions
     if isinstance(tools, (str, list, dict)):
+        # Resolve to dict form and build MAF-wrapped tools
         tools_spec = resolve_tools(tools)
+        converted_tools = build_maf_tools(tools_spec, workspace, memory_path)
     else:
+        # Already a sequence of callable tools - wrap them with tool decorator
+        converted_tools = []
+        for tool_func in tools:
+            tool_name = getattr(tool_func, "_tool_name", tool_func.__name__)
+            tool_description = getattr(tool_func, "_tool_description", tool_func.__doc__ or "")
+            wrapped = tool(name=tool_name, description=tool_description)(tool_func)
+            converted_tools.append(wrapped)
     # Create the chat client
     client = AzureOpenAIChatClient(
         api_key=api_key,
         endpoint=endpoint,
+        deployment_name=deployment,
         api_version=api_version,
     )

src/flow/harness/maf/harness.py CHANGED Viewed

@@ -3,9 +3,12 @@
 A thin adapter that converts Agent Framework events to the uniform Flow Event format.
 """
 import logging
 import uuid
 from collections.abc import AsyncIterator
 from typing import TYPE_CHECKING, Any
 from flow.harness.base import BaseHarness, Event, EventType
@@ -13,6 +16,9 @@ from flow.harness.base import BaseHarness, Event, EventType
 if TYPE_CHECKING:
     from agent_framework import ChatAgent
 logger = logging.getLogger(__name__)
 # Track if instrumentation has been enabled globally
@@ -55,12 +61,69 @@ class MAFHarness(BaseHarness):
         >>> async for event in harness.run_stream("Create a hello world script"):
         ...     print(event)
-        >>> # Or with custom agent
-        >>> from flow.harness.maf import create_agent
-        >>> agent = create_agent(enable_compaction=False)
-        >>> harness = MAFHarness(agent)
     """
     def __init__(
         self,
         agent: "ChatAgent | None" = None,
@@ -87,61 +150,15 @@ class MAFHarness(BaseHarness):
         # Enable OpenTelemetry instrumentation for trace collection
         _enable_instrumentation()
-    def register_tools(self, tools: list[Any]) -> None:
-        """Register tools with the harness.
-        Note: For MAFHarness, tools should be configured when creating the agent
-        via create_agent(). This method is provided for interface compatibility
-        but will log a warning if called.
-        Args:
-            tools: List of tool functions (ignored - configure via create_agent)
-        """
-        logger.warning(
-            "MAFHarness.register_tools() called but tools should be configured "
-            "via create_agent(). These tools will be ignored."
-        )
-    async def run(self, task: str, thread_id: str | None = None) -> str:
-        """Run a task and return the final response.
-        Args:
-            task: The task/prompt to execute
-            thread_id: Optional thread ID for conversation continuity
-        Returns:
-            The agent's final response text
-        """
-        if thread_id:
-            self._thread_id = thread_id
-        # Get or create an AgentThread for conversation continuity
-        if self._thread is None:
-            self._thread = self._agent.get_new_thread()
-        response = await self._agent.run(task, thread=self._thread)
-        # Extract text content from response
-        content = getattr(response, "content", None)
-        if content is not None:
-            return str(content)
-        return str(response)
-    async def run_stream(
-        self, task: str, thread_id: str | None = None
-    ) -> AsyncIterator[Event]:
         """Run a task with streaming events.
         Args:
             task: The task/prompt to execute
-            thread_id: Optional thread ID for conversation continuity
         Yields:
             Event objects representing agent activity
         """
-        if thread_id:
-            self._thread_id = thread_id
         # Get or create an AgentThread for conversation continuity
         if self._thread is None:
             self._thread = self._agent.get_new_thread()

 A thin adapter that converts Agent Framework events to the uniform Flow Event format.
 """
+from __future__ import annotations
 import logging
 import uuid
 from collections.abc import AsyncIterator
+from pathlib import Path
 from typing import TYPE_CHECKING, Any
 from flow.harness.base import BaseHarness, Event, EventType
 if TYPE_CHECKING:
     from agent_framework import ChatAgent
+    from flow.experiments.models import Agent
+    from flow.llm import LLMClientConfig
 logger = logging.getLogger(__name__)
 # Track if instrumentation has been enabled globally
         >>> async for event in harness.run_stream("Create a hello world script"):
         ...     print(event)
+        >>> # Or from Agent spec
+        >>> from flow.experiments.models import Agent
+        >>> agent = Agent(name="my-agent", tools="standard")
+        >>> harness = MAFHarness.from_agent(agent, workspace=Path("/tmp"))
     """
+    @classmethod
+    def from_agent(
+        cls,
+        agent: "Agent",
+        workspace: Path,
+        llm_config: "LLMClientConfig | None" = None,
+    ) -> "MAFHarness":
+        """Create a MAFHarness from an Agent definition.
+        Args:
+            agent: The Agent spec defining the configuration
+            workspace: Working directory for the agent
+            llm_config: Optional LLM configuration (falls back to env vars if not provided)
+        Returns:
+            A configured MAFHarness instance
+        """
+        from flow.experiments.models import resolve_tools
+        tools_spec = resolve_tools(agent.tools)
+        # Build kwargs for create_agent
+        kwargs: dict[str, Any] = {
+            "workspace": workspace,
+            "memory_path": workspace / "memory",
+            "enable_compaction": agent.compaction.enabled,
+            "compaction_head_size": agent.compaction.head_size,
+            "compaction_tail_size": agent.compaction.tail_size,
+            "tools": tools_spec,
+            "instructions": agent.instructions,
+        }
+        # Extract credentials from LLM config if provided
+        if llm_config is not None:
+            from flow.llm import LLMProvider
+            if llm_config.provider == LLMProvider.AZURE_OPENAI and llm_config.azure_openai:
+                kwargs["endpoint"] = llm_config.azure_openai.get_endpoint()
+                kwargs["api_key"] = llm_config.azure_openai.get_api_key()
+                kwargs["deployment"] = llm_config.azure_openai.deployment
+                kwargs["api_version"] = llm_config.azure_openai.api_version
+            elif llm_config.provider == LLMProvider.OPENAI and llm_config.openai:
+                # OpenAI uses different endpoint/auth pattern
+                # For now, MAF only supports Azure OpenAI natively
+                # Log warning and fall back to env vars
+                logger.warning(
+                    f"MAF harness only supports Azure OpenAI natively. "
+                    f"Provider {llm_config.provider.value} will fall back to env vars."
+                )
+            else:
+                logger.warning(
+                    f"MAF harness only supports Azure OpenAI. "
+                    f"Provider {llm_config.provider.value} will fall back to env vars."
+                )
+        return cls(**kwargs)
     def __init__(
         self,
         agent: "ChatAgent | None" = None,
         # Enable OpenTelemetry instrumentation for trace collection
         _enable_instrumentation()
+    async def run_stream(self, task: str) -> AsyncIterator[Event]:
         """Run a task with streaming events.
         Args:
             task: The task/prompt to execute
         Yields:
             Event objects representing agent activity
         """
         # Get or create an AgentThread for conversation continuity
         if self._thread is None:
             self._thread = self._agent.get_new_thread()

src/flow/harness/maf/tools/__init__.py CHANGED Viewed

@@ -1,86 +1,74 @@
 """MAF-specific tools for the Flow agent.
 This module provides tools that work with the Microsoft Agent Framework harness.
-Tools are created based on a specification dict that maps tool names to their configs.
 Available tools:
-- read_file: Read file contents
-- write_file: Write/edit file content
-- list_directory: List directory contents
-- grep_search: Search for text patterns
-- bash_execute: Execute bash commands (config: timeout)
-- check_processes: Manage background processes
-- python_repl: Execute Python code
-- think: Explicit reasoning tool
-- task_done: Task completion marker
-- memory: Persistent memory storage
-- sub_agent: Isolated research sub-agent (config: model)
 """
-from collections.abc import Callable, Coroutine, Sequence
 from pathlib import Path
 from typing import Any
-from flow.harness.maf.tools.coding import (
-    create_grep_search_tool,
-    create_list_directory_tool,
-    create_read_file_tool,
-    create_write_file_tool,
 )
-from flow.harness.maf.tools.core import task_done, think
-from flow.harness.maf.tools.execution import (
-    create_bash_execute_tool,
-    create_check_processes_tool,
-    create_python_repl_tool,
-)
-from flow.harness.maf.tools.memory import create_memory_tool
-from flow.harness.maf.tools.sub_agent import create_sub_agent_tool
 __all__ = [
     "build_tools",
-    "create_bash_execute_tool",
-    "create_check_processes_tool",
-    "create_grep_search_tool",
-    "create_list_directory_tool",
-    "create_memory_tool",
-    "create_python_repl_tool",
-    "create_read_file_tool",
-    "create_sub_agent_tool",
-    "create_write_file_tool",
-    "task_done",
-    "think",
 ]
-# Registry of tool factories that don't require config
-# Maps tool name -> factory function(workspace, memory_path) -> tool
-_SIMPLE_TOOL_FACTORIES: dict[str, Callable[..., Any]] = {}
-# Registry of tools that are standalone (no factory needed)
-_STANDALONE_TOOLS: dict[str, Callable[..., Coroutine[Any, Any, str]]] = {
-    "think": think,
-    "task_done": task_done,
-}
 def build_tools(
     tools_spec: dict[str, dict[str, Any]],
     workspace: Path,
     memory_path: Path,
-) -> Sequence[Callable[..., Coroutine[Any, Any, str]]]:
-    """Build tool functions from a specification dict.
     This is the main entry point for creating tools based on a resolved
-    tool specification (from resolve_tools()).
     Args:
         tools_spec: Dict mapping tool names to their config dicts.
-                   e.g., {"bash_execute": {"timeout": 60}, "read_file": {}}
         workspace: Root directory for file operations
-        memory_path: Directory for persistent memory
     Returns:
-        List of tool functions ready to use with MAF
     Example:
         >>> from flow.experiments.models import resolve_tools
@@ -88,70 +76,63 @@ def build_tools(
         >>> tools = build_tools(tools_spec, workspace, memory_path)
     """
     workspace = Path(workspace).resolve()
-    memory_path = Path(memory_path).resolve()
     tools: list[Callable[..., Coroutine[Any, Any, str]]] = []
-    for tool_name, config in tools_spec.items():
-        tool = _create_tool(tool_name, config, workspace, memory_path)
-        if tool is not None:
-            tools.append(tool)
     return tools
-def _create_tool(
-    name: str,
-    config: dict[str, Any],
-    workspace: Path,
-    memory_path: Path,
-) -> Callable[..., Coroutine[Any, Any, str]] | None:
-    """Create a single tool by name with the given config.
-    Args:
-        name: Tool name (e.g., "read_file", "bash_execute")
-        config: Tool-specific configuration dict
-        workspace: Root directory for file operations
-        memory_path: Directory for persistent memory
-    Returns:
-        Tool function or None if unknown tool name
-    """
-    # Standalone tools (no config needed)
-    if name in _STANDALONE_TOOLS:
-        return _STANDALONE_TOOLS[name]
-    # Coding tools
-    if name == "read_file":
-        return create_read_file_tool(workspace)
-    if name == "write_file":
-        return create_write_file_tool(workspace)
-    if name == "list_directory":
-        return create_list_directory_tool(workspace)
-    if name == "grep_search":
-        return create_grep_search_tool(workspace)
-    # Execution tools
-    if name == "bash_execute":
-        timeout = config.get("timeout", 120)
-        return create_bash_execute_tool(workspace, memory_path, timeout)
-    if name == "check_processes":
-        return create_check_processes_tool(workspace, memory_path)
-    if name == "python_repl":
-        return create_python_repl_tool(workspace)
-    # Memory tool
-    if name == "memory":
-        return create_memory_tool(memory_path)
-    # Sub-agent tool
-    if name == "sub_agent":
-        model = config.get("model", "gpt-4o-mini")
-        return create_sub_agent_tool(workspace, model=model)
-    # Unknown tool - log warning and skip
-    import logging
-    logger = logging.getLogger(__name__)
-    logger.warning(f"Unknown tool name: {name}. Skipping.")
-    return None

 """MAF-specific tools for the Flow agent.
 This module provides tools that work with the Microsoft Agent Framework harness.
+Tools are created from the shared flow.tools module and adapted for MAF using
+the to_maf_tool adapter.
 Available tools:
+- read_file, write_file, edit_file, multi_edit, glob_files, grep, ls
+- bash, check_processes, python_repl
+- think, todo_write, todo_read
+- memory, skills, task
+- web_search, web_fetch
+- notebook_edit, notebook_read
 """
+import logging
+from collections.abc import Callable, Coroutine
 from pathlib import Path
 from typing import Any
+from flow.tools import (
+    # Coding
+    read_file, write_file, edit_file, multi_edit, glob_files, grep, ls,
+    # Execution
+    bash, check_processes, python_repl,
+    # Planning
+    think, todo_write, todo_read,
+    # Memory
+    memory, create_memory_tool,
+    # Web
+    web_search, web_fetch,
+    # Notebooks
+    notebook_edit, notebook_read,
+    # Skills
+    skills, create_skills_tool,
+    # Sub-agent
+    task, create_task_tool,
+    # Workspace management
+    set_workspace, Workspace,
+    # Adapters
+    to_maf_tool,
+    # Base
+    Tool,
 )
 __all__ = [
     "build_tools",
 ]
+logger = logging.getLogger(__name__)
 def build_tools(
     tools_spec: dict[str, dict[str, Any]],
     workspace: Path,
     memory_path: Path,
+) -> list[Callable[..., Coroutine[Any, Any, str]]]:
+    """Build MAF-compatible tool functions from a specification dict.
     This is the main entry point for creating tools based on a resolved
+    tool specification (from resolve_tools()). It uses the shared tools
+    from flow.tools and adapts them for MAF.
     Args:
         tools_spec: Dict mapping tool names to their config dicts.
+                   e.g., {"bash": {"timeout": 60}, "read_file": {}}
         workspace: Root directory for file operations
+        memory_path: Directory for persistent memory (deprecated, uses workspace)
     Returns:
+        List of tool functions wrapped with MAF's @tool decorator
     Example:
         >>> from flow.experiments.models import resolve_tools
         >>> tools = build_tools(tools_spec, workspace, memory_path)
     """
     workspace = Path(workspace).resolve()
+    # Set workspace for tools that need it (memory, todos, etc.)
+    set_workspace(Workspace(workspace))
+    # Map tool names → Tool instances
+    tool_map: dict[str, Tool] = {
+        # Coding/Filesystem
+        "read_file": read_file,
+        "write_file": write_file,
+        "edit_file": edit_file,
+        "multi_edit": multi_edit,
+        "glob_files": glob_files,
+        "ls": ls,
+        "grep": grep,
+        # Execution
+        "bash": bash,
+        "check_processes": check_processes,
+        "python_repl": python_repl,
+        # Planning
+        "think": think,
+        "todo_write": todo_write,
+        "todo_read": todo_read,
+        # Web
+        "web_search": web_search,
+        "web_fetch": web_fetch,
+        # Notebooks
+        "notebook_edit": notebook_edit,
+        "notebook_read": notebook_read,
+        # Memory (default instance)
+        "memory": memory,
+        # Skills (default instance)
+        "skills": skills,
+        # Task/sub-agent (default instance)
+        "task": task,
+    }
     tools: list[Callable[..., Coroutine[Any, Any, str]]] = []
+    for name, config in tools_spec.items():
+        if name in tool_map:
+            # Convert shared Tool to MAF-decorated function
+            maf_tool = to_maf_tool(tool_map[name])
+            tools.append(maf_tool)
+        elif name == "task" and config:
+            # Task tool with custom config
+            custom_task = create_task_tool(
+                coordinator_tools=list(tool_map.values()),
+                model=config.get("model"),
+            )
+            tools.append(to_maf_tool(custom_task))
+        elif name == "skills" and config.get("additional_paths"):
+            # Skills with custom paths
+            custom_skills = create_skills_tool(
+                project_path=Path(config["additional_paths"][0])
+            )
+            tools.append(to_maf_tool(custom_skills))
+        else:
+            logger.warning(f"Unknown tool name: {name}. Skipping.")
     return tools

src/flow/harness/maf/tools/coding.py DELETED Viewed

@@ -1,391 +0,0 @@
-"""Coding tools for file operations and code search.
-These tools enable agents to read/write files, list directories,
-and search for patterns in code.
-The agent can read and write to any path the user has access to.
-The workspace serves as the default working directory for relative paths.
-"""
-import re
-from collections.abc import Callable, Coroutine, Sequence
-from pathlib import Path
-from typing import Annotated, Any
-def create_read_file_tool(workspace: Path) -> Callable[..., Coroutine[Any, Any, str]]:
-    """Create a read_file tool that can read from any path.
-    Args:
-        workspace: Default directory for relative paths (not a restriction)
-    """
-    async def read_file(
-        file_path: Annotated[str, "Path to the file (absolute or relative to workspace)"],
-        max_lines: Annotated[int, "Maximum lines to return (default: 500)"] = 500,
-    ) -> str:
-        """Read the contents of a file. Can read from any path on the system."""
-        try:
-            # Support both absolute and relative paths
-            path = Path(file_path)
-            if path.is_absolute():
-                full_path = path.resolve()
-            else:
-                full_path = (workspace / file_path).resolve()
-            if not full_path.exists():
-                return f"Error: File not found: {file_path}"
-            if not full_path.is_file():
-                return f"Error: Not a file: {file_path}"
-            content = full_path.read_text(encoding="utf-8")
-            lines = content.splitlines()
-            # Apply line limit
-            total_lines = len(lines)
-            if len(lines) > max_lines:
-                lines = lines[:max_lines]
-                truncated_msg = f"\n... (truncated, showing first {max_lines} of {total_lines} lines)"
-            else:
-                truncated_msg = ""
-            # Format with line numbers
-            numbered_lines = [f"{i + 1:5d}: {line}" for i, line in enumerate(lines)]
-            result = "\n".join(numbered_lines) + truncated_msg
-            return f"File: {full_path} ({total_lines} lines)\n{'=' * 40}\n{result}"
-        except UnicodeDecodeError:
-            return f"Error: Cannot read file (binary or non-UTF-8): {file_path}"
-        except PermissionError:
-            return f"Error: Permission denied: {file_path}"
-        except Exception as e:
-            return f"Error reading file: {e}"
-    # Add tool metadata
-    read_file._tool_name = "read_file"  # type: ignore[attr-defined]
-    read_file._tool_description = (  # type: ignore[attr-defined]
-        "Read the contents of a file. Accepts absolute paths (e.g., /path/to/file) "
-        "or relative paths (relative to workspace). Returns content with line numbers."
-    )
-    read_file._is_tool = True  # type: ignore[attr-defined]
-    return read_file
-def create_write_file_tool(workspace: Path) -> Callable[..., Coroutine[Any, Any, str]]:
-    """Create a write_file tool.
-    Args:
-        workspace: Default directory for relative paths
-    """
-    async def write_file(
-        file_path: Annotated[str, "Path to the file (absolute or relative to workspace)"],
-        content: Annotated[str | None, "Full content to write (for complete file write)"] = None,
-        old_str: Annotated[str | None, "Text to replace (for str_replace operation)"] = None,
-        new_str: Annotated[str | None, "Replacement text (for str_replace operation)"] = None,
-        insert_line: Annotated[int | None, "Line number to insert at (1-indexed)"] = None,
-        insert_content: Annotated[str | None, "Content to insert at line"] = None,
-    ) -> str:
-        """Write or edit file content.
-        Supports: (1) full file write with 'content',
-        (2) str_replace to replace specific text,
-        (3) insert_at_line to add content at a specific line.
-        Creates parent directories if needed.
-        """
-        try:
-            # Support both absolute and relative paths
-            path = Path(file_path)
-            if path.is_absolute():
-                full_path = path.resolve()
-            else:
-                full_path = (workspace / file_path).resolve()
-            # Create parent directories
-            full_path.parent.mkdir(parents=True, exist_ok=True)
-            # Operation 1: Full file write
-            if content is not None:
-                full_path.write_text(content, encoding="utf-8")
-                return f"Successfully wrote {len(content)} characters to {file_path}"
-            # Operation 2: str_replace
-            if old_str is not None and new_str is not None:
-                if not full_path.exists():
-                    return f"Error: File not found for str_replace: {file_path}"
-                current_content = full_path.read_text(encoding="utf-8")
-                if old_str not in current_content:
-                    # Show a snippet of the file to help debug
-                    if len(current_content) > 500:
-                        snippet = current_content[:500] + "..."
-                    else:
-                        snippet = current_content
-                    return (
-                        f"Error: String to replace not found in file.\n"
-                        f"Searching for: '{old_str[:100]}...'\n"
-                        f"File content preview:\n{snippet}"
-                    )
-                # Replace first occurrence only
-                new_content = current_content.replace(old_str, new_str, 1)
-                full_path.write_text(new_content, encoding="utf-8")
-                return f"Successfully replaced text in {file_path}"
-            # Operation 3: insert_at_line
-            if insert_line is not None and insert_content is not None:
-                if full_path.exists():
-                    current_content = full_path.read_text(encoding="utf-8")
-                    lines = current_content.splitlines(keepends=True)
-                else:
-                    lines = []
-                # Ensure insert_content ends with newline
-                if not insert_content.endswith("\n"):
-                    insert_content += "\n"
-                # Insert at specified line (1-indexed)
-                insert_index = insert_line - 1
-                if insert_index < 0:
-                    return f"Error: Invalid line number: {insert_line}. Must be >= 1."
-                # Allow inserting at end
-                if insert_index > len(lines):
-                    insert_index = len(lines)
-                lines.insert(insert_index, insert_content)
-                new_content = "".join(lines)
-                full_path.write_text(new_content, encoding="utf-8")
-                return f"Successfully inserted content at line {insert_line} in {file_path}"
-            return "Error: Must provide either 'content', 'old_str' + 'new_str', or 'insert_line' + 'insert_content'"
-        except Exception as e:
-            return f"Error writing file: {e}"
-    # Add tool metadata
-    write_file._tool_name = "write_file"  # type: ignore[attr-defined]
-    write_file._tool_description = (  # type: ignore[attr-defined]
-        "Write or edit file content. Accepts absolute paths or relative paths (relative to workspace). "
-        "Supports: (1) full file write with 'content', (2) str_replace to replace specific text, "
-        "(3) insert_at_line to add content at a specific line. Creates parent directories if needed."
-    )
-    write_file._is_tool = True  # type: ignore[attr-defined]
-    return write_file
-def create_list_directory_tool(workspace: Path) -> Callable[..., Coroutine[Any, Any, str]]:
-    """Create a list_directory tool that can list any directory.
-    Args:
-        workspace: Default directory for relative paths (not a restriction)
-    """
-    async def list_directory(
-        directory_path: Annotated[str, "Path to directory (absolute or relative to workspace, default: '.')"] = ".",
-        recursive: Annotated[bool, "List subdirectories recursively (default: false)"] = False,
-        max_entries: Annotated[int, "Maximum entries to return (default: 200)"] = 200,
-    ) -> str:
-        """List files and directories at a given path. Can list any directory on the system."""
-        try:
-            # Support both absolute and relative paths
-            path = Path(directory_path)
-            if path.is_absolute():
-                full_path = path.resolve()
-            else:
-                full_path = (workspace / directory_path).resolve()
-            if not full_path.exists():
-                return f"Error: Directory not found: {directory_path}"
-            if not full_path.is_dir():
-                return f"Error: Not a directory: {directory_path}"
-            entries: list[tuple[str, str, int]] = []
-            if recursive:
-                for item in full_path.rglob("*"):
-                    if len(entries) >= max_entries:
-                        break
-                    # Skip common non-essential directories
-                    skip_dirs = ["node_modules", "__pycache__", ".git", "venv", ".venv"]
-                    if any(part in item.parts for part in skip_dirs):
-                        continue
-                    rel_path = item.relative_to(full_path)
-                    item_type = "file" if item.is_file() else "dir"
-                    size = item.stat().st_size if item.is_file() else 0
-                    entries.append((str(rel_path), item_type, size))
-            else:
-                for item in full_path.iterdir():
-                    if len(entries) >= max_entries:
-                        break
-                    item_type = "file" if item.is_file() else "dir"
-                    size = item.stat().st_size if item.is_file() else 0
-                    entries.append((item.name, item_type, size))
-            # Sort: directories first, then by name
-            entries.sort(key=lambda x: (x[1] != "dir", x[0]))
-            # Format output
-            result_lines = [f"Directory: {directory_path} ({len(entries)} entries)"]
-            result_lines.append("=" * 50)
-            for name, item_type, size in entries:
-                if item_type == "dir":
-                    result_lines.append(f"  [DIR]  {name}/")
-                else:
-                    size_str = f"{size:,} bytes" if size < 10000 else f"{size / 1024:.1f} KB"
-                    result_lines.append(f"  [FILE] {name} ({size_str})")
-            if len(entries) >= max_entries:
-                result_lines.append(f"\n... (truncated at {max_entries} entries)")
-            return "\n".join(result_lines)
-        except Exception as e:
-            return f"Error listing directory: {e}"
-    # Add tool metadata
-    list_directory._tool_name = "list_directory"  # type: ignore[attr-defined]
-    list_directory._tool_description = (  # type: ignore[attr-defined]
-        "List files and directories at a given path. Accepts absolute paths (e.g., /path/to/dir) "
-        "or relative paths (relative to workspace). Returns names, types, and sizes."
-    )
-    list_directory._is_tool = True  # type: ignore[attr-defined]
-    return list_directory
-def create_grep_search_tool(workspace: Path) -> Callable[..., Coroutine[Any, Any, str]]:
-    """Create a grep_search tool that can search any directory.
-    Args:
-        workspace: Default directory for relative paths (not a restriction)
-    """
-    async def grep_search(
-        pattern: Annotated[str, "Pattern to search for (regex supported)"],
-        path: Annotated[str, "Path to search in (absolute or relative to workspace, default: '.')"] = ".",
-        file_pattern: Annotated[str | None, "File pattern to filter (e.g., '*.py', '*.js')"] = None,
-        case_sensitive: Annotated[bool, "Case sensitive search (default: true)"] = True,
-        max_matches: Annotated[int, "Maximum matches to return (default: 50)"] = 50,
-    ) -> str:
-        """Search for text patterns in files. Can search any path on the system."""
-        try:
-            # Support both absolute and relative paths
-            search_path = Path(path)
-            if search_path.is_absolute():
-                full_path = search_path.resolve()
-            else:
-                full_path = (workspace / path).resolve()
-            if not full_path.exists():
-                return f"Error: Path not found: {path}"
-            # Compile regex
-            flags = 0 if case_sensitive else re.IGNORECASE
-            try:
-                regex = re.compile(pattern, flags)
-            except re.error as e:
-                return f"Error: Invalid regex pattern: {e}"
-            matches: list[dict[str, Any]] = []
-            # Get files to search
-            if full_path.is_file():
-                files = [full_path]
-            else:
-                if file_pattern:
-                    files = list(full_path.rglob(file_pattern))
-                else:
-                    files = [f for f in full_path.rglob("*") if f.is_file()]
-            # Search each file
-            for file_path_item in files:
-                if len(matches) >= max_matches:
-                    break
-                # Skip common non-essential directories and binary files
-                skip_dirs = ["node_modules", "__pycache__", ".git", "venv", ".venv"]
-                if any(part in file_path_item.parts for part in skip_dirs):
-                    continue
-                try:
-                    # Skip large files (> 1MB)
-                    if file_path_item.stat().st_size > 1_000_000:
-                        continue
-                    file_content = file_path_item.read_text(encoding="utf-8", errors="ignore")
-                    lines = file_content.splitlines()
-                    for line_num, line in enumerate(lines, 1):
-                        if len(matches) >= max_matches:
-                            break
-                        if regex.search(line):
-                            # Compute relative path from search root
-                            try:
-                                rel_path = file_path_item.relative_to(full_path)
-                            except ValueError:
-                                # If file is the search path itself, use filename
-                                rel_path = file_path_item.name
-                            matches.append({
-                                "file": str(rel_path),
-                                "line": line_num,
-                                "text": line.strip()[:200],
-                            })
-                except (UnicodeDecodeError, PermissionError):
-                    continue
-            # Format output
-            if not matches:
-                return f"No matches found for pattern '{pattern}' in {path}"
-            result_lines = [f"Found {len(matches)} match(es) for '{pattern}'"]
-            result_lines.append("=" * 50)
-            for match in matches:
-                result_lines.append(f"{match['file']}:{match['line']}: {match['text']}")
-            if len(matches) >= max_matches:
-                result_lines.append(f"\n... (truncated at {max_matches} matches)")
-            return "\n".join(result_lines)
-        except Exception as e:
-            return f"Error searching: {e}"
-    # Add tool metadata
-    grep_search._tool_name = "grep_search"  # type: ignore[attr-defined]
-    grep_search._tool_description = (  # type: ignore[attr-defined]
-        "Search for text patterns in files. Accepts absolute paths (e.g., /path/to/dir) "
-        "or relative paths (relative to workspace). Supports regex patterns and file filtering."
-    )
-    grep_search._is_tool = True  # type: ignore[attr-defined]
-    return grep_search
-def create_coding_tools(workspace: Path) -> Sequence[Callable[..., Coroutine[Any, Any, str]]]:
-    """Create all coding tools bound to a workspace.
-    Args:
-        workspace: Root directory for file operations
-    Returns:
-        List of coding tool functions
-    """
-    workspace = Path(workspace).resolve()
-    return [
-        create_read_file_tool(workspace),
-        create_write_file_tool(workspace),
-        create_list_directory_tool(workspace),
-        create_grep_search_tool(workspace),
-    ]

src/flow/harness/maf/tools/core.py DELETED Viewed

@@ -1,100 +0,0 @@
-"""Core metacognitive tools for agent reasoning and task management.
-These tools enable agents to think explicitly, track task status,
-and make structured decisions during complex software engineering tasks.
-"""
-from collections.abc import Callable, Coroutine, Sequence
-from typing import Annotated, Any, Literal
-async def think(
-    thought: Annotated[
-        str,
-        (
-            "Your detailed reasoning about the current situation. "
-            "Include: what you've learned, options you're considering, "
-            "potential risks, and your planned approach."
-        ),
-    ],
-) -> str:
-    """Use this tool to pause and think through a complex problem.
-    Helpful when: (1) analyzing tool results, (2) planning multi-step approaches,
-    (3) making design decisions, (4) debugging issues, (5) avoiding mistakes.
-    Your reasoning is recorded and helps structure your approach.
-    """
-    # The value is in giving the LLM dedicated space to reason
-    summary = thought[:300] + "..." if len(thought) > 300 else thought
-    return f"Thought recorded: {summary}"
-async def task_done(
-    status: Annotated[
-        Literal["complete", "incomplete"],
-        "'complete' if task finished successfully, 'incomplete' if blocked or needs input",
-    ],
-    summary: Annotated[
-        str,
-        (
-            "Summary of what was accomplished. "
-            "If complete: what was done and how to use/test it. "
-            "If incomplete: what's blocking and what's needed."
-        ),
-    ],
-    files_created: Annotated[
-        list[str] | None,
-        "List of files created or modified (if any)",
-    ] = None,
-    next_steps: Annotated[
-        list[str] | None,
-        "Suggested next steps for the user (if any)",
-    ] = None,
-) -> str:
-    """Call this when you have completed the user's task.
-    Provide a summary of what was accomplished and any relevant details.
-    Use 'complete' if all requirements are satisfied,
-    'incomplete' if blocked or need more information.
-    """
-    result_lines = [
-        f"Task Status: {status.upper()}",
-        "",
-        "Summary:",
-        summary,
-    ]
-    if files_created:
-        result_lines.extend([
-            "",
-            "Files Created/Modified:",
-            *[f"  - {f}" for f in files_created],
-        ])
-    if next_steps:
-        result_lines.extend([
-            "",
-            "Suggested Next Steps:",
-            *[f"  - {step}" for step in next_steps],
-        ])
-    return "\n".join(result_lines)
-# Add tool metadata
-think._tool_name = "think"  # type: ignore[attr-defined]
-think._tool_description = think.__doc__ or ""  # type: ignore[attr-defined]
-think._is_tool = True  # type: ignore[attr-defined]
-task_done._tool_name = "task_done"  # type: ignore[attr-defined]
-task_done._tool_description = task_done.__doc__ or ""  # type: ignore[attr-defined]
-task_done._is_tool = True  # type: ignore[attr-defined]
-def create_core_tools() -> Sequence[Callable[..., Coroutine[Any, Any, str]]]:
-    """Create all core metacognitive tools.
-    Returns:
-        List of core tool functions
-    """
-    return [think, task_done]

src/flow/harness/maf/tools/execution.py DELETED Viewed

@@ -1,479 +0,0 @@
-"""Execution tools for running commands and code.
-These tools enable agents to execute bash commands and Python code
-with safety controls (timeouts, output limits), and manage background processes.
-"""
-import asyncio
-import os
-import re
-import signal
-import sys
-from collections.abc import Callable, Coroutine, Sequence
-from datetime import datetime
-from io import StringIO
-from pathlib import Path
-from typing import Annotated, Any, Literal
-def _get_process_registry_path(memory_path: Path) -> Path:
-    """Get the path to the process registry file in memory."""
-    return memory_path / "processes.md"
-def _ensure_process_registry(memory_path: Path) -> Path:
-    """Ensure the process registry file exists and return its path."""
-    registry_path = _get_process_registry_path(memory_path)
-    registry_path.parent.mkdir(parents=True, exist_ok=True)
-    if not registry_path.exists():
-        registry_path.write_text(
-            "# Background Processes\n\n"
-            "This file tracks background processes started by the Flow agent.\n"
-            "You can view this file with `memory(command='view', path='/memory/processes.md')`\n\n"
-            "## Running\n\n"
-            "## Stopped\n\n"
-        )
-    return registry_path
-def _add_process_to_registry(
-    memory_path: Path,
-    pid: int,
-    command: str,
-    workspace: str,
-    log_file: str,
-    port: int | None = None,
-) -> None:
-    """Add a process to the registry using checklist format."""
-    registry_path = _ensure_process_registry(memory_path)
-    content = registry_path.read_text()
-    # Extract port from command if not provided
-    if port is None:
-        port_match = re.search(r"(?:--port|-p)\s+(\d+)", command)
-        if port_match:
-            port = int(port_match.group(1))
-        elif ":8000" in command or "8000" in command:
-            port = 8000
-        elif ":3000" in command or "3000" in command:
-            port = 3000
-    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M")
-    port_str = f"Port: {port}" if port else "Port: -"
-    cmd_short = command[:60] + "..." if len(command) > 60 else command
-    workspace_short = workspace.split("/")[-1] if "/" in workspace else workspace
-    # Create checklist entry
-    entry = f"- [ ] **PID {pid}** | `{cmd_short}` | {timestamp} | {port_str} | {workspace_short}\n"
-    # Add under "## Running" section
-    if "## Running" in content:
-        content = content.replace("## Running\n\n", f"## Running\n\n{entry}")
-    else:
-        # Add Running section if missing
-        content += f"\n## Running\n\n{entry}"
-    registry_path.write_text(content)
-def _mark_process_stopped(memory_path: Path, pid: int, reason: str = "killed") -> None:
-    """Mark a process as stopped in the registry (check the box and move to Stopped)."""
-    registry_path = _get_process_registry_path(memory_path)
-    if not registry_path.exists():
-        return
-    content = registry_path.read_text()
-    lines = content.split("\n")
-    new_lines: list[str] = []
-    stopped_entry: str | None = None
-    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M")
-    for line in lines:
-        if f"**PID {pid}**" in line and "- [ ]" in line:
-            # Found the running process - mark it as checked and prepare for Stopped section
-            stopped_entry = line.replace("- [ ]", "- [x]") + f" | {reason} @ {timestamp}"
-            # Don't add to new_lines yet (will move to Stopped section)
-        else:
-            new_lines.append(line)
-    # Add stopped entry to Stopped section
-    if stopped_entry:
-        content = "\n".join(new_lines)
-        if "## Stopped" in content:
-            content = content.replace("## Stopped\n\n", f"## Stopped\n\n{stopped_entry}\n")
-        else:
-            content += f"\n## Stopped\n\n{stopped_entry}\n"
-        registry_path.write_text(content)
-def _is_process_running(pid: int) -> bool:
-    """Check if a process is still running."""
-    try:
-        os.kill(pid, 0)
-        return True
-    except (OSError, ProcessLookupError):
-        return False
-def _get_running_pids_from_registry(memory_path: Path) -> list[tuple[int, str]]:
-    """Get list of (pid, line) for processes marked as running in registry."""
-    registry_path = _get_process_registry_path(memory_path)
-    if not registry_path.exists():
-        return []
-    content = registry_path.read_text()
-    running: list[tuple[int, str]] = []
-    for line in content.split("\n"):
-        if "- [ ]" in line and "**PID" in line:
-            # Extract PID from format: **PID 12345**
-            match = re.search(r"\*\*PID (\d+)\*\*", line)
-            if match:
-                pid = int(match.group(1))
-                running.append((pid, line))
-    return running
-def create_bash_execute_tool(
-    workspace: Path, memory_path: Path, default_timeout: int = 120
-) -> Callable[..., Coroutine[Any, Any, str]]:
-    """Create a bash_execute tool bound to a specific workspace."""
-    async def bash_execute(
-        command: Annotated[str, "Bash command to execute"],
-        timeout: Annotated[int, f"Command timeout in seconds (default: {default_timeout})"] = default_timeout,
-        background: Annotated[
-            bool, "Run in background and return immediately with PID. Use for servers/long-running processes."
-        ] = False,
-    ) -> str:
-        """Execute bash commands in the workspace.
-        Returns stdout, stderr, and return code.
-        Use for running tests, git commands, package managers, builds, etc.
-        IMPORTANT: Each call runs in a fresh shell from workspace root -
-        use 'cd dir && command' for commands in subdirectories.
-        For long-running processes (servers), use background=True to avoid timeout.
-        """
-        try:
-            if background:
-                # Run in background using nohup and capture PID
-                # Redirect output to a log file
-                log_file = workspace / ".background_logs" / f"bg_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
-                log_file.parent.mkdir(parents=True, exist_ok=True)
-                bg_command = f"nohup {command} > {log_file} 2>&1 & echo $!"
-                proc = await asyncio.create_subprocess_shell(
-                    bg_command,
-                    stdout=asyncio.subprocess.PIPE,
-                    stderr=asyncio.subprocess.PIPE,
-                    cwd=str(workspace),
-                )
-                stdout, _ = await proc.communicate()
-                pid_str = stdout.decode().strip()
-                try:
-                    pid = int(pid_str)
-                    # Register the process in memory
-                    _add_process_to_registry(
-                        memory_path=memory_path,
-                        pid=pid,
-                        command=command,
-                        workspace=str(workspace),
-                        log_file=str(log_file),
-                    )
-                    return (
-                        f"Background process started successfully.\n"
-                        f"PID: {pid}\n"
-                        f"Command: {command}\n"
-                        f"Log file: {log_file}\n"
-                        f"\nProcess registered in /memory/processes.md\n"
-                        f"Use check_processes(action='list') to see all background processes.\n"
-                        f"Use check_processes(action='kill', pid={pid}) to stop this process."
-                    )
-                except ValueError:
-                    return f"Error: Could not get PID. Output: {pid_str}"
-            # Regular (blocking) execution
-            proc = await asyncio.create_subprocess_shell(
-                command,
-                stdout=asyncio.subprocess.PIPE,
-                stderr=asyncio.subprocess.PIPE,
-                cwd=str(workspace),
-            )
-            try:
-                stdout, stderr = await asyncio.wait_for(proc.communicate(), timeout=timeout)
-            except asyncio.TimeoutError:
-                proc.kill()
-                await proc.wait()
-                return (
-                    f"Error: Command timed out after {timeout} seconds.\n"
-                    f"Command: {command}\n\n"
-                    f"TIP: If this is a long-running process (like a server), "
-                    f"use background=True to run it in the background."
-                )
-            stdout_str = stdout.decode("utf-8", errors="replace")
-            stderr_str = stderr.decode("utf-8", errors="replace")
-            return_code = proc.returncode
-            # Format output
-            result_parts = [f"Command: {command}"]
-            result_parts.append(f"Return code: {return_code}")
-            result_parts.append("=" * 50)
-            if stdout_str.strip():
-                # Truncate very long output
-                if len(stdout_str) > 15000:
-                    stdout_str = stdout_str[:15000] + "\n... (stdout truncated)"
-                result_parts.append("STDOUT:")
-                result_parts.append(stdout_str)
-            if stderr_str.strip():
-                if len(stderr_str) > 5000:
-                    stderr_str = stderr_str[:5000] + "\n... (stderr truncated)"
-                result_parts.append("STDERR:")
-                result_parts.append(stderr_str)
-            if not stdout_str.strip() and not stderr_str.strip():
-                result_parts.append("(no output)")
-            return "\n".join(result_parts)
-        except Exception as e:
-            return f"Error executing command: {e}"
-    # Add tool metadata
-    bash_execute._tool_name = "bash_execute"  # type: ignore[attr-defined]
-    bash_execute._tool_description = (  # type: ignore[attr-defined]
-        "Execute bash commands in the workspace. "
-        "Returns stdout, stderr, and return code. "
-        "Use for running tests, git commands, package managers, builds, etc."
-    )
-    bash_execute._is_tool = True  # type: ignore[attr-defined]
-    return bash_execute
-def create_check_processes_tool(
-    workspace: Path, memory_path: Path
-) -> Callable[..., Coroutine[Any, Any, str]]:
-    """Create a tool to check and manage background processes."""
-    async def check_processes(
-        action: Annotated[
-            Literal["list", "kill", "cleanup"],
-            "'list' to see processes, 'kill' to stop one by PID, 'cleanup' to kill all",
-        ],
-        pid: Annotated[int | None, "PID to kill (required for 'kill' action)"] = None,
-    ) -> str:
-        """Check and manage background processes.
-        Use 'list' to see all background processes (also viewable at /memory/processes.md),
-        'kill' to stop a specific process by PID,
-        'cleanup' to kill all background processes from this workspace.
-        """
-        _ensure_process_registry(memory_path)
-        registry_path = _get_process_registry_path(memory_path)
-        if action == "list":
-            # Read the registry and update status of running processes
-            running_pids = _get_running_pids_from_registry(memory_path)
-            active_count = 0
-            dead_pids: list[int] = []
-            for proc_pid, _ in running_pids:
-                if _is_process_running(proc_pid):
-                    active_count += 1
-                else:
-                    dead_pids.append(proc_pid)
-            # Mark dead processes as stopped
-            for dead_pid in dead_pids:
-                _mark_process_stopped(memory_path, dead_pid, reason="exited")
-            # Return the updated registry
-            content = registry_path.read_text()
-            return (
-                f"Active background processes: {active_count}\n"
-                f"(View full registry at /memory/processes.md)\n\n"
-                f"{content}"
-            )
-        if action == "kill":
-            if pid is None:
-                return "Error: 'pid' is required for 'kill' action."
-            try:
-                os.kill(pid, signal.SIGTERM)
-                await asyncio.sleep(0.5)  # Give it time to terminate
-                # Check if it's really dead, if not SIGKILL
-                if _is_process_running(pid):
-                    os.kill(pid, signal.SIGKILL)
-                    await asyncio.sleep(0.2)
-                _mark_process_stopped(memory_path, pid, reason="killed")
-                if _is_process_running(pid):
-                    return f"Warning: Process {pid} may still be running after kill attempt."
-                return f"Successfully killed process {pid}. Updated /memory/processes.md"
-            except ProcessLookupError:
-                _mark_process_stopped(memory_path, pid, reason="not found")
-                return f"Process {pid} was not running (already terminated). Updated /memory/processes.md"
-            except PermissionError:
-                return f"Error: Permission denied to kill process {pid}."
-            except Exception as e:
-                return f"Error killing process {pid}: {e}"
-        if action == "cleanup":
-            # Kill all processes from this workspace
-            running_pids = _get_running_pids_from_registry(memory_path)
-            workspace_str = str(workspace)
-            killed: list[int] = []
-            failed: list[tuple[int, str]] = []
-            for proc_pid, line in running_pids:
-                # Check if this process is from our workspace
-                workspace_short = workspace_str.split("/")[-1]
-                if workspace_short in line or workspace_str in line:
-                    try:
-                        os.kill(proc_pid, signal.SIGTERM)
-                        await asyncio.sleep(0.2)
-                        if _is_process_running(proc_pid):
-                            os.kill(proc_pid, signal.SIGKILL)
-                        _mark_process_stopped(memory_path, proc_pid, reason="cleanup")
-                        killed.append(proc_pid)
-                    except (ProcessLookupError, PermissionError) as e:
-                        _mark_process_stopped(memory_path, proc_pid, reason=f"cleanup failed: {e}")
-                        failed.append((proc_pid, str(e)))
-            result = "Cleanup complete. Updated /memory/processes.md\n"
-            if killed:
-                result += f"Killed processes: {killed}\n"
-            if failed:
-                result += f"Failed to kill: {failed}\n"
-            if not killed and not failed:
-                result += "No active processes found for this workspace."
-            return result
-        return f"Unknown action: {action}"
-    # Add tool metadata
-    check_processes._tool_name = "check_processes"  # type: ignore[attr-defined]
-    check_processes._tool_description = (  # type: ignore[attr-defined]
-        "Check and manage background processes. "
-        "Use 'list' to see all background processes, "
-        "'kill' to stop a specific process by PID, "
-        "'cleanup' to kill all background processes from this workspace."
-    )
-    check_processes._is_tool = True  # type: ignore[attr-defined]
-    return check_processes
-def create_python_repl_tool(workspace: Path) -> Callable[..., Coroutine[Any, Any, str]]:
-    """Create a python_repl tool bound to a specific workspace."""
-    async def python_repl(
-        code: Annotated[str, "Python code to execute"],
-    ) -> str:
-        """Execute Python code in an isolated namespace.
-        Returns the output (stdout) or any errors.
-        Use for testing code snippets, calculations, data manipulation, or quick validation.
-        The WORKSPACE variable is available with the workspace path.
-        """
-        old_stdout = sys.stdout
-        old_stderr = sys.stderr
-        try:
-            # Capture stdout and stderr
-            redirected_output = StringIO()
-            redirected_error = StringIO()
-            sys.stdout = redirected_output
-            sys.stderr = redirected_error
-            # Create isolated namespace with builtins
-            namespace: dict[str, Any] = {
-                "__builtins__": __builtins__,
-                "__name__": "__main__",
-                "WORKSPACE": workspace,
-            }
-            try:
-                # Try to compile and exec
-                compiled = compile(code, "<repl>", "exec")
-                exec(compiled, namespace)  # noqa: S102
-                output = redirected_output.getvalue()
-                error = redirected_error.getvalue()
-                result_parts = ["Python REPL Output"]
-                result_parts.append("=" * 50)
-                if output.strip():
-                    if len(output) > 15000:
-                        output = output[:15000] + "\n... (output truncated)"
-                    result_parts.append(output)
-                if error.strip():
-                    result_parts.append("STDERR:")
-                    result_parts.append(error)
-                if not output.strip() and not error.strip():
-                    result_parts.append("(code executed successfully, no output)")
-                return "\n".join(result_parts)
-            except SyntaxError as e:
-                return f"SyntaxError: {e}"
-            except Exception as e:
-                return f"Error: {type(e).__name__}: {e}"
-        finally:
-            sys.stdout = old_stdout
-            sys.stderr = old_stderr
-    # Add tool metadata
-    python_repl._tool_name = "python_repl"  # type: ignore[attr-defined]
-    python_repl._tool_description = (  # type: ignore[attr-defined]
-        "Execute Python code in an isolated namespace. "
-        "Returns the output (stdout) or any errors. "
-        "Use for testing code snippets, calculations, data manipulation, or quick validation."
-    )
-    python_repl._is_tool = True  # type: ignore[attr-defined]
-    return python_repl
-def create_execution_tools(
-    workspace: Path,
-    memory_path: Path,
-    bash_timeout: int = 120,
-) -> Sequence[Callable[..., Coroutine[Any, Any, str]]]:
-    """Create all execution tools bound to a workspace.
-    Args:
-        workspace: Root directory for command execution
-        memory_path: Path to memory directory for process registry
-        bash_timeout: Default timeout for bash commands in seconds
-    Returns:
-        List of execution tool functions
-    """
-    workspace = Path(workspace).resolve()
-    memory_path = Path(memory_path).resolve()
-    return [
-        create_bash_execute_tool(workspace, memory_path, bash_timeout),
-        create_check_processes_tool(workspace, memory_path),
-        create_python_repl_tool(workspace),
-    ]

src/flow/harness/maf/tools/memory.py DELETED Viewed

@@ -1,260 +0,0 @@
-"""Memory tool for persistent storage across sessions.
-Provides file-based memory storage allowing agents to store and retrieve
-information, patterns, and decisions across conversations.
-"""
-from collections.abc import Callable, Coroutine
-from pathlib import Path
-from typing import Annotated, Any, Literal
-class MemoryBackend:
-    """File-based memory storage backend with security controls."""
-    def __init__(self, base_path: Path) -> None:
-        """Initialize memory backend."""
-        self.base_path = Path(base_path).resolve()
-        self.base_path.mkdir(parents=True, exist_ok=True)
-    def _validate_path(self, path: str) -> Path:
-        """Validate and resolve a memory path."""
-        # Normalize path (remove /memory prefix if present)
-        if path.startswith("/memory"):
-            path = path[len("/memory") :]
-        path = path.lstrip("/")
-        # Handle empty path
-        if not path:
-            return self.base_path
-        # Resolve to absolute path
-        full_path = (self.base_path / path).resolve()
-        # Security: Ensure path is within base_path
-        try:
-            full_path.relative_to(self.base_path)
-        except ValueError as err:
-            raise ValueError(f"Access denied: path '{path}' is outside memory directory") from err
-        return full_path
-    def view(self, path: str, view_range: list[int] | None = None) -> str:
-        """View directory contents or file contents."""
-        full_path = self._validate_path(path)
-        if not full_path.exists():
-            return f"Path not found: {path}\nUse 'create' to create new files."
-        # Directory listing
-        if full_path.is_dir():
-            contents = [f"Directory: {path or '/memory'}"]
-            items = sorted(full_path.iterdir(), key=lambda x: (x.is_file(), x.name))
-            if not items:
-                contents.append("(empty directory)")
-            else:
-                for item in items:
-                    suffix = "/" if item.is_dir() else ""
-                    contents.append(f"  - {item.name}{suffix}")
-            return "\n".join(contents)
-        # File contents
-        if full_path.is_file():
-            content = full_path.read_text(encoding="utf-8")
-            lines = content.splitlines()
-            if view_range:
-                start, end = view_range
-                start = max(1, start)
-                end = min(len(lines), end)
-                lines = lines[start - 1 : end]
-                numbered_lines = [f"{i + start:5d}: {line}" for i, line in enumerate(lines)]
-            else:
-                numbered_lines = [f"{i + 1:5d}: {line}" for i, line in enumerate(lines)]
-            return "\n".join(numbered_lines) if numbered_lines else "(empty file)"
-        return f"Unknown path type: {path}"
-    def create(self, path: str, file_text: str) -> str:
-        """Create or overwrite a file."""
-        full_path = self._validate_path(path)
-        full_path.parent.mkdir(parents=True, exist_ok=True)
-        full_path.write_text(file_text, encoding="utf-8")
-        return f"File created successfully at {path}"
-    def str_replace(self, path: str, old_str: str, new_str: str) -> str:
-        """Replace text in a file."""
-        full_path = self._validate_path(path)
-        if not full_path.is_file():
-            raise FileNotFoundError(f"File not found: {path}")
-        content = full_path.read_text(encoding="utf-8")
-        if old_str not in content:
-            raise ValueError(f"Text not found in file: '{old_str[:50]}...'")
-        new_content = content.replace(old_str, new_str, 1)
-        full_path.write_text(new_content, encoding="utf-8")
-        return f"File {path} has been edited successfully"
-    def append(self, path: str, text: str) -> str:
-        """Append text to end of file."""
-        full_path = self._validate_path(path)
-        if not full_path.exists():
-            full_path.parent.mkdir(parents=True, exist_ok=True)
-            full_path.write_text("", encoding="utf-8")
-        # Ensure text starts with newline if file isn't empty
-        if full_path.stat().st_size > 0:
-            existing = full_path.read_text(encoding="utf-8")
-            if existing and not existing.endswith("\n"):
-                text = "\n" + text
-        # Ensure text ends with newline
-        if not text.endswith("\n"):
-            text += "\n"
-        with full_path.open("a", encoding="utf-8") as f:
-            f.write(text)
-        return f"Text appended to {path}"
-    def search(self, query: str, path: str = "") -> str:
-        """Search for text across memory files."""
-        full_path = self._validate_path(path)
-        if not full_path.exists():
-            return f"Path not found: {path or '/memory'}"
-        if not full_path.is_dir():
-            # Search single file
-            files = [full_path]
-        else:
-            files = list(full_path.rglob("*"))
-        matches: list[dict[str, Any]] = []
-        query_lower = query.lower()
-        for file_path in files:
-            if not file_path.is_file():
-                continue
-            try:
-                content = file_path.read_text(encoding="utf-8")
-                lines = content.splitlines()
-                for line_num, line in enumerate(lines, 1):
-                    if query_lower in line.lower():
-                        rel_path = file_path.relative_to(self.base_path)
-                        matches.append({
-                            "file": str(rel_path),
-                            "line": line_num,
-                            "content": line.strip()[:100],
-                        })
-            except (UnicodeDecodeError, PermissionError):
-                continue
-        if not matches:
-            return f"No matches found for '{query}' in {path or '/memory'}"
-        result_lines = [f"Found {len(matches)} match(es) for '{query}':\n"]
-        for match in matches[:50]:
-            result_lines.append(f"  {match['file']}:{match['line']} - {match['content']}")
-        if len(matches) > 50:
-            result_lines.append(f"\n... and {len(matches) - 50} more matches")
-        return "\n".join(result_lines)
-    def delete(self, path: str) -> str:
-        """Delete a file or empty directory."""
-        full_path = self._validate_path(path)
-        if not full_path.exists():
-            raise FileNotFoundError(f"Path not found: {path}")
-        if full_path.is_file():
-            full_path.unlink()
-            return f"File deleted: {path}"
-        if full_path.is_dir():
-            if any(full_path.iterdir()):
-                raise ValueError(f"Directory not empty: {path}. Delete contents first.")
-            full_path.rmdir()
-            return f"Directory deleted: {path}"
-        return f"Unknown path type: {path}"
-def create_memory_tool(memory_path: Path) -> Callable[..., Coroutine[Any, Any, str]]:
-    """Create a memory tool bound to a specific memory directory."""
-    backend = MemoryBackend(memory_path)
-    async def memory(
-        command: Annotated[
-            Literal["view", "create", "str_replace", "append", "search", "delete"],
-            "Operation to perform",
-        ],
-        path: Annotated[str, "Path to file or directory (e.g., '/memory/patterns/cors.md')"] = "/memory",
-        file_text: Annotated[str | None, "Content to write (for create)"] = None,
-        old_str: Annotated[str | None, "Text to find (for str_replace)"] = None,
-        new_str: Annotated[str | None, "Replacement text (for str_replace)"] = None,
-        append_text: Annotated[str | None, "Text to append (for append)"] = None,
-        query: Annotated[str | None, "Search query (for search)"] = None,
-        view_range: Annotated[list[int] | None, "Line range [start, end] (for view)"] = None,
-    ) -> str:
-        """Store and retrieve information in persistent memory.
-        Memory persists across conversations - use it to remember patterns,
-        insights, project context, and decisions.
-        Operations: view (show directory/file), create (new file),
-        str_replace (edit file), append (add to file),
-        search (find text), delete (remove file/dir).
-        Organize by: /memory/patterns/, /memory/projects/, /memory/decisions/
-        """
-        try:
-            if command == "view":
-                return backend.view(path, view_range)
-            if command == "create":
-                if file_text is None:
-                    return "Error: 'file_text' is required for create operation"
-                return backend.create(path, file_text)
-            if command == "str_replace":
-                if old_str is None or new_str is None:
-                    return "Error: 'old_str' and 'new_str' are required for str_replace"
-                return backend.str_replace(path, old_str, new_str)
-            if command == "append":
-                if append_text is None:
-                    return "Error: 'append_text' is required for append operation"
-                return backend.append(path, append_text)
-            if command == "search":
-                if query is None:
-                    return "Error: 'query' is required for search operation"
-                return backend.search(query, path)
-            if command == "delete":
-                return backend.delete(path)
-            return f"Error: Unknown command: {command}"
-        except Exception as e:
-            return f"Memory operation failed: {e}"
-    # Add tool metadata
-    memory._tool_name = "memory"  # type: ignore[attr-defined]
-    memory._tool_description = (  # type: ignore[attr-defined]
-        "Store and retrieve information in persistent memory. "
-        "Memory persists across conversations - use it to remember patterns, "
-        "insights, project context, and decisions."
-    )
-    memory._is_tool = True  # type: ignore[attr-defined]
-    return memory

src/flow/harness/maf/tools/sub_agent.py DELETED Viewed

@@ -1,196 +0,0 @@
-"""Sub-agent tool for isolated research tasks.
-Provides context isolation by delegating complex research tasks to a
-separate agent that operates in its own context window. The sub-agent
-processes the request and returns only a concise summary, preventing
-context pollution in the main agent.
-This implements the "Isolation" strategy for context engineering:
-- Coordinator agent stays lean with minimal context
-- Sub-agent can use 30K+ tokens internally for research
-- Only the distilled result (200-500 tokens) returns to coordinator
-"""
-from __future__ import annotations
-import os
-from collections.abc import Callable, Coroutine
-from pathlib import Path
-from typing import Annotated, Any
-# Sub-agent system prompt focused on research and summarization
-SUB_AGENT_INSTRUCTIONS = """You are a research assistant that helps with complex information gathering tasks.
-Your role:
-1. Thoroughly research the given topic or question
-2. Gather relevant information from available tools
-3. Synthesize findings into a clear, concise summary
-4. Return ONLY the essential information needed by the requesting agent
-Guidelines:
-- Be thorough in your research but concise in your response
-- Focus on facts and actionable information
-- If you can't find information, say so clearly
-- Your response will be passed to another agent, so make it self-contained
-- Target 200-500 tokens for your final response unless more detail is explicitly requested
-Do NOT:
-- Include conversational fluff or preamble
-- Repeat the original question back
-- Add disclaimers about your limitations
-- Include information that wasn't requested
-"""
-def create_sub_agent_tool(
-    workspace: Path,
-    model: str = "gpt-4o-mini",
-    endpoint: str | None = None,
-    api_key: str | None = None,
-    api_version: str = "2024-02-15-preview",
-) -> Callable[..., Coroutine[Any, Any, str]]:
-    """Create a sub-agent tool for isolated research tasks.
-    The sub-agent runs in its own isolated context, preventing context
-    pollution in the main agent. This is useful for:
-    - Complex research that requires many tool calls
-    - Tasks that generate lots of intermediate content
-    - Keeping the main agent's context lean and focused
-    Args:
-        workspace: Workspace directory for file operations
-        model: Model to use for sub-agent (default: gpt-4o-mini for efficiency)
-        endpoint: Azure OpenAI endpoint (defaults to AZURE_OPENAI_ENDPOINT env var)
-        api_key: Azure OpenAI API key (defaults to AZURE_OPENAI_API_KEY env var)
-        api_version: Azure OpenAI API version
-    Returns:
-        An async function that can be used as a tool
-    """
-    # Resolve credentials from environment if not provided
-    _endpoint = endpoint or os.environ.get("AZURE_OPENAI_ENDPOINT", "")
-    _api_key = api_key or os.environ.get("AZURE_OPENAI_API_KEY", "")
-    # Lazy import to avoid circular dependencies
-    _sub_agent: Any = None
-    async def _ensure_sub_agent() -> Any:
-        """Lazily create the sub-agent on first use."""
-        nonlocal _sub_agent
-        if _sub_agent is not None:
-            return _sub_agent
-        try:
-            from agent_framework import ChatAgent
-            from agent_framework.azure import AzureOpenAIChatClient
-        except ImportError as e:
-            raise ImportError(
-                "Microsoft Agent Framework is required for sub-agent. "
-                "Install with: pip install agent-framework-core"
-            ) from e
-        # Create a lightweight chat client for the sub-agent
-        # Uses a smaller/faster model by default for efficiency
-        client = AzureOpenAIChatClient(
-            api_key=_api_key,
-            endpoint=_endpoint,
-            deployment=model,
-            api_version=api_version,
-        )
-        # Create basic tools for the sub-agent
-        # Keep it minimal - just what's needed for research
-        from flow.harness.maf.tools.coding import (
-            create_grep_search_tool,
-            create_list_directory_tool,
-            create_read_file_tool,
-        )
-        from flow.harness.maf.tools.core import task_done, think
-        sub_tools: list[Callable[..., Any]] = [
-            create_read_file_tool(workspace),
-            create_list_directory_tool(workspace),
-            create_grep_search_tool(workspace),
-            think,
-            task_done,
-        ]
-        # Convert tools to agent_framework format
-        from agent_framework import ai_function
-        converted_tools = []
-        for tool_func in sub_tools:
-            name = getattr(tool_func, "_tool_name", tool_func.__name__)
-            description = getattr(tool_func, "_tool_description", tool_func.__doc__ or "")
-            wrapped = ai_function(name=name, description=description)(tool_func)
-            converted_tools.append(wrapped)
-        _sub_agent = ChatAgent(
-            name="ResearchAssistant",
-            description="Research assistant for complex information gathering",
-            instructions=SUB_AGENT_INSTRUCTIONS,
-            chat_client=client,
-            tools=converted_tools,
-        )
-        return _sub_agent
-    async def research(
-        task: Annotated[
-            str,
-            "The research task or question to investigate. Be specific about what information you need.",
-        ],
-        context: Annotated[
-            str | None,
-            "Optional context to help the sub-agent understand the broader goal.",
-        ] = None,
-    ) -> str:
-        """Delegate a research task to a sub-agent with isolated context.
-        Use this tool when you need to:
-        - Research a complex topic that may require multiple steps
-        - Gather information without polluting your main context
-        - Get a summarized answer to a specific question
-        The sub-agent operates in its own context window, so it can
-        use many tokens internally while only returning a concise summary.
-        This keeps your main context lean and focused.
-        Examples:
-        - "Find all Python files that import the requests library and summarize their purpose"
-        - "Research how authentication is implemented in this codebase"
-        - "Analyze the error handling patterns used across the project"
-        """
-        sub_agent = await _ensure_sub_agent()
-        # Build the research prompt
-        prompt_parts = [f"Research task: {task}"]
-        if context:
-            prompt_parts.insert(0, f"Context: {context}")
-        prompt_parts.append("\nProvide a concise summary of your findings.")
-        full_prompt = "\n\n".join(prompt_parts)
-        try:
-            # Run the sub-agent - it operates in isolated context
-            response = await sub_agent.run(full_prompt)
-            # Extract text content from response
-            if hasattr(response, "content"):
-                return str(response.content)
-            return str(response)
-        except Exception as e:
-            return f"Research failed: {e}"
-    # Add tool metadata
-    research._tool_name = "research"  # type: ignore[attr-defined]
-    research._tool_description = (  # type: ignore[attr-defined]
-        "Delegate a research task to a sub-agent with isolated context. "
-        "The sub-agent can thoroughly investigate a topic using many tool calls "
-        "internally, then return only a concise summary. Use this for complex "
-        "research that would otherwise pollute your main context."
-    )
-    research._is_tool = True  # type: ignore[attr-defined]
-    return research

src/flow/harness/maf/wrappers.py ADDED Viewed

	@@ -0,0 +1,64 @@

+"""MAF-specific tool wrappers.
+This module provides utilities for wrapping shared tools for use with
+Microsoft Agent Framework. The main functionality is now handled by
+the shared adapters in flow.tools.adapters.
+This module is maintained for backward compatibility.
+"""
+from __future__ import annotations
+import logging
+from collections.abc import Callable, Coroutine
+from pathlib import Path
+from typing import Any
+from flow.tools import Tool, to_maf_tool
+from flow.harness.maf.tools import build_tools as build_maf_tools_impl
+logger = logging.getLogger(__name__)
+__all__ = ["build_maf_tools", "wrap_for_maf"]
+def wrap_for_maf(tool: Tool) -> Callable[..., Coroutine[Any, Any, str]]:
+    """Wrap a Flow Tool for Microsoft Agent Framework.
+    Applies the MAF @tool decorator using metadata from the Tool instance.
+    Args:
+        tool: A Flow Tool instance
+    Returns:
+        The function wrapped with @tool for MAF
+    Raises:
+        ValueError: If the input is not a Tool instance
+    """
+    if not isinstance(tool, Tool):
+        raise ValueError(f"Expected Tool instance, got {type(tool)}")
+    return to_maf_tool(tool)
+def build_maf_tools(
+    tools_spec: dict[str, dict[str, Any]],
+    workspace: Path,
+    memory_path: Path,
+) -> list[Callable[..., Coroutine[Any, Any, str]]]:
+    """Build MAF-compatible tools from a specification dict.
+    Creates MAF-specific tools using the shared tools from flow.tools
+    and wraps them with the MAF @tool decorator.
+    Args:
+        tools_spec: Dict mapping tool names to their config dicts.
+        workspace: Root directory for file operations
+        memory_path: Directory for persistent memory
+    Returns:
+        List of tool functions wrapped with @tool
+    """
+    # Build tools from MAF-specific module (already wrapped with MAF @tool)
+    return build_maf_tools_impl(tools_spec, workspace, memory_path)

src/flow/harness/miniagent/__init__.py ADDED Viewed

	@@ -0,0 +1,139 @@

+"""MiniAgent harness for Flow - correct context compaction.
+MiniAgent fixes Agent Framework's broken context compaction by:
+1. Applying compaction BEFORE each LLM call in the tool loop
+2. Reassigning the message list (not modifying a copy)
+3. Supporting token-budget-based strategies
+## Usage in Flow
+    from flow.experiments.models import Agent, CompactionConfig
+    agent = Agent(
+        name="my-agent",
+        framework="miniagent",  # Use this harness
+        compaction=CompactionConfig.head_tail_tokens(head_ratio=0.2, token_budget=50_000),
+        tools="standard",
+    )
+## Direct Usage
+    from flow.harness.miniagent import ChatAgent, HeadTailStrategy
+    agent = ChatAgent(
+        instructions="You are a helpful assistant.",
+        tools=tools.coding_tools(),
+        context_strategy=HeadTailStrategy(),
+        token_budget=100_000,
+    )
+    response = await agent.run("Find all Python files with TODO comments")
+## Context Strategies
+- NoCompactionStrategy: Baseline (no management)
+- HeadTailStrategy: Keep head (20%) + tail (80%), drop middle (token-aware)
+- SlidingWindowStrategy: Keep system + recent messages within budget
+- SummarizationStrategy: Compress old messages using LLM
+## Key Difference from Agent Framework
+Agent Framework's tool loop:
+    prepped_messages = prepare_messages(messages)  # Copy made ONCE
+    for iteration in range(max_iterations):
+        middleware(context)  # Modifies a DIFFERENT copy
+        response = llm_call(prepped_messages)
+        prepped_messages.extend(results)  # List grows unbounded
+MiniAgent's tool loop:
+    for iteration in range(max_iterations):
+        messages = compact(messages)  # Compacted list REPLACES original
+        response = llm_call(messages)
+        messages.extend(results)  # Next iteration will compact again
+"""
+from .agent import ChatAgent, AgentThread, AgentResponse, UsageStats, StreamEvent, StreamEventType
+from .tool import Tool, tool
+from .messages import ChatMessage, ToolCall, ToolResult
+from .context import (
+    ContextStrategy,
+    NoCompactionStrategy,
+    HeadTailStrategy,
+    SlidingWindowStrategy,
+    SummarizationStrategy,
+)
+from .client import ChatClient, ClientConfig, ChatCompletionResult
+from .hooks import (
+    Hooks,
+    HookEvent,
+    PreToolUseEvent,
+    PreToolUseResult,
+    PostToolUseEvent,
+    PostToolUseResult,
+    PreModelCallEvent,
+    PostModelCallEvent,
+    PreCompactEvent,
+    PostCompactEvent,
+    AgentStartEvent,
+    AgentEndEvent,
+)
+from .instructions import get_instructions, INSTRUCTIONS
+from .workspace import Workspace, get_workspace, set_workspace
+from . import tools
+# Register with Flow's harness system
+from flow.harness.registry import register
+from .harness import MiniAgentHarness
+register("miniagent", MiniAgentHarness)
+__version__ = "0.1.0"
+__all__ = [
+    # Harness
+    "MiniAgentHarness",
+    # Core
+    "ChatAgent",
+    "AgentThread",
+    "AgentResponse",
+    "UsageStats",
+    "StreamEvent",
+    "StreamEventType",
+    # Tools
+    "Tool",
+    "tool",
+    "tools",
+    # Messages
+    "ChatMessage",
+    "ToolCall",
+    "ToolResult",
+    # Context strategies
+    "ContextStrategy",
+    "NoCompactionStrategy",
+    "HeadTailStrategy",
+    "SlidingWindowStrategy",
+    "SummarizationStrategy",
+    # Client
+    "ChatClient",
+    "ClientConfig",
+    "ChatCompletionResult",
+    # Hooks
+    "Hooks",
+    "HookEvent",
+    "PreToolUseEvent",
+    "PreToolUseResult",
+    "PostToolUseEvent",
+    "PostToolUseResult",
+    "PreModelCallEvent",
+    "PostModelCallEvent",
+    "PreCompactEvent",
+    "PostCompactEvent",
+    "AgentStartEvent",
+    "AgentEndEvent",
+    # Instructions
+    "get_instructions",
+    "INSTRUCTIONS",
+    # Workspace
+    "Workspace",
+    "get_workspace",
+    "set_workspace",
+]

src/flow/harness/miniagent/agent.py ADDED Viewed

	@@ -0,0 +1,604 @@

+"""ChatAgent - the core agent implementation for MiniAgent.
+This is the CRITICAL module that fixes Agent Framework's broken compaction.
+The key difference: context strategy is called BEFORE each LLM call in the
+tool loop, and the compacted list continues to the next iteration.
+"""
+from dataclasses import dataclass, field
+from typing import Any, AsyncGenerator
+from enum import Enum
+import json
+from .messages import ChatMessage, ToolCall
+from .tool import Tool
+from .client import ChatClient, ChatCompletionResult
+from .context import ContextStrategy, NoCompactionStrategy
+from .hooks import (
+    Hooks,
+    PreToolUseEvent,
+    PreToolUseResult,
+    PostToolUseEvent,
+    PostToolUseResult,
+    PreModelCallEvent,
+    PostModelCallEvent,
+    PreCompactEvent,
+    PostCompactEvent,
+    AgentStartEvent,
+    AgentEndEvent,
+)
+class StreamEventType(str, Enum):
+    """Types of events emitted during run_stream()."""
+    AGENT_START = "agent_start"
+    MODEL_START = "model_start"
+    MODEL_END = "model_end"
+    TOOL_START = "tool_start"
+    TOOL_END = "tool_end"
+    TEXT = "text"
+    AGENT_END = "agent_end"
+def _dict_factory() -> dict[str, Any]:
+    return {}
+def _list_factory() -> list[dict[str, int]]:
+    return []
+@dataclass
+class StreamEvent:
+    """Event emitted during agent execution streaming."""
+    type: StreamEventType
+    data: dict[str, Any] = field(default_factory=_dict_factory)
+    def __str__(self) -> str:
+        """Human-readable representation for print(event)."""
+        match self.type:
+            case StreamEventType.AGENT_START:
+                msg = self.data.get("user_message", "")[:50]
+                return f"🚀 Agent started: {msg}..."
+            case StreamEventType.MODEL_START:
+                return f"🧠 Model call (iteration {self.data.get('iteration', 0) + 1})"
+            case StreamEventType.MODEL_END:
+                usage = self.data.get("usage", {})
+                tokens = usage.get("input_tokens", 0)
+                has_tools = self.data.get("has_tool_calls", False)
+                tool_info = " → calling tools" if has_tools else ""
+                return f"   ✓ Response ({tokens} tokens){tool_info}"
+            case StreamEventType.TOOL_START:
+                name = self.data.get("tool_name", "unknown")
+                return f"🔧 Tool: {name}"
+            case StreamEventType.TOOL_END:
+                name = self.data.get("tool_name", "unknown")
+                output = self.data.get("tool_output", "")[:100]
+                return f"   → {name}: {output}..."
+            case StreamEventType.TEXT:
+                content = self.data.get("content", "")[:200]
+                return f"💬 {content}"
+            case StreamEventType.AGENT_END:
+                usage = self.data.get("usage", {})
+                iters = self.data.get("iterations", 0)
+                tools = usage.get("tool_calls", 0)
+                tokens = usage.get("total_input_tokens", 0) + usage.get("total_output_tokens", 0)
+                return f"✅ Done ({iters} iterations, {tools} tool calls, {tokens} tokens)"
+            case _:
+                return f"Event({self.type.value}): {self.data}"
+    def __repr__(self) -> str:
+        return f"StreamEvent(type={self.type.value!r}, data={self.data!r})"
+@dataclass
+class UsageStats:
+    """Token usage statistics."""
+    total_input_tokens: int = 0
+    total_output_tokens: int = 0
+    llm_calls: int = 0
+    tool_calls: int = 0
+    per_call: list[dict[str, int]] = field(default_factory=_list_factory)
+@dataclass
+class AgentResponse:
+    """Response from agent.run()."""
+    content: str | None
+    messages: list[ChatMessage]
+    usage: UsageStats
+    iterations: int
+class AgentThread:
+    """Conversation thread with message history.
+    Threads allow multi-turn conversations by preserving history
+    between agent.run() calls.
+    """
+    def __init__(self, messages: list[ChatMessage] | None = None):
+        self.messages: list[ChatMessage] = messages or []
+    def add(self, message: ChatMessage) -> None:
+        """Add a single message to the thread."""
+        self.messages.append(message)
+    def add_many(self, messages: list[ChatMessage]) -> None:
+        """Add multiple messages to the thread."""
+        self.messages.extend(messages)
+    def clear(self) -> None:
+        """Clear all messages from the thread."""
+        self.messages = []
+    def __len__(self) -> int:
+        return len(self.messages)
+    def __bool__(self) -> bool:
+        # Always truthy, even when empty (to work with `thread or get_new_thread()`)
+        return True
+class ChatAgent:
+    """Minimal agent with correct context compaction and hooks.
+    The key difference from Agent Framework:
+    - Context strategy is called BEFORE each LLM call in the tool loop
+    - The compacted messages are used for both the call AND next iteration
+    - This ensures cumulative token usage actually decreases with compaction
+    Example:
+        from miniagent import ChatAgent, tools
+        agent = ChatAgent(
+            instructions="You are a helpful assistant.",
+            tools=tools.coding_tools(),
+        )
+        response = await agent.run("List files in the current directory")
+    """
+    DEFAULT_MAX_ITERATIONS = 40
+    DEFAULT_TOKEN_BUDGET = 100_000
+    def __init__(
+        self,
+        client: ChatClient | None = None,
+        instructions: str | None = None,
+        tools: list[Tool] | None = None,
+        context_strategy: ContextStrategy | None = None,
+        token_budget: int = DEFAULT_TOKEN_BUDGET,
+        max_iterations: int = DEFAULT_MAX_ITERATIONS,
+        hooks: Hooks | None = None,
+    ):
+        """Initialize the agent.
+        Args:
+            client: Chat client for LLM calls. Auto-created if None.
+            instructions: System prompt for the agent.
+            tools: List of tools the agent can use.
+            context_strategy: Strategy for managing context. Defaults to no compaction.
+            token_budget: Maximum tokens for context window.
+            max_iterations: Maximum tool loop iterations.
+            hooks: Hook configuration for event handling.
+        """
+        self.client = client or ChatClient()
+        self.instructions = instructions
+        self.tools = {t.name: t for t in (tools or [])}
+        self.context_strategy = context_strategy or NoCompactionStrategy()
+        self.token_budget = token_budget
+        self.max_iterations = max_iterations
+        self.hooks = hooks or Hooks()
+    def get_new_thread(self) -> AgentThread:
+        """Create a new conversation thread."""
+        return AgentThread()
+    async def run(
+        self,
+        message: str,
+        thread: AgentThread | None = None,
+    ) -> AgentResponse:
+        """Run the agent on a message (non-streaming).
+        This method delegates to run_stream() and collects the results.
+        All logic lives in run_stream() - this is just a convenience wrapper.
+        THE CRITICAL FIX (in run_stream):
+        - Messages are compacted BEFORE each LLM call
+        - The compacted list is used for both the call AND continues
+        - Unlike Agent Framework where prepped_messages grows unbounded
+        Args:
+            message: The user message to process.
+            thread: Optional thread for conversation continuity.
+        Returns:
+            AgentResponse with the result and statistics.
+        """
+        thread = thread or self.get_new_thread()
+        # Consume the stream and extract final results
+        final_content: str | None = None
+        iterations: int = 0
+        usage_data: dict[str, int] = {}
+        async for event in self.run_stream(message, thread):
+            if event.type == StreamEventType.AGENT_END:
+                final_content = event.data.get("final_response")
+                iterations = event.data.get("iterations", 0)
+                usage_data = event.data.get("usage", {})
+        # Build UsageStats from the collected data
+        usage = UsageStats(
+            llm_calls=usage_data.get("llm_calls", 0),
+            tool_calls=usage_data.get("tool_calls", 0),
+            total_input_tokens=usage_data.get("total_input_tokens", 0),
+            total_output_tokens=usage_data.get("total_output_tokens", 0),
+        )
+        return AgentResponse(
+            content=final_content,
+            messages=thread.messages,  # Thread was updated by run_stream
+            usage=usage,
+            iterations=iterations,
+        )
+    async def run_stream(
+        self,
+        message: str,
+        thread: AgentThread | None = None,
+    ) -> AsyncGenerator[StreamEvent, None]:
+        """Run the agent and yield events as they occur.
+        This is useful for building interactive UIs that need to show
+        progress in real-time.
+        Args:
+            message: The user message to process.
+            thread: Optional thread for conversation continuity.
+        Yields:
+            StreamEvent objects for each step of execution.
+        """
+        thread = thread or self.get_new_thread()
+        usage = UsageStats()
+        # Emit AgentStart hook
+        await self._emit_agent_start(message, thread)
+        # Emit start event
+        yield StreamEvent(
+            type=StreamEventType.AGENT_START,
+            data={"user_message": message, "thread_length": len(thread)},
+        )
+        # Build initial messages
+        messages: list[ChatMessage] = []
+        if self.instructions:
+            messages.append(ChatMessage.system(self.instructions))
+        messages.extend(thread.messages)
+        user_msg = ChatMessage.user(message)
+        messages.append(user_msg)
+        openai_tools = (
+            [t.to_openai_tool() for t in self.tools.values()] if self.tools else None
+        )
+        final_content: str | None = None
+        iteration = 0
+        for iteration in range(self.max_iterations):
+            # Apply context strategy
+            messages = await self._compact_with_hooks(messages, iteration)
+            # Model call start
+            yield StreamEvent(
+                type=StreamEventType.MODEL_START,
+                data={"iteration": iteration, "message_count": len(messages)},
+            )
+            # Emit PreModelCall hook for OTEL tracing
+            await self._emit_pre_model_call(messages, iteration)
+            # Make LLM call
+            result = await self.client.chat_completion(
+                messages=[m.to_openai_format() for m in messages],
+                tools=openai_tools,
+            )
+            # Track usage
+            usage.llm_calls += 1
+            usage.total_input_tokens += result.usage["input_tokens"]
+            usage.total_output_tokens += result.usage["output_tokens"]
+            usage.per_call.append(result.usage)
+            # Emit PostModelCall hook for OTEL tracing
+            await self._emit_post_model_call(result, iteration)
+            # Model call end
+            yield StreamEvent(
+                type=StreamEventType.MODEL_END,
+                data={
+                    "iteration": iteration,
+                    "usage": result.usage,
+                    "has_tool_calls": bool(result.tool_calls),
+                },
+            )
+            # Parse response
+            assistant_msg = self._parse_assistant_message(result)
+            messages.append(assistant_msg)
+            # Emit text if present
+            if assistant_msg.content:
+                yield StreamEvent(
+                    type=StreamEventType.TEXT,
+                    data={"content": assistant_msg.content},
+                )
+            # Check if done
+            if not assistant_msg.tool_calls:
+                final_content = assistant_msg.content
+                break
+            # Execute tools
+            should_stop = False
+            for tool_call in assistant_msg.tool_calls:
+                # Pre-tool hook
+                hook_result = await self._emit_pre_tool_use(tool_call, messages, iteration)
+                if hook_result and hook_result.decision == "block":
+                    tool_msg = ChatMessage.tool(
+                        tool_call.id,
+                        f"Tool call blocked: {hook_result.reason or 'No reason provided'}",
+                    )
+                    messages.append(tool_msg)
+                    continue
+                tool_input = json.loads(tool_call.arguments)
+                if hook_result and hook_result.decision == "modify" and hook_result.modified_input:
+                    tool_input = hook_result.modified_input
+                # Tool start
+                yield StreamEvent(
+                    type=StreamEventType.TOOL_START,
+                    data={"tool_name": tool_call.name, "tool_input": tool_input},
+                )
+                # Execute
+                tool_result = await self._execute_tool(tool_call.name, tool_input)
+                usage.tool_calls += 1
+                # Tool end
+                yield StreamEvent(
+                    type=StreamEventType.TOOL_END,
+                    data={
+                        "tool_name": tool_call.name,
+                        "tool_output": tool_result[:500],  # Truncate for streaming
+                    },
+                )
+                # Post-tool hook
+                post_result = await self._emit_post_tool_use(
+                    tool_call, tool_input, tool_result, iteration
+                )
+                # Add tool result
+                tool_msg = ChatMessage.tool(tool_call.id, tool_result)
+                messages.append(tool_msg)
+                if post_result:
+                    if post_result.additional_context:
+                        messages.append(ChatMessage.system(post_result.additional_context))
+                    if post_result.stop_execution:
+                        should_stop = True
+                        break
+            if should_stop:
+                break
+        # Update thread
+        start_idx = 1 if self.instructions else 0
+        thread.messages = messages[start_idx:]
+        # Get final content
+        if final_content is None:
+            for msg in reversed(messages):
+                if msg.role == "assistant" and msg.content:
+                    final_content = msg.content
+                    break
+        # Emit AgentEnd hook
+        await self._emit_agent_end(final_content, iteration + 1, usage)
+        # End event
+        yield StreamEvent(
+            type=StreamEventType.AGENT_END,
+            data={
+                "final_response": final_content,
+                "iterations": iteration + 1,
+                "usage": {
+                    "total_input_tokens": usage.total_input_tokens,
+                    "total_output_tokens": usage.total_output_tokens,
+                    "llm_calls": usage.llm_calls,
+                    "tool_calls": usage.tool_calls,
+                },
+            },
+        )
+    def _parse_assistant_message(self, result: "ChatCompletionResult") -> ChatMessage:
+        """Parse the LLM response into a ChatMessage."""
+        tool_calls = None
+        if result.tool_calls:
+            tool_calls = [
+                ToolCall(
+                    id=tc["id"],
+                    name=tc["name"],
+                    arguments=tc["arguments"],
+                )
+                for tc in result.tool_calls
+            ]
+        return ChatMessage.assistant(content=result.content, tool_calls=tool_calls)
+    async def _execute_tool(self, name: str, arguments: dict[str, Any]) -> str:
+        """Execute a tool call."""
+        tool = self.tools.get(name)
+        if not tool:
+            return f"Error: Unknown tool '{name}'"
+        try:
+            return await tool.invoke(**arguments)
+        except Exception as e:
+            return f"Error executing {name}: {str(e)}"
+    # === Hook emission methods ===
+    async def _emit_agent_start(self, message: str, thread: AgentThread) -> None:
+        """Emit AgentStart event to hooks."""
+        event = AgentStartEvent(
+            user_message=message,
+            thread_message_count=len(thread),
+        )
+        for hook in self.hooks.agent_start:
+            await hook(event)
+    async def _emit_agent_end(
+        self, final_response: str | None, iterations: int, usage: UsageStats
+    ) -> None:
+        """Emit AgentEnd event to hooks."""
+        event = AgentEndEvent(
+            final_response=final_response,
+            total_iterations=iterations,
+            total_input_tokens=usage.total_input_tokens,
+            total_output_tokens=usage.total_output_tokens,
+            tool_calls_made=usage.tool_calls,
+        )
+        for hook in self.hooks.agent_end:
+            await hook(event)
+    async def _emit_pre_model_call(
+        self, messages: list[ChatMessage], iteration: int
+    ) -> None:
+        """Emit PreModelCall event to hooks."""
+        event = PreModelCallEvent(
+            message_count=len(messages),
+            iteration=iteration,
+        )
+        for hook in self.hooks.pre_model_call:
+            await hook(event)
+    async def _emit_post_model_call(self, result: "ChatCompletionResult", iteration: int) -> None:
+        """Emit PostModelCall event to hooks."""
+        # Extract text content from response
+        response_text = result.content or ""
+        event = PostModelCallEvent(
+            usage=result.usage,
+            iteration=iteration,
+            has_tool_calls=bool(result.tool_calls),
+            finish_reason=result.finish_reason,
+            response_text=response_text,
+        )
+        for hook in self.hooks.post_model_call:
+            await hook(event)
+    async def _emit_pre_tool_use(
+        self, tool_call: ToolCall, messages: list[ChatMessage], iteration: int
+    ) -> PreToolUseResult | None:
+        """Emit PreToolUse event to hooks. Returns combined result."""
+        event = PreToolUseEvent(
+            tool_name=tool_call.name,
+            tool_input=json.loads(tool_call.arguments),
+            tool_call_id=tool_call.id,
+            iteration=iteration,
+        )
+        result: PreToolUseResult | None = None
+        for hook in self.hooks.pre_tool_use:
+            hook_result = await hook(event)
+            if hook_result:
+                # First non-allow result wins
+                if hook_result.decision != "allow":
+                    return hook_result
+                result = hook_result
+        return result
+    async def _emit_post_tool_use(
+        self,
+        tool_call: ToolCall,
+        tool_input: dict[str, Any],
+        tool_output: str,
+        iteration: int,
+    ) -> PostToolUseResult | None:
+        """Emit PostToolUse event to hooks. Returns combined result."""
+        error = tool_output if tool_output.startswith("Error") else None
+        event = PostToolUseEvent(
+            tool_name=tool_call.name,
+            tool_input=tool_input,
+            tool_output=tool_output,
+            tool_call_id=tool_call.id,
+            iteration=iteration,
+            error=error,
+        )
+        combined = PostToolUseResult()
+        for hook in self.hooks.post_tool_use:
+            hook_result = await hook(event)
+            if hook_result:
+                if hook_result.additional_context:
+                    combined.additional_context = hook_result.additional_context
+                if hook_result.stop_execution:
+                    combined.stop_execution = True
+                    combined.stop_reason = hook_result.stop_reason
+        return combined if (combined.additional_context or combined.stop_execution) else None
+    async def _compact_with_hooks(
+        self, messages: list[ChatMessage], iteration: int
+    ) -> list[ChatMessage]:
+        """Apply context strategy with hooks."""
+        # Estimate current tokens (rough)
+        current_tokens = sum(
+            len(str(m.content or "")) // 4 + 10 for m in messages
+        )
+        # Emit PreCompact hook
+        pre_event = PreCompactEvent(
+            message_count=len(messages),
+            current_tokens=current_tokens,
+            budget=self.token_budget,
+            trigger="auto",
+        )
+        for hook in self.hooks.pre_compact:
+            await hook(pre_event)
+        # Apply context strategy (use async if available for summarization)
+        compacted: list[ChatMessage]
+        if hasattr(self.context_strategy, "prepare_context_async"):
+            # Cast to Any to access optional async method
+            strategy: Any = self.context_strategy
+            compacted = await strategy.prepare_context_async(
+                messages, self.token_budget
+            )
+        else:
+            compacted = self.context_strategy.prepare_context(messages, self.token_budget)
+        # Emit PostCompact hook if something changed
+        if len(compacted) != len(messages):
+            compacted_tokens = sum(
+                len(str(m.content or "")) // 4 + 10 for m in compacted
+            )
+            post_event = PostCompactEvent(
+                messages_before=len(messages),
+                messages_after=len(compacted),
+                tokens_before=current_tokens,
+                tokens_after=compacted_tokens,
+            )
+            for hook in self.hooks.post_compact:
+                await hook(post_event)
+        return compacted

src/flow/harness/miniagent/client.py ADDED Viewed

	@@ -0,0 +1,185 @@

+"""OpenAI/Azure OpenAI client wrapper for MiniAgent.
+Provides a unified interface for both OpenAI and Azure OpenAI APIs.
+Auto-detects configuration from environment variables.
+"""
+from dataclasses import dataclass
+from typing import Any
+import os
+# Load .env file if present (override=True to prefer .env over shell env)
+try:
+    from dotenv import load_dotenv
+    load_dotenv(override=True)
+except ImportError:
+    pass  # dotenv not installed, use existing env vars
+@dataclass
+class ClientConfig:
+    """Configuration for the chat client.
+    Can be provided explicitly or auto-detected from environment variables.
+    """
+    api_key: str
+    model: str = "gpt-4o"
+    endpoint: str | None = None  # For Azure OpenAI
+    api_version: str = "2024-02-15-preview"  # For Azure OpenAI
+    temperature: float = 0.0
+    max_tokens: int | None = None
+    @classmethod
+    def from_env(cls) -> "ClientConfig":
+        """Create config from environment variables.
+        Checks for Azure first, then falls back to OpenAI.
+        Environment variables:
+            Azure: AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_DEPLOYMENT
+            OpenAI: OPENAI_API_KEY, OPENAI_MODEL
+        """
+        # Check for Azure OpenAI
+        azure_endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT")
+        if azure_endpoint:
+            return cls(
+                api_key=os.environ.get("AZURE_OPENAI_API_KEY", ""),
+                model=os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4o"),
+                endpoint=azure_endpoint,
+                api_version=os.environ.get(
+                    "AZURE_OPENAI_API_VERSION", "2024-02-15-preview"
+                ),
+            )
+        # Fall back to OpenAI
+        return cls(
+            api_key=os.environ.get("OPENAI_API_KEY", ""),
+            model=os.environ.get("OPENAI_MODEL", "gpt-4o"),
+        )
+@dataclass
+class ChatCompletionResult:
+    """Result from a chat completion call."""
+    content: str | None
+    tool_calls: list[dict[str, Any]] | None
+    usage: dict[str, int]
+    finish_reason: str | None
+    raw_response: Any
+class ChatClient:
+    """Async client for OpenAI/Azure OpenAI chat completions.
+    Wraps the openai Python SDK and provides a simplified interface.
+    """
+    def __init__(self, config: ClientConfig | None = None):
+        """Initialize the client.
+        Args:
+            config: Client configuration. If None, auto-detects from env.
+        """
+        self.config = config or ClientConfig.from_env()
+        self._client = self._create_client()
+    def _create_client(self):
+        """Create the appropriate async client."""
+        try:
+            from openai import AsyncOpenAI, AsyncAzureOpenAI
+        except ImportError:
+            raise ImportError(
+                "openai package is required. Install with: pip install openai"
+            )
+        if self.config.endpoint:
+            # Check if using OpenAI-compatible endpoint (e.g., /openai/v1/)
+            # vs traditional Azure OpenAI endpoint
+            if "/v1" in self.config.endpoint:
+                # OpenAI-compatible endpoint (like gpt-5.2-chat on victor-test-resource)
+                return AsyncOpenAI(
+                    base_url=self.config.endpoint,
+                    api_key=self.config.api_key,
+                )
+            else:
+                # Traditional Azure OpenAI
+                return AsyncAzureOpenAI(
+                    api_key=self.config.api_key,
+                    azure_endpoint=self.config.endpoint,
+                    api_version=self.config.api_version,
+                )
+        # Standard OpenAI
+        return AsyncOpenAI(api_key=self.config.api_key)
+    async def chat_completion(
+        self,
+        messages: list[dict[str, Any]],
+        tools: list[dict[str, Any]] | None = None,
+        **kwargs: Any,
+    ) -> ChatCompletionResult:
+        """Make a chat completion request.
+        Args:
+            messages: List of messages in OpenAI format
+            tools: Optional list of tools in OpenAI format
+            **kwargs: Additional parameters to pass to the API
+        Returns:
+            ChatCompletionResult with the response
+        """
+        params: dict[str, Any] = {
+            "model": self.config.model,
+            "messages": messages,
+        }
+        # Only set temperature if not using models that don't support it (like gpt-5.2-chat)
+        temp = kwargs.get("temperature", self.config.temperature)
+        if temp != 1.0 and "5.2" not in self.config.model:
+            params["temperature"] = temp
+        if self.config.max_tokens:
+            params["max_tokens"] = self.config.max_tokens
+        if tools:
+            params["tools"] = tools
+            params["tool_choice"] = kwargs.get("tool_choice", "auto")
+            params["parallel_tool_calls"] = kwargs.get("parallel_tool_calls", True)
+        # Add any extra kwargs
+        for key, value in kwargs.items():
+            if key not in ("temperature", "tool_choice") and value is not None:
+                params[key] = value
+        response = await self._client.chat.completions.create(**params)  # type: ignore[union-attr]
+        # Extract the message
+        choice = response.choices[0]  # type: ignore[index]
+        message = choice.message  # type: ignore[union-attr]
+        # Parse tool calls if present
+        tool_calls: list[dict[str, Any]] | None = None
+        if message.tool_calls:  # type: ignore[union-attr]
+            tool_calls = [
+                {
+                    "id": str(tc.id),  # type: ignore[union-attr]
+                    "name": str(tc.function.name),  # type: ignore[union-attr]
+                    "arguments": str(tc.function.arguments),  # type: ignore[union-attr]
+                }
+                for tc in message.tool_calls  # type: ignore[union-attr]
+            ]
+        return ChatCompletionResult(
+            content=str(message.content) if message.content else None,  # type: ignore[union-attr]
+            tool_calls=tool_calls,
+            usage={
+                "input_tokens": response.usage.prompt_tokens if response.usage else 0,  # type: ignore[union-attr]
+                "output_tokens": (
+                    response.usage.completion_tokens if response.usage else 0  # type: ignore[union-attr]
+                ),
+            },
+            finish_reason=str(choice.finish_reason) if choice.finish_reason else None,  # type: ignore[union-attr]
+            raw_response=response,
+        )

src/flow/harness/miniagent/context.py ADDED Viewed

	@@ -0,0 +1,664 @@

+"""Context strategies for MiniAgent.
+This is the KEY module that fixes Agent Framework's broken compaction.
+Strategies are called BEFORE each LLM call, and the returned (potentially
+compacted) list continues to the next iteration.
+"""
+from dataclasses import dataclass, field
+from typing import Protocol, Any
+import tiktoken
+from .messages import ChatMessage
+class ContextStrategy(Protocol):
+    """Protocol for context management strategies.
+    Called BEFORE each LLM call in the tool loop, allowing
+    the strategy to modify the message list.
+    """
+    def prepare_context(
+        self,
+        messages: list[ChatMessage],
+        token_budget: int,
+    ) -> list[ChatMessage]:
+        """Prepare messages for the next LLM call.
+        Args:
+            messages: Current messages
+            token_budget: Maximum tokens for context
+        Returns:
+            Messages to use (may be compacted)
+        """
+        ...
+class NoCompactionStrategy:
+    """Baseline: no compaction, context grows unbounded.
+    Use this for benchmarking to see how context grows without management.
+    """
+    def prepare_context(
+        self,
+        messages: list[ChatMessage],
+        token_budget: int,
+    ) -> list[ChatMessage]:
+        return messages
+@dataclass
+class HeadTailStrategy:
+    """Token-aware head+tail compaction.
+    Preserves:
+    - Head: System prompt, initial user message (critical context)
+    - Tail: Recent tool calls and results (working memory)
+    Drops middle messages when over budget, respecting atomic groups
+    (tool calls and their results must stay together).
+    This is the recommended strategy for most use cases.
+    """
+    head_ratio: float = 0.2  # 20% for head by default
+    model: str = "gpt-4o"
+    _encoder: tiktoken.Encoding | None = field(default=None, repr=False)
+    # Statistics
+    compaction_count: int = field(default=0, repr=False)
+    total_tokens_saved: int = field(default=0, repr=False)
+    def __post_init__(self):
+        try:
+            self._encoder = tiktoken.encoding_for_model(self.model)
+        except KeyError:
+            # Fallback for unknown models
+            self._encoder = tiktoken.get_encoding("cl100k_base")
+    def _count_tokens(self, messages: list[ChatMessage]) -> int:
+        """Count tokens in messages."""
+        if not self._encoder:
+            # Rough estimate if no encoder
+            return sum(len(str(m.content or "")) // 4 for m in messages)
+        total = 0
+        for msg in messages:
+            # Role overhead (approximately 4 tokens per message)
+            total += 4
+            if msg.content:
+                total += len(self._encoder.encode(msg.content))
+            if msg.tool_calls:
+                for tc in msg.tool_calls:
+                    # Tool call overhead
+                    total += 4
+                    total += len(self._encoder.encode(tc.name))
+                    total += len(self._encoder.encode(tc.arguments))
+        return total
+    def _find_atomic_groups(
+        self, messages: list[ChatMessage]
+    ) -> list[tuple[int, ...]]:
+        """Group tool_call messages with their results.
+        OpenAI requires every tool_call to have a corresponding result.
+        This ensures we never split a tool call from its results.
+        Returns list of tuples, where each tuple contains indices that
+        must stay together.
+        """
+        groups: list[tuple[int, ...]] = []
+        i = 0
+        while i < len(messages):
+            msg = messages[i]
+            if msg.tool_calls:
+                # This message has tool calls - find all results
+                call_ids = {tc.id for tc in msg.tool_calls}
+                group_indices = [i]
+                # Look ahead for results
+                j = i + 1
+                while j < len(messages) and call_ids:
+                    if messages[j].role == "tool" and messages[j].tool_call_id in call_ids:
+                        group_indices.append(j)
+                        call_ids.remove(messages[j].tool_call_id)
+                    j += 1
+                groups.append(tuple(group_indices))
+                i = max(group_indices) + 1 if group_indices else i + 1
+            else:
+                groups.append((i,))
+                i += 1
+        return groups
+    def prepare_context(
+        self,
+        messages: list[ChatMessage],
+        token_budget: int,
+    ) -> list[ChatMessage]:
+        """Compact if over budget."""
+        if not messages:
+            return messages
+        current_tokens = self._count_tokens(messages)
+        if current_tokens <= token_budget:
+            return messages
+        # COMPACTION NEEDED
+        self.compaction_count += 1
+        groups = self._find_atomic_groups(messages)
+        head_budget = int(token_budget * self.head_ratio)
+        tail_budget = token_budget - head_budget
+        # Fill head from start
+        head_groups: list[tuple[int, ...]] = []
+        head_tokens = 0
+        for group in groups:
+            group_msgs = [messages[i] for i in group]
+            group_tokens = self._count_tokens(group_msgs)
+            if head_tokens + group_tokens <= head_budget:
+                head_groups.append(group)
+                head_tokens += group_tokens
+            else:
+                break
+        # Fill tail from end (skip head groups)
+        remaining_groups = groups[len(head_groups) :]
+        tail_groups: list[tuple[int, ...]] = []
+        tail_tokens = 0
+        for group in reversed(remaining_groups):
+            group_msgs = [messages[i] for i in group]
+            group_tokens = self._count_tokens(group_msgs)
+            if tail_tokens + group_tokens <= tail_budget:
+                tail_groups.insert(0, group)
+                tail_tokens += group_tokens
+            else:
+                break
+        # Build compacted list
+        kept_indices: set[int] = set()
+        for group in head_groups + tail_groups:
+            kept_indices.update(group)
+        compacted = [messages[i] for i in sorted(kept_indices)]
+        # Track savings
+        compacted_tokens = self._count_tokens(compacted)
+        self.total_tokens_saved += current_tokens - compacted_tokens
+        return compacted
+@dataclass
+class SlidingWindowStrategy:
+    """Keep only recent messages within budget.
+    Always preserves the system message (if present) plus the most
+    recent messages that fit in the budget. Respects atomic groups
+    (tool calls and their results must stay together).
+    Simpler than HeadTailStrategy but may lose important early context.
+    """
+    model: str = "gpt-4o"
+    _encoder: tiktoken.Encoding | None = field(default=None, repr=False)
+    def __post_init__(self):
+        try:
+            self._encoder = tiktoken.encoding_for_model(self.model)
+        except KeyError:
+            self._encoder = tiktoken.get_encoding("cl100k_base")
+    def _count_tokens(self, messages: list[ChatMessage]) -> int:
+        """Count tokens in messages."""
+        if not self._encoder:
+            return sum(len(str(m.content or "")) // 4 for m in messages)
+        total = 0
+        for msg in messages:
+            total += 4
+            if msg.content:
+                total += len(self._encoder.encode(msg.content))
+            if msg.tool_calls:
+                for tc in msg.tool_calls:
+                    total += 4 + len(self._encoder.encode(tc.name))
+                    total += len(self._encoder.encode(tc.arguments))
+        return total
+    def _find_atomic_groups(
+        self, messages: list[ChatMessage]
+    ) -> list[tuple[int, ...]]:
+        """Group tool_call messages with their results.
+        OpenAI requires every tool_call to have a corresponding result.
+        This ensures we never split a tool call from its results.
+        """
+        groups: list[tuple[int, ...]] = []
+        i = 0
+        while i < len(messages):
+            msg = messages[i]
+            if msg.tool_calls:
+                # This message has tool calls - find all results
+                call_ids = {tc.id for tc in msg.tool_calls}
+                group_indices = [i]
+                # Look ahead for results
+                j = i + 1
+                while j < len(messages) and call_ids:
+                    if messages[j].role == "tool" and messages[j].tool_call_id in call_ids:
+                        group_indices.append(j)
+                        call_ids.remove(messages[j].tool_call_id)
+                    j += 1
+                groups.append(tuple(group_indices))
+                i = max(group_indices) + 1 if group_indices else i + 1
+            else:
+                groups.append((i,))
+                i += 1
+        return groups
+    def prepare_context(
+        self,
+        messages: list[ChatMessage],
+        token_budget: int,
+    ) -> list[ChatMessage]:
+        """Keep system message + most recent messages within budget."""
+        if not messages:
+            return messages
+        # Always keep system messages at the start
+        system_msgs: list[ChatMessage] = []
+        non_system_start = 0
+        for i, msg in enumerate(messages):
+            if msg.role == "system":
+                system_msgs.append(msg)
+                non_system_start = i + 1
+            else:
+                break
+        other_msgs = messages[non_system_start:]
+        system_tokens = self._count_tokens(system_msgs)
+        remaining_budget = token_budget - system_tokens
+        if remaining_budget <= 0:
+            return system_msgs
+        # Find atomic groups in other messages
+        groups = self._find_atomic_groups(other_msgs)
+        # Fill from end, respecting atomic groups
+        kept_groups: list[tuple[int, ...]] = []
+        kept_tokens = 0
+        for group in reversed(groups):
+            group_msgs = [other_msgs[i] for i in group]
+            group_tokens = self._count_tokens(group_msgs)
+            if kept_tokens + group_tokens <= remaining_budget:
+                kept_groups.insert(0, group)
+                kept_tokens += group_tokens
+            else:
+                break
+        # Build result from kept groups
+        kept_indices: set[int] = set()
+        for group in kept_groups:
+            kept_indices.update(group)
+        result = [other_msgs[i] for i in sorted(kept_indices)]
+        return system_msgs + result
+@dataclass
+class SummarizationStrategy:
+    """Summarize old messages instead of dropping them.
+    When over budget, this strategy:
+    1. Keeps: System message + initial user message (head)
+    2. Keeps: Most recent messages (tail)
+    3. Summarizes: Everything in between into a single "context so far" message
+    This preserves critical state (files read, findings, progress) that would
+    otherwise be lost with simple truncation strategies.
+    The summarization uses an LLM call, which adds latency but preserves meaning.
+    """
+    # Client for summarization calls (required)
+    client: Any = None  # ChatClient instance
+    # Configuration
+    head_messages: int = 2  # Keep first N messages (system + initial user)
+    tail_messages: int = 4  # Keep last N messages (recent context)
+    summary_max_tokens: int = 1000  # Max tokens for the summary
+    model: str = "gpt-4o"
+    # Statistics
+    compaction_count: int = field(default=0, repr=False)
+    total_tokens_saved: int = field(default=0, repr=False)
+    _encoder: tiktoken.Encoding | None = field(default=None, repr=False)
+    def __post_init__(self):
+        try:
+            self._encoder = tiktoken.encoding_for_model(self.model)
+        except KeyError:
+            self._encoder = tiktoken.get_encoding("cl100k_base")
+    def _count_tokens(self, messages: list[ChatMessage]) -> int:
+        if not self._encoder:
+            return sum(len(str(m.content or "")) // 4 for m in messages)
+        total = 0
+        for msg in messages:
+            total += 4
+            if msg.content:
+                total += len(self._encoder.encode(msg.content))
+            if msg.tool_calls:
+                for tc in msg.tool_calls:
+                    total += 4 + len(self._encoder.encode(tc.name))
+                    total += len(self._encoder.encode(tc.arguments))
+        return total
+    def _format_messages_for_summary(self, messages: list[ChatMessage]) -> str:
+        """Format messages into text for summarization."""
+        parts: list[str] = []
+        for msg in messages:
+            if msg.role == "assistant":
+                if msg.content:
+                    parts.append(f"Assistant: {msg.content}")
+                if msg.tool_calls:
+                    for tc in msg.tool_calls:
+                        parts.append(f"Tool call: {tc.name}({tc.arguments[:200]}...)")
+            elif msg.role == "tool":
+                # Truncate long tool outputs
+                output = msg.content or ""
+                if len(output) > 500:
+                    output = output[:500] + "... [truncated]"
+                parts.append(f"Tool result ({msg.name}): {output}")
+            elif msg.role == "user" and msg.content:
+                parts.append(f"User: {msg.content}")
+        return "\n\n".join(parts)
+    async def _generate_summary(
+        self, messages: list[ChatMessage], original_task: str = ""
+    ) -> str:
+        """Generate a summary of the messages using the LLM.
+        Args:
+            messages: The middle messages to summarize
+            original_task: The original user task (for context)
+        """
+        if not self.client:
+            return self._extract_key_info(messages)
+        content = self._format_messages_for_summary(messages)
+        # Extract files that were read (to prevent re-reading)
+        files_read = self._extract_files_read(messages)
+        files_list = "\n".join(f"  - {f}" for f in files_read) if files_read else "  (none identified)"
+        summary_prompt = f"""You are helping an AI agent that is working on a task but hit context limits.
+The agent needs to continue from a summary of what was done so far.
+ORIGINAL TASK:
+{original_task if original_task else "(not provided)"}
+The conversation below shows {len(messages)} messages of work that needs to be summarized.
+The agent will continue working after receiving this summary.
+CRITICAL: Your summary MUST include:
+1. **FILES ALREADY READ** - List EVERY file that was read. The agent must NOT re-read these:
+{files_list}
+2. **KEY FINDINGS** - What was discovered in each file (brief, 1-2 lines each)
+3. **PROGRESS** - What's been accomplished toward the task
+4. **WHAT REMAINS** - What still needs to be done to complete the task
+Keep summary under {self.summary_max_tokens} tokens. Be specific - vague summaries cause the agent to repeat work.
+CONVERSATION TO SUMMARIZE:
+{content}
+SUMMARY:"""
+        try:
+            # chat_completion expects messages as dicts, not ChatMessage objects
+            # Use max_completion_tokens for newer models, fall back to max_tokens
+            response = await self.client.chat_completion(
+                messages=[{"role": "user", "content": summary_prompt}],
+                max_completion_tokens=self.summary_max_tokens,
+            )
+            # ChatCompletionResult has .content attribute
+            if response.content:
+                return response.content
+            return self._extract_key_info(messages)
+        except Exception as e:
+            # Log the error for debugging
+            import sys
+            print(f"[SummarizationStrategy] LLM call failed: {e}", file=sys.stderr)
+            return self._extract_key_info(messages)
+    def _extract_files_read(self, messages: list[ChatMessage]) -> list[str]:
+        """Extract list of files that were read from the messages."""
+        files: list[str] = []
+        for msg in messages:
+            if msg.tool_calls:
+                for tc in msg.tool_calls:
+                    if tc.name in ("read_file", "Read"):
+                        # Try to extract path from arguments
+                        try:
+                            import json
+                            args = json.loads(tc.arguments)
+                            path = args.get("path") or args.get("file_path") or args.get("filename")
+                            if path:
+                                files.append(path)
+                        except:
+                            pass
+        return list(dict.fromkeys(files))  # Remove duplicates, preserve order
+    def _extract_key_info(self, messages: list[ChatMessage]) -> str:
+        """Extract key info without LLM (fallback)."""
+        files_read: set[str] = set()
+        key_findings: list[str] = []
+        for msg in messages:
+            if msg.role == "tool" and msg.name == "read_file":
+                # Try to extract filename from the previous tool call
+                files_read.add(msg.name or "file")
+            if msg.role == "assistant" and msg.content:
+                # Keep short assistant messages as findings
+                if len(msg.content) < 200:
+                    key_findings.append(msg.content)
+        parts: list[str] = []
+        if files_read:
+            parts.append(f"Files accessed: {', '.join(files_read)}")
+        if key_findings:
+            parts.append(f"Key points: {'; '.join(key_findings[:5])}")
+        return "\n".join(parts) if parts else "Previous context was processed."
+    def prepare_context(
+        self,
+        messages: list[ChatMessage],
+        token_budget: int,
+    ) -> list[ChatMessage]:
+        """Summarize middle messages if over budget.
+        NOTE: This is synchronous but summarization needs async.
+        The actual summarization happens in prepare_context_async.
+        This method uses a simple fallback for sync contexts.
+        """
+        if not messages:
+            return messages
+        current_tokens = self._count_tokens(messages)
+        if current_tokens <= token_budget:
+            return messages
+        # For sync context, use simple extraction (no LLM call)
+        return self._compact_with_summary_sync(messages, token_budget)
+    def _find_safe_split_points(self, messages: list[ChatMessage]) -> tuple[int, int]:
+        """Find safe points to split messages without breaking tool call/result pairs.
+        Returns (head_end, tail_start) indices where it's safe to summarize between.
+        """
+        # Find atomic groups (tool calls must stay with their results)
+        groups: list[tuple[int, int]] = []  # (start, end) indices
+        i = 0
+        while i < len(messages):
+            msg = messages[i]
+            if msg.tool_calls:
+                # Find all results for this tool call
+                call_ids = {tc.id for tc in msg.tool_calls}
+                end = i
+                j = i + 1
+                while j < len(messages) and call_ids:
+                    if messages[j].role == "tool" and messages[j].tool_call_id in call_ids:
+                        call_ids.discard(messages[j].tool_call_id)
+                        end = j
+                    j += 1
+                groups.append((i, end + 1))
+                i = end + 1
+            else:
+                groups.append((i, i + 1))
+                i += 1
+        # Find safe head end (after self.head_messages worth of groups)
+        head_end = 0
+        for idx, (_start, end) in enumerate(groups):
+            if idx < self.head_messages:
+                head_end = end
+            else:
+                break
+        # Find safe tail start (before last self.tail_messages groups)
+        tail_start = len(messages)
+        tail_groups = min(self.tail_messages, len(groups))
+        if tail_groups > 0 and len(groups) > tail_groups:
+            tail_start = groups[-tail_groups][0]
+        # Ensure we don't overlap
+        if head_end >= tail_start:
+            # Not enough room - just keep everything
+            return len(messages), len(messages)
+        return head_end, tail_start
+    async def prepare_context_async(
+        self,
+        messages: list[ChatMessage],
+        token_budget: int,
+    ) -> list[ChatMessage]:
+        """Async version that can use LLM for summarization."""
+        if not messages:
+            return messages
+        current_tokens = self._count_tokens(messages)
+        if current_tokens <= token_budget:
+            return messages
+        self.compaction_count += 1
+        # Find safe split points that don't break tool call/result pairs
+        head_end, tail_start = self._find_safe_split_points(messages)
+        head = messages[:head_end]
+        tail = messages[tail_start:]
+        middle = messages[head_end:tail_start]
+        if not middle:
+            # Nothing to summarize - return as is
+            return messages
+        # Extract the original task from the first user message
+        original_task = ""
+        for msg in head:
+            if msg.role == "user" and msg.content:
+                original_task = msg.content
+                break
+        # Generate summary of middle section with task context
+        summary_text = await self._generate_summary(middle, original_task)
+        # Create a user message that clearly instructs continuation
+        # This works better than assistant role because it's a clear directive
+        summary_message = ChatMessage(
+            role="user",
+            content=f"""[CONTEXT CHECKPOINT - Your previous work has been summarized below]
+{summary_text}
+---
+IMPORTANT: The files listed above have ALREADY been read and analyzed.
+DO NOT re-read them - that would waste tokens and duplicate work.
+Continue from where you left off, completing any remaining items listed in "WHAT REMAINS".
+If all files have been read, proceed to generate the final output.""",
+        )
+        # Build compacted message list
+        compacted = head + [summary_message] + tail
+        # Track savings
+        compacted_tokens = self._count_tokens(compacted)
+        self.total_tokens_saved += current_tokens - compacted_tokens
+        return compacted
+    def _compact_with_summary_sync(
+        self, messages: list[ChatMessage], token_budget: int
+    ) -> list[ChatMessage]:
+        """Synchronous compaction with simple summary extraction."""
+        self.compaction_count += 1
+        # Find safe split points that don't break tool call/result pairs
+        head_end, tail_start = self._find_safe_split_points(messages)
+        head = messages[:head_end]
+        tail = messages[tail_start:]
+        middle = messages[head_end:tail_start]
+        if not middle:
+            return messages
+        # Extract key info without LLM
+        summary_text = self._extract_key_info(middle)
+        summary_message = ChatMessage(
+            role="user",
+            content=f"[CONTEXT SUMMARY - Previous {len(middle)} messages compressed]\n\n{summary_text}\n\n[END SUMMARY - Continue from here]",
+        )
+        compacted = head + [summary_message] + tail
+        compacted_tokens = self._count_tokens(compacted)
+        current_tokens = self._count_tokens(messages)
+        self.total_tokens_saved += current_tokens - compacted_tokens
+        return compacted

src/flow/harness/miniagent/harness.py ADDED Viewed

	@@ -0,0 +1,403 @@

+"""MiniAgent harness - implements BaseHarness for Flow integration.
+This harness adapts MiniAgent's ChatAgent to Flow's harness interface,
+enabling experiments with correct context compaction.
+"""
+from __future__ import annotations
+import logging
+import uuid
+from collections.abc import AsyncIterator
+from pathlib import Path
+from typing import TYPE_CHECKING, Any
+from flow.harness.base import BaseHarness, Event, EventType
+if TYPE_CHECKING:
+    from flow.experiments.models import Agent
+    from flow.llm import LLMClientConfig
+from .agent import ChatAgent, AgentThread, StreamEvent, StreamEventType
+from .context import (
+    ContextStrategy,
+    NoCompactionStrategy,
+    HeadTailStrategy,
+    SlidingWindowStrategy,
+    SummarizationStrategy,
+)
+from .client import ChatClient
+from .otel import enable_instrumentation
+from .instructions import get_instructions
+from flow.tools import Tool
+logger = logging.getLogger(__name__)
+# Enable instrumentation on module load (like MAF does)
+enable_instrumentation()
+class MiniAgentHarness(BaseHarness):
+    """Harness adapter for MiniAgent.
+    This adapter:
+    1. Maps Flow's Agent spec to MiniAgent's ChatAgent
+    2. Maps CompactionConfig to ContextStrategy
+    3. Converts StreamEvents to Flow Events
+    4. Injects OTEL hooks for trace collection
+    Example:
+        >>> from flow.harness.miniagent import MiniAgentHarness
+        >>> from flow.experiments.models import Agent, CompactionConfig
+        >>>
+        >>> agent = Agent(
+        ...     name="test",
+        ...     framework="miniagent",
+        ...     compaction=CompactionConfig.head_tail_tokens(0.2, 50_000),
+        ... )
+        >>> harness = MiniAgentHarness.from_agent(agent, workspace=Path("/tmp"))
+        >>> async for event in harness.run_stream("Hello"):
+        ...     print(event)
+    """
+    @classmethod
+    def from_agent(
+        cls,
+        agent: "Agent",
+        workspace: Path,
+        llm_config: "LLMClientConfig | None" = None,
+    ) -> "MiniAgentHarness":
+        """Create a MiniAgentHarness from an Agent definition.
+        Args:
+            agent: The Agent spec defining the configuration
+            workspace: Working directory for the agent
+            llm_config: Optional LLM configuration (falls back to env vars if not provided)
+        Returns:
+            A configured MiniAgentHarness instance
+        """
+        from flow.experiments.models import resolve_tools
+        # 1. Map CompactionConfig → ContextStrategy
+        context_strategy = cls._create_context_strategy(agent)
+        # 2. Build tools from spec
+        tools_spec = resolve_tools(agent.tools)
+        tools = cls._build_tools(tools_spec, workspace)
+        # 3. Create OTEL hooks for trace collection
+        from .otel import create_otel_hooks
+        otel_hooks = create_otel_hooks(model=agent.model or "gpt-4o")
+        # 4. Create ChatClient from LLM config or env
+        from .client import ClientConfig
+        if llm_config is not None:
+            # Use provided LLM config
+            config = cls._create_client_config_from_llm_config(llm_config)
+        else:
+            # Fall back to env vars
+            config = ClientConfig.from_env()
+            if agent.model:
+                config.model = agent.model
+        chat_client = ChatClient(config)
+        # Resolve instructions: explicit > preset > default "coding"
+        if agent.instructions:
+            instructions = agent.instructions
+        elif agent.instructions_preset:
+            instructions = get_instructions(agent.instructions_preset)
+        else:
+            instructions = get_instructions("coding")
+        chat_agent = ChatAgent(
+            client=chat_client,
+            instructions=instructions,
+            tools=tools,
+            context_strategy=context_strategy,
+            token_budget=agent.compaction.token_budget,
+            hooks=otel_hooks,
+        )
+        return cls(chat_agent, workspace)
+    @classmethod
+    def _create_client_config_from_llm_config(
+        cls, llm_config: "LLMClientConfig"
+    ) -> "ClientConfig":
+        """Create MiniAgent ClientConfig from Flow LLMClientConfig.
+        Args:
+            llm_config: Flow's LLM client configuration
+        Returns:
+            MiniAgent ClientConfig
+        """
+        from flow.llm import LLMProvider
+        from .client import ClientConfig
+        match llm_config.provider:
+            case LLMProvider.AZURE_OPENAI:
+                if not llm_config.azure_openai:
+                    raise ValueError("azure_openai config required for Azure OpenAI provider")
+                return ClientConfig(
+                    api_key=llm_config.azure_openai.get_api_key(),
+                    model=llm_config.azure_openai.deployment,
+                    endpoint=llm_config.azure_openai.get_endpoint(),
+                    api_version=llm_config.azure_openai.api_version,
+                )
+            case LLMProvider.OPENAI:
+                if not llm_config.openai:
+                    raise ValueError("openai config required for OpenAI provider")
+                return ClientConfig(
+                    api_key=llm_config.openai.get_api_key(),
+                    model=llm_config.openai.model_id,
+                    endpoint=llm_config.openai.base_url,
+                )
+            case LLMProvider.CUSTOM:
+                if not llm_config.custom:
+                    raise ValueError("custom config required for custom provider")
+                return ClientConfig(
+                    api_key=llm_config.custom.get_api_key(),
+                    model=llm_config.custom.model_id,
+                    endpoint=llm_config.custom.base_url,
+                )
+            case _:
+                raise ValueError(
+                    f"MiniAgent does not support provider: {llm_config.provider.value}. "
+                    f"Supported: openai, azure_openai, custom"
+                )
+    @classmethod
+    def _create_context_strategy(cls, agent: "Agent") -> ContextStrategy:
+        """Map Flow's CompactionConfig to MiniAgent's ContextStrategy."""
+        config = agent.compaction
+        match config.strategy:
+            case "none":
+                return NoCompactionStrategy()
+            case "head_tail":
+                # Legacy message-count based → convert to ratio
+                total = config.head_size + config.tail_size
+                ratio = config.head_size / total if total > 0 else 0.2
+                return HeadTailStrategy(head_ratio=ratio)
+            case "head_tail_tokens":
+                return HeadTailStrategy(
+                    head_ratio=config.params.get("head_ratio", 0.2)
+                )
+            case "sliding_window":
+                return SlidingWindowStrategy()
+            case "summarization":
+                # SummarizationStrategy needs a client for LLM calls
+                return SummarizationStrategy(
+                    client=ChatClient(),
+                    head_messages=config.params.get("head_messages", 2),
+                    tail_messages=config.params.get("tail_messages", 4),
+                    summary_max_tokens=config.params.get("summary_max_tokens", 1000),
+                )
+            case "last_n":
+                # Map to sliding window as closest equivalent
+                return SlidingWindowStrategy()
+            case _:
+                logger.warning(f"Unknown compaction strategy: {config.strategy}, using none")
+                return NoCompactionStrategy()
+    @classmethod
+    def _build_tools(cls, tools_spec: dict[str, dict[str, Any]], workspace: Path) -> list[Tool]:
+        """Build MiniAgent Tools from Flow tool spec.
+        Uses the shared tools from flow.tools, setting up the workspace
+        for tools that need persistent state.
+        Args:
+            tools_spec: Dict mapping tool names to their configs
+            workspace: Working directory for tools
+        Returns:
+            List of Tool instances
+        """
+        # Import shared tools
+        from flow.tools import (
+            # Coding
+            read_file, write_file, edit_file, multi_edit, glob_files, grep, ls,
+            # Execution
+            bash, check_processes, python_repl,
+            # Planning
+            think, todo_write, todo_read,
+            # Memory
+            memory, create_memory_tool,
+            # Web
+            web_search, web_fetch,
+            # Notebooks
+            notebook_edit, notebook_read,
+            # Skills
+            skills, create_skills_tool,
+            # Sub-agent
+            task, create_task_tool,
+            # Workspace management
+            set_workspace, Workspace,
+        )
+        # Set workspace for tools that need it (memory, todos, etc.)
+        set_workspace(Workspace(workspace))
+        # Map tool names → Tool instances
+        tool_map: dict[str, Tool] = {
+            # Coding/Filesystem
+            "read_file": read_file,
+            "write_file": write_file,
+            "edit_file": edit_file,
+            "multi_edit": multi_edit,
+            "glob_files": glob_files,
+            "ls": ls,
+            "grep": grep,
+            # Execution
+            "bash": bash,
+            "check_processes": check_processes,
+            "python_repl": python_repl,
+            # Planning
+            "think": think,
+            "todo_write": todo_write,
+            "todo_read": todo_read,
+            # Web
+            "web_search": web_search,
+            "web_fetch": web_fetch,
+            # Notebooks
+            "notebook_edit": notebook_edit,
+            "notebook_read": notebook_read,
+            # Memory (default instance)
+            "memory": memory,
+            # Skills (default instance)
+            "skills": skills,
+            # Task/sub-agent (default instance)
+            "task": task,
+        }
+        tools: list[Tool] = []
+        for name, config in tools_spec.items():
+            if name in tool_map:
+                tools.append(tool_map[name])
+            elif name == "task" and config:
+                # Task tool with custom config
+                tools.append(create_task_tool(
+                    coordinator_tools=list(tool_map.values()),
+                    model=config.get("model"),
+                ))
+            else:
+                logger.warning(f"Unknown tool: {name}")
+        return tools
+    def __init__(self, agent: ChatAgent, workspace: Path) -> None:
+        """Initialize the harness.
+        Args:
+            agent: The MiniAgent ChatAgent instance
+            workspace: Working directory
+        """
+        self._agent = agent
+        self._workspace = workspace
+        self._thread: AgentThread | None = None
+        self._thread_id: str | None = None
+    async def run_stream(self, task: str) -> AsyncIterator[Event]:
+        """Run a task with streaming events.
+        Args:
+            task: The task/prompt to execute
+        Yields:
+            Event objects representing agent activity
+        """
+        if self._thread is None:
+            self._thread = self._agent.get_new_thread()
+        try:
+            async for event in self._agent.run_stream(task, thread=self._thread):
+                flow_event = self._convert_event(event)
+                if flow_event:
+                    yield flow_event
+            yield Event(type=EventType.DONE)
+        except Exception as e:
+            logger.exception(f"Error in MiniAgent execution: {e}")
+            yield Event(type=EventType.ERROR, content=str(e))
+    def _convert_event(self, event: StreamEvent) -> Event | None:
+        """Convert a MiniAgent StreamEvent to a Flow Event.
+        Args:
+            event: StreamEvent from MiniAgent
+        Returns:
+            Flow Event or None if no conversion needed
+        """
+        match event.type:
+            case StreamEventType.AGENT_START:
+                # Could emit a thinking event
+                return None
+            case StreamEventType.MODEL_START:
+                return Event(
+                    type=EventType.THINKING,
+                    content=f"Iteration {event.data.get('iteration', 0) + 1}",
+                )
+            case StreamEventType.MODEL_END:
+                # Token usage tracked via OTEL, no event needed
+                return None
+            case StreamEventType.TOOL_START:
+                return Event(
+                    type=EventType.TOOL_CALL_START,
+                    tool_name=str(event.data.get("tool_name", "")),
+                )
+            case StreamEventType.TOOL_END:
+                return Event(
+                    type=EventType.TOOL_RESULT,
+                    content=str(event.data.get("tool_output", ""))[:1000],  # Truncate
+                    tool_name=str(event.data.get("tool_name", "")),
+                )
+            case StreamEventType.TEXT:
+                content = event.data.get("content", "")
+                if content:
+                    return Event(type=EventType.TEXT_DELTA, content=str(content))
+                return None
+            case StreamEventType.AGENT_END:
+                # Don't include content - it was already streamed via TEXT events
+                # TEXT_DONE just signals completion
+                return Event(type=EventType.TEXT_DONE, content="")
+            case _:
+                return None
+    def get_thread_id(self) -> str:
+        """Get the current thread ID.
+        Returns:
+            The current conversation thread ID
+        """
+        if self._thread_id is None:
+            self._thread_id = str(uuid.uuid4())
+        return self._thread_id
+    async def close(self) -> None:
+        """Clean up resources used by the harness."""
+        self._thread = None
+        self._thread_id = None

src/flow/harness/miniagent/hooks.py ADDED Viewed

	@@ -0,0 +1,209 @@

+"""Hook types and event definitions for MiniAgent.
+Inspired by Claude Agent SDK's hooks system. Hooks allow applications to:
+- Observe: Monitor what's happening (logging, metrics)
+- Modify: Change inputs, inject context
+- Control: Block tool calls, stop execution
+"""
+from dataclasses import dataclass, field
+from typing import Any, Callable, Awaitable, Literal
+from enum import Enum
+class HookEvent(str, Enum):
+    """All supported hook events."""
+    PRE_TOOL_USE = "pre_tool_use"
+    POST_TOOL_USE = "post_tool_use"
+    PRE_MODEL_CALL = "pre_model_call"
+    POST_MODEL_CALL = "post_model_call"
+    PRE_COMPACT = "pre_compact"
+    POST_COMPACT = "post_compact"
+    AGENT_START = "agent_start"
+    AGENT_END = "agent_end"
+# === Event Data Classes ===
+@dataclass
+class PreToolUseEvent:
+    """Fired before a tool is executed.
+    Hooks can inspect and optionally block or modify the tool call.
+    """
+    tool_name: str
+    tool_input: dict[str, Any]
+    tool_call_id: str
+    iteration: int
+@dataclass
+class PreToolUseResult:
+    """Result from PreToolUse hook.
+    Controls whether the tool call proceeds.
+    """
+    decision: Literal["allow", "block", "modify"] = "allow"
+    reason: str | None = None  # Shown to model if blocked
+    modified_input: dict[str, Any] | None = None  # If decision="modify"
+@dataclass
+class PostToolUseEvent:
+    """Fired after a tool executes.
+    Hooks can inject additional context or stop execution.
+    """
+    tool_name: str
+    tool_input: dict[str, Any]
+    tool_output: str
+    tool_call_id: str
+    iteration: int
+    error: str | None = None
+@dataclass
+class PostToolUseResult:
+    """Result from PostToolUse hook."""
+    additional_context: str | None = None  # Injected into next message
+    stop_execution: bool = False
+    stop_reason: str | None = None
+@dataclass
+class PreModelCallEvent:
+    """Fired before an LLM call.
+    Useful for logging, metrics, or inspecting the context.
+    """
+    message_count: int
+    iteration: int
+    estimated_tokens: int | None = None
+@dataclass
+class PostModelCallEvent:
+    """Fired after an LLM call.
+    Contains usage information and the raw response.
+    """
+    usage: dict[str, int]
+    iteration: int
+    has_tool_calls: bool
+    finish_reason: str | None = None
+    response_text: str = ""  # The model's text response (non-tool content)
+@dataclass
+class PreCompactEvent:
+    """Fired before context compaction.
+    Allows monitoring when compaction is triggered.
+    """
+    message_count: int
+    current_tokens: int
+    budget: int
+    trigger: Literal["auto", "manual"]
+@dataclass
+class PostCompactEvent:
+    """Fired after context compaction.
+    Reports how much was compacted.
+    """
+    messages_before: int
+    messages_after: int
+    tokens_before: int
+    tokens_after: int
+@dataclass
+class AgentStartEvent:
+    """Fired when agent.run() starts."""
+    user_message: str
+    thread_message_count: int
+@dataclass
+class AgentEndEvent:
+    """Fired when agent.run() completes."""
+    final_response: str | None
+    total_iterations: int
+    total_input_tokens: int
+    total_output_tokens: int
+    tool_calls_made: int
+# === Hook Type Aliases ===
+PreToolUseHook = Callable[[PreToolUseEvent], Awaitable[PreToolUseResult | None]]
+PostToolUseHook = Callable[[PostToolUseEvent], Awaitable[PostToolUseResult | None]]
+PreModelCallHook = Callable[[PreModelCallEvent], Awaitable[None]]
+PostModelCallHook = Callable[[PostModelCallEvent], Awaitable[None]]
+PreCompactHook = Callable[[PreCompactEvent], Awaitable[None]]
+PostCompactHook = Callable[[PostCompactEvent], Awaitable[None]]
+AgentStartHook = Callable[[AgentStartEvent], Awaitable[None]]
+AgentEndHook = Callable[[AgentEndEvent], Awaitable[None]]
+def _pre_tool_use_factory() -> "list[PreToolUseHook]":
+    return []
+def _post_tool_use_factory() -> "list[PostToolUseHook]":
+    return []
+def _pre_model_call_factory() -> "list[PreModelCallHook]":
+    return []
+def _post_model_call_factory() -> "list[PostModelCallHook]":
+    return []
+def _pre_compact_factory() -> "list[PreCompactHook]":
+    return []
+def _post_compact_factory() -> "list[PostCompactHook]":
+    return []
+def _agent_start_factory() -> "list[AgentStartHook]":
+    return []
+def _agent_end_factory() -> "list[AgentEndHook]":
+    return []
+@dataclass
+class Hooks:
+    """Hook configuration for ChatAgent.
+    All hook lists are optional. Multiple hooks can be registered
+    for each event - they are called in order.
+    Example:
+        async def log_tokens(event: PostModelCallEvent) -> None:
+            print(f"Used {event.usage['input_tokens']} input tokens")
+        hooks = Hooks(post_model_call=[log_tokens])
+        agent = ChatAgent(hooks=hooks)
+    """
+    pre_tool_use: "list[PreToolUseHook]" = field(default_factory=_pre_tool_use_factory)
+    post_tool_use: "list[PostToolUseHook]" = field(default_factory=_post_tool_use_factory)
+    pre_model_call: "list[PreModelCallHook]" = field(default_factory=_pre_model_call_factory)
+    post_model_call: "list[PostModelCallHook]" = field(default_factory=_post_model_call_factory)
+    pre_compact: "list[PreCompactHook]" = field(default_factory=_pre_compact_factory)
+    post_compact: "list[PostCompactHook]" = field(default_factory=_post_compact_factory)
+    agent_start: "list[AgentStartHook]" = field(default_factory=_agent_start_factory)
+    agent_end: "list[AgentEndHook]" = field(default_factory=_agent_end_factory)

src/flow/harness/miniagent/instructions.py ADDED Viewed

	@@ -0,0 +1,207 @@

+"""System instructions for MiniAgent.
+Provides well-crafted system prompts that guide the agent on:
+- How to use each tool effectively
+- When to use todo_write for task tracking
+- When to delegate with the task tool
+- Best practices for coding tasks
+These instructions are informed by Claude Code's system prompt design.
+"""
+# =============================================================================
+# Shared instruction blocks - reused across presets
+# =============================================================================
+TASK_COMPLETION_INSTRUCTIONS = """
+## CRITICAL: Task Completion
+You MUST complete tasks fully and autonomously.
+### Execution Rules
+- DO NOT ask "Should I proceed?" or "Would you like me to continue?"
+- DO NOT propose alternatives or phased approaches when the task is clear
+- DO NOT decline tasks due to scope concerns - break them into steps instead
+- DO NOT stop halfway - keep iterating until the task is DONE
+- If something fails, try alternatives automatically
+### Use Todos for Multi-Step Tasks
+For any task with 3+ steps:
+1. Create a todo list with `todo_write` at the START
+2. Mark tasks `in_progress` as you work on them (only ONE at a time)
+3. Mark tasks `completed` IMMEDIATELY when done
+4. Keep working until ALL todos are checked off
+5. Do NOT finish until the todo list shows everything complete
+### Iteration Until Success
+- If a step fails, diagnose and retry with a different approach
+- If you hit a blocker, note it and move to the next step
+- Circle back to unfinished items
+- The task is NOT done until all requirements are met
+"""
+EFFICIENCY_INSTRUCTIONS = """
+## Efficiency Guidelines
+### Batch Tool Calls
+Call ALL independent tools in a SINGLE response:
+- Read 5 files? Call read_file 5 times in one response.
+- Search multiple patterns? Call grep multiple times in one response.
+- List directories and read files? Call both in one response.
+### Read Files Fully
+Read ENTIRE files (default limit 2000 lines). Do NOT chunk files into small pieces (40-60 lines) - this wastes API calls and context.
+### Search Then Batch Read
+1. Use glob_files or grep to find relevant files
+2. Read ALL matching files in a single response
+"""
+BEST_PRACTICES_INSTRUCTIONS = """
+## Best Practices
+### Before Editing
+NEVER edit a file you haven't read. Always use `read_file` first.
+### Follow Existing Patterns
+Before writing new code, examine neighboring files to understand:
+- Naming conventions
+- Import style
+- Error handling patterns
+- Framework usage
+### Don't Over-Engineer
+- Solve the current problem, not hypothetical future ones
+- Prefer editing existing files over creating new ones
+- NEVER proactively create documentation files unless explicitly requested
+- Don't add features beyond what was asked
+### Verify Dependencies
+Never assume libraries exist. Check package.json, requirements.txt, or equivalent before importing.
+### Security
+- Refuse to write code that could be used maliciously
+- Never expose secrets, API keys, or credentials in code
+- If files seem related to malware, refuse to help
+"""
+# =============================================================================
+# Preset-specific instructions
+# =============================================================================
+CODING_AGENT_INSTRUCTIONS = f"""You are an expert coding assistant. You help users with software engineering tasks including writing code, debugging, refactoring, and explaining code.
+## Response Style
+- Be concise and direct in explanations, but thorough in execution.
+- Use GitHub-flavored markdown for formatting.
+- When referencing code, use the pattern `file_path:line_number` (e.g., `src/utils.py:42`).
+- Don't add unnecessary preamble or postamble. Get to work.
+- Only use emojis if explicitly requested.
+{TASK_COMPLETION_INSTRUCTIONS}
+## Tool Usage
+### File Operations
+- **read_file**: Read file contents with line numbers. Always read before editing.
+- **write_file**: Create new files or completely replace file contents.
+- **edit_file**: Make targeted edits by replacing specific text (must be unique in file).
+- **multi_edit**: Make multiple edits to a file atomically (all succeed or all fail).
+- **glob_files**: Find files by pattern (e.g., `**/*.py`, `src/**/*.ts`).
+- **grep**: Search file contents with regex. Returns matching lines with context.
+- **ls**: List directory contents.
+### Execution
+- **bash**: Execute shell commands. Use for git, running tests, installing packages.
+### Planning
+- **think**: Reason through complex problems before acting.
+- **todo_write**: Track progress on multi-step tasks. USE THIS FREQUENTLY.
+- **todo_read**: Check current task status.
+### Delegation (if available)
+- **task**: Delegate complex sub-tasks to a specialist agent with isolated context.
+{EFFICIENCY_INSTRUCTIONS}
+{BEST_PRACTICES_INSTRUCTIONS}
+"""
+RESEARCH_AGENT_INSTRUCTIONS = f"""You are a research assistant. You help users find information, synthesize knowledge, and answer questions.
+## Response Style
+- Be thorough in research, concise in presentation.
+- Cite sources with URLs when reporting findings.
+- Synthesize information - don't just list results.
+{TASK_COMPLETION_INSTRUCTIONS}
+## Tools
+### Search & Fetch
+- **web_search**: Search the web for information.
+- **web_fetch**: Fetch and read web page contents.
+### Planning
+- **think**: Work through complex questions step by step.
+- **todo_write**: Track research progress on multi-part questions.
+## Research Strategy
+1. Start with broad searches to identify relevant sources
+2. Fetch multiple promising URLs in parallel (batch web_fetch calls)
+3. Synthesize findings into a coherent answer
+4. If initial searches don't answer the question, refine and search again
+## Guidelines
+1. **Be thorough**: Search multiple queries if needed - batch them.
+2. **Cite sources**: Include URLs when reporting findings.
+3. **Synthesize**: Draw conclusions, don't just list results.
+4. **Keep going**: If first searches don't work, try different queries.
+5. **Acknowledge uncertainty**: If information is unclear, say so.
+"""
+EXPLORE_AGENT_INSTRUCTIONS = f"""You are a codebase exploration specialist. Your job is to quickly find and understand code.
+## Response Style
+- Be concise. Your response goes to another agent, so be self-contained.
+- Include file paths and line numbers in findings.
+- Summarize what you found, don't dump raw content.
+{TASK_COMPLETION_INSTRUCTIONS}
+## Tools
+- **read_file**: Read file contents (read fully, don't chunk).
+- **glob_files**: Find files by pattern.
+- **grep**: Search file contents with regex.
+- **ls**: List directory contents.
+- **think**: Reason about what you're finding.
+- **todo_write**: Track exploration progress for complex searches.
+{EFFICIENCY_INSTRUCTIONS}
+## Guidelines
+1. **Start broad, then narrow**: Use glob/grep to find candidates, then batch-read.
+2. **Be efficient**: Don't read files you don't need.
+3. **Report clearly**: Include file paths and line numbers.
+4. **Keep searching**: If first attempt doesn't find what's needed, try different patterns.
+5. **Summarize**: Be self-contained for the calling agent.
+"""
+# =============================================================================
+# Instruction presets registry
+# =============================================================================
+INSTRUCTIONS = {
+    "coding": CODING_AGENT_INSTRUCTIONS,
+    "research": RESEARCH_AGENT_INSTRUCTIONS,
+    "explore": EXPLORE_AGENT_INSTRUCTIONS,
+}
+def get_instructions(preset: str = "coding") -> str:
+    """Get system instructions by preset name.
+    Args:
+        preset: One of 'coding', 'research', 'explore'
+    Returns:
+        System instruction string
+    """
+    return INSTRUCTIONS.get(preset, CODING_AGENT_INSTRUCTIONS)

src/flow/harness/miniagent/messages.py ADDED Viewed

	@@ -0,0 +1,88 @@

+"""Message types for MiniAgent.
+Defines the core message structures used in the agent loop.
+"""
+from dataclasses import dataclass
+from typing import Any, Literal
+Role = Literal["system", "user", "assistant", "tool"]
+@dataclass
+class ToolCall:
+    """A tool call request from the model."""
+    id: str
+    name: str
+    arguments: str  # JSON string
+@dataclass
+class ToolResult:
+    """Result of executing a tool."""
+    call_id: str
+    result: str
+    error: str | None = None
+@dataclass
+class ChatMessage:
+    """A message in the conversation.
+    Supports all OpenAI message roles and tool calling.
+    """
+    role: Role
+    content: str | None = None
+    tool_calls: list[ToolCall] | None = None
+    tool_call_id: str | None = None  # For tool role messages
+    name: str | None = None  # Optional name for the message author
+    def to_openai_format(self) -> dict[str, Any]:
+        """Convert to OpenAI API format."""
+        msg: dict[str, Any] = {"role": self.role}
+        if self.content is not None:
+            msg["content"] = self.content
+        if self.tool_calls:
+            msg["tool_calls"] = [
+                {
+                    "id": tc.id,
+                    "type": "function",
+                    "function": {"name": tc.name, "arguments": tc.arguments},
+                }
+                for tc in self.tool_calls
+            ]
+        if self.tool_call_id:
+            msg["tool_call_id"] = self.tool_call_id
+        if self.name:
+            msg["name"] = self.name
+        return msg
+    @classmethod
+    def system(cls, content: str) -> "ChatMessage":
+        """Create a system message."""
+        return cls(role="system", content=content)
+    @classmethod
+    def user(cls, content: str) -> "ChatMessage":
+        """Create a user message."""
+        return cls(role="user", content=content)
+    @classmethod
+    def assistant(
+        cls, content: str | None = None, tool_calls: list[ToolCall] | None = None
+    ) -> "ChatMessage":
+        """Create an assistant message."""
+        return cls(role="assistant", content=content, tool_calls=tool_calls)
+    @classmethod
+    def tool(cls, call_id: str, content: str) -> "ChatMessage":
+        """Create a tool result message."""
+        return cls(role="tool", content=content, tool_call_id=call_id)

src/flow/harness/miniagent/otel.py ADDED Viewed

	@@ -0,0 +1,258 @@

+"""OpenTelemetry instrumentation for MiniAgent.
+This module provides OTEL span emission that conforms to GenAI semantic
+conventions, enabling Flow's metrics extraction pipeline to work with
+MiniAgent traces.
+Reference: https://opentelemetry.io/docs/specs/semconv/gen-ai/
+"""
+from __future__ import annotations
+from typing import TYPE_CHECKING
+from opentelemetry import trace
+if TYPE_CHECKING:
+    from .hooks import (
+        Hooks,
+        PreModelCallEvent,
+        PostModelCallEvent,
+        PreToolUseEvent,
+        PreToolUseResult,
+        PostToolUseEvent,
+        PostToolUseResult,
+    )
+__all__ = ["GenAIAttr", "create_otel_hooks", "enable_instrumentation"]
+# Track if instrumentation has been enabled
+_instrumentation_enabled = False
+class GenAIAttr:
+    """OpenTelemetry GenAI semantic convention attributes.
+    These match the attributes used by MAF/LangGraph harnesses for consistency.
+    Reference: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
+    """
+    # Operation
+    OPERATION_NAME = "gen_ai.operation.name"
+    PROVIDER_NAME = "gen_ai.provider.name"
+    # Model
+    REQUEST_MODEL = "gen_ai.request.model"
+    RESPONSE_MODEL = "gen_ai.response.model"
+    # Tokens
+    INPUT_TOKENS = "gen_ai.usage.input_tokens"
+    OUTPUT_TOKENS = "gen_ai.usage.output_tokens"
+    # Tool
+    TOOL_NAME = "gen_ai.tool.name"
+    TOOL_TYPE = "gen_ai.tool.type"
+    TOOL_CALL_ID = "gen_ai.tool.call.id"
+    # Error
+    ERROR_TYPE = "error.type"
+def _get_tracer() -> trace.Tracer:
+    """Get tracer lazily to ensure it uses the current TracerProvider.
+    This is important because the TracerProvider may be set up after
+    this module is imported (e.g., by Flow's experiment runner).
+    """
+    return trace.get_tracer("flow.miniagent", "0.1.0")
+def start_llm_span(model: str) -> trace.Span:
+    """Start a span for an LLM call.
+    Args:
+        model: The model name being called
+    Returns:
+        An active span for the LLM call
+    """
+    span = _get_tracer().start_span(f"chat {model}", kind=trace.SpanKind.CLIENT)
+    span.set_attribute(GenAIAttr.OPERATION_NAME, "chat")
+    span.set_attribute(GenAIAttr.REQUEST_MODEL, model)
+    span.set_attribute(GenAIAttr.PROVIDER_NAME, "openai")
+    return span
+def end_llm_span(span: trace.Span, input_tokens: int, output_tokens: int) -> None:
+    """End an LLM span with token usage.
+    Args:
+        span: The span to end
+        input_tokens: Number of input tokens used
+        output_tokens: Number of output tokens generated
+    """
+    span.set_attribute(GenAIAttr.INPUT_TOKENS, input_tokens)
+    span.set_attribute(GenAIAttr.OUTPUT_TOKENS, output_tokens)
+    span.end()
+def start_tool_span(tool_name: str, call_id: str = "") -> trace.Span:
+    """Start a span for a tool call.
+    Args:
+        tool_name: Name of the tool being called
+        call_id: Optional tool call ID
+    Returns:
+        An active span for the tool call
+    """
+    span = _get_tracer().start_span(f"execute_tool {tool_name}", kind=trace.SpanKind.INTERNAL)
+    span.set_attribute(GenAIAttr.OPERATION_NAME, "execute_tool")
+    span.set_attribute(GenAIAttr.TOOL_NAME, tool_name)
+    span.set_attribute(GenAIAttr.TOOL_TYPE, "function")
+    if call_id:
+        span.set_attribute(GenAIAttr.TOOL_CALL_ID, call_id)
+    return span
+def end_tool_span(span: trace.Span, error: Exception | None = None) -> None:
+    """End a tool span, optionally recording an error.
+    Args:
+        span: The span to end
+        error: Optional exception if the tool failed
+    """
+    if error:
+        span.set_attribute(GenAIAttr.ERROR_TYPE, type(error).__name__)
+        span.record_exception(error)
+        span.set_status(trace.StatusCode.ERROR, str(error))
+    span.end()
+class OTelHooks:
+    """Hook handlers that emit OTEL spans.
+    This class provides hook callbacks that instrument MiniAgent's
+    execution with OpenTelemetry spans, enabling trace collection
+    for Flow's evaluation pipeline.
+    Usage:
+        otel = OTelHooks(model="gpt-4o")
+        hooks = Hooks(
+            pre_model_call=[otel.on_pre_model_call],
+            post_model_call=[otel.on_post_model_call],
+            pre_tool_use=[otel.on_pre_tool_use],
+            post_tool_use=[otel.on_post_tool_use],
+        )
+        agent = ChatAgent(..., hooks=hooks)
+    """
+    def __init__(self, model: str = "gpt-4o"):
+        """Initialize OTEL hooks.
+        Args:
+            model: Default model name for spans
+        """
+        self.model = model
+        self._llm_spans: dict[int, trace.Span] = {}  # iteration -> span
+        self._tool_spans: dict[str, trace.Span] = {}  # call_id -> span
+    async def on_pre_model_call(self, event: "PreModelCallEvent") -> None:
+        """Start an LLM span before model call.
+        Args:
+            event: Pre-model call event with iteration info
+        """
+        span = start_llm_span(model=self.model)
+        self._llm_spans[event.iteration] = span
+    async def on_post_model_call(self, event: "PostModelCallEvent") -> None:
+        """End the LLM span after model call.
+        Args:
+            event: Post-model call event with usage info
+        """
+        span = self._llm_spans.pop(event.iteration, None)
+        if span:
+            input_tokens = event.usage.get("input_tokens", 0)
+            output_tokens = event.usage.get("output_tokens", 0)
+            end_llm_span(span, input_tokens, output_tokens)
+    async def on_pre_tool_use(self, event: "PreToolUseEvent") -> "PreToolUseResult | None":
+        """Start a tool span before tool execution.
+        Args:
+            event: Pre-tool use event with tool info
+        Returns:
+            None (don't block or modify)
+        """
+        span = start_tool_span(event.tool_name, event.tool_call_id)
+        self._tool_spans[event.tool_call_id] = span
+        return None  # Don't block
+    async def on_post_tool_use(self, event: "PostToolUseEvent") -> "PostToolUseResult | None":
+        """End the tool span after tool execution.
+        Args:
+            event: Post-tool use event with result info
+        Returns:
+            None (don't inject context or stop)
+        """
+        span = self._tool_spans.pop(event.tool_call_id, None)
+        if span:
+            error = Exception(event.error) if event.error else None
+            end_tool_span(span, error)
+        return None
+def enable_instrumentation() -> None:
+    """Enable OpenTelemetry instrumentation for MiniAgent.
+    Call this once before running agents to enable trace collection.
+    This is the MiniAgent equivalent of agent_framework.observability.enable_instrumentation().
+    Note: This function is idempotent - calling it multiple times is safe.
+    Example:
+        from flow.harness.miniagent.otel import enable_instrumentation
+        enable_instrumentation()
+        # Now traces will be collected when agents run
+    """
+    global _instrumentation_enabled
+    if _instrumentation_enabled:
+        return
+    # MiniAgent instrumentation is hook-based, so this is mainly a marker
+    # that indicates the system is ready for trace collection.
+    # The actual spans are created via OTelHooks attached to agents.
+    _instrumentation_enabled = True
+def create_otel_hooks(model: str = "gpt-4o") -> "Hooks":
+    """Create a Hooks instance with OTEL instrumentation.
+    This is the main entry point for adding OTEL tracing to a MiniAgent.
+    The returned Hooks object can be passed directly to ChatAgent.
+    Args:
+        model: Model name to use in LLM spans
+    Returns:
+        Hooks instance configured for OTEL tracing
+    Example:
+        hooks = create_otel_hooks(model="gpt-4o")
+        agent = ChatAgent(..., hooks=hooks)
+    """
+    from .hooks import Hooks
+    otel = OTelHooks(model=model)
+    return Hooks(
+        pre_model_call=[otel.on_pre_model_call],
+        post_model_call=[otel.on_post_model_call],
+        pre_tool_use=[otel.on_pre_tool_use],
+        post_tool_use=[otel.on_post_tool_use],
+    )

src/flow/harness/miniagent/tool.py ADDED Viewed

	@@ -0,0 +1,173 @@

+"""Tool definition and @tool decorator for MiniAgent.
+Provides a simple way to define tools that can be called by the LLM.
+"""
+from dataclasses import dataclass
+from typing import Any, Callable, Literal, get_type_hints, get_origin, get_args, Annotated
+import inspect
+@dataclass
+class Tool:
+    """A tool that can be called by the LLM.
+    Tools are functions with metadata that allows the LLM to understand
+    how to call them.
+    """
+    name: str
+    description: str
+    parameters: dict[str, Any]  # JSON Schema
+    func: Callable[..., Any]
+    def to_openai_tool(self) -> dict[str, Any]:
+        """Convert to OpenAI tool format."""
+        return {
+            "type": "function",
+            "function": {
+                "name": self.name,
+                "description": self.description,
+                "parameters": self.parameters,
+            },
+        }
+    async def invoke(self, **kwargs: Any) -> str:
+        """Execute the tool and return result as string.
+        Handles both sync and async functions.
+        """
+        try:
+            result = self.func(**kwargs)
+            if inspect.iscoroutine(result):
+                result = await result
+            return str(result) if not isinstance(result, str) else result
+        except Exception as e:
+            return f"Error executing {self.name}: {str(e)}"
+def _python_type_to_json_schema(py_type: Any) -> dict[str, Any]:
+    """Convert a Python type hint to JSON Schema."""
+    # Handle None/NoneType
+    if py_type is None or py_type is type(None):
+        return {"type": "null"}
+    # Handle basic types
+    if py_type == str:
+        return {"type": "string"}
+    if py_type == int:
+        return {"type": "integer"}
+    if py_type == float:
+        return {"type": "number"}
+    if py_type == bool:
+        return {"type": "boolean"}
+    # Handle dict without type args
+    if py_type is dict:
+        return {"type": "object"}
+    # Handle Optional (Union with None)
+    origin = get_origin(py_type)
+    args = get_args(py_type)
+    if origin is list:
+        if args:
+            return {"type": "array", "items": _python_type_to_json_schema(args[0])}
+        return {"type": "array"}
+    if origin is dict:
+        return {"type": "object"}
+    # Handle Union types (including Optional)
+    # In Python 3.10+, Optional[X] is Union[X, None]
+    if origin is type(int | str):  # Union type
+        non_none_args = [a for a in args if a is not type(None)]
+        if len(non_none_args) == 1:
+            # This is Optional[X]
+            return _python_type_to_json_schema(non_none_args[0])
+        # Multiple types - use anyOf
+        return {"anyOf": [_python_type_to_json_schema(a) for a in non_none_args]}
+    # Handle Literal
+    if origin is Literal:
+        return {"type": "string", "enum": list(args)}
+    # Default to string
+    return {"type": "string"}
+def tool(func: Callable[..., Any]) -> Tool:
+    """Decorator to convert a function into a Tool.
+    Uses type hints and Annotated[] for parameter descriptions.
+    The function's docstring becomes the tool description.
+    Usage:
+        @tool
+        def search(query: Annotated[str, "The search query"]) -> str:
+            '''Search the web for information.'''
+            return f"Results for: {query}"
+    """
+    # Get function signature
+    sig = inspect.signature(func)
+    # Get type hints (with extras for Annotated)
+    try:
+        hints = get_type_hints(func, include_extras=True)
+    except Exception:
+        hints = {}
+    # Build JSON Schema for parameters
+    properties: dict[str, Any] = {}
+    required: list[str] = []
+    for param_name, param in sig.parameters.items():
+        if param_name in ("self", "cls"):
+            continue
+        # Get the type hint
+        hint = hints.get(param_name, str)
+        description = ""
+        # Check if it's Annotated
+        if get_origin(hint) is Annotated:
+            args = get_args(hint)
+            actual_type = args[0]
+            # Look for string descriptions in the annotations
+            for annotation in args[1:]:
+                if isinstance(annotation, str):
+                    description = annotation
+                    break
+        else:
+            actual_type = hint
+        # Convert to JSON Schema
+        prop_schema = _python_type_to_json_schema(actual_type)
+        if description:
+            prop_schema["description"] = description
+        properties[param_name] = prop_schema
+        # Check if required (no default value)
+        if param.default is inspect.Parameter.empty:
+            required.append(param_name)
+    # Build the full schema
+    parameters_schema: dict[str, Any] = {
+        "type": "object",
+        "properties": properties,
+    }
+    if required:
+        parameters_schema["required"] = required
+    # Get description from docstring
+    description = func.__doc__ or f"Call the {func.__name__} function"
+    # Clean up the docstring - take first line/paragraph
+    description = description.strip().split("\n\n")[0].strip()
+    return Tool(
+        name=func.__name__,
+        description=description,
+        parameters=parameters_schema,
+        func=func,
+    )

src/flow/harness/miniagent/tools/__init__.py ADDED Viewed

	@@ -0,0 +1,125 @@

+"""Built-in tool library for MiniAgent.
+This module re-exports tools from the shared flow.tools module
+for backward compatibility with existing MiniAgent code.
+All tools are now implemented in flow.tools and shared across
+all harnesses (MiniAgent, MAF, etc.).
+Tool Categories:
+- File Operations: read_file, write_file, edit_file, multi_edit, glob_files, grep, ls
+- Notebooks: notebook_edit, notebook_read
+- Execution: bash
+- Planning: think, todo_write, todo_read
+- Memory: memory (agentic memory for persistence)
+- Web: web_search, web_fetch
+- Sub-agents: task (for context isolation)
+Presets:
+- coding_tools(): Core tools for coding tasks
+- research_tools(): Tools for research and exploration
+- all_tools(): Everything
+Example:
+    from flow.harness.miniagent.tools import coding_tools, task
+    agent = ChatAgent(
+        instructions="You are a helpful coding assistant.",
+        tools=coding_tools() + [task],
+    )
+"""
+# Re-export everything from shared tools
+from flow.tools import (
+    # Base
+    Tool,
+    # File operations
+    read_file,
+    write_file,
+    edit_file,
+    multi_edit,
+    glob_files,
+    grep,
+    ls,
+    # Notebook operations
+    notebook_edit,
+    notebook_read,
+    # Execution
+    bash,
+    check_processes,
+    python_repl,
+    # Planning and reasoning
+    think,
+    todo_write,
+    todo_read,
+    # Web operations
+    web_search,
+    web_fetch,
+    # Memory
+    memory,
+    create_memory_tool,
+    # Skills
+    skills,
+    create_skills_tool,
+    # Sub-agent
+    task,
+    create_task_tool,
+    # Presets
+    coding_tools,
+    planning_tools,
+    web_tools as research_tools,
+    notebook_tools,
+    all_tools,
+)
+# Compatibility: reset_todos from planning module
+from flow.tools.planning import reset_todos, get_todos
+# Compatibility: reset_memory from memory module
+from flow.tools.memory import reset_memory
+__all__ = [
+    # Base
+    "Tool",
+    # Presets
+    "coding_tools",
+    "research_tools",
+    "notebook_tools",
+    "planning_tools",
+    "all_tools",
+    # File operations
+    "read_file",
+    "write_file",
+    "edit_file",
+    "multi_edit",
+    "glob_files",
+    "grep",
+    "ls",
+    # Notebook operations
+    "notebook_edit",
+    "notebook_read",
+    # Execution
+    "bash",
+    "check_processes",
+    "python_repl",
+    # Planning
+    "think",
+    "todo_write",
+    "todo_read",
+    "reset_todos",
+    "get_todos",
+    # Memory
+    "memory",
+    "create_memory_tool",
+    "reset_memory",
+    # Web
+    "web_search",
+    "web_fetch",
+    # Sub-agent
+    "task",
+    "create_task_tool",
+    # Skills
+    "skills",
+    "create_skills_tool",
+]

src/flow/harness/miniagent/workspace.py ADDED Viewed

	@@ -0,0 +1,198 @@

+"""Workspace management for MiniAgent.
+Provides a simple convention for where agent-managed data lives:
+- Working directory is the workspace (or explicitly set)
+- Agent data goes in `{workspace}/.miniagent/`
+- No restrictions on file access - agent can read/write anywhere
+Structure:
+    {workspace}/
+    ├── .miniagent/
+    │   ├── todos.json      # Persisted task list
+    │   ├── memory/         # Memory entries
+    │   │   ├── {id}.json
+    │   │   └── ...
+    │   └── config.json     # Optional agent config
+    └── ... (rest of project)
+Usage:
+    from miniagent.workspace import Workspace
+    # Use current directory
+    ws = Workspace()
+    # Or explicit path
+    ws = Workspace("/path/to/project")
+    # Get paths
+    ws.root          # /path/to/project
+    ws.data_dir      # /path/to/project/.miniagent
+    ws.todos_file    # /path/to/project/.miniagent/todos.json
+    ws.memory_dir    # /path/to/project/.miniagent/memory
+"""
+import json
+from pathlib import Path
+from typing import Any
+class Workspace:
+    """Manages workspace paths and agent data storage.
+    The workspace is where the agent operates. Agent-managed data
+    (todos, memories, etc.) is stored in a `.miniagent/` subdirectory.
+    """
+    def __init__(self, root: str | Path | None = None):
+        """Initialize workspace.
+        Args:
+            root: Workspace root directory. Defaults to current working directory.
+        """
+        if root is None:
+            root = Path.cwd()
+        self._root = Path(root).resolve()
+    @property
+    def root(self) -> Path:
+        """Workspace root directory."""
+        return self._root
+    @property
+    def data_dir(self) -> Path:
+        """Agent data directory (.miniagent/)."""
+        return self._root / ".miniagent"
+    @property
+    def todos_file(self) -> Path:
+        """Path to todos.json."""
+        return self.data_dir / "todos.json"
+    @property
+    def memory_dir(self) -> Path:
+        """Path to memory/ directory."""
+        return self.data_dir / "memory"
+    @property
+    def config_file(self) -> Path:
+        """Path to config.json."""
+        return self.data_dir / "config.json"
+    def ensure_data_dir(self) -> Path:
+        """Create data directory if it doesn't exist."""
+        self.data_dir.mkdir(parents=True, exist_ok=True)
+        return self.data_dir
+    def ensure_memory_dir(self) -> Path:
+        """Create memory directory if it doesn't exist."""
+        self.memory_dir.mkdir(parents=True, exist_ok=True)
+        return self.memory_dir
+    # --- Todos ---
+    def load_todos(self) -> list[dict[str, Any]]:
+        """Load todos from workspace."""
+        if not self.todos_file.exists():
+            return []
+        try:
+            with open(self.todos_file) as f:
+                return json.load(f)  # type: ignore[no-any-return]
+        except (json.JSONDecodeError, IOError):
+            return []
+    def save_todos(self, todos: list[dict[str, Any]]) -> None:
+        """Save todos to workspace."""
+        self.ensure_data_dir()
+        with open(self.todos_file, "w") as f:
+            json.dump(todos, f, indent=2)
+    # --- Memory ---
+    def list_memories(self) -> list[dict[str, Any]]:
+        """List all memory entries."""
+        if not self.memory_dir.exists():
+            return []
+        memories: list[dict[str, Any]] = []
+        for filepath in self.memory_dir.glob("*.json"):
+            try:
+                with open(filepath) as f:
+                    memories.append(json.load(f))
+            except (json.JSONDecodeError, IOError):
+                continue
+        return memories
+    def load_memory(self, memory_id: str) -> dict[str, Any] | None:
+        """Load a specific memory entry."""
+        filepath = self.memory_dir / f"{memory_id}.json"
+        if not filepath.exists():
+            return None
+        try:
+            with open(filepath) as f:
+                return json.load(f)  # type: ignore[no-any-return]
+        except (json.JSONDecodeError, IOError):
+            return None
+    def save_memory(self, memory_id: str, data: dict[str, Any]) -> None:
+        """Save a memory entry."""
+        self.ensure_memory_dir()
+        filepath = self.memory_dir / f"{memory_id}.json"
+        with open(filepath, "w") as f:
+            json.dump(data, f, indent=2, default=str)
+    def delete_memory(self, memory_id: str) -> bool:
+        """Delete a memory entry. Returns True if deleted."""
+        filepath = self.memory_dir / f"{memory_id}.json"
+        if filepath.exists():
+            filepath.unlink()
+            return True
+        return False
+    # --- Config ---
+    def load_config(self) -> dict[str, Any]:
+        """Load workspace config."""
+        if not self.config_file.exists():
+            return {}
+        try:
+            with open(self.config_file) as f:
+                return json.load(f)
+        except (json.JSONDecodeError, IOError):
+            return {}
+    def save_config(self, config: dict[str, Any]) -> None:
+        """Save workspace config."""
+        self.ensure_data_dir()
+        with open(self.config_file, "w") as f:
+            json.dump(config, f, indent=2)
+    def __repr__(self) -> str:
+        return f"Workspace({self._root})"
+# Default workspace (current directory)
+_default_workspace: Workspace | None = None
+def get_workspace() -> Workspace:
+    """Get the default workspace (creates if needed)."""
+    global _default_workspace
+    if _default_workspace is None:
+        _default_workspace = Workspace()
+    return _default_workspace
+def set_workspace(workspace: Workspace | str | Path) -> Workspace:
+    """Set the default workspace."""
+    global _default_workspace
+    if isinstance(workspace, Workspace):
+        _default_workspace = workspace
+    else:
+        _default_workspace = Workspace(workspace)
+    return _default_workspace
+def reset_workspace() -> None:
+    """Reset default workspace (for testing)."""
+    global _default_workspace
+    _default_workspace = None

src/flow/harness/registry.py ADDED Viewed

	@@ -0,0 +1,80 @@

+"""Harness registry for multi-framework support.
+This module provides a simple registry pattern for harness implementations,
+allowing Flow to support multiple agent frameworks (MAF, LangGraph, Claude SDK).
+"""
+from __future__ import annotations
+from pathlib import Path
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from flow.experiments.models import Agent
+    from flow.harness.base import BaseHarness
+    from flow.llm import LLMClientConfig
+_HARNESSES: dict[str, type["BaseHarness"]] = {}
+def register(name: str, harness_class: type["BaseHarness"]) -> None:
+    """Register a harness class for a framework.
+    Args:
+        name: Framework name (e.g., "maf", "langgraph", "claude")
+        harness_class: The harness class to register
+    """
+    _HARNESSES[name] = harness_class
+def get_harness_class(name: str) -> type["BaseHarness"]:
+    """Get harness class by framework name.
+    Args:
+        name: Framework name
+    Returns:
+        The harness class
+    Raises:
+        ValueError: If framework is not registered
+    """
+    if name not in _HARNESSES:
+        available = list(_HARNESSES.keys())
+        raise ValueError(f"Unknown framework: {name}. Available: {available}")
+    return _HARNESSES[name]
+def create_harness(
+    agent: "Agent",
+    workspace: Path,
+    llm_config: "LLMClientConfig | None" = None,
+) -> "BaseHarness":
+    """Create a harness from an Agent spec.
+    This is the main entry point for creating harnesses. It looks up
+    the appropriate harness class based on agent.framework and calls
+    its from_agent() classmethod.
+    Args:
+        agent: The Agent spec defining the configuration
+        workspace: Working directory for the agent
+        llm_config: Optional LLM configuration for the agent (falls back to env vars)
+    Returns:
+        A configured harness instance
+    Raises:
+        ValueError: If agent.framework is not registered
+    """
+    harness_class = get_harness_class(agent.framework)
+    return harness_class.from_agent(agent, workspace, llm_config=llm_config)
+def available_frameworks() -> list[str]:
+    """Get list of available framework names.
+    Returns:
+        List of registered framework names
+    """
+    return list(_HARNESSES.keys())

src/flow/llm/__init__.py ADDED Viewed

	@@ -0,0 +1,49 @@

+# Copyright (c) Microsoft. All rights reserved.
+"""LLM client configuration and factory.
+This package provides a unified way to configure and create LLM clients
+for different providers and frameworks.
+Example:
+    from flow.llm import LLMClientConfig, LLMProvider, LLMClientFactory
+    from flow.llm.config import AzureOpenAIConfig
+    # Create config
+    config = LLMClientConfig(
+        provider=LLMProvider.AZURE_OPENAI,
+        name="My Azure GPT-4o",
+        azure_openai=AzureOpenAIConfig(deployment="gpt-4o"),
+    )
+    # Create client for MAF
+    client = LLMClientFactory.create_maf_client(config)
+    # Create client for LangGraph
+    llm = LLMClientFactory.create_langgraph_client(config)
+"""
+from .config import (
+    AnthropicConfig,
+    AzureOpenAIConfig,
+    CustomConfig,
+    LLMClientConfig,
+    LLMProvider,
+    OllamaConfig,
+    OpenAIConfig,
+)
+from .factory import LLMClientFactory
+__all__ = [
+    # Enums
+    "LLMProvider",
+    # Config classes
+    "LLMClientConfig",
+    "OpenAIConfig",
+    "AzureOpenAIConfig",
+    "AnthropicConfig",
+    "OllamaConfig",
+    "CustomConfig",
+    # Factory
+    "LLMClientFactory",
+]

src/flow/llm/config.py ADDED Viewed

	@@ -0,0 +1,227 @@

+# Copyright (c) Microsoft. All rights reserved.
+"""LLM client configuration models.
+This module defines provider-agnostic configuration for LLM clients.
+Secrets are stored as environment variable references, not actual values.
+"""
+from __future__ import annotations
+import os
+from enum import Enum
+from typing import Any
+from pydantic import BaseModel, Field, model_validator
+class LLMProvider(str, Enum):
+    """Supported LLM providers."""
+    OPENAI = "openai"
+    AZURE_OPENAI = "azure_openai"
+    ANTHROPIC = "anthropic"
+    OLLAMA = "ollama"
+    CUSTOM = "custom"  # OpenAI-compatible endpoints
+class OpenAIConfig(BaseModel):
+    """Configuration for OpenAI API."""
+    api_key_env_var: str = Field(
+        default="OPENAI_API_KEY",
+        description="Environment variable name containing the API key",
+    )
+    model_id: str = Field(
+        default="gpt-4o",
+        description="Model ID to use (e.g., gpt-4o, gpt-4-turbo)",
+    )
+    base_url: str | None = Field(
+        default=None,
+        description="Optional base URL for API (for proxies)",
+    )
+    def get_api_key(self) -> str:
+        """Get the API key from environment variable."""
+        value = os.environ.get(self.api_key_env_var)
+        if not value:
+            raise ValueError(f"Environment variable {self.api_key_env_var} is not set")
+        return value
+class AzureOpenAIConfig(BaseModel):
+    """Configuration for Azure OpenAI API."""
+    endpoint_env_var: str = Field(
+        default="AZURE_OPENAI_ENDPOINT",
+        description="Environment variable name containing the endpoint URL",
+    )
+    api_key_env_var: str = Field(
+        default="AZURE_OPENAI_API_KEY",
+        description="Environment variable name containing the API key",
+    )
+    deployment: str = Field(
+        description="Azure OpenAI deployment name",
+    )
+    api_version: str = Field(
+        default="2024-02-15-preview",
+        description="Azure OpenAI API version",
+    )
+    def get_endpoint(self) -> str:
+        """Get the endpoint from environment variable."""
+        value = os.environ.get(self.endpoint_env_var)
+        if not value:
+            raise ValueError(f"Environment variable {self.endpoint_env_var} is not set")
+        return value
+    def get_api_key(self) -> str:
+        """Get the API key from environment variable."""
+        value = os.environ.get(self.api_key_env_var)
+        if not value:
+            raise ValueError(f"Environment variable {self.api_key_env_var} is not set")
+        return value
+class AnthropicConfig(BaseModel):
+    """Configuration for Anthropic API."""
+    api_key_env_var: str = Field(
+        default="ANTHROPIC_API_KEY",
+        description="Environment variable name containing the API key",
+    )
+    model_id: str = Field(
+        default="claude-3-5-sonnet-20241022",
+        description="Model ID to use",
+    )
+    def get_api_key(self) -> str:
+        """Get the API key from environment variable."""
+        value = os.environ.get(self.api_key_env_var)
+        if not value:
+            raise ValueError(f"Environment variable {self.api_key_env_var} is not set")
+        return value
+class OllamaConfig(BaseModel):
+    """Configuration for Ollama (local models)."""
+    host: str = Field(
+        default="http://localhost:11434",
+        description="Ollama server URL",
+    )
+    model_id: str = Field(
+        default="llama3.2",
+        description="Model ID to use",
+    )
+class CustomConfig(BaseModel):
+    """Configuration for custom OpenAI-compatible endpoints."""
+    base_url: str = Field(
+        description="Base URL for the API",
+    )
+    api_key_env_var: str = Field(
+        default="CUSTOM_API_KEY",
+        description="Environment variable name containing the API key",
+    )
+    model_id: str = Field(
+        description="Model ID to use",
+    )
+    def get_api_key(self) -> str:
+        """Get the API key from environment variable."""
+        value = os.environ.get(self.api_key_env_var)
+        if not value:
+            raise ValueError(f"Environment variable {self.api_key_env_var} is not set")
+        return value
+class LLMClientConfig(BaseModel):
+    """Unified LLM client configuration.
+    This is a discriminated union based on the provider field.
+    Only one of the provider-specific configs should be set.
+    Example:
+        # Azure OpenAI
+        config = LLMClientConfig(
+            provider=LLMProvider.AZURE_OPENAI,
+            name="My Azure GPT-4o",
+            azure_openai=AzureOpenAIConfig(deployment="gpt-4o"),
+        )
+        # OpenAI
+        config = LLMClientConfig(
+            provider=LLMProvider.OPENAI,
+            name="OpenAI GPT-4o",
+            openai=OpenAIConfig(model_id="gpt-4o"),
+        )
+    """
+    id: str | None = Field(
+        default=None,
+        description="Unique identifier (set when stored in DB)",
+    )
+    provider: LLMProvider = Field(
+        description="The LLM provider type",
+    )
+    name: str = Field(
+        description="User-friendly name for this configuration",
+    )
+    is_default: bool = Field(
+        default=False,
+        description="Whether this is the default configuration",
+    )
+    # Provider-specific configs (discriminated union)
+    openai: OpenAIConfig | None = None
+    azure_openai: AzureOpenAIConfig | None = None
+    anthropic: AnthropicConfig | None = None
+    ollama: OllamaConfig | None = None
+    custom: CustomConfig | None = None
+    @model_validator(mode="after")
+    def validate_provider_config(self) -> "LLMClientConfig":
+        """Ensure the correct provider config is set."""
+        provider_configs = {
+            LLMProvider.OPENAI: self.openai,
+            LLMProvider.AZURE_OPENAI: self.azure_openai,
+            LLMProvider.ANTHROPIC: self.anthropic,
+            LLMProvider.OLLAMA: self.ollama,
+            LLMProvider.CUSTOM: self.custom,
+        }
+        config = provider_configs.get(self.provider)
+        if config is None:
+            raise ValueError(
+                f"Provider {self.provider.value} requires {self.provider.value} config to be set"
+            )
+        return self
+    def get_model_id(self) -> str:
+        """Get the model/deployment ID for display purposes."""
+        match self.provider:
+            case LLMProvider.OPENAI:
+                return self.openai.model_id if self.openai else ""
+            case LLMProvider.AZURE_OPENAI:
+                return self.azure_openai.deployment if self.azure_openai else ""
+            case LLMProvider.ANTHROPIC:
+                return self.anthropic.model_id if self.anthropic else ""
+            case LLMProvider.OLLAMA:
+                return self.ollama.model_id if self.ollama else ""
+            case LLMProvider.CUSTOM:
+                return self.custom.model_id if self.custom else ""
+            case _:
+                return ""
+    def to_dict(self) -> dict[str, Any]:
+        """Convert to dictionary for JSON serialization."""
+        return self.model_dump(exclude_none=True)
+    @classmethod
+    def from_dict(cls, data: dict[str, Any]) -> "LLMClientConfig":
+        """Create from dictionary."""
+        return cls.model_validate(data)