Spaces:

Mohammed-Altaf
/

DataAnalysis_Env

Sleeping

App Files Files Community

Mohammed Altaf commited on Mar 30

Commit

8ab6a5f

0 Parent(s):

first commit

Browse files

Files changed (20) hide show

.gitignore +18 -0
.python-version +1 -0
README.md +143 -0
__init__.py +15 -0
baseline.py +187 -0
client.py +63 -0
datasets/sales.csv +0 -0
models.py +58 -0
openenv.yaml +8 -0
pyproject.toml +18 -0
server/Dockerfile +58 -0
server/__init__.py +0 -0
server/app.py +23 -0
server/data_analysis_env.py +270 -0
tasks/__init__.py +14 -0
tasks/base_task.py +62 -0
tasks/task_easy.py +55 -0
tasks/task_hard.py +111 -0
tasks/task_medium.py +85 -0
uv.lock +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,18 @@

+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+# Virtual environments
+.venv
+OpenEnv/
+*.ipynb
+personal/
+# avoid claude stuff
+CLAUDE.md
+.claude

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.13

README.md ADDED Viewed

	@@ -0,0 +1,143 @@

+# Data Analysis Agent Environment
+An OpenEnv-compliant RL environment for training and evaluating data analysis agents. Agents execute pandas code against a business dataset to answer analytical questions, graded by deterministic programmatic graders.
+## Motivation
+Data analysis is a universal real-world task. Every business needs analysts who can query datasets, compute metrics, and extract insights. This environment lets RL agents practice that exact workflow — explore a dataset with code, then submit a precise answer — with automatic scoring.
+## Action & Observation Spaces
+### Action (`DataAction`)
+| Field | Type | Description |
+|---|---|---|
+| `action_type` | `"execute_code"` or `"submit_answer"` | What the agent wants to do |
+| `code` | `str` (optional) | Python/pandas code to execute |
+| `answer` | `str` (optional) | Final answer to submit for grading |
+### Observation (`DataObservation`)
+| Field | Type | Description |
+|---|---|---|
+| `output` | `str` | Stdout from code execution or environment messages |
+| `success` | `bool` | Whether the action succeeded |
+| `error` | `str` (optional) | Error message if action failed |
+| `task_description` | `str` | The question to answer (set on reset) |
+| `dataset_info` | `str` | Dataset schema summary (set on reset) |
+| `done` | `bool` | Whether the episode is over |
+| `reward` | `float` | Step reward |
+### State (`DataState`)
+| Field | Type | Description |
+|---|---|---|
+| `episode_id` | `str` | Unique episode identifier |
+| `step_count` | `int` | Current step number |
+| `task_id` | `int` | Active task (1, 2, or 3) |
+| `answer_submitted` | `bool` | Whether final answer was submitted |
+| `final_score` | `float` | Graded score after submission |
+## Tasks
+All tasks use a synthetic e-commerce dataset (~2000 orders) with columns: `order_id`, `customer_id`, `product_name`, `category`, `quantity`, `unit_price`, `total_price`, `order_date`, `city`, `country`.
+### Task 1 — Easy: Top Revenue Category
+- **Question**: What is the top-selling product category by total revenue?
+- **Grading**: Exact match (case-insensitive) → 1.0 or 0.0
+- **Expected difficulty**: Single groupby + sum + argmax
+### Task 2 — Medium: City Revenue Share
+- **Question**: Which city generates the most revenue? What percentage of total revenue does it represent?
+- **Grading**: 0.5 for correct city + 0.5 for percentage within ±0.1%
+- **Expected difficulty**: Groupby + percentage calculation + formatting
+### Task 3 — Hard: Repeat Customer Cohort Analysis
+- **Question**: How many unique customers ordered in both January and December? Compare their average order value to all other customers.
+- **Grading**: 0.33 per correct field (count, cohort AOV, other AOV)
+- **Expected difficulty**: Temporal filtering, set intersection, conditional aggregation
+## Reward Function
+| Event | Reward |
+|---|---|
+| Successful code execution | +0.05 |
+| Code execution error | -0.05 |
+| Final answer (graded) | 0.0 — 1.0 based on task grader |
+| Max steps (20) exceeded | 0.0 |
+## Setup & Usage
+### Prerequisites
+- Python 3.13+
+- [uv](https://docs.astral.sh/uv/) package manager
+### Install
+```bash
+uv sync
+```
+### Run the server
+```bash
+uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
+```
+### Run the baseline
+```bash
+OPENAI_API_KEY=sk-... uv run python baseline.py
+```
+### Docker
+```bash
+docker build -t data-analysis-env -f server/Dockerfile .
+docker run -p 8000:8000 data-analysis-env
+```
+### Client usage (Python)
+```python
+from client import DataAnalysisClient
+from models import DataAction
+# Async
+async with DataAnalysisClient(base_url="http://localhost:8000") as client:
+    result = await client.reset(task_id=1)
+    result = await client.step(DataAction(action_type="execute_code", code="print(df.head())"))
+    result = await client.step(DataAction(action_type="submit_answer", answer="Electronics"))
+# Sync
+with DataAnalysisClient(base_url="http://localhost:8000").sync() as client:
+    result = client.reset(task_id=2)
+    result = client.step(DataAction(action_type="execute_code", code="print(df.groupby('city')['total_price'].sum())"))
+```
+## Baseline Scores
+| Task | Difficulty | gpt-4o-mini Score |
+|---|---|---|
+| 1 | Easy | TBD |
+| 2 | Medium | TBD |
+| 3 | Hard | TBD |
+| **Average** | | **TBD** |
+*(Run the baseline script to populate these scores)*
+## Project Structure
+```
+├── models.py               # DataAction, DataObservation, DataState
+├── client.py               # DataAnalysisClient (EnvClient subclass)
+├── baseline.py             # OpenAI baseline inference script
+├── tasks/
+│   ├── base_task.py        # Task ABC with grade() interface
+│   ├── task_easy.py        # Task 1: Top revenue category
+│   ├── task_medium.py      # Task 2: City revenue share
+│   └── task_hard.py        # Task 3: Repeat customer cohort
+├── datasets/
+│   └── sales.csv           # Synthetic e-commerce dataset
+├── server/
+│   ├── app.py              # FastAPI app (create_app)
+│   ├── data_analysis_env.py # Environment implementation
+│   └── Dockerfile          # Container build
+├── openenv.yaml            # OpenEnv spec metadata
+└── pyproject.toml          # Dependencies and project config
+```

__init__.py ADDED Viewed

	@@ -0,0 +1,15 @@

+"""Data Analysis Agent Environment for OpenEnv.
+An RL environment where agents execute pandas code against a business
+dataset to answer analytical questions with programmatic grading.
+"""
+from client import DataAnalysisClient
+from models import DataAction, DataObservation, DataState
+__all__ = [
+    "DataAnalysisClient",
+    "DataAction",
+    "DataObservation",
+    "DataState",
+]

baseline.py ADDED Viewed

	@@ -0,0 +1,187 @@

+"""Baseline inference script for the Data Analysis Agent environment.
+Uses the OpenAI API to run a model (gpt-4o-mini) against all 3 tasks
+and produces reproducible baseline scores.
+Usage:
+    OPENAI_API_KEY=sk-... uv run python baseline.py
+    OPENAI_API_KEY=sk-... uv run python baseline.py --base-url http://localhost:8000
+"""
+import argparse
+import json
+import os
+import sys
+import requests
+from openai import OpenAI
+SYSTEM_PROMPT = """You are a data analyst. You are given a dataset loaded as a pandas DataFrame called `df`.
+You can execute Python/pandas code to explore the dataset and answer the question.
+Rules:
+- Use `print()` to see results of your code
+- The DataFrame `df` is pre-loaded with pandas as `pd` and numpy as `np`
+- When you have the answer, submit it in the exact format requested
+- Be precise with numbers and formatting
+Respond with JSON in one of these formats:
+1. To execute code: {{"action": "execute_code", "code": "your python code here"}}
+2. To submit answer: {{"action": "submit_answer", "answer": "your answer here"}}
+Respond with ONLY the JSON, no other text."""
+def run_task(client: OpenAI, base_url: str, task_id: int, max_steps: int = 15) -> float:
+    """Run a single task using the OpenAI API as the agent.
+    Args:
+        client: The OpenAI client instance.
+        base_url: The environment server base URL.
+        task_id: Which task to run (1, 2, or 3).
+        max_steps: Maximum agent steps before giving up.
+    Returns:
+        The final score for this task (0.0 to 1.0).
+    """
+    # Reset environment with the specified task
+    reset_resp = requests.post(
+        f"{base_url}/reset",
+        json={"task_id": task_id},
+        timeout=30,
+    )
+    reset_data = reset_resp.json()
+    obs = reset_data.get("observation", reset_data)
+    task_desc = obs.get("task_description", "")
+    dataset_info = obs.get("dataset_info", "")
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {
+            "role": "user",
+            "content": f"Task: {task_desc}\n\nDataset Info:\n{dataset_info}",
+        },
+    ]
+    print(f"\n--- Task {task_id} ---")
+    print(f"Question: {task_desc}")
+    for step in range(max_steps):
+        response = client.chat.completions.create(
+            model="gpt-4o-mini",
+            messages=messages,
+            temperature=0.0,
+        )
+        assistant_msg = response.choices[0].message.content.strip()
+        # Parse the agent's JSON response
+        try:
+            # Handle markdown code blocks if present
+            if assistant_msg.startswith("```"):
+                assistant_msg = assistant_msg.split("```")[1]
+                if assistant_msg.startswith("json"):
+                    assistant_msg = assistant_msg[4:]
+                assistant_msg = assistant_msg.strip()
+            action = json.loads(assistant_msg)
+        except json.JSONDecodeError:
+            messages.append({"role": "assistant", "content": assistant_msg})
+            messages.append({
+                "role": "user",
+                "content": "Invalid JSON. Please respond with valid JSON only.",
+            })
+            continue
+        action_type = action.get("action", "")
+        if action_type == "execute_code":
+            # Send code execution to environment
+            step_resp = requests.post(
+                f"{base_url}/step",
+                json={
+                    "action_type": "execute_code",
+                    "code": action.get("code", ""),
+                },
+                timeout=30,
+            )
+            step_data = step_resp.json()
+            step_obs = step_data.get("observation", step_data)
+            output = step_obs.get("output", "")
+            error = step_obs.get("error", "")
+            result_text = f"Output: {output}" if not error else f"Error: {error}"
+            print(f"  Step {step + 1}: execute_code -> {result_text[:100]}")
+            messages.append({"role": "assistant", "content": assistant_msg})
+            messages.append({"role": "user", "content": result_text})
+        elif action_type == "submit_answer":
+            # Submit final answer
+            step_resp = requests.post(
+                f"{base_url}/step",
+                json={
+                    "action_type": "submit_answer",
+                    "answer": action.get("answer", ""),
+                },
+                timeout=30,
+            )
+            step_data = step_resp.json()
+            step_obs = step_data.get("observation", step_data)
+            score = step_obs.get("metadata", {}).get("score", 0.0)
+            print(f"  Step {step + 1}: submit_answer -> '{action.get('answer', '')}'")
+            print(f"  Score: {score:.2f}")
+            return score
+        else:
+            messages.append({"role": "assistant", "content": assistant_msg})
+            messages.append({
+                "role": "user",
+                "content": f"Unknown action '{action_type}'. Use 'execute_code' or 'submit_answer'.",
+            })
+    print("  Max steps reached without submitting an answer.")
+    return 0.0
+def main():
+    """Run baseline inference across all 3 tasks and report scores."""
+    parser = argparse.ArgumentParser(description="Baseline inference for Data Analysis Env")
+    parser.add_argument(
+        "--base-url",
+        default="http://localhost:8000",
+        help="Environment server URL (default: http://localhost:8000)",
+    )
+    args = parser.parse_args()
+    api_key = os.environ.get("OPENAI_API_KEY")
+    if not api_key:
+        print("Error: OPENAI_API_KEY environment variable is required.")
+        sys.exit(1)
+    client = OpenAI(api_key=api_key)
+    print("=" * 50)
+    print("Data Analysis Agent - Baseline Inference")
+    print(f"Server: {args.base_url}")
+    print(f"Model: gpt-4o-mini")
+    print("=" * 50)
+    scores = {}
+    for task_id in [1, 2, 3]:
+        score = run_task(client, args.base_url, task_id)
+        scores[task_id] = score
+    print("\n" + "=" * 50)
+    print("RESULTS")
+    print("=" * 50)
+    difficulties = {1: "Easy", 2: "Medium", 3: "Hard"}
+    for task_id, score in scores.items():
+        print(f"  Task {task_id} ({difficulties[task_id]}): {score:.2f}")
+    avg = sum(scores.values()) / len(scores)
+    print(f"\n  Average Score: {avg:.2f}")
+    print("=" * 50)
+if __name__ == "__main__":
+    main()

client.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""Client for the Data Analysis Agent environment.
+Provides a typed async/sync client for interacting with the
+data analysis environment server over HTTP/WebSocket.
+"""
+from openenv.core.env_client import EnvClient
+from openenv.core.client_types import StepResult
+from models import DataAction, DataObservation, DataState
+class DataAnalysisClient(EnvClient[DataAction, DataObservation, DataState]):
+    """Client for interacting with the Data Analysis environment server.
+    Supports both async and sync usage patterns:
+        - Async: ``async with DataAnalysisClient(base_url=...) as client:``
+        - Sync: ``with DataAnalysisClient(base_url=...).sync() as client:``
+    """
+    def _step_payload(self, action: DataAction) -> dict:
+        """Convert a DataAction into a JSON-serializable payload.
+        Args:
+            action: The action to send to the server.
+        Returns:
+            A dictionary representation of the action.
+        """
+        payload = {"action_type": action.action_type}
+        if action.code is not None:
+            payload["code"] = action.code
+        if action.answer is not None:
+            payload["answer"] = action.answer
+        return payload
+    def _parse_result(self, payload: dict) -> StepResult[DataObservation]:
+        """Parse the server's JSON response into a StepResult.
+        Args:
+            payload: The raw JSON response from the server.
+        Returns:
+            A StepResult containing the parsed observation, reward, and done flag.
+        """
+        obs_data = payload.get("observation", payload)
+        obs = DataObservation(**obs_data)
+        return StepResult(
+            observation=obs,
+            reward=payload.get("reward", obs.reward),
+            done=payload.get("done", obs.done),
+        )
+    def _parse_state(self, payload: dict) -> DataState:
+        """Parse the server's state response into a DataState.
+        Args:
+            payload: The raw JSON state response from the server.
+        Returns:
+            A DataState object reflecting the current episode state.
+        """
+        return DataState(**payload)

datasets/sales.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

models.py ADDED Viewed

	@@ -0,0 +1,58 @@

+"""Pydantic models for the Data Analysis Agent environment.
+Defines the action, observation, and state types used for communication
+between the RL agent and the environment server.
+"""
+from typing import Literal, Optional
+from openenv.core.env_server import Action, Observation, State
+class DataAction(Action):
+    """Agent action for the data analysis environment.
+    The agent can either execute pandas code against the loaded dataset
+    or submit a final answer to be graded.
+    Attributes:
+        action_type: Whether to execute code or submit an answer.
+        code: Python/pandas code to execute (required when action_type is "execute_code").
+        answer: Final answer string (required when action_type is "submit_answer").
+    """
+    action_type: Literal["execute_code", "submit_answer"]
+    code: Optional[str] = None
+    answer: Optional[str] = None
+class DataObservation(Observation):
+    """Observation returned after each step or reset.
+    Attributes:
+        output: String output from code execution or environment messages.
+        success: Whether the last action executed without errors.
+        error: Error message if the last action failed.
+        task_description: The task question, populated on reset.
+        dataset_info: Column names and dtypes summary, populated on reset.
+    """
+    output: str = ""
+    success: bool = True
+    error: Optional[str] = None
+    task_description: str = ""
+    dataset_info: str = ""
+class DataState(State):
+    """Episode state for the data analysis environment.
+    Attributes:
+        task_id: The current task being evaluated (1, 2, or 3).
+        answer_submitted: Whether the agent has submitted a final answer.
+        final_score: The graded score after answer submission (0.0 to 1.0).
+    """
+    task_id: int = 1
+    answer_submitted: bool = False
+    final_score: float = 0.0

openenv.yaml ADDED Viewed

	@@ -0,0 +1,8 @@

+spec_version: 1
+name: data_analysis_env
+version: "0.1.0"
+description: "RL environment for training data analysis agents on business datasets"
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

pyproject.toml ADDED Viewed

	@@ -0,0 +1,18 @@

+[project]
+name = "openenv-data-analysis-env"
+version = "0.1.0"
+description = "RL environment for training data analysis agents on business datasets"
+readme = "README.md"
+requires-python = ">=3.13"
+dependencies = [
+    "openenv-core>=0.2.3",
+    "fastapi>=0.115.0",
+    "pydantic>=2.0.0",
+    "uvicorn>=0.24.0",
+    "pandas>=2.0.0",
+    "numpy>=1.24.0",
+    "openai>=1.0.0",
+]
+[project.scripts]
+server = "server.app:main"

server/Dockerfile ADDED Viewed

	@@ -0,0 +1,58 @@

+# Multi-stage build for the Data Analysis Agent environment
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Copy environment code
+COPY .. /app/env
+WORKDIR /app/env
+# Ensure uv is available
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install git for build-time dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+# Install dependencies with cache
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Copy virtual environment and code
+COPY --from=builder /app/env/.venv /app/.venv
+COPY --from=builder /app/env /app/env
+ENV PATH="/app/.venv/bin:$PATH"
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
+EXPOSE 8000
+# Run server
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

server/__init__.py ADDED Viewed

File without changes

server/app.py ADDED Viewed

	@@ -0,0 +1,23 @@

+"""FastAPI application for the Data Analysis Agent environment.
+Creates the OpenEnv-compliant HTTP/WebSocket server that wraps
+the DataAnalysisEnv environment.
+"""
+from openenv.core.env_server import create_app
+from models import DataAction, DataObservation
+from server.data_analysis_env import DataAnalysisEnv
+app = create_app(DataAnalysisEnv, DataAction, DataObservation, env_name="data_analysis_env")
+def main():
+    """Run the environment server with uvicorn."""
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)
+if __name__ == "__main__":
+    main()

server/data_analysis_env.py ADDED Viewed

	@@ -0,0 +1,270 @@

+"""Data Analysis Agent environment implementation.
+Provides an RL environment where an agent executes pandas code against
+a business dataset to answer analytical questions. Each episode presents
+a task with a programmatic grader that scores performance 0.0-1.0.
+"""
+import io
+import sys
+import uuid
+from pathlib import Path
+from typing import Any, Optional
+import numpy as np
+import pandas as pd
+from openenv.core.env_server import Environment
+from models import DataAction, DataObservation, DataState
+from tasks import TASKS
+DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "sales.csv"
+class DataAnalysisEnv(Environment):
+    """Environment for training data analysis agents on business datasets.
+    The agent receives a task question and can execute pandas code against
+    a pre-loaded DataFrame. The episode ends when the agent submits an answer
+    or exceeds the maximum number of steps.
+    Attributes:
+        MAX_STEPS: Maximum steps before forced episode termination.
+    """
+    MAX_STEPS = 20
+    def __init__(self):
+        """Initialize the environment with default state."""
+        super().__init__()
+        self._source_df = pd.read_csv(DATASET_PATH)
+        self._df = self._source_df.copy()
+        self._state = DataState()
+        self._task = None
+        self._exec_namespace = {}
+    def _build_namespace(self) -> dict:
+        """Build a restricted execution namespace for agent code.
+        The namespace includes only pandas, numpy, and the dataset copy.
+        Dangerous builtins like open, exec, eval, and __import__ are removed.
+        Returns:
+            A dictionary to use as the globals for exec().
+        """
+        safe_builtins = {
+            k: v for k, v in __builtins__.items()
+            if k not in ("open", "exec", "eval", "__import__", "compile", "exit", "quit")
+        } if isinstance(__builtins__, dict) else {
+            k: getattr(__builtins__, k) for k in dir(__builtins__)
+            if k not in ("open", "exec", "eval", "__import__", "compile", "exit", "quit")
+            and not k.startswith("_")
+        }
+        return {
+            "__builtins__": safe_builtins,
+            "df": self._df.copy(),
+            "pd": pd,
+            "np": np,
+        }
+    def _dataset_info(self) -> str:
+        """Generate a summary of the dataset schema for the agent.
+        Returns:
+            A string describing column names, dtypes, row count, and a sample.
+        """
+        buf = io.StringIO()
+        self._df.info(buf=buf)
+        info_str = buf.getvalue()
+        sample = self._df.head(3).to_string()
+        return f"Dataset shape: {self._df.shape}\n\n{info_str}\nSample rows:\n{sample}"
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs: Any,
+    ) -> DataObservation:
+        """Reset the environment for a new episode.
+        Args:
+            seed: Optional random seed (unused, kept for interface compliance).
+            episode_id: Optional episode identifier; generated if not provided.
+            **kwargs: Additional keyword arguments. Supports 'task_id' (int, 1-3).
+        Returns:
+            An initial observation with the task description and dataset info.
+        """
+        task_id = kwargs.get("task_id", 1)
+        eid = episode_id or str(uuid.uuid4())
+        self._df = self._source_df.copy()
+        self._state = DataState(episode_id=eid, step_count=0, task_id=task_id)
+        self._exec_namespace = self._build_namespace()
+        task_cls = TASKS.get(task_id)
+        if task_cls is None:
+            return DataObservation(
+                done=True,
+                reward=0.0,
+                success=False,
+                error=f"Invalid task_id: {task_id}. Must be 1, 2, or 3.",
+            )
+        self._task = task_cls(self._df)
+        return DataObservation(
+            done=False,
+            reward=0.0,
+            output="Environment ready. Use 'execute_code' actions to explore the dataset, then 'submit_answer' with your result.",
+            task_description=self._task.description,
+            dataset_info=self._dataset_info(),
+            metadata={"task_id": task_id, "difficulty": self._task.difficulty},
+        )
+    def step(
+        self,
+        action: DataAction,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> DataObservation:
+        """Execute one step in the environment.
+        Handles two action types:
+        - execute_code: runs pandas code in a sandboxed namespace
+        - submit_answer: grades the agent's final answer and ends the episode
+        Args:
+            action: The agent's action (execute_code or submit_answer).
+            timeout_s: Optional timeout in seconds (unused).
+            **kwargs: Additional keyword arguments.
+        Returns:
+            An observation with execution output, reward, and done flag.
+        """
+        self._state.step_count += 1
+        if self._state.answer_submitted:
+            return DataObservation(
+                done=True,
+                reward=0.0,
+                output="Episode is already finished. Call reset() to start a new one.",
+                success=False,
+            )
+        # Check max steps
+        if self._state.step_count >= self.MAX_STEPS and action.action_type != "submit_answer":
+            self._state.answer_submitted = True
+            return DataObservation(
+                done=True,
+                reward=0.0,
+                output=f"Maximum steps ({self.MAX_STEPS}) exceeded without submitting an answer.",
+                success=False,
+                metadata={"reason": "max_steps_exceeded"},
+            )
+        if action.action_type == "execute_code":
+            return self._handle_execute_code(action)
+        elif action.action_type == "submit_answer":
+            return self._handle_submit_answer(action)
+        else:
+            return DataObservation(
+                done=False,
+                reward=-0.05,
+                success=False,
+                error=f"Unknown action_type: {action.action_type}",
+            )
+    def _handle_execute_code(self, action: DataAction) -> DataObservation:
+        """Execute pandas code in the sandboxed namespace.
+        Args:
+            action: The action containing the code to execute.
+        Returns:
+            An observation with stdout output or error message.
+        """
+        if not action.code:
+            return DataObservation(
+                done=False,
+                reward=-0.05,
+                success=False,
+                error="No code provided for execute_code action.",
+            )
+        stdout_capture = io.StringIO()
+        old_stdout = sys.stdout
+        try:
+            sys.stdout = stdout_capture
+            exec(action.code, self._exec_namespace)
+            sys.stdout = old_stdout
+            output = stdout_capture.getvalue()
+            # If code produced no print output, try to get the last expression value
+            if not output.strip():
+                try:
+                    result = eval(action.code.strip().split("\n")[-1], self._exec_namespace)
+                    if result is not None:
+                        output = str(result)
+                except Exception:
+                    output = "(Code executed successfully with no output)"
+            return DataObservation(
+                done=False,
+                reward=0.05,
+                output=output[:5000],
+                success=True,
+                metadata={"steps_remaining": self.MAX_STEPS - self._state.step_count},
+            )
+        except Exception as e:
+            sys.stdout = old_stdout
+            return DataObservation(
+                done=False,
+                reward=-0.05,
+                success=False,
+                error=f"{type(e).__name__}: {e}",
+                output="",
+                metadata={"steps_remaining": self.MAX_STEPS - self._state.step_count},
+            )
+    def _handle_submit_answer(self, action: DataAction) -> DataObservation:
+        """Grade the agent's submitted answer and end the episode.
+        Args:
+            action: The action containing the answer to grade.
+        Returns:
+            An observation with the final score and done=True.
+        """
+        if not action.answer:
+            return DataObservation(
+                done=False,
+                reward=-0.05,
+                success=False,
+                error="No answer provided for submit_answer action.",
+            )
+        self._state.answer_submitted = True
+        score = self._task.grade(action.answer)
+        self._state.final_score = score
+        return DataObservation(
+            done=True,
+            reward=score,
+            output=f"Answer submitted. Score: {score:.2f}/1.00",
+            success=True,
+            metadata={
+                "score": score,
+                "expected_answer": self._task.expected_answer(),
+                "submitted_answer": action.answer,
+            },
+        )
+    @property
+    def state(self) -> DataState:
+        """Return the current episode state.
+        Returns:
+            The current DataState with episode_id, step_count, task_id, etc.
+        """
+        return self._state

tasks/__init__.py ADDED Viewed

	@@ -0,0 +1,14 @@

+"""Task definitions for the Data Analysis Agent environment."""
+from tasks.base_task import BaseTask
+from tasks.task_easy import TopRevenueCategoryTask
+from tasks.task_medium import CityRevenueShareTask
+from tasks.task_hard import RepeatCustomerCohortTask
+TASKS = {
+    1: TopRevenueCategoryTask,
+    2: CityRevenueShareTask,
+    3: RepeatCustomerCohortTask,
+}
+__all__ = ["BaseTask", "TASKS", "TopRevenueCategoryTask", "CityRevenueShareTask", "RepeatCustomerCohortTask"]

tasks/base_task.py ADDED Viewed

	@@ -0,0 +1,62 @@

+"""Abstract base class for data analysis tasks.
+Each task defines a question, computes the expected answer from the dataset,
+and provides a grader that scores agent responses from 0.0 to 1.0.
+"""
+from abc import ABC, abstractmethod
+import pandas as pd
+class BaseTask(ABC):
+    """Base class for all data analysis tasks.
+    Subclasses must implement the question, compute the expected answer
+    from the dataset, and provide a grading function.
+    Attributes:
+        df: The pandas DataFrame containing the dataset.
+    """
+    def __init__(self, df: pd.DataFrame):
+        """Initialize the task with a dataset.
+        Args:
+            df: The pandas DataFrame to analyze.
+        """
+        self.df = df
+    @property
+    @abstractmethod
+    def task_id(self) -> int:
+        """Return the unique task identifier."""
+    @property
+    @abstractmethod
+    def difficulty(self) -> str:
+        """Return the difficulty level: 'easy', 'medium', or 'hard'."""
+    @property
+    @abstractmethod
+    def description(self) -> str:
+        """Return the task question shown to the agent."""
+    @abstractmethod
+    def expected_answer(self) -> str:
+        """Compute and return the ground-truth answer from the dataset.
+        Returns:
+            The expected answer as a formatted string.
+        """
+    @abstractmethod
+    def grade(self, answer: str) -> float:
+        """Grade the agent's submitted answer.
+        Args:
+            answer: The agent's submitted answer string.
+        Returns:
+            A score between 0.0 and 1.0.
+        """

tasks/task_easy.py ADDED Viewed

	@@ -0,0 +1,55 @@

+"""Task 1 (Easy): Identify the top-selling product category by total revenue.
+Requires a single groupby + sum + idxmax operation.
+"""
+import pandas as pd
+from tasks.base_task import BaseTask
+class TopRevenueCategoryTask(BaseTask):
+    """Easy task: find the product category with the highest total revenue.
+    The agent must group the dataset by category, sum the total_price column,
+    and identify which category has the highest revenue.
+    """
+    @property
+    def task_id(self) -> int:
+        """Return the task identifier."""
+        return 1
+    @property
+    def difficulty(self) -> str:
+        """Return the difficulty level."""
+        return "easy"
+    @property
+    def description(self) -> str:
+        """Return the task question."""
+        return (
+            "What is the top-selling product category by total revenue? "
+            "Submit just the category name as your answer."
+        )
+    def expected_answer(self) -> str:
+        """Compute the top revenue category from the dataset.
+        Returns:
+            The name of the category with the highest total_price sum.
+        """
+        return self.df.groupby("category")["total_price"].sum().idxmax()
+    def grade(self, answer: str) -> float:
+        """Grade the answer by case-insensitive string match.
+        Args:
+            answer: The agent's submitted category name.
+        Returns:
+            1.0 if the answer matches the expected category, 0.0 otherwise.
+        """
+        expected = self.expected_answer().strip().lower()
+        submitted = answer.strip().lower()
+        return 1.0 if submitted == expected else 0.0

tasks/task_hard.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""Task 3 (Hard): Analyze repeat customers who ordered in both January and December.
+Requires temporal filtering, set intersection, and conditional aggregation.
+"""
+import re
+import pandas as pd
+from tasks.base_task import BaseTask
+class RepeatCustomerCohortTask(BaseTask):
+    """Hard task: find customers who ordered in both January and December.
+    The agent must identify customers present in both months, count them,
+    and compare their average order value to all other customers.
+    """
+    @property
+    def task_id(self) -> int:
+        """Return the task identifier."""
+        return 3
+    @property
+    def difficulty(self) -> str:
+        """Return the difficulty level."""
+        return "hard"
+    @property
+    def description(self) -> str:
+        """Return the task question."""
+        return (
+            "How many unique customers placed orders in BOTH January and December? "
+            "What is their average order value compared to all other customers? "
+            "Submit your answer in the format: "
+            "'Cohort: N customers, Cohort AOV: $X.XX, Other AOV: $X.XX'"
+        )
+    def _compute_cohort(self) -> tuple[set, float, float]:
+        """Compute the cohort of customers ordering in both January and December.
+        Returns:
+            A tuple of (cohort_customer_ids, cohort_aov, other_aov).
+        """
+        df = self.df.copy()
+        df["order_date"] = pd.to_datetime(df["order_date"])
+        jan_customers = set(df[df["order_date"].dt.month == 1]["customer_id"])
+        dec_customers = set(df[df["order_date"].dt.month == 12]["customer_id"])
+        cohort = jan_customers & dec_customers
+        cohort_aov = df[df["customer_id"].isin(cohort)]["total_price"].mean()
+        other_aov = df[~df["customer_id"].isin(cohort)]["total_price"].mean()
+        return cohort, round(cohort_aov, 2), round(other_aov, 2)
+    def expected_answer(self) -> str:
+        """Compute the expected cohort analysis answer.
+        Returns:
+            Formatted string like 'Cohort: 57 customers, Cohort AOV: $126.57, Other AOV: $122.94'.
+        """
+        cohort, cohort_aov, other_aov = self._compute_cohort()
+        return f"Cohort: {len(cohort)} customers, Cohort AOV: ${cohort_aov}, Other AOV: ${other_aov}"
+    def grade(self, answer: str) -> float:
+        """Grade the answer with partial credit for each of the three fields.
+        Scoring:
+            - 0.33 for correct customer count (exact match)
+            - 0.33 for cohort AOV within ±0.5% of expected
+            - 0.34 for other AOV within ±0.5% of expected
+        Args:
+            answer: The agent's submitted answer string.
+        Returns:
+            A score between 0.0 and 1.0.
+        """
+        cohort, expected_cohort_aov, expected_other_aov = self._compute_cohort()
+        expected_count = len(cohort)
+        score = 0.0
+        # Check customer count
+        count_match = re.search(r"Cohort:\s*(\d+)\s*customers?", answer, re.IGNORECASE)
+        if count_match:
+            if int(count_match.group(1)) == expected_count:
+                score += 0.33
+        # Check cohort AOV
+        cohort_aov_match = re.search(r"Cohort\s+AOV:\s*\$?([\d.]+)", answer, re.IGNORECASE)
+        if cohort_aov_match:
+            try:
+                submitted = float(cohort_aov_match.group(1))
+                tolerance = expected_cohort_aov * 0.005
+                if abs(submitted - expected_cohort_aov) <= tolerance:
+                    score += 0.33
+            except ValueError:
+                pass
+        # Check other AOV
+        other_aov_match = re.search(r"Other\s+AOV:\s*\$?([\d.]+)", answer, re.IGNORECASE)
+        if other_aov_match:
+            try:
+                submitted = float(other_aov_match.group(1))
+                tolerance = expected_other_aov * 0.005
+                if abs(submitted - expected_other_aov) <= tolerance:
+                    score += 0.34
+            except ValueError:
+                pass
+        return score

tasks/task_medium.py ADDED Viewed

	@@ -0,0 +1,85 @@

+"""Task 2 (Medium): Find the top revenue city and its share of total revenue.
+Requires groupby + aggregation + percentage calculation + formatting.
+"""
+import re
+import pandas as pd
+from tasks.base_task import BaseTask
+class CityRevenueShareTask(BaseTask):
+    """Medium task: identify the city with the highest revenue and its percentage share.
+    The agent must group by city, compute total revenue per city,
+    find the top city, and calculate what percentage of overall revenue it represents.
+    """
+    @property
+    def task_id(self) -> int:
+        """Return the task identifier."""
+        return 2
+    @property
+    def difficulty(self) -> str:
+        """Return the difficulty level."""
+        return "medium"
+    @property
+    def description(self) -> str:
+        """Return the task question."""
+        return (
+            "Which city generates the most revenue? What percentage of total revenue "
+            "does it represent? Round to 2 decimal places. "
+            "Submit your answer in the format: 'City: <name>, Percentage: <X.XX>%'"
+        )
+    def expected_answer(self) -> str:
+        """Compute the top city and its revenue share.
+        Returns:
+            Formatted string like 'City: London, Percentage: 10.81%'.
+        """
+        city_rev = self.df.groupby("city")["total_price"].sum()
+        top_city = city_rev.idxmax()
+        pct = round(city_rev[top_city] / city_rev.sum() * 100, 2)
+        return f"City: {top_city}, Percentage: {pct}%"
+    def grade(self, answer: str) -> float:
+        """Grade the answer with partial credit for city and percentage.
+        Scoring:
+            - 0.5 for correct city name (case-insensitive)
+            - 0.5 for percentage within ±0.1 of expected
+        Args:
+            answer: The agent's submitted answer string.
+        Returns:
+            A score between 0.0 and 1.0.
+        """
+        score = 0.0
+        city_rev = self.df.groupby("city")["total_price"].sum()
+        expected_city = city_rev.idxmax()
+        expected_pct = round(city_rev[expected_city] / city_rev.sum() * 100, 2)
+        # Check city
+        city_match = re.search(r"City:\s*([^,]+)", answer, re.IGNORECASE)
+        if city_match:
+            submitted_city = city_match.group(1).strip()
+            if submitted_city.lower() == expected_city.lower():
+                score += 0.5
+        # Check percentage
+        pct_match = re.search(r"Percentage:\s*([\d.]+)%?", answer, re.IGNORECASE)
+        if pct_match:
+            try:
+                submitted_pct = float(pct_match.group(1))
+                if abs(submitted_pct - expected_pct) <= 0.1:
+                    score += 0.5
+            except ValueError:
+                pass
+        return score

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff