--- title: Data Analysis Agent Environment emoji: 📊 colorFrom: blue colorTo: indigo sdk: docker pinned: false --- # Data Analysis Agent Environment An OpenEnv-compliant RL environment for training and evaluating data analysis agents. Agents execute pandas code against a business dataset to answer analytical questions, graded by deterministic programmatic graders. ## Motivation Data analysis is a universal real-world task. Every business needs analysts who can query datasets, compute metrics, and extract insights. This environment lets RL agents practice that exact workflow — explore a dataset with code, then submit a precise answer — with automatic scoring. ## Action & Observation Spaces ### Action (`DataAction`) | Field | Type | Description | |---|---|---| | `action_type` | `"execute_code"` or `"submit_answer"` | What the agent wants to do | | `code` | `str` (optional) | Python/pandas code to execute | | `answer` | `str` (optional) | Final answer to submit for grading | ### Observation (`DataObservation`) | Field | Type | Description | |---|---|---| | `output` | `str` | Stdout from code execution or environment messages | | `success` | `bool` | Whether the action succeeded | | `error` | `str` (optional) | Error message if action failed | | `task_description` | `str` | The question to answer (set on reset) | | `dataset_info` | `str` | Dataset schema summary (set on reset) | | `done` | `bool` | Whether the episode is over | | `reward` | `float` | Step reward | ### State (`DataState`) | Field | Type | Description | |---|---|---| | `episode_id` | `str` | Unique episode identifier | | `step_count` | `int` | Current step number | | `task_id` | `int` | Active task (1–6) | | `answer_submitted` | `bool` | Whether final answer was submitted | | `final_score` | `float` | Graded score after submission | ## Tasks Tasks use two data sources: - **`df`** — synthetic e-commerce sales CSV (~2000 orders): `order_id`, `customer_id`, `product_name`, `category`, `quantity`, `unit_price`, `total_price`, `order_date`, `city`, `country` - **SQLite DB** (`store_data.db`) — additional tables for cross-source tasks: `customer_profiles` (300 rows), `product_catalog` (25 rows) ### Task 1 — Easy: Top Revenue Category - **Question**: What is the top-selling product category by total revenue? - **Grading**: Containment match (case-insensitive) → 1.0 or 0.0 - **Expected difficulty**: Single groupby + sum + argmax ### Task 2 — Medium: City Revenue Share - **Question**: Which city generates the most revenue? What percentage of total revenue does it represent? - **Grading**: 0.5 for correct city + 0.5 for percentage within ±0.1% - **Expected difficulty**: Groupby + percentage calculation + formatting ### Task 3 — Medium: Repeat Customer Cohort Analysis - **Question**: How many unique customers ordered in both January and December? Compare their average order value to all other customers. - **Grading**: 0.33 per correct field (count, cohort AOV, other AOV) - **Expected difficulty**: Temporal filtering, set intersection, conditional aggregation ### Task 4 — Hard: Monthly Revenue Ratio - **Question**: Which month had the highest vs. lowest total revenue? What is the ratio between them? - **Grading**: 0.33 for best month + 0.33 for worst month + 0.34 for ratio within ±0.01 - **Expected difficulty**: Monthly resample/groupby, min/max comparison, ratio formatting ### Task 5 — Hard: Customer Loyalty Tier Revenue (cross-source) - **Question**: Which customer loyalty tier generates the highest total revenue and what percentage does it represent? - **Data**: Requires joining `df` with `customer_profiles` table from SQLite on `customer_id` - **Grading**: 0.33 for tier name + 0.33 for revenue within ±0.5% + 0.34 for percentage within ±0.1 - **Expected difficulty**: SQLite query → pandas merge → groupby aggregation ### Task 6 — Hard: Supplier Profitability (cross-source) - **Question**: Which supplier has the highest total profit? What is their average profit margin? - **Data**: Requires joining `df` with `product_catalog` table from SQLite on `product_name` - **Grading**: 0.33 for supplier name + 0.34 for total profit within ±0.5% + 0.33 for avg margin within ±0.1 - **Expected difficulty**: SQLite query → pandas merge → per-order profit/margin calculation → group aggregation ## Reward Function | Event | Reward | |---|---| | Successful code execution | +0.05 | | Code execution error | -0.05 | | Final answer (graded) | 0.0 — 1.0 based on task grader | | Max steps (20) exceeded | 0.0 | ## Setup & Usage ### Prerequisites - Python 3.13+ - [uv](https://docs.astral.sh/uv/) package manager ### Install ```bash uv sync ``` ### Run the server ```bash uv run uvicorn server.app:app --host 0.0.0.0 --port 8000 ``` ### Run the inference - First export all the required env variables mentioned in the .env.example. Then run below command ```bash uv run python inference.py ``` ### Run the baseline ```bash OPENAI_API_KEY=sk-... uv run python baseline.py # Against a deployed HF Space: OPENAI_API_KEY=sk-... uv run python baseline.py --base-url https://-.hf.space ``` ### Docker (local) ```bash docker build -t data-analysis-env . docker run -p 7860:7860 data-analysis-env ``` ### Client usage (Python) ```python from client import DataAnalysisClient from models import DataAction # Async async with DataAnalysisClient(base_url="http://localhost:8000") as client: result = await client.reset(task_id=1) result = await client.step(DataAction(action_type="execute_code", code="print(df.head())")) result = await client.step(DataAction(action_type="submit_answer", answer="Electronics")) # Sync with DataAnalysisClient(base_url="http://localhost:8000").sync() as client: result = client.reset(task_id=2) result = client.step(DataAction(action_type="execute_code", code="print(df.groupby('city')['total_price'].sum())")) ``` ## Project Structure ``` ├── models.py # DataAction, DataObservation, DataState ├── client.py # DataAnalysisClient (EnvClient subclass) ├── inference.py # HF inference script (uses HF Inference API) ├── baseline.py # OpenAI baseline inference script ├── helpers/ │ └── response_parser.py # Robust LLM JSON response parser ├── tasks/ │ ├── base_task.py # Task ABC with grade() interface │ ├── task_easy.py # Task 1 (Easy): Top revenue category │ ├── task_medium.py # Task 2 (Medium): City revenue share │ ├── task_medium_2.py # Task 4 (Hard): Monthly revenue ratio │ ├── task_hard.py # Task 3 (Medium): Repeat customer cohort │ ├── task_hard_2.py # Task 5 (Hard): Customer loyalty tier revenue │ └── task_hard_3.py # Task 6 (Hard): Supplier profitability ├── datasets/ │ ├── sales.csv # Synthetic e-commerce sales dataset │ └── store_data.db # SQLite DB: customer_profiles, product_catalog ├── server/ │ ├── app.py # FastAPI app entry point │ └── data_analysis_env.py # Environment implementation ├── Dockerfile # HF Spaces Docker build (port 7860) ├── openenv.yaml # OpenEnv spec metadata └── pyproject.toml # Dependencies and project config ```