Spaces:
Sleeping
Sleeping
| title: Data Analysis Agent Environment | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| pinned: false | |
| # Data Analysis Agent Environment | |
| An OpenEnv-compliant RL environment for training and evaluating data analysis agents. Agents execute pandas code against a business dataset to answer analytical questions, graded by deterministic programmatic graders. | |
| ## Motivation | |
| Data analysis is a universal real-world task. Every business needs analysts who can query datasets, compute metrics, and extract insights. This environment lets RL agents practice that exact workflow β explore a dataset with code, then submit a precise answer β with automatic scoring. | |
| ## Action & Observation Spaces | |
| ### Action (`DataAction`) | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `action_type` | `"execute_code"` or `"submit_answer"` | What the agent wants to do | | |
| | `code` | `str` (optional) | Python/pandas code to execute | | |
| | `answer` | `str` (optional) | Final answer to submit for grading | | |
| ### Observation (`DataObservation`) | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `output` | `str` | Stdout from code execution or environment messages | | |
| | `success` | `bool` | Whether the action succeeded | | |
| | `error` | `str` (optional) | Error message if action failed | | |
| | `task_description` | `str` | The question to answer (set on reset) | | |
| | `dataset_info` | `str` | Dataset schema summary (set on reset) | | |
| | `done` | `bool` | Whether the episode is over | | |
| | `reward` | `float` | Step reward | | |
| ### State (`DataState`) | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `episode_id` | `str` | Unique episode identifier | | |
| | `step_count` | `int` | Current step number | | |
| | `task_id` | `int` | Active task (1β6) | | |
| | `answer_submitted` | `bool` | Whether final answer was submitted | | |
| | `final_score` | `float` | Graded score after submission | | |
| ## Tasks | |
| Tasks use two data sources: | |
| - **`df`** β synthetic e-commerce sales CSV (~2000 orders): `order_id`, `customer_id`, `product_name`, `category`, `quantity`, `unit_price`, `total_price`, `order_date`, `city`, `country` | |
| - **SQLite DB** (`store_data.db`) β additional tables for cross-source tasks: `customer_profiles` (300 rows), `product_catalog` (25 rows) | |
| ### Task 1 β Easy: Top Revenue Category | |
| - **Question**: What is the top-selling product category by total revenue? | |
| - **Grading**: Containment match (case-insensitive) β 1.0 or 0.0 | |
| - **Expected difficulty**: Single groupby + sum + argmax | |
| ### Task 2 β Medium: City Revenue Share | |
| - **Question**: Which city generates the most revenue? What percentage of total revenue does it represent? | |
| - **Grading**: 0.5 for correct city + 0.5 for percentage within Β±0.1% | |
| - **Expected difficulty**: Groupby + percentage calculation + formatting | |
| ### Task 3 β Medium: Repeat Customer Cohort Analysis | |
| - **Question**: How many unique customers ordered in both January and December? Compare their average order value to all other customers. | |
| - **Grading**: 0.33 per correct field (count, cohort AOV, other AOV) | |
| - **Expected difficulty**: Temporal filtering, set intersection, conditional aggregation | |
| ### Task 4 β Hard: Monthly Revenue Ratio | |
| - **Question**: Which month had the highest vs. lowest total revenue? What is the ratio between them? | |
| - **Grading**: 0.33 for best month + 0.33 for worst month + 0.34 for ratio within Β±0.01 | |
| - **Expected difficulty**: Monthly resample/groupby, min/max comparison, ratio formatting | |
| ### Task 5 β Hard: Customer Loyalty Tier Revenue (cross-source) | |
| - **Question**: Which customer loyalty tier generates the highest total revenue and what percentage does it represent? | |
| - **Data**: Requires joining `df` with `customer_profiles` table from SQLite on `customer_id` | |
| - **Grading**: 0.33 for tier name + 0.33 for revenue within Β±0.5% + 0.34 for percentage within Β±0.1 | |
| - **Expected difficulty**: SQLite query β pandas merge β groupby aggregation | |
| ### Task 6 β Hard: Supplier Profitability (cross-source) | |
| - **Question**: Which supplier has the highest total profit? What is their average profit margin? | |
| - **Data**: Requires joining `df` with `product_catalog` table from SQLite on `product_name` | |
| - **Grading**: 0.33 for supplier name + 0.34 for total profit within Β±0.5% + 0.33 for avg margin within Β±0.1 | |
| - **Expected difficulty**: SQLite query β pandas merge β per-order profit/margin calculation β group aggregation | |
| ## Reward Function | |
| | Event | Reward | | |
| |---|---| | |
| | Successful code execution | +0.05 | | |
| | Code execution error | -0.05 | | |
| | Final answer (graded) | 0.0 β 1.0 based on task grader | | |
| | Max steps (20) exceeded | 0.0 | | |
| ## Setup & Usage | |
| ### Prerequisites | |
| - Python 3.13+ | |
| - [uv](https://docs.astral.sh/uv/) package manager | |
| ### Install | |
| ```bash | |
| uv sync | |
| ``` | |
| ### Run the server | |
| ```bash | |
| uv run uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| ### Run the inference | |
| - First export all the required env variables mentioned in the .env.example. Then run below command | |
| ```bash | |
| uv run python inference.py | |
| ``` | |
| ### Run the baseline | |
| ```bash | |
| OPENAI_API_KEY=sk-... uv run python baseline.py | |
| # Against a deployed HF Space: | |
| OPENAI_API_KEY=sk-... uv run python baseline.py --base-url https://<your-username>-<space-name>.hf.space | |
| ``` | |
| ### Docker (local) | |
| ```bash | |
| docker build -t data-analysis-env . | |
| docker run -p 7860:7860 data-analysis-env | |
| ``` | |
| ### Client usage (Python) | |
| ```python | |
| from client import DataAnalysisClient | |
| from models import DataAction | |
| # Async | |
| async with DataAnalysisClient(base_url="http://localhost:8000") as client: | |
| result = await client.reset(task_id=1) | |
| result = await client.step(DataAction(action_type="execute_code", code="print(df.head())")) | |
| result = await client.step(DataAction(action_type="submit_answer", answer="Electronics")) | |
| # Sync | |
| with DataAnalysisClient(base_url="http://localhost:8000").sync() as client: | |
| result = client.reset(task_id=2) | |
| result = client.step(DataAction(action_type="execute_code", code="print(df.groupby('city')['total_price'].sum())")) | |
| ``` | |
| ## Project Structure | |
| ``` | |
| βββ models.py # DataAction, DataObservation, DataState | |
| βββ client.py # DataAnalysisClient (EnvClient subclass) | |
| βββ inference.py # HF inference script (uses HF Inference API) | |
| βββ baseline.py # OpenAI baseline inference script | |
| βββ helpers/ | |
| β βββ response_parser.py # Robust LLM JSON response parser | |
| βββ tasks/ | |
| β βββ base_task.py # Task ABC with grade() interface | |
| β βββ task_easy.py # Task 1 (Easy): Top revenue category | |
| β βββ task_medium.py # Task 2 (Medium): City revenue share | |
| β βββ task_medium_2.py # Task 4 (Hard): Monthly revenue ratio | |
| β βββ task_hard.py # Task 3 (Medium): Repeat customer cohort | |
| β βββ task_hard_2.py # Task 5 (Hard): Customer loyalty tier revenue | |
| β βββ task_hard_3.py # Task 6 (Hard): Supplier profitability | |
| βββ datasets/ | |
| β βββ sales.csv # Synthetic e-commerce sales dataset | |
| β βββ store_data.db # SQLite DB: customer_profiles, product_catalog | |
| βββ server/ | |
| β βββ app.py # FastAPI app entry point | |
| β βββ data_analysis_env.py # Environment implementation | |
| βββ Dockerfile # HF Spaces Docker build (port 7860) | |
| βββ openenv.yaml # OpenEnv spec metadata | |
| βββ pyproject.toml # Dependencies and project config | |
| ``` | |