Spaces:
Sleeping
Sleeping
metadata
title: Data Analysis Agent Environment
emoji: π
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
Data Analysis Agent Environment
An OpenEnv-compliant RL environment for training and evaluating data analysis agents. Agents execute pandas code against a business dataset to answer analytical questions, graded by deterministic programmatic graders.
Motivation
Data analysis is a universal real-world task. Every business needs analysts who can query datasets, compute metrics, and extract insights. This environment lets RL agents practice that exact workflow β explore a dataset with code, then submit a precise answer β with automatic scoring.
Action & Observation Spaces
Action (DataAction)
| Field | Type | Description |
|---|---|---|
action_type |
"execute_code" or "submit_answer" |
What the agent wants to do |
code |
str (optional) |
Python/pandas code to execute |
answer |
str (optional) |
Final answer to submit for grading |
Observation (DataObservation)
| Field | Type | Description |
|---|---|---|
output |
str |
Stdout from code execution or environment messages |
success |
bool |
Whether the action succeeded |
error |
str (optional) |
Error message if action failed |
task_description |
str |
The question to answer (set on reset) |
dataset_info |
str |
Dataset schema summary (set on reset) |
done |
bool |
Whether the episode is over |
reward |
float |
Step reward |
State (DataState)
| Field | Type | Description |
|---|---|---|
episode_id |
str |
Unique episode identifier |
step_count |
int |
Current step number |
task_id |
int |
Active task (1β6) |
answer_submitted |
bool |
Whether final answer was submitted |
final_score |
float |
Graded score after submission |
Tasks
Tasks use two data sources:
dfβ synthetic e-commerce sales CSV (~2000 orders):order_id,customer_id,product_name,category,quantity,unit_price,total_price,order_date,city,country- SQLite DB (
store_data.db) β additional tables for cross-source tasks:customer_profiles(300 rows),product_catalog(25 rows)
Task 1 β Easy: Top Revenue Category
- Question: What is the top-selling product category by total revenue?
- Grading: Containment match (case-insensitive) β 1.0 or 0.0
- Expected difficulty: Single groupby + sum + argmax
Task 2 β Medium: City Revenue Share
- Question: Which city generates the most revenue? What percentage of total revenue does it represent?
- Grading: 0.5 for correct city + 0.5 for percentage within Β±0.1%
- Expected difficulty: Groupby + percentage calculation + formatting
Task 3 β Medium: Repeat Customer Cohort Analysis
- Question: How many unique customers ordered in both January and December? Compare their average order value to all other customers.
- Grading: 0.33 per correct field (count, cohort AOV, other AOV)
- Expected difficulty: Temporal filtering, set intersection, conditional aggregation
Task 4 β Hard: Monthly Revenue Ratio
- Question: Which month had the highest vs. lowest total revenue? What is the ratio between them?
- Grading: 0.33 for best month + 0.33 for worst month + 0.34 for ratio within Β±0.01
- Expected difficulty: Monthly resample/groupby, min/max comparison, ratio formatting
Task 5 β Hard: Customer Loyalty Tier Revenue (cross-source)
- Question: Which customer loyalty tier generates the highest total revenue and what percentage does it represent?
- Data: Requires joining
dfwithcustomer_profilestable from SQLite oncustomer_id - Grading: 0.33 for tier name + 0.33 for revenue within Β±0.5% + 0.34 for percentage within Β±0.1
- Expected difficulty: SQLite query β pandas merge β groupby aggregation
Task 6 β Hard: Supplier Profitability (cross-source)
- Question: Which supplier has the highest total profit? What is their average profit margin?
- Data: Requires joining
dfwithproduct_catalogtable from SQLite onproduct_name - Grading: 0.33 for supplier name + 0.34 for total profit within Β±0.5% + 0.33 for avg margin within Β±0.1
- Expected difficulty: SQLite query β pandas merge β per-order profit/margin calculation β group aggregation
Reward Function
| Event | Reward |
|---|---|
| Successful code execution | +0.05 |
| Code execution error | -0.05 |
| Final answer (graded) | 0.0 β 1.0 based on task grader |
| Max steps (20) exceeded | 0.0 |
Setup & Usage
Prerequisites
- Python 3.13+
- uv package manager
Install
uv sync
Run the server
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
Run the inference
- First export all the required env variables mentioned in the .env.example. Then run below command
uv run python inference.py
Run the baseline
OPENAI_API_KEY=sk-... uv run python baseline.py
# Against a deployed HF Space:
OPENAI_API_KEY=sk-... uv run python baseline.py --base-url https://<your-username>-<space-name>.hf.space
Docker (local)
docker build -t data-analysis-env .
docker run -p 7860:7860 data-analysis-env
Client usage (Python)
from client import DataAnalysisClient
from models import DataAction
# Async
async with DataAnalysisClient(base_url="http://localhost:8000") as client:
result = await client.reset(task_id=1)
result = await client.step(DataAction(action_type="execute_code", code="print(df.head())"))
result = await client.step(DataAction(action_type="submit_answer", answer="Electronics"))
# Sync
with DataAnalysisClient(base_url="http://localhost:8000").sync() as client:
result = client.reset(task_id=2)
result = client.step(DataAction(action_type="execute_code", code="print(df.groupby('city')['total_price'].sum())"))
Project Structure
βββ models.py # DataAction, DataObservation, DataState
βββ client.py # DataAnalysisClient (EnvClient subclass)
βββ inference.py # HF inference script (uses HF Inference API)
βββ baseline.py # OpenAI baseline inference script
βββ helpers/
β βββ response_parser.py # Robust LLM JSON response parser
βββ tasks/
β βββ base_task.py # Task ABC with grade() interface
β βββ task_easy.py # Task 1 (Easy): Top revenue category
β βββ task_medium.py # Task 2 (Medium): City revenue share
β βββ task_medium_2.py # Task 4 (Hard): Monthly revenue ratio
β βββ task_hard.py # Task 3 (Medium): Repeat customer cohort
β βββ task_hard_2.py # Task 5 (Hard): Customer loyalty tier revenue
β βββ task_hard_3.py # Task 6 (Hard): Supplier profitability
βββ datasets/
β βββ sales.csv # Synthetic e-commerce sales dataset
β βββ store_data.db # SQLite DB: customer_profiles, product_catalog
βββ server/
β βββ app.py # FastAPI app entry point
β βββ data_analysis_env.py # Environment implementation
βββ Dockerfile # HF Spaces Docker build (port 7860)
βββ openenv.yaml # OpenEnv spec metadata
βββ pyproject.toml # Dependencies and project config