DataAnalysis_Env / README.md
Mohammed-Altaf's picture
updated readme and baseline.py
7fd7757
metadata
title: Data Analysis Agent Environment
emoji: πŸ“Š
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false

Data Analysis Agent Environment

An OpenEnv-compliant RL environment for training and evaluating data analysis agents. Agents execute pandas code against a business dataset to answer analytical questions, graded by deterministic programmatic graders.

Motivation

Data analysis is a universal real-world task. Every business needs analysts who can query datasets, compute metrics, and extract insights. This environment lets RL agents practice that exact workflow β€” explore a dataset with code, then submit a precise answer β€” with automatic scoring.

Action & Observation Spaces

Action (DataAction)

Field Type Description
action_type "execute_code" or "submit_answer" What the agent wants to do
code str (optional) Python/pandas code to execute
answer str (optional) Final answer to submit for grading

Observation (DataObservation)

Field Type Description
output str Stdout from code execution or environment messages
success bool Whether the action succeeded
error str (optional) Error message if action failed
task_description str The question to answer (set on reset)
dataset_info str Dataset schema summary (set on reset)
done bool Whether the episode is over
reward float Step reward

State (DataState)

Field Type Description
episode_id str Unique episode identifier
step_count int Current step number
task_id int Active task (1–6)
answer_submitted bool Whether final answer was submitted
final_score float Graded score after submission

Tasks

Tasks use two data sources:

  • df β€” synthetic e-commerce sales CSV (~2000 orders): order_id, customer_id, product_name, category, quantity, unit_price, total_price, order_date, city, country
  • SQLite DB (store_data.db) β€” additional tables for cross-source tasks: customer_profiles (300 rows), product_catalog (25 rows)

Task 1 β€” Easy: Top Revenue Category

  • Question: What is the top-selling product category by total revenue?
  • Grading: Containment match (case-insensitive) β†’ 1.0 or 0.0
  • Expected difficulty: Single groupby + sum + argmax

Task 2 β€” Medium: City Revenue Share

  • Question: Which city generates the most revenue? What percentage of total revenue does it represent?
  • Grading: 0.5 for correct city + 0.5 for percentage within Β±0.1%
  • Expected difficulty: Groupby + percentage calculation + formatting

Task 3 β€” Medium: Repeat Customer Cohort Analysis

  • Question: How many unique customers ordered in both January and December? Compare their average order value to all other customers.
  • Grading: 0.33 per correct field (count, cohort AOV, other AOV)
  • Expected difficulty: Temporal filtering, set intersection, conditional aggregation

Task 4 β€” Hard: Monthly Revenue Ratio

  • Question: Which month had the highest vs. lowest total revenue? What is the ratio between them?
  • Grading: 0.33 for best month + 0.33 for worst month + 0.34 for ratio within Β±0.01
  • Expected difficulty: Monthly resample/groupby, min/max comparison, ratio formatting

Task 5 β€” Hard: Customer Loyalty Tier Revenue (cross-source)

  • Question: Which customer loyalty tier generates the highest total revenue and what percentage does it represent?
  • Data: Requires joining df with customer_profiles table from SQLite on customer_id
  • Grading: 0.33 for tier name + 0.33 for revenue within Β±0.5% + 0.34 for percentage within Β±0.1
  • Expected difficulty: SQLite query β†’ pandas merge β†’ groupby aggregation

Task 6 β€” Hard: Supplier Profitability (cross-source)

  • Question: Which supplier has the highest total profit? What is their average profit margin?
  • Data: Requires joining df with product_catalog table from SQLite on product_name
  • Grading: 0.33 for supplier name + 0.34 for total profit within Β±0.5% + 0.33 for avg margin within Β±0.1
  • Expected difficulty: SQLite query β†’ pandas merge β†’ per-order profit/margin calculation β†’ group aggregation

Reward Function

Event Reward
Successful code execution +0.05
Code execution error -0.05
Final answer (graded) 0.0 β€” 1.0 based on task grader
Max steps (20) exceeded 0.0

Setup & Usage

Prerequisites

  • Python 3.13+
  • uv package manager

Install

uv sync

Run the server

uv run uvicorn server.app:app --host 0.0.0.0 --port 8000

Run the inference

  • First export all the required env variables mentioned in the .env.example. Then run below command
uv run python inference.py

Run the baseline

OPENAI_API_KEY=sk-... uv run python baseline.py
# Against a deployed HF Space:
OPENAI_API_KEY=sk-... uv run python baseline.py --base-url https://<your-username>-<space-name>.hf.space

Docker (local)

docker build -t data-analysis-env .
docker run -p 7860:7860 data-analysis-env

Client usage (Python)

from client import DataAnalysisClient
from models import DataAction

# Async
async with DataAnalysisClient(base_url="http://localhost:8000") as client:
    result = await client.reset(task_id=1)
    result = await client.step(DataAction(action_type="execute_code", code="print(df.head())"))
    result = await client.step(DataAction(action_type="submit_answer", answer="Electronics"))

# Sync
with DataAnalysisClient(base_url="http://localhost:8000").sync() as client:
    result = client.reset(task_id=2)
    result = client.step(DataAction(action_type="execute_code", code="print(df.groupby('city')['total_price'].sum())"))

Project Structure

β”œβ”€β”€ models.py                  # DataAction, DataObservation, DataState
β”œβ”€β”€ client.py                  # DataAnalysisClient (EnvClient subclass)
β”œβ”€β”€ inference.py               # HF inference script (uses HF Inference API)
β”œβ”€β”€ baseline.py                # OpenAI baseline inference script
β”œβ”€β”€ helpers/
β”‚   └── response_parser.py     # Robust LLM JSON response parser
β”œβ”€β”€ tasks/
β”‚   β”œβ”€β”€ base_task.py           # Task ABC with grade() interface
β”‚   β”œβ”€β”€ task_easy.py           # Task 1 (Easy): Top revenue category
β”‚   β”œβ”€β”€ task_medium.py         # Task 2 (Medium): City revenue share
β”‚   β”œβ”€β”€ task_medium_2.py       # Task 4 (Hard): Monthly revenue ratio
β”‚   β”œβ”€β”€ task_hard.py           # Task 3 (Medium): Repeat customer cohort
β”‚   β”œβ”€β”€ task_hard_2.py         # Task 5 (Hard): Customer loyalty tier revenue
β”‚   └── task_hard_3.py         # Task 6 (Hard): Supplier profitability
β”œβ”€β”€ datasets/
β”‚   β”œβ”€β”€ sales.csv              # Synthetic e-commerce sales dataset
β”‚   └── store_data.db          # SQLite DB: customer_profiles, product_catalog
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py                 # FastAPI app entry point
β”‚   └── data_analysis_env.py   # Environment implementation
β”œβ”€β”€ Dockerfile                 # HF Spaces Docker build (port 7860)
β”œβ”€β”€ openenv.yaml               # OpenEnv spec metadata
└── pyproject.toml             # Dependencies and project config