File size: 7,512 Bytes
cc5da55
 
 
 
 
 
 
 
 
8ab6a5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7fd7757
8ab6a5f
 
 
 
 
7fd7757
 
 
8ab6a5f
 
 
7fd7757
8ab6a5f
 
 
 
 
 
 
7fd7757
8ab6a5f
 
 
 
7fd7757
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ab6a5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7fd7757
 
 
 
 
 
8ab6a5f
 
 
7fd7757
 
8ab6a5f
 
cc5da55
 
 
 
 
 
8ab6a5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7fd7757
 
 
 
 
 
8ab6a5f
7fd7757
 
 
 
 
 
 
8ab6a5f
7fd7757
 
8ab6a5f
7fd7757
 
 
 
 
8ab6a5f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
title: Data Analysis Agent Environment
emoji: πŸ“Š
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---

# Data Analysis Agent Environment

An OpenEnv-compliant RL environment for training and evaluating data analysis agents. Agents execute pandas code against a business dataset to answer analytical questions, graded by deterministic programmatic graders.

## Motivation

Data analysis is a universal real-world task. Every business needs analysts who can query datasets, compute metrics, and extract insights. This environment lets RL agents practice that exact workflow β€” explore a dataset with code, then submit a precise answer β€” with automatic scoring.

## Action & Observation Spaces

### Action (`DataAction`)

| Field | Type | Description |
|---|---|---|
| `action_type` | `"execute_code"` or `"submit_answer"` | What the agent wants to do |
| `code` | `str` (optional) | Python/pandas code to execute |
| `answer` | `str` (optional) | Final answer to submit for grading |

### Observation (`DataObservation`)

| Field | Type | Description |
|---|---|---|
| `output` | `str` | Stdout from code execution or environment messages |
| `success` | `bool` | Whether the action succeeded |
| `error` | `str` (optional) | Error message if action failed |
| `task_description` | `str` | The question to answer (set on reset) |
| `dataset_info` | `str` | Dataset schema summary (set on reset) |
| `done` | `bool` | Whether the episode is over |
| `reward` | `float` | Step reward |

### State (`DataState`)

| Field | Type | Description |
|---|---|---|
| `episode_id` | `str` | Unique episode identifier |
| `step_count` | `int` | Current step number |
| `task_id` | `int` | Active task (1–6) |
| `answer_submitted` | `bool` | Whether final answer was submitted |
| `final_score` | `float` | Graded score after submission |

## Tasks

Tasks use two data sources:
- **`df`** β€” synthetic e-commerce sales CSV (~2000 orders): `order_id`, `customer_id`, `product_name`, `category`, `quantity`, `unit_price`, `total_price`, `order_date`, `city`, `country`
- **SQLite DB** (`store_data.db`) β€” additional tables for cross-source tasks: `customer_profiles` (300 rows), `product_catalog` (25 rows)

### Task 1 β€” Easy: Top Revenue Category
- **Question**: What is the top-selling product category by total revenue?
- **Grading**: Containment match (case-insensitive) β†’ 1.0 or 0.0
- **Expected difficulty**: Single groupby + sum + argmax

### Task 2 β€” Medium: City Revenue Share
- **Question**: Which city generates the most revenue? What percentage of total revenue does it represent?
- **Grading**: 0.5 for correct city + 0.5 for percentage within Β±0.1%
- **Expected difficulty**: Groupby + percentage calculation + formatting

### Task 3 β€” Medium: Repeat Customer Cohort Analysis
- **Question**: How many unique customers ordered in both January and December? Compare their average order value to all other customers.
- **Grading**: 0.33 per correct field (count, cohort AOV, other AOV)
- **Expected difficulty**: Temporal filtering, set intersection, conditional aggregation

### Task 4 β€” Hard: Monthly Revenue Ratio
- **Question**: Which month had the highest vs. lowest total revenue? What is the ratio between them?
- **Grading**: 0.33 for best month + 0.33 for worst month + 0.34 for ratio within Β±0.01
- **Expected difficulty**: Monthly resample/groupby, min/max comparison, ratio formatting

### Task 5 β€” Hard: Customer Loyalty Tier Revenue (cross-source)
- **Question**: Which customer loyalty tier generates the highest total revenue and what percentage does it represent?
- **Data**: Requires joining `df` with `customer_profiles` table from SQLite on `customer_id`
- **Grading**: 0.33 for tier name + 0.33 for revenue within Β±0.5% + 0.34 for percentage within Β±0.1
- **Expected difficulty**: SQLite query β†’ pandas merge β†’ groupby aggregation

### Task 6 β€” Hard: Supplier Profitability (cross-source)
- **Question**: Which supplier has the highest total profit? What is their average profit margin?
- **Data**: Requires joining `df` with `product_catalog` table from SQLite on `product_name`
- **Grading**: 0.33 for supplier name + 0.34 for total profit within Β±0.5% + 0.33 for avg margin within Β±0.1
- **Expected difficulty**: SQLite query β†’ pandas merge β†’ per-order profit/margin calculation β†’ group aggregation

## Reward Function

| Event | Reward |
|---|---|
| Successful code execution | +0.05 |
| Code execution error | -0.05 |
| Final answer (graded) | 0.0 β€” 1.0 based on task grader |
| Max steps (20) exceeded | 0.0 |

## Setup & Usage

### Prerequisites
- Python 3.13+
- [uv](https://docs.astral.sh/uv/) package manager

### Install
```bash
uv sync
```

### Run the server
```bash
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
```

### Run the inference
- First export all the required env variables mentioned in the .env.example. Then run below command
```bash
uv run python inference.py
```

### Run the baseline
```bash
OPENAI_API_KEY=sk-... uv run python baseline.py
# Against a deployed HF Space:
OPENAI_API_KEY=sk-... uv run python baseline.py --base-url https://<your-username>-<space-name>.hf.space
```

### Docker (local)
```bash
docker build -t data-analysis-env .
docker run -p 7860:7860 data-analysis-env
```


### Client usage (Python)
```python
from client import DataAnalysisClient
from models import DataAction

# Async
async with DataAnalysisClient(base_url="http://localhost:8000") as client:
    result = await client.reset(task_id=1)
    result = await client.step(DataAction(action_type="execute_code", code="print(df.head())"))
    result = await client.step(DataAction(action_type="submit_answer", answer="Electronics"))

# Sync
with DataAnalysisClient(base_url="http://localhost:8000").sync() as client:
    result = client.reset(task_id=2)
    result = client.step(DataAction(action_type="execute_code", code="print(df.groupby('city')['total_price'].sum())"))
```

## Project Structure

```
β”œβ”€β”€ models.py                  # DataAction, DataObservation, DataState
β”œβ”€β”€ client.py                  # DataAnalysisClient (EnvClient subclass)
β”œβ”€β”€ inference.py               # HF inference script (uses HF Inference API)
β”œβ”€β”€ baseline.py                # OpenAI baseline inference script
β”œβ”€β”€ helpers/
β”‚   └── response_parser.py     # Robust LLM JSON response parser
β”œβ”€β”€ tasks/
β”‚   β”œβ”€β”€ base_task.py           # Task ABC with grade() interface
β”‚   β”œβ”€β”€ task_easy.py           # Task 1 (Easy): Top revenue category
β”‚   β”œβ”€β”€ task_medium.py         # Task 2 (Medium): City revenue share
β”‚   β”œβ”€β”€ task_medium_2.py       # Task 4 (Hard): Monthly revenue ratio
β”‚   β”œβ”€β”€ task_hard.py           # Task 3 (Medium): Repeat customer cohort
β”‚   β”œβ”€β”€ task_hard_2.py         # Task 5 (Hard): Customer loyalty tier revenue
β”‚   └── task_hard_3.py         # Task 6 (Hard): Supplier profitability
β”œβ”€β”€ datasets/
β”‚   β”œβ”€β”€ sales.csv              # Synthetic e-commerce sales dataset
β”‚   └── store_data.db          # SQLite DB: customer_profiles, product_catalog
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py                 # FastAPI app entry point
β”‚   └── data_analysis_env.py   # Environment implementation
β”œβ”€β”€ Dockerfile                 # HF Spaces Docker build (port 7860)
β”œβ”€β”€ openenv.yaml               # OpenEnv spec metadata
└── pyproject.toml             # Dependencies and project config
```