File size: 4,658 Bytes
5e4b568
 
bcd8636
5e4b568
 
 
bcd8636
5e4b568
 
bcd8636
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
title: DataDetective
emoji: πŸ”
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
---

# DataDetective β€” Business Incident Investigation Environment

An [OpenEnv](https://github.com/meta-pytorch/OpenEnv) environment where AI
agents investigate real-world business incidents by querying a SQL database,
analysing patterns, and submitting root-cause findings.

## What It Does

The agent is given a realistic company database (TechMart β€” a mid-size B2B+B2C
electronics retailer) and a business problem to investigate. It can execute
SQL queries to explore the data, then submit a final written analysis. The
environment automatically grades the analysis based on whether key findings
were identified. Each task has 5 grading criteria worth 0.20 each, enabling
meaningful partial credit.

## Tasks (Easy β†’ Hard)

| # | Task ID | Difficulty | Scenario |
|---|---------|-----------|----------|
| 1 | `orders_drop` | Easy | Order volume dropped sharply after promo ended |
| 2 | `returns_spike` | Medium | Product returns spiking in West region (defective SKU) |
| 3 | `supplier_quality` | Medium | Supplier-level quality crisis across multiple products |
| 4 | `shipping_delay` | Medium-Hard | Customer satisfaction crisis from carrier delays |
| 5 | `inventory_stockout` | Medium-Hard | Regional sales underperformance from warehouse stockout |
| 6 | `customer_churn` | Hard | Active customer decline across segments post price hike |
| 7 | `revenue_paradox` | Hard | Revenue up but profit down β€” multi-causal margin erosion |
| 8 | `fraud_detection` | Hard | Coordinated fraud ring with fake accounts |
| 9 | `repeat_purchase_decline` | Hard | Repeat purchase collapse masked by acquisition spend |

Each task is scored 0.0 – 1.0 based on specific findings the agent must discover.

## Action / Observation Spaces

### Action (`DataDetectiveAction`)

| Field | Type | Description |
|-------|------|-------------|
| `action_type` | `str` | `"query"` to run SQL, `"answer"` to submit findings |
| `content` | `str` | SQL query string or final analysis text |

### Observation (`DataDetectiveObservation`)

| Field | Type | Description |
|-------|------|-------------|
| `output` | `str` | Query results (formatted table) or feedback |
| `task_description` | `str` | The investigation task |
| `schema_info` | `str` | Database schema (shown at reset) |
| `step_number` | `int` | Current step |
| `max_steps` | `int` | Maximum steps allowed (30) |
| `message` | `str` | Status message |

## Database Schema (11 Tables)

The TechMart database includes:

| Table | Description |
|-------|-------------|
| `customers` | Customer demographics (region, segment, signup date) |
| `products` | Product catalog (category, price, cost, supplier) |
| `orders` | Order history with totals |
| `order_items` | Line items with quantity and unit price |
| `returns` | Product returns with reasons and refund amounts |
| `promotions` | Promotional campaigns with discount percentages |
| `price_changes` | Historical price adjustments |
| `shipping` | Shipment records with carrier and delivery dates |
| `support_tickets` | Customer support tickets by category and priority |
| `inventory_log` | Daily stock levels per product per warehouse region |
| `marketing_spend` | Daily marketing spend by channel, campaign, and region |

All data is synthetic, generated in-memory (no external databases required).

## Quick Start

### 1. Install Dependencies

```bash
pip install -r requirements.txt
```

### 2. Start the Server

```bash
uvicorn server.app:app --host 0.0.0.0 --port 7860
```

### 3. Health Check

```bash
curl http://localhost:7860/health
```

### 4. Run the Baseline Agent

```bash
API_BASE_URL="https://router.huggingface.co/v1" \
MODEL_NAME="gpt-4.1-mini" \
HF_TOKEN="hf_..." \
python inference.py
```

### 5. Docker

```bash
docker build -t data-detective .
docker run -p 7860:7860 data-detective
```

## Environment Variables

| Env Var | Purpose | Required |
|---------|---------|----------|
| `API_BASE_URL` | LLM endpoint URL | Yes |
| `MODEL_NAME` | Model identifier | Yes |
| `HF_TOKEN` | API key / HF token | Yes |
| `ENV_URL` | Environment server URL | No (default: `http://localhost:7860`) |

## How Grading Works

Each task has an automated grader that checks the agent's final answer for
specific key findings (keywords, patterns, named entities). Each task has 5
grading criteria worth 0.20 each, for a maximum score of 1.0. Partial credit
is awarded for each finding discovered.

## Setup Requirements

- Python 3.10+
- No GPU required
- Runs within 2 vCPU / 8 GB memory
- All data is generated in-memory (no external databases)