File size: 4,433 Bytes
fa1e87d
95d976b
 
 
 
fa1e87d
 
95d976b
 
 
 
 
 
 
fa1e87d
 
95d976b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
title: CodeDark Environment Server
emoji: 📊
colorFrom: yellow
colorTo: purple
sdk: docker
pinned: false
license: mit
tags:
  - openenv
  - reinforcement-learning
  - data-analytics
  - agents
  - benchmark
---

# CodeDark: Data Analytics Environment for RL Agents

**OpenEnv-compatible multi-turn environment for training AI agents on real business analytics tasks.**

## Overview

CodeDark is the first data analytics environment in the OpenEnv ecosystem. It challenges AI agents to analyze CSV datasets using Python/Pandas, testing their ability to be data scientists rather than just code executors.

### Key Features

- **Real Business Tasks**: Bank marketing and road safety datasets with genuine analytical questions
- **Multi-Turn Interaction**: Agents explore data, save notes, ask clarifications, and submit answers
- **Shaped Rewards**: 80% correctness + 10% efficiency + 10% token cost
- **Pre-Benchmarked**: 25 curated L5-L6 difficulty tasks validated on 11+ models

## Quick Start

### Connect to the Environment

```python
from openenv import EnvClient

# Connect to this Space
env = EnvClient.from_hub("openenv/codedark")

# Reset for a new task
obs = env.reset()
print(f"Task: {obs['question']}")

# Execute Python code
obs = env.step({"tool": "run_python", "args": "<code>result = df.shape</code>"})
print(f"Result: {obs['stdout']}")

# Submit answer
obs = env.step({"tool": "submit_answer", "args": "<answer>42.5</answer>"})
print(f"Reward: {obs['reward']}")
```

### Available Tools

| Tool            | Description                                                    |
| --------------- | -------------------------------------------------------------- |
| `run_python`    | Execute Python/pandas code. Store result in `result` variable. |
| `read_notes`    | Read saved notes from previous turns.                          |
| `save_note`     | Save observations for later recall.                            |
| `clarify`       | Ask clarifying questions (max 2 per episode).                  |
| `submit_answer` | Submit final answer. Ends episode.                             |

## Datasets

### Bank Marketing (750K rows)

- **Target**: Term deposit subscription prediction
- **Features**: age, job, marital, education, balance, housing, loan, contact, day, month, duration, campaign

### Road Safety (500K rows)

- **Target**: Accident risk assessment
- **Features**: road_type, num_lanes, curvature, speed_limit, lighting, weather, time_of_day

## Task Difficulty

| Level | Complexity      | Example                                      |
| ----- | --------------- | -------------------------------------------- |
| L4    | Quartile/binned | "Subscription rate in Q1 balance?"           |
| L5    | Multi-condition | "Rate for month='may' AND job='management'?" |
| L6    | Nested extrema  | "In lowest subscription month, avg day?"     |

## Reward Structure

| Component   | Weight | Description                                     |
| ----------- | ------ | ----------------------------------------------- |
| Correctness | 80%    | Binary correct/incorrect with numeric tolerance |
| Efficiency  | 10%    | Fewer turns = better score                      |
| Token Cost  | 10%    | Lower token usage = better score                |

## API Endpoints

| Endpoint    | Method | Description           |
| ----------- | ------ | --------------------- |
| `/health`   | GET    | Health check          |
| `/reset`    | POST   | Reset for new episode |
| `/step`     | POST   | Execute action        |
| `/state`    | GET    | Current state         |
| `/metadata` | GET    | Environment metadata  |
| `/schema`   | GET    | Type schemas          |

## Benchmark Results

Pre-benchmarked on 11+ models with 1,844 completions:

| Model            | Accuracy | Avg Turns |
| ---------------- | -------- | --------- |
| Claude Opus 4.5  | 77.3%    | 4.2       |
| Qwen3 Max        | 46.7%    | 5.1       |
| Mistral Large    | 45.3%    | 5.8       |
| Llama 4 Maverick | 38.7%    | 6.2       |

## Links

- **GitHub**: [vj-09/codeblue-env](https://github.com/vj-09/codeblue-env)
- **Leaderboard**: [analytics-rl.com](https://www.analytics-rl.com)
- **OpenEnv Spec**: [meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)

## License

MIT License

## Author

**Vijay Athithya**

- GitHub: [@vj-09](https://github.com/vj-09)
- LinkedIn: [vijay-athithya](https://www.linkedin.com/in/vijay-athithya/)