---
title: CodeDark Environment Server
emoji: 📊
colorFrom: yellow
colorTo: purple
sdk: docker
pinned: false
license: mit
tags:
- openenv
- reinforcement-learning
- data-analytics
- agents
- benchmark
---
# CodeDark: Data Analytics Environment for RL Agents
**OpenEnv-compatible multi-turn environment for training AI agents on real business analytics tasks.**
## Overview
CodeDark is the first data analytics environment in the OpenEnv ecosystem. It challenges AI agents to analyze CSV datasets using Python/Pandas, testing their ability to be data scientists rather than just code executors.
### Key Features
- **Real Business Tasks**: Bank marketing and road safety datasets with genuine analytical questions
- **Multi-Turn Interaction**: Agents explore data, save notes, ask clarifications, and submit answers
- **Shaped Rewards**: 80% correctness + 10% efficiency + 10% token cost
- **Pre-Benchmarked**: 25 curated L5-L6 difficulty tasks validated on 11+ models
## Quick Start
### Connect to the Environment
```python
from openenv import EnvClient
# Connect to this Space
env = EnvClient.from_hub("openenv/codedark")
# Reset for a new task
obs = env.reset()
print(f"Task: {obs['question']}")
# Execute Python code
obs = env.step({"tool": "run_python", "args": "result = df.shape"})
print(f"Result: {obs['stdout']}")
# Submit answer
obs = env.step({"tool": "submit_answer", "args": "42.5"})
print(f"Reward: {obs['reward']}")
```
### Available Tools
| Tool | Description |
| --------------- | -------------------------------------------------------------- |
| `run_python` | Execute Python/pandas code. Store result in `result` variable. |
| `read_notes` | Read saved notes from previous turns. |
| `save_note` | Save observations for later recall. |
| `clarify` | Ask clarifying questions (max 2 per episode). |
| `submit_answer` | Submit final answer. Ends episode. |
## Datasets
### Bank Marketing (750K rows)
- **Target**: Term deposit subscription prediction
- **Features**: age, job, marital, education, balance, housing, loan, contact, day, month, duration, campaign
### Road Safety (500K rows)
- **Target**: Accident risk assessment
- **Features**: road_type, num_lanes, curvature, speed_limit, lighting, weather, time_of_day
## Task Difficulty
| Level | Complexity | Example |
| ----- | --------------- | -------------------------------------------- |
| L4 | Quartile/binned | "Subscription rate in Q1 balance?" |
| L5 | Multi-condition | "Rate for month='may' AND job='management'?" |
| L6 | Nested extrema | "In lowest subscription month, avg day?" |
## Reward Structure
| Component | Weight | Description |
| ----------- | ------ | ----------------------------------------------- |
| Correctness | 80% | Binary correct/incorrect with numeric tolerance |
| Efficiency | 10% | Fewer turns = better score |
| Token Cost | 10% | Lower token usage = better score |
## API Endpoints
| Endpoint | Method | Description |
| ----------- | ------ | --------------------- |
| `/health` | GET | Health check |
| `/reset` | POST | Reset for new episode |
| `/step` | POST | Execute action |
| `/state` | GET | Current state |
| `/metadata` | GET | Environment metadata |
| `/schema` | GET | Type schemas |
## Benchmark Results
Pre-benchmarked on 11+ models with 1,844 completions:
| Model | Accuracy | Avg Turns |
| ---------------- | -------- | --------- |
| Claude Opus 4.5 | 77.3% | 4.2 |
| Qwen3 Max | 46.7% | 5.1 |
| Mistral Large | 45.3% | 5.8 |
| Llama 4 Maverick | 38.7% | 6.2 |
## Links
- **GitHub**: [vj-09/codeblue-env](https://github.com/vj-09/codeblue-env)
- **Leaderboard**: [analytics-rl.com](https://www.analytics-rl.com)
- **OpenEnv Spec**: [meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
## License
MIT License
## Author
**Vijay Athithya**
- GitHub: [@vj-09](https://github.com/vj-09)
- LinkedIn: [vijay-athithya](https://www.linkedin.com/in/vijay-athithya/)