Spaces:
Sleeping
Sleeping
metadata
title: CodeDark Environment Server
emoji: 📊
colorFrom: yellow
colorTo: purple
sdk: docker
pinned: false
license: mit
tags:
- openenv
- reinforcement-learning
- data-analytics
- agents
- benchmark
CodeDark: Data Analytics Environment for RL Agents
OpenEnv-compatible multi-turn environment for training AI agents on real business analytics tasks.
Overview
CodeDark is the first data analytics environment in the OpenEnv ecosystem. It challenges AI agents to analyze CSV datasets using Python/Pandas, testing their ability to be data scientists rather than just code executors.
Key Features
- Real Business Tasks: Bank marketing and road safety datasets with genuine analytical questions
- Multi-Turn Interaction: Agents explore data, save notes, ask clarifications, and submit answers
- Shaped Rewards: 80% correctness + 10% efficiency + 10% token cost
- Pre-Benchmarked: 25 curated L5-L6 difficulty tasks validated on 11+ models
Quick Start
Connect to the Environment
from openenv import EnvClient
# Connect to this Space
env = EnvClient.from_hub("openenv/codedark")
# Reset for a new task
obs = env.reset()
print(f"Task: {obs['question']}")
# Execute Python code
obs = env.step({"tool": "run_python", "args": "<code>result = df.shape</code>"})
print(f"Result: {obs['stdout']}")
# Submit answer
obs = env.step({"tool": "submit_answer", "args": "<answer>42.5</answer>"})
print(f"Reward: {obs['reward']}")
Available Tools
| Tool | Description |
|---|---|
run_python |
Execute Python/pandas code. Store result in result variable. |
read_notes |
Read saved notes from previous turns. |
save_note |
Save observations for later recall. |
clarify |
Ask clarifying questions (max 2 per episode). |
submit_answer |
Submit final answer. Ends episode. |
Datasets
Bank Marketing (750K rows)
- Target: Term deposit subscription prediction
- Features: age, job, marital, education, balance, housing, loan, contact, day, month, duration, campaign
Road Safety (500K rows)
- Target: Accident risk assessment
- Features: road_type, num_lanes, curvature, speed_limit, lighting, weather, time_of_day
Task Difficulty
| Level | Complexity | Example |
|---|---|---|
| L4 | Quartile/binned | "Subscription rate in Q1 balance?" |
| L5 | Multi-condition | "Rate for month='may' AND job='management'?" |
| L6 | Nested extrema | "In lowest subscription month, avg day?" |
Reward Structure
| Component | Weight | Description |
|---|---|---|
| Correctness | 80% | Binary correct/incorrect with numeric tolerance |
| Efficiency | 10% | Fewer turns = better score |
| Token Cost | 10% | Lower token usage = better score |
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/reset |
POST | Reset for new episode |
/step |
POST | Execute action |
/state |
GET | Current state |
/metadata |
GET | Environment metadata |
/schema |
GET | Type schemas |
Benchmark Results
Pre-benchmarked on 11+ models with 1,844 completions:
| Model | Accuracy | Avg Turns |
|---|---|---|
| Claude Opus 4.5 | 77.3% | 4.2 |
| Qwen3 Max | 46.7% | 5.1 |
| Mistral Large | 45.3% | 5.8 |
| Llama 4 Maverick | 38.7% | 6.2 |
Links
- GitHub: vj-09/codeblue-env
- Leaderboard: analytics-rl.com
- OpenEnv Spec: meta-pytorch/OpenEnv
License
MIT License
Author
Vijay Athithya
- GitHub: @vj-09
- LinkedIn: vijay-athithya