Spaces:

albert-einstein-09
/

codedark

Sleeping

File size: 4,433 Bytes

fa1e87d
95d976b
 
 
 
fa1e87d
 
95d976b
 
 
 
 
 
 
fa1e87d
 
95d976b

---
title: CodeDark Environment Server
emoji: 📊
colorFrom: yellow
colorTo: purple
sdk: docker
pinned: false
license: mit
tags:
  - openenv
  - reinforcement-learning
  - data-analytics
  - agents
  - benchmark
---

# CodeDark: Data Analytics Environment for RL Agents

**OpenEnv-compatible multi-turn environment for training AI agents on real business analytics tasks.**

## Overview

CodeDark is the first data analytics environment in the OpenEnv ecosystem. It challenges AI agents to analyze CSV datasets using Python/Pandas, testing their ability to be data scientists rather than just code executors.

### Key Features

- **Real Business Tasks**: Bank marketing and road safety datasets with genuine analytical questions
- **Multi-Turn Interaction**: Agents explore data, save notes, ask clarifications, and submit answers
- **Shaped Rewards**: 80% correctness + 10% efficiency + 10% token cost
- **Pre-Benchmarked**: 25 curated L5-L6 difficulty tasks validated on 11+ models

## Quick Start

### Connect to the Environment

```python
from openenv import EnvClient

# Connect to this Space
env = EnvClient.from_hub("openenv/codedark")

# Reset for a new task
obs = env.reset()
print(f"Task: {obs['question']}")

# Execute Python code
obs = env.step({"tool": "run_python", "args": "<code>result = df.shape</code>"})
print(f"Result: {obs['stdout']}")

# Submit answer
obs = env.step({"tool": "submit_answer", "args": "<answer>42.5</answer>"})
print(f"Reward: {obs['reward']}")
```

### Available Tools

| Tool            | Description                                                    |
| --------------- | -------------------------------------------------------------- |
| `run_python`    | Execute Python/pandas code. Store result in `result` variable. |
| `read_notes`    | Read saved notes from previous turns.                          |
| `save_note`     | Save observations for later recall.                            |
| `clarify`       | Ask clarifying questions (max 2 per episode).                  |
| `submit_answer` | Submit final answer. Ends episode.                             |

## Datasets

### Bank Marketing (750K rows)

- **Target**: Term deposit subscription prediction
- **Features**: age, job, marital, education, balance, housing, loan, contact, day, month, duration, campaign

### Road Safety (500K rows)

- **Target**: Accident risk assessment
- **Features**: road_type, num_lanes, curvature, speed_limit, lighting, weather, time_of_day

## Task Difficulty

| Level | Complexity      | Example                                      |
| ----- | --------------- | -------------------------------------------- |
| L4    | Quartile/binned | "Subscription rate in Q1 balance?"           |
| L5    | Multi-condition | "Rate for month='may' AND job='management'?" |
| L6    | Nested extrema  | "In lowest subscription month, avg day?"     |

## Reward Structure

| Component   | Weight | Description                                     |
| ----------- | ------ | ----------------------------------------------- |
| Correctness | 80%    | Binary correct/incorrect with numeric tolerance |
| Efficiency  | 10%    | Fewer turns = better score                      |
| Token Cost  | 10%    | Lower token usage = better score                |

## API Endpoints

| Endpoint    | Method | Description           |
| ----------- | ------ | --------------------- |
| `/health`   | GET    | Health check          |
| `/reset`    | POST   | Reset for new episode |
| `/step`     | POST   | Execute action        |
| `/state`    | GET    | Current state         |
| `/metadata` | GET    | Environment metadata  |
| `/schema`   | GET    | Type schemas          |

## Benchmark Results

Pre-benchmarked on 11+ models with 1,844 completions:

| Model            | Accuracy | Avg Turns |
| ---------------- | -------- | --------- |
| Claude Opus 4.5  | 77.3%    | 4.2       |
| Qwen3 Max        | 46.7%    | 5.1       |
| Mistral Large    | 45.3%    | 5.8       |
| Llama 4 Maverick | 38.7%    | 6.2       |

## Links

- **GitHub**: [vj-09/codeblue-env](https://github.com/vj-09/codeblue-env)
- **Leaderboard**: [analytics-rl.com](https://www.analytics-rl.com)
- **OpenEnv Spec**: [meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)

## License

MIT License

## Author

**Vijay Athithya**

- GitHub: [@vj-09](https://github.com/vj-09)
- LinkedIn: [vijay-athithya](https://www.linkedin.com/in/vijay-athithya/)