Spaces:
Sleeping
Sleeping
| title: CodeDark Environment Server | |
| emoji: ๐ | |
| colorFrom: yellow | |
| colorTo: purple | |
| sdk: docker | |
| pinned: false | |
| license: mit | |
| tags: | |
| - openenv | |
| - reinforcement-learning | |
| - data-analytics | |
| - agents | |
| - benchmark | |
| # CodeDark: Data Analytics Environment for RL Agents | |
| **OpenEnv-compatible multi-turn environment for training AI agents on real business analytics tasks.** | |
| ## Overview | |
| CodeDark is the first data analytics environment in the OpenEnv ecosystem. It challenges AI agents to analyze CSV datasets using Python/Pandas, testing their ability to be data scientists rather than just code executors. | |
| ### Key Features | |
| - **Real Business Tasks**: Bank marketing and road safety datasets with genuine analytical questions | |
| - **Multi-Turn Interaction**: Agents explore data, save notes, ask clarifications, and submit answers | |
| - **Shaped Rewards**: 80% correctness + 10% efficiency + 10% token cost | |
| - **Pre-Benchmarked**: 25 curated L5-L6 difficulty tasks validated on 11+ models | |
| ## Quick Start | |
| ### Connect to the Environment | |
| ```python | |
| from openenv import EnvClient | |
| # Connect to this Space | |
| env = EnvClient.from_hub("openenv/codedark") | |
| # Reset for a new task | |
| obs = env.reset() | |
| print(f"Task: {obs['question']}") | |
| # Execute Python code | |
| obs = env.step({"tool": "run_python", "args": "<code>result = df.shape</code>"}) | |
| print(f"Result: {obs['stdout']}") | |
| # Submit answer | |
| obs = env.step({"tool": "submit_answer", "args": "<answer>42.5</answer>"}) | |
| print(f"Reward: {obs['reward']}") | |
| ``` | |
| ### Available Tools | |
| | Tool | Description | | |
| | --------------- | -------------------------------------------------------------- | | |
| | `run_python` | Execute Python/pandas code. Store result in `result` variable. | | |
| | `read_notes` | Read saved notes from previous turns. | | |
| | `save_note` | Save observations for later recall. | | |
| | `clarify` | Ask clarifying questions (max 2 per episode). | | |
| | `submit_answer` | Submit final answer. Ends episode. | | |
| ## Datasets | |
| ### Bank Marketing (750K rows) | |
| - **Target**: Term deposit subscription prediction | |
| - **Features**: age, job, marital, education, balance, housing, loan, contact, day, month, duration, campaign | |
| ### Road Safety (500K rows) | |
| - **Target**: Accident risk assessment | |
| - **Features**: road_type, num_lanes, curvature, speed_limit, lighting, weather, time_of_day | |
| ## Task Difficulty | |
| | Level | Complexity | Example | | |
| | ----- | --------------- | -------------------------------------------- | | |
| | L4 | Quartile/binned | "Subscription rate in Q1 balance?" | | |
| | L5 | Multi-condition | "Rate for month='may' AND job='management'?" | | |
| | L6 | Nested extrema | "In lowest subscription month, avg day?" | | |
| ## Reward Structure | |
| | Component | Weight | Description | | |
| | ----------- | ------ | ----------------------------------------------- | | |
| | Correctness | 80% | Binary correct/incorrect with numeric tolerance | | |
| | Efficiency | 10% | Fewer turns = better score | | |
| | Token Cost | 10% | Lower token usage = better score | | |
| ## API Endpoints | |
| | Endpoint | Method | Description | | |
| | ----------- | ------ | --------------------- | | |
| | `/health` | GET | Health check | | |
| | `/reset` | POST | Reset for new episode | | |
| | `/step` | POST | Execute action | | |
| | `/state` | GET | Current state | | |
| | `/metadata` | GET | Environment metadata | | |
| | `/schema` | GET | Type schemas | | |
| ## Benchmark Results | |
| Pre-benchmarked on 11+ models with 1,844 completions: | |
| | Model | Accuracy | Avg Turns | | |
| | ---------------- | -------- | --------- | | |
| | Claude Opus 4.5 | 77.3% | 4.2 | | |
| | Qwen3 Max | 46.7% | 5.1 | | |
| | Mistral Large | 45.3% | 5.8 | | |
| | Llama 4 Maverick | 38.7% | 6.2 | | |
| ## Links | |
| - **GitHub**: [vj-09/codeblue-env](https://github.com/vj-09/codeblue-env) | |
| - **Leaderboard**: [analytics-rl.com](https://www.analytics-rl.com) | |
| - **OpenEnv Spec**: [meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv) | |
| ## License | |
| MIT License | |
| ## Author | |
| **Vijay Athithya** | |
| - GitHub: [@vj-09](https://github.com/vj-09) | |
| - LinkedIn: [vijay-athithya](https://www.linkedin.com/in/vijay-athithya/) | |