--- title: CodeDark Environment Server emoji: 📊 colorFrom: yellow colorTo: purple sdk: docker pinned: false license: mit tags: - openenv - reinforcement-learning - data-analytics - agents - benchmark --- # CodeDark: Data Analytics Environment for RL Agents **OpenEnv-compatible multi-turn environment for training AI agents on real business analytics tasks.** ## Overview CodeDark is the first data analytics environment in the OpenEnv ecosystem. It challenges AI agents to analyze CSV datasets using Python/Pandas, testing their ability to be data scientists rather than just code executors. ### Key Features - **Real Business Tasks**: Bank marketing and road safety datasets with genuine analytical questions - **Multi-Turn Interaction**: Agents explore data, save notes, ask clarifications, and submit answers - **Shaped Rewards**: 80% correctness + 10% efficiency + 10% token cost - **Pre-Benchmarked**: 25 curated L5-L6 difficulty tasks validated on 11+ models ## Quick Start ### Connect to the Environment ```python from openenv import EnvClient # Connect to this Space env = EnvClient.from_hub("openenv/codedark") # Reset for a new task obs = env.reset() print(f"Task: {obs['question']}") # Execute Python code obs = env.step({"tool": "run_python", "args": "result = df.shape"}) print(f"Result: {obs['stdout']}") # Submit answer obs = env.step({"tool": "submit_answer", "args": "42.5"}) print(f"Reward: {obs['reward']}") ``` ### Available Tools | Tool | Description | | --------------- | -------------------------------------------------------------- | | `run_python` | Execute Python/pandas code. Store result in `result` variable. | | `read_notes` | Read saved notes from previous turns. | | `save_note` | Save observations for later recall. | | `clarify` | Ask clarifying questions (max 2 per episode). | | `submit_answer` | Submit final answer. Ends episode. | ## Datasets ### Bank Marketing (750K rows) - **Target**: Term deposit subscription prediction - **Features**: age, job, marital, education, balance, housing, loan, contact, day, month, duration, campaign ### Road Safety (500K rows) - **Target**: Accident risk assessment - **Features**: road_type, num_lanes, curvature, speed_limit, lighting, weather, time_of_day ## Task Difficulty | Level | Complexity | Example | | ----- | --------------- | -------------------------------------------- | | L4 | Quartile/binned | "Subscription rate in Q1 balance?" | | L5 | Multi-condition | "Rate for month='may' AND job='management'?" | | L6 | Nested extrema | "In lowest subscription month, avg day?" | ## Reward Structure | Component | Weight | Description | | ----------- | ------ | ----------------------------------------------- | | Correctness | 80% | Binary correct/incorrect with numeric tolerance | | Efficiency | 10% | Fewer turns = better score | | Token Cost | 10% | Lower token usage = better score | ## API Endpoints | Endpoint | Method | Description | | ----------- | ------ | --------------------- | | `/health` | GET | Health check | | `/reset` | POST | Reset for new episode | | `/step` | POST | Execute action | | `/state` | GET | Current state | | `/metadata` | GET | Environment metadata | | `/schema` | GET | Type schemas | ## Benchmark Results Pre-benchmarked on 11+ models with 1,844 completions: | Model | Accuracy | Avg Turns | | ---------------- | -------- | --------- | | Claude Opus 4.5 | 77.3% | 4.2 | | Qwen3 Max | 46.7% | 5.1 | | Mistral Large | 45.3% | 5.8 | | Llama 4 Maverick | 38.7% | 6.2 | ## Links - **GitHub**: [vj-09/codeblue-env](https://github.com/vj-09/codeblue-env) - **Leaderboard**: [analytics-rl.com](https://www.analytics-rl.com) - **OpenEnv Spec**: [meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv) ## License MIT License ## Author **Vijay Athithya** - GitHub: [@vj-09](https://github.com/vj-09) - LinkedIn: [vijay-athithya](https://www.linkedin.com/in/vijay-athithya/)