codedark / README.md
albert-einstein-09's picture
Upload folder using huggingface_hub
95d976b verified
metadata
title: CodeDark Environment Server
emoji: 📊
colorFrom: yellow
colorTo: purple
sdk: docker
pinned: false
license: mit
tags:
  - openenv
  - reinforcement-learning
  - data-analytics
  - agents
  - benchmark

CodeDark: Data Analytics Environment for RL Agents

OpenEnv-compatible multi-turn environment for training AI agents on real business analytics tasks.

Overview

CodeDark is the first data analytics environment in the OpenEnv ecosystem. It challenges AI agents to analyze CSV datasets using Python/Pandas, testing their ability to be data scientists rather than just code executors.

Key Features

  • Real Business Tasks: Bank marketing and road safety datasets with genuine analytical questions
  • Multi-Turn Interaction: Agents explore data, save notes, ask clarifications, and submit answers
  • Shaped Rewards: 80% correctness + 10% efficiency + 10% token cost
  • Pre-Benchmarked: 25 curated L5-L6 difficulty tasks validated on 11+ models

Quick Start

Connect to the Environment

from openenv import EnvClient

# Connect to this Space
env = EnvClient.from_hub("openenv/codedark")

# Reset for a new task
obs = env.reset()
print(f"Task: {obs['question']}")

# Execute Python code
obs = env.step({"tool": "run_python", "args": "<code>result = df.shape</code>"})
print(f"Result: {obs['stdout']}")

# Submit answer
obs = env.step({"tool": "submit_answer", "args": "<answer>42.5</answer>"})
print(f"Reward: {obs['reward']}")

Available Tools

Tool Description
run_python Execute Python/pandas code. Store result in result variable.
read_notes Read saved notes from previous turns.
save_note Save observations for later recall.
clarify Ask clarifying questions (max 2 per episode).
submit_answer Submit final answer. Ends episode.

Datasets

Bank Marketing (750K rows)

  • Target: Term deposit subscription prediction
  • Features: age, job, marital, education, balance, housing, loan, contact, day, month, duration, campaign

Road Safety (500K rows)

  • Target: Accident risk assessment
  • Features: road_type, num_lanes, curvature, speed_limit, lighting, weather, time_of_day

Task Difficulty

Level Complexity Example
L4 Quartile/binned "Subscription rate in Q1 balance?"
L5 Multi-condition "Rate for month='may' AND job='management'?"
L6 Nested extrema "In lowest subscription month, avg day?"

Reward Structure

Component Weight Description
Correctness 80% Binary correct/incorrect with numeric tolerance
Efficiency 10% Fewer turns = better score
Token Cost 10% Lower token usage = better score

API Endpoints

Endpoint Method Description
/health GET Health check
/reset POST Reset for new episode
/step POST Execute action
/state GET Current state
/metadata GET Environment metadata
/schema GET Type schemas

Benchmark Results

Pre-benchmarked on 11+ models with 1,844 completions:

Model Accuracy Avg Turns
Claude Opus 4.5 77.3% 4.2
Qwen3 Max 46.7% 5.1
Mistral Large 45.3% 5.8
Llama 4 Maverick 38.7% 6.2

Links

License

MIT License

Author

Vijay Athithya