Spaces:

albert-einstein-09
/

codedark

Sleeping

App Files Files Community

codedark / README.md

albert-einstein-09

Upload folder using huggingface_hub

95d976b verified about 1 month ago

preview code

raw

history blame contribute delete

4.43 kB

metadata

title: CodeDark Environment Server
emoji: 📊
colorFrom: yellow
colorTo: purple
sdk: docker
pinned: false
license: mit
tags:
  - openenv
  - reinforcement-learning
  - data-analytics
  - agents
  - benchmark

CodeDark: Data Analytics Environment for RL Agents

OpenEnv-compatible multi-turn environment for training AI agents on real business analytics tasks.

Overview

CodeDark is the first data analytics environment in the OpenEnv ecosystem. It challenges AI agents to analyze CSV datasets using Python/Pandas, testing their ability to be data scientists rather than just code executors.

Key Features

Real Business Tasks: Bank marketing and road safety datasets with genuine analytical questions
Multi-Turn Interaction: Agents explore data, save notes, ask clarifications, and submit answers
Shaped Rewards: 80% correctness + 10% efficiency + 10% token cost
Pre-Benchmarked: 25 curated L5-L6 difficulty tasks validated on 11+ models

Quick Start

Connect to the Environment

from openenv import EnvClient

# Connect to this Space
env = EnvClient.from_hub("openenv/codedark")

# Reset for a new task
obs = env.reset()
print(f"Task: {obs['question']}")

# Execute Python code
obs = env.step({"tool": "run_python", "args": "<code>result = df.shape</code>"})
print(f"Result: {obs['stdout']}")

# Submit answer
obs = env.step({"tool": "submit_answer", "args": "<answer>42.5</answer>"})
print(f"Reward: {obs['reward']}")

Available Tools

Tool	Description
`run_python`	Execute Python/pandas code. Store result in `result` variable.
`read_notes`	Read saved notes from previous turns.
`save_note`	Save observations for later recall.
`clarify`	Ask clarifying questions (max 2 per episode).
`submit_answer`	Submit final answer. Ends episode.

Datasets

Bank Marketing (750K rows)

Target: Term deposit subscription prediction
Features: age, job, marital, education, balance, housing, loan, contact, day, month, duration, campaign

Road Safety (500K rows)

Target: Accident risk assessment
Features: road_type, num_lanes, curvature, speed_limit, lighting, weather, time_of_day

Task Difficulty

Level	Complexity	Example
L4	Quartile/binned	"Subscription rate in Q1 balance?"
L5	Multi-condition	"Rate for month='may' AND job='management'?"
L6	Nested extrema	"In lowest subscription month, avg day?"

Reward Structure

Component	Weight	Description
Correctness	80%	Binary correct/incorrect with numeric tolerance
Efficiency	10%	Fewer turns = better score
Token Cost	10%	Lower token usage = better score

API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check
`/reset`	POST	Reset for new episode
`/step`	POST	Execute action
`/state`	GET	Current state
`/metadata`	GET	Environment metadata
`/schema`	GET	Type schemas

Benchmark Results

Pre-benchmarked on 11+ models with 1,844 completions:

Model	Accuracy	Avg Turns
Claude Opus 4.5	77.3%	4.2
Qwen3 Max	46.7%	5.1
Mistral Large	45.3%	5.8
Llama 4 Maverick	38.7%	6.2

License

MIT License

Author

Vijay Athithya

GitHub: @vj-09
LinkedIn: vijay-athithya