Spaces:

albert-einstein-09
/

codedark

Sleeping

App Files Files Community

albert-einstein-09 commited on Jan 23

Commit

95d976b

verified ·

1 Parent(s): fa1e87d

Upload folder using huggingface_hub

Browse files

Files changed (30) hide show

.gitattributes +2 -35
Dockerfile +34 -0
README.md +125 -5
codedark/.pytest_cache/.gitignore +2 -0
codedark/.pytest_cache/CACHEDIR.TAG +4 -0
codedark/.pytest_cache/README.md +8 -0
codedark/.pytest_cache/v/cache/lastfailed +6 -0
codedark/.pytest_cache/v/cache/nodeids +25 -0
codedark/README.md +174 -0
codedark/__init__.py +38 -0
codedark/client.py +197 -0
codedark/data/bank.csv +3 -0
codedark/data/road.csv +3 -0
codedark/data/tasks/final_25_tasks.jsonl +25 -0
codedark/models.py +150 -0
codedark/openenv.yaml +62 -0
codedark/pyproject.toml +60 -0
codedark/server/Dockerfile +34 -0
codedark/server/__init__.py +1 -0
codedark/server/app.py +195 -0
codedark/server/environment.py +319 -0
codedark/server/requirements.txt +7 -0
codedark/server/scoring.py +332 -0
codedark/server/tools.py +308 -0
codedark/tests/__init__.py +1 -0
codedark/tests/test_environment.py +181 -0
data/bank.csv +3 -0
data/road.csv +3 -0
data/tasks/final_25_tasks.jsonl +25 -0
requirements.txt +6 -0

.gitattributes CHANGED Viewed

@@ -1,35 +1,2 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text


1	+ *.csv filter=lfs diff=lfs merge=lfs -text
2	+ *.jsonl filter=lfs diff=lfs merge=lfs -text

Dockerfile ADDED Viewed

	@@ -0,0 +1,34 @@

+FROM python:3.11-slim
+# Set working directory
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends curl && \
+    rm -rf /var/lib/apt/lists/*
+# Copy requirements and install Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY codedark/ ./codedark/
+COPY data/ ./data/
+# Environment variables
+ENV PYTHONUNBUFFERED=1
+ENV CODEDARK_DATA_DIR=/app/data
+ENV CODEDARK_TASKS_PATH=/app/data/tasks/final_25_tasks.jsonl
+ENV HOST=0.0.0.0
+ENV PORT=7860
+# Expose HuggingFace Spaces port
+EXPOSE 7860
+# Healthcheck
+HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
+    CMD curl -f http://localhost:7860/health || exit 1
+# Run server
+CMD ["python", "-m", "uvicorn", "codedark.server.app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,10 +1,130 @@
 ---
-title: Codedark
-emoji: 🔥
-colorFrom: gray
-colorTo: green
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: CodeDark Environment Server
+emoji: 📊
+colorFrom: yellow
+colorTo: purple
 sdk: docker
 pinned: false
+license: mit
+tags:
+  - openenv
+  - reinforcement-learning
+  - data-analytics
+  - agents
+  - benchmark
 ---
+# CodeDark: Data Analytics Environment for RL Agents
+**OpenEnv-compatible multi-turn environment for training AI agents on real business analytics tasks.**
+## Overview
+CodeDark is the first data analytics environment in the OpenEnv ecosystem. It challenges AI agents to analyze CSV datasets using Python/Pandas, testing their ability to be data scientists rather than just code executors.
+### Key Features
+- **Real Business Tasks**: Bank marketing and road safety datasets with genuine analytical questions
+- **Multi-Turn Interaction**: Agents explore data, save notes, ask clarifications, and submit answers
+- **Shaped Rewards**: 80% correctness + 10% efficiency + 10% token cost
+- **Pre-Benchmarked**: 25 curated L5-L6 difficulty tasks validated on 11+ models
+## Quick Start
+### Connect to the Environment
+```python
+from openenv import EnvClient
+# Connect to this Space
+env = EnvClient.from_hub("openenv/codedark")
+# Reset for a new task
+obs = env.reset()
+print(f"Task: {obs['question']}")
+# Execute Python code
+obs = env.step({"tool": "run_python", "args": "<code>result = df.shape</code>"})
+print(f"Result: {obs['stdout']}")
+# Submit answer
+obs = env.step({"tool": "submit_answer", "args": "<answer>42.5</answer>"})
+print(f"Reward: {obs['reward']}")
+```
+### Available Tools
+| Tool            | Description                                                    |
+| --------------- | -------------------------------------------------------------- |
+| `run_python`    | Execute Python/pandas code. Store result in `result` variable. |
+| `read_notes`    | Read saved notes from previous turns.                          |
+| `save_note`     | Save observations for later recall.                            |
+| `clarify`       | Ask clarifying questions (max 2 per episode).                  |
+| `submit_answer` | Submit final answer. Ends episode.                             |
+## Datasets
+### Bank Marketing (750K rows)
+- **Target**: Term deposit subscription prediction
+- **Features**: age, job, marital, education, balance, housing, loan, contact, day, month, duration, campaign
+### Road Safety (500K rows)
+- **Target**: Accident risk assessment
+- **Features**: road_type, num_lanes, curvature, speed_limit, lighting, weather, time_of_day
+## Task Difficulty
+| Level | Complexity      | Example                                      |
+| ----- | --------------- | -------------------------------------------- |
+| L4    | Quartile/binned | "Subscription rate in Q1 balance?"           |
+| L5    | Multi-condition | "Rate for month='may' AND job='management'?" |
+| L6    | Nested extrema  | "In lowest subscription month, avg day?"     |
+## Reward Structure
+| Component   | Weight | Description                                     |
+| ----------- | ------ | ----------------------------------------------- |
+| Correctness | 80%    | Binary correct/incorrect with numeric tolerance |
+| Efficiency  | 10%    | Fewer turns = better score                      |
+| Token Cost  | 10%    | Lower token usage = better score                |
+## API Endpoints
+| Endpoint    | Method | Description           |
+| ----------- | ------ | --------------------- |
+| `/health`   | GET    | Health check          |
+| `/reset`    | POST   | Reset for new episode |
+| `/step`     | POST   | Execute action        |
+| `/state`    | GET    | Current state         |
+| `/metadata` | GET    | Environment metadata  |
+| `/schema`   | GET    | Type schemas          |
+## Benchmark Results
+Pre-benchmarked on 11+ models with 1,844 completions:
+| Model            | Accuracy | Avg Turns |
+| ---------------- | -------- | --------- |
+| Claude Opus 4.5  | 77.3%    | 4.2       |
+| Qwen3 Max        | 46.7%    | 5.1       |
+| Mistral Large    | 45.3%    | 5.8       |
+| Llama 4 Maverick | 38.7%    | 6.2       |
+## Links
+- **GitHub**: [vj-09/codeblue-env](https://github.com/vj-09/codeblue-env)
+- **Leaderboard**: [analytics-rl.com](https://www.analytics-rl.com)
+- **OpenEnv Spec**: [meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
+## License
+MIT License
+## Author
+**Vijay Athithya**
+- GitHub: [@vj-09](https://github.com/vj-09)
+- LinkedIn: [vijay-athithya](https://www.linkedin.com/in/vijay-athithya/)

codedark/.pytest_cache/.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Created by pytest automatically.
2	+ *

codedark/.pytest_cache/CACHEDIR.TAG ADDED Viewed

	@@ -0,0 +1,4 @@

+Signature: 8a477f597d28d172789f06886806bc55
+# This file is a cache directory tag created by pytest.
+# For information about cache directory tags, see:
+#	https://bford.info/cachedir/spec.html

codedark/.pytest_cache/README.md ADDED Viewed

	@@ -0,0 +1,8 @@

+# pytest cache directory #
+This directory contains data from the pytest's cache plugin,
+which provides the `--lf` and `--ff` options, as well as the `cache` fixture.
+**Do not** commit this to version control.
+See [the docs](https://docs.pytest.org/en/stable/how-to/cache.html) for more information.

codedark/.pytest_cache/v/cache/lastfailed ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "tests/test_environment.py::TestModels": true,
+  "tests/test_environment.py::TestTools": true,
+  "tests/test_environment.py::TestScoring": true,
+  "tests/test_environment.py::TestEnvironment": true
+}

codedark/.pytest_cache/v/cache/nodeids ADDED Viewed

	@@ -0,0 +1,25 @@

+[
+  "tests/test_environment.py::TestEnvironment::test_reset_loads_task",
+  "tests/test_environment.py::TestEnvironment::test_reset_specific_task",
+  "tests/test_environment.py::TestEnvironment::test_step_read_notes",
+  "tests/test_environment.py::TestEnvironment::test_step_run_python",
+  "tests/test_environment.py::TestEnvironment::test_step_save_note",
+  "tests/test_environment.py::TestEnvironment::test_step_submit_answer",
+  "tests/test_environment.py::TestEnvironment::test_turn_counting",
+  "tests/test_environment.py::TestModels::test_action_creation",
+  "tests/test_environment.py::TestModels::test_observation_defaults",
+  "tests/test_environment.py::TestModels::test_state_defaults",
+  "tests/test_environment.py::TestScoring::test_compute_reward_correct",
+  "tests/test_environment.py::TestScoring::test_correctness_exact_match",
+  "tests/test_environment.py::TestScoring::test_correctness_scale_error",
+  "tests/test_environment.py::TestScoring::test_correctness_within_tolerance",
+  "tests/test_environment.py::TestScoring::test_correctness_wrong",
+  "tests/test_environment.py::TestScoring::test_efficiency_correct_answer",
+  "tests/test_environment.py::TestScoring::test_efficiency_incorrect_answer",
+  "tests/test_environment.py::TestTools::test_parse_run_python",
+  "tests/test_environment.py::TestTools::test_parse_run_python_missing_tag",
+  "tests/test_environment.py::TestTools::test_parse_submit_answer",
+  "tests/test_environment.py::TestTools::test_read_notes_empty",
+  "tests/test_environment.py::TestTools::test_read_notes_with_content",
+  "tests/test_environment.py::TestTools::test_save_note"
+]

codedark/README.md ADDED Viewed

	@@ -0,0 +1,174 @@

+# CodeDark
+**OpenEnv-compatible multi-turn data analytics environment for RL agent training.**
+Train AI agents to be data scientists, not just code executors. CodeDark features real business analytics tasks with pandas/numpy, multi-metric reward shaping, and skill-based curriculum.
+## Quick Start
+### Server
+```bash
+# Install
+pip install -e .
+# Run server
+python -m codedark.server.app
+# Server runs at http://localhost:8000
+```
+### Client
+```python
+from codedark import CodeDarkEnv
+env = CodeDarkEnv("http://localhost:8000")
+# Reset for new episode
+obs = env.reset()
+print(f"Task: {obs['question']}")
+# Execute Python code
+obs = env.run_python("result = df.shape")
+print(f"Shape: {obs['stdout']}")
+# Explore the data
+obs = env.run_python("result = df.columns.tolist()")
+print(f"Columns: {obs['stdout']}")
+# Calculate and submit answer
+obs = env.run_python("result = df['y'].mean() * 100")
+obs = env.submit_answer(11.26)
+print(f"Reward: {obs['reward']}")
+```
+### Docker
+```bash
+# Build
+docker build -t codedark:latest -f server/Dockerfile .
+# Run
+docker run -p 8000:8000 codedark:latest
+```
+## Tools
+Agents have access to 5 tools:
+| Tool            | Description                                                     |
+| --------------- | --------------------------------------------------------------- |
+| `run_python`    | Execute Python/pandas code. Store output in `result` variable.  |
+| `read_notes`    | Read all saved notes from previous turns.                       |
+| `save_note`     | Save observations for later recall. Notes persist across turns. |
+| `clarify`       | Ask clarifying questions about the task (max 2 per episode).    |
+| `submit_answer` | Submit final answer. Ends episode.                              |
+## Reward Structure
+Total reward is computed from three components (max 1.0):
+| Component   | Weight | Description                                     |
+| ----------- | ------ | ----------------------------------------------- |
+| Correctness | 80%    | Binary correct/incorrect with numeric tolerance |
+| Efficiency  | 10%    | Fewer turns = better score                      |
+| Token Cost  | 10%    | Lower token usage = better score                |
+## Datasets
+### Bank Marketing
+- **Records**: 750,000 customers
+- **Target**: Term deposit subscription (y = 0/1)
+- **Features**: age, job, marital, education, balance, housing, loan, contact, day, month, duration, campaign, pdays, previous, poutcome
+### Road Safety
+- **Records**: 500,000 road segments
+- **Target**: Accident risk (continuous)
+- **Features**: road_type, num_lanes, curvature, speed_limit, lighting, weather, road_signs_present, time_of_day, num_reported_accidents
+## Task Difficulty
+| Level | Complexity      | Example                                      |
+| ----- | --------------- | -------------------------------------------- |
+| L4    | Quartile/binned | "Subscription rate in Q1 balance?"           |
+| L5    | Multi-condition | "Rate for month='may' AND job='management'?" |
+| L6    | Nested extrema  | "In lowest subscription month, avg day?"     |
+## API Endpoints
+| Endpoint    | Method | Description           |
+| ----------- | ------ | --------------------- |
+| `/health`   | GET    | Health check          |
+| `/reset`    | POST   | Reset for new episode |
+| `/step`     | POST   | Execute action        |
+| `/state`    | GET    | Current state         |
+| `/metadata` | GET    | Environment metadata  |
+| `/schema`   | GET    | Type schemas          |
+## Benchmark Results
+Pre-benchmarked on 11+ models with 1,844 completions:
+| Model            | Accuracy | Cost/Task |
+| ---------------- | -------- | --------- |
+| Claude Opus 4.5  | 77.3%    | $0.89     |
+| Qwen3 Max        | 46.7%    | $0.12     |
+| Mistral Large    | 45.3%    | $0.18     |
+| Llama 4 Maverick | 38.7%    | $0.08     |
+## Environment Variables
+| Variable              | Default                           | Description               |
+| --------------------- | --------------------------------- | ------------------------- |
+| `CODEDARK_DATA_DIR`   | `data/`                           | Path to CSV files         |
+| `CODEDARK_TASKS_PATH` | `data/tasks/final_25_tasks.jsonl` | Path to tasks file        |
+| `CODEDARK_MAX_TURNS`  | `10`                              | Maximum turns per episode |
+| `HOST`                | `0.0.0.0`                         | Server host               |
+| `PORT`                | `8000`                            | Server port               |
+## Project Structure
+```
+codedark/
+├── __init__.py          # Package exports
+├── models.py            # Action, Observation, State dataclasses
+├── client.py            # HTTP client
+├── openenv.yaml         # OpenEnv manifest
+├── pyproject.toml       # Package config
+├── server/
+│   ├── app.py           # FastAPI application
+│   ├── environment.py   # Core environment logic
+│   ├── tools.py         # Tool implementations
+│   ├── scoring.py       # Reward computation
+│   ├── Dockerfile       # Container spec
+│   └── requirements.txt # Dependencies
+├── data/
+│   ├── bank.csv         # Bank marketing dataset
+│   ├── road.csv         # Road safety dataset
+│   └── tasks/
+│       └── final_25_tasks.jsonl
+└── tests/
+```
+## OpenEnv Compatibility
+CodeDark follows the [OpenEnv specification](https://huggingface.co/openenv):
+- Gymnasium-style `reset()` / `step()` API
+- Pydantic models for Action, Observation, State
+- FastAPI server with standard endpoints
+- Docker containerization for isolated execution
+- HTTP + WebSocket transport
+## License
+MIT
+## Author
+Vijay Athithya
+- GitHub: [vj-09](https://github.com/vj-09)
+- LinkedIn: [vijay-athithya](https://www.linkedin.com/in/vijay-athithya/)

codedark/__init__.py ADDED Viewed

	@@ -0,0 +1,38 @@

+"""
+CodeDark - OpenEnv-Compatible Data Analytics Environment
+Multi-turn RL environment for training AI agents on data analytics tasks.
+Example usage:
+    from codedark import CodeDarkEnv
+    env = CodeDarkEnv("http://localhost:8000")
+    obs = env.reset()
+    print(f"Task: {obs['question']}")
+    obs = env.run_python("result = df['y'].mean() * 100")
+    obs = env.submit_answer(11.26)
+    print(f"Reward: {obs['reward']}")
+"""
+from .client import CodeDarkEnv
+from .models import (
+    CodeDarkAction,
+    CodeDarkObservation,
+    CodeDarkState,
+    ResetRequest,
+    StepRequest,
+    HealthResponse,
+)
+__all__ = [
+    "CodeDarkEnv",
+    "CodeDarkAction",
+    "CodeDarkObservation",
+    "CodeDarkState",
+    "ResetRequest",
+    "StepRequest",
+    "HealthResponse",
+]
+__version__ = "0.1.0"

codedark/client.py ADDED Viewed

	@@ -0,0 +1,197 @@

+"""
+CodeDark Client
+HTTP client for interacting with CodeDark environment server.
+Follows OpenEnv EnvClient pattern.
+"""
+from typing import Any, Dict, Optional
+import requests
+class CodeDarkEnv:
+    """Client for CodeDark environment.
+    Example usage:
+        env = CodeDarkEnv("http://localhost:8000")
+        obs = env.reset()
+        print(f"Task: {obs['question']}")
+        obs = env.step("run_python", "<code>result = df.shape</code>")
+        print(f"Result: {obs['stdout']}")
+        obs = env.step("submit_answer", "<answer>11.26</answer>")
+        print(f"Reward: {obs['reward']}")
+    """
+    def __init__(self, base_url: str = "http://localhost:8000", timeout: int = 30):
+        """Initialize client.
+        Args:
+            base_url: Server URL
+            timeout: Request timeout in seconds
+        """
+        self.base_url = base_url.rstrip("/")
+        self.timeout = timeout
+        self._session = requests.Session()
+    def reset(
+        self, task_id: Optional[str] = None, seed: Optional[int] = None
+    ) -> Dict[str, Any]:
+        """Reset environment for a new episode.
+        Args:
+            task_id: Specific task to load (optional)
+            seed: Random seed for task selection (optional)
+        Returns:
+            Initial observation dict
+        """
+        payload = {}
+        if task_id is not None:
+            payload["task_id"] = task_id
+        if seed is not None:
+            payload["seed"] = seed
+        response = self._session.post(
+            f"{self.base_url}/reset",
+            json=payload if payload else None,
+            timeout=self.timeout,
+        )
+        response.raise_for_status()
+        return response.json()
+    def step(self, tool: str, args: str = "") -> Dict[str, Any]:
+        """Execute an action.
+        Args:
+            tool: Tool name (run_python, read_notes, save_note, clarify, submit_answer)
+            args: Tool-specific arguments
+        Returns:
+            Observation dict
+        """
+        response = self._session.post(
+            f"{self.base_url}/step",
+            json={"tool": tool, "args": args},
+            timeout=self.timeout,
+        )
+        response.raise_for_status()
+        return response.json()
+    def state(self) -> Dict[str, Any]:
+        """Get current environment state.
+        Returns:
+            State dict
+        """
+        response = self._session.get(
+            f"{self.base_url}/state",
+            timeout=self.timeout,
+        )
+        response.raise_for_status()
+        return response.json()
+    def health(self) -> Dict[str, Any]:
+        """Check server health.
+        Returns:
+            Health status dict
+        """
+        response = self._session.get(
+            f"{self.base_url}/health",
+            timeout=self.timeout,
+        )
+        response.raise_for_status()
+        return response.json()
+    def metadata(self) -> Dict[str, Any]:
+        """Get environment metadata.
+        Returns:
+            Metadata dict
+        """
+        response = self._session.get(
+            f"{self.base_url}/metadata",
+            timeout=self.timeout,
+        )
+        response.raise_for_status()
+        return response.json()
+    def schema(self) -> Dict[str, Any]:
+        """Get environment type schemas.
+        Returns:
+            Schema dict for action, observation, state
+        """
+        response = self._session.get(
+            f"{self.base_url}/schema",
+            timeout=self.timeout,
+        )
+        response.raise_for_status()
+        return response.json()
+    # Convenience methods for common tools
+    def run_python(self, code: str) -> Dict[str, Any]:
+        """Execute Python code.
+        Args:
+            code: Python code to execute
+        Returns:
+            Observation dict
+        """
+        return self.step("run_python", f"<code>{code}</code>")
+    def read_notes(self) -> Dict[str, Any]:
+        """Read all saved notes.
+        Returns:
+            Observation dict
+        """
+        return self.step("read_notes", "")
+    def save_note(self, content: str) -> Dict[str, Any]:
+        """Save a note.
+        Args:
+            content: Note content
+        Returns:
+            Observation dict
+        """
+        return self.step("save_note", content)
+    def clarify(self, question: str) -> Dict[str, Any]:
+        """Ask a clarifying question.
+        Args:
+            question: Clarifying question
+        Returns:
+            Observation dict
+        """
+        return self.step("clarify", f"<question>{question}</question>")
+    def submit_answer(self, answer: Any) -> Dict[str, Any]:
+        """Submit final answer.
+        Args:
+            answer: Answer value
+        Returns:
+            Final observation with reward
+        """
+        return self.step("submit_answer", f"<answer>{answer}</answer>")
+    def close(self):
+        """Close the session."""
+        self._session.close()
+    def __enter__(self):
+        return self
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.close()
+        return False

codedark/data/bank.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a071417203c9e1434df5fe794fffae6a55327502b00da9c4e9754d2ab7f7cede
+size 65698328

codedark/data/road.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7ee5955af18eca0d4b53e13f23bd6436422e40ca84077fb8cdcfa467aa62b68f
+size 37936892

codedark/data/tasks/final_25_tasks.jsonl ADDED Viewed

	@@ -0,0 +1,25 @@

+{"id": "bank_hard_001", "dataset": "bank", "goal": "What's the subscription rate for month='may' AND job='management' AND balance in Q1?", "expected_output_type": "scalar", "level": "L6", "template": "multi_condition_filter", "golden": {"answer_value": 2.44, "answer_type": "scalar", "verification_code": "df['_q'] = pd.qcut(df['balance'], 4, labels=['Q1','Q2','Q3','Q4'], duplicates='drop')\nfiltered = df[(df['month'] == 'may') & (df['job'] == 'management') & (df['_q'] == 'Q1')]\nresult = round((filtered['y'] == 1).mean() * 100, 2) if len(filtered) > 0 else 0.0\ndf.drop('_q', axis=1, inplace=True)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["quartile", "how", "define", "q1", "q2", "q3", "q4", "bins", "bucket", "positive", "success", "target", "y=1", "class", "outcome", "null", "missing", "nan", "empty", "na"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "multi_condition_filter", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"col_a": "month", "val_a": "may", "col_b": "job", "val_b": "management", "metric": "balance", "quartile": "Q1", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_005", "dataset": "bank", "goal": "Among customers in top 95% of age AND bottom 10% of duration, what's subscription rate?", "expected_output_type": "scalar", "level": "L6", "template": "percentile_cohort", "golden": {"answer_value": 0.59, "answer_type": "scalar", "verification_code": "p_high = df['age'].quantile(95 / 100)\np_low = df['duration'].quantile(10 / 100)\ncohort = df[(df['age'] >= p_high) & (df['duration'] <= p_low)]\nresult = round((cohort['y'] == 1).mean() * 100, 2) if len(cohort) > 0 else 0.0", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["percentile", "inclusive", "exclusive", "boundary", "include", "positive", "success", "target", "y=1", "class", "outcome"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "percentile_cohort", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"metric_a": "age", "metric_b": "duration", "pct_high": 95, "pct_low": 10, "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_012", "dataset": "bank", "goal": "Find the job with lowest average balance. What is the subscription rate for that segment?", "expected_output_type": "scalar", "level": "L5", "template": "chain_conversion", "golden": {"answer_value": 8.27, "answer_type": "scalar", "verification_code": "group_stats = df.groupby('job')['balance'].mean()\nextrema_val = group_stats.max() if 'lowest' == 'highest' else group_stats.min()\ntied = group_stats[group_stats == extrema_val].sort_index()\nextrema_group = tied.index[0]\nsubset = df[df['job'] == extrema_group]\nresult = round((subset['y'] == 1).mean() * 100, 2)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["positive", "success", "target", "y=1", "class", "outcome", "rate", "percentage", "decimal", "format", "0-100", "0-1"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "chain_conversion", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"group_col": "job", "metric": "balance", "extrema": "lowest", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_019", "dataset": "bank", "goal": "What's the subscription rate for month='may' AND job='management' AND age in Q4?", "expected_output_type": "scalar", "level": "L6", "template": "multi_condition_filter", "golden": {"answer_value": 7.04, "answer_type": "scalar", "verification_code": "df['_q'] = pd.qcut(df['age'], 4, labels=['Q1','Q2','Q3','Q4'], duplicates='drop')\nfiltered = df[(df['month'] == 'may') & (df['job'] == 'management') & (df['_q'] == 'Q4')]\nresult = round((filtered['y'] == 1).mean() * 100, 2) if len(filtered) > 0 else 0.0\ndf.drop('_q', axis=1, inplace=True)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["quartile", "how", "define", "q1", "q2", "q3", "q4", "bins", "bucket", "positive", "success", "target", "y=1", "class", "outcome", "null", "missing", "nan", "empty", "na"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "multi_condition_filter", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"col_a": "month", "val_a": "may", "col_b": "job", "val_b": "management", "metric": "age", "quartile": "Q4", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_020", "dataset": "bank", "goal": "For month with above-average day, which have subscription rate > 25%? Return sorted list.", "expected_output_type": "list", "level": "L5", "template": "top_n_in_segment", "golden": {"answer_value": ["oct"], "answer_type": "list", "verification_code": "avg_metric = df.groupby('month')['day'].mean()\nhigh_metric = avg_metric[avg_metric > avg_metric.mean()].index\nrates = df[df['month'].isin(high_metric)].groupby('month')['y'].apply(\n    lambda x: (x == 1).mean() * 100)\nresult = sorted(rates[rates > 25].index.tolist())", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": ["tie", "ties", "equal", "same", "duplicate"], "success_criteria": ["Answer must match expected value", "List elements must match (order matters)"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "top_n_in_segment", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "month", "metric": "day", "threshold": 25, "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_021", "dataset": "bank", "goal": "Find the job with highest average balance. What is the subscription rate for that segment?", "expected_output_type": "scalar", "level": "L5", "template": "chain_conversion", "golden": {"answer_value": 24.62, "answer_type": "scalar", "verification_code": "group_stats = df.groupby('job')['balance'].mean()\nextrema_val = group_stats.max() if 'highest' == 'highest' else group_stats.min()\ntied = group_stats[group_stats == extrema_val].sort_index()\nextrema_group = tied.index[0]\nsubset = df[df['job'] == extrema_group]\nresult = round((subset['y'] == 1).mean() * 100, 2)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["positive", "success", "target", "y=1", "class", "outcome", "rate", "percentage", "decimal", "format", "0-100", "0-1"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "chain_conversion", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"group_col": "job", "metric": "balance", "extrema": "highest", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_023", "dataset": "bank", "goal": "Among customers in top 90% of balance AND bottom 10% of day, what's subscription rate?", "expected_output_type": "scalar", "level": "L6", "template": "percentile_cohort", "golden": {"answer_value": 33.53, "answer_type": "scalar", "verification_code": "p_high = df['balance'].quantile(90 / 100)\np_low = df['day'].quantile(10 / 100)\ncohort = df[(df['balance'] >= p_high) & (df['day'] <= p_low)]\nresult = round((cohort['y'] == 1).mean() * 100, 2) if len(cohort) > 0 else 0.0", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["percentile", "inclusive", "exclusive", "boundary", "include", "positive", "success", "target", "y=1", "class", "outcome"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "percentile_cohort", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"metric_a": "balance", "metric_b": "day", "pct_high": 90, "pct_low": 10, "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_026", "dataset": "bank", "goal": "For each job, compute volatility score: std(age) / mean(age). Return top 5 with [group, mean, std, volatility].", "expected_output_type": "dataframe", "level": "L6", "template": "segment_volatility", "golden": {"answer_value": [{"job": "unemployed", "mean": 40.97, "std": 9.74, "volatility": 0.2377}, {"job": "self-employed", "mean": 40.42, "std": 9.46, "volatility": 0.234}, {"job": "admin.", "mean": 39.68, "std": 9.23, "volatility": 0.2326}, {"job": "services", "mean": 38.94, "std": 8.86, "volatility": 0.2275}, {"job": "management", "mean": 40.2, "std": 9.13, "volatility": 0.2271}], "answer_type": "dataframe", "verification_code": "stats = df.groupby('job')['age'].agg(['mean', 'std']).round(2)\nstats['volatility'] = round(stats['std'] / stats['mean'], 4)\nresult = stats.nlargest(5, 'volatility').reset_index()", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": ["round", "decimal", "precision", "digits", "volatility", "cv", "coefficient", "variation"], "success_criteria": ["Answer must match expected value", "DataFrame shape must match", "Column names must match", "Values must match with numeric tolerance 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "segment_volatility", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "job", "metric": "age"}}}
+{"id": "bank_hard_028", "dataset": "bank", "goal": "Among customers in top 95% of balance AND bottom 10% of duration, what's subscription rate?", "expected_output_type": "scalar", "level": "L6", "template": "percentile_cohort", "golden": {"answer_value": 0.84, "answer_type": "scalar", "verification_code": "p_high = df['balance'].quantile(95 / 100)\np_low = df['duration'].quantile(10 / 100)\ncohort = df[(df['balance'] >= p_high) & (df['duration'] <= p_low)]\nresult = round((cohort['y'] == 1).mean() * 100, 2) if len(cohort) > 0 else 0.0", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["percentile", "inclusive", "exclusive", "boundary", "include", "positive", "success", "target", "y=1", "class", "outcome"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "percentile_cohort", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"metric_a": "balance", "metric_b": "duration", "pct_high": 95, "pct_low": 10, "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_029", "dataset": "bank", "goal": "Among customers in top 90% of balance AND bottom 25% of duration, what's subscription rate?", "expected_output_type": "scalar", "level": "L6", "template": "percentile_cohort", "golden": {"answer_value": 0.54, "answer_type": "scalar", "verification_code": "p_high = df['balance'].quantile(90 / 100)\np_low = df['duration'].quantile(25 / 100)\ncohort = df[(df['balance'] >= p_high) & (df['duration'] <= p_low)]\nresult = round((cohort['y'] == 1).mean() * 100, 2) if len(cohort) > 0 else 0.0", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["percentile", "inclusive", "exclusive", "boundary", "include", "positive", "success", "target", "y=1", "class", "outcome"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "percentile_cohort", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"metric_a": "balance", "metric_b": "duration", "pct_high": 90, "pct_low": 25, "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_030", "dataset": "bank", "goal": "Rank job by subscription rate. Which bottom-3 have above-median age?", "expected_output_type": "list", "level": "L6", "template": "ranked_anomaly", "golden": {"answer_value": ["blue-collar", "entrepreneur"], "answer_type": "list", "verification_code": "stats = df.groupby('job').agg(\n    rate=('y', lambda x: (x == 1).mean()),\n    avg_metric=('age', 'mean'))\nstats['rank'] = stats['rate'].rank()\nbottom_3 = stats[stats['rank'] <= 3]\nresult = sorted(bottom_3[bottom_3['avg_metric'] > stats['avg_metric'].median()].index.tolist())", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": ["positive", "success", "target", "y=1", "class", "outcome", "rank", "order", "sort", "ascending", "descending"], "success_criteria": ["Answer must match expected value", "List elements must match (order matters)"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "ranked_anomaly", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "job", "metric": "age", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_031", "dataset": "bank", "goal": "For each month, compute volatility score: std(day) / mean(day). Return top 5 with [group, mean, std, volatility].", "expected_output_type": "dataframe", "level": "L6", "template": "segment_volatility", "golden": {"answer_value": [{"month": "feb", "mean": 6.05, "std": 5.46, "volatility": 0.9025}, {"month": "sep", "mean": 11.67, "std": 8.04, "volatility": 0.6889}, {"month": "mar", "mean": 13.45, "std": 9.18, "volatility": 0.6825}, {"month": "jun", "mean": 11.32, "std": 7.33, "volatility": 0.6475}, {"month": "dec", "mean": 14.19, "std": 8.83, "volatility": 0.6223}], "answer_type": "dataframe", "verification_code": "stats = df.groupby('month')['day'].agg(['mean', 'std']).round(2)\nstats['volatility'] = round(stats['std'] / stats['mean'], 4)\nresult = stats.nlargest(5, 'volatility').reset_index()", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": ["round", "decimal", "precision", "digits", "volatility", "cv", "coefficient", "variation"], "success_criteria": ["Answer must match expected value", "DataFrame shape must match", "Column names must match", "Values must match with numeric tolerance 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "segment_volatility", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "month", "metric": "day"}}}
+{"id": "bank_hard_033", "dataset": "bank", "goal": "Show the average balance breakdown by job. Include count and mean balance for each category, sorted by mean descending.", "expected_output_type": "dataframe", "level": "L4", "template": "metric_breakdown", "golden": {"answer_value": [{"job": "retired", "count": 35185, "mean_balance": 1812.07}, {"job": "unknown", "count": 2917, "mean_balance": 1678.96}, {"job": "self-employed", "count": 19020, "mean_balance": 1598.27}, {"job": "student", "count": 11767, "mean_balance": 1577.32}, {"job": "management", "count": 175541, "mean_balance": 1510.39}, {"job": "unemployed", "count": 17634, "mean_balance": 1440.57}, {"job": "entrepreneur", "count": 17718, "mean_balance": 1306.75}, {"job": "housemaid", "count": 15912, "mean_balance": 1281.22}, {"job": "technician", "count": 138107, "mean_balance": 1071.57}, {"job": "admin.", "count": 81492, "mean_balance": 1019.92}, {"job": "blue-collar", "count": 170498, "mean_balance": 977.49}, {"job": "services", "count": 64209, "mean_balance": 834.63}], "answer_type": "dataframe", "verification_code": "breakdown = df.groupby('job').agg(\n    count=('balance', 'size'),\n    mean_balance=('balance', lambda x: round(x.mean(), 2))\n).sort_values('mean_balance', ascending=False)\nresult = breakdown.reset_index()", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": ["round", "decimal", "precision", "digits"], "success_criteria": ["Answer must match expected value", "DataFrame shape must match", "Column names must match", "Values must match with numeric tolerance 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Single aggregation or binning operation expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 4, "template_name": "metric_breakdown", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "job", "metric": "balance"}}}
+{"id": "bank_hard_035", "dataset": "bank", "goal": "Find the job with lowest average age. What is the subscription rate for that segment?", "expected_output_type": "scalar", "level": "L5", "template": "chain_conversion", "golden": {"answer_value": 34.08, "answer_type": "scalar", "verification_code": "group_stats = df.groupby('job')['age'].mean()\nextrema_val = group_stats.max() if 'lowest' == 'highest' else group_stats.min()\ntied = group_stats[group_stats == extrema_val].sort_index()\nextrema_group = tied.index[0]\nsubset = df[df['job'] == extrema_group]\nresult = round((subset['y'] == 1).mean() * 100, 2)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["positive", "success", "target", "y=1", "class", "outcome", "rate", "percentage", "decimal", "format", "0-100", "0-1"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "chain_conversion", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"group_col": "job", "metric": "age", "extrema": "lowest", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_038", "dataset": "bank", "goal": "Find the month with the lowest subscription rate. Within that group, what is the average day?", "expected_output_type": "scalar", "level": "L5", "template": "nested_extrema", "golden": {"answer_value": 16.09, "answer_type": "scalar", "verification_code": "group_rates = df.groupby('month')['y'].apply(lambda x: (x == 1).mean())\nouter_val = group_rates.max() if 'lowest' == 'highest' else group_rates.min()\nouter_tied = group_rates[group_rates == outer_val].sort_index()\nouter_group = outer_tied.index[0]\nsubset = df[df['month'] == outer_group]\nresult = round(subset['day'].mean(), 2)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["tie", "ties", "equal", "same", "duplicate", "positive", "success", "target", "y=1", "class", "outcome"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "nested_extrema", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"group_col": "month", "metric": "day", "extrema_outer": "lowest", "extrema_inner": "highest", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_039", "dataset": "bank", "goal": "Divide customers into 4 day quartiles. What is the subscription percentage (0-100) in the highest (top 25%) (Q4) quartile?", "expected_output_type": "scalar", "level": "L4", "template": "quartile_conversion", "golden": {"answer_value": 11.55, "answer_type": "scalar", "verification_code": "df['_bin'] = pd.qcut(df['day'], 4, labels=['Q1','Q2','Q3','Q4'], duplicates='drop')\nbin_data = df[df['_bin'] == 'Q4']\nresult = round((bin_data['y'] == 1).mean() * 100, 2)\ndf.drop('_bin', axis=1, inplace=True)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["quartile", "how", "define", "q1", "q2", "q3", "q4", "bins", "bucket", "positive", "success", "target", "y=1", "class", "outcome", "rate", "percentage", "decimal", "format", "0-100", "0-1"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Single aggregation or binning operation expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 4, "template_name": "quartile_conversion", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"bin_col": "day", "quartile": "Q4", "quartile_desc": "highest (top 25%)", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_040", "dataset": "bank", "goal": "Among customers in top 95% of balance AND bottom 25% of duration, what's subscription rate?", "expected_output_type": "scalar", "level": "L6", "template": "percentile_cohort", "golden": {"answer_value": 0.65, "answer_type": "scalar", "verification_code": "p_high = df['balance'].quantile(95 / 100)\np_low = df['duration'].quantile(25 / 100)\ncohort = df[(df['balance'] >= p_high) & (df['duration'] <= p_low)]\nresult = round((cohort['y'] == 1).mean() * 100, 2) if len(cohort) > 0 else 0.0", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["percentile", "inclusive", "exclusive", "boundary", "include", "positive", "success", "target", "y=1", "class", "outcome"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "percentile_cohort", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"metric_a": "balance", "metric_b": "duration", "pct_high": 95, "pct_low": 25, "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_041", "dataset": "bank", "goal": "Which job categories would have the biggest impact if brought to average subscription rate? Return top 3 by potential gain (count * rate gap), sorted by impact.", "expected_output_type": "list", "level": "L5", "template": "segment_improvement_potential", "golden": {"answer_value": ["blue-collar", "services", "entrepreneur"], "answer_type": "list", "verification_code": "overall_rate = (df['y'] == 1).mean()\ngroup_stats = df.groupby('job').agg(\n    rate=('y', lambda x: (x == 1).mean()),\n    count=('y', 'size')\n)\ngroup_stats['gap'] = overall_rate - group_stats['rate']\ngroup_stats['potential'] = group_stats['count'] * group_stats['gap']\ntop_potential = group_stats[group_stats['gap'] > 0].nlargest(3, 'potential')\nresult = top_potential.index.tolist()", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": ["positive", "success", "target", "y=1", "class", "outcome", "rate", "percentage", "decimal", "format", "0-100", "0-1"], "success_criteria": ["Answer must match expected value", "List elements must match (order matters)"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "segment_improvement_potential", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "job", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_044", "dataset": "bank", "goal": "Find the month with the lowest subscription rate. Within that group, what is the average age?", "expected_output_type": "scalar", "level": "L5", "template": "nested_extrema", "golden": {"answer_value": 38.98, "answer_type": "scalar", "verification_code": "group_rates = df.groupby('month')['y'].apply(lambda x: (x == 1).mean())\nouter_val = group_rates.max() if 'lowest' == 'highest' else group_rates.min()\nouter_tied = group_rates[group_rates == outer_val].sort_index()\nouter_group = outer_tied.index[0]\nsubset = df[df['month'] == outer_group]\nresult = round(subset['age'].mean(), 2)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["tie", "ties", "equal", "same", "duplicate", "positive", "success", "target", "y=1", "class", "outcome"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "nested_extrema", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"group_col": "month", "metric": "age", "extrema_outer": "lowest", "extrema_inner": "lowest", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "road_hard_014", "dataset": "road", "goal": "Which lighting categories have the highest total reported accidents? Show breakdown with [lighting, count, total_num_reported_accidents, avg_num_reported_accidents] sorted by total descending.", "expected_output_type": "dataframe", "level": "L4", "template": "count_segment_total", "golden": {"answer_value": [{"lighting": "dim", "count": 183826, "total_num_reported_accidents": 211283, "avg_num_reported_accidents": 1.15}, {"lighting": "daylight", "count": 178015, "total_num_reported_accidents": 207579, "avg_num_reported_accidents": 1.17}, {"lighting": "night", "count": 155913, "total_num_reported_accidents": 196214, "avg_num_reported_accidents": 1.26}], "answer_type": "dataframe", "verification_code": "breakdown = df.groupby('lighting').agg(\n    count=('num_reported_accidents', 'size'),\n    total_num_reported_accidents=('num_reported_accidents', 'sum'),\n    avg_num_reported_accidents=('num_reported_accidents', lambda x: round(x.mean(), 2))\n).sort_values('total_num_reported_accidents', ascending=False)\nresult = breakdown.reset_index()", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": [], "success_criteria": ["Answer must match expected value", "DataFrame shape must match", "Column names must match", "Values must match with numeric tolerance 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Single aggregation or binning operation expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 4, "template_name": "count_segment_total", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "lighting", "target_col": "num_reported_accidents", "target_desc": "reported accidents"}}}
+{"id": "road_hard_021", "dataset": "road", "goal": "Which weather categories have the highest average accident risk? Show breakdown with [weather, count, avg_accident_risk] sorted by average descending.", "expected_output_type": "dataframe", "level": "L4", "template": "continuous_segment_breakdown", "golden": {"answer_value": [{"weather": "foggy", "count": 181463, "avg_accident_risk": 0.3863}, {"weather": "rainy", "count": 156985, "avg_accident_risk": 0.3615}, {"weather": "clear", "count": 179306, "avg_accident_risk": 0.3101}], "answer_type": "dataframe", "verification_code": "breakdown = df.groupby('weather').agg(\n    count=('accident_risk', 'size'),\n    avg_accident_risk=('accident_risk', lambda x: round(x.mean(), 4))\n).sort_values('avg_accident_risk', ascending=False)\nresult = breakdown.reset_index()", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": [], "success_criteria": ["Answer must match expected value", "DataFrame shape must match", "Column names must match", "Values must match with numeric tolerance 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Single aggregation or binning operation expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 4, "template_name": "continuous_segment_breakdown", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "weather", "target_col": "accident_risk", "target_desc": "accident risk"}}}
+{"id": "road_hard_015", "dataset": "road", "goal": "Divide records into 4 speed_limit quartiles. What is the average accident risk in the lower-middle (25-50%) (Q2) quartile?", "expected_output_type": "scalar", "level": "L4", "template": "continuous_quartile_analysis", "golden": {"answer_value": 0.29, "answer_type": "scalar", "verification_code": "df['_bin'] = pd.qcut(df['speed_limit'], 4, labels=['Q1','Q2','Q3','Q4'], duplicates='drop')\nbin_data = df[df['_bin'] == 'Q2']\nresult = round(bin_data['accident_risk'].mean(), 4)\ndf.drop('_bin', axis=1, inplace=True)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": [], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Single aggregation or binning operation expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 4, "template_name": "continuous_quartile_analysis", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"bin_col": "speed_limit", "quartile": "Q2", "quartile_desc": "lower-middle (25-50%)", "target_col": "accident_risk", "target_desc": "accident risk"}}}
+{"id": "road_hard_002", "dataset": "road", "goal": "Divide records into 4 curvature quartiles. What is the average accident risk in the upper-middle (50-75%) (Q3) quartile?", "expected_output_type": "scalar", "level": "L4", "template": "continuous_quartile_analysis", "golden": {"answer_value": 0.41, "answer_type": "scalar", "verification_code": "df['_bin'] = pd.qcut(df['curvature'], 4, labels=['Q1','Q2','Q3','Q4'], duplicates='drop')\nbin_data = df[df['_bin'] == 'Q3']\nresult = round(bin_data['accident_risk'].mean(), 4)\ndf.drop('_bin', axis=1, inplace=True)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": [], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Single aggregation or binning operation expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 4, "template_name": "continuous_quartile_analysis", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"bin_col": "curvature", "quartile": "Q3", "quartile_desc": "upper-middle (50-75%)", "target_col": "accident_risk", "target_desc": "accident risk"}}}
+{"id": "road_hard_007", "dataset": "road", "goal": "How much higher is the average accident risk for lighting='daylight' compared to 'night'? Return difference.", "expected_output_type": "scalar", "level": "L5", "template": "continuous_comparison", "golden": {"answer_value": -0.17, "answer_type": "scalar", "verification_code": "avg_a = df[df['lighting'] == 'daylight']['accident_risk'].mean()\navg_b = df[df['lighting'] == 'night']['accident_risk'].mean()\nresult = round(avg_a - avg_b, 4)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": [], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "continuous_comparison", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"group_col": "lighting", "val_a": "daylight", "val_b": "night", "target_col": "accident_risk", "target_desc": "accident risk"}}}
+{"id": "road_hard_004", "dataset": "road", "goal": "How much higher is the average accident risk for road_type='rural' compared to 'urban'? Return difference.", "expected_output_type": "scalar", "level": "L5", "template": "continuous_comparison", "golden": {"answer_value": -0.01, "answer_type": "scalar", "verification_code": "avg_a = df[df['road_type'] == 'rural']['accident_risk'].mean()\navg_b = df[df['road_type'] == 'urban']['accident_risk'].mean()\nresult = round(avg_a - avg_b, 4)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": [], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "continuous_comparison", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"group_col": "road_type", "val_a": "rural", "val_b": "urban", "target_col": "accident_risk", "target_desc": "accident risk"}}}

codedark/models.py ADDED Viewed

	@@ -0,0 +1,150 @@

+"""
+CodeDark Data Models
+Pydantic models for Action, Observation, and State following OpenEnv spec.
+"""
+from pydantic import BaseModel, Field
+from typing import Optional, List, Any, Literal
+class CodeDarkAction(BaseModel):
+    """
+    Action for CodeDark environment.
+    Agents send actions with a tool name and arguments.
+    Tools available:
+    - run_python: Execute Python code with pandas/numpy
+    - read_notes: Read all saved notes
+    - save_note: Save a note for later recall
+    - clarify: Ask clarifying question (max 2 per episode)
+    - submit_answer: Submit final answer (ends episode)
+    """
+    tool: Literal["run_python", "read_notes", "save_note", "clarify", "submit_answer"]
+    args: str = ""  # Tool-specific arguments
+    model_config = {
+        "json_schema_extra": {
+            "examples": [
+                {"tool": "run_python", "args": "result = df['y'].mean() * 100"},
+                {"tool": "read_notes", "args": ""},
+                {"tool": "save_note", "args": "Average subscription rate is 11.26%"},
+                {"tool": "clarify", "args": "What does Q1 mean in this context?"},
+                {"tool": "submit_answer", "args": "11.26"},
+            ]
+        }
+    }
+class CodeDarkObservation(BaseModel):
+    """
+    Observation returned after each action.
+    Contains execution results, environment state, and episode info.
+    Reward is only populated when done=True.
+    """
+    # Execution results
+    stdout: str = ""
+    stderr: str = ""
+    exit_code: int = 0
+    # Turn tracking
+    turn: int = 0
+    max_turns: int = 10
+    # Persistent state
+    notes: List[str] = Field(default_factory=list)
+    # Task info
+    task_id: str = ""
+    question: str = ""
+    difficulty: str = ""  # L4, L5, L6
+    dataset: str = ""  # bank, road
+    # Episode status
+    done: bool = False
+    submitted: bool = False
+    # Reward components (only set when done=True)
+    reward: Optional[float] = None
+    correctness: Optional[float] = None
+    efficiency: Optional[float] = None
+    # Additional metadata
+    metadata: dict = Field(default_factory=dict)
+    model_config = {
+        "json_schema_extra": {
+            "examples": [
+                {
+                    "stdout": "run_python Result:\n(45211, 17)",
+                    "stderr": "",
+                    "exit_code": 0,
+                    "turn": 1,
+                    "max_turns": 10,
+                    "notes": [],
+                    "task_id": "bank_hard_001",
+                    "question": "What's the subscription rate for month='may'?",
+                    "difficulty": "L5",
+                    "dataset": "bank",
+                    "done": False,
+                    "submitted": False,
+                    "reward": None,
+                }
+            ]
+        }
+    }
+class CodeDarkState(BaseModel):
+    """
+    Internal state for CodeDark environment.
+    Tracks episode progress, accumulated notes, and submission status.
+    """
+    episode_id: str = ""
+    step_count: int = 0
+    # Task info
+    task_id: str = ""
+    dataset: str = ""
+    # Accumulated state
+    notes: List[str] = Field(default_factory=list)
+    turn_count: int = 0
+    error_count: int = 0
+    clarify_count: int = 0
+    # Submission
+    submitted: bool = False
+    submitted_answer: Optional[Any] = None
+    # For scoring
+    expected_answer: Optional[Any] = None
+    tolerance: float = 0.01
+class ResetRequest(BaseModel):
+    """Request body for /reset endpoint."""
+    task_id: Optional[str] = None
+    seed: Optional[int] = None
+class StepRequest(BaseModel):
+    """Request body for /step endpoint."""
+    tool: str
+    args: str = ""
+class HealthResponse(BaseModel):
+    """Response for /health endpoint."""
+    status: str = "healthy"
+    environment: str = "codedark"
+    version: str = "0.1.0"

codedark/openenv.yaml ADDED Viewed

	@@ -0,0 +1,62 @@

+name: codedark
+version: "0.1.0"
+description: |
+  CodeDark: Multi-turn data analytics environment for training RL agents.
+  Train AI agents to be data scientists, not just code executors.
+  Features real business analytics tasks with pandas/numpy,
+  multi-metric reward shaping, and skill-based curriculum.
+author: Vijay Athithya
+license: MIT
+# Environment interface types
+action: CodeDarkAction
+observation: CodeDarkObservation
+state: CodeDarkState
+# Environment configuration
+config:
+  max_turns: 10
+  datasets:
+    - bank
+    - road
+  difficulty_levels:
+    - L4
+    - L5
+    - L6
+# Tools available to agents
+tools:
+  - name: run_python
+    description: Execute Python/pandas code. Store output in 'result' variable.
+  - name: read_notes
+    description: Read all saved notes from previous turns.
+  - name: save_note
+    description: Save observations for later recall. Notes persist across turns.
+  - name: clarify
+    description: Ask clarifying questions about the task (max 2 per episode).
+  - name: submit_answer
+    description: Submit final answer. Ends episode.
+# Reward structure
+reward:
+  max_reward: 1.0
+  components:
+    - name: correctness
+      weight: 0.80
+      description: Binary correct/incorrect with numeric tolerance
+    - name: efficiency
+      weight: 0.10
+      description: Fewer turns = better score
+    - name: token_cost
+      weight: 0.10
+      description: Lower token usage = better score
+# Benchmarking info
+benchmark:
+  tasks: 25
+  models_evaluated: 11
+  completions: 1844
+  best_accuracy: "77.3%"
+  best_model: "claude-opus-4.5"

codedark/pyproject.toml ADDED Viewed

	@@ -0,0 +1,60 @@

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-codedark"
+version = "0.1.0"
+description = "Multi-turn data analytics environment for RL agent training - OpenEnv compatible"
+readme = "README.md"
+requires-python = ">=3.10"
+license = {text = "MIT"}
+authors = [
+    {name = "Vijay Athithya", email = "vijay@analytics-rl.com"}
+]
+keywords = ["openenv", "reinforcement-learning", "data-analytics", "llm", "agents"]
+classifiers = [
+    "Development Status :: 4 - Beta",
+    "Intended Audience :: Developers",
+    "Intended Audience :: Science/Research",
+    "License :: OSI Approved :: MIT License",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence",
+]
+dependencies = [
+    "fastapi>=0.115.0",
+    "pydantic>=2.0.0",
+    "uvicorn[standard]>=0.24.0",
+    "pandas>=2.0.0",
+    "numpy>=1.24.0",
+    "requests>=2.31.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+    "pytest-asyncio>=0.23.0",
+    "httpx>=0.27.0",
+]
+[project.urls]
+Homepage = "https://github.com/vj-09/codedark"
+Documentation = "https://huggingface.co/spaces/openenv/codedark"
+Repository = "https://github.com/vj-09/codedark"
+[project.scripts]
+codedark-server = "codedark.server.app:main"
+[tool.setuptools]
+packages = ["codedark", "codedark.server"]
+package-dir = {"codedark" = ".", "codedark.server" = "server"}
+[tool.setuptools.package-data]
+codedark = ["data/*.csv", "data/tasks/*.jsonl", "openenv.yaml"]
+[tool.pytest.ini_options]
+asyncio_mode = "auto"
+testpaths = ["tests"]

codedark/server/Dockerfile ADDED Viewed

	@@ -0,0 +1,34 @@

+FROM python:3.11-slim
+# Set working directory
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends curl && \
+    rm -rf /var/lib/apt/lists/*
+# Copy requirements first for better caching
+COPY server/requirements.txt ./requirements.txt
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy package
+COPY . .
+# Install package in editable mode
+RUN pip install --no-cache-dir -e .
+# Environment variables
+ENV PYTHONUNBUFFERED=1
+ENV HOST=0.0.0.0
+ENV PORT=8000
+# Expose port
+EXPOSE 8000
+# Healthcheck
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+# Run server
+CMD ["python", "-m", "codedark.server.app"]

codedark/server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """CodeDark Server Package."""

codedark/server/app.py ADDED Viewed

	@@ -0,0 +1,195 @@

+"""
+CodeDark FastAPI Server
+OpenEnv-compatible HTTP server for CodeDark environment.
+Provides /reset, /step, /state, and /health endpoints.
+"""
+import os
+from contextlib import asynccontextmanager
+from typing import Optional
+import uvicorn
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from ..models import (
+    CodeDarkAction,
+    CodeDarkObservation,
+    CodeDarkState,
+    ResetRequest,
+    StepRequest,
+    HealthResponse,
+)
+from .environment import CodeDarkEnvironment
+# Global environment instance
+_env: Optional[CodeDarkEnvironment] = None
+def get_env() -> CodeDarkEnvironment:
+    """Get or create environment instance."""
+    global _env
+    if _env is None:
+        _env = CodeDarkEnvironment(
+            data_dir=os.environ.get("CODEDARK_DATA_DIR"),
+            tasks_path=os.environ.get("CODEDARK_TASKS_PATH"),
+            max_turns=int(os.environ.get("CODEDARK_MAX_TURNS", "10")),
+        )
+    return _env
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """Lifespan context manager for startup/shutdown."""
+    # Startup: initialize environment
+    get_env()
+    yield
+    # Shutdown: cleanup if needed
+    global _env
+    _env = None
+# Create FastAPI app
+app = FastAPI(
+    title="CodeDark Environment",
+    description="Multi-turn data analytics environment for RL agent training",
+    version="0.1.0",
+    lifespan=lifespan,
+)
+# Add CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+@app.get("/health", response_model=HealthResponse)
+async def health():
+    """Health check endpoint."""
+    return HealthResponse(
+        status="healthy",
+        environment="codedark",
+        version="0.1.0",
+    )
+@app.post("/reset", response_model=CodeDarkObservation)
+async def reset(request: ResetRequest = None):
+    """Reset environment for a new episode.
+    Args:
+        request: Optional reset request with task_id and seed
+    Returns:
+        Initial observation
+    """
+    env = get_env()
+    if request is None:
+        request = ResetRequest()
+    obs = env.reset(task_id=request.task_id, seed=request.seed)
+    return obs
+@app.post("/step", response_model=CodeDarkObservation)
+async def step(request: StepRequest):
+    """Execute an action and return observation.
+    Args:
+        request: Step request with tool and args
+    Returns:
+        Observation after action execution
+    """
+    env = get_env()
+    # Validate tool
+    valid_tools = ["run_python", "read_notes", "save_note", "clarify", "submit_answer"]
+    if request.tool not in valid_tools:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Invalid tool: {request.tool}. Valid tools: {valid_tools}",
+        )
+    action = CodeDarkAction(tool=request.tool, args=request.args)
+    obs = env.step(action)
+    return obs
+@app.get("/state", response_model=CodeDarkState)
+async def state():
+    """Get current environment state.
+    Returns:
+        Current CodeDarkState
+    """
+    env = get_env()
+    return env.state
+@app.get("/metadata")
+async def metadata():
+    """Get environment metadata.
+    Returns:
+        Environment metadata dict
+    """
+    env = get_env()
+    return {
+        "name": "codedark",
+        "version": "0.1.0",
+        "description": "Multi-turn data analytics environment for RL agent training",
+        "max_turns": env.max_turns,
+        "max_clarifications": env.max_clarifications,
+        "num_tasks": len(env.tasks),
+        "tools": [
+            {"name": "run_python", "description": "Execute Python/pandas code"},
+            {"name": "read_notes", "description": "Read all saved notes"},
+            {"name": "save_note", "description": "Save a note for later recall"},
+            {"name": "clarify", "description": "Ask clarifying question (max 2)"},
+            {"name": "submit_answer", "description": "Submit final answer"},
+        ],
+        "reward_structure": {
+            "max_reward": 1.0,
+            "components": [
+                {"name": "correctness", "weight": 0.80},
+                {"name": "efficiency", "weight": 0.10},
+                {"name": "token_cost", "weight": 0.10},
+            ],
+        },
+    }
+@app.get("/schema")
+async def schema():
+    """Get environment schema for Action, Observation, State.
+    Returns:
+        JSON schemas for all types
+    """
+    return {
+        "action": CodeDarkAction.model_json_schema(),
+        "observation": CodeDarkObservation.model_json_schema(),
+        "state": CodeDarkState.model_json_schema(),
+    }
+def main():
+    """Run the server."""
+    uvicorn.run(
+        "codedark.server.app:app",
+        host=os.environ.get("HOST", "0.0.0.0"),
+        port=int(os.environ.get("PORT", "8000")),
+        reload=os.environ.get("RELOAD", "false").lower() == "true",
+    )
+if __name__ == "__main__":
+    main()

codedark/server/environment.py ADDED Viewed

	@@ -0,0 +1,319 @@

+"""
+CodeDark Environment
+OpenEnv-compatible environment for multi-turn data analytics tasks.
+Agents analyze CSV data using Python/Pandas tools and submit answers.
+"""
+import json
+import uuid
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+import pandas as pd
+from ..models import CodeDarkAction, CodeDarkObservation, CodeDarkState
+from .tools import (
+    run_python,
+    read_notes,
+    save_note,
+    clarify,
+    submit_answer,
+    parse_tool_call,
+)
+from .scoring import compute_reward
+class CodeDarkEnvironment:
+    """CodeDark environment for multi-turn data analytics.
+    Features:
+    - Multi-turn agent evaluation
+    - 5 tools: run_python, read_notes, save_note, clarify, submit_answer
+    - Shaped rewards: correctness (80%) + efficiency (10%) + token cost (10%)
+    - Supports bank and road datasets
+    """
+    def __init__(
+        self,
+        data_dir: Optional[str] = None,
+        tasks_path: Optional[str] = None,
+        max_turns: int = 10,
+        max_clarifications: int = 2,
+    ):
+        """Initialize CodeDark environment.
+        Args:
+            data_dir: Path to directory containing CSV files
+            tasks_path: Path to tasks.jsonl file
+            max_turns: Maximum turns per episode (default: 10)
+            max_clarifications: Maximum clarifications per episode (default: 2)
+        """
+        self.max_turns = max_turns
+        self.max_clarifications = max_clarifications
+        # Resolve paths
+        if data_dir:
+            self.data_dir = Path(data_dir)
+        else:
+            # Default to data/ relative to this file's parent
+            self.data_dir = Path(__file__).parent.parent / "data"
+        if tasks_path:
+            self.tasks_path = Path(tasks_path)
+        else:
+            self.tasks_path = self.data_dir / "tasks" / "final_25_tasks.jsonl"
+        # Load tasks
+        self.tasks = self._load_tasks()
+        self._tasks_by_id = {t["id"]: t for t in self.tasks}
+        self._task_index = 0
+        # Current episode state
+        self._state: Optional[CodeDarkState] = None
+        self._df: Optional[pd.DataFrame] = None
+        self._current_task: Optional[Dict] = None
+    def _load_tasks(self) -> List[Dict]:
+        """Load tasks from JSONL file."""
+        if not self.tasks_path.exists():
+            return []
+        tasks = []
+        with open(self.tasks_path) as f:
+            for line in f:
+                if line.strip():
+                    tasks.append(json.loads(line))
+        return tasks
+    def _load_data_for_task(self, task: Dict) -> Optional[pd.DataFrame]:
+        """Load the appropriate CSV for a task.
+        Args:
+            task: Task dictionary with 'dataset' field
+        Returns:
+            DataFrame or None if not found
+        """
+        dataset = task.get("dataset", "bank")
+        csv_path = self.data_dir / f"{dataset}.csv"
+        if csv_path.exists():
+            return pd.read_csv(csv_path)
+        return None
+    @property
+    def state(self) -> CodeDarkState:
+        """Return current environment state."""
+        if self._state is None:
+            self._state = CodeDarkState()
+        return self._state
+    def reset(
+        self, task_id: Optional[str] = None, seed: Optional[int] = None
+    ) -> CodeDarkObservation:
+        """Reset environment for a new episode.
+        Args:
+            task_id: Specific task to load (optional)
+            seed: Random seed for task selection (optional)
+        Returns:
+            Initial observation with task question
+        """
+        # Select task
+        if task_id and task_id in self._tasks_by_id:
+            task = self._tasks_by_id[task_id]
+        elif self.tasks:
+            if seed is not None:
+                import random
+                random.seed(seed)
+                task = random.choice(self.tasks)
+            else:
+                # Round-robin through tasks
+                task = self.tasks[self._task_index % len(self.tasks)]
+                self._task_index += 1
+        else:
+            # No tasks loaded - return error observation
+            return CodeDarkObservation(
+                stderr="Error: No tasks loaded",
+                exit_code=1,
+                done=True,
+            )
+        self._current_task = task
+        # Load data for this task
+        self._df = self._load_data_for_task(task)
+        if self._df is None:
+            return CodeDarkObservation(
+                stderr=f"Error: Could not load data for dataset '{task.get('dataset', 'bank')}'",
+                exit_code=1,
+                done=True,
+            )
+        # Initialize state
+        self._state = CodeDarkState(
+            episode_id=str(uuid.uuid4()),
+            step_count=0,
+            task_id=task["id"],
+            dataset=task.get("dataset", "bank"),
+            notes=[],
+            turn_count=0,
+            error_count=0,
+            clarify_count=0,
+            submitted=False,
+            submitted_answer=None,
+            expected_answer=task["golden"]["answer_value"],
+            tolerance=task["golden"].get("tolerance", 0.01),
+        )
+        # Return initial observation
+        return CodeDarkObservation(
+            stdout=f"Task loaded. DataFrame shape: {self._df.shape}",
+            turn=0,
+            max_turns=self.max_turns,
+            notes=[],
+            task_id=task["id"],
+            question=task["goal"],
+            difficulty=task.get("level", "L5"),
+            dataset=task.get("dataset", "bank"),
+            done=False,
+            submitted=False,
+        )
+    def step(self, action: CodeDarkAction) -> CodeDarkObservation:
+        """Execute an action and return observation.
+        Args:
+            action: CodeDarkAction with tool name and args
+        Returns:
+            CodeDarkObservation with results
+        """
+        if self._state is None or self._current_task is None:
+            return CodeDarkObservation(
+                stderr="Error: Environment not reset. Call reset() first.",
+                exit_code=1,
+                done=True,
+            )
+        if self._state.submitted:
+            return self._make_final_observation()
+        # Increment turn
+        self._state.turn_count += 1
+        self._state.step_count += 1
+        # Check turn limit
+        if self._state.turn_count > self.max_turns:
+            self._state.submitted = True
+            return self._make_final_observation()
+        # Parse tool-specific args
+        parsed_content, parse_error = parse_tool_call(action.args, action.tool)
+        if parse_error:
+            self._state.error_count += 1
+            return CodeDarkObservation(
+                stderr=f"{action.tool} Error: {parse_error}",
+                exit_code=1,
+                turn=self._state.turn_count,
+                max_turns=self.max_turns,
+                notes=self._state.notes.copy(),
+                task_id=self._state.task_id,
+                question=self._current_task["goal"],
+                difficulty=self._current_task.get("level", "L5"),
+                dataset=self._state.dataset,
+                done=False,
+                submitted=False,
+            )
+        # Execute tool
+        stdout, stderr, exit_code = "", "", 0
+        if action.tool == "run_python":
+            stdout, stderr, exit_code = run_python(parsed_content, self._df)
+        elif action.tool == "read_notes":
+            stdout, stderr, exit_code = read_notes(self._state.notes)
+        elif action.tool == "save_note":
+            stdout, stderr, exit_code = save_note(parsed_content, self._state.notes)
+        elif action.tool == "clarify":
+            stdout, stderr, exit_code, new_count = clarify(
+                question=parsed_content,
+                clarify_count=self._state.clarify_count,
+                max_clarifications=self.max_clarifications,
+                ambiguities=self._current_task.get("ambiguities", []),
+                answer_type=self._current_task.get("golden", {}).get(
+                    "answer_type", "scalar"
+                ),
+            )
+            self._state.clarify_count = new_count
+        elif action.tool == "submit_answer":
+            stdout, stderr, exit_code, answer = submit_answer(parsed_content)
+            if exit_code == 0:
+                self._state.submitted = True
+                self._state.submitted_answer = answer
+                return self._make_final_observation()
+        # Track errors
+        if exit_code != 0:
+            self._state.error_count += 1
+        return CodeDarkObservation(
+            stdout=stdout,
+            stderr=stderr,
+            exit_code=exit_code,
+            turn=self._state.turn_count,
+            max_turns=self.max_turns,
+            notes=self._state.notes.copy(),
+            task_id=self._state.task_id,
+            question=self._current_task["goal"],
+            difficulty=self._current_task.get("level", "L5"),
+            dataset=self._state.dataset,
+            done=False,
+            submitted=False,
+        )
+    def _make_final_observation(self) -> CodeDarkObservation:
+        """Create final observation with reward computation."""
+        if self._state is None or self._current_task is None:
+            return CodeDarkObservation(done=True)
+        # Compute reward
+        reward, correctness, efficiency, token_cost = compute_reward(
+            submitted=self._state.submitted_answer,
+            expected=self._state.expected_answer,
+            tolerance=self._state.tolerance,
+            turns=self._state.turn_count,
+            max_turns=self.max_turns,
+        )
+        return CodeDarkObservation(
+            stdout="[EPISODE COMPLETE]",
+            turn=self._state.turn_count,
+            max_turns=self.max_turns,
+            notes=self._state.notes.copy(),
+            task_id=self._state.task_id,
+            question=self._current_task["goal"],
+            difficulty=self._current_task.get("level", "L5"),
+            dataset=self._state.dataset,
+            done=True,
+            submitted=self._state.submitted,
+            reward=reward,
+            correctness=correctness,
+            efficiency=efficiency,
+            metadata={
+                "submitted_answer": self._state.submitted_answer,
+                "expected_answer": self._state.expected_answer,
+                "tolerance": self._state.tolerance,
+                "error_count": self._state.error_count,
+                "clarify_count": self._state.clarify_count,
+                "token_cost_usd": token_cost,
+            },
+        )

codedark/server/requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+# CodeDark Server Requirements
+fastapi>=0.115.0
+pydantic>=2.0.0
+uvicorn[standard]>=0.24.0
+pandas>=2.0.0
+numpy>=1.24.0
+requests>=2.31.0

codedark/server/scoring.py ADDED Viewed

	@@ -0,0 +1,332 @@

+"""
+CodeDark Scoring System
+Reward computation with multi-metric scoring:
+- 80% correctness (binary: exact match within tolerance)
+- 10% efficiency (fewer turns = better)
+- 10% token cost (lower usage = better)
+"""
+from typing import Any, Optional, Tuple
+import ast
+def normalize_value(val: Any) -> Any:
+    """Normalize a value for comparison.
+    Handles:
+    - String to float/int conversion
+    - String to list/dict parsing
+    - Float rounding for precision
+    - Percentage stripping
+    """
+    if val is None:
+        return None
+    # Already a proper type - just normalize floats
+    if isinstance(val, float):
+        return round(val, 4)
+    if isinstance(val, int):
+        return float(val)
+    if isinstance(val, (list, dict)):
+        return val
+    # String handling
+    if isinstance(val, str):
+        val = val.strip()
+        # Try to parse as list/dict first
+        if val.startswith("[") or val.startswith("{"):
+            try:
+                return ast.literal_eval(val)
+            except (ValueError, SyntaxError):
+                pass
+        # Try float (strip % if present)
+        try:
+            return round(float(val.rstrip("%")), 4)
+        except ValueError:
+            pass
+        # Return as lowercase string
+        return val.lower()
+    return val
+def parse_markdown_table(text: str) -> Optional[list]:
+    """Parse markdown table to list of dicts.
+    Handles tables like:
+    | job | mean | std |
+    |-----|------|-----|
+    | retired | 40.97 | 9.74 |
+    """
+    if not isinstance(text, str):
+        return None
+    lines = text.strip().split("\n")
+    # Find table lines (contain |)
+    table_lines = [line for line in lines if "|" in line]
+    if len(table_lines) < 3:  # Need header + separator + at least 1 row
+        return None
+    # Parse header
+    header_line = table_lines[0]
+    headers = [h.strip().lower() for h in header_line.split("|") if h.strip()]
+    if not headers:
+        return None
+    # Skip separator line (contains ---)
+    data_start = 1
+    if "---" in table_lines[1] or "--|" in table_lines[1]:
+        data_start = 2
+    # Parse data rows
+    rows = []
+    for line in table_lines[data_start:]:
+        cells = [c.strip() for c in line.split("|") if c.strip()]
+        if len(cells) != len(headers):
+            continue
+        row = {}
+        for h, c in zip(headers, cells):
+            # Clean number formatting (commas, currency symbols)
+            c_clean = c.replace(",", "").replace("€", "").replace("$", "").strip()
+            try:
+                if "." in c_clean:
+                    row[h] = round(float(c_clean), 4)
+                else:
+                    row[h] = int(c_clean)
+            except ValueError:
+                row[h] = c.lower()
+        rows.append(row)
+    return rows if rows else None
+def compare_answers(submitted: Any, expected: Any, tolerance: float = 0.01) -> bool:
+    """Compare answers with support for structured data and numeric tolerance.
+    Handles:
+    - Type mismatches (string "33.53" vs float 33.53)
+    - Floating point precision (rounds to 4 decimals)
+    - Nested structures (lists, dicts)
+    - String parsing for lists/dicts
+    """
+    # Normalize both values
+    submitted_n = normalize_value(submitted)
+    expected_n = normalize_value(expected)
+    # Null checks
+    if submitted_n is None and expected_n is None:
+        return True
+    if submitted_n is None or expected_n is None:
+        return False
+    # Same type comparison after normalization
+    if type(submitted_n) == type(expected_n):
+        if isinstance(expected_n, list):
+            if len(submitted_n) != len(expected_n):
+                return False
+            # Check if list contains dicts (structured data) - use order-sensitive
+            if expected_n and isinstance(expected_n[0], dict):
+                return all(
+                    compare_answers(s, e, tolerance)
+                    for s, e in zip(submitted_n, expected_n)
+                )
+            # Simple values list - order-insensitive (we don't tell models to sort)
+            submitted_sorted = sorted([str(x).lower().strip() for x in submitted_n])
+            expected_sorted = sorted([str(x).lower().strip() for x in expected_n])
+            return submitted_sorted == expected_sorted
+        if isinstance(expected_n, dict):
+            if set(submitted_n.keys()) != set(expected_n.keys()):
+                return False
+            return all(
+                compare_answers(submitted_n[k], expected_n[k], tolerance)
+                for k in expected_n
+            )
+        if isinstance(expected_n, float):
+            return abs(submitted_n - expected_n) <= tolerance
+        # String comparison
+        return str(submitted_n) == str(expected_n)
+    # Type mismatch after normalization - try numeric comparison
+    try:
+        sub_f = (
+            float(submitted_n) if not isinstance(submitted_n, (list, dict)) else None
+        )
+        exp_f = float(expected_n) if not isinstance(expected_n, (list, dict)) else None
+        if sub_f is not None and exp_f is not None:
+            return abs(sub_f - exp_f) <= tolerance
+    except (ValueError, TypeError):
+        pass
+    # Try markdown table parsing if expected is list and submitted is string
+    if isinstance(expected_n, list) and isinstance(submitted_n, str):
+        parsed = parse_markdown_table(submitted)  # Use original, not normalized
+        if parsed is not None:
+            return compare_answers(parsed, expected, tolerance)
+    # Fallback: string comparison
+    return str(submitted_n).lower() == str(expected_n).lower()
+def score_correctness(submitted: Any, expected: Any, tolerance: float = 0.01) -> float:
+    """Score the submitted answer correctness. Weight: 0.80
+    Scoring:
+    - 0.80: Exact match
+    - 0.20: Almost there (rounding or 100x scale error)
+    - 0.00: Wrong
+    Args:
+        submitted: Submitted answer
+        expected: Expected answer
+        tolerance: Numeric tolerance for comparison
+    Returns:
+        Correctness score (0.0, 0.20, or 0.80)
+    """
+    if submitted is None:
+        return 0.0
+    try:
+        # Try numeric comparison first
+        submitted_f = float(submitted)
+        expected_f = float(expected)
+        # Exact match (within tolerance)
+        if abs(submitted_f - expected_f) < tolerance:
+            return 0.80
+        if expected_f != 0:
+            ratio = submitted_f / expected_f
+            # 100x scale error (decimal vs percentage)
+            # e.g., 0.0959 vs 9.59 or 9.59 vs 0.0959
+            if 0.009 < ratio < 0.011 or 99 < ratio < 101:
+                return 0.20
+            # Rounding error (within 1% of expected)
+            if 0.99 < ratio < 1.01:
+                return 0.20
+    except (ValueError, TypeError):
+        # Structured data comparison (lists, dicts)
+        if compare_answers(submitted, expected, tolerance=tolerance):
+            return 0.80
+    return 0.0
+def score_efficiency(turns: int, max_turns: int, is_correct: bool) -> float:
+    """Score based on turns used (fewer = better). Weight: 0.10
+    Only applies if answer is correct.
+    Args:
+        turns: Number of turns used
+        max_turns: Maximum turns allowed
+        is_correct: Whether the answer was correct
+    Returns:
+        Efficiency score (0.0 to 0.10)
+    """
+    if not is_correct:
+        return 0.0
+    # Scale: 1 turn = 0.10, max_turns = 0.01
+    efficiency = max(0.0, 1.0 - (turns / max_turns))
+    return 0.10 * efficiency
+def score_token_cost(
+    input_tokens: int,
+    output_tokens: int,
+    is_correct: bool,
+    input_price: float = 1.0,
+    output_price: float = 5.0,
+    target_cost: float = 0.01,
+    max_cost: float = 0.10,
+) -> Tuple[float, float]:
+    """Score based on token cost (lower = better). Weight: 0.10
+    Only applies if answer is correct.
+    Args:
+        input_tokens: Number of input tokens used
+        output_tokens: Number of output tokens used
+        is_correct: Whether the answer was correct
+        input_price: Price per 1M input tokens (default $1)
+        output_price: Price per 1M output tokens (default $5)
+        target_cost: Cost for full score (default $0.01)
+        max_cost: Cost for zero score (default $0.10)
+    Returns:
+        Tuple of (token_score, cost_usd)
+    """
+    if not is_correct:
+        return 0.0, 0.0
+    # Calculate cost in dollars
+    cost = (input_tokens * input_price / 1_000_000) + (
+        output_tokens * output_price / 1_000_000
+    )
+    # Scale: <$0.01 = full score, >$0.10 = 0
+    if cost <= target_cost:
+        efficiency = 1.0
+    elif cost >= max_cost:
+        efficiency = 0.0
+    else:
+        efficiency = 1.0 - ((cost - target_cost) / (max_cost - target_cost))
+    return 0.10 * efficiency, cost
+def compute_reward(
+    submitted: Any,
+    expected: Any,
+    tolerance: float,
+    turns: int,
+    max_turns: int,
+    input_tokens: int = 0,
+    output_tokens: int = 0,
+) -> Tuple[float, float, float, float]:
+    """Compute total reward from all components.
+    Args:
+        submitted: Submitted answer
+        expected: Expected answer
+        tolerance: Numeric tolerance
+        turns: Number of turns used
+        max_turns: Maximum turns allowed
+        input_tokens: Number of input tokens (optional)
+        output_tokens: Number of output tokens (optional)
+    Returns:
+        Tuple of (total_reward, correctness, efficiency, token_cost_usd)
+    """
+    # Correctness (0.80 weight)
+    correctness = score_correctness(submitted, expected, tolerance)
+    is_correct = correctness > 0
+    # Efficiency (0.10 weight)
+    efficiency = score_efficiency(turns, max_turns, is_correct)
+    # Token cost (0.10 weight)
+    # If tokens not tracked, estimate from turns
+    if input_tokens == 0 and output_tokens == 0:
+        input_tokens = turns * 1000
+        output_tokens = turns * 500
+    token_score, cost_usd = score_token_cost(input_tokens, output_tokens, is_correct)
+    total_reward = correctness + efficiency + token_score
+    return total_reward, correctness, efficiency, cost_usd

codedark/server/tools.py ADDED Viewed

	@@ -0,0 +1,308 @@

+"""
+CodeDark Tool Implementations
+Tools available to agents:
+- run_python: Execute Python/pandas code in sandboxed environment
+- read_notes: Read all saved notes from current episode
+- save_note: Save a note for later recall
+- clarify: Ask clarifying question (max 2 per episode)
+- submit_answer: Submit final answer (ends episode)
+"""
+import re
+import ast
+from typing import Any, Dict, List, Optional, Tuple
+import pandas as pd
+import numpy as np
+# Safe builtins for sandboxed code execution
+SAFE_BUILTINS = {
+    "len": len,
+    "sum": sum,
+    "min": min,
+    "max": max,
+    "abs": abs,
+    "round": round,
+    "sorted": sorted,
+    "range": range,
+    "int": int,
+    "float": float,
+    "str": str,
+    "bool": bool,
+    "list": list,
+    "dict": dict,
+    "set": set,
+    "tuple": tuple,
+    "enumerate": enumerate,
+    "zip": zip,
+    "True": True,
+    "False": False,
+    "None": None,
+    "print": print,
+    "type": type,
+    "isinstance": isinstance,
+    "map": map,
+    "filter": filter,
+    "any": any,
+    "all": all,
+    "hasattr": hasattr,
+    "getattr": getattr,
+    "repr": repr,
+    "locals": locals,
+    "globals": globals,
+    "dir": dir,
+    "vars": vars,
+    "reversed": reversed,
+    "slice": slice,
+    "format": format,
+    "Exception": Exception,
+    "ValueError": ValueError,
+    "TypeError": TypeError,
+    "KeyError": KeyError,
+    "IndexError": IndexError,
+    "AttributeError": AttributeError,
+}
+def run_python(
+    code: str, df: pd.DataFrame, max_output_chars: int = 200
+) -> Tuple[str, str, int]:
+    """Execute Python code in sandboxed environment.
+    Args:
+        code: Python code to execute
+        df: DataFrame available as 'df' in execution context
+        max_output_chars: Maximum characters for output truncation
+    Returns:
+        Tuple of (stdout, stderr, exit_code)
+    """
+    if df is None:
+        return "", "Error: No dataframe loaded", 1
+    local_vars = {
+        "pd": pd,
+        "np": np,
+        "df": df.copy(),
+    }
+    try:
+        exec(code, {"__builtins__": SAFE_BUILTINS}, local_vars)
+        result = local_vars.get("result")
+        if result is None:
+            return (
+                "",
+                "Error: No 'result' variable set. Store your result in 'result'.",
+                1,
+            )
+        # Format output with truncation
+        if isinstance(result, pd.DataFrame):
+            preview = result.head(3).to_string()
+        elif isinstance(result, pd.Series):
+            preview = result.head(5).to_string()
+        else:
+            preview = str(result)
+        # Truncate if needed
+        if len(preview) > max_output_chars:
+            preview = preview[:max_output_chars] + "..."
+        return f"run_python Result:\n{preview}", "", 0
+    except Exception as e:
+        return "", f"run_python Error: {e}", 1
+def read_notes(notes: List[str]) -> Tuple[str, str, int]:
+    """Read all saved notes.
+    Args:
+        notes: List of saved notes
+    Returns:
+        Tuple of (stdout, stderr, exit_code)
+    """
+    if not notes:
+        return "No notes saved yet.", "", 0
+    notes_list = "\n".join(f"- {n}" for n in notes)
+    return f"Saved notes:\n{notes_list}", "", 0
+def save_note(content: str, notes: List[str]) -> Tuple[str, str, int]:
+    """Save a note to persistent memory.
+    Args:
+        content: Note content to save
+        notes: List to append note to (modified in place)
+    Returns:
+        Tuple of (stdout, stderr, exit_code)
+    """
+    content = content.strip()
+    if not content:
+        return "", "Error: Empty note content", 1
+    notes.append(content)
+    notes_list = "\n".join(f"- {n}" for n in notes)
+    return f"Note saved.\n\nAll notes:\n{notes_list}", "", 0
+def clarify(
+    question: str,
+    clarify_count: int,
+    max_clarifications: int,
+    ambiguities: Optional[List[str]] = None,
+    answer_type: str = "scalar",
+) -> Tuple[str, str, int, int]:
+    """Ask a clarifying question about the task.
+    Args:
+        question: The clarifying question
+        clarify_count: Current number of clarifications used
+        max_clarifications: Maximum allowed clarifications
+        ambiguities: List of known ambiguities from task metadata
+        answer_type: Expected answer type ("scalar", "list", etc.)
+    Returns:
+        Tuple of (stdout, stderr, exit_code, new_clarify_count)
+    """
+    if clarify_count >= max_clarifications:
+        return (
+            "",
+            f"Error: Maximum {max_clarifications} clarifications per episode. Please proceed with your best interpretation.",
+            1,
+            clarify_count,
+        )
+    question_lower = question.lower()
+    ambiguities = ambiguities or []
+    # Build clarification responses from task metadata
+    clarifications = {}
+    for amb in ambiguities:
+        amb_lower = amb.lower()
+        if (
+            "percentile" in amb_lower
+            or "inclusive" in amb_lower
+            or "exclusive" in amb_lower
+        ):
+            clarifications["percentile"] = (
+                "Use >= for 'top X%' (inclusive of threshold) and <= for 'bottom X%'."
+            )
+        if "rate" in amb_lower or "percentage" in amb_lower:
+            clarifications["rate"] = (
+                "Express rates as percentages 0-100, rounded to 2 decimal places."
+            )
+        if (
+            "positive" in amb_lower
+            or "success" in amb_lower
+            or "target" in amb_lower
+            or "y=1" in amb_lower
+        ):
+            clarifications["target"] = "Subscription/success means y=1 in the dataset."
+        if "boundary" in amb_lower:
+            clarifications["boundary"] = (
+                "Include boundary values (>=, <=) when filtering."
+            )
+    # Add format clarifications from answer type
+    if answer_type == "scalar":
+        clarifications["format"] = (
+            "Return a single numeric value, rounded to 2 decimal places."
+        )
+    elif answer_type == "list":
+        clarifications["format"] = (
+            "Return as a list/DataFrame with the specified columns."
+        )
+    # Try to match question to a clarification
+    response = None
+    for key, value in clarifications.items():
+        if key in question_lower or any(word in question_lower for word in key.split()):
+            response = value
+            break
+    if response:
+        return f"Clarification: {response}", "", 0, clarify_count + 1
+    else:
+        return (
+            "Clarification: Please proceed with your best interpretation based on standard data analysis conventions.",
+            "",
+            0,
+            clarify_count + 1,
+        )
+def submit_answer(answer_str: str) -> Tuple[str, str, int, Any]:
+    """Submit final answer.
+    Args:
+        answer_str: Answer string to parse and submit
+    Returns:
+        Tuple of (stdout, stderr, exit_code, parsed_answer)
+    """
+    answer_str = answer_str.strip().rstrip("%").strip()
+    if not answer_str:
+        return "", "Error: Empty answer", 1, None
+    # Try to parse as structured data first (list/dict)
+    try:
+        answer = ast.literal_eval(answer_str)
+    except (ValueError, SyntaxError):
+        # Fall back to numeric parsing
+        try:
+            answer = float(answer_str)
+        except ValueError:
+            answer = answer_str
+    return "[SUBMITTED]", "", 0, answer
+def parse_tool_call(args: str, tool_name: str) -> Tuple[Optional[str], Optional[str]]:
+    """Parse tool-specific arguments from args string.
+    Args:
+        args: Raw args string
+        tool_name: Name of the tool being called
+    Returns:
+        Tuple of (parsed_content, error_message)
+    """
+    if tool_name == "run_python":
+        # Extract code from <code></code> tags
+        match = re.search(r"<code>(.*?)</code>", args, re.DOTALL)
+        if not match:
+            return None, "No <code> tag found. Use: <code>your_code</code>"
+        return match.group(1).strip(), None
+    elif tool_name == "clarify":
+        # Extract question from <question></question> tags
+        match = re.search(r"<question>(.*?)</question>", args, re.DOTALL)
+        if not match:
+            return (
+                None,
+                "No <question> tag found. Use: <question>your question</question>",
+            )
+        return match.group(1).strip(), None
+    elif tool_name == "submit_answer":
+        # Extract answer from <answer></answer> tags
+        match = re.search(r"<answer>(.*?)</answer>", args, re.DOTALL)
+        if not match:
+            return None, "No <answer> tag found. Use: <answer>value</answer>"
+        return match.group(1).strip(), None
+    elif tool_name in ("read_notes", "save_note"):
+        # These take raw args
+        return args.strip(), None
+    else:
+        return None, f"Unknown tool: {tool_name}"

codedark/tests/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """CodeDark Tests."""

codedark/tests/test_environment.py ADDED Viewed

	@@ -0,0 +1,181 @@

+"""Tests for CodeDark environment."""
+import pytest
+from pathlib import Path
+from codedark.models import CodeDarkAction, CodeDarkObservation, CodeDarkState
+from codedark.server.environment import CodeDarkEnvironment
+from codedark.server.scoring import score_correctness, score_efficiency, compute_reward
+from codedark.server.tools import run_python, read_notes, save_note, parse_tool_call
+class TestModels:
+    """Test Pydantic models."""
+    def test_action_creation(self):
+        action = CodeDarkAction(tool="run_python", args="<code>result = 1</code>")
+        assert action.tool == "run_python"
+        assert "result = 1" in action.args
+    def test_observation_defaults(self):
+        obs = CodeDarkObservation()
+        assert obs.stdout == ""
+        assert obs.done is False
+        assert obs.reward is None
+    def test_state_defaults(self):
+        state = CodeDarkState()
+        assert state.notes == []
+        assert state.turn_count == 0
+class TestTools:
+    """Test tool implementations."""
+    def test_parse_run_python(self):
+        content, error = parse_tool_call("<code>result = df.shape</code>", "run_python")
+        assert content == "result = df.shape"
+        assert error is None
+    def test_parse_run_python_missing_tag(self):
+        content, error = parse_tool_call("result = df.shape", "run_python")
+        assert content is None
+        assert "No <code> tag" in error
+    def test_parse_submit_answer(self):
+        content, error = parse_tool_call("<answer>42.5</answer>", "submit_answer")
+        assert content == "42.5"
+        assert error is None
+    def test_read_notes_empty(self):
+        stdout, stderr, exit_code = read_notes([])
+        assert "No notes saved" in stdout
+        assert exit_code == 0
+    def test_read_notes_with_content(self):
+        notes = ["Note 1", "Note 2"]
+        stdout, stderr, exit_code = read_notes(notes)
+        assert "Note 1" in stdout
+        assert "Note 2" in stdout
+        assert exit_code == 0
+    def test_save_note(self):
+        notes = []
+        stdout, stderr, exit_code = save_note("Test note", notes)
+        assert "Note saved" in stdout
+        assert len(notes) == 1
+        assert notes[0] == "Test note"
+class TestScoring:
+    """Test scoring functions."""
+    def test_correctness_exact_match(self):
+        score = score_correctness(42.5, 42.5, tolerance=0.01)
+        assert score == 0.80
+    def test_correctness_within_tolerance(self):
+        score = score_correctness(42.505, 42.5, tolerance=0.01)
+        assert score == 0.80
+    def test_correctness_wrong(self):
+        score = score_correctness(100.0, 42.5, tolerance=0.01)
+        assert score == 0.0
+    def test_correctness_scale_error(self):
+        # 100x scale error: 0.425 vs 42.5
+        score = score_correctness(0.425, 42.5, tolerance=0.01)
+        assert score == 0.20
+    def test_efficiency_correct_answer(self):
+        score = score_efficiency(turns=1, max_turns=10, is_correct=True)
+        assert score > 0.08  # High efficiency for 1 turn
+    def test_efficiency_incorrect_answer(self):
+        score = score_efficiency(turns=1, max_turns=10, is_correct=False)
+        assert score == 0.0
+    def test_compute_reward_correct(self):
+        reward, corr, eff, cost = compute_reward(
+            submitted=42.5,
+            expected=42.5,
+            tolerance=0.01,
+            turns=3,
+            max_turns=10,
+        )
+        assert reward > 0.8  # At least correctness score
+        assert corr == 0.80
+class TestEnvironment:
+    """Test environment class."""
+    @pytest.fixture
+    def env(self):
+        """Create environment with test data."""
+        data_dir = Path(__file__).parent.parent / "data"
+        tasks_path = data_dir / "tasks" / "final_25_tasks.jsonl"
+        return CodeDarkEnvironment(
+            data_dir=str(data_dir),
+            tasks_path=str(tasks_path),
+            max_turns=10,
+        )
+    def test_reset_loads_task(self, env):
+        obs = env.reset()
+        assert obs.task_id != ""
+        assert obs.question != ""
+        assert obs.done is False
+    def test_reset_specific_task(self, env):
+        obs = env.reset(task_id="bank_hard_001")
+        assert obs.task_id == "bank_hard_001"
+        assert "subscription rate" in obs.question.lower()
+    def test_step_run_python(self, env):
+        env.reset(task_id="bank_hard_001")
+        action = CodeDarkAction(
+            tool="run_python", args="<code>result = df.shape</code>"
+        )
+        obs = env.step(action)
+        assert obs.exit_code == 0
+        assert "run_python Result" in obs.stdout
+    def test_step_save_note(self, env):
+        env.reset(task_id="bank_hard_001")
+        action = CodeDarkAction(tool="save_note", args="Test observation")
+        obs = env.step(action)
+        assert "Note saved" in obs.stdout
+        assert len(obs.notes) == 1
+    def test_step_read_notes(self, env):
+        env.reset(task_id="bank_hard_001")
+        # First save a note
+        env.step(CodeDarkAction(tool="save_note", args="Important finding"))
+        # Then read notes
+        obs = env.step(CodeDarkAction(tool="read_notes", args=""))
+        assert "Important finding" in obs.stdout
+    def test_step_submit_answer(self, env):
+        env.reset(task_id="bank_hard_001")
+        action = CodeDarkAction(tool="submit_answer", args="<answer>2.44</answer>")
+        obs = env.step(action)
+        assert obs.done is True
+        assert obs.submitted is True
+        assert obs.reward is not None
+        # This should be correct answer for bank_hard_001
+        assert obs.correctness == 0.80
+    def test_turn_counting(self, env):
+        env.reset()
+        assert env.state.turn_count == 0
+        env.step(CodeDarkAction(tool="run_python", args="<code>result = 1</code>"))
+        assert env.state.turn_count == 1
+        env.step(CodeDarkAction(tool="run_python", args="<code>result = 2</code>"))
+        assert env.state.turn_count == 2
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])

data/bank.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a071417203c9e1434df5fe794fffae6a55327502b00da9c4e9754d2ab7f7cede
+size 65698328

data/road.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7ee5955af18eca0d4b53e13f23bd6436422e40ca84077fb8cdcfa467aa62b68f
+size 37936892

data/tasks/final_25_tasks.jsonl ADDED Viewed

	@@ -0,0 +1,25 @@

+{"id": "bank_hard_001", "dataset": "bank", "goal": "What's the subscription rate for month='may' AND job='management' AND balance in Q1?", "expected_output_type": "scalar", "level": "L6", "template": "multi_condition_filter", "golden": {"answer_value": 2.44, "answer_type": "scalar", "verification_code": "df['_q'] = pd.qcut(df['balance'], 4, labels=['Q1','Q2','Q3','Q4'], duplicates='drop')\nfiltered = df[(df['month'] == 'may') & (df['job'] == 'management') & (df['_q'] == 'Q1')]\nresult = round((filtered['y'] == 1).mean() * 100, 2) if len(filtered) > 0 else 0.0\ndf.drop('_q', axis=1, inplace=True)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["quartile", "how", "define", "q1", "q2", "q3", "q4", "bins", "bucket", "positive", "success", "target", "y=1", "class", "outcome", "null", "missing", "nan", "empty", "na"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "multi_condition_filter", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"col_a": "month", "val_a": "may", "col_b": "job", "val_b": "management", "metric": "balance", "quartile": "Q1", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_005", "dataset": "bank", "goal": "Among customers in top 95% of age AND bottom 10% of duration, what's subscription rate?", "expected_output_type": "scalar", "level": "L6", "template": "percentile_cohort", "golden": {"answer_value": 0.59, "answer_type": "scalar", "verification_code": "p_high = df['age'].quantile(95 / 100)\np_low = df['duration'].quantile(10 / 100)\ncohort = df[(df['age'] >= p_high) & (df['duration'] <= p_low)]\nresult = round((cohort['y'] == 1).mean() * 100, 2) if len(cohort) > 0 else 0.0", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["percentile", "inclusive", "exclusive", "boundary", "include", "positive", "success", "target", "y=1", "class", "outcome"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "percentile_cohort", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"metric_a": "age", "metric_b": "duration", "pct_high": 95, "pct_low": 10, "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_012", "dataset": "bank", "goal": "Find the job with lowest average balance. What is the subscription rate for that segment?", "expected_output_type": "scalar", "level": "L5", "template": "chain_conversion", "golden": {"answer_value": 8.27, "answer_type": "scalar", "verification_code": "group_stats = df.groupby('job')['balance'].mean()\nextrema_val = group_stats.max() if 'lowest' == 'highest' else group_stats.min()\ntied = group_stats[group_stats == extrema_val].sort_index()\nextrema_group = tied.index[0]\nsubset = df[df['job'] == extrema_group]\nresult = round((subset['y'] == 1).mean() * 100, 2)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["positive", "success", "target", "y=1", "class", "outcome", "rate", "percentage", "decimal", "format", "0-100", "0-1"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "chain_conversion", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"group_col": "job", "metric": "balance", "extrema": "lowest", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_019", "dataset": "bank", "goal": "What's the subscription rate for month='may' AND job='management' AND age in Q4?", "expected_output_type": "scalar", "level": "L6", "template": "multi_condition_filter", "golden": {"answer_value": 7.04, "answer_type": "scalar", "verification_code": "df['_q'] = pd.qcut(df['age'], 4, labels=['Q1','Q2','Q3','Q4'], duplicates='drop')\nfiltered = df[(df['month'] == 'may') & (df['job'] == 'management') & (df['_q'] == 'Q4')]\nresult = round((filtered['y'] == 1).mean() * 100, 2) if len(filtered) > 0 else 0.0\ndf.drop('_q', axis=1, inplace=True)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["quartile", "how", "define", "q1", "q2", "q3", "q4", "bins", "bucket", "positive", "success", "target", "y=1", "class", "outcome", "null", "missing", "nan", "empty", "na"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "multi_condition_filter", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"col_a": "month", "val_a": "may", "col_b": "job", "val_b": "management", "metric": "age", "quartile": "Q4", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_020", "dataset": "bank", "goal": "For month with above-average day, which have subscription rate > 25%? Return sorted list.", "expected_output_type": "list", "level": "L5", "template": "top_n_in_segment", "golden": {"answer_value": ["oct"], "answer_type": "list", "verification_code": "avg_metric = df.groupby('month')['day'].mean()\nhigh_metric = avg_metric[avg_metric > avg_metric.mean()].index\nrates = df[df['month'].isin(high_metric)].groupby('month')['y'].apply(\n    lambda x: (x == 1).mean() * 100)\nresult = sorted(rates[rates > 25].index.tolist())", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": ["tie", "ties", "equal", "same", "duplicate"], "success_criteria": ["Answer must match expected value", "List elements must match (order matters)"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "top_n_in_segment", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "month", "metric": "day", "threshold": 25, "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_021", "dataset": "bank", "goal": "Find the job with highest average balance. What is the subscription rate for that segment?", "expected_output_type": "scalar", "level": "L5", "template": "chain_conversion", "golden": {"answer_value": 24.62, "answer_type": "scalar", "verification_code": "group_stats = df.groupby('job')['balance'].mean()\nextrema_val = group_stats.max() if 'highest' == 'highest' else group_stats.min()\ntied = group_stats[group_stats == extrema_val].sort_index()\nextrema_group = tied.index[0]\nsubset = df[df['job'] == extrema_group]\nresult = round((subset['y'] == 1).mean() * 100, 2)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["positive", "success", "target", "y=1", "class", "outcome", "rate", "percentage", "decimal", "format", "0-100", "0-1"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "chain_conversion", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"group_col": "job", "metric": "balance", "extrema": "highest", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_023", "dataset": "bank", "goal": "Among customers in top 90% of balance AND bottom 10% of day, what's subscription rate?", "expected_output_type": "scalar", "level": "L6", "template": "percentile_cohort", "golden": {"answer_value": 33.53, "answer_type": "scalar", "verification_code": "p_high = df['balance'].quantile(90 / 100)\np_low = df['day'].quantile(10 / 100)\ncohort = df[(df['balance'] >= p_high) & (df['day'] <= p_low)]\nresult = round((cohort['y'] == 1).mean() * 100, 2) if len(cohort) > 0 else 0.0", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["percentile", "inclusive", "exclusive", "boundary", "include", "positive", "success", "target", "y=1", "class", "outcome"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "percentile_cohort", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"metric_a": "balance", "metric_b": "day", "pct_high": 90, "pct_low": 10, "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_026", "dataset": "bank", "goal": "For each job, compute volatility score: std(age) / mean(age). Return top 5 with [group, mean, std, volatility].", "expected_output_type": "dataframe", "level": "L6", "template": "segment_volatility", "golden": {"answer_value": [{"job": "unemployed", "mean": 40.97, "std": 9.74, "volatility": 0.2377}, {"job": "self-employed", "mean": 40.42, "std": 9.46, "volatility": 0.234}, {"job": "admin.", "mean": 39.68, "std": 9.23, "volatility": 0.2326}, {"job": "services", "mean": 38.94, "std": 8.86, "volatility": 0.2275}, {"job": "management", "mean": 40.2, "std": 9.13, "volatility": 0.2271}], "answer_type": "dataframe", "verification_code": "stats = df.groupby('job')['age'].agg(['mean', 'std']).round(2)\nstats['volatility'] = round(stats['std'] / stats['mean'], 4)\nresult = stats.nlargest(5, 'volatility').reset_index()", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": ["round", "decimal", "precision", "digits", "volatility", "cv", "coefficient", "variation"], "success_criteria": ["Answer must match expected value", "DataFrame shape must match", "Column names must match", "Values must match with numeric tolerance 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "segment_volatility", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "job", "metric": "age"}}}
+{"id": "bank_hard_028", "dataset": "bank", "goal": "Among customers in top 95% of balance AND bottom 10% of duration, what's subscription rate?", "expected_output_type": "scalar", "level": "L6", "template": "percentile_cohort", "golden": {"answer_value": 0.84, "answer_type": "scalar", "verification_code": "p_high = df['balance'].quantile(95 / 100)\np_low = df['duration'].quantile(10 / 100)\ncohort = df[(df['balance'] >= p_high) & (df['duration'] <= p_low)]\nresult = round((cohort['y'] == 1).mean() * 100, 2) if len(cohort) > 0 else 0.0", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["percentile", "inclusive", "exclusive", "boundary", "include", "positive", "success", "target", "y=1", "class", "outcome"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "percentile_cohort", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"metric_a": "balance", "metric_b": "duration", "pct_high": 95, "pct_low": 10, "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_029", "dataset": "bank", "goal": "Among customers in top 90% of balance AND bottom 25% of duration, what's subscription rate?", "expected_output_type": "scalar", "level": "L6", "template": "percentile_cohort", "golden": {"answer_value": 0.54, "answer_type": "scalar", "verification_code": "p_high = df['balance'].quantile(90 / 100)\np_low = df['duration'].quantile(25 / 100)\ncohort = df[(df['balance'] >= p_high) & (df['duration'] <= p_low)]\nresult = round((cohort['y'] == 1).mean() * 100, 2) if len(cohort) > 0 else 0.0", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["percentile", "inclusive", "exclusive", "boundary", "include", "positive", "success", "target", "y=1", "class", "outcome"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "percentile_cohort", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"metric_a": "balance", "metric_b": "duration", "pct_high": 90, "pct_low": 25, "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_030", "dataset": "bank", "goal": "Rank job by subscription rate. Which bottom-3 have above-median age?", "expected_output_type": "list", "level": "L6", "template": "ranked_anomaly", "golden": {"answer_value": ["blue-collar", "entrepreneur"], "answer_type": "list", "verification_code": "stats = df.groupby('job').agg(\n    rate=('y', lambda x: (x == 1).mean()),\n    avg_metric=('age', 'mean'))\nstats['rank'] = stats['rate'].rank()\nbottom_3 = stats[stats['rank'] <= 3]\nresult = sorted(bottom_3[bottom_3['avg_metric'] > stats['avg_metric'].median()].index.tolist())", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": ["positive", "success", "target", "y=1", "class", "outcome", "rank", "order", "sort", "ascending", "descending"], "success_criteria": ["Answer must match expected value", "List elements must match (order matters)"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "ranked_anomaly", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "job", "metric": "age", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_031", "dataset": "bank", "goal": "For each month, compute volatility score: std(day) / mean(day). Return top 5 with [group, mean, std, volatility].", "expected_output_type": "dataframe", "level": "L6", "template": "segment_volatility", "golden": {"answer_value": [{"month": "feb", "mean": 6.05, "std": 5.46, "volatility": 0.9025}, {"month": "sep", "mean": 11.67, "std": 8.04, "volatility": 0.6889}, {"month": "mar", "mean": 13.45, "std": 9.18, "volatility": 0.6825}, {"month": "jun", "mean": 11.32, "std": 7.33, "volatility": 0.6475}, {"month": "dec", "mean": 14.19, "std": 8.83, "volatility": 0.6223}], "answer_type": "dataframe", "verification_code": "stats = df.groupby('month')['day'].agg(['mean', 'std']).round(2)\nstats['volatility'] = round(stats['std'] / stats['mean'], 4)\nresult = stats.nlargest(5, 'volatility').reset_index()", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": ["round", "decimal", "precision", "digits", "volatility", "cv", "coefficient", "variation"], "success_criteria": ["Answer must match expected value", "DataFrame shape must match", "Column names must match", "Values must match with numeric tolerance 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "segment_volatility", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "month", "metric": "day"}}}
+{"id": "bank_hard_033", "dataset": "bank", "goal": "Show the average balance breakdown by job. Include count and mean balance for each category, sorted by mean descending.", "expected_output_type": "dataframe", "level": "L4", "template": "metric_breakdown", "golden": {"answer_value": [{"job": "retired", "count": 35185, "mean_balance": 1812.07}, {"job": "unknown", "count": 2917, "mean_balance": 1678.96}, {"job": "self-employed", "count": 19020, "mean_balance": 1598.27}, {"job": "student", "count": 11767, "mean_balance": 1577.32}, {"job": "management", "count": 175541, "mean_balance": 1510.39}, {"job": "unemployed", "count": 17634, "mean_balance": 1440.57}, {"job": "entrepreneur", "count": 17718, "mean_balance": 1306.75}, {"job": "housemaid", "count": 15912, "mean_balance": 1281.22}, {"job": "technician", "count": 138107, "mean_balance": 1071.57}, {"job": "admin.", "count": 81492, "mean_balance": 1019.92}, {"job": "blue-collar", "count": 170498, "mean_balance": 977.49}, {"job": "services", "count": 64209, "mean_balance": 834.63}], "answer_type": "dataframe", "verification_code": "breakdown = df.groupby('job').agg(\n    count=('balance', 'size'),\n    mean_balance=('balance', lambda x: round(x.mean(), 2))\n).sort_values('mean_balance', ascending=False)\nresult = breakdown.reset_index()", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": ["round", "decimal", "precision", "digits"], "success_criteria": ["Answer must match expected value", "DataFrame shape must match", "Column names must match", "Values must match with numeric tolerance 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Single aggregation or binning operation expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 4, "template_name": "metric_breakdown", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "job", "metric": "balance"}}}
+{"id": "bank_hard_035", "dataset": "bank", "goal": "Find the job with lowest average age. What is the subscription rate for that segment?", "expected_output_type": "scalar", "level": "L5", "template": "chain_conversion", "golden": {"answer_value": 34.08, "answer_type": "scalar", "verification_code": "group_stats = df.groupby('job')['age'].mean()\nextrema_val = group_stats.max() if 'lowest' == 'highest' else group_stats.min()\ntied = group_stats[group_stats == extrema_val].sort_index()\nextrema_group = tied.index[0]\nsubset = df[df['job'] == extrema_group]\nresult = round((subset['y'] == 1).mean() * 100, 2)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["positive", "success", "target", "y=1", "class", "outcome", "rate", "percentage", "decimal", "format", "0-100", "0-1"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "chain_conversion", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"group_col": "job", "metric": "age", "extrema": "lowest", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_038", "dataset": "bank", "goal": "Find the month with the lowest subscription rate. Within that group, what is the average day?", "expected_output_type": "scalar", "level": "L5", "template": "nested_extrema", "golden": {"answer_value": 16.09, "answer_type": "scalar", "verification_code": "group_rates = df.groupby('month')['y'].apply(lambda x: (x == 1).mean())\nouter_val = group_rates.max() if 'lowest' == 'highest' else group_rates.min()\nouter_tied = group_rates[group_rates == outer_val].sort_index()\nouter_group = outer_tied.index[0]\nsubset = df[df['month'] == outer_group]\nresult = round(subset['day'].mean(), 2)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["tie", "ties", "equal", "same", "duplicate", "positive", "success", "target", "y=1", "class", "outcome"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "nested_extrema", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"group_col": "month", "metric": "day", "extrema_outer": "lowest", "extrema_inner": "highest", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_039", "dataset": "bank", "goal": "Divide customers into 4 day quartiles. What is the subscription percentage (0-100) in the highest (top 25%) (Q4) quartile?", "expected_output_type": "scalar", "level": "L4", "template": "quartile_conversion", "golden": {"answer_value": 11.55, "answer_type": "scalar", "verification_code": "df['_bin'] = pd.qcut(df['day'], 4, labels=['Q1','Q2','Q3','Q4'], duplicates='drop')\nbin_data = df[df['_bin'] == 'Q4']\nresult = round((bin_data['y'] == 1).mean() * 100, 2)\ndf.drop('_bin', axis=1, inplace=True)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["quartile", "how", "define", "q1", "q2", "q3", "q4", "bins", "bucket", "positive", "success", "target", "y=1", "class", "outcome", "rate", "percentage", "decimal", "format", "0-100", "0-1"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Single aggregation or binning operation expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 4, "template_name": "quartile_conversion", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"bin_col": "day", "quartile": "Q4", "quartile_desc": "highest (top 25%)", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_040", "dataset": "bank", "goal": "Among customers in top 95% of balance AND bottom 25% of duration, what's subscription rate?", "expected_output_type": "scalar", "level": "L6", "template": "percentile_cohort", "golden": {"answer_value": 0.65, "answer_type": "scalar", "verification_code": "p_high = df['balance'].quantile(95 / 100)\np_low = df['duration'].quantile(25 / 100)\ncohort = df[(df['balance'] >= p_high) & (df['duration'] <= p_low)]\nresult = round((cohort['y'] == 1).mean() * 100, 2) if len(cohort) > 0 else 0.0", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["percentile", "inclusive", "exclusive", "boundary", "include", "positive", "success", "target", "y=1", "class", "outcome"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Complex analysis with 3+ operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 6, "template_name": "percentile_cohort", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"metric_a": "balance", "metric_b": "duration", "pct_high": 95, "pct_low": 25, "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_041", "dataset": "bank", "goal": "Which job categories would have the biggest impact if brought to average subscription rate? Return top 3 by potential gain (count * rate gap), sorted by impact.", "expected_output_type": "list", "level": "L5", "template": "segment_improvement_potential", "golden": {"answer_value": ["blue-collar", "services", "entrepreneur"], "answer_type": "list", "verification_code": "overall_rate = (df['y'] == 1).mean()\ngroup_stats = df.groupby('job').agg(\n    rate=('y', lambda x: (x == 1).mean()),\n    count=('y', 'size')\n)\ngroup_stats['gap'] = overall_rate - group_stats['rate']\ngroup_stats['potential'] = group_stats['count'] * group_stats['gap']\ntop_potential = group_stats[group_stats['gap'] > 0].nlargest(3, 'potential')\nresult = top_potential.index.tolist()", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": ["positive", "success", "target", "y=1", "class", "outcome", "rate", "percentage", "decimal", "format", "0-100", "0-1"], "success_criteria": ["Answer must match expected value", "List elements must match (order matters)"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "segment_improvement_potential", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "job", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "bank_hard_044", "dataset": "bank", "goal": "Find the month with the lowest subscription rate. Within that group, what is the average age?", "expected_output_type": "scalar", "level": "L5", "template": "nested_extrema", "golden": {"answer_value": 38.98, "answer_type": "scalar", "verification_code": "group_rates = df.groupby('month')['y'].apply(lambda x: (x == 1).mean())\nouter_val = group_rates.max() if 'lowest' == 'highest' else group_rates.min()\nouter_tied = group_rates[group_rates == outer_val].sort_index()\nouter_group = outer_tied.index[0]\nsubset = df[df['month'] == outer_group]\nresult = round(subset['age'].mean(), 2)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": ["tie", "ties", "equal", "same", "duplicate", "positive", "success", "target", "y=1", "class", "outcome"], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "nested_extrema", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"group_col": "month", "metric": "age", "extrema_outer": "lowest", "extrema_inner": "lowest", "target_col": "y", "target_pos": "1", "target_desc": "subscription"}}}
+{"id": "road_hard_014", "dataset": "road", "goal": "Which lighting categories have the highest total reported accidents? Show breakdown with [lighting, count, total_num_reported_accidents, avg_num_reported_accidents] sorted by total descending.", "expected_output_type": "dataframe", "level": "L4", "template": "count_segment_total", "golden": {"answer_value": [{"lighting": "dim", "count": 183826, "total_num_reported_accidents": 211283, "avg_num_reported_accidents": 1.15}, {"lighting": "daylight", "count": 178015, "total_num_reported_accidents": 207579, "avg_num_reported_accidents": 1.17}, {"lighting": "night", "count": 155913, "total_num_reported_accidents": 196214, "avg_num_reported_accidents": 1.26}], "answer_type": "dataframe", "verification_code": "breakdown = df.groupby('lighting').agg(\n    count=('num_reported_accidents', 'size'),\n    total_num_reported_accidents=('num_reported_accidents', 'sum'),\n    avg_num_reported_accidents=('num_reported_accidents', lambda x: round(x.mean(), 2))\n).sort_values('total_num_reported_accidents', ascending=False)\nresult = breakdown.reset_index()", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": [], "success_criteria": ["Answer must match expected value", "DataFrame shape must match", "Column names must match", "Values must match with numeric tolerance 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Single aggregation or binning operation expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 4, "template_name": "count_segment_total", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "lighting", "target_col": "num_reported_accidents", "target_desc": "reported accidents"}}}
+{"id": "road_hard_021", "dataset": "road", "goal": "Which weather categories have the highest average accident risk? Show breakdown with [weather, count, avg_accident_risk] sorted by average descending.", "expected_output_type": "dataframe", "level": "L4", "template": "continuous_segment_breakdown", "golden": {"answer_value": [{"weather": "foggy", "count": 181463, "avg_accident_risk": 0.3863}, {"weather": "rainy", "count": 156985, "avg_accident_risk": 0.3615}, {"weather": "clear", "count": 179306, "avg_accident_risk": 0.3101}], "answer_type": "dataframe", "verification_code": "breakdown = df.groupby('weather').agg(\n    count=('accident_risk', 'size'),\n    avg_accident_risk=('accident_risk', lambda x: round(x.mean(), 4))\n).sort_values('avg_accident_risk', ascending=False)\nresult = breakdown.reset_index()", "tolerance": 0.0}, "tolerance": 0.0, "ambiguities": [], "success_criteria": ["Answer must match expected value", "DataFrame shape must match", "Column names must match", "Values must match with numeric tolerance 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Single aggregation or binning operation expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 4, "template_name": "continuous_segment_breakdown", "generator": "qa-gen-v2", "tolerance": 0.0, "slots": {"group_col": "weather", "target_col": "accident_risk", "target_desc": "accident risk"}}}
+{"id": "road_hard_015", "dataset": "road", "goal": "Divide records into 4 speed_limit quartiles. What is the average accident risk in the lower-middle (25-50%) (Q2) quartile?", "expected_output_type": "scalar", "level": "L4", "template": "continuous_quartile_analysis", "golden": {"answer_value": 0.29, "answer_type": "scalar", "verification_code": "df['_bin'] = pd.qcut(df['speed_limit'], 4, labels=['Q1','Q2','Q3','Q4'], duplicates='drop')\nbin_data = df[df['_bin'] == 'Q2']\nresult = round(bin_data['accident_risk'].mean(), 4)\ndf.drop('_bin', axis=1, inplace=True)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": [], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Single aggregation or binning operation expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 4, "template_name": "continuous_quartile_analysis", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"bin_col": "speed_limit", "quartile": "Q2", "quartile_desc": "lower-middle (25-50%)", "target_col": "accident_risk", "target_desc": "accident risk"}}}
+{"id": "road_hard_002", "dataset": "road", "goal": "Divide records into 4 curvature quartiles. What is the average accident risk in the upper-middle (50-75%) (Q3) quartile?", "expected_output_type": "scalar", "level": "L4", "template": "continuous_quartile_analysis", "golden": {"answer_value": 0.41, "answer_type": "scalar", "verification_code": "df['_bin'] = pd.qcut(df['curvature'], 4, labels=['Q1','Q2','Q3','Q4'], duplicates='drop')\nbin_data = df[df['_bin'] == 'Q3']\nresult = round(bin_data['accident_risk'].mean(), 4)\ndf.drop('_bin', axis=1, inplace=True)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": [], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Single aggregation or binning operation expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 4, "template_name": "continuous_quartile_analysis", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"bin_col": "curvature", "quartile": "Q3", "quartile_desc": "upper-middle (50-75%)", "target_col": "accident_risk", "target_desc": "accident risk"}}}
+{"id": "road_hard_007", "dataset": "road", "goal": "How much higher is the average accident risk for lighting='daylight' compared to 'night'? Return difference.", "expected_output_type": "scalar", "level": "L5", "template": "continuous_comparison", "golden": {"answer_value": -0.17, "answer_type": "scalar", "verification_code": "avg_a = df[df['lighting'] == 'daylight']['accident_risk'].mean()\navg_b = df[df['lighting'] == 'night']['accident_risk'].mean()\nresult = round(avg_a - avg_b, 4)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": [], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "continuous_comparison", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"group_col": "lighting", "val_a": "daylight", "val_b": "night", "target_col": "accident_risk", "target_desc": "accident risk"}}}
+{"id": "road_hard_004", "dataset": "road", "goal": "How much higher is the average accident risk for road_type='rural' compared to 'urban'? Return difference.", "expected_output_type": "scalar", "level": "L5", "template": "continuous_comparison", "golden": {"answer_value": -0.01, "answer_type": "scalar", "verification_code": "avg_a = df[df['road_type'] == 'rural']['accident_risk'].mean()\navg_b = df[df['road_type'] == 'urban']['accident_risk'].mean()\nresult = round(avg_a - avg_b, 4)", "tolerance": 0.01}, "tolerance": 0.01, "ambiguities": [], "success_criteria": ["Answer must match expected value", "Numeric tolerance: 0.01"], "constraints": ["Use pandas for data manipulation", "Store final answer in 'result' variable", "Multi-step analysis with 2 operations expected"], "relationships": [], "clarification_budget": 2, "metadata": {"difficulty_level": 5, "template_name": "continuous_comparison", "generator": "qa-gen-v2", "tolerance": 0.01, "slots": {"group_col": "road_type", "val_a": "rural", "val_b": "urban", "target_col": "accident_risk", "target_desc": "accident risk"}}}

requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+fastapi>=0.115.0
+pydantic>=2.0.0
+uvicorn[standard]>=0.24.0
+pandas>=2.0.0
+numpy>=1.24.0
+requests>=2.31.0