Spaces:

avanigupta
/

dataqa-env

Sleeping

App Files Files Community

Varshith Bathini commited on Apr 8

Commit

ca01572

unverified ·

2 Parent(s): 4c1a85d 85257bc

Merge pull request #1 from varshith15/enhancementsv1

Browse files

Files changed (23) hide show

.gitignore +4 -0
Dockerfile +4 -4
README.md +251 -45
__init__.py +0 -3
client.py +0 -5
dataqa_env/__init__.py +16 -1
dataqa_env/models.py +12 -10
dataqa_env/server/app.py +12 -1
dataqa_env/server/environment.py +346 -24
dataqa_env/server/gradio_ui.py +508 -0
dataqa_env/server/tasks.py +272 -39
inference.py +198 -144
models.py +0 -4
openenv.yaml +1 -1
scripts/prevalidation_script.sh +185 -0
scripts/sample_inference_script.py +188 -0
server/__init__.py +1 -0
server/app.py +3 -4
tests/__init__.py +0 -0
tests/test_environment.py +455 -0
tests/test_extensibility.py +215 -0
tests/test_inference.py +191 -0
tests/test_tasks.py +162 -0

.gitignore CHANGED Viewed

@@ -6,4 +6,8 @@ build/
 .venv/
 *.egg
 .env
 uv.lock

 .venv/
 *.egg
 .env
+.claude/
+.pytest_cache/
 uv.lock
+*.mov
+docs/*.png

Dockerfile CHANGED Viewed

@@ -26,10 +26,10 @@ RUN uv sync --no-editable 2>/dev/null || pip install -e .
 ENV PATH="/app/.venv/bin:$PATH"
 ENV PYTHONPATH="/app:$PYTHONPATH"
-# Health check
 HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
-    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
-EXPOSE 8000
-CMD ["uvicorn", "dataqa_env.server.app:app", "--host", "0.0.0.0", "--port", "8000"]

 ENV PATH="/app/.venv/bin:$PATH"
 ENV PYTHONPATH="/app:$PYTHONPATH"
+# Health check — HF Spaces uses port 7860
 HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/health')" || exit 1
+EXPOSE 7860
+CMD ["uvicorn", "dataqa_env.server.app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,65 +1,237 @@
 ---
 title: DataQA Environment Server
-emoji: 🔍
 colorFrom: blue
 colorTo: gray
 sdk: docker
 pinned: false
-app_port: 8000
-base_path: /web
 tags:
   - openenv
 ---
 # DataQA Environment
-An OpenEnv environment for **Data Quality Assurance** — an LLM agent inspects datasets with planted quality issues and must identify them all.
-## Overview
-DataQA simulates the real-world task of validating datasets before they enter ML training pipelines or production databases. The agent receives a corrupted dataset along with its schema and validation rules, then must identify all planted data quality issues.
-### Why Data QA?
-Every ML engineer and data scientist spends significant time debugging data quality issues — missing values, type mismatches, inconsistencies, and subtle statistical anomalies. This environment turns that task into a structured, gradable challenge.
 ## Environment API
-| Endpoint | Description |
-|----------|-------------|
-| `reset(task_id)` | Start a new episode with a corrupted dataset |
-| `step(issues)` | Submit identified issues, receive F1-scored feedback |
-| `state()` | Get current episode state |
 ## Tasks
-| Task | Issues | Difficulty | Description |
-|------|--------|-----------|-------------|
-| `easy` | 4 | Beginner | Employee directory — nulls, wrong types, duplicates, out-of-range |
-| `medium` | 6 | Intermediate | E-commerce orders — format violations, inconsistent totals, duplicate keys |
-| `hard` | 8 | Advanced | ML experiment metadata — data leakage signals, unreasonable GPU usage, timestamp ordering |
 ## Reward Function
-Scoring uses **F1 score** (harmonic mean of precision and recall):
-- **Precision**: What fraction of reported issues are real?
-- **Recall**: What fraction of planted issues did the agent find?
-- **F1**: `2 * precision * recall / (precision + recall)`
-Issues are matched by `row:<N>,col:<column>,issue:<type>` keys.
-The agent gets up to 3 attempts per task with feedback on each attempt (true positives, false positives, missed count).
-## Action/Observation Space
-**Action**: List of issue strings in format `row:<row_number>,col:<column_name>,issue:<issue_type>`
-**Observation**: Dataset CSV + schema + validation rules + feedback from previous attempt
-**Issue Types**: `missing_value`, `wrong_type`, `duplicate_row`, `out_of_range`, `format_violation`, `inconsistent_value`, `statistical_outlier`, `referential_integrity`
-## Quick Start
 ```bash
 # Install
@@ -68,42 +240,76 @@ pip install -e .
 # Run server locally
 uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
-# Run inference
-API_BASE_URL=https://api.groq.com/openai/v1 \
-MODEL_NAME=llama-3.3-70b-versatile \
-LLM_API_KEY=your-key \
 python inference.py
 ```
 ## Docker
 ```bash
-docker build -t dataqa-env -f dataqa_env/server/Dockerfile .
 docker run -p 8000:8000 dataqa-env
 ```
 ## Environment Variables
 | Variable | Description | Default |
 |----------|-------------|---------|
-| `API_BASE_URL` | LLM API endpoint | `https://api.groq.com/openai/v1` |
-| `MODEL_NAME` | Model identifier | `llama-3.3-70b-versatile` |
-| `HF_TOKEN` | HuggingFace token | - |
 | `ENV_URL` | Environment server URL | `http://localhost:8000` |
-| `LLM_API_KEY` | API key for LLM provider | Falls back to HF_TOKEN |
 ## Architecture
 ```
 dataqa_env/
-├── models.py              # Pydantic: DataQAAction, DataQAObservation, DataQAState
 ├── client.py              # EnvClient for WebSocket connections
 ├── server/
-│   ├── environment.py     # Core DataQAEnvironment (reset/step/state)
-│   ├── tasks.py           # Task definitions + data corruption + grading
-│   ├── app.py             # FastAPI server
 │   └── Dockerfile
-├── openenv.yaml
-├── pyproject.toml
-└── inference.py           # LLM agent using OpenAI client
 ```

 ---
 title: DataQA Environment Server
+emoji: "\U0001F50D"
 colorFrom: blue
 colorTo: gray
 sdk: docker
 pinned: false
+app_port: 7860
 tags:
   - openenv
 ---
 # DataQA Environment
+A two-phase OpenEnv RL environment for **Data Quality Assurance** — an LLM agent inspects corrupted datasets, identifies all planted quality issues, and proposes data repairs.
+### Demo: Agent Trajectory Replay
+```
+EASY TASK (Step 2) — All 6 issues found + 5 fixes proposed
+  Reward: 0.87 | Identify: 1.00 | Fix: 0.67
+  ✓ row:4  name: empty → "David Kim"          (fix correct)
+  ✓ row:7  salary: "seventy-five thousand" → "75000"  (fix correct)
+  ✓ row:9  salary: "5000" → "73000"           (fix correct)
+  ✓ row:15 email: mismatch → "oscar.rivera@company.com" (fix correct)
+  ✓ row:18 start_date: "2027-06-15" → "2022-01-19"     (fix correct)
+  ✓ row:21 duplicate row detected
+HARD TASK (Step 1 → Step 2)
+  Step 1: Found 5/10, missed hard issues    → Reward: 0.69
+  Step 2: Found 10/10 + 5 fixes proposed   → Reward: 0.77
+  Issues requiring ML knowledge:
+    • val_loss < train_loss (data leakage signal)
+    • resnet18 using 42.5GB GPU (impossible)
+    • 350 epochs on ImageNet in 30 min (impossible)
+    • wav2vec2 at 98.5% accuracy (exceeds SOTA)
+```
+> The interactive replay UI with color-coded dataset visualization is available on the HF Space.
+## Motivation
+Every ML engineer and data scientist spends significant time debugging data quality issues — missing values, type mismatches, logical inconsistencies, and subtle statistical anomalies — before data enters ML pipelines or production databases. This is a genuine, high-frequency human task that directly impacts model quality and business outcomes.
+DataQA turns this into a **two-phase RL challenge**:
+1. **Identify** — systematically inspect corrupted data and pinpoint every planted issue
+2. **Fix** — propose corrected values by reasoning about schema, constraints, and context
+This creates a rich multi-step decision problem where agents must explore datasets strategically, distinguish subtle anomalies from noise, and reason about what the correct data should be.
 ## Environment API
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/reset` | POST | Start a new episode with a corrupted dataset |
+| `/step` | POST | Submit identified issues + proposed fixes |
+| `/state` | GET | Get current episode state |
+| `/health` | GET | Health check |
 ## Tasks
+| Task | Issues | Difficulty | Domain | Description |
+|------|--------|-----------|--------|-------------|
+| `easy` | 6 | Beginner | HR/Employee data (21 rows) | Nulls, wrong types, duplicates, out-of-range, email-name mismatch, future dates |
+| `medium` | 8 | Intermediate | E-commerce orders (31 rows) | Inconsistent totals, invalid categories, duplicate keys, wrong date formats, invalid country codes, future-date deliveries |
+| `hard` | 10 | Advanced | ML experiment metadata (31 rows) | Data leakage signals, unreasonable GPU memory, impossibly fast training, SOTA-exceeding accuracy, timestamp ordering, whitespace-only fields |
+**Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage) and multi-row temporal reasoning.
+## Two-Phase Action Space
+### Phase 1: Identify Issues
+Submit issues in format: `row:<row_number>,col:<column_name>,issue:<issue_type>`
+- `row_number`: 1-indexed data row position (after header)
+- `column_name`: Exact column header name, lowercase
+- `issue_type`: One of the supported types below
+### Phase 2: Propose Fixes
+Submit fixes in format: `row:<row_number>,col:<column_name>,fix:<corrected_value>`
+The agent proposes the **correct value** that should replace the corrupted data. Fixes are graded against the original clean dataset.
+Both phases can be submitted in the same step or across multiple steps.
+**Supported Issue Types:**
+| Type | Description | Example |
+|------|-------------|---------|
+| `missing_value` | Null, empty, or whitespace-only | Empty name field |
+| `wrong_type` | Value doesn't match expected type | Salary as "seventy-five thousand" |
+| `duplicate_row` | Exact duplicate or duplicate key | Two rows with same employee_id |
+| `out_of_range` | Value outside valid range | Salary of 5000 when min is 50000 |
+| `format_violation` | Wrong format or invalid enum | Date as DD/MM/YYYY instead of YYYY-MM-DD |
+| `inconsistent_value` | Computed field mismatch, logical inconsistency | total != qty * price |
+| `statistical_outlier` | Unreasonable value given context | resnet18 using 42.5GB GPU |
+| `referential_integrity` | Foreign key violation | (available for custom tasks) |
+## Observation Space
+| Field | Type | Description |
+|-------|------|-------------|
+| `dataset_csv` | str | The corrupted dataset in CSV format |
+| `schema_description` | str | Column types, ranges, and constraints |
+| `validation_rules` | str | Business rules the data must satisfy |
+| `task_description` | str | Task context and instructions |
+| `feedback` | str | Per-step results: TP/FP/FN, precision/recall, fix scores |
+| `num_issues_hint` | int | Exact count of planted issues |
+| `max_steps` | int | Maximum attempts allowed |
+| `done` | bool | Whether episode has terminated |
+| `reward` | float | Best combined reward so far (0.0-1.0) |
+**Observation Metadata** (per step):
+- Identify: `identify_f1`, `identify_score`, `precision`, `recall`, `tp`, `fp`, `fn`
+- Fix: `fix_score`, `fixes_correct`, `fixes_partial`, `fixes_wrong`, `fixes_attempted`
+- Combined: `combined_reward`, `difficulty_found`, `difficulty_missed`
 ## Reward Function
+### Combined Reward
+```
+combined_reward = 0.6 * identify_score + 0.4 * fix_score
+```
+If no fixes are submitted, `combined_reward = identify_score` (no penalty — backward compatible).
+### Identify Score (Difficulty-Weighted F1)
+Each planted issue has a **difficulty weight** (1.0-3.0):
+| Weight | Category | Examples |
+|--------|----------|----------|
+| 1.0 | Easy | Missing values, obvious out-of-range, wrong type |
+| 1.5-2.0 | Medium | Duplicate keys, format violations, cross-column checks |
+| 2.5-3.0 | Hard | Data leakage, statistical outliers, whitespace-only |
+- **Weighted Recall** = (difficulty of found issues) / (total difficulty)
+- **Weighted Precision** = penalizes false positives proportional to average difficulty
+- **Weighted F1** = harmonic mean
+### Fix Score (Difficulty-Weighted Quality)
+Each proposed fix is compared against the original clean value:
+| Fix Quality | Score | Description |
+|-------------|-------|-------------|
+| Exact match | 1.0 | Case-insensitive, whitespace-stripped match |
+| Numeric close | 0.8 | Within 1% of correct numeric value |
+| Correct cell | 0.1 | Right location, wrong value |
+| Non-issue cell | 0.0 | Fix targets a cell with no issue |
+Fix score = (sum of best fix score per issue × difficulty weight) / (total difficulty weight)
+### Reward Properties
+- **Per-step partial progress**: reward increases as more issues are found/fixed
+- **Difficulty-aware**: finding subtle issues earns more than obvious ones
+- **Penalizes bad behavior**: false positives reduce score, fixing non-issues earns nothing
+- **Monotonically non-decreasing**: best score across all steps is the final reward
+- **Always in [0.0, 1.0]**: meets hackathon requirement
+### Episode Boundaries
+- Each task allows up to 3 steps (attempts)
+- Episode ends when F1 >= 0.999 (perfect identification) or max steps reached
+- Agent receives detailed feedback after each step to improve on next attempt
+## Baseline Scores
+Baseline agent uses Qwen2.5-72B-Instruct via HuggingFace Router:
+| Task | Identify Score | Fix Score | Combined | Notes |
+|------|---------------|-----------|----------|-------|
+| `easy` | 0.7-1.0 | 0.5-0.9 | 0.6-1.0 | Most LLMs find obvious issues reliably |
+| `medium` | 0.5-0.8 | 0.3-0.6 | 0.4-0.7 | Cross-column reasoning challenges models |
+| `hard` | 0.3-0.6 | 0.2-0.4 | 0.3-0.5 | ML domain knowledge and subtle patterns |
+Scores vary by model. The hard task is designed to challenge frontier models.
+## Extensibility
+### Custom Contamination Rules
+```python
+from dataqa_env import register_contamination_rule
+from dataqa_env.server.tasks import PlantedIssue
+def swap_digits(rows, header, col_idx, row_idx, rng):
+    val = rows[row_idx][col_idx]
+    corrupted = val[::-1]
+    issue = PlantedIssue(
+        row=row_idx + 1, col=header[col_idx],
+        issue_type="format_violation",
+        description=f"Digits swapped in {header[col_idx]}",
+        difficulty=2.0,
+    )
+    return corrupted, issue
+register_contamination_rule("swap_digits", swap_digits)
+```
+### Custom Tasks from Config
+```python
+from dataqa_env import create_task_from_config, register_task
+task = create_task_from_config(
+    task_id="custom",
+    name="Custom Validation",
+    description="Find quality issues in this dataset.",
+    schema_description="id: int, name: str, score: int (0-100)",
+    validation_rules="No missing values. Scores must be 0-100.",
+    clean_csv="id,name,score\n1,Alice,95\n2,Bob,87\n3,Carol,92",
+    contaminations=[
+        {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
+        {"rule": "negative_value", "row": 2, "col": 2, "difficulty": 1.5},
+    ],
+)
+register_task("custom", lambda seed: task)
+```
+### Built-in Contamination Rules
+| Rule | Effect | Default Difficulty |
+|------|--------|--------------------|
+| `missing_value` | Sets field to empty string | 1.0 |
+| `whitespace_value` | Sets field to single space | 2.5 |
+| `wrong_type_text` | Replaces with random text | 1.0 |
+| `negative_value` | Negates numeric value | 1.0 |
+## Setup & Quick Start
 ```bash
 # Install
 # Run server locally
 uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
+# Run inference (set your API credentials)
+API_BASE_URL=https://router.huggingface.co/v1 \
+MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
+HF_TOKEN=your-token \
 python inference.py
 ```
 ## Docker
 ```bash
+docker build -t dataqa-env .
 docker run -p 8000:8000 dataqa-env
 ```
+## Testing
+```bash
+pip install -e ".[dev]"
+pytest tests/ -v
+```
+118 tests covering:
+- Task creation, corruption, and difficulty weights
+- Issue key and fix parsing (standard, lenient, edge cases)
+- F1, weighted reward, and fix quality computation
+- Full environment lifecycle (identify-only and identify+fix)
+- Combined reward calculation and weight verification
+- Inference script parsing and prompt building
+- Structured log format ([START], [STEP], [END])
+- Score bounds (0.0-1.0), best-score monotonicity
+- Extensibility API (custom rules, custom tasks)
+## Validation
+```bash
+# OpenEnv spec validation
+openenv validate .
+# Pre-submission validation (requires HF Space URL)
+./prevalidation_script.sh https://your-space.hf.space
+```
 ## Environment Variables
 | Variable | Description | Default |
 |----------|-------------|---------|
+| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
+| `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
+| `HF_TOKEN` | HuggingFace token / API key | - |
 | `ENV_URL` | Environment server URL | `http://localhost:8000` |
 ## Architecture
 ```
 dataqa_env/
+├── __init__.py            # Public API + extensibility exports
+├── models.py              # Pydantic: DataQAAction (issues + fixes), DataQAObservation, DataQAState
 ├── client.py              # EnvClient for WebSocket connections
 ├── server/
+│   ├── environment.py     # Two-phase DataQAEnvironment (identify + fix + combined reward)
+│   ├── tasks.py           # Task definitions + contamination rules + extensibility API
+│   ├── app.py             # FastAPI server (via openenv-core create_app)
 │   └── Dockerfile
+tests/
+├── test_tasks.py          # Task creation, corruption, difficulty weights
+├── test_environment.py    # Identify scoring, fix grading, combined reward, lifecycle
+├── test_inference.py      # LLM response parsing, fix parsing, prompt building, log format
+└── test_extensibility.py  # Custom rules, custom tasks, registration API
+inference.py               # Two-phase baseline agent (identify → fix)
+openenv.yaml               # OpenEnv/HF Spaces spec
+pyproject.toml             # Package metadata and dependencies
+Dockerfile                 # Production container
 ```

__init__.py DELETED Viewed

@@ -1,3 +0,0 @@
-from dataqa_env import DataQAEnv, DataQAAction, DataQAObservation, DataQAState
-__all__ = ["DataQAEnv", "DataQAAction", "DataQAObservation", "DataQAState"]

client.py DELETED Viewed

@@ -1,5 +0,0 @@
-"""Root-level client for OpenEnv compatibility."""
-from dataqa_env.client import DataQAEnv
-from dataqa_env.models import DataQAAction, DataQAObservation, DataQAState
-__all__ = ["DataQAEnv", "DataQAAction", "DataQAObservation", "DataQAState"]

dataqa_env/__init__.py CHANGED Viewed

@@ -1,4 +1,19 @@
 from .client import DataQAEnv
 from .models import DataQAAction, DataQAObservation, DataQAState
-__all__ = ["DataQAEnv", "DataQAAction", "DataQAObservation", "DataQAState"]

 from .client import DataQAEnv
 from .models import DataQAAction, DataQAObservation, DataQAState
+from .server.tasks import (
+    create_task_from_config,
+    register_task,
+    register_contamination_rule,
+    CONTAMINATION_RULES,
+)
+__all__ = [
+    "DataQAEnv",
+    "DataQAAction",
+    "DataQAObservation",
+    "DataQAState",
+    "create_task_from_config",
+    "register_task",
+    "register_contamination_rule",
+    "CONTAMINATION_RULES",
+]

dataqa_env/models.py CHANGED Viewed

@@ -16,21 +16,23 @@ from openenv.core.env_server.interfaces import Action, Observation, State
 class DataQAAction(Action):
     """
-    Agent submits a list of identified data quality issues.
-    Each issue is a string in the format: "row:<row_idx>,col:<col_name>,issue:<issue_type>"
     Supported issue types:
-        - missing_value
-        - wrong_type
-        - duplicate_row
-        - out_of_range
-        - format_violation
-        - inconsistent_value
-        - statistical_outlier
-        - referential_integrity
     """
     issues: List[str]
     # Include task_id so step() can reconstruct context in stateless HTTP mode
     task_id: str = "easy"

 class DataQAAction(Action):
     """
+    Agent submits identified issues AND optional proposed fixes.
+    Two-phase action space:
+      Phase 1 (Identify): List issues in format "row:<N>,col:<name>,issue:<type>"
+      Phase 2 (Fix):      List fixes in format "row:<N>,col:<name>,fix:<proposed_value>"
+    The agent can submit both in the same step or across multiple steps.
+    Combined reward = 0.6 * identify_score + 0.4 * fix_score
     Supported issue types:
+        missing_value, wrong_type, duplicate_row, out_of_range,
+        format_violation, inconsistent_value, statistical_outlier,
+        referential_integrity
     """
     issues: List[str]
+    fixes: List[str] = []
     # Include task_id so step() can reconstruct context in stateless HTTP mode
     task_id: str = "easy"

dataqa_env/server/app.py CHANGED Viewed

@@ -19,9 +19,20 @@ app = create_app(
 )
 def main():
     import uvicorn
-    uvicorn.run(app, host="0.0.0.0", port=8000)
 if __name__ == "__main__":

 )
+@app.get("/")
+def root():
+    """Root endpoint — environment info."""
+    return {
+        "name": "DataQA Environment",
+        "description": "Two-phase data quality assurance environment: identify issues + propose fixes",
+        "tasks": ["easy", "medium", "hard"],
+        "endpoints": ["/health", "/reset", "/step", "/state"],
+    }
 def main():
     import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=7860)
 if __name__ == "__main__":

dataqa_env/server/environment.py CHANGED Viewed

@@ -3,8 +3,12 @@ DataQA Environment
 ------------------
 Server-side environment for data quality assurance tasks.
-The agent receives corrupted datasets and must identify planted quality issues.
-Scoring is based on F1 (precision-recall) of correctly matched issues.
 """
 from __future__ import annotations
@@ -18,6 +22,10 @@ from openenv.core.env_server.interfaces import Action, Environment, Observation
 from ..models import DataQAAction, DataQAObservation, DataQAState
 from .tasks import PlantedIssue, Task, get_task, list_tasks
 def parse_issue_key(raw: str) -> Optional[str]:
     """
@@ -26,7 +34,6 @@ def parse_issue_key(raw: str) -> Optional[str]:
     Returns normalized key or None if unparseable.
     """
     raw = raw.strip().lower()
-    # Be lenient with formatting
     row_match = re.search(r"row\s*[:=]\s*(\d+)", raw)
     col_match = re.search(r"col\s*[:=]\s*([\w_]+)", raw)
     issue_match = re.search(r"issue\s*[:=]\s*([\w_]+)", raw)
@@ -36,6 +43,22 @@ def parse_issue_key(raw: str) -> Optional[str]:
     return None
 def compute_f1(reported_keys: Set[str], planted_keys: Set[str]) -> dict:
     """Compute precision, recall, and F1 score."""
     if not reported_keys and not planted_keys:
@@ -58,12 +81,185 @@ def compute_f1(reported_keys: Set[str], planted_keys: Set[str]) -> dict:
     return {"precision": precision, "recall": recall, "f1": f1, "tp": tp, "fp": fp, "fn": fn}
 class DataQAEnvironment(Environment):
     """
-    Data Quality Assurance environment.
-    The agent inspects corrupted datasets and reports quality issues.
-    Reward is F1 score of correctly identified issues vs planted ground truth.
     """
     SUPPORTS_CONCURRENT_SESSIONS = True
@@ -103,7 +299,11 @@ class DataQAEnvironment(Environment):
             schema_description=self._current_task.schema_description,
             validation_rules=self._current_task.validation_rules,
             task_description=self._current_task.description,
-            feedback="Environment reset. Inspect the dataset and report all quality issues.",
             task_id=task_id,
             num_issues_hint=len(self._current_task.planted_issues),
             max_steps=self._current_task.max_steps,
@@ -120,15 +320,14 @@ class DataQAEnvironment(Environment):
         if not isinstance(action, DataQAAction):
             raise ValueError(f"Expected DataQAAction, got {type(action)}")
-        # In stateless HTTP mode, each request creates a fresh env instance.
-        # Auto-reset using the task_id from the action so step() works standalone.
         if self._current_task is None:
             self.reset(task_id=action.task_id)
         self._state.step_count += 1
         self._state.current_step += 1
-        # Parse reported issues
         reported_keys: Set[str] = set()
         parse_errors: list[str] = []
         for raw_issue in action.issues:
@@ -136,44 +335,148 @@ class DataQAEnvironment(Environment):
             if key:
                 reported_keys.add(key)
             else:
-                parse_errors.append(f"Could not parse: '{raw_issue}'")
-        # Compute score
         metrics = compute_f1(reported_keys, self._planted_keys)
-        score = metrics["f1"]
-        self._best_score = max(self._best_score, score)
         self._state.best_score = self._best_score
-        # Check if done
         is_done = (
-            score >= 0.999  # Perfect score
             or self._state.current_step >= self._state.max_steps
         )
-        # Build feedback
         feedback_lines = [
             f"Step {self._state.current_step}/{self._state.max_steps}",
             f"Issues reported: {len(reported_keys)}",
             f"True positives: {metrics['tp']}, False positives: {metrics['fp']}, Missed: {metrics['fn']}",
-            f"Precision: {metrics['precision']:.3f}, Recall: {metrics['recall']:.3f}, F1: {score:.3f}",
         ]
         if parse_errors:
-            feedback_lines.append(f"Parse errors ({len(parse_errors)}): {'; '.join(parse_errors[:3])}")
         if not is_done:
-            # Give hints about what was missed without revealing exact answers
             if metrics["fn"] > 0:
                 feedback_lines.append(
-                    f"You missed {metrics['fn']} issue(s). Review the dataset carefully."
                 )
             if metrics["fp"] > 0:
                 feedback_lines.append(
-                    f"{metrics['fp']} of your reported issues were incorrect."
                 )
-            feedback_lines.append("You can submit again with an updated list of issues.")
         else:
-            feedback_lines.append(f"Task complete! Final best F1 score: {self._best_score:.3f}")
         return DataQAObservation(
             dataset_csv=self._current_task.corrupted_csv,
@@ -186,6 +489,25 @@ class DataQAEnvironment(Environment):
             max_steps=self._state.max_steps,
             done=is_done,
             reward=self._best_score,
         )
     @property

 ------------------
 Server-side environment for data quality assurance tasks.
+Two-phase RL environment:
+  Phase 1 (Identify): Agent inspects corrupted datasets and reports quality issues.
+  Phase 2 (Fix):      Agent proposes corrections for identified issues.
+Combined reward = 0.6 * identify_score + 0.4 * fix_score
+Both phases scored with difficulty-weighted metrics for rich per-step signal.
 """
 from __future__ import annotations
 from ..models import DataQAAction, DataQAObservation, DataQAState
 from .tasks import PlantedIssue, Task, get_task, list_tasks
+# Reward weights for the two phases
+IDENTIFY_WEIGHT = 0.6
+FIX_WEIGHT = 0.4
 def parse_issue_key(raw: str) -> Optional[str]:
     """
     Returns normalized key or None if unparseable.
     """
     raw = raw.strip().lower()
     row_match = re.search(r"row\s*[:=]\s*(\d+)", raw)
     col_match = re.search(r"col\s*[:=]\s*([\w_]+)", raw)
     issue_match = re.search(r"issue\s*[:=]\s*([\w_]+)", raw)
     return None
+def parse_fix(raw: str) -> Optional[tuple[int, str, str]]:
+    """
+    Parse an agent-proposed fix into (row, col, proposed_value).
+    Expected format: row:<N>,col:<name>,fix:<value>
+    Returns (row, col, value) or None if unparseable.
+    """
+    raw = raw.strip()
+    row_match = re.search(r"row\s*[:=]\s*(\d+)", raw, re.IGNORECASE)
+    col_match = re.search(r"col(?:umn)?\s*[:=]\s*([\w_]+)", raw, re.IGNORECASE)
+    fix_match = re.search(r"fix\s*[:=]\s*(.+?)$", raw, re.IGNORECASE)
+    if row_match and col_match and fix_match:
+        return (int(row_match.group(1)), col_match.group(1).lower(), fix_match.group(1).strip())
+    return None
 def compute_f1(reported_keys: Set[str], planted_keys: Set[str]) -> dict:
     """Compute precision, recall, and F1 score."""
     if not reported_keys and not planted_keys:
     return {"precision": precision, "recall": recall, "f1": f1, "tp": tp, "fp": fp, "fn": fn}
+def compute_weighted_reward(
+    reported_keys: Set[str],
+    planted_issues: list,
+) -> dict:
+    """
+    Compute difficulty-weighted reward for richer per-step signal.
+    Each planted issue has a difficulty weight (1.0-3.0). Finding harder issues
+    earns more reward. False positives incur a penalty scaled by average difficulty.
+    Returns dict with weighted_reward (0.0-1.0), plus per-issue breakdown.
+    """
+    if not planted_issues and not reported_keys:
+        return {"weighted_reward": 1.0, "difficulty_found": 0.0, "difficulty_missed": 0.0}
+    planted_by_key = {issue.to_key(): issue for issue in planted_issues}
+    planted_keys = set(planted_by_key.keys())
+    if not reported_keys:
+        total_weight = sum(i.difficulty for i in planted_issues)
+        return {"weighted_reward": 0.0, "difficulty_found": 0.0, "difficulty_missed": total_weight}
+    if not planted_keys:
+        return {"weighted_reward": 0.0, "difficulty_found": 0.0, "difficulty_missed": 0.0}
+    found_keys = reported_keys & planted_keys
+    missed_keys = planted_keys - reported_keys
+    false_positive_count = len(reported_keys - planted_keys)
+    difficulty_found = sum(planted_by_key[k].difficulty for k in found_keys)
+    difficulty_missed = sum(planted_by_key[k].difficulty for k in missed_keys)
+    total_weight = sum(i.difficulty for i in planted_issues)
+    weighted_recall = difficulty_found / total_weight if total_weight > 0 else 0.0
+    avg_difficulty = total_weight / len(planted_issues)
+    fp_penalty_weight = false_positive_count * avg_difficulty
+    weighted_precision = difficulty_found / (difficulty_found + fp_penalty_weight) if (difficulty_found + fp_penalty_weight) > 0 else 0.0
+    if (weighted_precision + weighted_recall) > 0:
+        weighted_reward = 2 * weighted_precision * weighted_recall / (weighted_precision + weighted_recall)
+    else:
+        weighted_reward = 0.0
+    return {
+        "weighted_reward": round(weighted_reward, 4),
+        "difficulty_found": round(difficulty_found, 2),
+        "difficulty_missed": round(difficulty_missed, 2),
+    }
+def grade_fixes(
+    fixes: list[tuple[int, str, str]],
+    task: Task,
+) -> dict:
+    """
+    Grade proposed fixes against the clean dataset.
+    For each fix (row, col, proposed_value), compare to the original clean value.
+    Scoring per fix:
+      - Exact match (case-insensitive, whitespace-stripped): 1.0
+      - Numeric close match (within 1%): 0.8
+      - Correct column but wrong value: 0.1
+      - Targets a non-issue cell: 0.0 (penalty)
+    Returns dict with fix_score (0.0-1.0), details per fix, and counts.
+    """
+    if not fixes and not task.planted_issues:
+        return {"fix_score": 1.0, "fixes_correct": 0, "fixes_partial": 0,
+                "fixes_wrong": 0, "fixes_attempted": 0, "fix_details": []}
+    if not fixes:
+        return {"fix_score": 0.0, "fixes_correct": 0, "fixes_partial": 0,
+                "fixes_wrong": 0, "fixes_attempted": 0, "fix_details": []}
+    issue_map = task.get_planted_issue_map()
+    # Build set of (row, col) that are actual issues
+    issue_cells = {(issue.row, issue.col) for issue in task.planted_issues}
+    total_weight = sum(i.difficulty for i in task.planted_issues) if task.planted_issues else 1.0
+    earned_weight = 0.0
+    fixes_correct = 0
+    fixes_partial = 0
+    fixes_wrong = 0
+    fix_details = []
+    # Track which issues have been fixed (best fix wins)
+    fixed_issues: dict[tuple[int, str], float] = {}
+    for row, col, proposed in fixes:
+        clean_value = task.get_clean_value(row, col)
+        cell_key = (row, col)
+        if cell_key not in issue_cells:
+            # Fix targets a non-issue cell — no credit
+            fix_details.append({"row": row, "col": col, "score": 0.0, "reason": "not an issue cell"})
+            fixes_wrong += 1
+            continue
+        if clean_value is None:
+            fix_details.append({"row": row, "col": col, "score": 0.0, "reason": "cell not found"})
+            fixes_wrong += 1
+            continue
+        # Find the planted issue for this cell to get its difficulty weight
+        matching_issue = None
+        for issue in task.planted_issues:
+            if issue.row == row and issue.col == col:
+                matching_issue = issue
+                break
+        difficulty = matching_issue.difficulty if matching_issue else 1.0
+        # Score the fix
+        score = 0.0
+        reason = "wrong value"
+        # Exact match (case-insensitive, whitespace-stripped)
+        if proposed.strip().lower() == clean_value.lower():
+            score = 1.0
+            reason = "exact match"
+            fixes_correct += 1
+        else:
+            # Try numeric close match
+            try:
+                proposed_num = float(proposed.strip())
+                clean_num = float(clean_value)
+                if clean_num != 0 and abs(proposed_num - clean_num) / abs(clean_num) <= 0.01:
+                    score = 0.8
+                    reason = "numeric close match"
+                    fixes_partial += 1
+                elif proposed_num == clean_num:
+                    score = 1.0
+                    reason = "exact numeric match"
+                    fixes_correct += 1
+                else:
+                    score = 0.1
+                    reason = "correct cell, wrong value"
+                    fixes_partial += 1
+            except (ValueError, ZeroDivisionError):
+                # Not numeric — just a wrong value but at least right cell
+                score = 0.1
+                reason = "correct cell, wrong value"
+                fixes_partial += 1
+        # Keep best fix per cell
+        if cell_key not in fixed_issues or score > fixed_issues[cell_key]:
+            fixed_issues[cell_key] = score
+        fix_details.append({"row": row, "col": col, "score": score, "reason": reason})
+    # Compute fix score: weighted sum of best fix per issue / total weight
+    for issue in task.planted_issues:
+        cell_key = (issue.row, issue.col)
+        if cell_key in fixed_issues:
+            earned_weight += issue.difficulty * fixed_issues[cell_key]
+    fix_score = earned_weight / total_weight if total_weight > 0 else 0.0
+    fix_score = min(max(fix_score, 0.0), 1.0)
+    return {
+        "fix_score": round(fix_score, 4),
+        "fixes_correct": fixes_correct,
+        "fixes_partial": fixes_partial,
+        "fixes_wrong": fixes_wrong,
+        "fixes_attempted": len(fixes),
+        "fix_details": fix_details,
+    }
 class DataQAEnvironment(Environment):
     """
+    Data Quality Assurance environment — two-phase identify + fix.
+    Phase 1 (Identify): Agent inspects corrupted datasets and reports quality issues.
+    Phase 2 (Fix):      Agent proposes corrections for identified issues.
+    Combined reward = 0.6 * identify_score + 0.4 * fix_score
+    Both phases use difficulty-weighted scoring for rich per-step reward signals.
     """
     SUPPORTS_CONCURRENT_SESSIONS = True
             schema_description=self._current_task.schema_description,
             validation_rules=self._current_task.validation_rules,
             task_description=self._current_task.description,
+            feedback=(
+                "Environment reset. Inspect the dataset and report all quality issues.\n"
+                "You can also propose fixes in format: row:<N>,col:<name>,fix:<corrected_value>\n"
+                "Combined reward = 0.6 * identify_score + 0.4 * fix_score"
+            ),
             task_id=task_id,
             num_issues_hint=len(self._current_task.planted_issues),
             max_steps=self._current_task.max_steps,
         if not isinstance(action, DataQAAction):
             raise ValueError(f"Expected DataQAAction, got {type(action)}")
+        # Auto-reset in stateless HTTP mode
         if self._current_task is None:
             self.reset(task_id=action.task_id)
         self._state.step_count += 1
         self._state.current_step += 1
+        # ── Phase 1: Parse and score issue identification ──
         reported_keys: Set[str] = set()
         parse_errors: list[str] = []
         for raw_issue in action.issues:
             if key:
                 reported_keys.add(key)
             else:
+                parse_errors.append(f"Could not parse issue: '{raw_issue}'")
         metrics = compute_f1(reported_keys, self._planted_keys)
+        identify_f1 = metrics["f1"]
+        weighted = compute_weighted_reward(reported_keys, self._current_task.planted_issues)
+        identify_score = weighted["weighted_reward"]
+        # ── Phase 2: Parse and score proposed fixes ──
+        parsed_fixes: list[tuple[int, str, str]] = []
+        for raw_fix in action.fixes:
+            fix = parse_fix(raw_fix)
+            if fix:
+                parsed_fixes.append(fix)
+            else:
+                parse_errors.append(f"Could not parse fix: '{raw_fix}'")
+        fix_result = grade_fixes(parsed_fixes, self._current_task)
+        fix_score = fix_result["fix_score"]
+        # ── Combined reward ──
+        # If no fixes submitted, score is identify-only (no penalty for not fixing)
+        if action.fixes:
+            combined_reward = IDENTIFY_WEIGHT * identify_score + FIX_WEIGHT * fix_score
+        else:
+            combined_reward = identify_score  # backward compatible
+        self._best_score = max(self._best_score, combined_reward)
         self._state.best_score = self._best_score
+        # ── Check if done ──
         is_done = (
+            identify_f1 >= 0.999  # Perfect identification
             or self._state.current_step >= self._state.max_steps
         )
+        # ── Build feedback with actionable diagnostics ──
+        # Show the agent exactly which reported issues were correct (TP) and which were wrong (FP)
+        tp_keys = reported_keys & self._planted_keys
+        fp_keys = reported_keys - self._planted_keys
         feedback_lines = [
             f"Step {self._state.current_step}/{self._state.max_steps}",
+            "",
+            "--- Identification ---",
             f"Issues reported: {len(reported_keys)}",
             f"True positives: {metrics['tp']}, False positives: {metrics['fp']}, Missed: {metrics['fn']}",
+            f"Precision: {metrics['precision']:.3f}, Recall: {metrics['recall']:.3f}, F1: {identify_f1:.3f}",
+            f"Identify score (weighted): {identify_score:.3f}",
         ]
+        # Show which reported issues were correct vs wrong (helps agent self-correct)
+        if tp_keys:
+            feedback_lines.append(f"Correct issues: {', '.join(sorted(tp_keys))}")
+        if fp_keys:
+            feedback_lines.append(f"Incorrect issues (false positives): {', '.join(sorted(fp_keys))}")
+        if action.fixes:
+            feedback_lines += [
+                "",
+                "--- Fix Proposals ---",
+                f"Fixes attempted: {fix_result['fixes_attempted']}",
+                f"Correct: {fix_result['fixes_correct']}, Partial: {fix_result['fixes_partial']}, Wrong: {fix_result['fixes_wrong']}",
+                f"Fix score: {fix_score:.3f}",
+            ]
+            # Show per-fix feedback so agent knows which fixes worked
+            for detail in fix_result["fix_details"]:
+                status = "correct" if detail["score"] >= 0.99 else ("partial" if detail["score"] > 0 else "wrong")
+                feedback_lines.append(
+                    f"  row:{detail['row']},col:{detail['col']} -> {status} ({detail['reason']})"
+                )
+            feedback_lines.append(
+                f"\n--- Combined Reward: {combined_reward:.3f} (identify={identify_score:.3f} x {IDENTIFY_WEIGHT} + fix={fix_score:.3f} x {FIX_WEIGHT}) ---"
+            )
+        else:
+            feedback_lines += [
+                "",
+                "Tip: Submit fixes with format row:<N>,col:<name>,fix:<value> for bonus reward.",
+            ]
         if parse_errors:
+            feedback_lines.append(f"\nParse errors ({len(parse_errors)}): {'; '.join(parse_errors[:5])}")
         if not is_done:
             if metrics["fn"] > 0:
                 feedback_lines.append(
+                    f"\nYou missed {metrics['fn']} issue(s). Review the dataset carefully."
                 )
             if metrics["fp"] > 0:
                 feedback_lines.append(
+                    f"Remove the {metrics['fp']} false positive(s) listed above and look for real issues."
                 )
+            feedback_lines.append("You can submit again with updated issues and/or fixes.")
         else:
+            feedback_lines.append(f"\nTask complete! Final best reward: {self._best_score:.3f}")
+        # ── Flag items for human review ──
+        # In a production data QA pipeline, these would go to a human reviewer.
+        # The grader flags cases where automated scoring has low confidence.
+        human_review_flags: list[dict] = []
+        # 1. False positives that target real columns — could be legitimate issues
+        #    the task designer didn't plant (agent may be smarter than the grader)
+        issue_map = self._current_task.get_planted_issue_map()
+        valid_issue_types = {"missing_value", "wrong_type", "duplicate_row", "out_of_range",
+                             "format_violation", "inconsistent_value", "statistical_outlier",
+                             "referential_integrity"}
+        for fp_key in fp_keys:
+            parts = fp_key.split(",")
+            itype = parts[2].split(":")[1] if len(parts) >= 3 else ""
+            if itype in valid_issue_types:
+                human_review_flags.append({
+                    "item": fp_key,
+                    "reason": "Agent reported this issue but it's not in ground truth — may be a real issue the grader missed",
+                    "type": "possible_unplanted_issue",
+                })
+        # 2. Partial fix matches — fix was close but not exact, human should verify
+        for detail in fix_result["fix_details"]:
+            if 0 < detail["score"] < 0.99:
+                human_review_flags.append({
+                    "item": f"row:{detail['row']},col:{detail['col']}",
+                    "reason": f"Fix scored {detail['score']:.2f} ({detail['reason']}) — human should verify if acceptable",
+                    "type": "partial_fix",
+                })
+        # 3. High-difficulty issues that were missed — flag for training data review
+        planted_by_key = {i.to_key(): i for i in self._current_task.planted_issues}
+        fn_keys = self._planted_keys - reported_keys
+        for fn_key in fn_keys:
+            issue = planted_by_key.get(fn_key)
+            if issue and issue.difficulty >= 2.5:
+                human_review_flags.append({
+                    "item": fn_key,
+                    "reason": f"High-difficulty issue (difficulty={issue.difficulty}) missed — {issue.description}",
+                    "type": "missed_hard_issue",
+                })
+        if human_review_flags:
+            feedback_lines.append(f"\n--- Flagged for Human Review ({len(human_review_flags)}) ---")
+            for flag in human_review_flags:
+                feedback_lines.append(f"  [{flag['type']}] {flag['item']}: {flag['reason']}")
         return DataQAObservation(
             dataset_csv=self._current_task.corrupted_csv,
             max_steps=self._state.max_steps,
             done=is_done,
             reward=self._best_score,
+            metadata={
+                "identify_f1": identify_f1,
+                "identify_score": identify_score,
+                "fix_score": fix_score,
+                "combined_reward": combined_reward,
+                "precision": metrics["precision"],
+                "recall": metrics["recall"],
+                "tp": metrics["tp"],
+                "fp": metrics["fp"],
+                "fn": metrics["fn"],
+                "difficulty_found": weighted["difficulty_found"],
+                "difficulty_missed": weighted["difficulty_missed"],
+                "fixes_correct": fix_result["fixes_correct"],
+                "fixes_partial": fix_result["fixes_partial"],
+                "fixes_wrong": fix_result["fixes_wrong"],
+                "fixes_attempted": fix_result["fixes_attempted"],
+                "fix_details": fix_result["fix_details"],
+                "human_review_flags": human_review_flags,
+            },
         )
     @property

dataqa_env/server/gradio_ui.py ADDED Viewed

	@@ -0,0 +1,508 @@

+"""
+Gradio UI — Agent Trajectory Replay Viewer for DataQA.
+Designed for judges: zero clicks needed, auto-plays on load.
+Tab per task, step slider, prominent metric cards, color-coded dataset.
+"""
+from __future__ import annotations
+import csv
+import io
+import gradio as gr
+from .environment import DataQAEnvironment, parse_issue_key
+from .tasks import list_tasks, PlantedIssue
+from ..models import DataQAAction
+# ── Pre-built agent trajectories (simulates baseline agent) ──
+AGENT_TRAJECTORIES = {
+    "easy": [
+        {
+            "issues": [
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:9,col:salary,issue:out_of_range",
+                "row:18,col:start_date,issue:out_of_range",
+                "row:3,col:email,issue:format_violation",  # FP
+            ],
+            "fixes": [],
+        },
+        {
+            "issues": [
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:9,col:salary,issue:out_of_range",
+                "row:21,col:employee_id,issue:duplicate_row",
+                "row:15,col:email,issue:inconsistent_value",
+                "row:18,col:start_date,issue:out_of_range",
+            ],
+            "fixes": [
+                "row:4,col:name,fix:David Kim",
+                "row:7,col:salary,fix:75000",
+                "row:9,col:salary,fix:73000",
+                "row:15,col:email,fix:oscar.rivera@company.com",
+                "row:18,col:start_date,fix:2022-01-19",
+            ],
+        },
+    ],
+    "medium": [
+        {
+            "issues": [
+                "row:5,col:total,issue:inconsistent_value",
+                "row:10,col:category,issue:format_violation",
+                "row:14,col:product_name,issue:missing_value",
+                "row:17,col:quantity,issue:out_of_range",
+                "row:19,col:order_id,issue:duplicate_row",
+                "row:12,col:order_date,issue:format_violation",
+                "row:24,col:shipping_country,issue:format_violation",
+            ],
+            "fixes": [],
+        },
+        {
+            "issues": [
+                "row:5,col:total,issue:inconsistent_value",
+                "row:10,col:category,issue:format_violation",
+                "row:14,col:product_name,issue:missing_value",
+                "row:17,col:quantity,issue:out_of_range",
+                "row:19,col:order_id,issue:duplicate_row",
+                "row:12,col:order_date,issue:format_violation",
+                "row:24,col:shipping_country,issue:format_violation",
+                "row:29,col:order_date,issue:inconsistent_value",
+            ],
+            "fixes": [
+                "row:5,col:total,fix:42.00",
+                "row:10,col:category,fix:Sports",
+                "row:12,col:order_date,fix:2024-01-26",
+                "row:14,col:product_name,fix:LED Strip Lights",
+                "row:24,col:shipping_country,fix:US",
+                "row:29,col:order_date,fix:2024-02-12",
+            ],
+        },
+    ],
+    "hard": [
+        {
+            "issues": [
+                "row:14,col:training_time_hours,issue:out_of_range",
+                "row:13,col:learning_rate,issue:out_of_range",
+                "row:15,col:model_name,issue:missing_value",
+                "row:9,col:batch_size,issue:format_violation",
+                "row:10,col:train_size,issue:inconsistent_value",
+            ],
+            "fixes": [],
+        },
+        {
+            "issues": [
+                "row:14,col:training_time_hours,issue:out_of_range",
+                "row:13,col:learning_rate,issue:out_of_range",
+                "row:15,col:model_name,issue:missing_value",
+                "row:9,col:batch_size,issue:format_violation",
+                "row:10,col:train_size,issue:inconsistent_value",
+                "row:5,col:val_loss,issue:inconsistent_value",
+                "row:7,col:gpu_memory_gb,issue:statistical_outlier",
+                "row:11,col:timestamp,issue:inconsistent_value",
+                "row:9,col:training_time_hours,issue:statistical_outlier",
+                "row:12,col:test_accuracy,issue:statistical_outlier",
+            ],
+            "fixes": [
+                "row:14,col:training_time_hours,fix:72.0",
+                "row:13,col:learning_rate,fix:0.00001",
+                "row:15,col:model_name,fix:whisper-small",
+                "row:9,col:batch_size,fix:256",
+                "row:9,col:training_time_hours,fix:36.0",
+            ],
+        },
+    ],
+}
+# ── HTML rendering ──
+def _metric_card(label: str, value: str, color: str = "#333") -> str:
+    return (
+        f'<div style="text-align:center;padding:12px 16px;background:#f8f9fa;'
+        f'border-radius:8px;min-width:100px;">'
+        f'<div style="font-size:11px;color:#666;text-transform:uppercase;letter-spacing:1px;">{label}</div>'
+        f'<div style="font-size:28px;font-weight:700;color:{color};margin-top:2px;">{value}</div>'
+        f'</div>'
+    )
+def _csv_to_html(
+    csv_text: str,
+    planted: list[PlantedIssue],
+    correct: set[tuple[int, str]],
+    fp: set[tuple[int, str]],
+    missed: set[tuple[int, str]],
+    fixed: dict[tuple[int, str], str],
+    fix_values: dict[tuple[int, str], str] | None = None,
+) -> str:
+    """Render CSV as HTML with color-coded cells and inline fix proposals."""
+    fix_values = fix_values or {}
+    desc_map = {(i.row, i.col): i for i in planted}
+    reader = csv.reader(io.StringIO(csv_text.strip()))
+    rows = list(reader)
+    if not rows:
+        return ""
+    header = rows[0]
+    header_lower = [h.strip().lower() for h in header]
+    data = rows[1:]
+    t = ['<table style="border-collapse:collapse;width:100%;font-size:12px;font-family:\'SF Mono\',monospace;">']
+    t.append('<tr>')
+    t.append('<th style="border:1px solid #dee2e6;padding:6px 8px;background:#343a40;color:#fff;font-size:11px;">Row</th>')
+    for h in header:
+        t.append(f'<th style="border:1px solid #dee2e6;padding:6px 8px;background:#343a40;color:#fff;font-size:11px;">{h}</th>')
+    t.append('</tr>')
+    for i, row in enumerate(data):
+        rn = i + 1
+        bg = "#fff" if i % 2 == 0 else "#f8f9fa"
+        t.append(f'<tr style="background:{bg};">')
+        t.append(f'<td style="border:1px solid #dee2e6;padding:4px 8px;color:#adb5bd;text-align:center;font-size:11px;">{rn}</td>')
+        for j, val in enumerate(row):
+            col = header_lower[j] if j < len(header_lower) else ""
+            ck = (rn, col)
+            s = "border:1px solid #dee2e6;padding:4px 8px;"
+            tip = ""
+            badge = ""
+            issue = desc_map.get(ck)
+            if ck in correct:
+                s += "background:#d4edda;"
+                tip = f"FOUND: {issue.description}" if issue else ""
+                badge = '<span style="font-size:9px;background:#28a745;color:#fff;padding:1px 4px;border-radius:3px;margin-left:4px;">TP</span>'
+            elif ck in fp:
+                s += "background:#f8d7da;"
+                badge = '<span style="font-size:9px;background:#dc3545;color:#fff;padding:1px 4px;border-radius:3px;margin-left:4px;">FP</span>'
+            elif ck in missed:
+                s += "background:#fff3cd;"
+                tip = f"MISSED: {issue.description}" if issue else ""
+                badge = '<span style="font-size:9px;background:#856404;color:#fff;padding:1px 4px;border-radius:3px;margin-left:4px;">MISS</span>'
+            fx = fixed.get(ck)
+            proposed = fix_values.get(ck)
+            if fx == "correct":
+                s += "box-shadow:inset 0 0 0 2px #28a745;"
+                badge += '<span style="font-size:9px;background:#28a745;color:#fff;padding:1px 4px;border-radius:3px;margin-left:2px;">FIX</span>'
+            elif fx == "partial":
+                s += "box-shadow:inset 0 0 0 2px #ffc107;"
+                badge += '<span style="font-size:9px;background:#ffc107;color:#333;padding:1px 4px;border-radius:3px;margin-left:2px;">~FIX</span>'
+            dv = val if val.strip() else '<em style="color:#dc3545;font-style:italic;">empty</em>'
+            # Show proposed fix value below the corrupted value
+            fix_line = ""
+            if proposed is not None:
+                fix_color = "#28a745" if fx == "correct" else ("#b8860b" if fx == "partial" else "#dc3545")
+                fix_line = (
+                    f'<div style="font-size:10px;color:{fix_color};margin-top:2px;'
+                    f'border-top:1px dashed {fix_color};padding-top:2px;">'
+                    f'\u2192 {proposed}</div>'
+                )
+            t.append(f'<td style="{s}" title="{tip}">{dv}{badge}{fix_line}</td>')
+        t.append('</tr>')
+    t.append('</table>')
+    return "".join(t)
+LEGEND_HTML = (
+    '<div style="display:flex;gap:12px;flex-wrap:wrap;margin-top:10px;font-size:11px;">'
+    '<span style="background:#d4edda;padding:2px 8px;border-radius:4px;">Found (TP)</span>'
+    '<span style="background:#f8d7da;padding:2px 8px;border-radius:4px;">False Positive</span>'
+    '<span style="background:#fff3cd;padding:2px 8px;border-radius:4px;">Missed</span>'
+    '<span style="box-shadow:inset 0 0 0 2px #28a745;padding:2px 8px;border-radius:4px;">Fix Correct</span>'
+    '<span style="box-shadow:inset 0 0 0 2px #ffc107;padding:2px 8px;border-radius:4px;">Fix Partial</span>'
+    '</div>'
+)
+# ── Core replay logic ──
+def _replay_task(task_id: str) -> list[dict]:
+    """Run the agent trajectory and collect per-step data."""
+    env = DataQAEnvironment()
+    obs = env.reset(task_id=task_id)
+    task = env._current_task
+    planted_keys = {i.to_key() for i in task.planted_issues}
+    steps_data = []
+    # Step 0: initial state
+    steps_data.append({
+        "label": "Initial — corrupted dataset",
+        "html": _csv_to_html(obs.dataset_csv, task.planted_issues, set(), set(), set(), {}),
+        "metrics": {"reward": 0.0, "tp": 0, "fp": 0, "fn": len(task.planted_issues),
+                    "identify": 0.0, "fix": 0.0, "fixes_correct": 0},
+        "feedback": f"Task: {task.name}\nIssues to find: {obs.num_issues_hint}\n\n{task.description}",
+    })
+    trajectory = AGENT_TRAJECTORIES.get(task_id, [])
+    for i, step_data in enumerate(trajectory):
+        action = DataQAAction(
+            issues=step_data["issues"],
+            fixes=step_data.get("fixes", []),
+            task_id=task_id,
+        )
+        obs = env.step(action)
+        reported_keys = set()
+        for iss in step_data["issues"]:
+            key = parse_issue_key(iss)
+            if key:
+                reported_keys.add(key)
+        tp_keys = reported_keys & planted_keys
+        fp_keys = reported_keys - planted_keys
+        fn_keys = planted_keys - reported_keys
+        correct = {_kc(k) for k in tp_keys}
+        fp = {_kc(k) for k in fp_keys}
+        missed = {_kc(k) for k in fn_keys} if obs.done else set()
+        fixed: dict[tuple[int, str], str] = {}
+        for d in obs.metadata.get("fix_details", []):
+            c = (d["row"], d["col"])
+            fixed[c] = "correct" if d["score"] >= 0.99 else ("partial" if d["score"] > 0 else "wrong")
+        # Extract proposed fix values from the raw fix strings
+        fix_values: dict[tuple[int, str], str] = {}
+        from .environment import parse_fix
+        for raw_fix in step_data.get("fixes", []):
+            parsed = parse_fix(raw_fix)
+            if parsed:
+                row, col, val = parsed
+                fix_values[(row, col)] = val
+        html = _csv_to_html(obs.dataset_csv, task.planted_issues, correct, fp, missed, fixed, fix_values)
+        has_fixes = bool(step_data.get("fixes"))
+        if has_fixes:
+            label = f"Step {i+1} — identify + fix"
+        else:
+            label = f"Step {i+1} — identify only"
+        steps_data.append({
+            "label": label,
+            "html": html,
+            "metrics": {
+                "reward": obs.reward,
+                "tp": obs.metadata["tp"],
+                "fp": obs.metadata["fp"],
+                "fn": obs.metadata["fn"],
+                "identify": obs.metadata["identify_score"],
+                "fix": obs.metadata["fix_score"],
+                "fixes_correct": obs.metadata["fixes_correct"],
+            },
+            "feedback": obs.feedback,
+        })
+    return steps_data
+def _kc(key: str) -> tuple[int, str]:
+    parts = key.split(",")
+    return (int(parts[0].split(":")[1]), parts[1].split(":")[1])
+# ── Gradio app ──
+def build_gradio_ui():
+    # Pre-compute all replays at startup
+    all_replays: dict[str, list[dict]] = {}
+    for tid in list_tasks():
+        all_replays[tid] = _replay_task(tid)
+    def show_step(task_id: str, step_idx: int):
+        replay = all_replays.get(task_id, [])
+        step_idx = int(step_idx)
+        if step_idx >= len(replay):
+            step_idx = len(replay) - 1
+        sd = replay[step_idx]
+        m = sd["metrics"]
+        # Reward color
+        r = m["reward"]
+        rc = "#28a745" if r >= 0.8 else ("#ffc107" if r >= 0.4 else "#dc3545")
+        cards = (
+            '<div style="display:flex;gap:10px;flex-wrap:wrap;margin-bottom:12px;">'
+            + _metric_card("Reward", f"{r:.2f}", rc)
+            + _metric_card("Found", str(m["tp"]), "#28a745")
+            + _metric_card("False Pos", str(m["fp"]), "#dc3545" if m["fp"] > 0 else "#28a745")
+            + _metric_card("Missed", str(m["fn"]), "#dc3545" if m["fn"] > 0 else "#28a745")
+            + _metric_card("Identify", f"{m['identify']:.2f}", "#333")
+            + _metric_card("Fix", f"{m['fix']:.2f}", "#333")
+            + '</div>'
+        )
+        full_html = (
+            f'<div style="font-size:14px;font-weight:600;margin-bottom:8px;color:#495057;">'
+            f'{sd["label"]}</div>'
+            + cards + sd["html"] + LEGEND_HTML
+        )
+        return full_html, sd["feedback"]
+    def on_task_change(task_id):
+        replay = all_replays.get(task_id, [])
+        max_step = len(replay) - 1
+        html, fb = show_step(task_id, 0)
+        return (
+            gr.update(maximum=max_step, value=0),
+            html,
+            fb,
+        )
+    def on_step_change(task_id, step_idx):
+        html, fb = show_step(task_id, step_idx)
+        return html, fb
+    # ── Live agent runner (connects to the env server) ──
+    live_env = DataQAEnvironment()
+    live_state: dict = {"obs": None, "task_id": "easy", "steps": []}
+    def live_reset(task_id):
+        obs = live_env.reset(task_id=task_id)
+        task = live_env._current_task
+        live_state["obs"] = obs
+        live_state["task_id"] = task_id
+        live_state["steps"] = []
+        html = _csv_to_html(obs.dataset_csv, task.planted_issues, set(), set(), set(), {})
+        info = f"**{task.name}** — {obs.num_issues_hint} issues to find, {obs.max_steps} steps max"
+        return html, info, "", "0.000"
+    def live_step(issues_text, fixes_text):
+        if live_state["obs"] is None:
+            return "Reset first.", "", "", ""
+        obs = live_state["obs"]
+        task = live_env._current_task
+        planted_keys = {i.to_key() for i in task.planted_issues}
+        issues = [l.strip() for l in issues_text.strip().split("\n") if l.strip()]
+        fixes = [l.strip() for l in fixes_text.strip().split("\n") if l.strip()] if fixes_text.strip() else []
+        action = DataQAAction(issues=issues, fixes=fixes, task_id=live_state["task_id"])
+        obs = live_env.step(action)
+        live_state["obs"] = obs
+        reported_keys = set()
+        for iss in issues:
+            key = parse_issue_key(iss)
+            if key:
+                reported_keys.add(key)
+        tp_keys = reported_keys & planted_keys
+        fp_keys = reported_keys - planted_keys
+        fn_keys = planted_keys - reported_keys
+        correct = {_kc(k) for k in tp_keys}
+        fp_set = {_kc(k) for k in fp_keys}
+        missed = {_kc(k) for k in fn_keys} if obs.done else set()
+        fixed: dict[tuple[int, str], str] = {}
+        for d in obs.metadata.get("fix_details", []):
+            c = (d["row"], d["col"])
+            fixed[c] = "correct" if d["score"] >= 0.99 else ("partial" if d["score"] > 0 else "wrong")
+        from .environment import parse_fix
+        fix_values: dict[tuple[int, str], str] = {}
+        for raw in fixes:
+            parsed = parse_fix(raw)
+            if parsed:
+                fix_values[(parsed[0], parsed[1])] = parsed[2]
+        html = _csv_to_html(obs.dataset_csv, task.planted_issues, correct, fp_set, missed, fixed, fix_values)
+        m = obs.metadata
+        r = obs.reward
+        rc = "#28a745" if r >= 0.8 else ("#ffc107" if r >= 0.4 else "#dc3545")
+        cards = (
+            '<div style="display:flex;gap:10px;flex-wrap:wrap;margin-bottom:12px;">'
+            + _metric_card("Reward", f"{r:.2f}", rc)
+            + _metric_card("Found", str(m["tp"]), "#28a745")
+            + _metric_card("False Pos", str(m["fp"]), "#dc3545" if m["fp"] > 0 else "#28a745")
+            + _metric_card("Missed", str(m["fn"]), "#dc3545" if m["fn"] > 0 else "#28a745")
+            + '</div>'
+        )
+        full_html = cards + html + LEGEND_HTML
+        return full_html, obs.feedback, f"{r:.3f}", ""
+    # ── Build the UI ──
+    with gr.Blocks(title="DataQA Environment") as demo:
+        gr.Markdown(
+            "# DataQA — Data Quality Assurance Environment\n"
+            "Two-phase RL environment: **Identify** data quality issues, then **Fix** them."
+        )
+        with gr.Tabs():
+            # ── Tab 1: Demo replay ──
+            with gr.Tab("Demo (Baseline Agent)"):
+                gr.Markdown(
+                    "*Replay of the baseline Qwen-72B agent. "
+                    "Use the slider to step through the agent's trajectory.*"
+                )
+                with gr.Row():
+                    task_dd = gr.Dropdown(choices=list_tasks(), value="easy", label="Task", scale=1)
+                    step_slider = gr.Slider(minimum=0, maximum=2, step=1, value=0, label="Step", scale=3)
+                viz_html = gr.HTML()
+                feedback_box = gr.Textbox(label="Agent Feedback", lines=10, interactive=False)
+                task_dd.change(on_task_change, inputs=[task_dd], outputs=[step_slider, viz_html, feedback_box])
+                step_slider.change(on_step_change, inputs=[task_dd, step_slider], outputs=[viz_html, feedback_box])
+                demo.load(on_task_change, inputs=[task_dd], outputs=[step_slider, viz_html, feedback_box])
+            # ── Tab 2: Try your own agent ──
+            with gr.Tab("Try Your Own Agent"):
+                gr.Markdown(
+                    "*Submit your own issues and fixes to see how the environment scores them. "
+                    "This is the same environment the baseline agent talks to.*"
+                )
+                with gr.Row():
+                    live_task_dd = gr.Dropdown(choices=list_tasks(), value="easy", label="Task", scale=1)
+                    live_reset_btn = gr.Button("Reset", variant="primary", scale=1)
+                with gr.Row():
+                    live_info = gr.Markdown()
+                    live_reward = gr.Textbox(label="Reward", interactive=False, scale=1)
+                live_viz = gr.HTML()
+                with gr.Row():
+                    live_issues = gr.Textbox(
+                        label="Issues (one per line)",
+                        placeholder="row:4,col:name,issue:missing_value\nrow:7,col:salary,issue:wrong_type",
+                        lines=5,
+                    )
+                    live_fixes = gr.Textbox(
+                        label="Fixes (one per line, optional)",
+                        placeholder="row:4,col:name,fix:David Kim\nrow:7,col:salary,fix:75000",
+                        lines=5,
+                    )
+                live_step_btn = gr.Button("Submit Step", variant="primary")
+                live_feedback = gr.Textbox(label="Feedback", lines=10, interactive=False)
+                live_reset_btn.click(
+                    live_reset, inputs=[live_task_dd],
+                    outputs=[live_viz, live_info, live_feedback, live_reward],
+                )
+                live_step_btn.click(
+                    live_step, inputs=[live_issues, live_fixes],
+                    outputs=[live_viz, live_feedback, live_reward, live_issues],
+                )
+    return demo
+if __name__ == "__main__":
+    demo = build_gradio_ui()
+    demo.launch()

dataqa_env/server/tasks.py CHANGED Viewed

@@ -25,6 +25,7 @@ class PlantedIssue:
     col: str
     issue_type: str
     description: str
     def to_key(self) -> str:
         return f"row:{self.row},col:{self.col},issue:{self.issue_type}"
@@ -42,6 +43,28 @@ class Task:
     corrupted_csv: str = ""
     max_steps: int = 3
 def _csv_to_rows(csv_text: str) -> List[List[str]]:
     reader = csv.reader(io.StringIO(csv_text.strip()))
@@ -72,7 +95,17 @@ def create_task_easy(seed: int = 42) -> Task:
 107,Grace Lee,grace.lee@company.com,Marketing,75000,2021-12-01
 108,Hank Brown,hank.brown@company.com,Sales,65000,2023-04-18
 109,Iris Patel,iris.patel@company.com,HR,73000,2020-02-28
-110,Jack Taylor,jack.taylor@company.com,Engineering,97000,2022-09-14"""
     schema_desc = """Columns:
 - employee_id: integer, unique, range 100-999
@@ -93,29 +126,43 @@ def create_task_easy(seed: int = 42) -> Task:
     data = rows[1:]
     issues: List[PlantedIssue] = []
-    # Issue 1: Missing value - null out a name
     r = 3  # row index in data (0-based), displayed as row 4 in CSV
     data[r][1] = ""
     issues.append(PlantedIssue(row=r + 1, col="name", issue_type="missing_value",
-                               description="Empty name field"))
-    # Issue 2: Wrong type - salary as text
     r = 6
     data[r][4] = "seventy-five thousand"
     issues.append(PlantedIssue(row=r + 1, col="salary", issue_type="wrong_type",
-                               description="Salary is text instead of integer"))
-    # Issue 3: Duplicate row
     dup_source = 1
     data.append(list(data[dup_source]))
     issues.append(PlantedIssue(row=len(data), col="employee_id", issue_type="duplicate_row",
-                               description=f"Exact duplicate of row {dup_source + 1}"))
-    # Issue 4: Out of range salary
     r = 8
     data[r][4] = "5000"
     issues.append(PlantedIssue(row=r + 1, col="salary", issue_type="out_of_range",
-                               description="Salary 5000 is below minimum 50000"))
     corrupted = _rows_to_csv([header] + data)
@@ -163,7 +210,17 @@ ORD-016,CUST-114,Bluetooth Speaker,Electronics,1,49.99,2024-01-30,UK,delivered,4
 ORD-017,CUST-115,Jump Rope,Sports,2,8.99,2024-01-31,US,shipped,17.98
 ORD-018,CUST-116,Coffee Table Book,Books,1,32.00,2024-02-01,CA,delivered,32.00
 ORD-019,CUST-117,Ergonomic Chair,Home,1,450.00,2024-02-02,US,processing,450.00
-ORD-020,CUST-118,Fitness Tracker,Electronics,1,79.99,2024-02-03,AU,delivered,79.99"""
     schema_desc = """Columns:
 - order_id: string, unique, format ORD-NNN
@@ -190,41 +247,55 @@ ORD-020,CUST-118,Fitness Tracker,Electronics,1,79.99,2024-02-03,AU,delivered,79.
     data = rows[1:]
     issues: List[PlantedIssue] = []
-    # Issue 1: total doesn't match quantity * unit_price
     r = 4  # ORD-005
     data[r][9] = "84.00"  # should be 42.00 (qty=1, price=42.00)
     issues.append(PlantedIssue(row=r + 1, col="total", issue_type="inconsistent_value",
-                               description="total (84.00) != quantity (1) * unit_price (42.00)"))
-    # Issue 2: Invalid category
     r = 9  # ORD-010
     data[r][3] = "Fitness"  # should be Sports
     issues.append(PlantedIssue(row=r + 1, col="category", issue_type="format_violation",
-                               description="'Fitness' is not in allowed categories"))
-    # Issue 3: Missing value in product_name
     r = 13  # ORD-014
     data[r][2] = ""
     issues.append(PlantedIssue(row=r + 1, col="product_name", issue_type="missing_value",
-                               description="Empty product_name"))
-    # Issue 4: Out of range quantity
     r = 16  # ORD-017
     data[r][4] = "-1"
     issues.append(PlantedIssue(row=r + 1, col="quantity", issue_type="out_of_range",
-                               description="Negative quantity"))
-    # Issue 5: Duplicate order_id
     r = 18  # ORD-019
     data[r][0] = "ORD-003"
     issues.append(PlantedIssue(row=r + 1, col="order_id", issue_type="duplicate_row",
-                               description="Duplicate order_id ORD-003"))
-    # Issue 6: Wrong date format
     r = 11  # ORD-012
     data[r][6] = "26/01/2024"
     issues.append(PlantedIssue(row=r + 1, col="order_date", issue_type="format_violation",
-                               description="Date format DD/MM/YYYY instead of YYYY-MM-DD"))
     corrupted = _rows_to_csv([header] + data)
@@ -267,7 +338,22 @@ EXP-011,yolov5-m,coco-2017,118287,5000,40670,0.01,32,300,0.032,0.045,0.0,10.2,24
 EXP-012,wav2vec2,librispeech,281241,5567,2620,0.0001,8,20,0.92,1.05,0.0,12.6,15.0,2024-03-13T11:30:00
 EXP-013,clip-base,cc3m,2818102,15000,15000,0.00001,256,32,2.15,2.38,0.0,22.4,48.0,2024-03-14T08:00:00
 EXP-014,detr,coco-2017,118287,5000,40670,0.0001,4,500,1.85,2.12,0.0,16.0,72.0,2024-03-15T10:00:00
-EXP-015,whisper-small,common-voice,520000,16000,16000,0.00005,16,5,0.55,0.68,0.0,7.4,6.5,2024-03-16T14:00:00"""
     schema_desc = """Columns:
 - experiment_id: string, unique, format EXP-NNN
@@ -301,53 +387,83 @@ EXP-015,whisper-small,common-voice,520000,16000,16000,0.00005,16,5,0.55,0.68,0.0
     data = rows[1:]
     issues: List[PlantedIssue] = []
-    # Issue 1: Data leakage signal — val_loss much lower than train_loss
     r = 4  # EXP-005
     data[r][10] = "0.15"  # val_loss=0.15 but train_loss=0.28 → suspicious
     issues.append(PlantedIssue(row=r + 1, col="val_loss", issue_type="inconsistent_value",
-                               description="val_loss (0.15) significantly less than train_loss (0.28), potential data leakage"))
-    # Issue 2: Batch size not power of 2
     r = 8  # EXP-009
     data[r][7] = "250"  # not a power of 2
     issues.append(PlantedIssue(row=r + 1, col="batch_size", issue_type="format_violation",
-                               description="batch_size 250 is not a power of 2"))
-    # Issue 3: GPU memory unreasonable for model
     r = 6  # EXP-007 resnet18 on cifar10
     data[r][12] = "42.5"  # resnet18 shouldn't need 42.5 GB
     issues.append(PlantedIssue(row=r + 1, col="gpu_memory_gb", issue_type="statistical_outlier",
-                               description="resnet18 on cifar10 using 42.5 GB GPU memory is unreasonable"))
-    # Issue 4: Timestamp out of order
     r = 10  # EXP-011
     data[r][14] = "2024-03-02T09:00:00"  # should be after EXP-010's timestamp
     issues.append(PlantedIssue(row=r + 1, col="timestamp", issue_type="inconsistent_value",
-                               description="Timestamp 2024-03-02 is before EXP-010's timestamp 2024-03-11"))
-    # Issue 5: Train size smaller than test size
     r = 9  # EXP-010
     data[r][3] = "500"  # train_size=500 but test_size=1821
     issues.append(PlantedIssue(row=r + 1, col="train_size", issue_type="inconsistent_value",
-                               description="train_size (500) is smaller than test_size (1821)"))
-    # Issue 6: Negative training time
     r = 13  # EXP-014
     data[r][13] = "-72.0"
     issues.append(PlantedIssue(row=r + 1, col="training_time_hours", issue_type="out_of_range",
-                               description="Negative training time"))
-    # Issue 7: Learning rate out of range
     r = 12  # EXP-013
     data[r][6] = "2.5"  # way too high
     issues.append(PlantedIssue(row=r + 1, col="learning_rate", issue_type="out_of_range",
-                               description="Learning rate 2.5 exceeds maximum of 1.0"))
-    # Issue 8: Missing model name (subtle — single space instead of empty)
     r = 14  # EXP-015
     data[r][1] = " "
     issues.append(PlantedIssue(row=r + 1, col="model_name", issue_type="missing_value",
-                               description="model_name is whitespace-only"))
     corrupted = _rows_to_csv([header] + data)
@@ -370,6 +486,123 @@ EXP-015,whisper-small,common-voice,520000,16000,16000,0.00005,16,5,0.55,0.68,0.0
     )
 # ---------------------------------------------------------------------------
 # Task registry
 # ---------------------------------------------------------------------------

     col: str
     issue_type: str
     description: str
+    difficulty: float = 1.0  # 1.0=easy, 2.0=medium, 3.0=hard (for weighted reward)
     def to_key(self) -> str:
         return f"row:{self.row},col:{self.col},issue:{self.issue_type}"
     corrupted_csv: str = ""
     max_steps: int = 3
+    def get_clean_value(self, row: int, col: str) -> str | None:
+        """
+        Look up the original clean value for a given (row, col).
+        Row is 1-indexed (data row after header).
+        Returns None if row/col is out of bounds or column not found.
+        """
+        rows = _csv_to_rows(self.clean_csv)
+        if len(rows) < 2:
+            return None
+        header = [h.strip().lower() for h in rows[0]]
+        if col.lower() not in header:
+            return None
+        col_idx = header.index(col.lower())
+        data_row_idx = row  # row is 1-indexed, rows[0] is header, so rows[row] is the data row
+        if data_row_idx < 1 or data_row_idx >= len(rows):
+            return None
+        return rows[data_row_idx][col_idx].strip()
+    def get_planted_issue_map(self) -> dict:
+        """Return dict mapping issue key -> PlantedIssue for quick lookups."""
+        return {issue.to_key(): issue for issue in self.planted_issues}
 def _csv_to_rows(csv_text: str) -> List[List[str]]:
     reader = csv.reader(io.StringIO(csv_text.strip()))
 107,Grace Lee,grace.lee@company.com,Marketing,75000,2021-12-01
 108,Hank Brown,hank.brown@company.com,Sales,65000,2023-04-18
 109,Iris Patel,iris.patel@company.com,HR,73000,2020-02-28
+110,Jack Taylor,jack.taylor@company.com,Engineering,97000,2022-09-14
+111,Kevin Zhang,kevin.zhang@company.com,Engineering,91000,2021-05-22
+112,Laura Adams,laura.adams@company.com,Sales,69000,2022-11-03
+113,Mike Torres,mike.torres@company.com,Marketing,74000,2020-08-17
+114,Nina Sharma,nina.sharma@company.com,HR,76000,2019-04-30
+115,Oscar Rivera,oscar.rivera@company.com,Engineering,105000,2018-12-10
+116,Paula Green,paula.green@company.com,Sales,67000,2023-06-25
+117,Quinn Murphy,quinn.murphy@company.com,Marketing,78000,2021-03-08
+118,Rosa Diaz,rosa.diaz@company.com,Engineering,99000,2022-01-19
+119,Sam Cooper,sam.cooper@company.com,HR,70000,2020-10-05
+120,Tara Singh,tara.singh@company.com,Sales,66000,2023-02-14"""
     schema_desc = """Columns:
 - employee_id: integer, unique, range 100-999
     data = rows[1:]
     issues: List[PlantedIssue] = []
+    # Issue 1: Missing value - null out a name (easy to spot)
     r = 3  # row index in data (0-based), displayed as row 4 in CSV
     data[r][1] = ""
     issues.append(PlantedIssue(row=r + 1, col="name", issue_type="missing_value",
+                               description="Empty name field", difficulty=1.0))
+    # Issue 2: Wrong type - salary as text (easy to spot)
     r = 6
     data[r][4] = "seventy-five thousand"
     issues.append(PlantedIssue(row=r + 1, col="salary", issue_type="wrong_type",
+                               description="Salary is text instead of integer", difficulty=1.0))
+    # Issue 3: Duplicate row (moderate — requires cross-row comparison)
     dup_source = 1
     data.append(list(data[dup_source]))
     issues.append(PlantedIssue(row=len(data), col="employee_id", issue_type="duplicate_row",
+                               description=f"Exact duplicate of row {dup_source + 1}", difficulty=1.5))
+    # Issue 4: Out of range salary (easy to spot)
     r = 8
     data[r][4] = "5000"
     issues.append(PlantedIssue(row=r + 1, col="salary", issue_type="out_of_range",
+                               description="Salary 5000 is below minimum 50000", difficulty=1.0))
+    # Issue 5: Email doesn't match name pattern (moderate — cross-column check)
+    r = 14  # Oscar Rivera -> email should be oscar.rivera@company.com
+    data[r][2] = "john.doe@company.com"
+    issues.append(PlantedIssue(row=r + 1, col="email", issue_type="inconsistent_value",
+                               description="Email john.doe@company.com doesn't match name Oscar Rivera",
+                               difficulty=1.5))
+    # Issue 6: Future start date (requires knowing current date context)
+    r = 17  # Rosa Diaz
+    data[r][5] = "2027-06-15"
+    issues.append(PlantedIssue(row=r + 1, col="start_date", issue_type="out_of_range",
+                               description="Start date 2027-06-15 is in the future (beyond 2025-12-31)",
+                               difficulty=1.5))
     corrupted = _rows_to_csv([header] + data)
 ORD-017,CUST-115,Jump Rope,Sports,2,8.99,2024-01-31,US,shipped,17.98
 ORD-018,CUST-116,Coffee Table Book,Books,1,32.00,2024-02-01,CA,delivered,32.00
 ORD-019,CUST-117,Ergonomic Chair,Home,1,450.00,2024-02-02,US,processing,450.00
+ORD-020,CUST-118,Fitness Tracker,Electronics,1,79.99,2024-02-03,AU,delivered,79.99
+ORD-021,CUST-119,Laptop Sleeve,Electronics,1,24.99,2024-02-04,US,delivered,24.99
+ORD-022,CUST-120,Hiking Backpack,Sports,1,65.00,2024-02-05,CA,shipped,65.00
+ORD-023,CUST-121,Machine Learning Book,Books,1,54.99,2024-02-06,UK,delivered,54.99
+ORD-024,CUST-122,Plant Pot Set,Home,3,15.00,2024-02-07,US,delivered,45.00
+ORD-025,CUST-123,Noise Cancelling Headphones,Electronics,1,199.99,2024-02-08,DE,shipped,199.99
+ORD-026,CUST-124,Basketball,Sports,1,29.99,2024-02-09,US,delivered,29.99
+ORD-027,CUST-125,Cookbook Collection,Books,2,22.50,2024-02-10,AU,delivered,45.00
+ORD-028,CUST-126,Smart Plug,Home,4,12.99,2024-02-11,US,processing,51.96
+ORD-029,CUST-127,Wireless Charger,Electronics,1,34.99,2024-02-12,UK,delivered,34.99
+ORD-030,CUST-128,Dumbbells Set,Sports,1,89.00,2024-02-13,US,shipped,89.00"""
     schema_desc = """Columns:
 - order_id: string, unique, format ORD-NNN
     data = rows[1:]
     issues: List[PlantedIssue] = []
+    # Issue 1: total doesn't match quantity * unit_price (requires cross-column check)
     r = 4  # ORD-005
     data[r][9] = "84.00"  # should be 42.00 (qty=1, price=42.00)
     issues.append(PlantedIssue(row=r + 1, col="total", issue_type="inconsistent_value",
+                               description="total (84.00) != quantity (1) * unit_price (42.00)", difficulty=2.0))
+    # Issue 2: Invalid category (requires knowing the allowed set)
     r = 9  # ORD-010
     data[r][3] = "Fitness"  # should be Sports
     issues.append(PlantedIssue(row=r + 1, col="category", issue_type="format_violation",
+                               description="'Fitness' is not in allowed categories", difficulty=1.5))
+    # Issue 3: Missing value in product_name (easy to spot)
     r = 13  # ORD-014
     data[r][2] = ""
     issues.append(PlantedIssue(row=r + 1, col="product_name", issue_type="missing_value",
+                               description="Empty product_name", difficulty=1.0))
+    # Issue 4: Out of range quantity (easy to spot)
     r = 16  # ORD-017
     data[r][4] = "-1"
     issues.append(PlantedIssue(row=r + 1, col="quantity", issue_type="out_of_range",
+                               description="Negative quantity", difficulty=1.0))
+    # Issue 5: Duplicate order_id (requires cross-row comparison)
     r = 18  # ORD-019
     data[r][0] = "ORD-003"
     issues.append(PlantedIssue(row=r + 1, col="order_id", issue_type="duplicate_row",
+                               description="Duplicate order_id ORD-003", difficulty=1.5))
+    # Issue 6: Wrong date format (moderate — format mismatch)
     r = 11  # ORD-012
     data[r][6] = "26/01/2024"
     issues.append(PlantedIssue(row=r + 1, col="order_date", issue_type="format_violation",
+                               description="Date format DD/MM/YYYY instead of YYYY-MM-DD", difficulty=1.5))
+    # Issue 7: Invalid country code (requires ISO knowledge)
+    r = 23  # ORD-024
+    data[r][7] = "XX"  # not a valid ISO country code
+    issues.append(PlantedIssue(row=r + 1, col="shipping_country", issue_type="format_violation",
+                               description="'XX' is not a valid ISO 2-letter country code", difficulty=1.5))
+    # Issue 8: Status-date inconsistency — order from Feb 13 still "processing" is suspicious
+    # but more importantly: delivered order with a future date
+    r = 28  # ORD-029
+    data[r][6] = "2025-12-25"  # future date but status is "delivered"
+    issues.append(PlantedIssue(row=r + 1, col="order_date", issue_type="inconsistent_value",
+                               description="Order date 2025-12-25 is in the future but status is 'delivered'",
+                               difficulty=2.0))
     corrupted = _rows_to_csv([header] + data)
 EXP-012,wav2vec2,librispeech,281241,5567,2620,0.0001,8,20,0.92,1.05,0.0,12.6,15.0,2024-03-13T11:30:00
 EXP-013,clip-base,cc3m,2818102,15000,15000,0.00001,256,32,2.15,2.38,0.0,22.4,48.0,2024-03-14T08:00:00
 EXP-014,detr,coco-2017,118287,5000,40670,0.0001,4,500,1.85,2.12,0.0,16.0,72.0,2024-03-15T10:00:00
+EXP-015,whisper-small,common-voice,520000,16000,16000,0.00005,16,5,0.55,0.68,0.0,7.4,6.5,2024-03-16T14:00:00
+EXP-016,mobilenet-v3,imagenet-1k,1281167,50000,100000,0.004,128,150,0.92,1.05,72.8,4.1,18.0,2024-03-17T08:30:00
+EXP-017,albert-base,mnli,392702,9815,9796,0.00002,32,5,0.32,0.41,83.1,6.2,1.8,2024-03-18T11:00:00
+EXP-018,gpt-neo-1.3b,pile-subset,1500000,50000,50000,0.0002,8,2,2.85,2.98,0.0,18.5,36.0,2024-03-19T14:00:00
+EXP-019,swin-tiny,imagenet-1k,1281167,50000,100000,0.001,256,300,0.78,0.95,78.2,8.6,42.0,2024-03-20T09:00:00
+EXP-020,deberta-large,squad-v2,130319,11873,8862,0.00001,16,5,0.35,0.42,85.7,15.2,4.5,2024-03-21T10:30:00
+EXP-021,yolov8-s,coco-2017,118287,5000,40670,0.01,64,200,0.028,0.038,0.0,6.8,16.0,2024-03-22T13:00:00
+EXP-022,bart-base,xsum,204045,11332,11334,0.0001,32,10,1.22,1.38,0.0,8.4,6.2,2024-03-23T15:30:00
+EXP-023,convnext-tiny,imagenet-1k,1281167,50000,100000,0.002,256,300,0.74,0.92,79.5,7.2,38.0,2024-03-24T08:00:00
+EXP-024,xlm-roberta,xnli,392702,2490,5010,0.00002,16,10,0.41,0.48,82.3,12.4,5.8,2024-03-25T11:00:00
+EXP-025,stable-diffusion,laion-400m,400000000,10000,10000,0.0001,4,1,0.45,0.52,0.0,24.0,168.0,2024-03-26T09:00:00
+EXP-026,phi-2,dolly-15k,15011,500,500,0.00005,8,3,0.82,0.95,0.0,10.2,2.5,2024-03-27T14:00:00
+EXP-027,dino-v2,imagenet-1k,1281167,50000,100000,0.0005,64,100,0.42,0.58,0.0,11.8,28.0,2024-03-28T10:00:00
+EXP-028,electra-small,glue-mrpc,3668,408,1725,0.0001,32,10,0.38,0.44,87.2,3.8,0.8,2024-03-29T16:00:00
+EXP-029,sam-base,sa-1b,11000000,50000,50000,0.0001,4,1,0.95,1.08,0.0,16.4,96.0,2024-03-30T08:00:00
+EXP-030,llama2-13b,oasst1,84437,4401,4401,0.00001,2,3,0.78,0.88,0.0,52.0,12.0,2024-03-31T12:00:00"""
     schema_desc = """Columns:
 - experiment_id: string, unique, format EXP-NNN
     data = rows[1:]
     issues: List[PlantedIssue] = []
+    # Issue 1: Data leakage signal — val_loss much lower than train_loss (hard — requires ML knowledge)
     r = 4  # EXP-005
     data[r][10] = "0.15"  # val_loss=0.15 but train_loss=0.28 → suspicious
     issues.append(PlantedIssue(row=r + 1, col="val_loss", issue_type="inconsistent_value",
+                               description="val_loss (0.15) significantly less than train_loss (0.28), potential data leakage",
+                               difficulty=3.0))
+    # Issue 2: Batch size not power of 2 (moderate — domain convention)
     r = 8  # EXP-009
     data[r][7] = "250"  # not a power of 2
     issues.append(PlantedIssue(row=r + 1, col="batch_size", issue_type="format_violation",
+                               description="batch_size 250 is not a power of 2", difficulty=2.0))
+    # Issue 3: GPU memory unreasonable for model (hard — requires model size reasoning)
     r = 6  # EXP-007 resnet18 on cifar10
     data[r][12] = "42.5"  # resnet18 shouldn't need 42.5 GB
     issues.append(PlantedIssue(row=r + 1, col="gpu_memory_gb", issue_type="statistical_outlier",
+                               description="resnet18 on cifar10 using 42.5 GB GPU memory is unreasonable",
+                               difficulty=3.0))
+    # Issue 4: Timestamp out of order (moderate — requires sequential comparison)
     r = 10  # EXP-011
     data[r][14] = "2024-03-02T09:00:00"  # should be after EXP-010's timestamp
     issues.append(PlantedIssue(row=r + 1, col="timestamp", issue_type="inconsistent_value",
+                               description="Timestamp 2024-03-02 is before EXP-010's timestamp 2024-03-11",
+                               difficulty=2.0))
+    # Issue 5: Train size smaller than test size (moderate — cross-column logic)
     r = 9  # EXP-010
     data[r][3] = "500"  # train_size=500 but test_size=1821
     issues.append(PlantedIssue(row=r + 1, col="train_size", issue_type="inconsistent_value",
+                               description="train_size (500) is smaller than test_size (1821)",
+                               difficulty=2.0))
+    # Issue 6: Negative training time (easy to spot)
     r = 13  # EXP-014
     data[r][13] = "-72.0"
     issues.append(PlantedIssue(row=r + 1, col="training_time_hours", issue_type="out_of_range",
+                               description="Negative training time", difficulty=1.0))
+    # Issue 7: Learning rate out of range (easy to spot)
     r = 12  # EXP-013
     data[r][6] = "2.5"  # way too high
     issues.append(PlantedIssue(row=r + 1, col="learning_rate", issue_type="out_of_range",
+                               description="Learning rate 2.5 exceeds maximum of 1.0", difficulty=1.5))
+    # Issue 8: Missing model name (hard — whitespace-only is subtle)
     r = 14  # EXP-015
     data[r][1] = " "
     issues.append(PlantedIssue(row=r + 1, col="model_name", issue_type="missing_value",
+                               description="model_name is whitespace-only", difficulty=2.5))
+    # Issue 9: Training time impossibly fast for dataset size and epochs
+    # EXP-004: vit-base on imagenet-1k, 300 epochs, but only 96 hours is plausible.
+    # Let's make EXP-009: efficientnet-b0 on imagenet-1k, 350 epochs = should take ~40+ hours
+    # but we set it to 0.5 hours — impossible for 1.2M images * 350 epochs
+    r = 8  # EXP-009 (same row as batch_size issue, different column)
+    data[r][13] = "0.5"  # 30 minutes for 350 epochs on imagenet? impossible
+    issues.append(PlantedIssue(row=r + 1, col="training_time_hours", issue_type="statistical_outlier",
+                               description="0.5 hours for 350 epochs on imagenet-1k (1.2M images) is impossibly fast",
+                               difficulty=3.0))
+    # Issue 10: test_accuracy of 95.1% for roberta-large on SST-2 with train_size=500
+    # is suspiciously high — SOTA is ~96% with full dataset (67k). With only 500 training
+    # samples, 95.1% accuracy suggests data contamination or evaluation bug
+    r = 9  # EXP-010 (same row as train_size issue, different column)
+    # train_size is already corrupted to 500, but the test_accuracy 95.1 is from the
+    # original full-dataset run — this cross-column inconsistency is the real issue
+    # We don't modify the value — the inconsistency emerges from the train_size corruption
+    # So let's use a different row. EXP-001: resnet50 on imagenet, accuracy 76.3 is fine.
+    # Instead: EXP-012 wav2vec2 on librispeech — set test_accuracy to 98.5 (way too high
+    # for a speech model with only 20 epochs, SOTA is ~96% with much more training)
+    r = 11  # EXP-012
+    data[r][11] = "98.5"  # wav2vec2 with 20 epochs shouldn't hit 98.5% — SOTA is ~96%
+    issues.append(PlantedIssue(row=r + 1, col="test_accuracy", issue_type="statistical_outlier",
+                               description="test_accuracy 98.5% for wav2vec2 with only 20 epochs exceeds known SOTA (~96%), likely evaluation error",
+                               difficulty=3.0))
     corrupted = _rows_to_csv([header] + data)
     )
+# ---------------------------------------------------------------------------
+# Contamination rules for extensible task creation
+# ---------------------------------------------------------------------------
+# Each contamination rule is a callable: (rows, header, col_idx, row_idx, rng) -> (new_value, PlantedIssue)
+# Users can define their own and register them.
+CONTAMINATION_RULES = {
+    "missing_value": lambda rows, header, col_idx, row_idx, rng: (
+        "",
+        PlantedIssue(
+            row=row_idx + 1, col=header[col_idx], issue_type="missing_value",
+            description=f"Empty {header[col_idx]} field", difficulty=1.0,
+        ),
+    ),
+    "whitespace_value": lambda rows, header, col_idx, row_idx, rng: (
+        " ",
+        PlantedIssue(
+            row=row_idx + 1, col=header[col_idx], issue_type="missing_value",
+            description=f"Whitespace-only {header[col_idx]} field", difficulty=2.5,
+        ),
+    ),
+    "wrong_type_text": lambda rows, header, col_idx, row_idx, rng: (
+        rng.choice(["not-a-number", "N/A", "null", "undefined"]),
+        PlantedIssue(
+            row=row_idx + 1, col=header[col_idx], issue_type="wrong_type",
+            description=f"{header[col_idx]} is text instead of expected type", difficulty=1.0,
+        ),
+    ),
+    "negative_value": lambda rows, header, col_idx, row_idx, rng: (
+        str(-abs(float(rows[row_idx][col_idx]) if rows[row_idx][col_idx] else 1)),
+        PlantedIssue(
+            row=row_idx + 1, col=header[col_idx], issue_type="out_of_range",
+            description=f"Negative {header[col_idx]}", difficulty=1.0,
+        ),
+    ),
+}
+def create_task_from_config(
+    task_id: str,
+    name: str,
+    description: str,
+    schema_description: str,
+    validation_rules: str,
+    clean_csv: str,
+    contaminations: List[dict],
+    max_steps: int = 3,
+    seed: int = 42,
+) -> Task:
+    """
+    Create a custom task from a configuration dict.
+    Each contamination entry should have:
+        - rule: str (key in CONTAMINATION_RULES) or callable
+        - row: int (0-based row index in data)
+        - col: int (column index in header)
+        - difficulty: float (optional, overrides rule default)
+    Example:
+        contaminations = [
+            {"rule": "missing_value", "row": 2, "col": 1, "difficulty": 1.5},
+            {"rule": "negative_value", "row": 5, "col": 4},
+        ]
+    """
+    rng = random.Random(seed)
+    rows = _csv_to_rows(clean_csv)
+    header = rows[0]
+    data = rows[1:]
+    issues: List[PlantedIssue] = []
+    for spec in contaminations:
+        rule = spec["rule"]
+        row_idx = spec["row"]
+        col_idx = spec["col"]
+        if callable(rule):
+            new_val, issue = rule(data, header, col_idx, row_idx, rng)
+        elif rule in CONTAMINATION_RULES:
+            new_val, issue = CONTAMINATION_RULES[rule](data, header, col_idx, row_idx, rng)
+        else:
+            raise ValueError(f"Unknown contamination rule: {rule}. Available: {list(CONTAMINATION_RULES.keys())}")
+        data[row_idx][col_idx] = new_val
+        if "difficulty" in spec:
+            issue.difficulty = spec["difficulty"]
+        issues.append(issue)
+    corrupted = _rows_to_csv([header] + data)
+    return Task(
+        task_id=task_id,
+        name=name,
+        description=description,
+        schema_description=schema_description,
+        validation_rules=validation_rules,
+        clean_csv=clean_csv,
+        planted_issues=issues,
+        corrupted_csv=corrupted,
+        max_steps=max_steps,
+    )
+def register_task(task_id: str, factory_fn):
+    """Register a custom task factory. Factory must accept (seed: int) -> Task."""
+    TASK_REGISTRY[task_id] = factory_fn
+def register_contamination_rule(name: str, rule_fn):
+    """
+    Register a custom contamination rule.
+    rule_fn signature: (rows, header, col_idx, row_idx, rng) -> (new_value, PlantedIssue)
+    """
+    CONTAMINATION_RULES[name] = rule_fn
 # ---------------------------------------------------------------------------
 # Task registry
 # ---------------------------------------------------------------------------

inference.py CHANGED Viewed

@@ -1,26 +1,31 @@
 #!/usr/bin/env python3
 """
-DataQA Inference Script
------------------------
-LLM agent that plays the DataQA environment.
 Uses the OpenAI client to interact with any OpenAI-compatible LLM API.
 Required environment variables:
-    API_BASE_URL  - LLM API endpoint (e.g., https://api.groq.com/openai/v1)
-    MODEL_NAME    - Model identifier (e.g., llama-3.3-70b-versatile)
-    HF_TOKEN      - HuggingFace token (for HF Spaces access)
-Structured logging format: [START], [STEP], [END] tags for evaluation.
 """
 from __future__ import annotations
-import json
 import os
 import re
 import sys
 import time
-from typing import Optional
 import requests
 from openai import OpenAI
@@ -28,52 +33,43 @@ from openai import OpenAI
 # ---------------------------------------------------------------------------
 # Configuration
 # ---------------------------------------------------------------------------
-API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.groq.com/openai/v1")
-MODEL_NAME = os.environ.get("MODEL_NAME", "llama-3.3-70b-versatile")
-HF_TOKEN = os.environ.get("HF_TOKEN", "")
-ENV_URL = os.environ.get("ENV_URL", "http://localhost:8000")
 TASKS = ["easy", "medium", "hard"]
 MAX_STEPS_PER_TASK = 3
 # ---------------------------------------------------------------------------
-# Logging helpers (structured stdout for evaluation)
 # ---------------------------------------------------------------------------
-def log_start(task_id: str, metadata: Optional[dict] = None):
-    entry = {"event": "START", "task_id": task_id, "timestamp": time.time()}
-    if metadata:
-        entry["metadata"] = metadata
-    print(f"[START] {json.dumps(entry)}", flush=True)
-def log_step(task_id: str, step: int, reward: float, details: Optional[dict] = None):
-    entry = {
-        "event": "STEP",
-        "task_id": task_id,
-        "step": step,
-        "reward": reward,
-        "timestamp": time.time(),
-    }
-    if details:
-        entry["details"] = details
-    print(f"[STEP] {json.dumps(entry)}", flush=True)
-def log_end(task_id: str, final_score: float, metadata: Optional[dict] = None):
-    entry = {
-        "event": "END",
-        "task_id": task_id,
-        "final_score": final_score,
-        "timestamp": time.time(),
-    }
-    if metadata:
-        entry["metadata"] = metadata
-    print(f"[END] {json.dumps(entry)}", flush=True)
 # ---------------------------------------------------------------------------
-# Environment HTTP client (simple, no WebSocket needed for inference)
 # ---------------------------------------------------------------------------
 class EnvHTTPClient:
@@ -99,26 +95,21 @@ class EnvHTTPClient:
         r.raise_for_status()
         return r.json()
-    def step(self, issues: list[str], task_id: str = "easy") -> dict:
         r = self.session.post(
             f"{self.base_url}/step",
-            json={"action": {"issues": issues, "task_id": task_id}},
             timeout=30,
         )
         r.raise_for_status()
         return r.json()
-    def state(self) -> dict:
-        r = self.session.get(f"{self.base_url}/state", timeout=10)
-        r.raise_for_status()
-        return r.json()
 # ---------------------------------------------------------------------------
-# LLM Agent
 # ---------------------------------------------------------------------------
-SYSTEM_PROMPT = """You are a data quality analyst. Your job is to inspect datasets and identify data quality issues.
 You will be given:
 1. A dataset in CSV format
@@ -142,7 +133,6 @@ CRITICAL INSTRUCTIONS FOR ROW NUMBERING:
 - Row numbers refer to the ROW POSITION in the CSV data, NOT the value of any ID column
 - Row 1 = the FIRST data row after the header
 - Row 2 = the SECOND data row after the header
-- For example, if the CSV has header on line 1 and data starting on line 2, the data on line 2 is row 1, line 3 is row 2, etc.
 - DO NOT use the employee_id, order_id, or experiment_id as the row number
 - Column names must match exactly (use the CSV header names, lowercase)
 - Check EVERY row and EVERY column systematically
@@ -154,7 +144,26 @@ Respond with ONLY the list of issues, one per line. No other text.
 Example: row:3,col:salary,issue:missing_value"""
-def build_user_prompt(observation: dict) -> str:
     obs = observation if isinstance(observation, dict) else observation
     parts = []
@@ -173,6 +182,12 @@ def build_user_prompt(observation: dict) -> str:
     if feedback and "reset" not in feedback.lower():
         parts.append(f"FEEDBACK FROM PREVIOUS ATTEMPT:\n{feedback}")
     return "\n\n".join(parts)
@@ -183,88 +198,142 @@ def parse_llm_response(response: str) -> list[str]:
         line = line.strip()
         if not line:
             continue
-        # Remove numbering like "1. " or "- " or "* "
         line = re.sub(r"^\s*[\d]+[.\)]\s*", "", line)
         line = re.sub(r"^\s*[-*]\s*", "", line)
         line = line.strip()
         if "row" in line.lower() and "col" in line.lower():
-            # Lenient regex: accept : or = as delimiters, case-insensitive
             match = re.search(
                 r"row\s*[:=]\s*(\d+)\s*[,;\s]+col(?:umn)?\s*[:=]\s*([\w_]+)\s*[,;\s]+issue\s*[:=]\s*([\w_]+)",
                 line,
                 re.IGNORECASE,
             )
             if match:
-                # Normalize to lowercase canonical format
                 normalized = f"row:{match.group(1)},col:{match.group(2).lower()},issue:{match.group(3).lower()}"
                 issues.append(normalized)
     return issues
-def run_task(client: OpenAI, env: EnvHTTPClient, task_id: str) -> float:
-    """Run a single task and return the best score."""
-    log_start(task_id)
-    # Reset environment for this task
-    reset_response = env.reset(task_id=task_id)
-    observation = reset_response.get("observation", reset_response)
     best_score = 0.0
-    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
-    for step_num in range(1, MAX_STEPS_PER_TASK + 1):
-        user_prompt = build_user_prompt(observation)
-        messages_for_call = messages + [{"role": "user", "content": user_prompt}]
-        # Call LLM with retry on rate limit
-        llm_output = ""
-        for attempt in range(3):
-            try:
-                response = client.chat.completions.create(
-                    model=MODEL_NAME,
-                    messages=messages_for_call,
-                    temperature=0.1,
-                    max_tokens=2048,
-                )
-                llm_output = response.choices[0].message.content or ""
                 break
-            except Exception as e:
-                if "rate_limit" in str(e).lower() or "429" in str(e):
-                    wait = 10 * (attempt + 1)
-                    print(f"[WARN] Rate limited, waiting {wait}s...", flush=True)
-                    time.sleep(wait)
-                else:
-                    print(f"[ERROR] LLM call failed: {e}", file=sys.stderr, flush=True)
-                    break
-        # Parse issues from LLM response
-        issues = parse_llm_response(llm_output)
-        if not issues:
-            print(f"[WARN] No issues parsed from LLM response for {task_id} step {step_num}", file=sys.stderr, flush=True)
-        # Submit to environment
-        step_response = env.step(issues, task_id=task_id)
-        observation = step_response.get("observation", step_response)
-        # reward and done are at the top level of the response, not inside observation
-        reward = float(step_response.get("reward", 0.0) or 0.0)
-        done = bool(step_response.get("done", False))
-        best_score = max(best_score, reward)
-        log_step(task_id, step_num, reward, {
-            "issues_reported": len(issues),
-            "feedback": observation.get("feedback", ""),
-        })
-        if done:
-            break
-        # Add context for next attempt
-        messages.append({"role": "user", "content": user_prompt})
-        messages.append({"role": "assistant", "content": llm_output})
-    log_end(task_id, best_score)
     return best_score
@@ -273,49 +342,34 @@ def run_task(client: OpenAI, env: EnvHTTPClient, task_id: str) -> float:
 # ---------------------------------------------------------------------------
 def main():
-    print(f"[INFO] DataQA Inference starting", flush=True)
-    print(f"[INFO] ENV_URL={ENV_URL}", flush=True)
-    print(f"[INFO] API_BASE_URL={API_BASE_URL}", flush=True)
-    print(f"[INFO] MODEL_NAME={MODEL_NAME}", flush=True)
-    # Initialize clients
     env = EnvHTTPClient(ENV_URL)
     llm_client = OpenAI(
         base_url=API_BASE_URL,
-        api_key=os.environ.get("LLM_API_KEY", HF_TOKEN or "no-key"),
     )
-    # Check environment health
     if not env.health():
-        print("[ERROR] Environment is not healthy. Exiting.", file=sys.stderr, flush=True)
         sys.exit(1)
-    print(f"[INFO] Environment is healthy", flush=True)
-    # Run all tasks
     scores = {}
     for task_id in TASKS:
-        print(f"\n{'='*60}", flush=True)
-        print(f"[INFO] Starting task: {task_id}", flush=True)
-        print(f"{'='*60}", flush=True)
         try:
             score = run_task(llm_client, env, task_id)
             scores[task_id] = score
-            print(f"[INFO] Task {task_id} completed with score: {score:.3f}", flush=True)
         except Exception as e:
-            print(f"[ERROR] Task {task_id} failed: {e}", file=sys.stderr, flush=True)
             scores[task_id] = 0.0
-    # Summary
-    print(f"\n{'='*60}", flush=True)
-    print("[INFO] FINAL RESULTS", flush=True)
-    print(f"{'='*60}", flush=True)
-    for task_id, score in scores.items():
-        print(f"[INFO] {task_id}: {score:.3f}", flush=True)
     avg_score = sum(scores.values()) / len(scores) if scores else 0.0
-    print(f"[INFO] Average score: {avg_score:.3f}", flush=True)
 if __name__ == "__main__":

 #!/usr/bin/env python3
 """
+DataQA Inference Script — Two-Phase Agent
+------------------------------------------
+LLM agent that plays the DataQA environment in two phases:
+  Phase 1: Identify all data quality issues
+  Phase 2: Propose fixes for identified issues
 Uses the OpenAI client to interact with any OpenAI-compatible LLM API.
 Required environment variables:
+    API_BASE_URL  - LLM API endpoint (e.g., https://router.huggingface.co/v1)
+    MODEL_NAME    - Model identifier (e.g., Qwen/Qwen2.5-72B-Instruct)
+    HF_TOKEN      - HuggingFace token / API key
+STDOUT FORMAT (mandatory for evaluation):
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
 """
 from __future__ import annotations
 import os
 import re
 import sys
 import time
+from typing import List, Optional
 import requests
 from openai import OpenAI
 # ---------------------------------------------------------------------------
 # Configuration
 # ---------------------------------------------------------------------------
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
+BENCHMARK = "dataqa_env"
 TASKS = ["easy", "medium", "hard"]
 MAX_STEPS_PER_TASK = 3
 # ---------------------------------------------------------------------------
+# Logging helpers (structured stdout — exact format required by evaluation)
 # ---------------------------------------------------------------------------
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
 # ---------------------------------------------------------------------------
+# Environment HTTP client
 # ---------------------------------------------------------------------------
 class EnvHTTPClient:
         r.raise_for_status()
         return r.json()
+    def step(self, issues: list[str], fixes: list[str], task_id: str = "easy") -> dict:
         r = self.session.post(
             f"{self.base_url}/step",
+            json={"action": {"issues": issues, "fixes": fixes, "task_id": task_id}},
             timeout=30,
         )
         r.raise_for_status()
         return r.json()
 # ---------------------------------------------------------------------------
+# LLM Prompts
 # ---------------------------------------------------------------------------
+IDENTIFY_SYSTEM_PROMPT = """You are a data quality analyst. Your job is to inspect datasets and identify data quality issues.
 You will be given:
 1. A dataset in CSV format
 - Row numbers refer to the ROW POSITION in the CSV data, NOT the value of any ID column
 - Row 1 = the FIRST data row after the header
 - Row 2 = the SECOND data row after the header
 - DO NOT use the employee_id, order_id, or experiment_id as the row number
 - Column names must match exactly (use the CSV header names, lowercase)
 - Check EVERY row and EVERY column systematically
 Example: row:3,col:salary,issue:missing_value"""
+FIX_SYSTEM_PROMPT = """You are a data repair specialist. You have already identified data quality issues in a dataset. Now you must propose the correct values to fix each issue.
+For each issue you identified, propose a fix in EXACTLY this format:
+row:<row_number>,col:<column_name>,fix:<corrected_value>
+Guidelines for proposing fixes:
+- For missing_value: infer the correct value from context, schema, and other rows
+- For wrong_type: convert to the correct type (e.g., "seventy-five thousand" → "75000")
+- For out_of_range: propose a value within the valid range that makes sense in context
+- For format_violation: correct the format (e.g., "26/01/2024" → "2024-01-26")
+- For inconsistent_value: compute the correct value from related fields
+- For duplicate_row: propose a corrected unique key or indicate removal
+- For statistical_outlier: propose a reasonable value given the model/context
+Use the schema, validation rules, and surrounding data to determine the correct fix.
+Respond with ONLY the list of fixes, one per line. No other text.
+Example: row:3,col:salary,fix:75000"""
+def build_user_prompt(observation: dict, include_fixes: bool = False) -> str:
     obs = observation if isinstance(observation, dict) else observation
     parts = []
     if feedback and "reset" not in feedback.lower():
         parts.append(f"FEEDBACK FROM PREVIOUS ATTEMPT:\n{feedback}")
+    if include_fixes:
+        parts.append(
+            "Now propose fixes for ALL issues. "
+            "Use format: row:<N>,col:<name>,fix:<corrected_value>"
+        )
     return "\n\n".join(parts)
         line = line.strip()
         if not line:
             continue
         line = re.sub(r"^\s*[\d]+[.\)]\s*", "", line)
         line = re.sub(r"^\s*[-*]\s*", "", line)
         line = line.strip()
         if "row" in line.lower() and "col" in line.lower():
             match = re.search(
                 r"row\s*[:=]\s*(\d+)\s*[,;\s]+col(?:umn)?\s*[:=]\s*([\w_]+)\s*[,;\s]+issue\s*[:=]\s*([\w_]+)",
                 line,
                 re.IGNORECASE,
             )
             if match:
                 normalized = f"row:{match.group(1)},col:{match.group(2).lower()},issue:{match.group(3).lower()}"
                 issues.append(normalized)
     return issues
+def parse_fix_response(response: str) -> list[str]:
+    """Extract fix lines from LLM response."""
+    fixes = []
+    for line in response.strip().split("\n"):
+        line = line.strip()
+        if not line:
+            continue
+        line = re.sub(r"^\s*[\d]+[.\)]\s*", "", line)
+        line = re.sub(r"^\s*[-*]\s*", "", line)
+        line = line.strip()
+        if "row" in line.lower() and "fix" in line.lower():
+            match = re.search(
+                r"row\s*[:=]\s*(\d+)\s*[,;\s]+col(?:umn)?\s*[:=]\s*([\w_]+)\s*[,;\s]+fix\s*[:=]\s*(.+?)$",
+                line,
+                re.IGNORECASE,
+            )
+            if match:
+                normalized = f"row:{match.group(1)},col:{match.group(2).lower()},fix:{match.group(3).strip()}"
+                fixes.append(normalized)
+    return fixes
+def call_llm(client: OpenAI, system_prompt: str, user_prompt: str) -> str:
+    """Call the LLM with retry on rate limit."""
+    for attempt in range(3):
+        try:
+            response = client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=[
+                    {"role": "system", "content": system_prompt},
+                    {"role": "user", "content": user_prompt},
+                ],
+                temperature=0.1,
+                max_tokens=2048,
+            )
+            return response.choices[0].message.content or ""
+        except Exception as e:
+            if "rate_limit" in str(e).lower() or "429" in str(e):
+                wait = 10 * (attempt + 1)
+                print(f"[DEBUG] Rate limited, waiting {wait}s...", file=sys.stderr, flush=True)
+                time.sleep(wait)
+            else:
+                print(f"[DEBUG] LLM call failed: {e}", file=sys.stderr, flush=True)
+                return ""
+    return ""
+def run_task(client: OpenAI, env: EnvHTTPClient, task_id: str) -> float:
+    """
+    Run a single task with two-phase strategy:
+      Step 1: Identify issues only
+      Step 2: Identify + Fix (using feedback from step 1)
+      Step 3: Refined identify + fix (if needed)
+    """
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    rewards: List[float] = []
+    steps_taken = 0
     best_score = 0.0
+    success = False
+    try:
+        reset_response = env.reset(task_id=task_id)
+        observation = reset_response.get("observation", reset_response)
+        last_issues: list[str] = []
+        last_llm_output = ""
+        for step_num in range(1, MAX_STEPS_PER_TASK + 1):
+            error_msg = None
+            # ── Phase 1: Identify issues ──
+            user_prompt = build_user_prompt(observation)
+            identify_output = call_llm(client, IDENTIFY_SYSTEM_PROMPT, user_prompt)
+            issues = parse_llm_response(identify_output)
+            if not issues and not error_msg:
+                error_msg = "no issues parsed from LLM response"
+            # ── Phase 2: Propose fixes (from step 2 onward, or always if we have issues) ──
+            fixes: list[str] = []
+            if issues and step_num >= 2:
+                # Build a fix prompt that includes the identified issues
+                fix_prompt = build_user_prompt(observation, include_fixes=True)
+                fix_prompt += f"\n\nISSUES FOUND:\n" + "\n".join(issues)
+                fix_output = call_llm(client, FIX_SYSTEM_PROMPT, fix_prompt)
+                fixes = parse_fix_response(fix_output)
+            # ── Submit to environment ──
+            action_str = ";".join(issues[:5]) if issues else "none"
+            if fixes:
+                action_str += "|fixes:" + ";".join(fixes[:3])
+            step_response = env.step(issues, fixes, task_id=task_id)
+            observation = step_response.get("observation", step_response)
+            reward = float(step_response.get("reward", 0.0) or 0.0)
+            done = bool(step_response.get("done", False))
+            best_score = max(best_score, reward)
+            rewards.append(reward)
+            steps_taken = step_num
+            log_step(
+                step=step_num,
+                action=action_str,
+                reward=reward,
+                done=done,
+                error=error_msg,
+            )
+            if done:
                 break
+            last_issues = issues
+            last_llm_output = identify_output
+        success = best_score >= 0.5
+    finally:
+        log_end(success=success, steps=steps_taken, score=best_score, rewards=rewards)
     return best_score
 # ---------------------------------------------------------------------------
 def main():
+    print(f"[DEBUG] DataQA Inference starting", file=sys.stderr, flush=True)
+    print(f"[DEBUG] ENV_URL={ENV_URL}", file=sys.stderr, flush=True)
+    print(f"[DEBUG] API_BASE_URL={API_BASE_URL}", file=sys.stderr, flush=True)
+    print(f"[DEBUG] MODEL_NAME={MODEL_NAME}", file=sys.stderr, flush=True)
     env = EnvHTTPClient(ENV_URL)
     llm_client = OpenAI(
         base_url=API_BASE_URL,
+        api_key=API_KEY or "no-key",
     )
     if not env.health():
+        print("[DEBUG] Environment is not healthy. Exiting.", file=sys.stderr, flush=True)
         sys.exit(1)
+    print(f"[DEBUG] Environment is healthy", file=sys.stderr, flush=True)
     scores = {}
     for task_id in TASKS:
         try:
             score = run_task(llm_client, env, task_id)
             scores[task_id] = score
         except Exception as e:
+            print(f"[DEBUG] Task {task_id} failed: {e}", file=sys.stderr, flush=True)
             scores[task_id] = 0.0
     avg_score = sum(scores.values()) / len(scores) if scores else 0.0
+    print(f"\n[DEBUG] FINAL RESULTS: {scores} avg={avg_score:.3f}", file=sys.stderr, flush=True)
 if __name__ == "__main__":

models.py DELETED Viewed

@@ -1,4 +0,0 @@
-"""Root-level models for OpenEnv compatibility."""
-from dataqa_env.models import DataQAAction, DataQAObservation, DataQAState
-__all__ = ["DataQAAction", "DataQAObservation", "DataQAState"]

openenv.yaml CHANGED Viewed

@@ -3,4 +3,4 @@ name: dataqa_env
 type: space
 runtime: fastapi
 app: dataqa_env.server.app:app
-port: 8000

 type: space
 runtime: fastapi
 app: dataqa_env.server.app:app
+port: 7860

scripts/prevalidation_script.sh ADDED Viewed

	@@ -0,0 +1,185 @@

+#!/usr/bin/env bash
+#
+# validate-submission.sh — OpenEnv Submission Validator
+#
+# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
+#
+# Prerequisites:
+#   - Docker:       https://docs.docker.com/get-docker/
+#   - openenv-core: pip install openenv-core
+#   - curl (usually pre-installed)
+#
+# Run:
+#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
+#
+#   Or download and run locally:
+#     chmod +x validate-submission.sh
+#     ./validate-submission.sh <ping_url> [repo_dir]
+#
+# Arguments:
+#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
+#   repo_dir   Path to your repo (default: current directory)
+#
+# Examples:
+#   ./validate-submission.sh https://my-team.hf.space
+#   ./validate-submission.sh https://my-team.hf.space ./my-repo
+#
+set -uo pipefail
+DOCKER_BUILD_TIMEOUT=600
+if [ -t 1 ]; then
+  RED='\033[0;31m'
+  GREEN='\033[0;32m'
+  YELLOW='\033[1;33m'
+  BOLD='\033[1m'
+  NC='\033[0m'
+else
+  RED='' GREEN='' YELLOW='' BOLD='' NC=''
+fi
+run_with_timeout() {
+  local secs="$1"; shift
+  if command -v timeout &>/dev/null; then
+    timeout "$secs" "$@"
+  elif command -v gtimeout &>/dev/null; then
+    gtimeout "$secs" "$@"
+  else
+    "$@" &
+    local pid=$!
+    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
+    local watcher=$!
+    wait "$pid" 2>/dev/null
+    local rc=$?
+    kill "$watcher" 2>/dev/null
+    wait "$watcher" 2>/dev/null
+    return $rc
+  fi
+}
+portable_mktemp() {
+  local prefix="${1:-validate}"
+  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
+}
+CLEANUP_FILES=()
+cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
+trap cleanup EXIT
+PING_URL="${1:-}"
+REPO_DIR="${2:-.}"
+if [ -z "$PING_URL" ]; then
+  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
+  printf "\n"
+  printf "  ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
+  printf "  repo_dir   Path to your repo (default: current directory)\n"
+  exit 1
+fi
+if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
+  printf "Error: directory '%s' not found\n" "${2:-.}"
+  exit 1
+fi
+PING_URL="${PING_URL%/}"
+export PING_URL
+PASS=0
+log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
+pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
+fail() { log "${RED}FAILED${NC} -- $1"; }
+hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
+stop_at() {
+  printf "\n"
+  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
+  exit 1
+}
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+log "Repo:     $REPO_DIR"
+log "Ping URL: $PING_URL"
+printf "\n"
+log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
+CURL_OUTPUT=$(portable_mktemp "validate-curl")
+CLEANUP_FILES+=("$CURL_OUTPUT")
+HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
+  -H "Content-Type: application/json" -d '{}' \
+  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
+if [ "$HTTP_CODE" = "200" ]; then
+  pass "HF Space is live and responds to /reset"
+elif [ "$HTTP_CODE" = "000" ]; then
+  fail "HF Space not reachable (connection failed or timed out)"
+  hint "Check your network connection and that the Space is running."
+  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
+  stop_at "Step 1"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
+  hint "Make sure your Space is running and the URL is correct."
+  hint "Try opening $PING_URL in your browser first."
+  stop_at "Step 1"
+fi
+log "${BOLD}Step 2/3: Running docker build${NC} ..."
+if ! command -v docker &>/dev/null; then
+  fail "docker command not found"
+  hint "Install Docker: https://docs.docker.com/get-docker/"
+  stop_at "Step 2"
+fi
+if [ -f "$REPO_DIR/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR"
+elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR/server"
+else
+  fail "No Dockerfile found in repo root or server/ directory"
+  stop_at "Step 2"
+fi
+log "  Found Dockerfile in $DOCKER_CONTEXT"
+BUILD_OK=false
+BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
+if [ "$BUILD_OK" = true ]; then
+  pass "Docker build succeeded"
+else
+  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
+  printf "%s\n" "$BUILD_OUTPUT" | tail -20
+  stop_at "Step 2"
+fi
+log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
+if ! command -v openenv &>/dev/null; then
+  fail "openenv command not found"
+  hint "Install it: pip install openenv-core"
+  stop_at "Step 3"
+fi
+VALIDATE_OK=false
+VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
+if [ "$VALIDATE_OK" = true ]; then
+  pass "openenv validate passed"
+  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
+else
+  fail "openenv validate failed"
+  printf "%s\n" "$VALIDATE_OUTPUT"
+  stop_at "Step 3"
+fi
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
+printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+printf "\n"
+exit 0

scripts/sample_inference_script.py ADDED Viewed

	@@ -0,0 +1,188 @@

+"""
+Inference Script Example
+===================================
+MANDATORY
+- Before submitting, ensure the following variables are defined in your environment configuration:
+    API_BASE_URL   The API endpoint for the LLM.
+    MODEL_NAME     The model identifier to use for inference.
+    HF_TOKEN       Your Hugging Face / API key.
+    LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
+                     method
+- Defaults are set only for API_BASE_URL and MODEL_NAME
+    (and should reflect your active inference setup):
+    API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
+    MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
+- The inference script must be named `inference.py` and placed in the root directory of the project
+- Participants must use OpenAI Client for all LLM calls using above variables
+STDOUT FORMAT
+- The script must emit exactly three line types to stdout, in this order:
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+  Rules:
+    - One [START] line at episode begin.
+    - One [STEP] line per step, immediately after env.step() returns.
+    - One [END] line after env.close(), always emitted (even on exception).
+    - reward and rewards are formatted to 2 decimal places.
+    - done and success are lowercase booleans: true or false.
+    - error is the raw last_action_error string, or null if none.
+    - All fields on a single line with no newlines within a line.
+    - Each tasks should return score in [0, 1]
+  Example:
+    [START] task=click-test env=miniwob model=Qwen3-VL-30B
+    [STEP] step=1 action=click('123') reward=0.00 done=false error=null
+    [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
+    [STEP] step=3 action=click('789') reward=1.00 done=true error=null
+    [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
+"""
+import asyncio
+import os
+import textwrap
+from typing import List, Optional
+from openai import OpenAI
+from my_env_v4 import MyEnvV4Action, MyEnvV4Env
+IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
+MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
+TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
+BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
+MAX_STEPS = 8
+TEMPERATURE = 0.7
+MAX_TOKENS = 150
+SUCCESS_SCORE_THRESHOLD = 0.1  # normalized score in [0, 1]
+# Max possible reward: each token contributes 0.1, across all steps
+_MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
+MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
+SYSTEM_PROMPT = textwrap.dedent(
+    """
+    You are interacting with a simple echo environment.
+    Each turn you must send a message. The environment will echo it back.
+    Reward is proportional to message length: reward = len(message) * 0.1
+    Your goal is to maximize total reward by sending meaningful, substantive messages.
+    Reply with exactly one message string — no quotes, no prefixes, just the message text.
+    """
+).strip()
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
+def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    history_block = "\n".join(history[-4:]) if history else "None"
+    return textwrap.dedent(
+        f"""
+        Step: {step}
+        Last echoed message: {last_echoed!r}
+        Last reward: {last_reward:.2f}
+        Previous steps:
+        {history_block}
+        Send your next message.
+        """
+    ).strip()
+def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
+        text = (completion.choices[0].message.content or "").strip()
+        return text if text else "hello"
+    except Exception as exc:
+        print(f"[DEBUG] Model request failed: {exc}", flush=True)
+        return "hello"
+async def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
+    history: List[str] = []
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        result = await env.reset() # OpenENV.reset()
+        last_echoed = result.observation.echoed_message
+        last_reward = 0.0
+        for step in range(1, MAX_STEPS + 1):
+            if result.done:
+                break
+            message = get_model_message(client, step, last_echoed, last_reward, history)
+            result = await env.step(MyEnvV4Action(message=message))
+            obs = result.observation
+            reward = result.reward or 0.0
+            done = result.done
+            error = None
+            rewards.append(reward)
+            steps_taken = step
+            last_echoed = obs.echoed_message
+            last_reward = reward
+            log_step(step=step, action=message, reward=reward, done=done, error=error)
+            history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
+            if done:
+                break
+        score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
+        score = min(max(score, 0.0), 1.0)  # clamp to [0, 1]
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    finally:
+        try:
+            await env.close()
+        except Exception as e:
+            print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+if __name__ == "__main__":
+    asyncio.run(main())

server/__init__.py CHANGED Viewed

	@@ -0,0 +1 @@


1	+ """Root-level server package — delegates to dataqa_env.server."""

server/app.py CHANGED Viewed

@@ -1,13 +1,12 @@
-"""
-Root-level server entry point for OpenEnv compatibility.
-"""
 from dataqa_env.server.app import app  # noqa: F401
 def main():
     import uvicorn
-    uvicorn.run(app, host="0.0.0.0", port=8000)
 if __name__ == "__main__":

+"""Entrypoint for openenv-core deployment. Delegates to dataqa_env.server.app."""
 from dataqa_env.server.app import app  # noqa: F401
 def main():
+    """Start the environment server."""
     import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=7860)
 if __name__ == "__main__":

tests/__init__.py ADDED Viewed

File without changes

tests/test_environment.py ADDED Viewed

	@@ -0,0 +1,455 @@

+"""Tests for the DataQA environment (reset, step, scoring, two-phase identify+fix)."""
+import pytest
+from dataqa_env.server.environment import (
+    DataQAEnvironment,
+    parse_issue_key,
+    parse_fix,
+    compute_f1,
+    compute_weighted_reward,
+    grade_fixes,
+    IDENTIFY_WEIGHT,
+    FIX_WEIGHT,
+)
+from dataqa_env.models import DataQAAction
+from dataqa_env.server.tasks import PlantedIssue, create_task_easy, create_task_medium
+# ──────────────────────────────────────────────────────
+# Issue parsing
+# ──────────────────────────────────────────────────────
+class TestParseIssueKey:
+    def test_standard_format(self):
+        assert parse_issue_key("row:3,col:salary,issue:missing_value") == "row:3,col:salary,issue:missing_value"
+    def test_with_equals(self):
+        assert parse_issue_key("row=3,col=salary,issue=missing_value") == "row:3,col:salary,issue:missing_value"
+    def test_case_insensitive(self):
+        assert parse_issue_key("Row:3,Col:Salary,Issue:Missing_Value") == "row:3,col:salary,issue:missing_value"
+    def test_with_spaces(self):
+        assert parse_issue_key("row: 3, col: salary, issue: missing_value") == "row:3,col:salary,issue:missing_value"
+    def test_unparseable(self):
+        assert parse_issue_key("this is garbage") is None
+    def test_partial_match(self):
+        assert parse_issue_key("row:3,col:salary") is None
+    def test_empty_string(self):
+        assert parse_issue_key("") is None
+    def test_semicolon_separator(self):
+        result = parse_issue_key("row:3;col:salary;issue:missing_value")
+        assert result == "row:3,col:salary,issue:missing_value"
+# ──────────────────────────────────────────────────────
+# Fix parsing
+# ──────────────────────────────────────────────────────
+class TestParseFix:
+    def test_standard_format(self):
+        result = parse_fix("row:4,col:name,fix:Alice Chen")
+        assert result == (4, "name", "Alice Chen")
+    def test_with_equals(self):
+        result = parse_fix("row=4,col=name,fix=Alice Chen")
+        assert result == (4, "name", "Alice Chen")
+    def test_numeric_fix(self):
+        result = parse_fix("row:7,col:salary,fix:75000")
+        assert result == (7, "salary", "75000")
+    def test_date_fix(self):
+        result = parse_fix("row:12,col:order_date,fix:2024-01-26")
+        assert result == (12, "order_date", "2024-01-26")
+    def test_case_insensitive(self):
+        result = parse_fix("Row:4,Col:Name,Fix:Alice Chen")
+        assert result == (4, "name", "Alice Chen")
+    def test_unparseable(self):
+        assert parse_fix("garbage") is None
+        assert parse_fix("row:4,col:name") is None
+    def test_fix_with_special_chars(self):
+        result = parse_fix("row:1,col:email,fix:alice.chen@company.com")
+        assert result == (1, "email", "alice.chen@company.com")
+# ──────────────────────────────────────────────────────
+# F1 scoring
+# ──────────────────────────────────────────────────────
+class TestComputeF1:
+    def test_perfect_match(self):
+        keys = {"row:1,col:a,issue:missing_value"}
+        result = compute_f1(keys, keys)
+        assert result["f1"] == 1.0
+    def test_no_reported_no_planted(self):
+        result = compute_f1(set(), set())
+        assert result["f1"] == 1.0
+    def test_no_reported_some_planted(self):
+        planted = {"row:1,col:a,issue:missing_value"}
+        result = compute_f1(set(), planted)
+        assert result["f1"] == 0.0
+        assert result["fn"] == 1
+    def test_all_false_positives(self):
+        reported = {"row:99,col:x,issue:wrong_type"}
+        planted = {"row:1,col:a,issue:missing_value"}
+        result = compute_f1(reported, planted)
+        assert result["f1"] == 0.0
+    def test_partial_match(self):
+        reported = {"row:1,col:a,issue:missing_value", "row:2,col:b,issue:wrong_type"}
+        planted = {"row:1,col:a,issue:missing_value", "row:3,col:c,issue:duplicate_row"}
+        result = compute_f1(reported, planted)
+        assert result["tp"] == 1
+        assert result["fp"] == 1
+        assert result["fn"] == 1
+        assert 0 < result["f1"] < 1
+    def test_precision_recall_calculation(self):
+        reported = {"a", "b", "c"}
+        planted = {"a", "b", "d"}
+        result = compute_f1(reported, planted)
+        assert result["precision"] == pytest.approx(2 / 3)
+        assert result["recall"] == pytest.approx(2 / 3)
+# ──────────────────────────────────────────────────────
+# Weighted reward
+# ──────────────────────────────────────────────────────
+class TestComputeWeightedReward:
+    def test_perfect_match(self):
+        issues = [
+            PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=1.0),
+            PlantedIssue(row=2, col="b", issue_type="wrong_type", description="", difficulty=3.0),
+        ]
+        reported = {i.to_key() for i in issues}
+        result = compute_weighted_reward(reported, issues)
+        assert result["weighted_reward"] == 1.0
+    def test_empty_both(self):
+        result = compute_weighted_reward(set(), [])
+        assert result["weighted_reward"] == 1.0
+    def test_no_reported(self):
+        issues = [PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=2.0)]
+        result = compute_weighted_reward(set(), issues)
+        assert result["weighted_reward"] == 0.0
+    def test_hard_issue_worth_more(self):
+        easy = PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=1.0)
+        hard = PlantedIssue(row=2, col="b", issue_type="statistical_outlier", description="", difficulty=3.0)
+        issues = [easy, hard]
+        hard_found = compute_weighted_reward({hard.to_key()}, issues)
+        easy_found = compute_weighted_reward({easy.to_key()}, issues)
+        assert hard_found["weighted_reward"] > easy_found["weighted_reward"]
+    def test_false_positives_reduce_reward(self):
+        issues = [PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=1.0)]
+        correct = {issues[0].to_key()}
+        with_fp = correct | {"row:99,col:x,issue:wrong_type"}
+        r_correct = compute_weighted_reward(correct, issues)
+        r_with_fp = compute_weighted_reward(with_fp, issues)
+        assert r_correct["weighted_reward"] > r_with_fp["weighted_reward"]
+# ──────────────────────────────────────────────────────
+# Fix grading
+# ──────────────────────────────────────────────────────
+class TestGradeFixes:
+    @pytest.fixture
+    def easy_task(self):
+        return create_task_easy()
+    def test_no_fixes_no_issues(self):
+        from dataqa_env.server.tasks import Task
+        task = Task(task_id="empty", name="", description="", schema_description="",
+                    validation_rules="", clean_csv="a\n1")
+        result = grade_fixes([], task)
+        assert result["fix_score"] == 1.0
+    def test_no_fixes_submitted(self, easy_task):
+        result = grade_fixes([], easy_task)
+        assert result["fix_score"] == 0.0
+        assert result["fixes_attempted"] == 0
+    def test_exact_fix_for_missing_name(self, easy_task):
+        # Row 4 has empty name — clean value is "David Kim"
+        fixes = [(4, "name", "David Kim")]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fix_score"] > 0.0
+        assert result["fixes_correct"] == 1
+    def test_exact_fix_for_wrong_type_salary(self, easy_task):
+        # Row 7 has "seventy-five thousand" — clean value is "75000"
+        fixes = [(7, "salary", "75000")]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fixes_correct"] == 1
+    def test_numeric_close_match(self, easy_task):
+        # Row 9 has salary "5000" — clean value is "73000"
+        # Propose 73100 (within 1% of 73000)
+        fixes = [(9, "salary", "73100")]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fixes_partial"] == 1
+    def test_wrong_value_for_issue_cell(self, easy_task):
+        # Row 4 name is empty — propose wrong name
+        fixes = [(4, "name", "Wrong Person")]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fixes_partial"] == 1  # correct cell, wrong value
+        assert result["fix_score"] > 0.0  # gets partial credit
+    def test_fix_for_non_issue_cell(self, easy_task):
+        # Row 1 col name is fine — no issue there
+        fixes = [(1, "name", "Some Name")]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fixes_wrong"] == 1
+        assert result["fix_score"] == 0.0
+    def test_multiple_fixes_best_wins(self, easy_task):
+        # Submit two fixes for same cell — best one should count
+        fixes = [
+            (4, "name", "Wrong Person"),   # partial credit
+            (4, "name", "David Kim"),      # exact match
+        ]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fixes_correct"] >= 1
+    def test_all_fixes_correct(self, easy_task):
+        # Fix most issues with exact values
+        fixes = [
+            (4, "name", "David Kim"),
+            (7, "salary", "75000"),
+            (9, "salary", "73000"),
+            (15, "email", "oscar.rivera@company.com"),
+            (18, "start_date", "2022-01-19"),
+        ]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fix_score"] > 0.7  # 5 out of 6 issues fixed (duplicate can't be fixed)
+    def test_fix_score_bounded(self, easy_task):
+        fixes = [(4, "name", "David Kim"), (99, "x", "bad")]
+        result = grade_fixes(fixes, easy_task)
+        assert 0.0 <= result["fix_score"] <= 1.0
+# ──────────────────────────────────────────────────────
+# Full environment lifecycle
+# ──────────────────────────────────────────────────────
+class TestDataQAEnvironment:
+    @pytest.fixture
+    def env(self):
+        return DataQAEnvironment()
+    def test_reset_returns_observation(self, env):
+        obs = env.reset(task_id="easy")
+        assert obs.dataset_csv
+        assert obs.schema_description
+        assert obs.validation_rules
+        assert obs.task_description
+        assert obs.num_issues_hint == 6
+        assert obs.max_steps == 3
+        assert obs.done is False
+        assert obs.reward == 0.0
+        assert "fix" in obs.feedback.lower()  # mentions fix phase
+    def test_reset_medium(self, env):
+        obs = env.reset(task_id="medium")
+        assert obs.num_issues_hint == 8
+    def test_reset_hard(self, env):
+        obs = env.reset(task_id="hard")
+        assert obs.num_issues_hint == 10
+    def test_step_identify_only(self, env):
+        """Backward compatible: only issues, no fixes."""
+        env.reset(task_id="easy")
+        # Submit all 6 correct issues for easy task
+        action = DataQAAction(
+            issues=[
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:21,col:employee_id,issue:duplicate_row",
+                "row:9,col:salary,issue:out_of_range",
+                "row:15,col:email,issue:inconsistent_value",
+                "row:18,col:start_date,issue:out_of_range",
+            ],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        assert obs.done is True
+        assert obs.reward >= 0.999  # identify-only uses identify_score directly
+    def test_step_with_fixes_increases_reward(self, env):
+        """Submitting correct fixes should produce high combined reward."""
+        env.reset(task_id="easy")
+        # All 6 issues + 3 fixes
+        action = DataQAAction(
+            issues=[
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:21,col:employee_id,issue:duplicate_row",
+                "row:9,col:salary,issue:out_of_range",
+                "row:15,col:email,issue:inconsistent_value",
+                "row:18,col:start_date,issue:out_of_range",
+            ],
+            fixes=[
+                "row:4,col:name,fix:David Kim",
+                "row:7,col:salary,fix:75000",
+                "row:9,col:salary,fix:73000",
+            ],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        # Perfect identify + partial fixes -> high combined reward
+        assert obs.metadata["combined_reward"] > 0.7
+    def test_step_with_partial_issues(self, env):
+        env.reset(task_id="easy")
+        action = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value"],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        assert 0 < obs.reward < 1.0
+        assert obs.done is False
+    def test_step_with_no_issues(self, env):
+        env.reset(task_id="easy")
+        action = DataQAAction(issues=[], task_id="easy")
+        obs = env.step(action)
+        assert obs.reward == 0.0
+    def test_step_exhausts_max_steps(self, env):
+        env.reset(task_id="easy")
+        for _ in range(3):
+            action = DataQAAction(issues=["row:99,col:x,issue:wrong_type"], task_id="easy")
+            obs = env.step(action)
+        assert obs.done is True
+    def test_auto_reset_on_step(self, env):
+        action = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value"],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        assert obs.task_id == "easy"
+    def test_state_tracking(self, env):
+        env.reset(task_id="easy")
+        assert env.state.task_id == "easy"
+        assert env.state.current_step == 0
+        assert env.state.best_score == 0.0
+        action = DataQAAction(issues=["row:4,col:name,issue:missing_value"], task_id="easy")
+        env.step(action)
+        assert env.state.current_step == 1
+        assert env.state.best_score > 0.0
+    def test_best_score_monotonic(self, env):
+        env.reset(task_id="easy")
+        action1 = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value", "row:7,col:salary,issue:wrong_type"],
+            task_id="easy",
+        )
+        env.step(action1)
+        score_after_1 = env.state.best_score
+        action2 = DataQAAction(issues=["row:99,col:x,issue:wrong_type"], task_id="easy")
+        env.step(action2)
+        assert env.state.best_score >= score_after_1
+    def test_metadata_includes_both_phases(self, env):
+        env.reset(task_id="easy")
+        action = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value"],
+            fixes=["row:4,col:name,fix:David Kim"],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        m = obs.metadata
+        assert "identify_f1" in m
+        assert "identify_score" in m
+        assert "fix_score" in m
+        assert "combined_reward" in m
+        assert "tp" in m
+        assert "fixes_correct" in m
+        assert "fixes_attempted" in m
+    def test_parse_error_in_feedback(self, env):
+        env.reset(task_id="easy")
+        action = DataQAAction(issues=["garbage input"], task_id="easy")
+        obs = env.step(action)
+        assert "Parse error" in obs.feedback
+    def test_concurrent_sessions_flag(self):
+        assert DataQAEnvironment.SUPPORTS_CONCURRENT_SESSIONS is True
+    def test_reward_between_0_and_1(self, env):
+        """Hackathon requirement: scores must be 0.0-1.0."""
+        env.reset(task_id="hard")
+        for _ in range(3):
+            action = DataQAAction(
+                issues=["row:1,col:x,issue:wrong_type", "row:99,col:y,issue:missing_value"],
+                fixes=["row:1,col:x,fix:wrong"],
+                task_id="hard",
+            )
+            obs = env.step(action)
+            assert 0.0 <= obs.reward <= 1.0
+    def test_combined_reward_weights(self, env):
+        """Verify combined = IDENTIFY_WEIGHT * identify + FIX_WEIGHT * fix."""
+        env.reset(task_id="easy")
+        action = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value"],
+            fixes=["row:4,col:name,fix:David Kim"],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        m = obs.metadata
+        expected = IDENTIFY_WEIGHT * m["identify_score"] + FIX_WEIGHT * m["fix_score"]
+        assert abs(m["combined_reward"] - expected) < 0.01
+    def test_fix_feedback_shown_when_fixes_submitted(self, env):
+        env.reset(task_id="easy")
+        action = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value"],
+            fixes=["row:4,col:name,fix:David Kim"],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        assert "Fix Proposals" in obs.feedback
+        assert "Combined Reward" in obs.feedback
+    def test_no_fix_penalty_when_no_fixes_submitted(self, env):
+        """If agent submits no fixes, reward = identify_score (no penalty)."""
+        env.reset(task_id="easy")
+        action = DataQAAction(
+            issues=[
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:21,col:employee_id,issue:duplicate_row",
+                "row:9,col:salary,issue:out_of_range",
+                "row:15,col:email,issue:inconsistent_value",
+                "row:18,col:start_date,issue:out_of_range",
+            ],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        # identify_score should be ~1.0 since all 6 issues found
+        assert obs.reward >= 0.99
+        # combined_reward equals identify_score when no fixes
+        assert obs.metadata["combined_reward"] == obs.metadata["identify_score"]

tests/test_extensibility.py ADDED Viewed

	@@ -0,0 +1,215 @@

+"""Tests for the extensibility API — custom tasks and contamination rules."""
+import pytest
+from dataqa_env.server.tasks import (
+    PlantedIssue,
+    create_task_from_config,
+    register_task,
+    register_contamination_rule,
+    CONTAMINATION_RULES,
+    get_task,
+    list_tasks,
+)
+from dataqa_env.server.environment import DataQAEnvironment, compute_weighted_reward
+from dataqa_env.models import DataQAAction
+SIMPLE_CSV = "id,name,score\n1,Alice,95\n2,Bob,87\n3,Carol,92\n4,Dave,78"
+class TestCreateTaskFromConfig:
+    def test_basic_creation(self):
+        task = create_task_from_config(
+            task_id="test_custom",
+            name="Test Task",
+            description="Test",
+            schema_description="id: int, name: str, score: int",
+            validation_rules="No missing values",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": "missing_value", "row": 0, "col": 1},
+            ],
+        )
+        assert task.task_id == "test_custom"
+        assert len(task.planted_issues) == 1
+        assert task.planted_issues[0].issue_type == "missing_value"
+        assert task.planted_issues[0].col == "name"
+    def test_multiple_contaminations(self):
+        task = create_task_from_config(
+            task_id="multi",
+            name="Multi",
+            description="Test",
+            schema_description="",
+            validation_rules="",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": "missing_value", "row": 0, "col": 1},
+                {"rule": "missing_value", "row": 2, "col": 1},
+            ],
+        )
+        assert len(task.planted_issues) == 2
+    def test_custom_difficulty_override(self):
+        task = create_task_from_config(
+            task_id="custom_diff",
+            name="Custom Difficulty",
+            description="Test",
+            schema_description="",
+            validation_rules="",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 2.5},
+            ],
+        )
+        assert task.planted_issues[0].difficulty == 2.5
+    def test_callable_rule(self):
+        def custom_rule(rows, header, col_idx, row_idx, rng):
+            return "CORRUPTED", PlantedIssue(
+                row=row_idx + 1, col=header[col_idx], issue_type="wrong_type",
+                description="Custom corruption", difficulty=1.5,
+            )
+        task = create_task_from_config(
+            task_id="callable",
+            name="Callable Rule",
+            description="Test",
+            schema_description="",
+            validation_rules="",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": custom_rule, "row": 1, "col": 2},
+            ],
+        )
+        assert task.planted_issues[0].issue_type == "wrong_type"
+        assert "CORRUPTED" in task.corrupted_csv
+    def test_unknown_rule_raises(self):
+        with pytest.raises(ValueError, match="Unknown contamination rule"):
+            create_task_from_config(
+                task_id="bad",
+                name="Bad",
+                description="",
+                schema_description="",
+                validation_rules="",
+                clean_csv=SIMPLE_CSV,
+                contaminations=[{"rule": "nonexistent_rule", "row": 0, "col": 0}],
+            )
+class TestRegisterContaminationRule:
+    def test_register_and_use(self):
+        def reverse_value(rows, header, col_idx, row_idx, rng):
+            val = rows[row_idx][col_idx]
+            return val[::-1], PlantedIssue(
+                row=row_idx + 1, col=header[col_idx], issue_type="format_violation",
+                description="Reversed value", difficulty=1.5,
+            )
+        register_contamination_rule("reverse", reverse_value)
+        assert "reverse" in CONTAMINATION_RULES
+        task = create_task_from_config(
+            task_id="rev_test",
+            name="Reverse Test",
+            description="",
+            schema_description="",
+            validation_rules="",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[{"rule": "reverse", "row": 0, "col": 1}],
+        )
+        assert task.planted_issues[0].issue_type == "format_violation"
+        # "Alice" reversed is "ecilA"
+        assert "ecilA" in task.corrupted_csv
+        # Cleanup
+        del CONTAMINATION_RULES["reverse"]
+class TestRegisterTask:
+    def test_register_and_get(self):
+        task = create_task_from_config(
+            task_id="registered",
+            name="Registered Task",
+            description="Test registered task",
+            schema_description="id: int, name: str",
+            validation_rules="No missing values",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[{"rule": "missing_value", "row": 1, "col": 1}],
+        )
+        register_task("registered", lambda seed: task)
+        assert "registered" in list_tasks()
+        fetched = get_task("registered")
+        assert fetched.task_id == "registered"
+        assert len(fetched.planted_issues) == 1
+        # Cleanup
+        from dataqa_env.server.tasks import TASK_REGISTRY
+        del TASK_REGISTRY["registered"]
+class TestCustomTaskInEnvironment:
+    def test_full_lifecycle_identify_only(self):
+        """Custom task works end-to-end with identify-only."""
+        task = create_task_from_config(
+            task_id="e2e_custom",
+            name="E2E Custom",
+            description="End-to-end test",
+            schema_description="id: int, name: str, score: int",
+            validation_rules="No missing values",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
+                {"rule": "whitespace_value", "row": 2, "col": 1, "difficulty": 2.5},
+            ],
+        )
+        register_task("e2e_custom", lambda seed: task)
+        env = DataQAEnvironment()
+        obs = env.reset(task_id="e2e_custom")
+        assert obs.num_issues_hint == 2
+        action = DataQAAction(
+            issues=[i.to_key() for i in task.planted_issues],
+            task_id="e2e_custom",
+        )
+        obs = env.step(action)
+        assert obs.done is True
+        assert obs.reward >= 0.999
+        from dataqa_env.server.tasks import TASK_REGISTRY
+        del TASK_REGISTRY["e2e_custom"]
+    def test_full_lifecycle_identify_and_fix(self):
+        """Custom task works end-to-end with both identify and fix."""
+        task = create_task_from_config(
+            task_id="e2e_fix",
+            name="E2E Fix",
+            description="End-to-end test with fixes",
+            schema_description="id: int, name: str, score: int",
+            validation_rules="No missing values",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
+            ],
+        )
+        register_task("e2e_fix", lambda seed: task)
+        env = DataQAEnvironment()
+        env.reset(task_id="e2e_fix")
+        # Submit issues + fix
+        action = DataQAAction(
+            issues=[task.planted_issues[0].to_key()],
+            fixes=["row:1,col:name,fix:Alice"],  # clean value is "Alice"
+            task_id="e2e_fix",
+        )
+        obs = env.step(action)
+        assert obs.done is True
+        assert obs.metadata["fix_score"] > 0.0
+        assert obs.metadata["combined_reward"] > 0.0
+        from dataqa_env.server.tasks import TASK_REGISTRY
+        del TASK_REGISTRY["e2e_fix"]

tests/test_inference.py ADDED Viewed

	@@ -0,0 +1,191 @@

+"""Tests for the inference script's parsing, prompt building, and log format."""
+import pytest
+import sys
+import os
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+from inference import parse_llm_response, parse_fix_response, build_user_prompt, log_start, log_step, log_end
+class TestParseLLMResponse:
+    def test_standard_format(self):
+        response = "row:1,col:name,issue:missing_value\nrow:2,col:salary,issue:wrong_type"
+        issues = parse_llm_response(response)
+        assert len(issues) == 2
+        assert "row:1,col:name,issue:missing_value" in issues
+    def test_numbered_list(self):
+        response = "1. row:1,col:name,issue:missing_value\n2. row:2,col:salary,issue:wrong_type"
+        issues = parse_llm_response(response)
+        assert len(issues) == 2
+    def test_bullet_list(self):
+        response = "- row:1,col:name,issue:missing_value\n* row:2,col:salary,issue:wrong_type"
+        issues = parse_llm_response(response)
+        assert len(issues) == 2
+    def test_equals_delimiter(self):
+        response = "row=1,col=name,issue=missing_value"
+        issues = parse_llm_response(response)
+        assert len(issues) == 1
+        assert issues[0] == "row:1,col:name,issue:missing_value"
+    def test_mixed_case(self):
+        response = "Row:1,Col:Name,Issue:Missing_Value"
+        issues = parse_llm_response(response)
+        assert len(issues) == 1
+        assert issues[0] == "row:1,col:name,issue:missing_value"
+    def test_empty_response(self):
+        assert parse_llm_response("") == []
+        assert parse_llm_response("   ") == []
+    def test_garbage_lines_skipped(self):
+        response = "Here are the issues:\nrow:1,col:name,issue:missing_value\nNo more issues."
+        issues = parse_llm_response(response)
+        assert len(issues) == 1
+    def test_deduplication_not_applied(self):
+        response = "row:1,col:name,issue:missing_value\nrow:1,col:name,issue:missing_value"
+        issues = parse_llm_response(response)
+        assert len(issues) == 2
+    def test_with_column_variant(self):
+        response = "row:1,column:name,issue:missing_value"
+        issues = parse_llm_response(response)
+        assert len(issues) == 1
+class TestParseFixResponse:
+    def test_standard_format(self):
+        response = "row:4,col:name,fix:David Kim\nrow:7,col:salary,fix:75000"
+        fixes = parse_fix_response(response)
+        assert len(fixes) == 2
+        assert "row:4,col:name,fix:David Kim" in fixes
+    def test_numbered_list(self):
+        response = "1. row:4,col:name,fix:David Kim\n2. row:7,col:salary,fix:75000"
+        fixes = parse_fix_response(response)
+        assert len(fixes) == 2
+    def test_with_special_chars(self):
+        response = "row:1,col:email,fix:alice.chen@company.com"
+        fixes = parse_fix_response(response)
+        assert len(fixes) == 1
+        assert "alice.chen@company.com" in fixes[0]
+    def test_empty_response(self):
+        assert parse_fix_response("") == []
+    def test_date_fix(self):
+        response = "row:12,col:order_date,fix:2024-01-26"
+        fixes = parse_fix_response(response)
+        assert len(fixes) == 1
+    def test_ignores_issue_lines(self):
+        response = "row:4,col:name,issue:missing_value\nrow:4,col:name,fix:David Kim"
+        fixes = parse_fix_response(response)
+        assert len(fixes) == 1  # only the fix line
+class TestBuildUserPrompt:
+    def test_includes_all_fields(self):
+        obs = {
+            "task_description": "Find issues",
+            "schema_description": "col: int",
+            "validation_rules": "no nulls",
+            "dataset_csv": "a,b\n1,2",
+            "num_issues_hint": 3,
+            "feedback": "",
+        }
+        prompt = build_user_prompt(obs)
+        assert "Find issues" in prompt
+        assert "col: int" in prompt
+        assert "no nulls" in prompt
+        assert "a,b" in prompt
+        assert "3 issues" in prompt
+    def test_includes_feedback_on_retry(self):
+        obs = {
+            "task_description": "Find issues",
+            "schema_description": "",
+            "validation_rules": "",
+            "dataset_csv": "a\n1",
+            "num_issues_hint": 0,
+            "feedback": "Step 1/3: You missed 2 issues",
+        }
+        prompt = build_user_prompt(obs)
+        assert "FEEDBACK" in prompt
+        assert "missed 2" in prompt
+    def test_excludes_reset_feedback(self):
+        obs = {
+            "task_description": "",
+            "schema_description": "",
+            "validation_rules": "",
+            "dataset_csv": "",
+            "num_issues_hint": 0,
+            "feedback": "Environment reset. Start inspecting.",
+        }
+        prompt = build_user_prompt(obs)
+        assert "FEEDBACK" not in prompt
+    def test_include_fixes_flag(self):
+        obs = {
+            "task_description": "Find issues",
+            "schema_description": "",
+            "validation_rules": "",
+            "dataset_csv": "a\n1",
+            "num_issues_hint": 0,
+            "feedback": "",
+        }
+        prompt = build_user_prompt(obs, include_fixes=True)
+        assert "fix" in prompt.lower()
+class TestLogFormat:
+    """Verify stdout log format matches hackathon evaluation requirements."""
+    def test_log_start_format(self, capsys):
+        log_start(task="easy", env="dataqa_env", model="test-model")
+        out = capsys.readouterr().out.strip()
+        assert out == "[START] task=easy env=dataqa_env model=test-model"
+    def test_log_step_format(self, capsys):
+        log_step(step=1, action="row:1,col:name,issue:missing_value", reward=0.50, done=False, error=None)
+        out = capsys.readouterr().out.strip()
+        assert out == "[STEP] step=1 action=row:1,col:name,issue:missing_value reward=0.50 done=false error=null"
+    def test_log_step_with_error(self, capsys):
+        log_step(step=2, action="none", reward=0.00, done=True, error="timeout")
+        out = capsys.readouterr().out.strip()
+        assert "error=timeout" in out
+        assert "done=true" in out
+    def test_log_end_format(self, capsys):
+        log_end(success=True, steps=3, score=0.85, rewards=[0.25, 0.50, 0.85])
+        out = capsys.readouterr().out.strip()
+        assert out == "[END] success=true steps=3 score=0.850 rewards=0.25,0.50,0.85"
+    def test_log_end_failure(self, capsys):
+        log_end(success=False, steps=1, score=0.0, rewards=[0.0])
+        out = capsys.readouterr().out.strip()
+        assert "success=false" in out
+        assert "score=0.000" in out
+    def test_reward_format_2_decimal(self, capsys):
+        log_step(step=1, action="test", reward=0.123456, done=False, error=None)
+        out = capsys.readouterr().out.strip()
+        assert "reward=0.12" in out
+    def test_no_newlines_within_line(self, capsys):
+        log_start(task="easy", env="dataqa_env", model="model")
+        log_step(step=1, action="act", reward=0.0, done=False, error=None)
+        log_end(success=False, steps=1, score=0.0, rewards=[0.0])
+        out = capsys.readouterr().out
+        lines = [l for l in out.split("\n") if l.strip()]
+        assert len(lines) == 3
+        assert lines[0].startswith("[START]")
+        assert lines[1].startswith("[STEP]")
+        assert lines[2].startswith("[END]")

tests/test_tasks.py ADDED Viewed

	@@ -0,0 +1,162 @@

+"""Tests for task definitions, data corruption, and issue planting."""
+import pytest
+from dataqa_env.server.tasks import (
+    PlantedIssue,
+    Task,
+    create_task_easy,
+    create_task_medium,
+    create_task_hard,
+    get_task,
+    list_tasks,
+    _csv_to_rows,
+    _rows_to_csv,
+)
+class TestPlantedIssue:
+    def test_to_key(self):
+        issue = PlantedIssue(row=3, col="salary", issue_type="missing_value", description="test")
+        assert issue.to_key() == "row:3,col:salary,issue:missing_value"
+    def test_difficulty_default(self):
+        issue = PlantedIssue(row=1, col="name", issue_type="missing_value", description="test")
+        assert issue.difficulty == 1.0
+    def test_difficulty_custom(self):
+        issue = PlantedIssue(row=1, col="name", issue_type="missing_value", description="test", difficulty=3.0)
+        assert issue.difficulty == 3.0
+class TestCSVHelpers:
+    def test_roundtrip(self):
+        csv_text = "a,b,c\n1,2,3\n4,5,6"
+        rows = _csv_to_rows(csv_text)
+        assert len(rows) == 3
+        result = _rows_to_csv(rows)
+        assert "1,2,3" in result
+    def test_empty_csv(self):
+        rows = _csv_to_rows("a,b\n")
+        assert len(rows) == 1  # header only
+class TestTaskEasy:
+    @pytest.fixture
+    def task(self):
+        return create_task_easy()
+    def test_task_id(self, task):
+        assert task.task_id == "easy"
+    def test_has_6_issues(self, task):
+        assert len(task.planted_issues) == 6
+    def test_issue_types(self, task):
+        types = {i.issue_type for i in task.planted_issues}
+        assert "missing_value" in types
+        assert "wrong_type" in types
+        assert "duplicate_row" in types
+        assert "out_of_range" in types
+        assert "inconsistent_value" in types
+    def test_corrupted_csv_differs_from_clean(self, task):
+        assert task.corrupted_csv != task.clean_csv
+    def test_issue_keys_unique(self, task):
+        keys = [i.to_key() for i in task.planted_issues]
+        assert len(keys) == len(set(keys))
+    def test_max_steps(self, task):
+        assert task.max_steps == 3
+    def test_corrupted_csv_has_more_rows(self, task):
+        clean_rows = _csv_to_rows(task.clean_csv)
+        corrupt_rows = _csv_to_rows(task.corrupted_csv)
+        assert len(corrupt_rows) > len(clean_rows)  # duplicate row added
+    def test_difficulty_weights(self, task):
+        for issue in task.planted_issues:
+            assert 1.0 <= issue.difficulty <= 3.0
+class TestTaskMedium:
+    @pytest.fixture
+    def task(self):
+        return create_task_medium()
+    def test_task_id(self, task):
+        assert task.task_id == "medium"
+    def test_has_8_issues(self, task):
+        assert len(task.planted_issues) == 8
+    def test_issue_types(self, task):
+        types = {i.issue_type for i in task.planted_issues}
+        assert "inconsistent_value" in types
+        assert "format_violation" in types
+        assert "missing_value" in types
+    def test_issue_keys_unique(self, task):
+        keys = [i.to_key() for i in task.planted_issues]
+        assert len(keys) == len(set(keys))
+    def test_difficulty_weights(self, task):
+        for issue in task.planted_issues:
+            assert 1.0 <= issue.difficulty <= 3.0
+class TestTaskHard:
+    @pytest.fixture
+    def task(self):
+        return create_task_hard()
+    def test_task_id(self, task):
+        assert task.task_id == "hard"
+    def test_has_10_issues(self, task):
+        assert len(task.planted_issues) == 10
+    def test_issue_types(self, task):
+        types = {i.issue_type for i in task.planted_issues}
+        assert "inconsistent_value" in types
+        assert "format_violation" in types
+        assert "statistical_outlier" in types
+        assert "out_of_range" in types
+        assert "missing_value" in types
+    def test_has_high_difficulty_issues(self, task):
+        hard_issues = [i for i in task.planted_issues if i.difficulty >= 2.5]
+        assert len(hard_issues) >= 2  # data leakage, GPU outlier, whitespace
+    def test_issue_keys_unique(self, task):
+        keys = [i.to_key() for i in task.planted_issues]
+        assert len(keys) == len(set(keys))
+class TestTaskRegistry:
+    def test_list_tasks(self):
+        tasks = list_tasks()
+        assert set(tasks) == {"easy", "medium", "hard"}
+    def test_get_task_easy(self):
+        task = get_task("easy")
+        assert task.task_id == "easy"
+    def test_get_task_medium(self):
+        task = get_task("medium")
+        assert task.task_id == "medium"
+    def test_get_task_hard(self):
+        task = get_task("hard")
+        assert task.task_id == "hard"
+    def test_get_task_unknown_raises(self):
+        with pytest.raises(ValueError, match="Unknown task"):
+            get_task("nonexistent")
+    def test_seed_determinism(self):
+        t1 = get_task("easy", seed=42)
+        t2 = get_task("easy", seed=42)
+        assert t1.corrupted_csv == t2.corrupted_csv
+        assert [i.to_key() for i in t1.planted_issues] == [i.to_key() for i in t2.planted_issues]