Spaces:

sairaj2
/

DataCleanser

Sleeping

sairaj2 commited on 28 days ago

Commit

3432ccd

1 Parent(s): 746496e

Add /web dashboard UI and update Docker setup

- Added interactive web dashboard at /web with session controls, step controls, live outputs
- Updated Dockerfile for better Hugging Face Spaces compatibility
- Updated README.md with UI documentation
- Added .dockerignore for cleaner Docker builds
- Updated requirements.txt with necessary dependencies
- Updated env/environment.py and env/reward.py for improved functionality
- Added scripts/ directory with validation script

Files changed (9) hide show

.dockerignore +22 -0
Dockerfile +22 -22
README.md +94 -976
app.py +411 -9
env/environment.py +22 -7
env/reward.py +5 -5
inference.py +64 -78
requirements.txt +1 -1
scripts/validate-submission.sh +62 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,22 @@

+.git
+.gitignore
+__pycache__/
+*.py[cod]
+*.pyo
+*.pyd
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+.cache/
+venv/
+.venv/
+*.ipynb
+*.log
+results_*.json
+.DS_Store

Dockerfile CHANGED Viewed

@@ -1,35 +1,35 @@
-# OpenEnv Data Cleaning Environment
 FROM python:3.11-slim
-WORKDIR /app
-# Install system dependencies
-RUN apt-get update && apt-get install -y \
-    gcc \
-    g++ \
-    && rm -rf /var/lib/apt/lists/*
-# Copy requirements first for better caching
-COPY requirements.txt .
-# Install Python dependencies
-RUN pip install --no-cache-dir -r requirements.txt
 # Copy application code
-COPY . .
-# Create data directory
-RUN mkdir -p data
-# Generate datasets on build
-RUN python -c "from env.tasks import TaskManager; tm = TaskManager(); tm.generate_datasets()"
-# Expose port
 EXPOSE 7860
-# Health check
-HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
-    CMD python -c "import requests; requests.get('http://localhost:7860/health')" || exit 1
-# Run the application
-CMD ["python", "app.py"]

+# OpenEnv Data Cleaning Environment (FastAPI)
 FROM python:3.11-slim
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1 \
+    PORT=7860
+WORKDIR /app
+# Create non-root user
+RUN useradd -m -u 10001 appuser
+# Install Python dependencies first for better caching
+COPY requirements.txt /app/requirements.txt
+RUN python -m pip install --upgrade pip && \
+    pip install -r /app/requirements.txt
 # Copy application code
+COPY . /app
+# Ensure data directory exists and is writable (tasks may write datasets)
+RUN mkdir -p /app/data && chown -R appuser:appuser /app
+USER appuser
 EXPOSE 7860
+# Health check (keep it simple; HF will also probe)
+HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+  CMD python -c "import os,requests; requests.get(f\"http://127.0.0.1:{os.getenv('PORT','7860')}/health\", timeout=2).raise_for_status()"
+# Production server (FastAPI)
+# Note: keep a single worker because sessions are held in-memory per process.
+CMD ["sh", "-c", "uvicorn app:app --host 0.0.0.0 --port ${PORT} --proxy-headers --forwarded-allow-ips='*' --workers 1"]

README.md CHANGED Viewed

@@ -1,795 +1,113 @@
-# OpenEnv Data Cleaning Environment
-A production-grade OpenEnv environment for data cleaning and validation tasks that simulates real-world data engineering workflows.
-## Table of Contents
-- [Real-World Problem](#real-world-problem)
-- [Solution Overview](#solution-overview)
-- [Architecture](#architecture)
-- [How It Works](#how-it-works)
-- [Action Space](#action-space)
-- [Observation Space](#observation-space)
-- [Reward System](#reward-system)
-- [Task Levels](#task-levels)
-- [Quick Start](#quick-start)
-- [Deployment](#deployment)
-- [API Reference](#api-reference)
-- [Running the Baseline Agent](#running-the-baseline-agent)
-- [Expected Scores](#expected-scores)
-- [Development Guide](#development-guide)
----
-## Real-World Problem
-### The Data Quality Crisis
-In modern data engineering, **60-80% of a data scientist's time** is spent on data cleaning and preparation. This critical but tedious process involves:
-```mermaid
-graph TD
-    A[Raw Data] --> B{Data Quality Issues}
-    B --> C[Missing Values]
-    B --> D[Duplicate Records]
-    B --> E[Format Inconsistencies]
-    B --> F[Invalid Values]
-    B --> G[Cross-Field Errors]
-    C --> H[Manual Cleaning]
-    D --> H
-    E --> H
-    F --> H
-    G --> H
-    H --> I[Time Consuming]
-    H --> J[Error Prone]
-    H --> K[Not Scalable]
-    I --> L[Business Impact]
-    J --> L
-    K --> L
-    L --> M[Delayed Insights]
-    L --> N[Wrong Decisions]
-    L --> O[Lost Revenue]
-```
-### Industry Pain Points
-| Problem                | Impact                        | Current Solution                   |
-| ---------------------- | ----------------------------- | ---------------------------------- |
-| **Missing Values**     | 15-25% of datasets have gaps  | Manual imputation, simple fill     |
-| **Duplicates**         | 5-10% redundant records       | SQL dedup, pandas drop_duplicates  |
-| **Format Issues**      | 20-30% inconsistent formats   | Regex, manual standardization      |
-| **Invalid Data**       | 10-15% out-of-range values    | Business rules, validation scripts |
-| **Cross-Field Errors** | 5-10% logical inconsistencies | Custom validation logic            |
-### Why This Matters
-```mermaid
-graph LR
-    A[Poor Data Quality] --> B[Failed ML Models]
-    A --> C[Incorrect Analytics]
-    A --> D[Compliance Issues]
-    A --> E[Customer Churn]
-    B --> F[$4.2M Annual Loss]
-    C --> F
-    D --> F
-    E --> F
-    style F fill:#ff6b6b,stroke:#c92a2a
-```
-**Key Statistics:**
-- **$12.9 million** - Average annual cost of poor data quality per organization (Gartner)
-- **40% of business initiatives** fail to achieve targets due to poor data quality
-- **27% of revenue** is lost due to inaccurate data in CRM systems
----
-## Solution Overview
-### What This Project Provides
-This OpenEnv environment creates a **standardized, reproducible benchmark** for evaluating AI agents on data cleaning tasks. It bridges the gap between academic research and production data engineering.
-```mermaid
-graph TB
-    subgraph "OpenEnv Data Cleaning Environment"
-        A[Environment Interface] --> B[Action Space]
-        A --> C[Observation Space]
-        A --> D[Reward System]
-        B --> E[10 Structured Actions]
-        C --> F[Intelligent Feedback]
-        D --> G[Multi-Component Scoring]
-        E --> H[Agent Interaction]
-        F --> H
-        G --> H
-        H --> I[Deterministic Grading]
-        I --> J[Reproducible Results]
-    end
-    subgraph "Real-World Applications"
-        K[Data Pipeline Automation]
-        L[ETL Quality Assurance]
-        M[ML Data Preparation]
-        N[Compliance Validation]
-    end
-    J --> K
-    J --> L
-    J --> M
-    J --> N
-```
-### Key Benefits
-1. **Standardized Evaluation**: Compare different AI agents on the same tasks
-2. **Realistic Scenarios**: Based on actual data engineering challenges
-3. **Deterministic Grading**: Reproducible scoring for fair comparison
-4. **Production Ready**: Docker deployment, REST API, scalable architecture
-5. **Extensible**: Easy to add new tasks, actions, and metrics
----
-## Architecture
-### System Architecture
-```mermaid
-graph TB
-    subgraph "Client Layer"
-        A[AI Agent]
-        B[Human User]
-        C[CI/CD Pipeline]
-    end
-    subgraph "API Layer"
-        D[FastAPI Server]
-        E[REST Endpoints]
-        F[Session Management]
-    end
-    subgraph "Environment Layer"
-        G[DataCleaningEnv]
-        H[Task Manager]
-        I[Reward Calculator]
-        J[Grader]
-    end
-    subgraph "Data Layer"
-        K[Dirty Datasets]
-        L[Clean Datasets]
-        M[Session State]
-    end
-    A --> D
-    B --> D
-    C --> D
-    D --> E
-    D --> F
-    E --> G
-    F --> G
-    G --> H
-    G --> I
-    G --> J
-    H --> K
-    H --> L
-    G --> M
-    style D fill:#4dabf7,stroke:#1971c2
-    style G fill:#51cf66,stroke:#2f9e44
-```
-### Component Interaction
-```mermaid
-sequenceDiagram
-    participant Agent
-    participant API
-    participant Env
-    participant Reward
-    participant Grader
-    Agent->>API: POST /reset (task_id)
-    API->>Env: reset(task_config)
-    Env-->>API: Observation
-    API-->>Agent: Initial State
-    loop Until Done
-        Agent->>API: POST /step (action)
-        API->>Env: step(action)
-        Env->>Reward: calculate_reward()
-        Reward-->>Env: Reward
-        Env-->>API: Observation, Reward, Done, Info
-        API-->>Agent: Step Result
-    end
-    Agent->>API: POST /step (submit)
-    API->>Env: step(submit)
-    Env->>Grader: grade()
-    Grader-->>Env: Final Score
-    Env-->>API: Final Result
-    API-->>Agent: Final Score
-```
-### Data Flow
-```mermaid
-flowchart TD
-    A[Dirty Dataset] --> B[Environment Reset]
-    B --> C[Initial Observation]
-    C --> D{Agent Decision}
-    D --> E[Select Action]
-    E --> F[Execute Action]
-    F --> G[Update DataFrame]
-    G --> H[Calculate Metrics]
-    H --> I[Detect Issues]
-    I --> J[Generate Observation]
-    J --> K[Calculate Reward]
-    K --> L{Done?}
-    L -->|No| D
-    L -->|Yes| M[Final Grading]
-    M --> N[Score Calculation]
-    N --> O[Results]
-    style A fill:#ffd43b,stroke:#f59f00
-    style O fill:#51cf66,stroke:#2f9e44
-```
----
-## How It Works
-### Environment Lifecycle
-```mermaid
-stateDiagram-v2
-    [*] --> Initialized: Create Environment
-    Initialized --> Ready: reset(task)
-    Ready --> Running: step(action)
-    Running --> Running: Action Executed
-    Running --> Running: Reward Calculated
-    Running --> Done: submit() or max_steps
-    Done --> Ready: reset(new_task)
-    Done --> [*]: Delete Session
-    state Running {
-        [*] --> ExecuteAction
-        ExecuteAction --> UpdateState
-        UpdateState --> CalculateMetrics
-        CalculateMetrics --> DetectIssues
-        DetectIssues --> GenerateObservation
-        GenerateObservation --> CalculateReward
-        CalculateReward --> [*]
-    }
-```
-### Action Execution Flow
-```mermaid
-flowchart LR
-    A[Action Input] --> B{Validate Action}
-    B -->|Invalid| C[Return Error]
-    B -->|Valid| D[Execute Operation]
-    D --> E[Update DataFrame]
-    E --> F[Track Changes]
-    F --> G[Update Issues]
-    G --> H[Calculate Metrics]
-    H --> I[Generate Result]
-    I --> J[Return ActionResult]
-    style C fill:#ff6b6b,stroke:#c92a2a
-    style J fill:#51cf66,stroke:#2f9e44
-```
-### Reward Calculation
-```mermaid
-graph TD
-    A[Current State] --> B[Quality Metrics]
-    A --> C[Issue Count]
-    A --> D[Action History]
-    B --> E[Quality Improvement]
-    C --> F[Issue Resolution]
-    E --> G[Reward Components]
-    F --> G
-    D --> H[Redundancy Check]
-    D --> I[Destructive Check]
-    H --> J[Penalties]
-    I --> J
-    G --> K[Weighted Sum]
-    J --> K
-    K --> L[Final Reward]
-    style L fill:#4dabf7,stroke:#1971c2
-```
----
-## Action Space
-### Available Actions
-```mermaid
-graph TB
-    subgraph "Data Cleaning Actions"
-        A[fill_missing] --> A1[mean/median/mode]
-        A --> A2[forward/backward fill]
-        A --> A3[specific value]
-        B[drop_duplicates] --> B1[subset columns]
-        B --> B2[keep first/last]
-        C[normalize_text] --> C1[lowercase]
-        C --> C2[strip whitespace]
-        C --> C3[remove special chars]
-        D[standardize_format] --> D1[email]
-        D --> D2[phone]
-        D --> D3[date]
-        E[validate_range] --> E1[min value]
-        E --> E2[max value]
-        F[detect_outliers] --> F1[IQR method]
-        F --> F2[Z-score method]
-        G[infer_values] --> G1[interpolation]
-        G --> G2[regression]
-        G --> G3[mode]
-        H[flag_invalid] --> H1[row_id]
-        H --> H2[reason]
-    end
-    subgraph "Control Actions"
-        I[revert_last_action]
-        J[submit]
-    end
-```
-### Action Details
-| Action               | Description             | Parameters                | Example                                                    |
-| -------------------- | ----------------------- | ------------------------- | ---------------------------------------------------------- |
-| `fill_missing`       | Fill missing values     | column, strategy/value    | `{"column": "age", "strategy": "median"}`                  |
-| `drop_duplicates`    | Remove duplicate rows   | subset, keep              | `{"subset": ["email"], "keep": "first"}`                   |
-| `normalize_text`     | Normalize text data     | column, operations        | `{"column": "name", "operations": ["lowercase", "strip"]}` |
-| `standardize_format` | Standardize formats     | column, format_type       | `{"column": "email", "format_type": "email"}`              |
-| `validate_range`     | Fix out-of-range values | column, min, max          | `{"column": "age", "min_value": 0, "max_value": 120}`      |
-| `detect_outliers`    | Handle outliers         | column, method, threshold | `{"column": "salary", "method": "iqr"}`                    |
-| `infer_values`       | Infer missing values    | column, method            | `{"column": "price", "method": "interpolation"}`           |
-| `flag_invalid`       | Flag invalid rows       | row_id, reason            | `{"row_id": 42, "reason": "Invalid email"}`                |
-| `revert_last_action` | Undo last action        | -                         | `{}`                                                       |
-| `submit`             | Submit cleaned data     | -                         | `{}`                                                       |
----
-## Observation Space
-### Observation Structure
-```mermaid
-graph TB
-    subgraph "Observation"
-        A[table_preview] --> A1[First N rows]
-        B[schema] --> B1[Column types]
-        B --> B2[Null counts]
-        B --> B3[Unique counts]
-        C[detected_issues] --> C1[Issue type]
-        C --> C2[Count]
-        C --> C3[Severity]
-        D[quality_metrics] --> D1[Completeness]
-        D --> D2[Validity]
-        D --> D3[Consistency]
-        D --> D4[Uniqueness]
-        D --> D5[Overall]
-        E[step_count] --> E1[Current step]
-        F[max_steps] --> F1[Step limit]
-        G[last_action_result] --> G1[Success/failure]
-        G --> G2[Message]
-        G --> G3[Rows affected]
-        H[task_info] --> H1[Level]
-        H --> H2[Description]
-    end
-```
-### Quality Metrics Explained
-```mermaid
-graph LR
-    A[Quality Metrics] --> B[Completeness]
-    A --> C[Validity]
-    A --> D[Consistency]
-    A --> E[Uniqueness]
-    B --> F[Non-null ratio]
-    C --> G[Format compliance]
-    D --> H[Type consistency]
-    E --> I[1 - duplicate rate]
-    F --> J[Overall Score]
-    G --> J
-    H --> J
-    I --> J
-    J --> K[30% B + 30% C + 20% D + 20% E]
-```
 ---
-## Reward System
-### Multi-Component Reward
-```mermaid
-graph TB
-    subgraph "Positive Components"
-        A[Quality Improvement] --> A1[× 0.4 weight]
-        B[Issue Resolution] --> B1[× 0.4 weight]
-        C[Schema Validity] --> C1[× 0.2 weight]
-    end
-    subgraph "Penalties"
-        D[Destructive Changes] --> D1[× 0.5 penalty]
-        E[Redundant Actions] --> E1[× 0.05 penalty]
-        F[Step Cost] --> F1[× 0.01 penalty]
-    end
-    A1 --> G[Total Reward]
-    B1 --> G
-    C1 --> G
-    D1 --> G
-    E1 --> G
-    F1 --> G
-    G --> H[Final Score]
-    style H fill:#4dabf7,stroke:#1971c2
-```
-### Reward Formula
-```
-reward = (quality_improvement × 0.4)
-       + (issue_resolution × 0.4)
-       + (schema_validity × 0.2)
-       - (destructive_penalty × 0.5)
-       - (redundant_penalty × 0.05)
-       - (step_penalty × 0.01)
-```
-### Grading System
-```mermaid
-graph TD
-    subgraph "Easy Task Weights"
-        A1[Completeness: 40%]
-        A2[Uniqueness: 30%]
-        A3[Format: 20%]
-        A4[Structure: 10%]
-    end
-    subgraph "Medium Task Weights"
-        B1[Completeness: 25%]
-        B2[Uniqueness: 20%]
-        B3[Format: 25%]
-        B4[Structure: 15%]
-        B5[Consistency: 15%]
-    end
-    subgraph "Hard Task Weights"
-        C1[Completeness: 20%]
-        C2[Uniqueness: 15%]
-        C3[Format: 20%]
-        C4[Structure: 15%]
-        C5[Consistency: 15%]
-        C6[Cross-Field: 15%]
-    end
-```
 ---
-## Task Levels
-### Easy Task (easy_001)
-```mermaid
-graph LR
-    A[Customer Database] --> B[100 Records]
-    B --> C[Missing Values]
-    B --> D[Exact Duplicates]
-    C --> E[Fill with median/mode]
-    D --> F[Drop duplicates]
-    E --> G[Clean Data]
-    F --> G
-    style A fill:#ffd43b,stroke:#f59f00
-    style G fill:#51cf66,stroke:#2f9e44
-```
-- **Dataset**: Customer database (100 records)
-- **Issues**: Missing values in age/email, 8 duplicate records
-- **Max Steps**: 30
-- **Focus**: Basic cleaning operations
-### Medium Task (medium_001)
-```mermaid
-graph LR
-    A[Employee Records] --> B[120 Records]
-    B --> C[Format Issues]
-    B --> D[Validation Errors]
-    B --> E[Missing Values]
-    C --> F[Standardize formats]
-    D --> G[Validate ranges]
-    E --> H[Fill missing]
-    F --> I[Clean Data]
-    G --> I
-    H --> I
-    style A fill:#ffd43b,stroke:#f59f00
-    style I fill:#51cf66,stroke:#2f9e44
-```
-- **Dataset**: Employee records (120 records)
-- **Issues**: Inconsistent email/phone/date formats, out-of-range salary/age
-- **Max Steps**: 40
-- **Focus**: Format standardization, range validation
-### Hard Task (hard_001)
-```mermaid
-graph LR
-    A[Sales Transactions] --> B[150 Records]
-    B --> C[Cross-Field Errors]
-    B --> D[Outliers]
-    B --> E[Complex Missing]
-    B --> F[Duplicates]
-    C --> G[Validate relationships]
-    D --> H[Detect outliers]
-    E --> I[Infer values]
-    F --> J[Remove duplicates]
-    G --> K[Clean Data]
-    H --> K
-    I --> K
-    J --> K
-    style A fill:#ffd43b,stroke:#f59f00
-    style K fill:#51cf66,stroke:#2f9e44
-```
-- **Dataset**: Sales transactions (150 records)
-- **Issues**: End dates before start dates, total ≠ price × quantity, statistical outliers
-- **Max Steps**: 50
-- **Focus**: Cross-field validation, anomaly detection
----
-## Quick Start
-### Prerequisites
-- Python 3.11+
-- pip or Docker
-### Installation
 ```bash
-# Clone repository
-git clone <repository-url>
-cd data-cleaning-env
-# Create virtual environment
-python -m venv venv
-source venv/bin/activate  # Linux/Mac
-# or
-venv\Scripts\activate  # Windows
-# Install dependencies
-pip install -r requirements.txt
-# Generate datasets
-python -c "from env.tasks import TaskManager; tm = TaskManager(); tm.generate_datasets()"
-# Start server
-python app.py
 ```
-### Verify Installation
 ```bash
-# Check health
-curl http://localhost:7860/health
-# List tasks
-curl http://localhost:7860/tasks
-```
----
-## Deployment
-### Docker Deployment
-```bash
-# Build image
-docker build -t data-cleaning-env .
-# Run container
-docker run -p 7860:7860 data-cleaning-env
-# Run with environment variables
-docker run -p 7860:7860 \
-  -e PORT=7860 \
-  -e LOG_LEVEL=info \
-  data-cleaning-env
-```
-### Docker Compose
-```yaml
-version: "3.8"
-services:
-  data-cleaning-env:
-    build: .
-    ports:
-      - "7860:7860"
-    environment:
-      - PORT=7860
-      - LOG_LEVEL=info
-    healthcheck:
-      test: ["CMD", "curl", "-f", "http://localhost:7860/health"]
-      interval: 30s
-      timeout: 10s
-      retries: 3
-```
-### Hugging Face Spaces
-1. **Create Space**
-   - Go to huggingface.co/spaces
-   - Click "Create new Space"
-   - Select "Docker" as SDK
-   - Choose "Blank" template
-2. **Upload Files**
-   ```
-   ├── env/
-   ├── agent/
-   ├── app.py
-   ├── Dockerfile
-   ├── requirements.txt
-   └── README.md
-   ```
-3. **Configure Space**
-   - Set `app.py` as main file
-   - Set port to 7860
-   - Add any required secrets
-4. **Deploy**
-   - Space will automatically build
-   - Access via provided URL
-### Cloud Deployment
-#### AWS ECS
-```bash
-# Build and push to ECR
-aws ecr create-repository --repository-name data-cleaning-env
-docker tag data-cleaning-env:latest <account-id>.dkr.ecr.<region>.amazonaws.com/data-cleaning-env:latest
-docker push <account-id>.dkr.ecr.<region>.amazonaws.com/data-cleaning-env:latest
-# Create ECS task definition
-aws ecs register-task-definition --cli-input-json file://task-definition.json
-# Run task
-aws ecs run-task --cluster default --task-definition data-cleaning-env
-```
-#### Google Cloud Run
-```bash
-# Build and push to GCR
-gcloud builds submit --tag gcr.io/<project-id>/data-cleaning-env
-# Deploy to Cloud Run
-gcloud run deploy data-cleaning-env \
-  --image gcr.io/<project-id>/data-cleaning-env \
-  --platform managed \
-  --port 7860
-```
----
-## API Reference
-### Endpoints
-```mermaid
-graph LR
-    subgraph "Core Endpoints"
-        A[GET /] --> A1[Root info]
-        B[GET /health] --> B1[Health check]
-        C[GET /tasks] --> C1[List tasks]
-        D[GET /tasks/:id] --> D1[Task details]
-    end
-    subgraph "Environment Endpoints"
-        E[POST /reset] --> E1[Reset env]
-        F[POST /step] --> F1[Execute action]
-        G[GET /state/:id] --> G1[Get state]
-        H[GET /data/:id] --> H1[Get data]
-    end
-    subgraph "Session Endpoints"
-        I[GET /sessions] --> I1[List sessions]
-        J[DELETE /session/:id] --> J1[Delete session]
-        K[GET /history/:id] --> K1[Get history]
-    end
 ```
-### Example Usage
-#### Reset Environment
 ```bash
-curl -X POST http://localhost:7860/reset \
   -H "Content-Type: application/json" \
-  -d '{"task_id": "easy_001", "session_id": "test-session"}'
 ```
-#### Execute Action
 ```bash
-curl -X POST http://localhost:7860/step \
   -H "Content-Type: application/json" \
   -d '{
-    "session_id": "test-session",
     "action": {
       "action_type": "fill_missing",
       "params": {"column": "age", "strategy": "median"}
@@ -797,244 +115,44 @@ curl -X POST http://localhost:7860/step \
   }'
 ```
-#### Get Current Data
-```bash
-curl http://localhost:7860/data/test-session?rows=10
-```
----
-## Running the Baseline Agent
-### Setup
 ```bash
-# Set OpenAI API key
-export OPENAI_API_KEY="your-api-key"
-# Run on easy task
-python agent/baseline.py easy
-# Run on medium task
-python agent/baseline.py medium
-# Run on hard task
-python agent/baseline.py hard
 ```
-### Agent Strategy
-```mermaid
-flowchart TD
-    A[Start] --> B[Analyze Issues]
-    B --> C[Prioritize High Severity]
-    C --> D{Missing Values?}
-    D -->|Yes| E[Fill Missing]
-    D -->|No| F{Duplicates?}
-    F -->|Yes| G[Drop Duplicates]
-    F -->|No| H{Format Issues?}
-    H -->|Yes| I[Standardize Format]
-    H -->|No| J{Out of Range?}
-    J -->|Yes| K[Validate Range]
-    J -->|No| L{Outliers?}
-    L -->|Yes| M[Detect Outliers]
-    L -->|No| N{Quality Good?}
-    N -->|Yes| O[Submit]
-    N -->|No| P[Continue Cleaning]
-    E --> Q[Next Iteration]
-    G --> Q
-    I --> Q
-    K --> Q
-    M --> Q
-    P --> Q
-    Q --> B
-    style O fill:#51cf66,stroke:#2f9e44
-```
-### Agent Output Example
-```
-============================================================
-Task: Clean a customer database with missing values and duplicate records
-Level: easy
-Max Steps: 30
-============================================================
-Step 1: fill_missing
-  Params: {'column': 'age', 'strategy': 'median'}
-  Reward: 0.2345
-  Quality: 78.45%
-  Issues: 18
-Step 2: fill_missing
-  Params: {'column': 'email', 'strategy': 'mode'}
-  Reward: 0.1892
-  Quality: 85.23%
-  Issues: 12
-Step 3: drop_duplicates
-  Params: {}
-  Reward: 0.3421
-  Quality: 92.15%
-  Issues: 4
-Step 4: submit
-  Reward: 1.2500
-  Quality: 95.80%
-  Issues: 0
-============================================================
-RESULTS:
-  Total Steps: 4
-  Total Reward: 2.0158
-  Final Quality: 95.80%
-  Final Score: 87.50%
-============================================================
-```
----
-## Expected Scores
-### Performance Benchmarks
-```mermaid
-graph TB
-    subgraph "Easy Task"
-        A1[Agent Score: 75-90%]
-        A2[Human Baseline: 95%]
-        A3[Random Baseline: 40%]
-    end
-    subgraph "Medium Task"
-        B1[Agent Score: 65-80%]
-        B2[Human Baseline: 90%]
-        B3[Random Baseline: 30%]
-    end
-    subgraph "Hard Task"
-        C1[Agent Score: 55-75%]
-        C2[Human Baseline: 85%]
-        C3[Random Baseline: 20%]
-    end
-```
-### Score Interpretation
-| Score Range | Rating            | Description                                    |
-| ----------- | ----------------- | ---------------------------------------------- |
-| 90-100%     | Excellent         | Near-perfect cleaning, minimal errors          |
-| 80-89%      | Good              | Most issues resolved, minor mistakes           |
-| 70-79%      | Satisfactory      | Core issues addressed, some gaps               |
-| 60-69%      | Needs Improvement | Basic cleaning done, significant issues remain |
-| <60%        | Poor              | Major issues unresolved                        |
----
-## Development Guide
-### Adding New Tasks
-1. **Define Task Config** (env/tasks.py)
-   ```python
-   self.tasks['new_task'] = TaskConfig(
-       task_id='new_task',
-       task_level=TaskLevel.MEDIUM,
-       description="Task description",
-       dataset_path=str(self.data_dir / "new_dirty.csv"),
-       expected_output_path=str(self.data_dir / "new_clean.csv"),
-       max_steps=40,
-       issues=[...]
-   )
-   ```
-2. **Create Dataset Generators**
-   ```python
-   def _generate_new_dataset(self):
-       # Generate clean data
-       clean_df = pd.DataFrame(...)
-       clean_df.to_csv(self.data_dir / "new_clean.csv", index=False)
-       # Add realistic issues
-       dirty_df = clean_df.copy()
-       # Add missing values, duplicates, format issues, etc.
-       dirty_df.to_csv(self.data_dir / "new_dirty.csv", index=False)
-   ```
-### Adding New Actions
-1. **Add Action Type** (env/models.py)
-   ```python
-   class ActionType(str, Enum):
-       NEW_ACTION = "new_action"
-   ```
-2. **Create Parameter Model** (env/models.py)
-   ```python
-   class NewActionParams(BaseModel):
-       param1: str
-       param2: Optional[int] = None
-   ```
-3. **Implement Action** (env/environment.py)
-   ```python
-   def _new_action(self, params: Dict) -> ActionResult:
-       # Implementation
-       return ActionResult(success=True, message="...", rows_affected=n)
-   ```
-4. **Update Agent Prompt** (agent/baseline.py)
-   - Add action description to system prompt
-   - Include example usage
-### Customizing Rewards
-Modify weights in `RewardCalculator.__init__()`:
-```python
-self.step_penalty = 0.01        # Penalty per step
-self.destructive_penalty = 0.5  # Penalty for destructive changes
-self.redundant_penalty = 0.05   # Penalty for redundant actions
-self.quality_weight = 0.4       # Weight for quality improvement
-self.issue_weight = 0.4         # Weight for issue resolution
-self.schema_weight = 0.2        # Weight for schema validity
 ```
----
-## License
-MIT License
-## Repository
-https://github.com/meta-pytorch/OpenEnv
----
-## Citation
-If you use this environment in your research, please cite:
-```bibtex
-@software{openenv_data_cleaning,
-  title={OpenEnv Data Cleaning Environment},
-  author={OpenEnv Team},
-  year={2024},
-  url={https://github.com/meta-pytorch/OpenEnv}
-}
-```

 ---
+title: OpenEnv Data Cleaning Environment
+emoji: 🧼
+colorFrom: blue
+colorTo: green
+sdk: docker
+app_port: 7860
+pinned: false
+tags:
+  - fastapi
+  - docker
+  - openenv
+  - data-cleaning
+  - data-validation
 ---
+## What this is
+This repo is a **Dockerized FastAPI application** that serves an OpenEnv-style **data cleaning environment**:
+- `POST /reset` to start a session on a task
+- `POST /step` to take actions (fill missing, drop duplicates, standardize formats, etc.)
+- `GET /health` for readiness checks
+It is suitable for **Hugging Face Spaces (Docker)**. Inference Endpoints are not ideal here because this is an interactive multi-endpoint environment, not a single model inference API.
+## Web UI (optional)
+Open `\/web` for a lightweight dashboard to reset/step and view the table preview.
+## Real-world task
+Simulates a common data engineering workflow: **cleaning a dirty table** so downstream analytics/ML won’t break.
+Agents must iteratively apply safe transformations (imputation, deduplication, normalization, format standardization, range/outlier handling) and then **submit**.
+## Tasks (3 levels, deterministic grading)
+- **easy_001**: missing values + exact duplicates (customer table)
+- **medium_001**: missing values + format inconsistencies + invalid ranges (employee table)
+- **hard_001**: missing values + duplicates + mixed date/currency formats + cross-field constraints + outliers (sales table)
+On `submit`, the grader returns a **score in \([0.0, 1.0]\)** in `info.grade.final_score`.
+## Action space
+`Action` = `{ "action_type": <enum>, "params": <dict> }`
+Supported `action_type` values:
+- `fill_missing` (`column`, `strategy` in {mean, median, mode, forward_fill, backward_fill}, optional `value`)
+- `drop_duplicates` (optional `subset`, optional `keep`)
+- `normalize_text` (`column`, `operations`)
+- `standardize_format` (`column`, `format_type` in {email, date, phone, currency, percentage})
+- `validate_range` (`column`, optional `min_value`, optional `max_value`)
+- `detect_outliers` (`column`, `method` in {iqr, zscore}, optional `threshold`)
+- `infer_values` (`column`, `method`, optional `reference_columns`)
+- `flag_invalid` (`row_id`, optional `reason`)
+- `revert_last_action` (no params)
+- `submit` (no params)
+## Observation space
+Each `step()` returns an `Observation` including:
+- `table_preview`: first N rows as JSON records
+- `column_schema`: per-column type, null counts, unique counts, samples
+- `detected_issues`: issue summaries (type/count/severity)
+- `quality_metrics`: completeness/validity/consistency/uniqueness/overall (0..1)
+- `issues_remaining`, `step_count`, `max_steps`, plus `last_action_result`
+## Reward shaping (dense signal)
+Reward is a **multi-component** `Reward` model:
+- **Positive**: quality improvement, issue resolution progress, schema validity
+- **Penalties**: destructive changes, redundant actions, per-step cost
+This gives partial-progress signal instead of only terminal success/failure.
+## Local run (Docker)
 ```bash
+docker build -t datacleanser .
+docker run --rm -p 7860:7860 datacleanser
 ```
+Verify:
 ```bash
+curl -fsS http://localhost:7860/health
+curl -fsS http://localhost:7860/tasks
 ```
+## API usage
+Reset:
 ```bash
+curl -fsS -X POST http://localhost:7860/reset \
   -H "Content-Type: application/json" \
+  -d '{"task_id":"easy_001","session_id":"demo"}'
 ```
+Step:
 ```bash
+curl -fsS -X POST http://localhost:7860/step \
   -H "Content-Type: application/json" \
   -d '{
+    "session_id": "demo",
     "action": {
       "action_type": "fill_missing",
       "params": {"column": "age", "strategy": "median"}
   }'
 ```
+## Baseline agent (LLM) inference
+The baseline script is `inference.py` (repo root). It uses an **OpenAI-compatible** API.
+Required environment variables (per submission rules):
+- `API_BASE_URL`: OpenAI-compatible endpoint base URL (optional if using OpenAI default)
+- `MODEL_NAME`: model id (e.g. `gpt-4.1-mini`, or your provider’s model name)
+- `OPENAI_API_KEY`: API key (preferred)
+- `HF_TOKEN`: API key fallback (used if `OPENAI_API_KEY` is not set)
+Run all 3 tasks locally via Docker:
 ```bash
+docker build -t datacleanser .
+docker run --rm -p 7860:7860 datacleanser
 ```
+In another terminal (using local python env that has deps installed), or run inside the container:
+```bash
+docker exec -it $(docker ps -q --filter ancestor=datacleanser | head -n 1) \
+  sh -lc 'API_BASE_URL="$API_BASE_URL" MODEL_NAME="$MODEL_NAME" OPENAI_API_KEY="$OPENAI_API_KEY" HF_TOKEN="$HF_TOKEN" python3 inference.py --all --out baseline_results.json'
 ```
+## Hugging Face Spaces (Docker) deployment
+1. Create a Space → **SDK: Docker**
+2. Push these files to the Space repo:
+   - `Dockerfile`
+   - `.dockerignore`
+   - `requirements.txt`
+   - `app.py`
+   - `env/`, `agent/`, `data/` (optional; datasets are generated on startup)
+   - `README.md` (this file)
+3. The Space will build and start automatically on port **7860**.
+## Notes
+- The server generates datasets on startup (see `app.py` startup event).
+- For baseline agent runs (outside Spaces), set `OPENAI_API_KEY` and use `inference.py`.

app.py CHANGED Viewed

@@ -9,6 +9,8 @@ from pathlib import Path
 from fastapi import FastAPI, HTTPException, BackgroundTasks
 from fastapi.middleware.cors import CORSMiddleware
 from pydantic import BaseModel
 import uvicorn
@@ -39,6 +41,24 @@ app.add_middleware(
 environments: Dict[str, DataCleaningEnv] = {}
 task_manager = TaskManager()
 class ResetRequest(BaseModel):
     task_id: str
@@ -78,6 +98,382 @@ async def root():
     }
 @app.get("/health")
 async def health_check():
     """Health check endpoint"""
@@ -149,8 +545,8 @@ async def reset_environment(request: ResetRequest):
             message=f"Environment reset with task {request.task_id}",
             data={
                 "session_id": session_id,
-                "observation": observation.dict(),
-                "state": env.state()
             }
         )
     except Exception as e:
@@ -177,11 +573,11 @@ async def step_environment(request: StepRequest):
             success=True,
             message="Action executed",
             data={
-                "observation": observation.dict(),
-                "reward": reward.dict(),
                 "done": done,
-                "info": info,
-                "state": env.state()
             }
         )
     except Exception as e:
@@ -201,10 +597,16 @@ async def get_state(session_id: str):
     env = environments[session_id]
     return {
         "session_id": session_id,
-        "state": env.state()
     }
 @app.get("/data/{session_id}")
 async def get_current_data(session_id: str, rows: int = 100):
     """Get current dataframe"""
@@ -221,7 +623,7 @@ async def get_current_data(session_id: str, rows: int = 100):
         "session_id": session_id,
         "rows": len(df),
         "columns": list(df.columns),
-        "data": df.head(rows).to_dict('records')
     }
@@ -237,7 +639,7 @@ async def get_history(session_id: str):
     env = environments[session_id]
     return {
         "session_id": session_id,
-        "history": env.get_history()
     }

 from fastapi import FastAPI, HTTPException, BackgroundTasks
 from fastapi.middleware.cors import CORSMiddleware
+from fastapi.encoders import jsonable_encoder
+from fastapi.responses import HTMLResponse
 from pydantic import BaseModel
 import uvicorn
 environments: Dict[str, DataCleaningEnv] = {}
 task_manager = TaskManager()
+def _to_jsonable(obj: Any) -> Any:
+    """
+    Convert common non-JSON-native scalar types (notably numpy scalars) into plain python types.
+    """
+    # numpy scalar types have .item(); converting them avoids FastAPI/Pydantic serialization errors.
+    if hasattr(obj, "item") and callable(getattr(obj, "item")):
+        try:
+            return obj.item()
+        except Exception:
+            pass
+    if isinstance(obj, dict):
+        return {k: _to_jsonable(v) for k, v in obj.items()}
+    if isinstance(obj, list):
+        return [_to_jsonable(v) for v in obj]
+    if isinstance(obj, tuple):
+        return [_to_jsonable(v) for v in obj]
+    return obj
 class ResetRequest(BaseModel):
     task_id: str
     }
+@app.get("/web", response_class=HTMLResponse)
+async def web_ui():
+    html = """
+<!doctype html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1" />
+    <title>DataCleanser • OpenEnv</title>
+    <style>
+      :root {
+        --bg: #0b1020;
+        --panel: #0f1730;
+        --muted: #8aa0c6;
+        --text: #e8eeff;
+        --border: rgba(255,255,255,0.10);
+        --accent: #6aa2ff;
+        --good: #2dd4bf;
+        --bad: #fb7185;
+        --warn: #fbbf24;
+        --mono: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace;
+        --sans: ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Helvetica, Arial, "Apple Color Emoji", "Segoe UI Emoji";
+      }
+      * { box-sizing: border-box; }
+      body {
+        margin: 0; padding: 0;
+        font-family: var(--sans);
+        background: radial-gradient(900px 600px at 20% 0%, rgba(106,162,255,0.18), transparent 55%),
+                    radial-gradient(900px 600px at 80% 0%, rgba(45,212,191,0.14), transparent 55%),
+                    var(--bg);
+        color: var(--text);
+      }
+      .container { max-width: 1200px; margin: 0 auto; padding: 20px; }
+      .topbar {
+        display: flex; gap: 12px; align-items: center; justify-content: space-between;
+        padding: 14px 16px; border: 1px solid var(--border); border-radius: 14px;
+        background: linear-gradient(180deg, rgba(255,255,255,0.06), rgba(255,255,255,0.02));
+        backdrop-filter: blur(8px);
+      }
+      .brand { display: flex; gap: 10px; align-items: center; }
+      .badge { font-family: var(--mono); font-size: 12px; color: var(--muted); border: 1px solid var(--border); padding: 4px 8px; border-radius: 999px; }
+      h1 { font-size: 18px; margin: 0; letter-spacing: 0.2px; }
+      .grid { display: grid; grid-template-columns: 380px 1fr; gap: 14px; margin-top: 14px; }
+      .card {
+        border: 1px solid var(--border);
+        border-radius: 14px;
+        background: rgba(15,23,48,0.72);
+        backdrop-filter: blur(8px);
+        overflow: hidden;
+      }
+      .card h2 {
+        font-size: 13px; letter-spacing: 0.5px; text-transform: uppercase;
+        margin: 0; padding: 12px 14px; border-bottom: 1px solid var(--border); color: var(--muted);
+      }
+      .card .body { padding: 14px; }
+      label { display: block; font-size: 12px; color: var(--muted); margin-bottom: 6px; }
+      input, select, textarea {
+        width: 100%;
+        padding: 10px 10px;
+        border-radius: 10px;
+        border: 1px solid var(--border);
+        background: rgba(255,255,255,0.03);
+        color: var(--text);
+        outline: none;
+      }
+      textarea { min-height: 110px; font-family: var(--mono); font-size: 12px; }
+      .row { display: grid; grid-template-columns: 1fr 1fr; gap: 10px; }
+      .btnrow { display: flex; flex-wrap: wrap; gap: 10px; margin-top: 10px; }
+      button {
+        border: 1px solid var(--border);
+        background: rgba(255,255,255,0.06);
+        color: var(--text);
+        padding: 10px 12px;
+        border-radius: 12px;
+        cursor: pointer;
+        font-weight: 600;
+      }
+      button.primary { border-color: rgba(106,162,255,0.5); background: rgba(106,162,255,0.14); }
+      button.good { border-color: rgba(45,212,191,0.5); background: rgba(45,212,191,0.12); }
+      button.bad { border-color: rgba(251,113,133,0.5); background: rgba(251,113,133,0.10); }
+      button:disabled { opacity: 0.55; cursor: not-allowed; }
+      .pill {
+        display: inline-flex; align-items: center; gap: 8px;
+        padding: 6px 10px; border: 1px solid var(--border); border-radius: 999px;
+        color: var(--muted); font-size: 12px; font-family: var(--mono);
+      }
+      .statusDot { width: 8px; height: 8px; border-radius: 999px; background: var(--warn); }
+      .statusDot.ok { background: var(--good); }
+      .statusDot.err { background: var(--bad); }
+      .kvs { display: grid; grid-template-columns: repeat(3, 1fr); gap: 10px; }
+      .kv { border: 1px solid var(--border); border-radius: 12px; padding: 10px; background: rgba(255,255,255,0.02); }
+      .kv .k { color: var(--muted); font-size: 11px; margin-bottom: 6px; }
+      .kv .v { font-family: var(--mono); font-size: 13px; }
+      .mono { font-family: var(--mono); font-size: 12px; color: var(--muted); }
+      table { width: 100%; border-collapse: collapse; font-size: 12px; }
+      th, td { border-bottom: 1px solid var(--border); padding: 8px 8px; text-align: left; vertical-align: top; }
+      th { color: var(--muted); font-weight: 600; background: rgba(255,255,255,0.02); position: sticky; top: 0; }
+      .scroll { max-height: 520px; overflow: auto; }
+      .msg { white-space: pre-wrap; font-family: var(--mono); font-size: 12px; color: var(--muted); margin: 0; }
+      .foot { margin-top: 14px; color: var(--muted); font-size: 12px; }
+      a { color: var(--accent); text-decoration: none; }
+      @media (max-width: 980px) { .grid { grid-template-columns: 1fr; } }
+    </style>
+  </head>
+  <body>
+    <div class="container">
+      <div class="topbar">
+        <div class="brand">
+          <h1>DataCleanser • OpenEnv Environment</h1>
+          <span class="badge">HTTP API + in-memory sessions</span>
+        </div>
+        <div class="pill" title="Backend health">
+          <span id="healthDot" class="statusDot"></span>
+          <span id="healthText">checking /health…</span>
+        </div>
+      </div>
+      <div class="grid">
+        <div class="card">
+          <h2>Session</h2>
+          <div class="body">
+            <label>Session ID</label>
+            <input id="sessionId" value="demo" />
+            <div style="height:10px"></div>
+            <div class="row">
+              <div>
+                <label>Task</label>
+                <select id="taskSelect"></select>
+              </div>
+              <div>
+                <label>Difficulty filter</label>
+                <select id="levelFilter">
+                  <option value="">all</option>
+                  <option value="easy">easy</option>
+                  <option value="medium">medium</option>
+                  <option value="hard">hard</option>
+                </select>
+              </div>
+            </div>
+            <div class="btnrow">
+              <button class="primary" id="btnLoadTasks">Load tasks</button>
+              <button class="good" id="btnReset">Reset</button>
+              <button id="btnState">State</button>
+              <button class="bad" id="btnDelete">Delete session</button>
+            </div>
+            <div style="height:10px"></div>
+            <p class="msg" id="sessionMsg"></p>
+          </div>
+        </div>
+        <div class="card">
+          <h2>Step</h2>
+          <div class="body">
+            <div class="row">
+              <div>
+                <label>Action type</label>
+                <select id="actionType">
+                  <option value="fill_missing">fill_missing</option>
+                  <option value="drop_duplicates">drop_duplicates</option>
+                  <option value="normalize_text">normalize_text</option>
+                  <option value="standardize_format">standardize_format</option>
+                  <option value="validate_range">validate_range</option>
+                  <option value="detect_outliers">detect_outliers</option>
+                  <option value="infer_values">infer_values</option>
+                  <option value="flag_invalid">flag_invalid</option>
+                  <option value="revert_last_action">revert_last_action</option>
+                  <option value="submit">submit</option>
+                </select>
+              </div>
+              <div>
+                <label>Quick templates</label>
+                <select id="templateSelect">
+                  <option value="">—</option>
+                  <option value="{&quot;column&quot;:&quot;age&quot;,&quot;strategy&quot;:&quot;median&quot;}">fill_missing age median</option>
+                  <option value="{&quot;column&quot;:&quot;email&quot;,&quot;format_type&quot;:&quot;email&quot;}">standardize_format email</option>
+                  <option value="{&quot;column&quot;:&quot;phone&quot;,&quot;format_type&quot;:&quot;phone&quot;}">standardize_format phone</option>
+                  <option value="{&quot;subset&quot;:[&quot;email&quot;],&quot;keep&quot;:&quot;first&quot;}">drop_duplicates by email</option>
+                </select>
+              </div>
+            </div>
+            <div style="height:10px"></div>
+            <label>Params (JSON)</label>
+            <textarea id="paramsJson">{}</textarea>
+            <div class="btnrow">
+              <button class="primary" id="btnStep">Step</button>
+              <button class="good" id="btnSubmit">Submit</button>
+            </div>
+            <div style="height:10px"></div>
+            <p class="msg" id="stepMsg"></p>
+          </div>
+        </div>
+      </div>
+      <div class="grid">
+        <div class="card">
+          <h2>Metrics</h2>
+          <div class="body">
+            <div class="kvs">
+              <div class="kv"><div class="k">Step</div><div class="v" id="kvStep">—</div></div>
+              <div class="kv"><div class="k">Issues remaining</div><div class="v" id="kvIssues">—</div></div>
+              <div class="kv"><div class="k">Reward (last)</div><div class="v" id="kvReward">—</div></div>
+            </div>
+            <div style="height:10px"></div>
+            <div class="kvs">
+              <div class="kv"><div class="k">Overall quality</div><div class="v" id="kvOverall">—</div></div>
+              <div class="kv"><div class="k">Completeness</div><div class="v" id="kvComplete">—</div></div>
+              <div class="kv"><div class="k">Uniqueness</div><div class="v" id="kvUnique">—</div></div>
+            </div>
+            <div style="height:10px"></div>
+            <p class="mono">Tip: Use <a href="/docs" target="_blank" rel="noreferrer">/docs</a> for full API schema.</p>
+          </div>
+        </div>
+        <div class="card">
+          <h2>Table preview</h2>
+          <div class="body">
+            <div class="scroll">
+              <table id="previewTable"></table>
+            </div>
+          </div>
+        </div>
+      </div>
+      <div class="card" style="margin-top:14px">
+        <h2>Issues & last action</h2>
+        <div class="body">
+          <pre class="msg" id="issuesBox">—</pre>
+        </div>
+      </div>
+      <div class="foot">
+        If the UI loads but actions fail, your session may not exist yet — click <b>Reset</b> first.
+      </div>
+    </div>
+    <script>
+      const $ = (id) => document.getElementById(id);
+      function setHealth(ok, text) {
+        const dot = $("healthDot");
+        dot.classList.remove("ok", "err");
+        dot.classList.add(ok ? "ok" : "err");
+        $("healthText").textContent = text;
+      }
+      async function api(method, path, body=null) {
+        const opts = { method, headers: { "Content-Type": "application/json" } };
+        if (body) opts.body = JSON.stringify(body);
+        const res = await fetch(path, opts);
+        const txt = await res.text();
+        let data = null;
+        try { data = txt ? JSON.parse(txt) : null; } catch { data = { raw: txt }; }
+        if (!res.ok) {
+          const msg = (data && (data.detail || data.message)) ? (data.detail || data.message) : txt;
+          throw new Error(`${res.status} ${res.statusText}: ${msg}`);
+        }
+        return data;
+      }
+      function renderPreview(obs) {
+        const table = $("previewTable");
+        const rows = (obs && obs.table_preview) ? obs.table_preview : [];
+        if (!rows.length) { table.innerHTML = "<tr><td class='mono'>No preview available</td></tr>"; return; }
+        const cols = Object.keys(rows[0]);
+        const thead = "<thead><tr>" + cols.map(c => `<th>${c}</th>`).join("") + "</tr></thead>";
+        const tbody = "<tbody>" + rows.slice(0, 30).map(r =>
+          "<tr>" + cols.map(c => `<td>${(r[c] === null || r[c] === undefined) ? "" : String(r[c])}</td>`).join("") + "</tr>"
+        ).join("") + "</tbody>";
+        table.innerHTML = thead + tbody;
+      }
+      function updateMetrics(data) {
+        const obs = data?.observation || data?.data?.observation;
+        const reward = data?.reward || data?.data?.reward;
+        if (!obs) return;
+        $("kvStep").textContent = `${obs.step_count ?? "—"} / ${obs.max_steps ?? "—"}`;
+        $("kvIssues").textContent = String(obs.issues_remaining ?? "—");
+        $("kvReward").textContent = reward ? String(reward.total ?? "—") : "—";
+        const qm = obs.quality_metrics || {};
+        $("kvOverall").textContent = (qm.overall ?? "—");
+        $("kvComplete").textContent = (qm.completeness ?? "—");
+        $("kvUnique").textContent = (qm.uniqueness ?? "—");
+        const issues = {
+          detected_issues: obs.detected_issues || [],
+          last_action_result: obs.last_action_result || null,
+          done: data?.done ?? data?.data?.done ?? null,
+          grade: data?.info?.grade || data?.data?.info?.grade || null
+        };
+        $("issuesBox").textContent = JSON.stringify(issues, null, 2);
+        renderPreview(obs);
+      }
+      async function checkHealth() {
+        try {
+          const h = await api("GET", "/health");
+          setHealth(true, `healthy`);
+        } catch (e) {
+          setHealth(false, `unhealthy`);
+        }
+      }
+      async function loadTasks() {
+        $("sessionMsg").textContent = "Loading tasks…";
+        const level = $("levelFilter").value;
+        const q = level ? `?level=${encodeURIComponent(level)}` : "";
+        const out = await api("GET", `/tasks${q}`);
+        const tasks = out.tasks || [];
+        const sel = $("taskSelect");
+        sel.innerHTML = tasks.map(t => `<option value="${t.task_id}">${t.task_id} • ${t.task_level}</option>`).join("");
+        $("sessionMsg").textContent = tasks.length ? `Loaded ${tasks.length} tasks.` : "No tasks found.";
+      }
+      async function doReset() {
+        const task_id = $("taskSelect").value;
+        const session_id = $("sessionId").value.trim() || "demo";
+        $("sessionMsg").textContent = "Resetting…";
+        const out = await api("POST", "/reset", { task_id, session_id });
+        $("sessionMsg").textContent = out.message || "Reset ok.";
+        updateMetrics(out.data);
+      }
+      async function doState() {
+        const session_id = $("sessionId").value.trim() || "demo";
+        $("sessionMsg").textContent = "Fetching state…";
+        const out = await api("GET", `/state?session_id=${encodeURIComponent(session_id)}`);
+        $("sessionMsg").textContent = "State ok.";
+        $("issuesBox").textContent = JSON.stringify(out, null, 2);
+      }
+      async function doDelete() {
+        const session_id = $("sessionId").value.trim() || "demo";
+        $("sessionMsg").textContent = "Deleting session…";
+        const out = await api("DELETE", `/session/${encodeURIComponent(session_id)}`);
+        $("sessionMsg").textContent = out.message || "Deleted.";
+      }
+      async function doStep(forceSubmit=false) {
+        const session_id = $("sessionId").value.trim() || "demo";
+        const action_type = forceSubmit ? "submit" : $("actionType").value;
+        let params = {};
+        try { params = JSON.parse($("paramsJson").value || "{}"); } catch (e) {
+          $("stepMsg").textContent = "Params JSON is invalid.";
+          return;
+        }
+        if (action_type === "submit" || action_type === "revert_last_action") params = {};
+        $("stepMsg").textContent = "Stepping…";
+        const out = await api("POST", "/step", { session_id, action: { action_type, params }});
+        $("stepMsg").textContent = out.message || "Step ok.";
+        updateMetrics(out.data);
+      }
+      $("btnLoadTasks").addEventListener("click", () => loadTasks().catch(e => $("sessionMsg").textContent = e.message));
+      $("btnReset").addEventListener("click", () => doReset().catch(e => $("sessionMsg").textContent = e.message));
+      $("btnState").addEventListener("click", () => doState().catch(e => $("sessionMsg").textContent = e.message));
+      $("btnDelete").addEventListener("click", () => doDelete().catch(e => $("sessionMsg").textContent = e.message));
+      $("btnStep").addEventListener("click", () => doStep(false).catch(e => $("stepMsg").textContent = e.message));
+      $("btnSubmit").addEventListener("click", () => doStep(true).catch(e => $("stepMsg").textContent = e.message));
+      $("templateSelect").addEventListener("change", (e) => {
+        const v = e.target.value;
+        if (v) $("paramsJson").value = v;
+      });
+      (async function init() {
+        await checkHealth();
+        await loadTasks();
+      })();
+      setInterval(checkHealth, 5000);
+    </script>
+  </body>
+</html>
+"""
+    return HTMLResponse(content=html)
 @app.get("/health")
 async def health_check():
     """Health check endpoint"""
             message=f"Environment reset with task {request.task_id}",
             data={
                 "session_id": session_id,
+                "observation": _to_jsonable(observation.model_dump(mode="json")),
+                "state": _to_jsonable(jsonable_encoder(env.state()))
             }
         )
     except Exception as e:
             success=True,
             message="Action executed",
             data={
+                "observation": _to_jsonable(observation.model_dump(mode="json")),
+                "reward": _to_jsonable(reward.model_dump(mode="json")),
                 "done": done,
+                "info": _to_jsonable(info),
+                "state": _to_jsonable(env.state())
             }
         )
     except Exception as e:
     env = environments[session_id]
     return {
         "session_id": session_id,
+        "state": _to_jsonable(jsonable_encoder(env.state()))
     }
+@app.get("/state")
+async def get_state_query(session_id: str):
+    """Get current environment state (query param form)"""
+    return await get_state(session_id)
 @app.get("/data/{session_id}")
 async def get_current_data(session_id: str, rows: int = 100):
     """Get current dataframe"""
         "session_id": session_id,
         "rows": len(df),
         "columns": list(df.columns),
+        "data": _to_jsonable(jsonable_encoder(df.head(rows).to_dict("records")))
     }
     env = environments[session_id]
     return {
         "session_id": session_id,
+        "history": _to_jsonable(jsonable_encoder(env.get_history()))
     }

env/environment.py CHANGED Viewed

@@ -181,17 +181,23 @@ class DataCleaningEnv:
         Returns:
             Dictionary containing current state information
         """
         return {
             'step_count': self.step_count,
             'max_steps': self.max_steps,
             'done': self.done,
             'total_rows': len(self.current_df) if self.current_df is not None else 0,
             'total_columns': len(self.current_df.columns) if self.current_df is not None else 0,
-            'current_issues': self.current_issues,
-            'initial_issues': self.initial_issues,
-            'quality_metrics': self.reward_calculator.calculate_quality_metrics(
-                self.current_df
-            ).dict() if self.current_df is not None else {},
             'history_length': len(self.history)
         }
@@ -495,7 +501,16 @@ class DataCleaningEnv:
     def _get_observation(self) -> Observation:
         """Generate observation from current state"""
-        preview = self.current_df.head(self.preview_rows).to_dict('records')
         schema = []
         for col in self.current_df.columns:
@@ -506,7 +521,7 @@ class DataCleaningEnv:
                 non_null_count=int(col_data.count()),
                 null_count=int(col_data.isnull().sum()),
                 unique_count=int(col_data.nunique()),
-                sample_values=col_data.dropna().head(3).tolist()
             ))
         detected_issues = []

         Returns:
             Dictionary containing current state information
         """
+        current_issues = {k: int(v) for k, v in (self.current_issues or {}).items()}
+        initial_issues = {k: int(v) for k, v in (self.initial_issues or {}).items()}
+        quality_metrics = (
+            self.reward_calculator.calculate_quality_metrics(self.current_df).model_dump(mode="json")
+            if self.current_df is not None
+            else {}
+        )
         return {
             'step_count': self.step_count,
             'max_steps': self.max_steps,
             'done': self.done,
             'total_rows': len(self.current_df) if self.current_df is not None else 0,
             'total_columns': len(self.current_df.columns) if self.current_df is not None else 0,
+            'current_issues': current_issues,
+            'initial_issues': initial_issues,
+            'quality_metrics': quality_metrics,
             'history_length': len(self.history)
         }
     def _get_observation(self) -> Observation:
         """Generate observation from current state"""
+        def _to_py(val: Any) -> Any:
+            # Convert numpy/pandas scalars into JSON-friendly python types
+            if isinstance(val, (np.generic,)):
+                return val.item()
+            return val
+        preview_raw = self.current_df.head(self.preview_rows).to_dict("records")
+        preview: List[Dict[str, Any]] = [
+            {k: _to_py(v) for k, v in row.items()} for row in preview_raw
+        ]
         schema = []
         for col in self.current_df.columns:
                 non_null_count=int(col_data.count()),
                 null_count=int(col_data.isnull().sum()),
                 unique_count=int(col_data.nunique()),
+                sample_values=[_to_py(v) for v in col_data.dropna().head(3).tolist()]
             ))
         detected_issues = []

env/reward.py CHANGED Viewed

@@ -39,8 +39,8 @@ class RewardCalculator:
     def calculate_quality_metrics(self, df: pd.DataFrame) -> QualityMetrics:
         """Calculate comprehensive quality metrics"""
-        completeness = calculate_completeness(df)
-        uniqueness = calculate_uniqueness(df)
         validity_scores = []
         for col in df.columns:
@@ -52,11 +52,11 @@ class RewardCalculator:
                 else:
                     validity_scores.append(1.0)
-        validity = np.mean(validity_scores) if validity_scores else 1.0
-        consistency = self._calculate_consistency(df)
-        overall = (
             completeness * 0.3 +
             validity * 0.3 +
             consistency * 0.2 +

     def calculate_quality_metrics(self, df: pd.DataFrame) -> QualityMetrics:
         """Calculate comprehensive quality metrics"""
+        completeness = float(calculate_completeness(df))
+        uniqueness = float(calculate_uniqueness(df))
         validity_scores = []
         for col in df.columns:
                 else:
                     validity_scores.append(1.0)
+        validity = float(np.mean(validity_scores)) if validity_scores else 1.0
+        consistency = float(self._calculate_consistency(df))
+        overall = float(
             completeness * 0.3 +
             validity * 0.3 +
             consistency * 0.2 +

inference.py CHANGED Viewed

@@ -1,26 +1,33 @@
 """
-Inference script for OpenEnv Data Cleaning Environment
-Uses OpenAI-compatible API to run baseline agent on data cleaning tasks
 """
 import os
 import json
 import logging
 import sys
-from typing import Dict, List, Optional, Any
 from openai import OpenAI
 from env.environment import DataCleaningEnv
 from env.models import Action, ActionType, Observation, TaskLevel
 from env.tasks import TaskManager
-logging.basicConfig(
-    level=logging.INFO,
-    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
-)
 logger = logging.getLogger(__name__)
 class DataCleaningAgent:
     """
     Agent that uses OpenAI-compatible API to select data cleaning actions
@@ -31,15 +38,17 @@ class DataCleaningAgent:
         api_key: Optional[str] = None,
         model: Optional[str] = None,
         api_base_url: Optional[str] = None,
-        temperature: float = 0.1,
-        max_tokens: int = 1000
     ):
-        self.api_key = api_key or os.getenv("OPENAI_API_KEY")
-        self.model = model or os.getenv("MODEL_NAME", "gpt-4")
         self.api_base_url = api_base_url or os.getenv("API_BASE_URL")
         if not self.api_key:
-            raise ValueError("OpenAI API key not provided. Set OPENAI_API_KEY environment variable.")
         client_kwargs = {"api_key": self.api_key}
         if self.api_base_url:
@@ -53,7 +62,6 @@ class DataCleaningAgent:
         self.system_prompt = self._build_system_prompt()
     def _build_system_prompt(self) -> str:
-        """Build the system prompt for the agent"""
         return """You are an expert data cleaning agent. Your task is to analyze data quality issues and select the most appropriate cleaning actions.
 You will receive observations from a data cleaning environment containing:
@@ -117,7 +125,7 @@ Be efficient - each step has a small penalty. Focus on actions that improve data
                     {"role": "user", "content": user_message}
                 ],
                 temperature=self.temperature,
-                max_tokens=self.max_tokens
             )
             response_text = response.choices[0].message.content
@@ -133,7 +141,7 @@ Be efficient - each step has a small penalty. Focus on actions that improve data
             return action
         except Exception as e:
-            logger.error(f"Error selecting action: {e}")
             return Action(action_type=ActionType.SUBMIT, params={})
     def _format_observation(self, observation: Observation) -> str:
@@ -340,80 +348,58 @@ def run_inference(
 def main():
-    """Main entry point for inference script"""
     import argparse
     parser = argparse.ArgumentParser(description="Run data cleaning agent inference")
-    parser.add_argument(
-        "task_level",
-        nargs="?",
-        default="easy",
-        choices=["easy", "medium", "hard"],
-        help="Task difficulty level (default: easy)"
-    )
-    parser.add_argument(
-        "--task-id",
-        type=str,
-        help="Specific task ID to run"
-    )
-    parser.add_argument(
-        "--model",
-        type=str,
-        default=None,
-        help="Model name (default: from MODEL_NAME env or gpt-4)"
-    )
-    parser.add_argument(
-        "--api-base",
-        type=str,
-        default=None,
-        help="API base URL (default: from API_BASE_URL env)"
-    )
-    parser.add_argument(
-        "--max-steps",
-        type=int,
-        default=None,
-        help="Override max steps"
-    )
-    parser.add_argument(
-        "--quiet",
-        action="store_true",
-        help="Suppress verbose output"
-    )
     args = parser.parse_args()
     # Generate datasets if needed
     task_manager = TaskManager()
     task_manager.generate_datasets()
-    # Determine task ID
-    if args.task_id:
-        task_id = args.task_id
     else:
-        tasks = task_manager.list_tasks(TaskLevel(args.task_level))
-        if not tasks:
-            print(f"No tasks found for level: {args.task_level}")
-            sys.exit(1)
-        task_id = tasks[0].task_id
     try:
-        result = run_inference(
-            task_id=task_id,
-            model=args.model,
-            api_base_url=args.api_base,
-            verbose=not args.quiet,
-            max_steps=args.max_steps
-        )
-        # Print summary
-        print(f"\nTask {task_id} completed with score: {result['grade'].get('final_score', 0):.2%}")
-        # Save results
-        output_file = f"results_{task_id}.json"
-        with open(output_file, 'w') as f:
-            json.dump(result, f, indent=2, default=str)
-        print(f"Results saved to {output_file}")
     except Exception as e:
         logger.error(f"Error running inference: {e}")
         sys.exit(1)

 """
+Baseline inference script for OpenEnv Data Cleaning Environment.
+Requirements (submission):
+- File name: inference.py at repo root
+- Uses OpenAI client for all LLM calls
+- Reads credentials/config from environment variables:
+  - OPENAI_API_KEY (preferred) or HF_TOKEN (fallback)
+  - API_BASE_URL (optional; OpenAI-compatible endpoint)
+  - MODEL_NAME (model identifier)
 """
 import os
 import json
 import logging
 import sys
+from typing import Dict, List, Optional, Any, Tuple
 from openai import OpenAI
 from env.environment import DataCleaningEnv
 from env.models import Action, ActionType, Observation, TaskLevel
 from env.tasks import TaskManager
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")
 logger = logging.getLogger(__name__)
+DEFAULT_TASKS: Tuple[str, ...] = ("easy_001", "medium_001", "hard_001")
 class DataCleaningAgent:
     """
     Agent that uses OpenAI-compatible API to select data cleaning actions
         api_key: Optional[str] = None,
         model: Optional[str] = None,
         api_base_url: Optional[str] = None,
+        temperature: float = 0.0,
+        max_tokens: int = 900
     ):
+        self.api_key = api_key or os.getenv("OPENAI_API_KEY") or os.getenv("HF_TOKEN")
+        self.model = model or os.getenv("MODEL_NAME", "gpt-4.1-mini")
         self.api_base_url = api_base_url or os.getenv("API_BASE_URL")
         if not self.api_key:
+            raise ValueError(
+                "No API key found. Set OPENAI_API_KEY (preferred) or HF_TOKEN (fallback)."
+            )
         client_kwargs = {"api_key": self.api_key}
         if self.api_base_url:
         self.system_prompt = self._build_system_prompt()
     def _build_system_prompt(self) -> str:
         return """You are an expert data cleaning agent. Your task is to analyze data quality issues and select the most appropriate cleaning actions.
 You will receive observations from a data cleaning environment containing:
                     {"role": "user", "content": user_message}
                 ],
                 temperature=self.temperature,
+                max_tokens=self.max_tokens,
             )
             response_text = response.choices[0].message.content
             return action
         except Exception as e:
+            logger.error(f"Model request failed ({e}). Using fallback submit action.")
             return Action(action_type=ActionType.SUBMIT, params={})
     def _format_observation(self, observation: Observation) -> str:
 def main():
     import argparse
     parser = argparse.ArgumentParser(description="Run data cleaning agent inference")
+    parser.add_argument("--task-id", type=str, help="Run a specific task id (e.g. easy_001)")
+    parser.add_argument("--all", action="store_true", help="Run easy_001, medium_001, hard_001")
+    parser.add_argument("--model", type=str, default=None, help="Overrides MODEL_NAME")
+    parser.add_argument("--api-base", type=str, default=None, help="Overrides API_BASE_URL")
+    parser.add_argument("--max-steps", type=int, default=None, help="Override max steps")
+    parser.add_argument("--quiet", action="store_true", help="Suppress verbose output")
+    parser.add_argument("--out", type=str, default="baseline_results.json", help="Output JSON path")
     args = parser.parse_args()
     # Generate datasets if needed
     task_manager = TaskManager()
     task_manager.generate_datasets()
+    if args.all:
+        task_ids = list(DEFAULT_TASKS)
+    elif args.task_id:
+        task_ids = [args.task_id]
     else:
+        task_ids = ["easy_001"]
+    results: Dict[str, Any] = {
+        "model_name": args.model or os.getenv("MODEL_NAME", "gpt-4.1-mini"),
+        "api_base_url": args.api_base or os.getenv("API_BASE_URL"),
+        "tasks": {},
+    }
     try:
+        for task_id in task_ids:
+            result = run_inference(
+                task_id=task_id,
+                model=args.model,
+                api_base_url=args.api_base,
+                verbose=not args.quiet,
+                max_steps=args.max_steps,
+            )
+            results["tasks"][task_id] = {
+                "task_level": result.get("task_level"),
+                "total_steps": result.get("total_steps"),
+                "final_quality": result.get("final_quality"),
+                "issues_remaining": result.get("issues_remaining"),
+                "final_score": (result.get("grade") or {}).get("final_score", 0.0),
+            }
+        with open(args.out, "w") as f:
+            json.dump(results, f, indent=2, default=str)
+        print(json.dumps(results, indent=2, default=str))
+        print(f"\nSaved -> {args.out}")
     except Exception as e:
         logger.error(f"Error running inference: {e}")
         sys.exit(1)

requirements.txt CHANGED Viewed

@@ -5,7 +5,7 @@ pydantic>=2.0.0
 # Web framework
 fastapi>=0.100.0
-uvicorn>=0.23.0
 python-multipart>=0.0.6
 # OpenAI API (for baseline agent)

 # Web framework
 fastapi>=0.100.0
+uvicorn[standard]>=0.23.0
 python-multipart>=0.0.6
 # OpenAI API (for baseline agent)

scripts/validate-submission.sh ADDED Viewed

	@@ -0,0 +1,62 @@

+#!/usr/bin/env bash
+#
+# validate-submission.sh — lightweight submission validator
+#
+# Usage:
+#   ./scripts/validate-submission.sh [ping_url] [repo_dir]
+#
+# Examples:
+#   ./scripts/validate-submission.sh http://localhost:7860 .
+#   ./scripts/validate-submission.sh https://<your-space>.hf.space .
+#
+set -euo pipefail
+PING_URL="${1:-http://localhost:7860}"
+REPO_DIR="${2:-.}"
+echo "==> Repo dir: ${REPO_DIR}"
+echo "==> Ping URL: ${PING_URL}"
+echo "==> Checking required files..."
+for f in Dockerfile requirements.txt app.py openenv.yaml inference.py README.md; do
+  test -f "${REPO_DIR}/${f}" || { echo "Missing ${f}"; exit 1; }
+done
+echo "==> Docker build..."
+docker build -t datacleanser:validate "${REPO_DIR}"
+echo "==> Docker run..."
+CID="$(docker run -d -p 7860:7860 datacleanser:validate)"
+cleanup() { docker rm -f "${CID}" >/dev/null 2>&1 || true; }
+trap cleanup EXIT
+echo "==> Waiting for /health..."
+for i in {1..30}; do
+  if curl -fsS "http://localhost:7860/health" >/dev/null; then
+    break
+  fi
+  sleep 1
+done
+echo "==> Probing endpoints..."
+curl -fsS "http://localhost:7860/health" | cat
+echo
+curl -fsS "http://localhost:7860/tasks" | head -c 400 || true
+echo
+curl -fsS -X POST "http://localhost:7860/reset" \
+  -H "Content-Type: application/json" \
+  -d '{"task_id":"easy_001","session_id":"validate"}' | head -c 400 || true
+echo
+curl -fsS "http://localhost:7860/state?session_id=validate" | head -c 400 || true
+echo
+echo "==> Optional: openenv validate (if installed)..."
+if command -v openenv >/dev/null 2>&1; then
+  (cd "${REPO_DIR}" && openenv validate) || true
+else
+  echo "openenv CLI not found; skipping openenv validate."
+fi
+echo "==> OK"