Spaces:
Sleeping
Sleeping
| title: Data Validation Pipeline | |
| emoji: π§Ή | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 8000 | |
| tags: | |
| - openenv | |
| # Data Validation Pipeline β OpenEnv Environment | |
| An RL environment for training AI agents to clean and validate structured data. Built on the [OpenEnv](https://github.com/meta-pytorch/OpenEnv) framework for the Meta-PyTorch Hackathon. | |
| ## π Environment Overview | |
| The **Data Validation Pipeline** environment simulates real-world data quality challenges. An agent is presented with a "dirty" dataset containing various errors β missing values, type mismatches, format violations, range errors, and duplicates β and must systematically identify and fix each issue. | |
| ### Motivation | |
| Data quality is a critical challenge in every organization. Poor data leads to incorrect analytics, broken ML models, and costly business decisions. This environment trains RL agents to become automated data stewards, capable of: | |
| - Detecting and classifying data errors | |
| - Applying appropriate fixes | |
| - Optimizing their correction strategy for efficiency | |
| ## π― Action Space | |
| The agent can take the following **discrete actions**: | |
| | Action Type | Description | Parameters | | |
| |-------------|-------------|------------| | |
| | `fix_missing` | Fill in a missing/empty value | `target_row`, `target_field`, `new_value` | | |
| | `fix_type` | Correct a data type error (e.g., string β float) | `target_row`, `target_field`, `new_value` | | |
| | `fix_range` | Fix an out-of-range value | `target_row`, `target_field`, `new_value` | | |
| | `fix_format` | Fix a format violation (e.g., date format) | `target_row`, `target_field`, `new_value` | | |
| | `fix_duplicate` | Resolve a duplicate entry | `target_row`, `target_field`, `new_value` | | |
| | `validate` | Check current progress | β | | |
| | `skip` | Skip (no action) | β | | |
| ### Action JSON Schema | |
| ```json | |
| { | |
| "action_type": "fix_missing|fix_type|fix_range|fix_format|fix_duplicate|validate|skip", | |
| "target_field": "column_name", | |
| "target_row": 0, | |
| "new_value": "corrected_value" | |
| } | |
| ``` | |
| ## ποΈ Observation Space | |
| Each observation includes: | |
| | Field | Type | Description | | |
| |-------|------|-------------| | |
| | `task_name` | string | Current task identifier | | |
| | `task_description` | string | What needs to be done | | |
| | `dataset` | list[dict] | Current state of the dataset | | |
| | `errors_found` | list[dict] | Remaining errors with details | | |
| | `errors_remaining` | int | Count of unfixed errors | | |
| | `errors_total` | int | Total errors at start | | |
| | `errors_fixed` | int | Successfully fixed errors | | |
| | `step_count` | int | Current step number | | |
| | `max_steps` | int | Step budget | | |
| | `reward` | float | Reward from last action | | |
| | `cumulative_reward` | float | Total reward so far | | |
| | `done` | bool | Episode finished? | | |
| | `last_action_result` | string | Feedback from last action | | |
| | `task_hint` | string | Hint for solving the task | | |
| | `progress_pct` | float | Completion percentage | | |
| | `field_names` | list[str] | Dataset column names | | |
| ## π Tasks | |
| ### Task 1: Easy β Missing Values (difficulty: β) | |
| - **Dataset**: 5-row employee table | |
| - **Errors**: 3 missing values (empty strings) | |
| - **Max Steps**: 10 | |
| - **Strategy**: Find empty fields and fill with correct values | |
| - **Solvable in**: β€5 steps | |
| ### Task 2: Medium β Mixed Errors (difficulty: ββ) | |
| - **Dataset**: 7-row product inventory | |
| - **Errors**: 6 errors (type, format, missing, range, duplicate) | |
| - **Max Steps**: 15 | |
| - **Strategy**: Classify error type, match to correct action | |
| - **Requires**: Type awareness + format rules | |
| ### Task 3: Hard β Multi-Constraint (difficulty: βββ) | |
| - **Dataset**: 10-row customer orders | |
| - **Errors**: 10 interrelated errors across all types | |
| - **Max Steps**: 20 | |
| - **Strategy**: Plan error resolution order, handle dependencies | |
| - **Requires**: Domain knowledge + planning | |
| ## ποΈ Setup & Usage | |
| ### Docker (Recommended) | |
| ```bash | |
| docker build -t data-validation-env . | |
| docker run -p 8000:8000 data-validation-env | |
| ``` | |
| ### Local Development | |
| ```bash | |
| pip install -r requirements.txt | |
| uvicorn server:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| ### Test Endpoints | |
| ```bash | |
| # Health check | |
| curl http://localhost:8000/health | |
| # Reset with easy task | |
| curl -X POST http://localhost:8000/reset \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"task_name": "easy_missing_values", "seed": 42}' | |
| # Take a step | |
| curl -X POST http://localhost:8000/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action_type": "fix_missing", "target_field": "email", "target_row": 1, "new_value": "bob@example.com"}' | |
| # Check state | |
| curl http://localhost:8000/state | |
| ``` | |
| ### Run Inference Agent | |
| ```bash | |
| export HF_TOKEN=your_token_here | |
| export API_BASE_URL=https://api.openai.com/v1 | |
| export MODEL_NAME=gpt-4.1-mini | |
| python inference.py | |
| ``` | |
| ## π Baseline Performance | |
| | Task | Model | Avg Reward | Steps Used | Success Rate | | |
| |------|-------|-----------|------------|-------------| | |
| | easy_missing_values | gpt-4.1-mini | 0.85 | 4/10 | 90% | | |
| | medium_mixed_errors | gpt-4.1-mini | 0.70 | 9/15 | 75% | | |
| | hard_multi_constraint | gpt-4.1-mini | 0.55 | 15/20 | 50% | | |
| ## π Reward Design | |
| - **Correct fix**: `+1.0 / total_errors` (proportional to error count) | |
| - **Wrong value**: `-0.05` penalty | |
| - **Wrong action type**: `-0.05` penalty | |
| - **Repeated action**: `-0.1` penalty | |
| - **Skip/Validate**: `0.0` (neutral) | |
| The reward design encourages: | |
| 1. **Accuracy**: Correct fixes get proportional positive reward | |
| 2. **Efficiency**: Penalties for wrong attempts | |
| 3. **Exploration**: No penalty for validation checks | |
| 4. **Diversity**: Penalizes repeated identical actions | |
| ## π Project Structure | |
| ``` | |
| βββ inference.py β LLM agent loop | |
| βββ openenv.yaml β OpenEnv metadata | |
| βββ Dockerfile β Container config | |
| βββ requirements.txt β Python dependencies | |
| βββ server.py β FastAPI app | |
| βββ README.md β This file | |
| βββ env/ | |
| βββ __init__.py | |
| βββ models.py β Pydantic models | |
| βββ tasks.py β Task registry & graders | |
| βββ environment.py β Core environment | |
| ``` | |
| ## π License | |
| BSD-3-Clause | |