Spaces:
Sleeping
Sleeping
File size: 6,141 Bytes
e2c6f56 9c195fe | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | ---
title: Data Validation Pipeline
emoji: π§Ή
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
tags:
- openenv
---
# Data Validation Pipeline β OpenEnv Environment
An RL environment for training AI agents to clean and validate structured data. Built on the [OpenEnv](https://github.com/meta-pytorch/OpenEnv) framework for the Meta-PyTorch Hackathon.
## π Environment Overview
The **Data Validation Pipeline** environment simulates real-world data quality challenges. An agent is presented with a "dirty" dataset containing various errors β missing values, type mismatches, format violations, range errors, and duplicates β and must systematically identify and fix each issue.
### Motivation
Data quality is a critical challenge in every organization. Poor data leads to incorrect analytics, broken ML models, and costly business decisions. This environment trains RL agents to become automated data stewards, capable of:
- Detecting and classifying data errors
- Applying appropriate fixes
- Optimizing their correction strategy for efficiency
## π― Action Space
The agent can take the following **discrete actions**:
| Action Type | Description | Parameters |
|-------------|-------------|------------|
| `fix_missing` | Fill in a missing/empty value | `target_row`, `target_field`, `new_value` |
| `fix_type` | Correct a data type error (e.g., string β float) | `target_row`, `target_field`, `new_value` |
| `fix_range` | Fix an out-of-range value | `target_row`, `target_field`, `new_value` |
| `fix_format` | Fix a format violation (e.g., date format) | `target_row`, `target_field`, `new_value` |
| `fix_duplicate` | Resolve a duplicate entry | `target_row`, `target_field`, `new_value` |
| `validate` | Check current progress | β |
| `skip` | Skip (no action) | β |
### Action JSON Schema
```json
{
"action_type": "fix_missing|fix_type|fix_range|fix_format|fix_duplicate|validate|skip",
"target_field": "column_name",
"target_row": 0,
"new_value": "corrected_value"
}
```
## ποΈ Observation Space
Each observation includes:
| Field | Type | Description |
|-------|------|-------------|
| `task_name` | string | Current task identifier |
| `task_description` | string | What needs to be done |
| `dataset` | list[dict] | Current state of the dataset |
| `errors_found` | list[dict] | Remaining errors with details |
| `errors_remaining` | int | Count of unfixed errors |
| `errors_total` | int | Total errors at start |
| `errors_fixed` | int | Successfully fixed errors |
| `step_count` | int | Current step number |
| `max_steps` | int | Step budget |
| `reward` | float | Reward from last action |
| `cumulative_reward` | float | Total reward so far |
| `done` | bool | Episode finished? |
| `last_action_result` | string | Feedback from last action |
| `task_hint` | string | Hint for solving the task |
| `progress_pct` | float | Completion percentage |
| `field_names` | list[str] | Dataset column names |
## π Tasks
### Task 1: Easy β Missing Values (difficulty: β)
- **Dataset**: 5-row employee table
- **Errors**: 3 missing values (empty strings)
- **Max Steps**: 10
- **Strategy**: Find empty fields and fill with correct values
- **Solvable in**: β€5 steps
### Task 2: Medium β Mixed Errors (difficulty: ββ)
- **Dataset**: 7-row product inventory
- **Errors**: 6 errors (type, format, missing, range, duplicate)
- **Max Steps**: 15
- **Strategy**: Classify error type, match to correct action
- **Requires**: Type awareness + format rules
### Task 3: Hard β Multi-Constraint (difficulty: βββ)
- **Dataset**: 10-row customer orders
- **Errors**: 10 interrelated errors across all types
- **Max Steps**: 20
- **Strategy**: Plan error resolution order, handle dependencies
- **Requires**: Domain knowledge + planning
## ποΈ Setup & Usage
### Docker (Recommended)
```bash
docker build -t data-validation-env .
docker run -p 8000:8000 data-validation-env
```
### Local Development
```bash
pip install -r requirements.txt
uvicorn server:app --host 0.0.0.0 --port 8000
```
### Test Endpoints
```bash
# Health check
curl http://localhost:8000/health
# Reset with easy task
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"task_name": "easy_missing_values", "seed": 42}'
# Take a step
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action_type": "fix_missing", "target_field": "email", "target_row": 1, "new_value": "bob@example.com"}'
# Check state
curl http://localhost:8000/state
```
### Run Inference Agent
```bash
export HF_TOKEN=your_token_here
export API_BASE_URL=https://api.openai.com/v1
export MODEL_NAME=gpt-4.1-mini
python inference.py
```
## π Baseline Performance
| Task | Model | Avg Reward | Steps Used | Success Rate |
|------|-------|-----------|------------|-------------|
| easy_missing_values | gpt-4.1-mini | 0.85 | 4/10 | 90% |
| medium_mixed_errors | gpt-4.1-mini | 0.70 | 9/15 | 75% |
| hard_multi_constraint | gpt-4.1-mini | 0.55 | 15/20 | 50% |
## π Reward Design
- **Correct fix**: `+1.0 / total_errors` (proportional to error count)
- **Wrong value**: `-0.05` penalty
- **Wrong action type**: `-0.05` penalty
- **Repeated action**: `-0.1` penalty
- **Skip/Validate**: `0.0` (neutral)
The reward design encourages:
1. **Accuracy**: Correct fixes get proportional positive reward
2. **Efficiency**: Penalties for wrong attempts
3. **Exploration**: No penalty for validation checks
4. **Diversity**: Penalizes repeated identical actions
## π Project Structure
```
βββ inference.py β LLM agent loop
βββ openenv.yaml β OpenEnv metadata
βββ Dockerfile β Container config
βββ requirements.txt β Python dependencies
βββ server.py β FastAPI app
βββ README.md β This file
βββ env/
βββ __init__.py
βββ models.py β Pydantic models
βββ tasks.py β Task registry & graders
βββ environment.py β Core environment
```
## π License
BSD-3-Clause
|