Spaces:
Sleeping
title: Data Validation Pipeline
emoji: π§Ή
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
tags:
- openenv
Data Validation Pipeline β OpenEnv Environment
An RL environment for training AI agents to clean and validate structured data. Built on the OpenEnv framework for the Meta-PyTorch Hackathon.
π Environment Overview
The Data Validation Pipeline environment simulates real-world data quality challenges. An agent is presented with a "dirty" dataset containing various errors β missing values, type mismatches, format violations, range errors, and duplicates β and must systematically identify and fix each issue.
Motivation
Data quality is a critical challenge in every organization. Poor data leads to incorrect analytics, broken ML models, and costly business decisions. This environment trains RL agents to become automated data stewards, capable of:
- Detecting and classifying data errors
- Applying appropriate fixes
- Optimizing their correction strategy for efficiency
π― Action Space
The agent can take the following discrete actions:
| Action Type | Description | Parameters |
|---|---|---|
fix_missing |
Fill in a missing/empty value | target_row, target_field, new_value |
fix_type |
Correct a data type error (e.g., string β float) | target_row, target_field, new_value |
fix_range |
Fix an out-of-range value | target_row, target_field, new_value |
fix_format |
Fix a format violation (e.g., date format) | target_row, target_field, new_value |
fix_duplicate |
Resolve a duplicate entry | target_row, target_field, new_value |
validate |
Check current progress | β |
skip |
Skip (no action) | β |
Action JSON Schema
{
"action_type": "fix_missing|fix_type|fix_range|fix_format|fix_duplicate|validate|skip",
"target_field": "column_name",
"target_row": 0,
"new_value": "corrected_value"
}
ποΈ Observation Space
Each observation includes:
| Field | Type | Description |
|---|---|---|
task_name |
string | Current task identifier |
task_description |
string | What needs to be done |
dataset |
list[dict] | Current state of the dataset |
errors_found |
list[dict] | Remaining errors with details |
errors_remaining |
int | Count of unfixed errors |
errors_total |
int | Total errors at start |
errors_fixed |
int | Successfully fixed errors |
step_count |
int | Current step number |
max_steps |
int | Step budget |
reward |
float | Reward from last action |
cumulative_reward |
float | Total reward so far |
done |
bool | Episode finished? |
last_action_result |
string | Feedback from last action |
task_hint |
string | Hint for solving the task |
progress_pct |
float | Completion percentage |
field_names |
list[str] | Dataset column names |
π Tasks
Task 1: Easy β Missing Values (difficulty: β)
- Dataset: 5-row employee table
- Errors: 3 missing values (empty strings)
- Max Steps: 10
- Strategy: Find empty fields and fill with correct values
- Solvable in: β€5 steps
Task 2: Medium β Mixed Errors (difficulty: ββ)
- Dataset: 7-row product inventory
- Errors: 6 errors (type, format, missing, range, duplicate)
- Max Steps: 15
- Strategy: Classify error type, match to correct action
- Requires: Type awareness + format rules
Task 3: Hard β Multi-Constraint (difficulty: βββ)
- Dataset: 10-row customer orders
- Errors: 10 interrelated errors across all types
- Max Steps: 20
- Strategy: Plan error resolution order, handle dependencies
- Requires: Domain knowledge + planning
ποΈ Setup & Usage
Docker (Recommended)
docker build -t data-validation-env .
docker run -p 8000:8000 data-validation-env
Local Development
pip install -r requirements.txt
uvicorn server:app --host 0.0.0.0 --port 8000
Test Endpoints
# Health check
curl http://localhost:8000/health
# Reset with easy task
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"task_name": "easy_missing_values", "seed": 42}'
# Take a step
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action_type": "fix_missing", "target_field": "email", "target_row": 1, "new_value": "bob@example.com"}'
# Check state
curl http://localhost:8000/state
Run Inference Agent
export HF_TOKEN=your_token_here
export API_BASE_URL=https://api.openai.com/v1
export MODEL_NAME=gpt-4.1-mini
python inference.py
π Baseline Performance
| Task | Model | Avg Reward | Steps Used | Success Rate |
|---|---|---|---|---|
| easy_missing_values | gpt-4.1-mini | 0.85 | 4/10 | 90% |
| medium_mixed_errors | gpt-4.1-mini | 0.70 | 9/15 | 75% |
| hard_multi_constraint | gpt-4.1-mini | 0.55 | 15/20 | 50% |
π Reward Design
- Correct fix:
+1.0 / total_errors(proportional to error count) - Wrong value:
-0.05penalty - Wrong action type:
-0.05penalty - Repeated action:
-0.1penalty - Skip/Validate:
0.0(neutral)
The reward design encourages:
- Accuracy: Correct fixes get proportional positive reward
- Efficiency: Penalties for wrong attempts
- Exploration: No penalty for validation checks
- Diversity: Penalizes repeated identical actions
π Project Structure
βββ inference.py β LLM agent loop
βββ openenv.yaml β OpenEnv metadata
βββ Dockerfile β Container config
βββ requirements.txt β Python dependencies
βββ server.py β FastAPI app
βββ README.md β This file
βββ env/
βββ __init__.py
βββ models.py β Pydantic models
βββ tasks.py β Task registry & graders
βββ environment.py β Core environment
π License
BSD-3-Clause