data-validation-env / README.md
kush5699's picture
Upload folder using huggingface_hub
e2c6f56 verified
---
title: Data Validation Pipeline
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
tags:
- openenv
---
# Data Validation Pipeline β€” OpenEnv Environment
An RL environment for training AI agents to clean and validate structured data. Built on the [OpenEnv](https://github.com/meta-pytorch/OpenEnv) framework for the Meta-PyTorch Hackathon.
## 🌐 Environment Overview
The **Data Validation Pipeline** environment simulates real-world data quality challenges. An agent is presented with a "dirty" dataset containing various errors β€” missing values, type mismatches, format violations, range errors, and duplicates β€” and must systematically identify and fix each issue.
### Motivation
Data quality is a critical challenge in every organization. Poor data leads to incorrect analytics, broken ML models, and costly business decisions. This environment trains RL agents to become automated data stewards, capable of:
- Detecting and classifying data errors
- Applying appropriate fixes
- Optimizing their correction strategy for efficiency
## 🎯 Action Space
The agent can take the following **discrete actions**:
| Action Type | Description | Parameters |
|-------------|-------------|------------|
| `fix_missing` | Fill in a missing/empty value | `target_row`, `target_field`, `new_value` |
| `fix_type` | Correct a data type error (e.g., string β†’ float) | `target_row`, `target_field`, `new_value` |
| `fix_range` | Fix an out-of-range value | `target_row`, `target_field`, `new_value` |
| `fix_format` | Fix a format violation (e.g., date format) | `target_row`, `target_field`, `new_value` |
| `fix_duplicate` | Resolve a duplicate entry | `target_row`, `target_field`, `new_value` |
| `validate` | Check current progress | β€” |
| `skip` | Skip (no action) | β€” |
### Action JSON Schema
```json
{
"action_type": "fix_missing|fix_type|fix_range|fix_format|fix_duplicate|validate|skip",
"target_field": "column_name",
"target_row": 0,
"new_value": "corrected_value"
}
```
## πŸ‘οΈ Observation Space
Each observation includes:
| Field | Type | Description |
|-------|------|-------------|
| `task_name` | string | Current task identifier |
| `task_description` | string | What needs to be done |
| `dataset` | list[dict] | Current state of the dataset |
| `errors_found` | list[dict] | Remaining errors with details |
| `errors_remaining` | int | Count of unfixed errors |
| `errors_total` | int | Total errors at start |
| `errors_fixed` | int | Successfully fixed errors |
| `step_count` | int | Current step number |
| `max_steps` | int | Step budget |
| `reward` | float | Reward from last action |
| `cumulative_reward` | float | Total reward so far |
| `done` | bool | Episode finished? |
| `last_action_result` | string | Feedback from last action |
| `task_hint` | string | Hint for solving the task |
| `progress_pct` | float | Completion percentage |
| `field_names` | list[str] | Dataset column names |
## πŸ“‹ Tasks
### Task 1: Easy β€” Missing Values (difficulty: ⭐)
- **Dataset**: 5-row employee table
- **Errors**: 3 missing values (empty strings)
- **Max Steps**: 10
- **Strategy**: Find empty fields and fill with correct values
- **Solvable in**: ≀5 steps
### Task 2: Medium β€” Mixed Errors (difficulty: ⭐⭐)
- **Dataset**: 7-row product inventory
- **Errors**: 6 errors (type, format, missing, range, duplicate)
- **Max Steps**: 15
- **Strategy**: Classify error type, match to correct action
- **Requires**: Type awareness + format rules
### Task 3: Hard β€” Multi-Constraint (difficulty: ⭐⭐⭐)
- **Dataset**: 10-row customer orders
- **Errors**: 10 interrelated errors across all types
- **Max Steps**: 20
- **Strategy**: Plan error resolution order, handle dependencies
- **Requires**: Domain knowledge + planning
## πŸ—οΈ Setup & Usage
### Docker (Recommended)
```bash
docker build -t data-validation-env .
docker run -p 8000:8000 data-validation-env
```
### Local Development
```bash
pip install -r requirements.txt
uvicorn server:app --host 0.0.0.0 --port 8000
```
### Test Endpoints
```bash
# Health check
curl http://localhost:8000/health
# Reset with easy task
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"task_name": "easy_missing_values", "seed": 42}'
# Take a step
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action_type": "fix_missing", "target_field": "email", "target_row": 1, "new_value": "bob@example.com"}'
# Check state
curl http://localhost:8000/state
```
### Run Inference Agent
```bash
export HF_TOKEN=your_token_here
export API_BASE_URL=https://api.openai.com/v1
export MODEL_NAME=gpt-4.1-mini
python inference.py
```
## πŸ“Š Baseline Performance
| Task | Model | Avg Reward | Steps Used | Success Rate |
|------|-------|-----------|------------|-------------|
| easy_missing_values | gpt-4.1-mini | 0.85 | 4/10 | 90% |
| medium_mixed_errors | gpt-4.1-mini | 0.70 | 9/15 | 75% |
| hard_multi_constraint | gpt-4.1-mini | 0.55 | 15/20 | 50% |
## πŸ† Reward Design
- **Correct fix**: `+1.0 / total_errors` (proportional to error count)
- **Wrong value**: `-0.05` penalty
- **Wrong action type**: `-0.05` penalty
- **Repeated action**: `-0.1` penalty
- **Skip/Validate**: `0.0` (neutral)
The reward design encourages:
1. **Accuracy**: Correct fixes get proportional positive reward
2. **Efficiency**: Penalties for wrong attempts
3. **Exploration**: No penalty for validation checks
4. **Diversity**: Penalizes repeated identical actions
## πŸ“ Project Structure
```
β”œβ”€β”€ inference.py ← LLM agent loop
β”œβ”€β”€ openenv.yaml ← OpenEnv metadata
β”œβ”€β”€ Dockerfile ← Container config
β”œβ”€β”€ requirements.txt ← Python dependencies
β”œβ”€β”€ server.py ← FastAPI app
β”œβ”€β”€ README.md ← This file
└── env/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ models.py ← Pydantic models
β”œβ”€β”€ tasks.py ← Task registry & graders
└── environment.py ← Core environment
```
## πŸ“œ License
BSD-3-Clause