Manas281's picture
Update README.md
96e178c verified
---
title: Data Cleaning OpenEnv
emoji: 🧹
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---
# 🧹 Data Cleaning OpenEnv
A complete **OpenEnv-compatible reinforcement learning environment** for data cleaning tasks.
**Team:** Soham Sandeep Kamathi, Manas Mahendra Patil, Shivam Jha
**Hackathon:** Meta x PyTorch OpenEnv Hackathon 2026
---
## What This Environment Does
An AI agent receives messy tabular datasets and must apply the correct cleaning operations to earn rewards. Three tasks of increasing difficulty simulate real-world data quality problems.
## Tasks
| Task | Difficulty | What the agent must do | Reward |
|------|-----------|----------------------|--------|
| `remove_nulls` | Easy | Drop rows containing null/missing values | 0.0–1.0 |
| `fix_dates` | Medium | Standardise inconsistent date formats to YYYY-MM-DD | 0.0–1.0 |
| `remove_outliers` | Hard | Remove statistical outliers via IQR method from salary and age columns | 0.0–1.0 |
## Observation Space
Each observation returns a `DatasetObservation` with:
- `dataset_preview` β€” first 5 rows as string
- `null_count` β€” number of missing values
- `date_format_errors` β€” number of non-standard dates
- `outlier_count` β€” number of outliers detected
- `task_description` β€” plain-English task description
- `hint` β€” suggested action
## Action Space
A `CleaningAction` with:
- `task_id` β€” 1, 2, or 3
- `action_type` β€” one of `remove_nulls`, `fix_dates`, `remove_outliers`
- `column` β€” optional column name (e.g. `hire_date`, `salary`, `all`)
## API Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/reset?task_id=1` | Start new episode |
| `POST` | `/step` | Submit cleaning action |
| `GET` | `/state?task_id=1` | Get session metadata |
| `GET` | `/tasks` | List all tasks |
| `GET` | `/health` | Health check |
## Setup & Run
```bash
# Install dependencies
pip install -r requirements.txt
# Run locally
uvicorn app:app --host 0.0.0.0 --port 7860