--- title: Data Cleaning OpenEnv emoji: ๐Ÿงน colorFrom: blue colorTo: indigo sdk: docker pinned: false --- # ๐Ÿงน Data Cleaning OpenEnv A complete **OpenEnv-compatible reinforcement learning environment** for data cleaning tasks. **Team:** Soham Sandeep Kamathi, Manas Mahendra Patil, Shivam Jha **Hackathon:** Meta x PyTorch OpenEnv Hackathon 2026 --- ## What This Environment Does An AI agent receives messy tabular datasets and must apply the correct cleaning operations to earn rewards. Three tasks of increasing difficulty simulate real-world data quality problems. ## Tasks | Task | Difficulty | What the agent must do | Reward | |------|-----------|----------------------|--------| | `remove_nulls` | Easy | Drop rows containing null/missing values | 0.0โ€“1.0 | | `fix_dates` | Medium | Standardise inconsistent date formats to YYYY-MM-DD | 0.0โ€“1.0 | | `remove_outliers` | Hard | Remove statistical outliers via IQR method from salary and age columns | 0.0โ€“1.0 | ## Observation Space Each observation returns a `DatasetObservation` with: - `dataset_preview` โ€” first 5 rows as string - `null_count` โ€” number of missing values - `date_format_errors` โ€” number of non-standard dates - `outlier_count` โ€” number of outliers detected - `task_description` โ€” plain-English task description - `hint` โ€” suggested action ## Action Space A `CleaningAction` with: - `task_id` โ€” 1, 2, or 3 - `action_type` โ€” one of `remove_nulls`, `fix_dates`, `remove_outliers` - `column` โ€” optional column name (e.g. `hire_date`, `salary`, `all`) ## API Endpoints | Method | Path | Description | |--------|------|-------------| | `POST` | `/reset?task_id=1` | Start new episode | | `POST` | `/step` | Submit cleaning action | | `GET` | `/state?task_id=1` | Get session metadata | | `GET` | `/tasks` | List all tasks | | `GET` | `/health` | Health check | ## Setup & Run ```bash # Install dependencies pip install -r requirements.txt # Run locally uvicorn app:app --host 0.0.0.0 --port 7860