Manas281's picture
Update README.md
96e178c verified
metadata
title: Data Cleaning OpenEnv
emoji: 🧹
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false

🧹 Data Cleaning OpenEnv

A complete OpenEnv-compatible reinforcement learning environment for data cleaning tasks.

Team: Soham Sandeep Kamathi, Manas Mahendra Patil, Shivam Jha Hackathon: Meta x PyTorch OpenEnv Hackathon 2026


What This Environment Does

An AI agent receives messy tabular datasets and must apply the correct cleaning operations to earn rewards. Three tasks of increasing difficulty simulate real-world data quality problems.

Tasks

Task Difficulty What the agent must do Reward
remove_nulls Easy Drop rows containing null/missing values 0.0–1.0
fix_dates Medium Standardise inconsistent date formats to YYYY-MM-DD 0.0–1.0
remove_outliers Hard Remove statistical outliers via IQR method from salary and age columns 0.0–1.0

Observation Space

Each observation returns a DatasetObservation with:

  • dataset_preview — first 5 rows as string
  • null_count — number of missing values
  • date_format_errors — number of non-standard dates
  • outlier_count — number of outliers detected
  • task_description — plain-English task description
  • hint — suggested action

Action Space

A CleaningAction with:

  • task_id — 1, 2, or 3
  • action_type — one of remove_nulls, fix_dates, remove_outliers
  • column — optional column name (e.g. hire_date, salary, all)

API Endpoints

Method Path Description
POST /reset?task_id=1 Start new episode
POST /step Submit cleaning action
GET /state?task_id=1 Get session metadata
GET /tasks List all tasks
GET /health Health check

Setup & Run

# Install dependencies
pip install -r requirements.txt
# Run locally
uvicorn app:app --host 0.0.0.0 --port 7860