datacentric-env / README.md
Aswini-Kumar's picture
Upload README.md with huggingface_hub
16c757c verified
metadata
title: DataCentric-Env
emoji: 🧹
colorFrom: purple
colorTo: indigo
sdk: docker
pinned: false

DataCentric-Env

An RL environment that trains an LLM to act as a data engineer.

The agent receives a real, messy tabular dataset and a frozen classifier it cannot touch. Its only job: fix the data until the classifier hits the accuracy target β€” measured against published academic benchmarks.


The Problem This Solves

Most RL environments for LLMs test reasoning on synthetic puzzles. Real data engineering requires domain reasoning β€” knowing that Glucose=0 is medically impossible, that capital-gain needs a log transform, that removing 30% of rows will hurt generalization even if it improves cross-validation accuracy.

This environment forces the agent to develop that domain knowledge by grounding rewards in published accuracy benchmarks on real UCI datasets.


Live Demo

Environment server: https://huggingface.co/spaces/Aswini-Kumar/datacentric-env

  • GET /docs β€” Interactive Swagger UI
  • GET /health β€” Status + active sessions
  • POST /reset β€” Start a new episode

The 5 Real Datasets

Dataset Domain Published Baseline Key Issues
UCI Adult Census Income prediction 87.1% 14% ? missing, capital-gain 97% zero, education/education-num redundant
Pima Indians Diabetes Medical diagnosis 77.0% Glucose=0, BloodPressure=0, BMI=0 are medically impossible (zeros = missing)
Wisconsin Breast Cancer Medical imaging 97.3% Correlated feature groups, outliers represent real rare tumors
German Credit Risk Credit risk 76.8% Mixed categorical + numeric, 70/30 imbalance
Cleveland Heart Disease Medical diagnosis 85.5% 303 rows, real missing values in ca and thal

Datasets download automatically on first run and are cached locally. The server pre-loads all 5 at startup via a background thread.


Architecture

POST /reset  β†’  Load real dataset  β†’  80/20 train/holdout split
               Agent sees train set (domain + known issues)
               Holdout is FROZEN β€” agent never sees or modifies it

POST /step   β†’  Query a specialist agent
               Agent reads recommendations (domain-informed)
               Agent applies the best recommendation

               Score = accuracy on FROZEN holdout
               Compared against published benchmark

5 Specialist Agents

Agent Action What it does
CleanerAgent query_cleaner Missing values + zero-as-missing (domain-aware) + log-transform for skewed features
AugmenterAgent query_augmenter SMOTE-like interpolation to synthesize minority class rows
BalancerAgent query_balancer Oversample/undersample with explicit tradeoff explanation
ValidatorAgent query_validator (cost 2) Duplicates + outlier clipping (conservative 5x IQR for medical domains)
AnalystAgent query_analyst (cost 2) Holistic diagnosis + prioritized action plan + published baseline reference

What's Domain-Aware

The CleanerAgent knows:

  • In medical_diagnosis datasets: zeros in physiological measurements are impossible β€” they're missing values β†’ zero_to_nan_impute
  • In income_prediction datasets: capital-gain has 97% zeros with heavy right skew β†’ log1p transform
  • Redundant features (e.g. education + education-num) β†’ recommend dropping one

The ValidatorAgent knows:

  • In medical domains, use 5x IQR instead of 3x β€” outliers may be real rare conditions
  • In credit/income domains, use standard 3x IQR

Reward Structure

All rewards strictly in (0.001, 0.999). Every /step returns a full decomposition:

Grader Weight What it measures
Format 15% Valid action with required fields
Accuracy 35% Progress toward target on frozen holdout
Quality 20% Missing% reduction + class balance improvement
Efficiency 15% Penalizes wasted steps and low-budget expensive queries
Completion 15% Bonus for hitting target, scaled by remaining budget

New in v0.5

Rollback Action

{"action": "rollback", "session_id": "..."}

Undoes the last apply. Max 3 per episode. Costs 1 budget. Real data engineers do this.

Episode Reasoning Trace

Every observation includes the last 5 steps with effects:

"episode_trace": [
  {"step": 2, "type": "apply", "accuracy_delta": 0.031, "effect": "improved"},
  {"step": 3, "type": "apply", "accuracy_delta": -0.018, "effect": "hurt"}
]

Feature Importance

Returned after every apply β€” LogisticRegression coefficients after StandardScaler:

"feature_importance": {
  "top_positive": [{"feature": "Glucose", "coef": 0.84}],
  "top_negative": [{"feature": "BMI_raw", "coef": -0.32}]
}

Regression Explanation

When accuracy drops after an apply:

"regression_explanation": {
  "likely_cause": "large_augmentation_overfitting",
  "suggestion": "Synthetic rows do not generalise to holdout. Try undersample_majority or rollback."
}

Benchmark Comparison

"benchmarks": {
  "majority_class_baseline": 0.6510,
  "starting_accuracy": 0.8095,
  "improvement_over_start": 0.0231,
  "published_baseline": 0.8710
}

API Reference

POST /reset                         Start a new episode
  body: {difficulty: "easy"|"medium"|"hard", seed?: int}

POST /step                          Take an action
  body: {session_id, action, rec_id?, target_class?}
  actions: query_cleaner | query_augmenter | query_balancer |
           query_validator | query_analyst | apply | rollback

GET  /state/{session_id}            Current observation
GET  /trajectory/{session_id}       Full episode trace (for offline analysis)
GET  /health                        Health check
GET  /metrics                       Server metrics + config
GET  /docs                          Swagger UI

Training

The training script (training/train.py) runs GRPO via TRL + Unsloth on Colab (T4 GPU).

# Set your HF Space URL
ENV_URL = "https://aswini-kumar-datacentric-env.hf.space"

# Then run training/train.py
# - Collects 60 episodes across easy/medium/hard difficulty
# - Trains Qwen2.5-3B-Instruct with LoRA r=16
# - Saves results.png with reward progression + distribution charts
# - Saves merged model to ./datacentric-grpo-final

Anti-Exploit Rules

Rule What it blocks
action_spam Same query 3+ times in a row
low_budget_expensive_query Cost-2 queries when budget is 2 or less
duplicate_apply Applying the same rec_id twice
invalid_rec_id Applying a rec_id that does not exist
data_integrity_violation Deleting more than 10% of training rows in one operation

Project Structure

datacentric-env/
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ main.py               # FastAPI app (endpoints + startup warmup)
β”‚   β”œβ”€β”€ environment.py        # Session-aware RL environment (v0.5)
β”‚   β”œβ”€β”€ dataset_registry.py   # Real dataset loader + CSV cache + warmup
β”‚   β”œβ”€β”€ evaluator.py          # Train/holdout split evaluator + feature importance
β”‚   β”œβ”€β”€ specialist_agents.py  # 5 domain-aware expert systems
β”‚   β”œβ”€β”€ reward.py             # 5-component reward function
β”‚   β”œβ”€β”€ session_manager.py    # Thread-safe UUID session management
β”‚   β”œβ”€β”€ anti_exploit.py       # 5 anti-exploit rules
β”‚   β”œβ”€β”€ config.py             # Centralized configuration
β”‚   └── logger.py             # Structured JSON logging
β”œβ”€β”€ datasets/                 # Cached real datasets (CSV, git-ignored)
β”œβ”€β”€ training/
β”‚   └── train.py              # GRPO training script (Colab)
β”œβ”€β”€ inference.py              # Automated end-to-end test
β”œβ”€β”€ openenv.yaml              # Full environment spec
β”œβ”€β”€ requirements.txt
└── Dockerfile