Spaces:
Sleeping
title: DataCentric-Env
emoji: π§Ή
colorFrom: purple
colorTo: indigo
sdk: docker
pinned: false
DataCentric-Env
An RL environment that trains an LLM to act as a data engineer.
The agent receives a real, messy tabular dataset and a frozen classifier it cannot touch. Its only job: fix the data until the classifier hits the accuracy target β measured against published academic benchmarks.
The Problem This Solves
Most RL environments for LLMs test reasoning on synthetic puzzles. Real data engineering requires domain reasoning β knowing that Glucose=0 is medically impossible, that capital-gain needs a log transform, that removing 30% of rows will hurt generalization even if it improves cross-validation accuracy.
This environment forces the agent to develop that domain knowledge by grounding rewards in published accuracy benchmarks on real UCI datasets.
Live Demo
Environment server: https://huggingface.co/spaces/Aswini-Kumar/datacentric-env
GET /docsβ Interactive Swagger UIGET /healthβ Status + active sessionsPOST /resetβ Start a new episode
The 5 Real Datasets
| Dataset | Domain | Published Baseline | Key Issues |
|---|---|---|---|
| UCI Adult Census | Income prediction | 87.1% | 14% ? missing, capital-gain 97% zero, education/education-num redundant |
| Pima Indians Diabetes | Medical diagnosis | 77.0% | Glucose=0, BloodPressure=0, BMI=0 are medically impossible (zeros = missing) |
| Wisconsin Breast Cancer | Medical imaging | 97.3% | Correlated feature groups, outliers represent real rare tumors |
| German Credit Risk | Credit risk | 76.8% | Mixed categorical + numeric, 70/30 imbalance |
| Cleveland Heart Disease | Medical diagnosis | 85.5% | 303 rows, real missing values in ca and thal |
Datasets download automatically on first run and are cached locally. The server pre-loads all 5 at startup via a background thread.
Architecture
POST /reset β Load real dataset β 80/20 train/holdout split
Agent sees train set (domain + known issues)
Holdout is FROZEN β agent never sees or modifies it
POST /step β Query a specialist agent
Agent reads recommendations (domain-informed)
Agent applies the best recommendation
Score = accuracy on FROZEN holdout
Compared against published benchmark
5 Specialist Agents
| Agent | Action | What it does |
|---|---|---|
| CleanerAgent | query_cleaner |
Missing values + zero-as-missing (domain-aware) + log-transform for skewed features |
| AugmenterAgent | query_augmenter |
SMOTE-like interpolation to synthesize minority class rows |
| BalancerAgent | query_balancer |
Oversample/undersample with explicit tradeoff explanation |
| ValidatorAgent | query_validator (cost 2) |
Duplicates + outlier clipping (conservative 5x IQR for medical domains) |
| AnalystAgent | query_analyst (cost 2) |
Holistic diagnosis + prioritized action plan + published baseline reference |
What's Domain-Aware
The CleanerAgent knows:
- In
medical_diagnosisdatasets: zeros in physiological measurements are impossible β they're missing values βzero_to_nan_impute - In
income_predictiondatasets:capital-gainhas 97% zeros with heavy right skew βlog1ptransform - Redundant features (e.g.
education+education-num) β recommend dropping one
The ValidatorAgent knows:
- In medical domains, use 5x IQR instead of 3x β outliers may be real rare conditions
- In credit/income domains, use standard 3x IQR
Reward Structure
All rewards strictly in (0.001, 0.999). Every /step returns a full decomposition:
| Grader | Weight | What it measures |
|---|---|---|
| Format | 15% | Valid action with required fields |
| Accuracy | 35% | Progress toward target on frozen holdout |
| Quality | 20% | Missing% reduction + class balance improvement |
| Efficiency | 15% | Penalizes wasted steps and low-budget expensive queries |
| Completion | 15% | Bonus for hitting target, scaled by remaining budget |
New in v0.5
Rollback Action
{"action": "rollback", "session_id": "..."}
Undoes the last apply. Max 3 per episode. Costs 1 budget. Real data engineers do this.
Episode Reasoning Trace
Every observation includes the last 5 steps with effects:
"episode_trace": [
{"step": 2, "type": "apply", "accuracy_delta": 0.031, "effect": "improved"},
{"step": 3, "type": "apply", "accuracy_delta": -0.018, "effect": "hurt"}
]
Feature Importance
Returned after every apply β LogisticRegression coefficients after StandardScaler:
"feature_importance": {
"top_positive": [{"feature": "Glucose", "coef": 0.84}],
"top_negative": [{"feature": "BMI_raw", "coef": -0.32}]
}
Regression Explanation
When accuracy drops after an apply:
"regression_explanation": {
"likely_cause": "large_augmentation_overfitting",
"suggestion": "Synthetic rows do not generalise to holdout. Try undersample_majority or rollback."
}
Benchmark Comparison
"benchmarks": {
"majority_class_baseline": 0.6510,
"starting_accuracy": 0.8095,
"improvement_over_start": 0.0231,
"published_baseline": 0.8710
}
API Reference
POST /reset Start a new episode
body: {difficulty: "easy"|"medium"|"hard", seed?: int}
POST /step Take an action
body: {session_id, action, rec_id?, target_class?}
actions: query_cleaner | query_augmenter | query_balancer |
query_validator | query_analyst | apply | rollback
GET /state/{session_id} Current observation
GET /trajectory/{session_id} Full episode trace (for offline analysis)
GET /health Health check
GET /metrics Server metrics + config
GET /docs Swagger UI
Training
The training script (training/train.py) runs GRPO via TRL + Unsloth on Colab (T4 GPU).
# Set your HF Space URL
ENV_URL = "https://aswini-kumar-datacentric-env.hf.space"
# Then run training/train.py
# - Collects 60 episodes across easy/medium/hard difficulty
# - Trains Qwen2.5-3B-Instruct with LoRA r=16
# - Saves results.png with reward progression + distribution charts
# - Saves merged model to ./datacentric-grpo-final
Anti-Exploit Rules
| Rule | What it blocks |
|---|---|
action_spam |
Same query 3+ times in a row |
low_budget_expensive_query |
Cost-2 queries when budget is 2 or less |
duplicate_apply |
Applying the same rec_id twice |
invalid_rec_id |
Applying a rec_id that does not exist |
data_integrity_violation |
Deleting more than 10% of training rows in one operation |
Project Structure
datacentric-env/
βββ server/
β βββ main.py # FastAPI app (endpoints + startup warmup)
β βββ environment.py # Session-aware RL environment (v0.5)
β βββ dataset_registry.py # Real dataset loader + CSV cache + warmup
β βββ evaluator.py # Train/holdout split evaluator + feature importance
β βββ specialist_agents.py # 5 domain-aware expert systems
β βββ reward.py # 5-component reward function
β βββ session_manager.py # Thread-safe UUID session management
β βββ anti_exploit.py # 5 anti-exploit rules
β βββ config.py # Centralized configuration
β βββ logger.py # Structured JSON logging
βββ datasets/ # Cached real datasets (CSV, git-ignored)
βββ training/
β βββ train.py # GRPO training script (Colab)
βββ inference.py # Automated end-to-end test
βββ openenv.yaml # Full environment spec
βββ requirements.txt
βββ Dockerfile