Spaces:
Sleeping
Sleeping
| title: data-quality-env | |
| sdk: docker | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: green | |
| ## OpenEnv Validation | |
| This environment has been validated using OpenEnv: | |
| ```bash | |
| openenv validate | |
| # [OK] data-quality-env: Ready for multi-mode deployment | |
| ``` | |
| ## Description | |
| DataQualityEnv v2 is a budget-constrained, confidence-scored OpenEnv environment where an AI agent performs multi-step SQL auditing and optional fix verification. | |
| ## π How to Use | |
| ### Step 1: Reset the Environment | |
| Click **"Reset episode"** to start a new task. | |
| You will see: | |
| - task description | |
| - table schema | |
| - step budget and query credits | |
| --- | |
| ### Step 2: Investigate with SQL Queries | |
| Run SQL queries to explore the data and identify issues. | |
| Examples: | |
| ```sql | |
| SELECT COUNT(*) FROM customers; | |
| SELECT SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) FROM customers; | |
| SELECT customer_id, email, COUNT(*) | |
| FROM customers | |
| GROUP BY 1,2 | |
| HAVING COUNT(*) > 1; | |
| π‘ Note: | |
| Queries help you gather evidence | |
| They may return small rewards or penalties | |
| Final scoring does NOT happen here | |
| Step 3: Submit Audit Report (IMPORTANT) | |
| After analysis, submit your findings using: | |
| Observation json: { | |
| "null_issues": { | |
| "email": 12, | |
| "customer_id": 0 | |
| }, | |
| "duplicate_row_count": 15, | |
| "near_duplicate_count": 9, | |
| "confidence": 0.9 | |
| } | |
| submit: { | |
| "null_issues": { | |
| "email": 12, | |
| "customer_id": 0 | |
| }, | |
| "duplicate_row_count": 15, | |
| "near_duplicate_count": 9, | |
| "confidence": 0.9 | |
| } | |
| π This step triggers: | |
| deterministic grading | |
| final score calculation (0.0 β 1.0) | |
| β οΈ Important: | |
| Running queries alone will NOT produce a final score. | |
| You must submit a report. | |
| Step 4: Fix Phase (Optional) | |
| After submitting the report, the environment enters fix phase. | |
| You can propose fixes using SQL: | |
| UPDATE customers SET email = NULL WHERE email = 'UNKNOWN'; | |
| This may improve your score. | |
| Step 5: Final Result | |
| You will receive: | |
| { | |
| "value": 0.85, | |
| "done": true | |
| } | |
| value β final score | |
| done β task completed | |
| π€ Auto Audit Mode | |
| You can also click "Auto audit", which: | |
| runs multiple diagnostic SQL queries | |
| generates a report automatically | |
| submits it before step limit | |
| This is the fastest way to test the system. | |
| π§ Reward System | |
| Queries may return small penalties (e.g., -0.1) for redundant or low-value actions | |
| This encourages efficient exploration | |
| Final score depends ONLY on the submitted report | |
| β οΈ Important Notes | |
| Always submit a report before step limit | |
| If you don't, the system may auto-submit a fallback report | |
| Efficient querying leads to better performance | |
| Core loop: | |
| - `reset` β environment generates seeded dirty datasets. | |
| - `query` β agent investigates across one or more tables. | |
| - `submit_report` β deterministic grading starts and fix phase unlocks. | |
| - `fix_sql` β agent proposes corrective updates for bonus. | |
| Novel mechanics: | |
| - Query budget economy (10 credits). | |
| - Confidence Brier grading. | |
| - 4 tasks (easy to expert). | |
| - Adversarial camouflage (`NULL`, `N/A`, `-`, near-duplicates). | |
| - Fix verification loop with bonus up to `+0.25`. | |
| ## Action space | |
| 1) Query | |
| ```json | |
| {"action_type": "query", "sql": "SELECT * FROM customers LIMIT 10"} | |
| ``` | |
| 2) Submit report | |
| ```json | |
| { | |
| "action_type": "submit_report", | |
| "report": { | |
| "null_issues": {"email": {"value": 12, "confidence": 0.92}}, | |
| "duplicate_row_count": {"value": 16, "confidence": 0.88}, | |
| "schema_violations": [], | |
| "drifted_columns": [], | |
| "drift_details": {}, | |
| "relational_issues": [], | |
| "recommended_fixes": ["Add NULL checks"] | |
| } | |
| } | |
| ``` | |
| 3) Fix SQL | |
| ```json | |
| {"action_type": "fix_sql", "sql": "UPDATE orders SET quantity = ABS(quantity) WHERE quantity < 0"} | |
| ``` | |
| ## Observation space | |
| - `task_id` | |
| - `task_description` | |
| - `tables` | |
| - `row_counts` | |
| - `step` | |
| - `max_steps` | |
| - `query_credits_remaining` | |
| - `phase` (`audit` | `fix`) | |
| - `last_query_result` | |
| - `last_action_error` | |
| - `last_fix_score` | |
| ## Tasks | |
| | ID | Name | Difficulty | What agent must find | Expected baseline | | |
| |----|------|-----------|---------------------|-------------------| | |
| | 1 | Null & duplicate detection | Easy | Nulls, disguised nulls, exact/near dups | ~0.82 | | |
| | 2 | Schema violation repair | Medium | Type/format/range/unparseable violations | ~0.61 | | |
| | 3 | Silent data drift | Hard | Mean shift, new cats, referential drift | ~0.34 | | |
| | 4 | Multi-table relational audit | Expert | Orphaned FKs, temporal violations, aggregate mismatches | ~0.19 | | |
| ## Reward design | |
| - Base audit score from deterministic task grader. | |
| - Confidence Brier adjustment per finding. | |
| - Budget bonus up to `+0.10`. | |
| - Fix bonus up to `+0.25`. | |
| Formula: | |
| `total = min(1.25, audit_score Γ brier_adj + budget_bonus + fix_bonus)` | |
| ## Baseline scores (multi-seed robustness) | |
| | Seed | Task 1 | Task 2 | Task 3 | Task 4 | Mean | | |
| |------|--------|--------|--------|--------|------| | |
| | 42 | X.XX | X.XX | X.XX | X.XX | X.XX | | |
| | 123 | X.XX | X.XX | X.XX | X.XX | X.XX | | |
| | 777 | X.XX | X.XX | X.XX | X.XX | X.XX | | |
| ## Running inference | |
| ```bash | |
| ENV_URL=http://localhost:7860 \ | |
| API_BASE_URL=https://router.huggingface.co/v1 \ | |
| MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \ | |
| HF_TOKEN=your_token \ | |
| python inference.py | |
| ``` | |
| ## Validation | |
| ```bash | |
| ./validate-submission.sh https://your-space.hf.space | |
| ``` | |