data-quality-env / README.md
kumar6591's picture
Update README.md
ee95e34
---
title: data-quality-env
sdk: docker
emoji: πŸš€
colorFrom: blue
colorTo: green
---
## OpenEnv Validation
This environment has been validated using OpenEnv:
```bash
openenv validate
# [OK] data-quality-env: Ready for multi-mode deployment
```
## Description
DataQualityEnv v2 is a budget-constrained, confidence-scored OpenEnv environment where an AI agent performs multi-step SQL auditing and optional fix verification.
## πŸš€ How to Use
### Step 1: Reset the Environment
Click **"Reset episode"** to start a new task.
You will see:
- task description
- table schema
- step budget and query credits
---
### Step 2: Investigate with SQL Queries
Run SQL queries to explore the data and identify issues.
Examples:
```sql
SELECT COUNT(*) FROM customers;
SELECT SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) FROM customers;
SELECT customer_id, email, COUNT(*)
FROM customers
GROUP BY 1,2
HAVING COUNT(*) > 1;
πŸ’‘ Note:
Queries help you gather evidence
They may return small rewards or penalties
Final scoring does NOT happen here
Step 3: Submit Audit Report (IMPORTANT)
After analysis, submit your findings using:
Observation json: {
"null_issues": {
"email": 12,
"customer_id": 0
},
"duplicate_row_count": 15,
"near_duplicate_count": 9,
"confidence": 0.9
}
submit: {
"null_issues": {
"email": 12,
"customer_id": 0
},
"duplicate_row_count": 15,
"near_duplicate_count": 9,
"confidence": 0.9
}
πŸ“Œ This step triggers:
deterministic grading
final score calculation (0.0 – 1.0)
⚠️ Important:
Running queries alone will NOT produce a final score.
You must submit a report.
Step 4: Fix Phase (Optional)
After submitting the report, the environment enters fix phase.
You can propose fixes using SQL:
UPDATE customers SET email = NULL WHERE email = 'UNKNOWN';
This may improve your score.
Step 5: Final Result
You will receive:
{
"value": 0.85,
"done": true
}
value β†’ final score
done β†’ task completed
πŸ€– Auto Audit Mode
You can also click "Auto audit", which:
runs multiple diagnostic SQL queries
generates a report automatically
submits it before step limit
This is the fastest way to test the system.
🧠 Reward System
Queries may return small penalties (e.g., -0.1) for redundant or low-value actions
This encourages efficient exploration
Final score depends ONLY on the submitted report
⚠️ Important Notes
Always submit a report before step limit
If you don't, the system may auto-submit a fallback report
Efficient querying leads to better performance
Core loop:
- `reset` β†’ environment generates seeded dirty datasets.
- `query` β†’ agent investigates across one or more tables.
- `submit_report` β†’ deterministic grading starts and fix phase unlocks.
- `fix_sql` β†’ agent proposes corrective updates for bonus.
Novel mechanics:
- Query budget economy (10 credits).
- Confidence Brier grading.
- 4 tasks (easy to expert).
- Adversarial camouflage (`NULL`, `N/A`, `-`, near-duplicates).
- Fix verification loop with bonus up to `+0.25`.
## Action space
1) Query
```json
{"action_type": "query", "sql": "SELECT * FROM customers LIMIT 10"}
```
2) Submit report
```json
{
"action_type": "submit_report",
"report": {
"null_issues": {"email": {"value": 12, "confidence": 0.92}},
"duplicate_row_count": {"value": 16, "confidence": 0.88},
"schema_violations": [],
"drifted_columns": [],
"drift_details": {},
"relational_issues": [],
"recommended_fixes": ["Add NULL checks"]
}
}
```
3) Fix SQL
```json
{"action_type": "fix_sql", "sql": "UPDATE orders SET quantity = ABS(quantity) WHERE quantity < 0"}
```
## Observation space
- `task_id`
- `task_description`
- `tables`
- `row_counts`
- `step`
- `max_steps`
- `query_credits_remaining`
- `phase` (`audit` | `fix`)
- `last_query_result`
- `last_action_error`
- `last_fix_score`
## Tasks
| ID | Name | Difficulty | What agent must find | Expected baseline |
|----|------|-----------|---------------------|-------------------|
| 1 | Null & duplicate detection | Easy | Nulls, disguised nulls, exact/near dups | ~0.82 |
| 2 | Schema violation repair | Medium | Type/format/range/unparseable violations | ~0.61 |
| 3 | Silent data drift | Hard | Mean shift, new cats, referential drift | ~0.34 |
| 4 | Multi-table relational audit | Expert | Orphaned FKs, temporal violations, aggregate mismatches | ~0.19 |
## Reward design
- Base audit score from deterministic task grader.
- Confidence Brier adjustment per finding.
- Budget bonus up to `+0.10`.
- Fix bonus up to `+0.25`.
Formula:
`total = min(1.25, audit_score Γ— brier_adj + budget_bonus + fix_bonus)`
## Baseline scores (multi-seed robustness)
| Seed | Task 1 | Task 2 | Task 3 | Task 4 | Mean |
|------|--------|--------|--------|--------|------|
| 42 | X.XX | X.XX | X.XX | X.XX | X.XX |
| 123 | X.XX | X.XX | X.XX | X.XX | X.XX |
| 777 | X.XX | X.XX | X.XX | X.XX | X.XX |
## Running inference
```bash
ENV_URL=http://localhost:7860 \
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
HF_TOKEN=your_token \
python inference.py
```
## Validation
```bash
./validate-submission.sh https://your-space.hf.space
```