Spaces:
Sleeping
Sleeping
File size: 5,221 Bytes
c10fded 640f531 c10fded 91e7690 9ae739c 2c92b66 9ae739c 2c92b66 9ae739c 91e7690 ee95e34 91e7690 2c92b66 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | ---
title: data-quality-env
sdk: docker
emoji: π
colorFrom: blue
colorTo: green
---
## OpenEnv Validation
This environment has been validated using OpenEnv:
```bash
openenv validate
# [OK] data-quality-env: Ready for multi-mode deployment
```
## Description
DataQualityEnv v2 is a budget-constrained, confidence-scored OpenEnv environment where an AI agent performs multi-step SQL auditing and optional fix verification.
## π How to Use
### Step 1: Reset the Environment
Click **"Reset episode"** to start a new task.
You will see:
- task description
- table schema
- step budget and query credits
---
### Step 2: Investigate with SQL Queries
Run SQL queries to explore the data and identify issues.
Examples:
```sql
SELECT COUNT(*) FROM customers;
SELECT SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) FROM customers;
SELECT customer_id, email, COUNT(*)
FROM customers
GROUP BY 1,2
HAVING COUNT(*) > 1;
π‘ Note:
Queries help you gather evidence
They may return small rewards or penalties
Final scoring does NOT happen here
Step 3: Submit Audit Report (IMPORTANT)
After analysis, submit your findings using:
Observation json: {
"null_issues": {
"email": 12,
"customer_id": 0
},
"duplicate_row_count": 15,
"near_duplicate_count": 9,
"confidence": 0.9
}
submit: {
"null_issues": {
"email": 12,
"customer_id": 0
},
"duplicate_row_count": 15,
"near_duplicate_count": 9,
"confidence": 0.9
}
π This step triggers:
deterministic grading
final score calculation (0.0 β 1.0)
β οΈ Important:
Running queries alone will NOT produce a final score.
You must submit a report.
Step 4: Fix Phase (Optional)
After submitting the report, the environment enters fix phase.
You can propose fixes using SQL:
UPDATE customers SET email = NULL WHERE email = 'UNKNOWN';
This may improve your score.
Step 5: Final Result
You will receive:
{
"value": 0.85,
"done": true
}
value β final score
done β task completed
π€ Auto Audit Mode
You can also click "Auto audit", which:
runs multiple diagnostic SQL queries
generates a report automatically
submits it before step limit
This is the fastest way to test the system.
π§ Reward System
Queries may return small penalties (e.g., -0.1) for redundant or low-value actions
This encourages efficient exploration
Final score depends ONLY on the submitted report
β οΈ Important Notes
Always submit a report before step limit
If you don't, the system may auto-submit a fallback report
Efficient querying leads to better performance
Core loop:
- `reset` β environment generates seeded dirty datasets.
- `query` β agent investigates across one or more tables.
- `submit_report` β deterministic grading starts and fix phase unlocks.
- `fix_sql` β agent proposes corrective updates for bonus.
Novel mechanics:
- Query budget economy (10 credits).
- Confidence Brier grading.
- 4 tasks (easy to expert).
- Adversarial camouflage (`NULL`, `N/A`, `-`, near-duplicates).
- Fix verification loop with bonus up to `+0.25`.
## Action space
1) Query
```json
{"action_type": "query", "sql": "SELECT * FROM customers LIMIT 10"}
```
2) Submit report
```json
{
"action_type": "submit_report",
"report": {
"null_issues": {"email": {"value": 12, "confidence": 0.92}},
"duplicate_row_count": {"value": 16, "confidence": 0.88},
"schema_violations": [],
"drifted_columns": [],
"drift_details": {},
"relational_issues": [],
"recommended_fixes": ["Add NULL checks"]
}
}
```
3) Fix SQL
```json
{"action_type": "fix_sql", "sql": "UPDATE orders SET quantity = ABS(quantity) WHERE quantity < 0"}
```
## Observation space
- `task_id`
- `task_description`
- `tables`
- `row_counts`
- `step`
- `max_steps`
- `query_credits_remaining`
- `phase` (`audit` | `fix`)
- `last_query_result`
- `last_action_error`
- `last_fix_score`
## Tasks
| ID | Name | Difficulty | What agent must find | Expected baseline |
|----|------|-----------|---------------------|-------------------|
| 1 | Null & duplicate detection | Easy | Nulls, disguised nulls, exact/near dups | ~0.82 |
| 2 | Schema violation repair | Medium | Type/format/range/unparseable violations | ~0.61 |
| 3 | Silent data drift | Hard | Mean shift, new cats, referential drift | ~0.34 |
| 4 | Multi-table relational audit | Expert | Orphaned FKs, temporal violations, aggregate mismatches | ~0.19 |
## Reward design
- Base audit score from deterministic task grader.
- Confidence Brier adjustment per finding.
- Budget bonus up to `+0.10`.
- Fix bonus up to `+0.25`.
Formula:
`total = min(1.25, audit_score Γ brier_adj + budget_bonus + fix_bonus)`
## Baseline scores (multi-seed robustness)
| Seed | Task 1 | Task 2 | Task 3 | Task 4 | Mean |
|------|--------|--------|--------|--------|------|
| 42 | X.XX | X.XX | X.XX | X.XX | X.XX |
| 123 | X.XX | X.XX | X.XX | X.XX | X.XX |
| 777 | X.XX | X.XX | X.XX | X.XX | X.XX |
## Running inference
```bash
ENV_URL=http://localhost:7860 \
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
HF_TOKEN=your_token \
python inference.py
```
## Validation
```bash
./validate-submission.sh https://your-space.hf.space
```
|