---
title: data-quality-env
sdk: docker
emoji: 🚀
colorFrom: blue
colorTo: green
---
## OpenEnv Validation

This environment has been validated using OpenEnv:
```bash
openenv validate
# [OK] data-quality-env: Ready for multi-mode deployment
```

## Description
DataQualityEnv v2 is a budget-constrained, confidence-scored OpenEnv environment where an AI agent performs multi-step SQL auditing and optional fix verification.

## 🚀 How to Use

### Step 1: Reset the Environment
Click **"Reset episode"** to start a new task.

You will see:
- task description
- table schema
- step budget and query credits

---

### Step 2: Investigate with SQL Queries

Run SQL queries to explore the data and identify issues.

Examples:

```sql
SELECT COUNT(*) FROM customers;

SELECT SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) FROM customers;

SELECT customer_id, email, COUNT(*) 
FROM customers 
GROUP BY 1,2 
HAVING COUNT(*) > 1;
💡 Note:

Queries help you gather evidence
They may return small rewards or penalties
Final scoring does NOT happen here
Step 3: Submit Audit Report (IMPORTANT)

After analysis, submit your findings using:
Observation json: {
  "null_issues": {
    "email": 12,
    "customer_id": 0
  },
  "duplicate_row_count": 15,
  "near_duplicate_count": 9,
  "confidence": 0.9
}
submit: {
  "null_issues": {
    "email": 12,
    "customer_id": 0
  },
  "duplicate_row_count": 15,
  "near_duplicate_count": 9,
  "confidence": 0.9
}
📌 This step triggers:

deterministic grading
final score calculation (0.0 – 1.0)

⚠️ Important:

Running queries alone will NOT produce a final score.
You must submit a report.

Step 4: Fix Phase (Optional)

After submitting the report, the environment enters fix phase.

You can propose fixes using SQL:
UPDATE customers SET email = NULL WHERE email = 'UNKNOWN';
This may improve your score.

Step 5: Final Result

You will receive:

{
  "value": 0.85,
  "done": true
}
value → final score
done → task completed
🤖 Auto Audit Mode

You can also click "Auto audit", which:

runs multiple diagnostic SQL queries
generates a report automatically
submits it before step limit

This is the fastest way to test the system.

🧠 Reward System
Queries may return small penalties (e.g., -0.1) for redundant or low-value actions
This encourages efficient exploration
Final score depends ONLY on the submitted report
⚠️ Important Notes
Always submit a report before step limit
If you don't, the system may auto-submit a fallback report
Efficient querying leads to better performance

Core loop:
- `reset` → environment generates seeded dirty datasets.
- `query` → agent investigates across one or more tables.
- `submit_report` → deterministic grading starts and fix phase unlocks.
- `fix_sql` → agent proposes corrective updates for bonus.

Novel mechanics:
- Query budget economy (10 credits).
- Confidence Brier grading.
- 4 tasks (easy to expert).
- Adversarial camouflage (`NULL`, `N/A`, `-`, near-duplicates).
- Fix verification loop with bonus up to `+0.25`.

## Action space
1) Query
```json
{"action_type": "query", "sql": "SELECT * FROM customers LIMIT 10"}
```

2) Submit report
```json
{
  "action_type": "submit_report",
  "report": {
    "null_issues": {"email": {"value": 12, "confidence": 0.92}},
    "duplicate_row_count": {"value": 16, "confidence": 0.88},
    "schema_violations": [],
    "drifted_columns": [],
    "drift_details": {},
    "relational_issues": [],
    "recommended_fixes": ["Add NULL checks"]
  }
}
```

3) Fix SQL
```json
{"action_type": "fix_sql", "sql": "UPDATE orders SET quantity = ABS(quantity) WHERE quantity < 0"}
```

## Observation space
- `task_id`
- `task_description`
- `tables`
- `row_counts`
- `step`
- `max_steps`
- `query_credits_remaining`
- `phase` (`audit` | `fix`)
- `last_query_result`
- `last_action_error`
- `last_fix_score`

## Tasks
| ID | Name | Difficulty | What agent must find | Expected baseline |
|----|------|-----------|---------------------|-------------------|
| 1  | Null & duplicate detection | Easy | Nulls, disguised nulls, exact/near dups | ~0.82 |
| 2  | Schema violation repair | Medium | Type/format/range/unparseable violations | ~0.61 |
| 3  | Silent data drift | Hard | Mean shift, new cats, referential drift | ~0.34 |
| 4  | Multi-table relational audit | Expert | Orphaned FKs, temporal violations, aggregate mismatches | ~0.19 |

## Reward design
- Base audit score from deterministic task grader.
- Confidence Brier adjustment per finding.
- Budget bonus up to `+0.10`.
- Fix bonus up to `+0.25`.

Formula:

`total = min(1.25, audit_score × brier_adj + budget_bonus + fix_bonus)`

## Baseline scores (multi-seed robustness)
| Seed | Task 1 | Task 2 | Task 3 | Task 4 | Mean |
|------|--------|--------|--------|--------|------|
| 42   | X.XX   | X.XX   | X.XX   | X.XX   | X.XX |
| 123  | X.XX   | X.XX   | X.XX   | X.XX   | X.XX |
| 777  | X.XX   | X.XX   | X.XX   | X.XX   | X.XX |

## Running inference
```bash
ENV_URL=http://localhost:7860 \
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
HF_TOKEN=your_token \
python inference.py
```

## Validation
```bash
./validate-submission.sh https://your-space.hf.space
```