Spaces:
Sleeping
Sleeping
updated readme
Browse files
README.md
CHANGED
|
@@ -61,6 +61,18 @@ Database schema migration is a **real-world task** that humans perform daily. Un
|
|
| 61 |
- **Orphaned FKs** (references to deleted entities — tests audit logging)
|
| 62 |
- **NULL currency** (must default to 'USD' — tests COALESCE)
|
| 63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
## Dynamic Golden Database Grading
|
| 65 |
|
| 66 |
Unlike benchmarks with hardcoded expected values, our grader is **seed-independent**:
|
|
@@ -75,6 +87,15 @@ Unlike benchmarks with hardcoded expected values, our grader is **seed-independe
|
|
| 75 |
- **FK & integrity (20%)**: Foreign keys enforced, PRAGMA integrity_check passes
|
| 76 |
- **Anti-exploit (10%)**: No empty tables, no schema pollution
|
| 77 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
## Security & Robustness
|
| 79 |
|
| 80 |
- **SQL Timeout**: Progress-handler-based execution timeout prevents infinite CTEs
|
|
|
|
| 61 |
- **Orphaned FKs** (references to deleted entities — tests audit logging)
|
| 62 |
- **NULL currency** (must default to 'USD' — tests COALESCE)
|
| 63 |
|
| 64 |
+
## Baseline Scores (Qwen/Qwen3-32B)
|
| 65 |
+
Tested deterministically via `inference.py` on default seeds:
|
| 66 |
+
| Task | Success Score | Step Count |
|
| 67 |
+
|------|--------------|------------|
|
| 68 |
+
| `column-restructure` | 0.99 | 4-5 |
|
| 69 |
+
| `soft-delete-restoration` | 0.99 | 5-7 |
|
| 70 |
+
| `table-normalization` | 0.99 | 8-10 |
|
| 71 |
+
| `schema-version-merge` | 0.99 | 9-11 |
|
| 72 |
+
| `multi-entity-extraction` | 0.99 | 10-12 |
|
| 73 |
+
| `cascade-migration` | 0.99 | 13-15 |
|
| 74 |
+
| `dual-source-consolidation`| 0.99 | 15-18 |
|
| 75 |
+
|
| 76 |
## Dynamic Golden Database Grading
|
| 77 |
|
| 78 |
Unlike benchmarks with hardcoded expected values, our grader is **seed-independent**:
|
|
|
|
| 87 |
- **FK & integrity (20%)**: Foreign keys enforced, PRAGMA integrity_check passes
|
| 88 |
- **Anti-exploit (10%)**: No empty tables, no schema pollution
|
| 89 |
|
| 90 |
+
### Reward Function
|
| 91 |
+
The episode step reward is the exact delta of the migration progress score:
|
| 92 |
+
```python
|
| 93 |
+
step_reward = current_score - previous_score
|
| 94 |
+
```
|
| 95 |
+
- If an agent reverts progress, `step_reward` is negative.
|
| 96 |
+
- Exploit attempts (e.g. `PRAGMA foreign_keys = OFF`) yield immediate `reward = -0.3`.
|
| 97 |
+
- Auto-submitted invalid schemas yield negative deltas for missing data.
|
| 98 |
+
|
| 99 |
## Security & Robustness
|
| 100 |
|
| 101 |
- **SQL Timeout**: Progress-handler-based execution timeout prevents infinite CTEs
|