Eishaan commited on
Commit
82a6e6c
·
1 Parent(s): 6a32325

updated readme

Browse files
Files changed (1) hide show
  1. README.md +21 -0
README.md CHANGED
@@ -61,6 +61,18 @@ Database schema migration is a **real-world task** that humans perform daily. Un
61
  - **Orphaned FKs** (references to deleted entities — tests audit logging)
62
  - **NULL currency** (must default to 'USD' — tests COALESCE)
63
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  ## Dynamic Golden Database Grading
65
 
66
  Unlike benchmarks with hardcoded expected values, our grader is **seed-independent**:
@@ -75,6 +87,15 @@ Unlike benchmarks with hardcoded expected values, our grader is **seed-independe
75
  - **FK & integrity (20%)**: Foreign keys enforced, PRAGMA integrity_check passes
76
  - **Anti-exploit (10%)**: No empty tables, no schema pollution
77
 
 
 
 
 
 
 
 
 
 
78
  ## Security & Robustness
79
 
80
  - **SQL Timeout**: Progress-handler-based execution timeout prevents infinite CTEs
 
61
  - **Orphaned FKs** (references to deleted entities — tests audit logging)
62
  - **NULL currency** (must default to 'USD' — tests COALESCE)
63
 
64
+ ## Baseline Scores (Qwen/Qwen3-32B)
65
+ Tested deterministically via `inference.py` on default seeds:
66
+ | Task | Success Score | Step Count |
67
+ |------|--------------|------------|
68
+ | `column-restructure` | 0.99 | 4-5 |
69
+ | `soft-delete-restoration` | 0.99 | 5-7 |
70
+ | `table-normalization` | 0.99 | 8-10 |
71
+ | `schema-version-merge` | 0.99 | 9-11 |
72
+ | `multi-entity-extraction` | 0.99 | 10-12 |
73
+ | `cascade-migration` | 0.99 | 13-15 |
74
+ | `dual-source-consolidation`| 0.99 | 15-18 |
75
+
76
  ## Dynamic Golden Database Grading
77
 
78
  Unlike benchmarks with hardcoded expected values, our grader is **seed-independent**:
 
87
  - **FK & integrity (20%)**: Foreign keys enforced, PRAGMA integrity_check passes
88
  - **Anti-exploit (10%)**: No empty tables, no schema pollution
89
 
90
+ ### Reward Function
91
+ The episode step reward is the exact delta of the migration progress score:
92
+ ```python
93
+ step_reward = current_score - previous_score
94
+ ```
95
+ - If an agent reverts progress, `step_reward` is negative.
96
+ - Exploit attempts (e.g. `PRAGMA foreign_keys = OFF`) yield immediate `reward = -0.3`.
97
+ - Auto-submitted invalid schemas yield negative deltas for missing data.
98
+
99
  ## Security & Robustness
100
 
101
  - **SQL Timeout**: Progress-handler-based execution timeout prevents infinite CTEs