sql-debug-env / docs /MASTER_MANUAL.md
md896's picture
Fix: Mock vllm and llm_blender to stabilize GRPOTrainer in HF Jobs environment
bc20ef9
# πŸ† SQL Debug Env: The Ultimate Master Manual
> **Comprehensive Wiki & Technical Bible for the Meta PyTorch Γ— OpenEnv Hackathon**
---
## πŸ“– Table of Contents
1. [The "Simple" Concept](#1-the-simple-concept)
2. [Architecture: How the Machine Works](#2-architecture-how-the-machine-works)
3. [The Industry Benchmark: Spider vs. BIRD vs. YOU](#3-the-industry-benchmark-spider-vs-bird-vs-you)
4. [Deep-Dive: The Codebase Map](#4-deep-dive-the-codebase-map)
5. [The Science: GRPO & Reinforcement Learning](#5-the-science-grpo--reinforcement-learning)
6. [The "Day in the Life" of a SQL Query](#6-the-day-in-the-life-of-a-sql-query)
7. [Current Project Status & Roadmap](#7-current-project-status--roadmap)
8. [Live Spider Evaluation (The "Ultimate Proof")](#8-live-spider-evaluation-the-ultimate-proof)
9. [Winning the Q&A (The Cheat Sheet)](#9-winning-the-qa-the-cheat-sheet)
---
## 1. The "Simple" Concept
Imagine you are a teacher. You have a student (the **AI**) who is good at English but bad at Math (the **SQL**).
Instead of just giving the student a textbook, you put them in a room with a calculator (the **Database**).
The student tries a problem, uses the calculator, sees the answer is wrong, and tries again.
**You have built the Room, the Calculator, and the Reward System (the "Stars") that makes the student smarter.**
---
## 2. Architecture: How the Machine Works
The project is split into two main "Brains":
### A. The Environment (The Body / server/)
This is the "physical world" where the SQL lives.
- **FastAPI:** The "telephone" that lets the AI talk to the database.
- **SQLite:** The "sandbox" where queries are actually run.
- **Graders:** The "judge" that compares the result of the AI's query to the "truth."
### B. The Agent (The Brain / grpo_train.py)
This is the intelligence that is trying to learn.
- **Model (Qwen2.5-Coder):** The actual neural network.
- **GRPO Logic:** The mathematical formula that tells the model: *"Fix #3 was better than Fix #1, change your weights to be more like #3."*
---
## 3. The Industry Benchmark: Spider vs. BIRD vs. YOU
**Judge Question:** *"Why should we use your environment instead of existing datasets like Spider?"*
| Feature | Spider / BIRD (Standard) | **SQL Debug Env (YOU)** |
| :--- | :--- | :--- |
| **Task Type** | One-Shot Generation | **Iterative Debugging** |
| **Feedback** | None (Static) | **Live Database Feedback** |
| **Difficulty** | High-level Text-to-SQL | **Low-level Logic/Syntax Fixes** |
| **Evaluation** | Fuzzy (String matching) | **Deterministic (Row matching)** |
**The Reference:** Your project is inspired by the **DeepSeek R1** and **OpenAI o1** reasoning models. You are applying their "Reinforcement Learning from Feedback" (RLHF) philosophy to the niche world of SQL engineering.
---
## 4. Deep-Dive: The Codebase Map
| File | What is it? | Why is it here? |
| :--- | :--- | :--- |
| **`server/main.py`** | The Heart | Acts as the API server. It handles `/reset` (new game) and `/step` (make a move). |
| **`server/env.py`** | The World | Manages the session state. It knows if the user is in Task 1 or Task 3. |
| **`server/database.py`** | The Sandbox | Creates temporary SQLite databases in memory so the AI can't break anything. |
| **`server/reward.py`** | The Scorekeeper | Calculates the "Reward" (0.0 to 1.0). It checks syntax, efficiency, and correctness. |
| **`grpo_train.py`** | The Trainer | The script that actually "upgrades" the AI's brain using RL. |
| **`inference.py`** | The Test | A simple script to see how smart the AI is *right now* before training. |
| **`openenv.yaml`** | The ID Card | Tells the hackathon platform how to connect to your project. |
---
## 5. The Science: GRPO & Reinforcement Learning
If a judge asks: *"How does it learn?"*
### The Old Way: SFT (Supervised Fine-Tuning)
- You show the AI 1,000 "Correct" answers.
- **Problem:** The AI just memorizes. It doesn't learn how to "debug" when it sees a new error.
### Your Way: GRPO (Group Relative Policy Optimization)
- **Step 1:** The AI looks at a broken query.
- **Step 2:** It generates **4 different ways** to fix it (a "Group").
- **Step 3:** We run all 4 in the database and get 4 scores.
- **Step 4:** We compare them. We tell the AI: *"Compared to your other 3 tries, your 2nd try was the best. Do more of that."*
- **Innovation:** This is **"Self-Generated Reasoning."** The AI is its own teacher.
---
## 6. The "Day in the Life" of a SQL Query
Follow a query from start to finish:
1. **The Prompt:** "Fix this query: SELECT * FROM userss (typo)."
2. **The Reviewer:** Your `reviewer_check` in `main.py` looks at it. If it sees `DROP TABLE`, it rejects it immediately.
3. **The Sandbox:** The query is run in a private SQLite memory space.
4. **The Comparison:** The system runs the "Correct" query in the background. It compares the rows.
5. **The Reward:** If the rows match, the AI gets `+1.0`. If they don't, but the syntax is valid, it gets `+0.2`.
6. **The Memory:** The AI updates its "Weights" (its digital brain) to remember this success.
---
## 7. Current Project Status & Roadmap
**Project Completion: 95%**
### βœ… Completed:
- Core FastAPI Server & SQLite Sandbox.
- 3 Realistic SQL Debugging Tasks (Easy, Medium, Hard).
- Multi-Agent Reviewer Layer.
- GRPO Training Script verified on Apple Silicon (M2).
- Smoke Test verified (Handshake is 100% working).
### ⏳ Remaining (For Hackathon Site):
- Scale to **Qwen 7B/14B** on A100 GPUs.
- Connect **Weights & Biases (WandB)** for the live presentation curve.
---
## 8. Live Spider Evaluation (The "Ultimate Proof")
**How to show the judges your agent can handle real-world academic benchmarks:**
1. **Launch the Spider Task:**
Run `/reset` with the `spider_cross_eval` task ID (handled by `server/tasks/task_spider.py`).
2. **The "Blind Test":**
Ask a judge to pick a random SQL query from the **Spider dev set**.
3. **Introduce a Bug:**
Delete a semicolon, misspell a JOIN, or remove a WHERE clause.
4. **The Demonstration:**
Run `inference.py` on that broken Spider query.
**The Result:** The agent will use its trained GRPO weights to analyze the error, inspect the Spider schema, and return the fix.
**Why this wins:** You are showing that your environment isn't a "closed loop." It can ingest and solve the industry's hardest academic benchmark in real-time.
---
## 9. Winning the Q&A (The Cheat Sheet)
**Q: "Why SQLite?"**
> *"Because it's the world's most used DB. If the agent can reason in SQLite, it can reason in PostgreSQL. I built a 'Simulator' that is DB-agnostic."*
**Q: "What makes this 'Multi-Agent'?"**
> *"I have two roles: The **Fixer** (the LLM) and the **Reviewer** (the guardrail logic). They interact to ensure every query is safe and syntactically sound before execution."*
---
**This manual is your secret weapon. Read it, understand it, and you will own the stage.** πŸš€