sql-debug-env / docs /MASTER_MANUAL.md
md896's picture
Fix: Mock vllm and llm_blender to stabilize GRPOTrainer in HF Jobs environment
bc20ef9

πŸ† SQL Debug Env: The Ultimate Master Manual

Comprehensive Wiki & Technical Bible for the Meta PyTorch Γ— OpenEnv Hackathon


πŸ“– Table of Contents

  1. The "Simple" Concept
  2. Architecture: How the Machine Works
  3. The Industry Benchmark: Spider vs. BIRD vs. YOU
  4. Deep-Dive: The Codebase Map
  5. The Science: GRPO & Reinforcement Learning
  6. The "Day in the Life" of a SQL Query
  7. Current Project Status & Roadmap
  8. Live Spider Evaluation (The "Ultimate Proof")
  9. Winning the Q&A (The Cheat Sheet)

1. The "Simple" Concept

Imagine you are a teacher. You have a student (the AI) who is good at English but bad at Math (the SQL). Instead of just giving the student a textbook, you put them in a room with a calculator (the Database). The student tries a problem, uses the calculator, sees the answer is wrong, and tries again. You have built the Room, the Calculator, and the Reward System (the "Stars") that makes the student smarter.


2. Architecture: How the Machine Works

The project is split into two main "Brains":

A. The Environment (The Body / server/)

This is the "physical world" where the SQL lives.

  • FastAPI: The "telephone" that lets the AI talk to the database.
  • SQLite: The "sandbox" where queries are actually run.
  • Graders: The "judge" that compares the result of the AI's query to the "truth."

B. The Agent (The Brain / grpo_train.py)

This is the intelligence that is trying to learn.

  • Model (Qwen2.5-Coder): The actual neural network.
  • GRPO Logic: The mathematical formula that tells the model: "Fix #3 was better than Fix #1, change your weights to be more like #3."

3. The Industry Benchmark: Spider vs. BIRD vs. YOU

Judge Question: "Why should we use your environment instead of existing datasets like Spider?"

Feature Spider / BIRD (Standard) SQL Debug Env (YOU)
Task Type One-Shot Generation Iterative Debugging
Feedback None (Static) Live Database Feedback
Difficulty High-level Text-to-SQL Low-level Logic/Syntax Fixes
Evaluation Fuzzy (String matching) Deterministic (Row matching)

The Reference: Your project is inspired by the DeepSeek R1 and OpenAI o1 reasoning models. You are applying their "Reinforcement Learning from Feedback" (RLHF) philosophy to the niche world of SQL engineering.


4. Deep-Dive: The Codebase Map

File What is it? Why is it here?
server/main.py The Heart Acts as the API server. It handles /reset (new game) and /step (make a move).
server/env.py The World Manages the session state. It knows if the user is in Task 1 or Task 3.
server/database.py The Sandbox Creates temporary SQLite databases in memory so the AI can't break anything.
server/reward.py The Scorekeeper Calculates the "Reward" (0.0 to 1.0). It checks syntax, efficiency, and correctness.
grpo_train.py The Trainer The script that actually "upgrades" the AI's brain using RL.
inference.py The Test A simple script to see how smart the AI is right now before training.
openenv.yaml The ID Card Tells the hackathon platform how to connect to your project.

5. The Science: GRPO & Reinforcement Learning

If a judge asks: "How does it learn?"

The Old Way: SFT (Supervised Fine-Tuning)

  • You show the AI 1,000 "Correct" answers.
  • Problem: The AI just memorizes. It doesn't learn how to "debug" when it sees a new error.

Your Way: GRPO (Group Relative Policy Optimization)

  • Step 1: The AI looks at a broken query.
  • Step 2: It generates 4 different ways to fix it (a "Group").
  • Step 3: We run all 4 in the database and get 4 scores.
  • Step 4: We compare them. We tell the AI: "Compared to your other 3 tries, your 2nd try was the best. Do more of that."
  • Innovation: This is "Self-Generated Reasoning." The AI is its own teacher.

6. The "Day in the Life" of a SQL Query

Follow a query from start to finish:

  1. The Prompt: "Fix this query: SELECT * FROM userss (typo)."
  2. The Reviewer: Your reviewer_check in main.py looks at it. If it sees DROP TABLE, it rejects it immediately.
  3. The Sandbox: The query is run in a private SQLite memory space.
  4. The Comparison: The system runs the "Correct" query in the background. It compares the rows.
  5. The Reward: If the rows match, the AI gets +1.0. If they don't, but the syntax is valid, it gets +0.2.
  6. The Memory: The AI updates its "Weights" (its digital brain) to remember this success.

7. Current Project Status & Roadmap

Project Completion: 95%

βœ… Completed:

  • Core FastAPI Server & SQLite Sandbox.
  • 3 Realistic SQL Debugging Tasks (Easy, Medium, Hard).
  • Multi-Agent Reviewer Layer.
  • GRPO Training Script verified on Apple Silicon (M2).
  • Smoke Test verified (Handshake is 100% working).

⏳ Remaining (For Hackathon Site):

  • Scale to Qwen 7B/14B on A100 GPUs.
  • Connect Weights & Biases (WandB) for the live presentation curve.

8. Live Spider Evaluation (The "Ultimate Proof")

How to show the judges your agent can handle real-world academic benchmarks:

  1. Launch the Spider Task: Run /reset with the spider_cross_eval task ID (handled by server/tasks/task_spider.py).
  2. The "Blind Test": Ask a judge to pick a random SQL query from the Spider dev set.
  3. Introduce a Bug: Delete a semicolon, misspell a JOIN, or remove a WHERE clause.
  4. The Demonstration: Run inference.py on that broken Spider query. The Result: The agent will use its trained GRPO weights to analyze the error, inspect the Spider schema, and return the fix.

Why this wins: You are showing that your environment isn't a "closed loop." It can ingest and solve the industry's hardest academic benchmark in real-time.


9. Winning the Q&A (The Cheat Sheet)

Q: "Why SQLite?"

"Because it's the world's most used DB. If the agent can reason in SQLite, it can reason in PostgreSQL. I built a 'Simulator' that is DB-agnostic."

Q: "What makes this 'Multi-Agent'?"

"I have two roles: The Fixer (the LLM) and the Reviewer (the guardrail logic). They interact to ensure every query is safe and syntactically sound before execution."


This manual is your secret weapon. Read it, understand it, and you will own the stage. πŸš€