sql-debug-env / docs /JUDGE_CHEAT_SHEET.md
md896's picture
Fix: Mock vllm and llm_blender to stabilize GRPOTrainer in HF Jobs environment
bc20ef9

🛡️ Judge Defense: Technical Q&A

1. "Why use GRPO instead of standard PPO?"

Answer: "GRPO (Group Relative Policy Optimization) is significantly more efficient for SQL tasks because it eliminates the need for a separate Value Function (Critic) model. By comparing multiple generations against each other within the same group, we get a clear relative signal of what 'good' SQL looks like, which is much more stable for logic-heavy tasks."

2. "How do you ensure the agent doesn't execute malicious SQL (e.g., DROP TABLE)?"

Answer: "Security is built-in. We use a Multi-Agent Reviewer pattern. Every query generated by the 'Actor' is pre-screened by a 'Security Agent' before it ever reaches the database. Additionally, our training environment uses a strictly sandboxed SQLite instance with no persistent file access."

3. "Does this generalize to other databases like PostgreSQL or Snowflake?"

Answer: "Yes. The environment is abstracted via a FastAPI interface. To support another database, we simply swap the SQLite driver for a PostgreSQL driver. The RL logic remains the same because the agent is learning SQL logic, not just syntax."

4. "What is the compute cost for training this specialized agent?"

Answer: "By using GRPO and parameter-efficient techniques, we were able to see a significant accuracy boost in under 20 minutes on a single T4 GPU. This makes it highly cost-effective for enterprise-specific schema fine-tuning."

5. "How do you handle 'Hallucinations' in the SQL?"

Answer: "Hallucinations are the primary reason we use RL. In a standard model, the AI might hallucinate a column name. In our system, that hallucination leads to a 'Database Error,' which results in a 0.0 Reward. The model is literally penalized for hallucinating and rewarded for checking the schema."