# 🛡️ Judge Defense: Technical Q&A ### 1. "Why use GRPO instead of standard PPO?" **Answer:** "GRPO (Group Relative Policy Optimization) is significantly more efficient for SQL tasks because it eliminates the need for a separate Value Function (Critic) model. By comparing multiple generations against each other within the same group, we get a clear relative signal of what 'good' SQL looks like, which is much more stable for logic-heavy tasks." ### 2. "How do you ensure the agent doesn't execute malicious SQL (e.g., DROP TABLE)?" **Answer:** "Security is built-in. We use a **Multi-Agent Reviewer pattern**. Every query generated by the 'Actor' is pre-screened by a 'Security Agent' before it ever reaches the database. Additionally, our training environment uses a strictly sandboxed SQLite instance with no persistent file access." ### 3. "Does this generalize to other databases like PostgreSQL or Snowflake?" **Answer:** "Yes. The environment is abstracted via a FastAPI interface. To support another database, we simply swap the SQLite driver for a PostgreSQL driver. The RL logic remains the same because the agent is learning SQL logic, not just syntax." ### 4. "What is the compute cost for training this specialized agent?" **Answer:** "By using GRPO and parameter-efficient techniques, we were able to see a significant accuracy boost in under 20 minutes on a single T4 GPU. This makes it highly cost-effective for enterprise-specific schema fine-tuning." ### 5. "How do you handle 'Hallucinations' in the SQL?" **Answer:** "Hallucinations are the primary reason we use RL. In a standard model, the AI might hallucinate a column name. In our system, that hallucination leads to a 'Database Error,' which results in a **0.0 Reward**. The model is literally penalized for hallucinating and rewarded for checking the schema."