# Training Module Everything related to training an AI agent to test APIs using GRPO (Group Relative Policy Optimization). --- ## Setup ```bash cd api_testing_env # Option 1: Automated setup (creates venv, installs everything) bash setup.sh # Option 2: Manual setup python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt # Optional: login to HuggingFace Hub (for model push) huggingface-cli login # Optional: login to Weights & Biases (for logging) wandb login ``` ### Environment Variables Create a `.env` file in `api_testing_env/` (or export in your shell): ```bash # .env # HuggingFace Hub — required for --push-to-hub # Get your token at: https://huggingface.co/settings/tokens HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # Weights & Biases — required for --use-wandb # Get your key at: https://wandb.ai/authorize WANDB_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # Optional: set W&B defaults WANDB_PROJECT=api-testing-grpo WANDB_ENTITY=your-team-name ``` **Three ways to provide these keys:** | Method | Command | |--------|---------| | `.env` file | Create `.env` as shown above, then `source .env` before training | | CLI login | `huggingface-cli login` and `wandb login` (stores keys in ~/.cache) | | Inline export | `export HF_TOKEN=hf_xxx && export WANDB_API_KEY=xxx` | > **Important:** Never commit `.env` to git. It's already in `.gitignore`. --- ## Quick Start ```bash cd api_testing_env source .venv/bin/activate # 1. See what training prompts look like (no GPU needed) SHOW_PROMPTS=1 python -m training.grpo # 2. Quick sanity check (CPU, ~2 minutes) python -m training.grpo --test-mode # 3. Real training (GPU required) python -m training.grpo --model-id Qwen/Qwen3-1.7B --num-episodes 100 # 4. With HuggingFace Hub push python -m training.grpo \ --push-to-hub --hf-repo-id your-username/api-tester-grpo # 5. With Weights & Biases logging python -m training.grpo \ --use-wandb --wandb-project api-testing-grpo # 6. Full pipeline: training + HF push + W&B python -m training.grpo \ --model-id Qwen/Qwen3-1.7B \ --num-episodes 100 \ --push-to-hub --hf-repo-id your-username/api-tester-grpo \ --use-wandb --wandb-project api-testing-grpo # 7. Run baseline agents only (no GPU needed) python -m training.evaluate --task all --agent all --url http://localhost:8000 # 8. Resume from checkpoint python -m training.grpo --model-id ./checkpoints/step_50 ``` --- ## How Training Works There is **no external dataset**. The environment generates unique episodes on the fly. ``` ┌─────────────────────────────────────────────┐ │ GRPO Training Loop │ │ │ ┌───────────┐ │ 1. env.reset(seed=N) │ │ │ │ → unique users, tasks, data │ │ Qwen │ │ │ │ 1.7B │──▶│ 2. LLM generates: {"method":"GET",...} │ │ + LoRA │ │ │ │ │◀──│ 3. env.step(action) → reward │ └───────────┘ │ coverage + bugs + validity │ │ │ │ 4. GRPO: generate 4 attempts per prompt, │ │ keep best, update model weights │ │ │ │ 5. Repeat with next seed │ └─────────────────────────────────────────────┘ ``` ### Why no dataset file? Each `reset(seed=N)` creates a **unique database** with different users, tasks, and data: | Seed | Users | Tasks | |------|-------|-------| | 42 | diana, alice, xander, ivan, hannah | 8 tasks | | 99 | mike, george, tom, fiona | 6 tasks | | 7 | priya, kevin, wendy | 4 tasks | The agent can't memorize "login as alice" because alice might not exist. It must **read the observation and adapt** — that's the learning signal. The bugs (13 planted flaws) are structural — same code flaws every episode — but the path to finding them changes because the data is different. --- ## Training Pipeline The full training pipeline runs these steps automatically: ``` 1. Run baseline agents (random, sequential, smart) across all tasks ↓ 2. Load base model (Qwen 1.7B) ↓ 3. Evaluate base model before training (establishes LLM baseline) ↓ 4. GRPO training with LoRA ↓ 5. Save model locally to --output-dir ↓ 6. Push to HuggingFace Hub (if --push-to-hub) ↓ 7. Evaluate trained model after GRPO ↓ 8. Print comparison table (baselines vs base vs trained) ↓ 9. Save metrics (JSON + markdown) to output-dir/metrics/ ↓ 10. Save comparison plots (PNG) to output-dir/metrics/plots/ ↓ 11. Finalize W&B run (if --use-wandb) ``` --- ## File Guide | File | Purpose | When to modify | |------|---------|----------------| | `prompts.py` | System prompt, `format_observation()`, `parse_action()` | Change how the LLM sees tasks or formats actions | | `rewards.py` | `format_reward_fn()`, `environment_reward_fn()` | Tune reward scaling or add new reward signals | | `agents.py` | `RandomAgent`, `SequentialAgent`, `SmartAgent` | Add new baseline strategies | | `grpo.py` | `build_training_prompts()`, `train_grpo()` | Change training hyperparameters or model | | `evaluate.py` | `run_rollout()`, `run_baseline_local()`, remote runner | Change evaluation logic | ### prompts.py The bridge between the environment and the LLM. **`SYSTEM_PROMPT`** — Instructions telling the LLM it's an API tester. Includes output format (JSON) and testing strategies. **`format_observation(obs)`** — Converts an environment observation into text: - First turn: full API spec + task description + available users - Later turns: last response + feedback + progress stats + auth tokens **`parse_action(text)`** — Extracts JSON from LLM output. Handles: - Raw JSON: `{"method": "GET", "endpoint": "/tasks"}` - Code blocks: `` ```json {...} ``` `` - Extra text around JSON: `"I'll try: {...}"` ### rewards.py Two reward functions that GRPO uses to score each LLM completion: **`format_reward_fn`** — Binary: +1.0 if valid JSON action, -1.0 if not. Teaches the model to always output parseable actions. **`environment_reward_fn`** — Runs the action in the environment and returns the actual reward (coverage + bugs + validity), scaled by 5.0 to dominate over format reward. ### agents.py Three hand-coded baselines for comparison: | Agent | Strategy | Expected Score | |-------|----------|---------------| | `RandomAgent` | Random method + random endpoint | ~0.10 | | `SequentialAgent` | Fixed sequence: GET, POST, PUT, DELETE each endpoint | ~0.35 | | `SmartAgent` | Multi-phase: discover → auth → CRUD → bug hunt → security | ~0.55 | A GRPO-trained model should beat the SmartAgent. ### grpo.py The main training script. **`build_training_prompts(num_episodes)`** — Creates N prompts by resetting the environment with seeds 0..N. Each prompt is a chat message with system prompt + initial observation. **`run_baseline_evaluation(seed)`** — Runs all three baseline agents across all tasks before training starts. **`train_grpo(args)`** — Full GRPO loop: 1. Run baseline agents for comparison 2. Load model + tokenizer (Qwen 1.7B default) 3. Evaluate base model before training 4. Apply LoRA (r=16, alpha=32, targets q_proj + v_proj) 5. Generate prompts from environment 6. Create per-prompt environment instances for reward eval 7. Train with TRL's GRPOTrainer 8. Save model locally + push to HF Hub 9. Evaluate trained model + print comparison 10. Save metrics (JSON, markdown) and plots (PNG) 11. Finalize W&B run **`save_metrics()`** — Saves `results.json` and `results.md` to `output-dir/metrics/`. **`save_plots()`** — Generates three comparison bar charts (reward, bugs, coverage) saved as PNGs. ### evaluate.py **`run_rollout(model, tokenizer, task_id, seed)`** — Runs one full episode with a HuggingFace model. Multi-turn: LLM generates action → env steps → LLM sees result → repeats. **`run_baseline_local(agent_name, task_id, seed)`** — Runs baseline agents against the local environment (no server needed). Used by `grpo.py` to establish baselines before training. **`run_episode(url, task_id, agent_cls)`** — Runs a baseline agent against a remote server via WebSocket. --- ## Training Hyperparameters | Parameter | Default | Description | |-----------|---------|-------------| | `--model-id` | `Qwen/Qwen3-1.7B` | Base model (any HF causal LM) | | `--num-episodes` | 50 | Training prompts (more = more diverse episodes) | | `--num-generations` | 4 | GRPO rollouts per prompt (higher = better but slower) | | `--max-completion-length` | 256 | Max tokens per LLM response | | `--max-steps` | 200 | Total training optimizer steps | | `--learning-rate` | 2e-5 | AdamW learning rate | | `--batch-size` | 1 | Per-device batch size | | `--output-dir` | `./checkpoints/grpo_api_tester` | Where to save model | | `--push-to-hub` | off | Push trained model to HuggingFace Hub | | `--hf-repo-id` | none | HF Hub repo (e.g., `user/api-tester-grpo`) | | `--use-wandb` | off | Enable Weights & Biases logging | | `--wandb-project` | `api-testing-grpo` | W&B project name | | `--wandb-run-name` | auto | W&B run name | | `--test-mode` | off | Quick 3-episode, 2-gen, 5-step test | ### Hardware Requirements | Setup | GPU | Time | Model | |-------|-----|------|-------| | Colab Free | T4 (16GB) | ~1-2 hours | Qwen 1.7B + 4-bit LoRA | | Colab Pro | A100 (40GB) | ~30 min | Qwen 4B + LoRA | | Local | Any 8GB+ | ~1-2 hours | Qwen 1.7B + 4-bit LoRA | | CPU only | None | `--test-mode` only | Verifies pipeline works | --- ## Output Structure After training, your output directory will look like: ``` checkpoints/grpo_api_tester/ ├── adapter_config.json # LoRA adapter config ├── adapter_model.safetensors # Trained LoRA weights ├── tokenizer.json # Tokenizer files ├── tokenizer_config.json ├── special_tokens_map.json └── metrics/ ├── results.json # Full results (baselines + base + trained) ├── results.md # Markdown comparison table └── plots/ ├── reward_comparison.png # Bar chart: reward across all agents ├── bugs_comparison.png # Bar chart: bugs found └── coverage_comparison.png # Bar chart: API coverage % ``` --- ## Weights & Biases Integration When `--use-wandb` is enabled, the following is logged: | Metric | Description | |--------|-------------| | `baseline/{agent}/{task}/reward` | Baseline agent scores | | `base_model/{task}/reward` | Pre-training model scores | | `trained_model/{task}/reward` | Post-training model scores | | `delta/{task}/reward` | Improvement over base model | | `plots/*` | Comparison charts as W&B images | | TRL defaults | Loss, learning rate, reward mean/std | --- ## Expected Results ### Before Training (base Qwen 1.7B, no fine-tuning) The base model can output JSON sometimes, but has no API testing strategy: ``` basic_validation: ~0.15 (random-level) edge_cases: ~0.08 security_workflows: ~0.03 ``` ### After GRPO (50 episodes, 200 steps) The model learns systematic testing patterns: ``` basic_validation: ~0.55-0.65 edge_cases: ~0.35-0.45 security_workflows: ~0.25-0.35 ``` ### What the Model Learns 1. **Output format** — Always produce valid JSON (format reward) 2. **Coverage** — Test different endpoints, don't repeat the same request 3. **Dependency chaining** — POST to create, then GET/PUT/DELETE the created resource 4. **Bug patterns** — Try non-existent IDs, missing fields, invalid emails 5. **Auth workflows** — Login first, use tokens in subsequent requests 6. **Security testing** — Try cross-user access, injection payloads --- ## Extending the Training ### Add a new reward signal Edit `rewards.py`: ```python def efficiency_reward_fn(completions: list[str], **kwargs) -> list[float]: """Reward for concise, focused actions (penalize wasted steps).""" rewards = [] for text in completions: action = parse_action(text) if action and action.expected_status: rewards.append(0.5) # Bonus for predicting expected status else: rewards.append(0.0) return rewards ``` Then add it to the combined reward in `grpo.py`. ### Add a new baseline agent Edit `agents.py`: ```python class CoverageAgent: """Agent that prioritizes hitting every endpoint once.""" name = "coverage" def __init__(self): self.tested = set() # ... ``` Then add it to the `AGENTS` dict. ### Use a different model ```bash # Qwen 2.5 (smaller, faster) python -m training.grpo --model-id Qwen/Qwen2.5-1.5B # Llama 3 (if you have access) python -m training.grpo --model-id meta-llama/Llama-3.2-1B ``` Any HuggingFace causal language model works — just make sure it supports chat templates.