Spaces:
Running
Running
| # Training Module | |
| Everything related to training an AI agent to test APIs using GRPO (Group Relative Policy Optimization). | |
| --- | |
| ## Setup | |
| ```bash | |
| cd api_testing_env | |
| # Option 1: Automated setup (creates venv, installs everything) | |
| bash setup.sh | |
| # Option 2: Manual setup | |
| python3 -m venv .venv | |
| source .venv/bin/activate | |
| pip install -r requirements.txt | |
| # Optional: login to HuggingFace Hub (for model push) | |
| huggingface-cli login | |
| # Optional: login to Weights & Biases (for logging) | |
| wandb login | |
| ``` | |
| ### Environment Variables | |
| Create a `.env` file in `api_testing_env/` (or export in your shell): | |
| ```bash | |
| # .env | |
| # HuggingFace Hub β required for --push-to-hub | |
| # Get your token at: https://huggingface.co/settings/tokens | |
| HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | |
| # Weights & Biases β required for --use-wandb | |
| # Get your key at: https://wandb.ai/authorize | |
| WANDB_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | |
| # Optional: set W&B defaults | |
| WANDB_PROJECT=api-testing-grpo | |
| WANDB_ENTITY=your-team-name | |
| ``` | |
| **Three ways to provide these keys:** | |
| | Method | Command | | |
| |--------|---------| | |
| | `.env` file | Create `.env` as shown above, then `source .env` before training | | |
| | CLI login | `huggingface-cli login` and `wandb login` (stores keys in ~/.cache) | | |
| | Inline export | `export HF_TOKEN=hf_xxx && export WANDB_API_KEY=xxx` | | |
| > **Important:** Never commit `.env` to git. It's already in `.gitignore`. | |
| --- | |
| ## Quick Start | |
| ```bash | |
| cd api_testing_env | |
| source .venv/bin/activate | |
| # 1. See what training prompts look like (no GPU needed) | |
| SHOW_PROMPTS=1 python -m training.grpo | |
| # 2. Quick sanity check (CPU, ~2 minutes) | |
| python -m training.grpo --test-mode | |
| # 3. Real training (GPU required) | |
| python -m training.grpo --model-id Qwen/Qwen3-1.7B --num-episodes 100 | |
| # 4. With HuggingFace Hub push | |
| python -m training.grpo \ | |
| --push-to-hub --hf-repo-id your-username/api-tester-grpo | |
| # 5. With Weights & Biases logging | |
| python -m training.grpo \ | |
| --use-wandb --wandb-project api-testing-grpo | |
| # 6. Full pipeline: training + HF push + W&B | |
| python -m training.grpo \ | |
| --model-id Qwen/Qwen3-1.7B \ | |
| --num-episodes 100 \ | |
| --push-to-hub --hf-repo-id your-username/api-tester-grpo \ | |
| --use-wandb --wandb-project api-testing-grpo | |
| # 7. Run baseline agents only (no GPU needed) | |
| python -m training.evaluate --task all --agent all --url http://localhost:8000 | |
| # 8. Resume from checkpoint | |
| python -m training.grpo --model-id ./checkpoints/step_50 | |
| ``` | |
| --- | |
| ## How Training Works | |
| There is **no external dataset**. The environment generates unique episodes on the fly. | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββ | |
| β GRPO Training Loop β | |
| β β | |
| βββββββββββββ β 1. env.reset(seed=N) β | |
| β β β β unique users, tasks, data β | |
| β Qwen β β β | |
| β 1.7B ββββΆβ 2. LLM generates: {"method":"GET",...} β | |
| β + LoRA β β β | |
| β βββββ 3. env.step(action) β reward β | |
| βββββββββββββ β coverage + bugs + validity β | |
| β β | |
| β 4. GRPO: generate 4 attempts per prompt, β | |
| β keep best, update model weights β | |
| β β | |
| β 5. Repeat with next seed β | |
| βββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### Why no dataset file? | |
| Each `reset(seed=N)` creates a **unique database** with different users, tasks, and data: | |
| | Seed | Users | Tasks | | |
| |------|-------|-------| | |
| | 42 | diana, alice, xander, ivan, hannah | 8 tasks | | |
| | 99 | mike, george, tom, fiona | 6 tasks | | |
| | 7 | priya, kevin, wendy | 4 tasks | | |
| The agent can't memorize "login as alice" because alice might not exist. It must **read the observation and adapt** β that's the learning signal. | |
| The bugs (13 planted flaws) are structural β same code flaws every episode β but the path to finding them changes because the data is different. | |
| --- | |
| ## Training Pipeline | |
| The full training pipeline runs these steps automatically: | |
| ``` | |
| 1. Run baseline agents (random, sequential, smart) across all tasks | |
| β | |
| 2. Load base model (Qwen 1.7B) | |
| β | |
| 3. Evaluate base model before training (establishes LLM baseline) | |
| β | |
| 4. GRPO training with LoRA | |
| β | |
| 5. Save model locally to --output-dir | |
| β | |
| 6. Push to HuggingFace Hub (if --push-to-hub) | |
| β | |
| 7. Evaluate trained model after GRPO | |
| β | |
| 8. Print comparison table (baselines vs base vs trained) | |
| β | |
| 9. Save metrics (JSON + markdown) to output-dir/metrics/ | |
| β | |
| 10. Save comparison plots (PNG) to output-dir/metrics/plots/ | |
| β | |
| 11. Finalize W&B run (if --use-wandb) | |
| ``` | |
| --- | |
| ## File Guide | |
| | File | Purpose | When to modify | | |
| |------|---------|----------------| | |
| | `prompts.py` | System prompt, `format_observation()`, `parse_action()` | Change how the LLM sees tasks or formats actions | | |
| | `rewards.py` | `format_reward_fn()`, `environment_reward_fn()` | Tune reward scaling or add new reward signals | | |
| | `agents.py` | `RandomAgent`, `SequentialAgent`, `SmartAgent` | Add new baseline strategies | | |
| | `grpo.py` | `build_training_prompts()`, `train_grpo()` | Change training hyperparameters or model | | |
| | `evaluate.py` | `run_rollout()`, `run_baseline_local()`, remote runner | Change evaluation logic | | |
| ### prompts.py | |
| The bridge between the environment and the LLM. | |
| **`SYSTEM_PROMPT`** β Instructions telling the LLM it's an API tester. Includes output format (JSON) and testing strategies. | |
| **`format_observation(obs)`** β Converts an environment observation into text: | |
| - First turn: full API spec + task description + available users | |
| - Later turns: last response + feedback + progress stats + auth tokens | |
| **`parse_action(text)`** β Extracts JSON from LLM output. Handles: | |
| - Raw JSON: `{"method": "GET", "endpoint": "/tasks"}` | |
| - Code blocks: `` ```json {...} ``` `` | |
| - Extra text around JSON: `"I'll try: {...}"` | |
| ### rewards.py | |
| Two reward functions that GRPO uses to score each LLM completion: | |
| **`format_reward_fn`** β Binary: +1.0 if valid JSON action, -1.0 if not. Teaches the model to always output parseable actions. | |
| **`environment_reward_fn`** β Runs the action in the environment and returns the actual reward (coverage + bugs + validity), scaled by 5.0 to dominate over format reward. | |
| ### agents.py | |
| Three hand-coded baselines for comparison: | |
| | Agent | Strategy | Expected Score | | |
| |-------|----------|---------------| | |
| | `RandomAgent` | Random method + random endpoint | ~0.10 | | |
| | `SequentialAgent` | Fixed sequence: GET, POST, PUT, DELETE each endpoint | ~0.35 | | |
| | `SmartAgent` | Multi-phase: discover β auth β CRUD β bug hunt β security | ~0.55 | | |
| A GRPO-trained model should beat the SmartAgent. | |
| ### grpo.py | |
| The main training script. | |
| **`build_training_prompts(num_episodes)`** β Creates N prompts by resetting the environment with seeds 0..N. Each prompt is a chat message with system prompt + initial observation. | |
| **`run_baseline_evaluation(seed)`** β Runs all three baseline agents across all tasks before training starts. | |
| **`train_grpo(args)`** β Full GRPO loop: | |
| 1. Run baseline agents for comparison | |
| 2. Load model + tokenizer (Qwen 1.7B default) | |
| 3. Evaluate base model before training | |
| 4. Apply LoRA (r=16, alpha=32, targets q_proj + v_proj) | |
| 5. Generate prompts from environment | |
| 6. Create per-prompt environment instances for reward eval | |
| 7. Train with TRL's GRPOTrainer | |
| 8. Save model locally + push to HF Hub | |
| 9. Evaluate trained model + print comparison | |
| 10. Save metrics (JSON, markdown) and plots (PNG) | |
| 11. Finalize W&B run | |
| **`save_metrics()`** β Saves `results.json` and `results.md` to `output-dir/metrics/`. | |
| **`save_plots()`** β Generates three comparison bar charts (reward, bugs, coverage) saved as PNGs. | |
| ### evaluate.py | |
| **`run_rollout(model, tokenizer, task_id, seed)`** β Runs one full episode with a HuggingFace model. Multi-turn: LLM generates action β env steps β LLM sees result β repeats. | |
| **`run_baseline_local(agent_name, task_id, seed)`** β Runs baseline agents against the local environment (no server needed). Used by `grpo.py` to establish baselines before training. | |
| **`run_episode(url, task_id, agent_cls)`** β Runs a baseline agent against a remote server via WebSocket. | |
| --- | |
| ## Training Hyperparameters | |
| | Parameter | Default | Description | | |
| |-----------|---------|-------------| | |
| | `--model-id` | `Qwen/Qwen3-1.7B` | Base model (any HF causal LM) | | |
| | `--num-episodes` | 50 | Training prompts (more = more diverse episodes) | | |
| | `--num-generations` | 4 | GRPO rollouts per prompt (higher = better but slower) | | |
| | `--max-completion-length` | 256 | Max tokens per LLM response | | |
| | `--max-steps` | 200 | Total training optimizer steps | | |
| | `--learning-rate` | 2e-5 | AdamW learning rate | | |
| | `--batch-size` | 1 | Per-device batch size | | |
| | `--output-dir` | `./checkpoints/grpo_api_tester` | Where to save model | | |
| | `--push-to-hub` | off | Push trained model to HuggingFace Hub | | |
| | `--hf-repo-id` | none | HF Hub repo (e.g., `user/api-tester-grpo`) | | |
| | `--use-wandb` | off | Enable Weights & Biases logging | | |
| | `--wandb-project` | `api-testing-grpo` | W&B project name | | |
| | `--wandb-run-name` | auto | W&B run name | | |
| | `--test-mode` | off | Quick 3-episode, 2-gen, 5-step test | | |
| ### Hardware Requirements | |
| | Setup | GPU | Time | Model | | |
| |-------|-----|------|-------| | |
| | Colab Free | T4 (16GB) | ~1-2 hours | Qwen 1.7B + 4-bit LoRA | | |
| | Colab Pro | A100 (40GB) | ~30 min | Qwen 4B + LoRA | | |
| | Local | Any 8GB+ | ~1-2 hours | Qwen 1.7B + 4-bit LoRA | | |
| | CPU only | None | `--test-mode` only | Verifies pipeline works | | |
| --- | |
| ## Output Structure | |
| After training, your output directory will look like: | |
| ``` | |
| checkpoints/grpo_api_tester/ | |
| βββ adapter_config.json # LoRA adapter config | |
| βββ adapter_model.safetensors # Trained LoRA weights | |
| βββ tokenizer.json # Tokenizer files | |
| βββ tokenizer_config.json | |
| βββ special_tokens_map.json | |
| βββ metrics/ | |
| βββ results.json # Full results (baselines + base + trained) | |
| βββ results.md # Markdown comparison table | |
| βββ plots/ | |
| βββ reward_comparison.png # Bar chart: reward across all agents | |
| βββ bugs_comparison.png # Bar chart: bugs found | |
| βββ coverage_comparison.png # Bar chart: API coverage % | |
| ``` | |
| --- | |
| ## Weights & Biases Integration | |
| When `--use-wandb` is enabled, the following is logged: | |
| | Metric | Description | | |
| |--------|-------------| | |
| | `baseline/{agent}/{task}/reward` | Baseline agent scores | | |
| | `base_model/{task}/reward` | Pre-training model scores | | |
| | `trained_model/{task}/reward` | Post-training model scores | | |
| | `delta/{task}/reward` | Improvement over base model | | |
| | `plots/*` | Comparison charts as W&B images | | |
| | TRL defaults | Loss, learning rate, reward mean/std | | |
| --- | |
| ## Expected Results | |
| ### Before Training (base Qwen 1.7B, no fine-tuning) | |
| The base model can output JSON sometimes, but has no API testing strategy: | |
| ``` | |
| basic_validation: ~0.15 (random-level) | |
| edge_cases: ~0.08 | |
| security_workflows: ~0.03 | |
| ``` | |
| ### After GRPO (50 episodes, 200 steps) | |
| The model learns systematic testing patterns: | |
| ``` | |
| basic_validation: ~0.55-0.65 | |
| edge_cases: ~0.35-0.45 | |
| security_workflows: ~0.25-0.35 | |
| ``` | |
| ### What the Model Learns | |
| 1. **Output format** β Always produce valid JSON (format reward) | |
| 2. **Coverage** β Test different endpoints, don't repeat the same request | |
| 3. **Dependency chaining** β POST to create, then GET/PUT/DELETE the created resource | |
| 4. **Bug patterns** β Try non-existent IDs, missing fields, invalid emails | |
| 5. **Auth workflows** β Login first, use tokens in subsequent requests | |
| 6. **Security testing** β Try cross-user access, injection payloads | |
| --- | |
| ## Extending the Training | |
| ### Add a new reward signal | |
| Edit `rewards.py`: | |
| ```python | |
| def efficiency_reward_fn(completions: list[str], **kwargs) -> list[float]: | |
| """Reward for concise, focused actions (penalize wasted steps).""" | |
| rewards = [] | |
| for text in completions: | |
| action = parse_action(text) | |
| if action and action.expected_status: | |
| rewards.append(0.5) # Bonus for predicting expected status | |
| else: | |
| rewards.append(0.0) | |
| return rewards | |
| ``` | |
| Then add it to the combined reward in `grpo.py`. | |
| ### Add a new baseline agent | |
| Edit `agents.py`: | |
| ```python | |
| class CoverageAgent: | |
| """Agent that prioritizes hitting every endpoint once.""" | |
| name = "coverage" | |
| def __init__(self): | |
| self.tested = set() | |
| # ... | |
| ``` | |
| Then add it to the `AGENTS` dict. | |
| ### Use a different model | |
| ```bash | |
| # Qwen 2.5 (smaller, faster) | |
| python -m training.grpo --model-id Qwen/Qwen2.5-1.5B | |
| # Llama 3 (if you have access) | |
| python -m training.grpo --model-id meta-llama/Llama-3.2-1B | |
| ``` | |
| Any HuggingFace causal language model works β just make sure it supports chat templates. | |