Mayank022's picture
Upload folder using huggingface_hub
a4f74f3 verified
# Training Module
Everything related to training an AI agent to test APIs using GRPO (Group Relative Policy Optimization).
---
## Setup
```bash
cd api_testing_env
# Option 1: Automated setup (creates venv, installs everything)
bash setup.sh
# Option 2: Manual setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Optional: login to HuggingFace Hub (for model push)
huggingface-cli login
# Optional: login to Weights & Biases (for logging)
wandb login
```
### Environment Variables
Create a `.env` file in `api_testing_env/` (or export in your shell):
```bash
# .env
# HuggingFace Hub β€” required for --push-to-hub
# Get your token at: https://huggingface.co/settings/tokens
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# Weights & Biases β€” required for --use-wandb
# Get your key at: https://wandb.ai/authorize
WANDB_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# Optional: set W&B defaults
WANDB_PROJECT=api-testing-grpo
WANDB_ENTITY=your-team-name
```
**Three ways to provide these keys:**
| Method | Command |
|--------|---------|
| `.env` file | Create `.env` as shown above, then `source .env` before training |
| CLI login | `huggingface-cli login` and `wandb login` (stores keys in ~/.cache) |
| Inline export | `export HF_TOKEN=hf_xxx && export WANDB_API_KEY=xxx` |
> **Important:** Never commit `.env` to git. It's already in `.gitignore`.
---
## Quick Start
```bash
cd api_testing_env
source .venv/bin/activate
# 1. See what training prompts look like (no GPU needed)
SHOW_PROMPTS=1 python -m training.grpo
# 2. Quick sanity check (CPU, ~2 minutes)
python -m training.grpo --test-mode
# 3. Real training (GPU required)
python -m training.grpo --model-id Qwen/Qwen3-1.7B --num-episodes 100
# 4. With HuggingFace Hub push
python -m training.grpo \
--push-to-hub --hf-repo-id your-username/api-tester-grpo
# 5. With Weights & Biases logging
python -m training.grpo \
--use-wandb --wandb-project api-testing-grpo
# 6. Full pipeline: training + HF push + W&B
python -m training.grpo \
--model-id Qwen/Qwen3-1.7B \
--num-episodes 100 \
--push-to-hub --hf-repo-id your-username/api-tester-grpo \
--use-wandb --wandb-project api-testing-grpo
# 7. Run baseline agents only (no GPU needed)
python -m training.evaluate --task all --agent all --url http://localhost:8000
# 8. Resume from checkpoint
python -m training.grpo --model-id ./checkpoints/step_50
```
---
## How Training Works
There is **no external dataset**. The environment generates unique episodes on the fly.
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GRPO Training Loop β”‚
β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ 1. env.reset(seed=N) β”‚
β”‚ β”‚ β”‚ β†’ unique users, tasks, data β”‚
β”‚ Qwen β”‚ β”‚ β”‚
β”‚ 1.7B │──▢│ 2. LLM generates: {"method":"GET",...} β”‚
β”‚ + LoRA β”‚ β”‚ β”‚
β”‚ │◀──│ 3. env.step(action) β†’ reward β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ coverage + bugs + validity β”‚
β”‚ β”‚
β”‚ 4. GRPO: generate 4 attempts per prompt, β”‚
β”‚ keep best, update model weights β”‚
β”‚ β”‚
β”‚ 5. Repeat with next seed β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Why no dataset file?
Each `reset(seed=N)` creates a **unique database** with different users, tasks, and data:
| Seed | Users | Tasks |
|------|-------|-------|
| 42 | diana, alice, xander, ivan, hannah | 8 tasks |
| 99 | mike, george, tom, fiona | 6 tasks |
| 7 | priya, kevin, wendy | 4 tasks |
The agent can't memorize "login as alice" because alice might not exist. It must **read the observation and adapt** β€” that's the learning signal.
The bugs (13 planted flaws) are structural β€” same code flaws every episode β€” but the path to finding them changes because the data is different.
---
## Training Pipeline
The full training pipeline runs these steps automatically:
```
1. Run baseline agents (random, sequential, smart) across all tasks
↓
2. Load base model (Qwen 1.7B)
↓
3. Evaluate base model before training (establishes LLM baseline)
↓
4. GRPO training with LoRA
↓
5. Save model locally to --output-dir
↓
6. Push to HuggingFace Hub (if --push-to-hub)
↓
7. Evaluate trained model after GRPO
↓
8. Print comparison table (baselines vs base vs trained)
↓
9. Save metrics (JSON + markdown) to output-dir/metrics/
↓
10. Save comparison plots (PNG) to output-dir/metrics/plots/
↓
11. Finalize W&B run (if --use-wandb)
```
---
## File Guide
| File | Purpose | When to modify |
|------|---------|----------------|
| `prompts.py` | System prompt, `format_observation()`, `parse_action()` | Change how the LLM sees tasks or formats actions |
| `rewards.py` | `format_reward_fn()`, `environment_reward_fn()` | Tune reward scaling or add new reward signals |
| `agents.py` | `RandomAgent`, `SequentialAgent`, `SmartAgent` | Add new baseline strategies |
| `grpo.py` | `build_training_prompts()`, `train_grpo()` | Change training hyperparameters or model |
| `evaluate.py` | `run_rollout()`, `run_baseline_local()`, remote runner | Change evaluation logic |
### prompts.py
The bridge between the environment and the LLM.
**`SYSTEM_PROMPT`** β€” Instructions telling the LLM it's an API tester. Includes output format (JSON) and testing strategies.
**`format_observation(obs)`** β€” Converts an environment observation into text:
- First turn: full API spec + task description + available users
- Later turns: last response + feedback + progress stats + auth tokens
**`parse_action(text)`** β€” Extracts JSON from LLM output. Handles:
- Raw JSON: `{"method": "GET", "endpoint": "/tasks"}`
- Code blocks: `` ```json {...} ``` ``
- Extra text around JSON: `"I'll try: {...}"`
### rewards.py
Two reward functions that GRPO uses to score each LLM completion:
**`format_reward_fn`** β€” Binary: +1.0 if valid JSON action, -1.0 if not. Teaches the model to always output parseable actions.
**`environment_reward_fn`** β€” Runs the action in the environment and returns the actual reward (coverage + bugs + validity), scaled by 5.0 to dominate over format reward.
### agents.py
Three hand-coded baselines for comparison:
| Agent | Strategy | Expected Score |
|-------|----------|---------------|
| `RandomAgent` | Random method + random endpoint | ~0.10 |
| `SequentialAgent` | Fixed sequence: GET, POST, PUT, DELETE each endpoint | ~0.35 |
| `SmartAgent` | Multi-phase: discover β†’ auth β†’ CRUD β†’ bug hunt β†’ security | ~0.55 |
A GRPO-trained model should beat the SmartAgent.
### grpo.py
The main training script.
**`build_training_prompts(num_episodes)`** β€” Creates N prompts by resetting the environment with seeds 0..N. Each prompt is a chat message with system prompt + initial observation.
**`run_baseline_evaluation(seed)`** β€” Runs all three baseline agents across all tasks before training starts.
**`train_grpo(args)`** β€” Full GRPO loop:
1. Run baseline agents for comparison
2. Load model + tokenizer (Qwen 1.7B default)
3. Evaluate base model before training
4. Apply LoRA (r=16, alpha=32, targets q_proj + v_proj)
5. Generate prompts from environment
6. Create per-prompt environment instances for reward eval
7. Train with TRL's GRPOTrainer
8. Save model locally + push to HF Hub
9. Evaluate trained model + print comparison
10. Save metrics (JSON, markdown) and plots (PNG)
11. Finalize W&B run
**`save_metrics()`** β€” Saves `results.json` and `results.md` to `output-dir/metrics/`.
**`save_plots()`** β€” Generates three comparison bar charts (reward, bugs, coverage) saved as PNGs.
### evaluate.py
**`run_rollout(model, tokenizer, task_id, seed)`** β€” Runs one full episode with a HuggingFace model. Multi-turn: LLM generates action β†’ env steps β†’ LLM sees result β†’ repeats.
**`run_baseline_local(agent_name, task_id, seed)`** β€” Runs baseline agents against the local environment (no server needed). Used by `grpo.py` to establish baselines before training.
**`run_episode(url, task_id, agent_cls)`** β€” Runs a baseline agent against a remote server via WebSocket.
---
## Training Hyperparameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--model-id` | `Qwen/Qwen3-1.7B` | Base model (any HF causal LM) |
| `--num-episodes` | 50 | Training prompts (more = more diverse episodes) |
| `--num-generations` | 4 | GRPO rollouts per prompt (higher = better but slower) |
| `--max-completion-length` | 256 | Max tokens per LLM response |
| `--max-steps` | 200 | Total training optimizer steps |
| `--learning-rate` | 2e-5 | AdamW learning rate |
| `--batch-size` | 1 | Per-device batch size |
| `--output-dir` | `./checkpoints/grpo_api_tester` | Where to save model |
| `--push-to-hub` | off | Push trained model to HuggingFace Hub |
| `--hf-repo-id` | none | HF Hub repo (e.g., `user/api-tester-grpo`) |
| `--use-wandb` | off | Enable Weights & Biases logging |
| `--wandb-project` | `api-testing-grpo` | W&B project name |
| `--wandb-run-name` | auto | W&B run name |
| `--test-mode` | off | Quick 3-episode, 2-gen, 5-step test |
### Hardware Requirements
| Setup | GPU | Time | Model |
|-------|-----|------|-------|
| Colab Free | T4 (16GB) | ~1-2 hours | Qwen 1.7B + 4-bit LoRA |
| Colab Pro | A100 (40GB) | ~30 min | Qwen 4B + LoRA |
| Local | Any 8GB+ | ~1-2 hours | Qwen 1.7B + 4-bit LoRA |
| CPU only | None | `--test-mode` only | Verifies pipeline works |
---
## Output Structure
After training, your output directory will look like:
```
checkpoints/grpo_api_tester/
β”œβ”€β”€ adapter_config.json # LoRA adapter config
β”œβ”€β”€ adapter_model.safetensors # Trained LoRA weights
β”œβ”€β”€ tokenizer.json # Tokenizer files
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ special_tokens_map.json
└── metrics/
β”œβ”€β”€ results.json # Full results (baselines + base + trained)
β”œβ”€β”€ results.md # Markdown comparison table
└── plots/
β”œβ”€β”€ reward_comparison.png # Bar chart: reward across all agents
β”œβ”€β”€ bugs_comparison.png # Bar chart: bugs found
└── coverage_comparison.png # Bar chart: API coverage %
```
---
## Weights & Biases Integration
When `--use-wandb` is enabled, the following is logged:
| Metric | Description |
|--------|-------------|
| `baseline/{agent}/{task}/reward` | Baseline agent scores |
| `base_model/{task}/reward` | Pre-training model scores |
| `trained_model/{task}/reward` | Post-training model scores |
| `delta/{task}/reward` | Improvement over base model |
| `plots/*` | Comparison charts as W&B images |
| TRL defaults | Loss, learning rate, reward mean/std |
---
## Expected Results
### Before Training (base Qwen 1.7B, no fine-tuning)
The base model can output JSON sometimes, but has no API testing strategy:
```
basic_validation: ~0.15 (random-level)
edge_cases: ~0.08
security_workflows: ~0.03
```
### After GRPO (50 episodes, 200 steps)
The model learns systematic testing patterns:
```
basic_validation: ~0.55-0.65
edge_cases: ~0.35-0.45
security_workflows: ~0.25-0.35
```
### What the Model Learns
1. **Output format** β€” Always produce valid JSON (format reward)
2. **Coverage** β€” Test different endpoints, don't repeat the same request
3. **Dependency chaining** β€” POST to create, then GET/PUT/DELETE the created resource
4. **Bug patterns** β€” Try non-existent IDs, missing fields, invalid emails
5. **Auth workflows** β€” Login first, use tokens in subsequent requests
6. **Security testing** β€” Try cross-user access, injection payloads
---
## Extending the Training
### Add a new reward signal
Edit `rewards.py`:
```python
def efficiency_reward_fn(completions: list[str], **kwargs) -> list[float]:
"""Reward for concise, focused actions (penalize wasted steps)."""
rewards = []
for text in completions:
action = parse_action(text)
if action and action.expected_status:
rewards.append(0.5) # Bonus for predicting expected status
else:
rewards.append(0.0)
return rewards
```
Then add it to the combined reward in `grpo.py`.
### Add a new baseline agent
Edit `agents.py`:
```python
class CoverageAgent:
"""Agent that prioritizes hitting every endpoint once."""
name = "coverage"
def __init__(self):
self.tested = set()
# ...
```
Then add it to the `AGENTS` dict.
### Use a different model
```bash
# Qwen 2.5 (smaller, faster)
python -m training.grpo --model-id Qwen/Qwen2.5-1.5B
# Llama 3 (if you have access)
python -m training.grpo --model-id meta-llama/Llama-3.2-1B
```
Any HuggingFace causal language model works β€” just make sure it supports chat templates.