Spaces:
Running
Training Module
Everything related to training an AI agent to test APIs using GRPO (Group Relative Policy Optimization).
Setup
cd api_testing_env
# Option 1: Automated setup (creates venv, installs everything)
bash setup.sh
# Option 2: Manual setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Optional: login to HuggingFace Hub (for model push)
huggingface-cli login
# Optional: login to Weights & Biases (for logging)
wandb login
Environment Variables
Create a .env file in api_testing_env/ (or export in your shell):
# .env
# HuggingFace Hub β required for --push-to-hub
# Get your token at: https://huggingface.co/settings/tokens
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# Weights & Biases β required for --use-wandb
# Get your key at: https://wandb.ai/authorize
WANDB_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# Optional: set W&B defaults
WANDB_PROJECT=api-testing-grpo
WANDB_ENTITY=your-team-name
Three ways to provide these keys:
| Method | Command |
|---|---|
.env file |
Create .env as shown above, then source .env before training |
| CLI login | huggingface-cli login and wandb login (stores keys in ~/.cache) |
| Inline export | export HF_TOKEN=hf_xxx && export WANDB_API_KEY=xxx |
Important: Never commit
.envto git. It's already in.gitignore.
Quick Start
cd api_testing_env
source .venv/bin/activate
# 1. See what training prompts look like (no GPU needed)
SHOW_PROMPTS=1 python -m training.grpo
# 2. Quick sanity check (CPU, ~2 minutes)
python -m training.grpo --test-mode
# 3. Real training (GPU required)
python -m training.grpo --model-id Qwen/Qwen3-1.7B --num-episodes 100
# 4. With HuggingFace Hub push
python -m training.grpo \
--push-to-hub --hf-repo-id your-username/api-tester-grpo
# 5. With Weights & Biases logging
python -m training.grpo \
--use-wandb --wandb-project api-testing-grpo
# 6. Full pipeline: training + HF push + W&B
python -m training.grpo \
--model-id Qwen/Qwen3-1.7B \
--num-episodes 100 \
--push-to-hub --hf-repo-id your-username/api-tester-grpo \
--use-wandb --wandb-project api-testing-grpo
# 7. Run baseline agents only (no GPU needed)
python -m training.evaluate --task all --agent all --url http://localhost:8000
# 8. Resume from checkpoint
python -m training.grpo --model-id ./checkpoints/step_50
How Training Works
There is no external dataset. The environment generates unique episodes on the fly.
βββββββββββββββββββββββββββββββββββββββββββββββ
β GRPO Training Loop β
β β
βββββββββββββ β 1. env.reset(seed=N) β
β β β β unique users, tasks, data β
β Qwen β β β
β 1.7B ββββΆβ 2. LLM generates: {"method":"GET",...} β
β + LoRA β β β
β βββββ 3. env.step(action) β reward β
βββββββββββββ β coverage + bugs + validity β
β β
β 4. GRPO: generate 4 attempts per prompt, β
β keep best, update model weights β
β β
β 5. Repeat with next seed β
βββββββββββββββββββββββββββββββββββββββββββββββ
Why no dataset file?
Each reset(seed=N) creates a unique database with different users, tasks, and data:
| Seed | Users | Tasks |
|---|---|---|
| 42 | diana, alice, xander, ivan, hannah | 8 tasks |
| 99 | mike, george, tom, fiona | 6 tasks |
| 7 | priya, kevin, wendy | 4 tasks |
The agent can't memorize "login as alice" because alice might not exist. It must read the observation and adapt β that's the learning signal.
The bugs (13 planted flaws) are structural β same code flaws every episode β but the path to finding them changes because the data is different.
Training Pipeline
The full training pipeline runs these steps automatically:
1. Run baseline agents (random, sequential, smart) across all tasks
β
2. Load base model (Qwen 1.7B)
β
3. Evaluate base model before training (establishes LLM baseline)
β
4. GRPO training with LoRA
β
5. Save model locally to --output-dir
β
6. Push to HuggingFace Hub (if --push-to-hub)
β
7. Evaluate trained model after GRPO
β
8. Print comparison table (baselines vs base vs trained)
β
9. Save metrics (JSON + markdown) to output-dir/metrics/
β
10. Save comparison plots (PNG) to output-dir/metrics/plots/
β
11. Finalize W&B run (if --use-wandb)
File Guide
| File | Purpose | When to modify |
|---|---|---|
prompts.py |
System prompt, format_observation(), parse_action() |
Change how the LLM sees tasks or formats actions |
rewards.py |
format_reward_fn(), environment_reward_fn() |
Tune reward scaling or add new reward signals |
agents.py |
RandomAgent, SequentialAgent, SmartAgent |
Add new baseline strategies |
grpo.py |
build_training_prompts(), train_grpo() |
Change training hyperparameters or model |
evaluate.py |
run_rollout(), run_baseline_local(), remote runner |
Change evaluation logic |
prompts.py
The bridge between the environment and the LLM.
SYSTEM_PROMPT β Instructions telling the LLM it's an API tester. Includes output format (JSON) and testing strategies.
format_observation(obs) β Converts an environment observation into text:
- First turn: full API spec + task description + available users
- Later turns: last response + feedback + progress stats + auth tokens
parse_action(text) β Extracts JSON from LLM output. Handles:
- Raw JSON:
{"method": "GET", "endpoint": "/tasks"} - Code blocks:
```json {...} ``` - Extra text around JSON:
"I'll try: {...}"
rewards.py
Two reward functions that GRPO uses to score each LLM completion:
format_reward_fn β Binary: +1.0 if valid JSON action, -1.0 if not. Teaches the model to always output parseable actions.
environment_reward_fn β Runs the action in the environment and returns the actual reward (coverage + bugs + validity), scaled by 5.0 to dominate over format reward.
agents.py
Three hand-coded baselines for comparison:
| Agent | Strategy | Expected Score |
|---|---|---|
RandomAgent |
Random method + random endpoint | ~0.10 |
SequentialAgent |
Fixed sequence: GET, POST, PUT, DELETE each endpoint | ~0.35 |
SmartAgent |
Multi-phase: discover β auth β CRUD β bug hunt β security | ~0.55 |
A GRPO-trained model should beat the SmartAgent.
grpo.py
The main training script.
build_training_prompts(num_episodes) β Creates N prompts by resetting the environment with seeds 0..N. Each prompt is a chat message with system prompt + initial observation.
run_baseline_evaluation(seed) β Runs all three baseline agents across all tasks before training starts.
train_grpo(args) β Full GRPO loop:
- Run baseline agents for comparison
- Load model + tokenizer (Qwen 1.7B default)
- Evaluate base model before training
- Apply LoRA (r=16, alpha=32, targets q_proj + v_proj)
- Generate prompts from environment
- Create per-prompt environment instances for reward eval
- Train with TRL's GRPOTrainer
- Save model locally + push to HF Hub
- Evaluate trained model + print comparison
- Save metrics (JSON, markdown) and plots (PNG)
- Finalize W&B run
save_metrics() β Saves results.json and results.md to output-dir/metrics/.
save_plots() β Generates three comparison bar charts (reward, bugs, coverage) saved as PNGs.
evaluate.py
run_rollout(model, tokenizer, task_id, seed) β Runs one full episode with a HuggingFace model. Multi-turn: LLM generates action β env steps β LLM sees result β repeats.
run_baseline_local(agent_name, task_id, seed) β Runs baseline agents against the local environment (no server needed). Used by grpo.py to establish baselines before training.
run_episode(url, task_id, agent_cls) β Runs a baseline agent against a remote server via WebSocket.
Training Hyperparameters
| Parameter | Default | Description |
|---|---|---|
--model-id |
Qwen/Qwen3-1.7B |
Base model (any HF causal LM) |
--num-episodes |
50 | Training prompts (more = more diverse episodes) |
--num-generations |
4 | GRPO rollouts per prompt (higher = better but slower) |
--max-completion-length |
256 | Max tokens per LLM response |
--max-steps |
200 | Total training optimizer steps |
--learning-rate |
2e-5 | AdamW learning rate |
--batch-size |
1 | Per-device batch size |
--output-dir |
./checkpoints/grpo_api_tester |
Where to save model |
--push-to-hub |
off | Push trained model to HuggingFace Hub |
--hf-repo-id |
none | HF Hub repo (e.g., user/api-tester-grpo) |
--use-wandb |
off | Enable Weights & Biases logging |
--wandb-project |
api-testing-grpo |
W&B project name |
--wandb-run-name |
auto | W&B run name |
--test-mode |
off | Quick 3-episode, 2-gen, 5-step test |
Hardware Requirements
| Setup | GPU | Time | Model |
|---|---|---|---|
| Colab Free | T4 (16GB) | ~1-2 hours | Qwen 1.7B + 4-bit LoRA |
| Colab Pro | A100 (40GB) | ~30 min | Qwen 4B + LoRA |
| Local | Any 8GB+ | ~1-2 hours | Qwen 1.7B + 4-bit LoRA |
| CPU only | None | --test-mode only |
Verifies pipeline works |
Output Structure
After training, your output directory will look like:
checkpoints/grpo_api_tester/
βββ adapter_config.json # LoRA adapter config
βββ adapter_model.safetensors # Trained LoRA weights
βββ tokenizer.json # Tokenizer files
βββ tokenizer_config.json
βββ special_tokens_map.json
βββ metrics/
βββ results.json # Full results (baselines + base + trained)
βββ results.md # Markdown comparison table
βββ plots/
βββ reward_comparison.png # Bar chart: reward across all agents
βββ bugs_comparison.png # Bar chart: bugs found
βββ coverage_comparison.png # Bar chart: API coverage %
Weights & Biases Integration
When --use-wandb is enabled, the following is logged:
| Metric | Description |
|---|---|
baseline/{agent}/{task}/reward |
Baseline agent scores |
base_model/{task}/reward |
Pre-training model scores |
trained_model/{task}/reward |
Post-training model scores |
delta/{task}/reward |
Improvement over base model |
plots/* |
Comparison charts as W&B images |
| TRL defaults | Loss, learning rate, reward mean/std |
Expected Results
Before Training (base Qwen 1.7B, no fine-tuning)
The base model can output JSON sometimes, but has no API testing strategy:
basic_validation: ~0.15 (random-level)
edge_cases: ~0.08
security_workflows: ~0.03
After GRPO (50 episodes, 200 steps)
The model learns systematic testing patterns:
basic_validation: ~0.55-0.65
edge_cases: ~0.35-0.45
security_workflows: ~0.25-0.35
What the Model Learns
- Output format β Always produce valid JSON (format reward)
- Coverage β Test different endpoints, don't repeat the same request
- Dependency chaining β POST to create, then GET/PUT/DELETE the created resource
- Bug patterns β Try non-existent IDs, missing fields, invalid emails
- Auth workflows β Login first, use tokens in subsequent requests
- Security testing β Try cross-user access, injection payloads
Extending the Training
Add a new reward signal
Edit rewards.py:
def efficiency_reward_fn(completions: list[str], **kwargs) -> list[float]:
"""Reward for concise, focused actions (penalize wasted steps)."""
rewards = []
for text in completions:
action = parse_action(text)
if action and action.expected_status:
rewards.append(0.5) # Bonus for predicting expected status
else:
rewards.append(0.0)
return rewards
Then add it to the combined reward in grpo.py.
Add a new baseline agent
Edit agents.py:
class CoverageAgent:
"""Agent that prioritizes hitting every endpoint once."""
name = "coverage"
def __init__(self):
self.tested = set()
# ...
Then add it to the AGENTS dict.
Use a different model
# Qwen 2.5 (smaller, faster)
python -m training.grpo --model-id Qwen/Qwen2.5-1.5B
# Llama 3 (if you have access)
python -m training.grpo --model-id meta-llama/Llama-3.2-1B
Any HuggingFace causal language model works β just make sure it supports chat templates.