Mayank022's picture
Upload folder using huggingface_hub
a4f74f3 verified

Training Module

Everything related to training an AI agent to test APIs using GRPO (Group Relative Policy Optimization).


Setup

cd api_testing_env

# Option 1: Automated setup (creates venv, installs everything)
bash setup.sh

# Option 2: Manual setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Optional: login to HuggingFace Hub (for model push)
huggingface-cli login

# Optional: login to Weights & Biases (for logging)
wandb login

Environment Variables

Create a .env file in api_testing_env/ (or export in your shell):

# .env

# HuggingFace Hub β€” required for --push-to-hub
# Get your token at: https://huggingface.co/settings/tokens
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# Weights & Biases β€” required for --use-wandb
# Get your key at: https://wandb.ai/authorize
WANDB_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# Optional: set W&B defaults
WANDB_PROJECT=api-testing-grpo
WANDB_ENTITY=your-team-name

Three ways to provide these keys:

Method Command
.env file Create .env as shown above, then source .env before training
CLI login huggingface-cli login and wandb login (stores keys in ~/.cache)
Inline export export HF_TOKEN=hf_xxx && export WANDB_API_KEY=xxx

Important: Never commit .env to git. It's already in .gitignore.


Quick Start

cd api_testing_env
source .venv/bin/activate

# 1. See what training prompts look like (no GPU needed)
SHOW_PROMPTS=1 python -m training.grpo

# 2. Quick sanity check (CPU, ~2 minutes)
python -m training.grpo --test-mode

# 3. Real training (GPU required)
python -m training.grpo --model-id Qwen/Qwen3-1.7B --num-episodes 100

# 4. With HuggingFace Hub push
python -m training.grpo \
  --push-to-hub --hf-repo-id your-username/api-tester-grpo

# 5. With Weights & Biases logging
python -m training.grpo \
  --use-wandb --wandb-project api-testing-grpo

# 6. Full pipeline: training + HF push + W&B
python -m training.grpo \
  --model-id Qwen/Qwen3-1.7B \
  --num-episodes 100 \
  --push-to-hub --hf-repo-id your-username/api-tester-grpo \
  --use-wandb --wandb-project api-testing-grpo

# 7. Run baseline agents only (no GPU needed)
python -m training.evaluate --task all --agent all --url http://localhost:8000

# 8. Resume from checkpoint
python -m training.grpo --model-id ./checkpoints/step_50

How Training Works

There is no external dataset. The environment generates unique episodes on the fly.

                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚           GRPO Training Loop                β”‚
                  β”‚                                             β”‚
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚  1. env.reset(seed=N)                      β”‚
  β”‚           β”‚   β”‚     β†’ unique users, tasks, data             β”‚
  β”‚  Qwen     β”‚   β”‚                                             β”‚
  β”‚  1.7B     │──▢│  2. LLM generates: {"method":"GET",...}     β”‚
  β”‚  + LoRA   β”‚   β”‚                                             β”‚
  β”‚           │◀──│  3. env.step(action) β†’ reward               β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚     coverage + bugs + validity              β”‚
                  β”‚                                             β”‚
                  β”‚  4. GRPO: generate 4 attempts per prompt,   β”‚
                  β”‚     keep best, update model weights          β”‚
                  β”‚                                             β”‚
                  β”‚  5. Repeat with next seed                   β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why no dataset file?

Each reset(seed=N) creates a unique database with different users, tasks, and data:

Seed Users Tasks
42 diana, alice, xander, ivan, hannah 8 tasks
99 mike, george, tom, fiona 6 tasks
7 priya, kevin, wendy 4 tasks

The agent can't memorize "login as alice" because alice might not exist. It must read the observation and adapt β€” that's the learning signal.

The bugs (13 planted flaws) are structural β€” same code flaws every episode β€” but the path to finding them changes because the data is different.


Training Pipeline

The full training pipeline runs these steps automatically:

1. Run baseline agents (random, sequential, smart) across all tasks
        ↓
2. Load base model (Qwen 1.7B)
        ↓
3. Evaluate base model before training (establishes LLM baseline)
        ↓
4. GRPO training with LoRA
        ↓
5. Save model locally to --output-dir
        ↓
6. Push to HuggingFace Hub (if --push-to-hub)
        ↓
7. Evaluate trained model after GRPO
        ↓
8. Print comparison table (baselines vs base vs trained)
        ↓
9. Save metrics (JSON + markdown) to output-dir/metrics/
        ↓
10. Save comparison plots (PNG) to output-dir/metrics/plots/
        ↓
11. Finalize W&B run (if --use-wandb)

File Guide

File Purpose When to modify
prompts.py System prompt, format_observation(), parse_action() Change how the LLM sees tasks or formats actions
rewards.py format_reward_fn(), environment_reward_fn() Tune reward scaling or add new reward signals
agents.py RandomAgent, SequentialAgent, SmartAgent Add new baseline strategies
grpo.py build_training_prompts(), train_grpo() Change training hyperparameters or model
evaluate.py run_rollout(), run_baseline_local(), remote runner Change evaluation logic

prompts.py

The bridge between the environment and the LLM.

SYSTEM_PROMPT β€” Instructions telling the LLM it's an API tester. Includes output format (JSON) and testing strategies.

format_observation(obs) β€” Converts an environment observation into text:

  • First turn: full API spec + task description + available users
  • Later turns: last response + feedback + progress stats + auth tokens

parse_action(text) β€” Extracts JSON from LLM output. Handles:

  • Raw JSON: {"method": "GET", "endpoint": "/tasks"}
  • Code blocks: ```json {...} ```
  • Extra text around JSON: "I'll try: {...}"

rewards.py

Two reward functions that GRPO uses to score each LLM completion:

format_reward_fn β€” Binary: +1.0 if valid JSON action, -1.0 if not. Teaches the model to always output parseable actions.

environment_reward_fn β€” Runs the action in the environment and returns the actual reward (coverage + bugs + validity), scaled by 5.0 to dominate over format reward.

agents.py

Three hand-coded baselines for comparison:

Agent Strategy Expected Score
RandomAgent Random method + random endpoint ~0.10
SequentialAgent Fixed sequence: GET, POST, PUT, DELETE each endpoint ~0.35
SmartAgent Multi-phase: discover β†’ auth β†’ CRUD β†’ bug hunt β†’ security ~0.55

A GRPO-trained model should beat the SmartAgent.

grpo.py

The main training script.

build_training_prompts(num_episodes) β€” Creates N prompts by resetting the environment with seeds 0..N. Each prompt is a chat message with system prompt + initial observation.

run_baseline_evaluation(seed) β€” Runs all three baseline agents across all tasks before training starts.

train_grpo(args) β€” Full GRPO loop:

  1. Run baseline agents for comparison
  2. Load model + tokenizer (Qwen 1.7B default)
  3. Evaluate base model before training
  4. Apply LoRA (r=16, alpha=32, targets q_proj + v_proj)
  5. Generate prompts from environment
  6. Create per-prompt environment instances for reward eval
  7. Train with TRL's GRPOTrainer
  8. Save model locally + push to HF Hub
  9. Evaluate trained model + print comparison
  10. Save metrics (JSON, markdown) and plots (PNG)
  11. Finalize W&B run

save_metrics() β€” Saves results.json and results.md to output-dir/metrics/.

save_plots() β€” Generates three comparison bar charts (reward, bugs, coverage) saved as PNGs.

evaluate.py

run_rollout(model, tokenizer, task_id, seed) β€” Runs one full episode with a HuggingFace model. Multi-turn: LLM generates action β†’ env steps β†’ LLM sees result β†’ repeats.

run_baseline_local(agent_name, task_id, seed) β€” Runs baseline agents against the local environment (no server needed). Used by grpo.py to establish baselines before training.

run_episode(url, task_id, agent_cls) β€” Runs a baseline agent against a remote server via WebSocket.


Training Hyperparameters

Parameter Default Description
--model-id Qwen/Qwen3-1.7B Base model (any HF causal LM)
--num-episodes 50 Training prompts (more = more diverse episodes)
--num-generations 4 GRPO rollouts per prompt (higher = better but slower)
--max-completion-length 256 Max tokens per LLM response
--max-steps 200 Total training optimizer steps
--learning-rate 2e-5 AdamW learning rate
--batch-size 1 Per-device batch size
--output-dir ./checkpoints/grpo_api_tester Where to save model
--push-to-hub off Push trained model to HuggingFace Hub
--hf-repo-id none HF Hub repo (e.g., user/api-tester-grpo)
--use-wandb off Enable Weights & Biases logging
--wandb-project api-testing-grpo W&B project name
--wandb-run-name auto W&B run name
--test-mode off Quick 3-episode, 2-gen, 5-step test

Hardware Requirements

Setup GPU Time Model
Colab Free T4 (16GB) ~1-2 hours Qwen 1.7B + 4-bit LoRA
Colab Pro A100 (40GB) ~30 min Qwen 4B + LoRA
Local Any 8GB+ ~1-2 hours Qwen 1.7B + 4-bit LoRA
CPU only None --test-mode only Verifies pipeline works

Output Structure

After training, your output directory will look like:

checkpoints/grpo_api_tester/
β”œβ”€β”€ adapter_config.json          # LoRA adapter config
β”œβ”€β”€ adapter_model.safetensors    # Trained LoRA weights
β”œβ”€β”€ tokenizer.json               # Tokenizer files
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ special_tokens_map.json
└── metrics/
    β”œβ”€β”€ results.json             # Full results (baselines + base + trained)
    β”œβ”€β”€ results.md               # Markdown comparison table
    └── plots/
        β”œβ”€β”€ reward_comparison.png   # Bar chart: reward across all agents
        β”œβ”€β”€ bugs_comparison.png     # Bar chart: bugs found
        └── coverage_comparison.png # Bar chart: API coverage %

Weights & Biases Integration

When --use-wandb is enabled, the following is logged:

Metric Description
baseline/{agent}/{task}/reward Baseline agent scores
base_model/{task}/reward Pre-training model scores
trained_model/{task}/reward Post-training model scores
delta/{task}/reward Improvement over base model
plots/* Comparison charts as W&B images
TRL defaults Loss, learning rate, reward mean/std

Expected Results

Before Training (base Qwen 1.7B, no fine-tuning)

The base model can output JSON sometimes, but has no API testing strategy:

basic_validation:    ~0.15 (random-level)
edge_cases:          ~0.08
security_workflows:  ~0.03

After GRPO (50 episodes, 200 steps)

The model learns systematic testing patterns:

basic_validation:    ~0.55-0.65
edge_cases:          ~0.35-0.45
security_workflows:  ~0.25-0.35

What the Model Learns

  1. Output format β€” Always produce valid JSON (format reward)
  2. Coverage β€” Test different endpoints, don't repeat the same request
  3. Dependency chaining β€” POST to create, then GET/PUT/DELETE the created resource
  4. Bug patterns β€” Try non-existent IDs, missing fields, invalid emails
  5. Auth workflows β€” Login first, use tokens in subsequent requests
  6. Security testing β€” Try cross-user access, injection payloads

Extending the Training

Add a new reward signal

Edit rewards.py:

def efficiency_reward_fn(completions: list[str], **kwargs) -> list[float]:
    """Reward for concise, focused actions (penalize wasted steps)."""
    rewards = []
    for text in completions:
        action = parse_action(text)
        if action and action.expected_status:
            rewards.append(0.5)  # Bonus for predicting expected status
        else:
            rewards.append(0.0)
    return rewards

Then add it to the combined reward in grpo.py.

Add a new baseline agent

Edit agents.py:

class CoverageAgent:
    """Agent that prioritizes hitting every endpoint once."""
    name = "coverage"

    def __init__(self):
        self.tested = set()
        # ...

Then add it to the AGENTS dict.

Use a different model

# Qwen 2.5 (smaller, faster)
python -m training.grpo --model-id Qwen/Qwen2.5-1.5B

# Llama 3 (if you have access)
python -m training.grpo --model-id meta-llama/Llama-3.2-1B

Any HuggingFace causal language model works β€” just make sure it supports chat templates.