# Training Module

Everything related to training an AI agent to test APIs using GRPO (Group Relative Policy Optimization).

---

## Setup

```bash
cd api_testing_env

# Option 1: Automated setup (creates venv, installs everything)
bash setup.sh

# Option 2: Manual setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Optional: login to HuggingFace Hub (for model push)
huggingface-cli login

# Optional: login to Weights & Biases (for logging)
wandb login
```

### Environment Variables

Create a `.env` file in `api_testing_env/` (or export in your shell):

```bash
# .env

# HuggingFace Hub — required for --push-to-hub
# Get your token at: https://huggingface.co/settings/tokens
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# Weights & Biases — required for --use-wandb
# Get your key at: https://wandb.ai/authorize
WANDB_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# Optional: set W&B defaults
WANDB_PROJECT=api-testing-grpo
WANDB_ENTITY=your-team-name
```

**Three ways to provide these keys:**

| Method | Command |
|--------|---------|
| `.env` file | Create `.env` as shown above, then `source .env` before training |
| CLI login | `huggingface-cli login` and `wandb login` (stores keys in ~/.cache) |
| Inline export | `export HF_TOKEN=hf_xxx && export WANDB_API_KEY=xxx` |

> **Important:** Never commit `.env` to git. It's already in `.gitignore`.

---

## Quick Start

```bash
cd api_testing_env
source .venv/bin/activate

# 1. See what training prompts look like (no GPU needed)
SHOW_PROMPTS=1 python -m training.grpo

# 2. Quick sanity check (CPU, ~2 minutes)
python -m training.grpo --test-mode

# 3. Real training (GPU required)
python -m training.grpo --model-id Qwen/Qwen3-1.7B --num-episodes 100

# 4. With HuggingFace Hub push
python -m training.grpo \
  --push-to-hub --hf-repo-id your-username/api-tester-grpo

# 5. With Weights & Biases logging
python -m training.grpo \
  --use-wandb --wandb-project api-testing-grpo

# 6. Full pipeline: training + HF push + W&B
python -m training.grpo \
  --model-id Qwen/Qwen3-1.7B \
  --num-episodes 100 \
  --push-to-hub --hf-repo-id your-username/api-tester-grpo \
  --use-wandb --wandb-project api-testing-grpo

# 7. Run baseline agents only (no GPU needed)
python -m training.evaluate --task all --agent all --url http://localhost:8000

# 8. Resume from checkpoint
python -m training.grpo --model-id ./checkpoints/step_50
```

---

## How Training Works

There is **no external dataset**. The environment generates unique episodes on the fly.

```
                  ┌─────────────────────────────────────────────┐
                  │           GRPO Training Loop                │
                  │                                             │
  ┌───────────┐   │  1. env.reset(seed=N)                      │
  │           │   │     → unique users, tasks, data             │
  │  Qwen     │   │                                             │
  │  1.7B     │──▶│  2. LLM generates: {"method":"GET",...}     │
  │  + LoRA   │   │                                             │
  │           │◀──│  3. env.step(action) → reward               │
  └───────────┘   │     coverage + bugs + validity              │
                  │                                             │
                  │  4. GRPO: generate 4 attempts per prompt,   │
                  │     keep best, update model weights          │
                  │                                             │
                  │  5. Repeat with next seed                   │
                  └─────────────────────────────────────────────┘
```

### Why no dataset file?

Each `reset(seed=N)` creates a **unique database** with different users, tasks, and data:

| Seed | Users | Tasks |
|------|-------|-------|
| 42 | diana, alice, xander, ivan, hannah | 8 tasks |
| 99 | mike, george, tom, fiona | 6 tasks |
| 7 | priya, kevin, wendy | 4 tasks |

The agent can't memorize "login as alice" because alice might not exist. It must **read the observation and adapt** — that's the learning signal.

The bugs (13 planted flaws) are structural — same code flaws every episode — but the path to finding them changes because the data is different.

---

## Training Pipeline

The full training pipeline runs these steps automatically:

```
1. Run baseline agents (random, sequential, smart) across all tasks
        ↓
2. Load base model (Qwen 1.7B)
        ↓
3. Evaluate base model before training (establishes LLM baseline)
        ↓
4. GRPO training with LoRA
        ↓
5. Save model locally to --output-dir
        ↓
6. Push to HuggingFace Hub (if --push-to-hub)
        ↓
7. Evaluate trained model after GRPO
        ↓
8. Print comparison table (baselines vs base vs trained)
        ↓
9. Save metrics (JSON + markdown) to output-dir/metrics/
        ↓
10. Save comparison plots (PNG) to output-dir/metrics/plots/
        ↓
11. Finalize W&B run (if --use-wandb)
```

---

## File Guide

| File | Purpose | When to modify |
|------|---------|----------------|
| `prompts.py` | System prompt, `format_observation()`, `parse_action()` | Change how the LLM sees tasks or formats actions |
| `rewards.py` | `format_reward_fn()`, `environment_reward_fn()` | Tune reward scaling or add new reward signals |
| `agents.py` | `RandomAgent`, `SequentialAgent`, `SmartAgent` | Add new baseline strategies |
| `grpo.py` | `build_training_prompts()`, `train_grpo()` | Change training hyperparameters or model |
| `evaluate.py` | `run_rollout()`, `run_baseline_local()`, remote runner | Change evaluation logic |

### prompts.py

The bridge between the environment and the LLM.

**`SYSTEM_PROMPT`** — Instructions telling the LLM it's an API tester. Includes output format (JSON) and testing strategies.

**`format_observation(obs)`** — Converts an environment observation into text:
- First turn: full API spec + task description + available users
- Later turns: last response + feedback + progress stats + auth tokens

**`parse_action(text)`** — Extracts JSON from LLM output. Handles:
- Raw JSON: `{"method": "GET", "endpoint": "/tasks"}`
- Code blocks: `` ```json {...} ``` ``
- Extra text around JSON: `"I'll try: {...}"`

### rewards.py

Two reward functions that GRPO uses to score each LLM completion:

**`format_reward_fn`** — Binary: +1.0 if valid JSON action, -1.0 if not. Teaches the model to always output parseable actions.

**`environment_reward_fn`** — Runs the action in the environment and returns the actual reward (coverage + bugs + validity), scaled by 5.0 to dominate over format reward.

### agents.py

Three hand-coded baselines for comparison:

| Agent | Strategy | Expected Score |
|-------|----------|---------------|
| `RandomAgent` | Random method + random endpoint | ~0.10 |
| `SequentialAgent` | Fixed sequence: GET, POST, PUT, DELETE each endpoint | ~0.35 |
| `SmartAgent` | Multi-phase: discover → auth → CRUD → bug hunt → security | ~0.55 |

A GRPO-trained model should beat the SmartAgent.

### grpo.py

The main training script.

**`build_training_prompts(num_episodes)`** — Creates N prompts by resetting the environment with seeds 0..N. Each prompt is a chat message with system prompt + initial observation.

**`run_baseline_evaluation(seed)`** — Runs all three baseline agents across all tasks before training starts.

**`train_grpo(args)`** — Full GRPO loop:
1. Run baseline agents for comparison
2. Load model + tokenizer (Qwen 1.7B default)
3. Evaluate base model before training
4. Apply LoRA (r=16, alpha=32, targets q_proj + v_proj)
5. Generate prompts from environment
6. Create per-prompt environment instances for reward eval
7. Train with TRL's GRPOTrainer
8. Save model locally + push to HF Hub
9. Evaluate trained model + print comparison
10. Save metrics (JSON, markdown) and plots (PNG)
11. Finalize W&B run

**`save_metrics()`** — Saves `results.json` and `results.md` to `output-dir/metrics/`.

**`save_plots()`** — Generates three comparison bar charts (reward, bugs, coverage) saved as PNGs.

### evaluate.py

**`run_rollout(model, tokenizer, task_id, seed)`** — Runs one full episode with a HuggingFace model. Multi-turn: LLM generates action → env steps → LLM sees result → repeats.

**`run_baseline_local(agent_name, task_id, seed)`** — Runs baseline agents against the local environment (no server needed). Used by `grpo.py` to establish baselines before training.

**`run_episode(url, task_id, agent_cls)`** — Runs a baseline agent against a remote server via WebSocket.

---

## Training Hyperparameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--model-id` | `Qwen/Qwen3-1.7B` | Base model (any HF causal LM) |
| `--num-episodes` | 50 | Training prompts (more = more diverse episodes) |
| `--num-generations` | 4 | GRPO rollouts per prompt (higher = better but slower) |
| `--max-completion-length` | 256 | Max tokens per LLM response |
| `--max-steps` | 200 | Total training optimizer steps |
| `--learning-rate` | 2e-5 | AdamW learning rate |
| `--batch-size` | 1 | Per-device batch size |
| `--output-dir` | `./checkpoints/grpo_api_tester` | Where to save model |
| `--push-to-hub` | off | Push trained model to HuggingFace Hub |
| `--hf-repo-id` | none | HF Hub repo (e.g., `user/api-tester-grpo`) |
| `--use-wandb` | off | Enable Weights & Biases logging |
| `--wandb-project` | `api-testing-grpo` | W&B project name |
| `--wandb-run-name` | auto | W&B run name |
| `--test-mode` | off | Quick 3-episode, 2-gen, 5-step test |

### Hardware Requirements

| Setup | GPU | Time | Model |
|-------|-----|------|-------|
| Colab Free | T4 (16GB) | ~1-2 hours | Qwen 1.7B + 4-bit LoRA |
| Colab Pro | A100 (40GB) | ~30 min | Qwen 4B + LoRA |
| Local | Any 8GB+ | ~1-2 hours | Qwen 1.7B + 4-bit LoRA |
| CPU only | None | `--test-mode` only | Verifies pipeline works |

---

## Output Structure

After training, your output directory will look like:

```
checkpoints/grpo_api_tester/
├── adapter_config.json          # LoRA adapter config
├── adapter_model.safetensors    # Trained LoRA weights
├── tokenizer.json               # Tokenizer files
├── tokenizer_config.json
├── special_tokens_map.json
└── metrics/
    ├── results.json             # Full results (baselines + base + trained)
    ├── results.md               # Markdown comparison table
    └── plots/
        ├── reward_comparison.png   # Bar chart: reward across all agents
        ├── bugs_comparison.png     # Bar chart: bugs found
        └── coverage_comparison.png # Bar chart: API coverage %
```

---

## Weights & Biases Integration

When `--use-wandb` is enabled, the following is logged:

| Metric | Description |
|--------|-------------|
| `baseline/{agent}/{task}/reward` | Baseline agent scores |
| `base_model/{task}/reward` | Pre-training model scores |
| `trained_model/{task}/reward` | Post-training model scores |
| `delta/{task}/reward` | Improvement over base model |
| `plots/*` | Comparison charts as W&B images |
| TRL defaults | Loss, learning rate, reward mean/std |

---

## Expected Results

### Before Training (base Qwen 1.7B, no fine-tuning)

The base model can output JSON sometimes, but has no API testing strategy:
```
basic_validation:    ~0.15 (random-level)
edge_cases:          ~0.08
security_workflows:  ~0.03
```

### After GRPO (50 episodes, 200 steps)

The model learns systematic testing patterns:
```
basic_validation:    ~0.55-0.65
edge_cases:          ~0.35-0.45
security_workflows:  ~0.25-0.35
```

### What the Model Learns

1. **Output format** — Always produce valid JSON (format reward)
2. **Coverage** — Test different endpoints, don't repeat the same request
3. **Dependency chaining** — POST to create, then GET/PUT/DELETE the created resource
4. **Bug patterns** — Try non-existent IDs, missing fields, invalid emails
5. **Auth workflows** — Login first, use tokens in subsequent requests
6. **Security testing** — Try cross-user access, injection payloads

---

## Extending the Training

### Add a new reward signal

Edit `rewards.py`:

```python
def efficiency_reward_fn(completions: list[str], **kwargs) -> list[float]:
    """Reward for concise, focused actions (penalize wasted steps)."""
    rewards = []
    for text in completions:
        action = parse_action(text)
        if action and action.expected_status:
            rewards.append(0.5)  # Bonus for predicting expected status
        else:
            rewards.append(0.0)
    return rewards
```

Then add it to the combined reward in `grpo.py`.

### Add a new baseline agent

Edit `agents.py`:

```python
class CoverageAgent:
    """Agent that prioritizes hitting every endpoint once."""
    name = "coverage"

    def __init__(self):
        self.tested = set()
        # ...
```

Then add it to the `AGENTS` dict.

### Use a different model

```bash
# Qwen 2.5 (smaller, faster)
python -m training.grpo --model-id Qwen/Qwen2.5-1.5B

# Llama 3 (if you have access)
python -m training.grpo --model-id meta-llama/Llama-3.2-1B
```

Any HuggingFace causal language model works — just make sure it supports chat templates.