aws_rl_env / compare /README.md
Sizzing's picture
Upload folder using huggingface_hub
c745a99 verified
# `compare/` β€” Base Model vs SFT Adapter Benchmark
[← back to main README](../README.md)
This directory holds the side-by-side benchmark that answers the only question that ultimately matters: **did SFT actually make the model better at the task?**
The benchmark compares the base [Qwen2.5-Coder-3B-Instruct](https://huggingface.co/unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit) against our published SFT adapter [Sizzing/aws-rl-sft-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-sft-qwen25coder3b-adapter) under two evaluation modes β€” fast static dataset eval and slow live-environment eval. Both write structured metrics so the deltas are explicit.
> ![Dataset comparison: base vs SFT (per-row scores)](../docs/figures/compare_dataset.png)
> ![RL-env comparison: base vs SFT (per-episode rewards)](../docs/figures/compare_rl_env.png)
---
## Table of contents
1. [What's compared](#1-whats-compared)
2. [Two evaluation modes](#2-two-evaluation-modes)
3. [Methodology](#3-methodology)
4. [Metrics reported](#4-metrics-reported)
5. [How to run](#5-how-to-run)
6. [Reading the results](#6-reading-the-results)
7. [Files in this directory](#7-files-in-this-directory)
---
## 1. What's compared
| | Base | SFT |
|---|---|---|
| **Model** | `unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit` | Same base + LoRA adapter |
| **Adapter** | None | `Sizzing/aws-rl-sft-qwen25coder3b-adapter` |
| **Training data** | Pretraining + Qwen instruction tuning | + 1,500 rows from [data/sft/aws_rl_sft.train.jsonl](../data/sft/aws_rl_sft.train.jsonl) |
| **Inference** | Same prompt template, same temperature | Identical |
The only variable is the LoRA adapter. Same base, same prompts, same decoding parameters, same evaluation set.
---
## 2. Two evaluation modes
The notebook runs two separate evaluations because they answer different questions:
### Dataset eval (static)
| Question | Does the model emit the *canonical* command for held-out prompts, one-shot? |
|-----------|-----------------------------------------------------------------------------|
| Speed | Fast (~minutes) |
| Needs | HF token + dataset access; **no env server** |
| Source | [data/sft/aws_rl_sft.val.jsonl](../data/sft/aws_rl_sft.val.jsonl) (150 held-out rows) |
| Verifies | Format correctness + command-token match against canonical |
This is the same kind of pattern-matching benchmark as [data/sft/MODEL_EVALUATION.md](../data/sft/MODEL_EVALUATION.md) β€” fast and deterministic. Useful as a regression check.
### RL env eval (live)
| Question | Can the model actually *solve* a task end-to-end against a live environment? |
|-----------|------------------------------------------------------------------------------|
| Speed | Slow (~tens of minutes per model) |
| Needs | Dataset eval above + a running env server (HF Space or local) |
| Source | Same val tasks, but exercised through `client.AwsRlEnv` round-trips |
| Verifies | Multi-step task completion, partial progress, reward shaping, hint usage |
This is closer to what training optimizes for. A model can score well on dataset eval (right command on step 1) but fail RL env eval (can't recover from a step 1 typo, can't continue past the first turn). Both signals matter.
---
## 3. Methodology
### Dataset eval
1. Load `Sizzing/aws-rl-sft` dataset from HF Hub
2. For each row in `val`, build the prompt from `messages[:-1]` (system + user, drop assistant)
3. Generate the model's response (`max_new_tokens=128`, deterministic decoding)
4. **Extract the AWS CLI line**: strip markdown fences, find first line starting with `aws `
5. Score against `messages[-1].content` (the canonical assistant response):
- Format OK (extracted line starts with `aws`)
- Service match (same first word after `aws`)
- Operation match (same first two words)
- Exact match (full token-for-token equality)
This mirrors the methodology in [eval_lm_studio_models.py](../data/eval_lm_studio_models.py); the same scoring functions are reused.
### RL env eval
1. Connect to the running env at `ENV_BASE_URL` (default: an HF Space; can be overridden to local)
2. For each val task, run a full episode (up to `MAX_STEPS=15` turns):
- Build the prompt from system + task + observation history (matches [inference.py](../inference.py))
- Generate one AWS CLI command per turn
- Step the environment, record `reward`, `task_achieved`, `partial_progress`
3. Aggregate per-episode metrics
The agent loop is identical to the training-time `rollout_one_episode` in [train_grpo.py](../train_grpo.py) β€” same prompt structure, same generation parameters, same termination logic. So the RL env eval is genuinely measuring "what would this model do during a GRPO rollout".
---
## 4. Metrics reported
### Dataset eval
| Metric | Definition |
|----------------|-----------------------------------------------------------|
| `format_ok` | % of responses where the extracted line starts with `aws ` |
| `svc_match` | % matching the canonical service |
| `op_match` | % matching service + operation |
| `exact_match` | % matching the full canonical command token-for-token |
### RL env eval (per episode)
| Metric | Definition |
|-------------------------|------------------------------------------------------------------|
| `avg_episode_reward` | Mean total reward accumulated per episode (sum of step rewards) |
| `completion_rate` | % of episodes ending in `task_achieved=True` |
| `avg_steps_to_complete` | Mean steps used by completed episodes (lower = more efficient) |
| `avg_max_progress` | Mean of the highest `partial_progress` reached per episode |
| `hint_usage_rate` | % of episodes where the agent requested at least one hint |
| `format_failure_rate` | % of agent commands that failed the `aws ` prefix gate |
The notebook produces per-tier breakdowns of all six metrics so you can see where SFT helped most (typically: warmup format-locking goes from ~85% β†’ 100%; intermediate completion goes from a small base to a meaningful fraction).
---
## 5. How to run
### Prerequisites
- HuggingFace token (`HF_TOKEN`) β€” needed to load the dataset and adapter
- A running env server β€” either:
- Your own HF Space deployment (set `ENV_BASE_URL` accordingly), or
- Local server: `make run` from the repo root, then `ENV_BASE_URL=http://localhost:8000`
- A GPU runtime (Colab T4 or better, A10/A100 ideal)
### Notebooks
| Notebook | Open in Colab |
|---------------------------------------------------------------------|--------------------------------|
| [compare_base_vs_sft.ipynb](compare_base_vs_sft.ipynb) (clean) | <!-- TODO: paste Colab URL --> |
| [compare_base_vs_sft_with_outputs.ipynb](compare_base_vs_sft_with_outputs.ipynb) (with outputs) | <!-- TODO: paste Colab URL --> |
The two notebooks are functionally identical; the second has cell outputs preserved (18 display widgets, 26 stdout cells) for offline inspection.
### Running steps
1. Open the notebook in Colab (or local Jupyter)
2. Edit the **CONFIG** cell:
```python
BASE_MODEL = "unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit"
SFT_ADAPTER_REPO = "Sizzing/aws-rl-sft-qwen25coder3b-adapter"
DATASET_REPO = "Sizzing/aws-rl-sft"
ENV_BASE_URL = "https://your-hf-space.hf.space" # or local
```
3. Run all cells. Part 1 (dataset eval) finishes first; Part 2 (RL env eval) is the slow one.
4. Compare the per-metric deltas between base and SFT.
---
## 6. Reading the results
### Actual numbers from the run
From the saved outputs of [compare_base_vs_sft_with_outputs.ipynb](compare_base_vs_sft_with_outputs.ipynb):
#### Dataset eval
| Metric | Base | Base + SFT | Ξ” |
|---------------------------|:------:|:----------:|:----------:|
| `format_pct` | 33.3% | **100.0%** | **+66.7 pp** |
| `format_after_extract_pct`| 100.0% | 100.0% | 0 |
| `exact_pct` | 38.9% | **88.9%** | **+50.0 pp** |
#### RL env eval (live multi-step agent loop)
| Metric | Base | Base + SFT | Ξ” |
|-------------------------|:-----:|:----------:|:---------:|
| `avg_episode_reward` | 1.187 | **2.011** | **+0.824** |
| `reward_std` | 1.137 | 1.908 | +0.771 |
| `avg_steps` | 8.600 | **5.733** | **βˆ’2.867** |
| `avg_reward_per_step` | 0.138 | **0.351** | **+0.213** |
> ![RL-env eval: base vs SFT](../docs/figures/rl_env_eval_base_vs_sft.png)
The agent **earns more reward per episode while taking fewer steps** β€” exactly what good fine-tuning should produce. Reward-per-step jumps 2.5Γ— because (a) the agent picks the right command more often (fewer wasted steps), and (b) format compliance is now perfect (no more `aws help` fallbacks).
#### Per-tier success in the RL eval
From the notebook's per-rollout traces (3 episodes per tier Γ— 5 tiers = 15 episodes per model):
| Tier | Base (rollouts βœ“ / 3) | Base + SFT (rollouts βœ“ / 3) |
|--------------|:---------------------:|:----------------------------:|
| warmup | 3 | 3 |
| beginner | 3 | 3 |
| intermediate | 1 | 3 |
| advanced | 0 | 1 |
| expert | 0 | 2 |
SFT moves the **success frontier** up two tiers β€” the base model could not finish a single advanced or expert episode, while SFT completes 2 of 3 expert tasks (S3 lockdown, IAM least-privilege variants) within 5 steps.
### What counts as a meaningful delta?
The val set is small (150 rows / ~10 unique tasks per RL eval), so individual percentage points have meaningful noise. Rules of thumb:
| Delta size | Significance |
|------------|------------------------------------------------|
| Β±2pp | Within noise β€” don't claim improvement |
| 5–10pp | Likely real, look at per-tier breakdown |
| >10pp | Almost certainly real |
The deltas above (66.7 pp, 50.0 pp on dataset; 0.82 reward / βˆ’2.9 steps on RL eval) are well above the noise floor.
### Going further with GRPO
Once the SFT adapter is in hand, the same comparison can be re-run against a GRPO adapter. Multi-step results from our reference GRPO run are documented in the [main README Β§11](../README.md#11-results--benchmarks); the short version is GRPO@35-steps preserves SFT performance and modestly improves the middle tiers, while the expert tier remains the bottleneck.
---
## 7. Files in this directory
| File | Purpose |
|-----------------------------------------------------------------------------------------------------|------------------------------------------------------------------|
| [compare_base_vs_sft.ipynb](compare_base_vs_sft.ipynb) | Side-by-side dataset + RL env benchmark β€” clean version |
| [compare_base_vs_sft_with_outputs.ipynb](compare_base_vs_sft_with_outputs.ipynb) | Same notebook with cell outputs preserved (18 display widgets) |
---
## See also
- [Main README](../README.md) β€” top-level overview, results section
- [data/README.md](../data/README.md) β€” dataset that drives this comparison
- [data/sft/MODEL_EVALUATION.md](../data/sft/MODEL_EVALUATION.md) β€” base-model selection benchmark (same scoring functions reused here)
- [train/README.md](../train/README.md) β€” how the SFT adapter being benchmarked here was produced
- [inference.py](../inference.py) β€” single-model agent loop (the prototype the RL eval mode is modeled after)