Spaces:
Running
Running
File size: 12,525 Bytes
c745a99 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | # `compare/` β Base Model vs SFT Adapter Benchmark
[β back to main README](../README.md)
This directory holds the side-by-side benchmark that answers the only question that ultimately matters: **did SFT actually make the model better at the task?**
The benchmark compares the base [Qwen2.5-Coder-3B-Instruct](https://huggingface.co/unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit) against our published SFT adapter [Sizzing/aws-rl-sft-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-sft-qwen25coder3b-adapter) under two evaluation modes β fast static dataset eval and slow live-environment eval. Both write structured metrics so the deltas are explicit.
> 
> 
---
## Table of contents
1. [What's compared](#1-whats-compared)
2. [Two evaluation modes](#2-two-evaluation-modes)
3. [Methodology](#3-methodology)
4. [Metrics reported](#4-metrics-reported)
5. [How to run](#5-how-to-run)
6. [Reading the results](#6-reading-the-results)
7. [Files in this directory](#7-files-in-this-directory)
---
## 1. What's compared
| | Base | SFT |
|---|---|---|
| **Model** | `unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit` | Same base + LoRA adapter |
| **Adapter** | None | `Sizzing/aws-rl-sft-qwen25coder3b-adapter` |
| **Training data** | Pretraining + Qwen instruction tuning | + 1,500 rows from [data/sft/aws_rl_sft.train.jsonl](../data/sft/aws_rl_sft.train.jsonl) |
| **Inference** | Same prompt template, same temperature | Identical |
The only variable is the LoRA adapter. Same base, same prompts, same decoding parameters, same evaluation set.
---
## 2. Two evaluation modes
The notebook runs two separate evaluations because they answer different questions:
### Dataset eval (static)
| Question | Does the model emit the *canonical* command for held-out prompts, one-shot? |
|-----------|-----------------------------------------------------------------------------|
| Speed | Fast (~minutes) |
| Needs | HF token + dataset access; **no env server** |
| Source | [data/sft/aws_rl_sft.val.jsonl](../data/sft/aws_rl_sft.val.jsonl) (150 held-out rows) |
| Verifies | Format correctness + command-token match against canonical |
This is the same kind of pattern-matching benchmark as [data/sft/MODEL_EVALUATION.md](../data/sft/MODEL_EVALUATION.md) β fast and deterministic. Useful as a regression check.
### RL env eval (live)
| Question | Can the model actually *solve* a task end-to-end against a live environment? |
|-----------|------------------------------------------------------------------------------|
| Speed | Slow (~tens of minutes per model) |
| Needs | Dataset eval above + a running env server (HF Space or local) |
| Source | Same val tasks, but exercised through `client.AwsRlEnv` round-trips |
| Verifies | Multi-step task completion, partial progress, reward shaping, hint usage |
This is closer to what training optimizes for. A model can score well on dataset eval (right command on step 1) but fail RL env eval (can't recover from a step 1 typo, can't continue past the first turn). Both signals matter.
---
## 3. Methodology
### Dataset eval
1. Load `Sizzing/aws-rl-sft` dataset from HF Hub
2. For each row in `val`, build the prompt from `messages[:-1]` (system + user, drop assistant)
3. Generate the model's response (`max_new_tokens=128`, deterministic decoding)
4. **Extract the AWS CLI line**: strip markdown fences, find first line starting with `aws `
5. Score against `messages[-1].content` (the canonical assistant response):
- Format OK (extracted line starts with `aws`)
- Service match (same first word after `aws`)
- Operation match (same first two words)
- Exact match (full token-for-token equality)
This mirrors the methodology in [eval_lm_studio_models.py](../data/eval_lm_studio_models.py); the same scoring functions are reused.
### RL env eval
1. Connect to the running env at `ENV_BASE_URL` (default: an HF Space; can be overridden to local)
2. For each val task, run a full episode (up to `MAX_STEPS=15` turns):
- Build the prompt from system + task + observation history (matches [inference.py](../inference.py))
- Generate one AWS CLI command per turn
- Step the environment, record `reward`, `task_achieved`, `partial_progress`
3. Aggregate per-episode metrics
The agent loop is identical to the training-time `rollout_one_episode` in [train_grpo.py](../train_grpo.py) β same prompt structure, same generation parameters, same termination logic. So the RL env eval is genuinely measuring "what would this model do during a GRPO rollout".
---
## 4. Metrics reported
### Dataset eval
| Metric | Definition |
|----------------|-----------------------------------------------------------|
| `format_ok` | % of responses where the extracted line starts with `aws ` |
| `svc_match` | % matching the canonical service |
| `op_match` | % matching service + operation |
| `exact_match` | % matching the full canonical command token-for-token |
### RL env eval (per episode)
| Metric | Definition |
|-------------------------|------------------------------------------------------------------|
| `avg_episode_reward` | Mean total reward accumulated per episode (sum of step rewards) |
| `completion_rate` | % of episodes ending in `task_achieved=True` |
| `avg_steps_to_complete` | Mean steps used by completed episodes (lower = more efficient) |
| `avg_max_progress` | Mean of the highest `partial_progress` reached per episode |
| `hint_usage_rate` | % of episodes where the agent requested at least one hint |
| `format_failure_rate` | % of agent commands that failed the `aws ` prefix gate |
The notebook produces per-tier breakdowns of all six metrics so you can see where SFT helped most (typically: warmup format-locking goes from ~85% β 100%; intermediate completion goes from a small base to a meaningful fraction).
---
## 5. How to run
### Prerequisites
- HuggingFace token (`HF_TOKEN`) β needed to load the dataset and adapter
- A running env server β either:
- Your own HF Space deployment (set `ENV_BASE_URL` accordingly), or
- Local server: `make run` from the repo root, then `ENV_BASE_URL=http://localhost:8000`
- A GPU runtime (Colab T4 or better, A10/A100 ideal)
### Notebooks
| Notebook | Open in Colab |
|---------------------------------------------------------------------|--------------------------------|
| [compare_base_vs_sft.ipynb](compare_base_vs_sft.ipynb) (clean) | <!-- TODO: paste Colab URL --> |
| [compare_base_vs_sft_with_outputs.ipynb](compare_base_vs_sft_with_outputs.ipynb) (with outputs) | <!-- TODO: paste Colab URL --> |
The two notebooks are functionally identical; the second has cell outputs preserved (18 display widgets, 26 stdout cells) for offline inspection.
### Running steps
1. Open the notebook in Colab (or local Jupyter)
2. Edit the **CONFIG** cell:
```python
BASE_MODEL = "unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit"
SFT_ADAPTER_REPO = "Sizzing/aws-rl-sft-qwen25coder3b-adapter"
DATASET_REPO = "Sizzing/aws-rl-sft"
ENV_BASE_URL = "https://your-hf-space.hf.space" # or local
```
3. Run all cells. Part 1 (dataset eval) finishes first; Part 2 (RL env eval) is the slow one.
4. Compare the per-metric deltas between base and SFT.
---
## 6. Reading the results
### Actual numbers from the run
From the saved outputs of [compare_base_vs_sft_with_outputs.ipynb](compare_base_vs_sft_with_outputs.ipynb):
#### Dataset eval
| Metric | Base | Base + SFT | Ξ |
|---------------------------|:------:|:----------:|:----------:|
| `format_pct` | 33.3% | **100.0%** | **+66.7 pp** |
| `format_after_extract_pct`| 100.0% | 100.0% | 0 |
| `exact_pct` | 38.9% | **88.9%** | **+50.0 pp** |
#### RL env eval (live multi-step agent loop)
| Metric | Base | Base + SFT | Ξ |
|-------------------------|:-----:|:----------:|:---------:|
| `avg_episode_reward` | 1.187 | **2.011** | **+0.824** |
| `reward_std` | 1.137 | 1.908 | +0.771 |
| `avg_steps` | 8.600 | **5.733** | **β2.867** |
| `avg_reward_per_step` | 0.138 | **0.351** | **+0.213** |
> 
The agent **earns more reward per episode while taking fewer steps** β exactly what good fine-tuning should produce. Reward-per-step jumps 2.5Γ because (a) the agent picks the right command more often (fewer wasted steps), and (b) format compliance is now perfect (no more `aws help` fallbacks).
#### Per-tier success in the RL eval
From the notebook's per-rollout traces (3 episodes per tier Γ 5 tiers = 15 episodes per model):
| Tier | Base (rollouts β / 3) | Base + SFT (rollouts β / 3) |
|--------------|:---------------------:|:----------------------------:|
| warmup | 3 | 3 |
| beginner | 3 | 3 |
| intermediate | 1 | 3 |
| advanced | 0 | 1 |
| expert | 0 | 2 |
SFT moves the **success frontier** up two tiers β the base model could not finish a single advanced or expert episode, while SFT completes 2 of 3 expert tasks (S3 lockdown, IAM least-privilege variants) within 5 steps.
### What counts as a meaningful delta?
The val set is small (150 rows / ~10 unique tasks per RL eval), so individual percentage points have meaningful noise. Rules of thumb:
| Delta size | Significance |
|------------|------------------------------------------------|
| Β±2pp | Within noise β don't claim improvement |
| 5β10pp | Likely real, look at per-tier breakdown |
| >10pp | Almost certainly real |
The deltas above (66.7 pp, 50.0 pp on dataset; 0.82 reward / β2.9 steps on RL eval) are well above the noise floor.
### Going further with GRPO
Once the SFT adapter is in hand, the same comparison can be re-run against a GRPO adapter. Multi-step results from our reference GRPO run are documented in the [main README Β§11](../README.md#11-results--benchmarks); the short version is GRPO@35-steps preserves SFT performance and modestly improves the middle tiers, while the expert tier remains the bottleneck.
---
## 7. Files in this directory
| File | Purpose |
|-----------------------------------------------------------------------------------------------------|------------------------------------------------------------------|
| [compare_base_vs_sft.ipynb](compare_base_vs_sft.ipynb) | Side-by-side dataset + RL env benchmark β clean version |
| [compare_base_vs_sft_with_outputs.ipynb](compare_base_vs_sft_with_outputs.ipynb) | Same notebook with cell outputs preserved (18 display widgets) |
---
## See also
- [Main README](../README.md) β top-level overview, results section
- [data/README.md](../data/README.md) β dataset that drives this comparison
- [data/sft/MODEL_EVALUATION.md](../data/sft/MODEL_EVALUATION.md) β base-model selection benchmark (same scoring functions reused here)
- [train/README.md](../train/README.md) β how the SFT adapter being benchmarked here was produced
- [inference.py](../inference.py) β single-model agent loop (the prototype the RL eval mode is modeled after)
|