File size: 12,525 Bytes
c745a99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
# `compare/` β€” Base Model vs SFT Adapter Benchmark

[← back to main README](../README.md)

This directory holds the side-by-side benchmark that answers the only question that ultimately matters: **did SFT actually make the model better at the task?**

The benchmark compares the base [Qwen2.5-Coder-3B-Instruct](https://huggingface.co/unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit) against our published SFT adapter [Sizzing/aws-rl-sft-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-sft-qwen25coder3b-adapter) under two evaluation modes β€” fast static dataset eval and slow live-environment eval. Both write structured metrics so the deltas are explicit.

> ![Dataset comparison: base vs SFT (per-row scores)](../docs/figures/compare_dataset.png)
> ![RL-env comparison: base vs SFT (per-episode rewards)](../docs/figures/compare_rl_env.png)

---

## Table of contents

1. [What's compared](#1-whats-compared)
2. [Two evaluation modes](#2-two-evaluation-modes)
3. [Methodology](#3-methodology)
4. [Metrics reported](#4-metrics-reported)
5. [How to run](#5-how-to-run)
6. [Reading the results](#6-reading-the-results)
7. [Files in this directory](#7-files-in-this-directory)

---

## 1. What's compared

| | Base | SFT |
|---|---|---|
| **Model**          | `unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit` | Same base + LoRA adapter |
| **Adapter**        | None                                          | `Sizzing/aws-rl-sft-qwen25coder3b-adapter` |
| **Training data**  | Pretraining + Qwen instruction tuning         | + 1,500 rows from [data/sft/aws_rl_sft.train.jsonl](../data/sft/aws_rl_sft.train.jsonl) |
| **Inference**      | Same prompt template, same temperature        | Identical                                    |

The only variable is the LoRA adapter. Same base, same prompts, same decoding parameters, same evaluation set.

---

## 2. Two evaluation modes

The notebook runs two separate evaluations because they answer different questions:

### Dataset eval (static)

| Question  | Does the model emit the *canonical* command for held-out prompts, one-shot? |
|-----------|-----------------------------------------------------------------------------|
| Speed     | Fast (~minutes)                                                             |
| Needs     | HF token + dataset access; **no env server**                                |
| Source    | [data/sft/aws_rl_sft.val.jsonl](../data/sft/aws_rl_sft.val.jsonl) (150 held-out rows) |
| Verifies  | Format correctness + command-token match against canonical                  |

This is the same kind of pattern-matching benchmark as [data/sft/MODEL_EVALUATION.md](../data/sft/MODEL_EVALUATION.md) β€” fast and deterministic. Useful as a regression check.

### RL env eval (live)

| Question  | Can the model actually *solve* a task end-to-end against a live environment? |
|-----------|------------------------------------------------------------------------------|
| Speed     | Slow (~tens of minutes per model)                                            |
| Needs     | Dataset eval above + a running env server (HF Space or local)                |
| Source    | Same val tasks, but exercised through `client.AwsRlEnv` round-trips          |
| Verifies  | Multi-step task completion, partial progress, reward shaping, hint usage     |

This is closer to what training optimizes for. A model can score well on dataset eval (right command on step 1) but fail RL env eval (can't recover from a step 1 typo, can't continue past the first turn). Both signals matter.

---

## 3. Methodology

### Dataset eval

1. Load `Sizzing/aws-rl-sft` dataset from HF Hub
2. For each row in `val`, build the prompt from `messages[:-1]` (system + user, drop assistant)
3. Generate the model's response (`max_new_tokens=128`, deterministic decoding)
4. **Extract the AWS CLI line**: strip markdown fences, find first line starting with `aws `
5. Score against `messages[-1].content` (the canonical assistant response):
   - Format OK (extracted line starts with `aws`)
   - Service match (same first word after `aws`)
   - Operation match (same first two words)
   - Exact match (full token-for-token equality)

This mirrors the methodology in [eval_lm_studio_models.py](../data/eval_lm_studio_models.py); the same scoring functions are reused.

### RL env eval

1. Connect to the running env at `ENV_BASE_URL` (default: an HF Space; can be overridden to local)
2. For each val task, run a full episode (up to `MAX_STEPS=15` turns):
   - Build the prompt from system + task + observation history (matches [inference.py](../inference.py))
   - Generate one AWS CLI command per turn
   - Step the environment, record `reward`, `task_achieved`, `partial_progress`
3. Aggregate per-episode metrics

The agent loop is identical to the training-time `rollout_one_episode` in [train_grpo.py](../train_grpo.py) β€” same prompt structure, same generation parameters, same termination logic. So the RL env eval is genuinely measuring "what would this model do during a GRPO rollout".

---

## 4. Metrics reported

### Dataset eval

| Metric         | Definition                                                |
|----------------|-----------------------------------------------------------|
| `format_ok`    | % of responses where the extracted line starts with `aws ` |
| `svc_match`    | % matching the canonical service                           |
| `op_match`     | % matching service + operation                             |
| `exact_match`  | % matching the full canonical command token-for-token      |

### RL env eval (per episode)

| Metric                  | Definition                                                       |
|-------------------------|------------------------------------------------------------------|
| `avg_episode_reward`    | Mean total reward accumulated per episode (sum of step rewards)  |
| `completion_rate`       | % of episodes ending in `task_achieved=True`                     |
| `avg_steps_to_complete` | Mean steps used by completed episodes (lower = more efficient)   |
| `avg_max_progress`      | Mean of the highest `partial_progress` reached per episode       |
| `hint_usage_rate`       | % of episodes where the agent requested at least one hint        |
| `format_failure_rate`   | % of agent commands that failed the `aws ` prefix gate           |

The notebook produces per-tier breakdowns of all six metrics so you can see where SFT helped most (typically: warmup format-locking goes from ~85% β†’ 100%; intermediate completion goes from a small base to a meaningful fraction).

---

## 5. How to run

### Prerequisites

- HuggingFace token (`HF_TOKEN`) β€” needed to load the dataset and adapter
- A running env server β€” either:
  - Your own HF Space deployment (set `ENV_BASE_URL` accordingly), or
  - Local server: `make run` from the repo root, then `ENV_BASE_URL=http://localhost:8000`
- A GPU runtime (Colab T4 or better, A10/A100 ideal)

### Notebooks

| Notebook                                                            | Open in Colab                  |
|---------------------------------------------------------------------|--------------------------------|
| [compare_base_vs_sft.ipynb](compare_base_vs_sft.ipynb) (clean)      | <!-- TODO: paste Colab URL --> |
| [compare_base_vs_sft_with_outputs.ipynb](compare_base_vs_sft_with_outputs.ipynb) (with outputs) | <!-- TODO: paste Colab URL --> |

The two notebooks are functionally identical; the second has cell outputs preserved (18 display widgets, 26 stdout cells) for offline inspection.

### Running steps

1. Open the notebook in Colab (or local Jupyter)
2. Edit the **CONFIG** cell:
   ```python
   BASE_MODEL        = "unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit"
   SFT_ADAPTER_REPO  = "Sizzing/aws-rl-sft-qwen25coder3b-adapter"
   DATASET_REPO      = "Sizzing/aws-rl-sft"
   ENV_BASE_URL      = "https://your-hf-space.hf.space"   # or local
   ```
3. Run all cells. Part 1 (dataset eval) finishes first; Part 2 (RL env eval) is the slow one.
4. Compare the per-metric deltas between base and SFT.

---

## 6. Reading the results

### Actual numbers from the run

From the saved outputs of [compare_base_vs_sft_with_outputs.ipynb](compare_base_vs_sft_with_outputs.ipynb):

#### Dataset eval

| Metric                    | Base   | Base + SFT | Ξ”          |
|---------------------------|:------:|:----------:|:----------:|
| `format_pct`              | 33.3%  | **100.0%** | **+66.7 pp** |
| `format_after_extract_pct`| 100.0% | 100.0%     | 0          |
| `exact_pct`               | 38.9%  | **88.9%**  | **+50.0 pp** |

#### RL env eval (live multi-step agent loop)

| Metric                  | Base  | Base + SFT | Ξ”         |
|-------------------------|:-----:|:----------:|:---------:|
| `avg_episode_reward`    | 1.187 | **2.011**  | **+0.824** |
| `reward_std`            | 1.137 | 1.908      | +0.771    |
| `avg_steps`             | 8.600 | **5.733**  | **βˆ’2.867** |
| `avg_reward_per_step`   | 0.138 | **0.351**  | **+0.213** |

> ![RL-env eval: base vs SFT](../docs/figures/rl_env_eval_base_vs_sft.png)

The agent **earns more reward per episode while taking fewer steps** β€” exactly what good fine-tuning should produce. Reward-per-step jumps 2.5Γ— because (a) the agent picks the right command more often (fewer wasted steps), and (b) format compliance is now perfect (no more `aws help` fallbacks).

#### Per-tier success in the RL eval

From the notebook's per-rollout traces (3 episodes per tier Γ— 5 tiers = 15 episodes per model):

| Tier         | Base (rollouts βœ“ / 3) | Base + SFT (rollouts βœ“ / 3) |
|--------------|:---------------------:|:----------------------------:|
| warmup       | 3                     | 3                            |
| beginner     | 3                     | 3                            |
| intermediate | 1                     | 3                            |
| advanced     | 0                     | 1                            |
| expert       | 0                     | 2                            |

SFT moves the **success frontier** up two tiers β€” the base model could not finish a single advanced or expert episode, while SFT completes 2 of 3 expert tasks (S3 lockdown, IAM least-privilege variants) within 5 steps.

### What counts as a meaningful delta?

The val set is small (150 rows / ~10 unique tasks per RL eval), so individual percentage points have meaningful noise. Rules of thumb:

| Delta size | Significance                                   |
|------------|------------------------------------------------|
| Β±2pp       | Within noise β€” don't claim improvement         |
| 5–10pp     | Likely real, look at per-tier breakdown        |
| >10pp      | Almost certainly real                          |

The deltas above (66.7 pp, 50.0 pp on dataset; 0.82 reward / βˆ’2.9 steps on RL eval) are well above the noise floor.

### Going further with GRPO

Once the SFT adapter is in hand, the same comparison can be re-run against a GRPO adapter. Multi-step results from our reference GRPO run are documented in the [main README Β§11](../README.md#11-results--benchmarks); the short version is GRPO@35-steps preserves SFT performance and modestly improves the middle tiers, while the expert tier remains the bottleneck.

---

## 7. Files in this directory

| File                                                                                                | Purpose                                                          |
|-----------------------------------------------------------------------------------------------------|------------------------------------------------------------------|
| [compare_base_vs_sft.ipynb](compare_base_vs_sft.ipynb)                                              | Side-by-side dataset + RL env benchmark β€” clean version          |
| [compare_base_vs_sft_with_outputs.ipynb](compare_base_vs_sft_with_outputs.ipynb)                    | Same notebook with cell outputs preserved (18 display widgets)   |

---

## See also

- [Main README](../README.md) β€” top-level overview, results section
- [data/README.md](../data/README.md) β€” dataset that drives this comparison
- [data/sft/MODEL_EVALUATION.md](../data/sft/MODEL_EVALUATION.md) β€” base-model selection benchmark (same scoring functions reused here)
- [train/README.md](../train/README.md) β€” how the SFT adapter being benchmarked here was produced
- [inference.py](../inference.py) β€” single-model agent loop (the prototype the RL eval mode is modeled after)