# `compare/` — Base Model vs SFT Adapter Benchmark [← back to main README](../README.md) This directory holds the side-by-side benchmark that answers the only question that ultimately matters: **did SFT actually make the model better at the task?** The benchmark compares the base [Qwen2.5-Coder-3B-Instruct](https://huggingface.co/unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit) against our published SFT adapter [Sizzing/aws-rl-sft-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-sft-qwen25coder3b-adapter) under two evaluation modes — fast static dataset eval and slow live-environment eval. Both write structured metrics so the deltas are explicit. > ![Dataset comparison: base vs SFT (per-row scores)](../docs/figures/compare_dataset.png) > ![RL-env comparison: base vs SFT (per-episode rewards)](../docs/figures/compare_rl_env.png) --- ## Table of contents 1. [What's compared](#1-whats-compared) 2. [Two evaluation modes](#2-two-evaluation-modes) 3. [Methodology](#3-methodology) 4. [Metrics reported](#4-metrics-reported) 5. [How to run](#5-how-to-run) 6. [Reading the results](#6-reading-the-results) 7. [Files in this directory](#7-files-in-this-directory) --- ## 1. What's compared | | Base | SFT | |---|---|---| | **Model** | `unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit` | Same base + LoRA adapter | | **Adapter** | None | `Sizzing/aws-rl-sft-qwen25coder3b-adapter` | | **Training data** | Pretraining + Qwen instruction tuning | + 1,500 rows from [data/sft/aws_rl_sft.train.jsonl](../data/sft/aws_rl_sft.train.jsonl) | | **Inference** | Same prompt template, same temperature | Identical | The only variable is the LoRA adapter. Same base, same prompts, same decoding parameters, same evaluation set. --- ## 2. Two evaluation modes The notebook runs two separate evaluations because they answer different questions: ### Dataset eval (static) | Question | Does the model emit the *canonical* command for held-out prompts, one-shot? | |-----------|-----------------------------------------------------------------------------| | Speed | Fast (~minutes) | | Needs | HF token + dataset access; **no env server** | | Source | [data/sft/aws_rl_sft.val.jsonl](../data/sft/aws_rl_sft.val.jsonl) (150 held-out rows) | | Verifies | Format correctness + command-token match against canonical | This is the same kind of pattern-matching benchmark as [data/sft/MODEL_EVALUATION.md](../data/sft/MODEL_EVALUATION.md) — fast and deterministic. Useful as a regression check. ### RL env eval (live) | Question | Can the model actually *solve* a task end-to-end against a live environment? | |-----------|------------------------------------------------------------------------------| | Speed | Slow (~tens of minutes per model) | | Needs | Dataset eval above + a running env server (HF Space or local) | | Source | Same val tasks, but exercised through `client.AwsRlEnv` round-trips | | Verifies | Multi-step task completion, partial progress, reward shaping, hint usage | This is closer to what training optimizes for. A model can score well on dataset eval (right command on step 1) but fail RL env eval (can't recover from a step 1 typo, can't continue past the first turn). Both signals matter. --- ## 3. Methodology ### Dataset eval 1. Load `Sizzing/aws-rl-sft` dataset from HF Hub 2. For each row in `val`, build the prompt from `messages[:-1]` (system + user, drop assistant) 3. Generate the model's response (`max_new_tokens=128`, deterministic decoding) 4. **Extract the AWS CLI line**: strip markdown fences, find first line starting with `aws ` 5. Score against `messages[-1].content` (the canonical assistant response): - Format OK (extracted line starts with `aws`) - Service match (same first word after `aws`) - Operation match (same first two words) - Exact match (full token-for-token equality) This mirrors the methodology in [eval_lm_studio_models.py](../data/eval_lm_studio_models.py); the same scoring functions are reused. ### RL env eval 1. Connect to the running env at `ENV_BASE_URL` (default: an HF Space; can be overridden to local) 2. For each val task, run a full episode (up to `MAX_STEPS=15` turns): - Build the prompt from system + task + observation history (matches [inference.py](../inference.py)) - Generate one AWS CLI command per turn - Step the environment, record `reward`, `task_achieved`, `partial_progress` 3. Aggregate per-episode metrics The agent loop is identical to the training-time `rollout_one_episode` in [train_grpo.py](../train_grpo.py) — same prompt structure, same generation parameters, same termination logic. So the RL env eval is genuinely measuring "what would this model do during a GRPO rollout". --- ## 4. Metrics reported ### Dataset eval | Metric | Definition | |----------------|-----------------------------------------------------------| | `format_ok` | % of responses where the extracted line starts with `aws ` | | `svc_match` | % matching the canonical service | | `op_match` | % matching service + operation | | `exact_match` | % matching the full canonical command token-for-token | ### RL env eval (per episode) | Metric | Definition | |-------------------------|------------------------------------------------------------------| | `avg_episode_reward` | Mean total reward accumulated per episode (sum of step rewards) | | `completion_rate` | % of episodes ending in `task_achieved=True` | | `avg_steps_to_complete` | Mean steps used by completed episodes (lower = more efficient) | | `avg_max_progress` | Mean of the highest `partial_progress` reached per episode | | `hint_usage_rate` | % of episodes where the agent requested at least one hint | | `format_failure_rate` | % of agent commands that failed the `aws ` prefix gate | The notebook produces per-tier breakdowns of all six metrics so you can see where SFT helped most (typically: warmup format-locking goes from ~85% → 100%; intermediate completion goes from a small base to a meaningful fraction). --- ## 5. How to run ### Prerequisites - HuggingFace token (`HF_TOKEN`) — needed to load the dataset and adapter - A running env server — either: - Your own HF Space deployment (set `ENV_BASE_URL` accordingly), or - Local server: `make run` from the repo root, then `ENV_BASE_URL=http://localhost:8000` - A GPU runtime (Colab T4 or better, A10/A100 ideal) ### Notebooks | Notebook | Open in Colab | |---------------------------------------------------------------------|--------------------------------| | [compare_base_vs_sft.ipynb](compare_base_vs_sft.ipynb) (clean) | | | [compare_base_vs_sft_with_outputs.ipynb](compare_base_vs_sft_with_outputs.ipynb) (with outputs) | | The two notebooks are functionally identical; the second has cell outputs preserved (18 display widgets, 26 stdout cells) for offline inspection. ### Running steps 1. Open the notebook in Colab (or local Jupyter) 2. Edit the **CONFIG** cell: ```python BASE_MODEL = "unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit" SFT_ADAPTER_REPO = "Sizzing/aws-rl-sft-qwen25coder3b-adapter" DATASET_REPO = "Sizzing/aws-rl-sft" ENV_BASE_URL = "https://your-hf-space.hf.space" # or local ``` 3. Run all cells. Part 1 (dataset eval) finishes first; Part 2 (RL env eval) is the slow one. 4. Compare the per-metric deltas between base and SFT. --- ## 6. Reading the results ### Actual numbers from the run From the saved outputs of [compare_base_vs_sft_with_outputs.ipynb](compare_base_vs_sft_with_outputs.ipynb): #### Dataset eval | Metric | Base | Base + SFT | Δ | |---------------------------|:------:|:----------:|:----------:| | `format_pct` | 33.3% | **100.0%** | **+66.7 pp** | | `format_after_extract_pct`| 100.0% | 100.0% | 0 | | `exact_pct` | 38.9% | **88.9%** | **+50.0 pp** | #### RL env eval (live multi-step agent loop) | Metric | Base | Base + SFT | Δ | |-------------------------|:-----:|:----------:|:---------:| | `avg_episode_reward` | 1.187 | **2.011** | **+0.824** | | `reward_std` | 1.137 | 1.908 | +0.771 | | `avg_steps` | 8.600 | **5.733** | **−2.867** | | `avg_reward_per_step` | 0.138 | **0.351** | **+0.213** | > ![RL-env eval: base vs SFT](../docs/figures/rl_env_eval_base_vs_sft.png) The agent **earns more reward per episode while taking fewer steps** — exactly what good fine-tuning should produce. Reward-per-step jumps 2.5× because (a) the agent picks the right command more often (fewer wasted steps), and (b) format compliance is now perfect (no more `aws help` fallbacks). #### Per-tier success in the RL eval From the notebook's per-rollout traces (3 episodes per tier × 5 tiers = 15 episodes per model): | Tier | Base (rollouts ✓ / 3) | Base + SFT (rollouts ✓ / 3) | |--------------|:---------------------:|:----------------------------:| | warmup | 3 | 3 | | beginner | 3 | 3 | | intermediate | 1 | 3 | | advanced | 0 | 1 | | expert | 0 | 2 | SFT moves the **success frontier** up two tiers — the base model could not finish a single advanced or expert episode, while SFT completes 2 of 3 expert tasks (S3 lockdown, IAM least-privilege variants) within 5 steps. ### What counts as a meaningful delta? The val set is small (150 rows / ~10 unique tasks per RL eval), so individual percentage points have meaningful noise. Rules of thumb: | Delta size | Significance | |------------|------------------------------------------------| | ±2pp | Within noise — don't claim improvement | | 5–10pp | Likely real, look at per-tier breakdown | | >10pp | Almost certainly real | The deltas above (66.7 pp, 50.0 pp on dataset; 0.82 reward / −2.9 steps on RL eval) are well above the noise floor. ### Going further with GRPO Once the SFT adapter is in hand, the same comparison can be re-run against a GRPO adapter. Multi-step results from our reference GRPO run are documented in the [main README §11](../README.md#11-results--benchmarks); the short version is GRPO@35-steps preserves SFT performance and modestly improves the middle tiers, while the expert tier remains the bottleneck. --- ## 7. Files in this directory | File | Purpose | |-----------------------------------------------------------------------------------------------------|------------------------------------------------------------------| | [compare_base_vs_sft.ipynb](compare_base_vs_sft.ipynb) | Side-by-side dataset + RL env benchmark — clean version | | [compare_base_vs_sft_with_outputs.ipynb](compare_base_vs_sft_with_outputs.ipynb) | Same notebook with cell outputs preserved (18 display widgets) | --- ## See also - [Main README](../README.md) — top-level overview, results section - [data/README.md](../data/README.md) — dataset that drives this comparison - [data/sft/MODEL_EVALUATION.md](../data/sft/MODEL_EVALUATION.md) — base-model selection benchmark (same scoring functions reused here) - [train/README.md](../train/README.md) — how the SFT adapter being benchmarked here was produced - [inference.py](../inference.py) — single-model agent loop (the prototype the RL eval mode is modeled after)