File size: 15,577 Bytes
c745a99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
# Model Evaluation β€” Picking the Best Base Model for SFT + GRPO on AWS RL Env

## TL;DR

**Train `qwen2.5-coder-3b-instruct`.** It's the strongest candidate across every metric that matters for this task: highest exact-match rate, tightest outputs, and fast enough to not bottleneck GRPO rollouts. Full reasoning and per-model data below.

---

## 1. What this evaluation does

For each chat model loaded in LM Studio, we send 27 prompts drawn from our held-out validation split and measure how closely the model's output matches the canonical AWS CLI command that would solve the task. The goal is to pick the base model that:

1. **Starts strong** β€” already understands AWS CLI syntax, so SFT can focus on task correctness instead of format-locking
2. **Has headroom** β€” not so perfect that SFT overfits; not so weak that SFT can't help
3. **Is fast enough** β€” GRPO generates `G=8` rollouts per prompt Γ— many prompts Γ— many steps; inference cost compounds

This is a **format-and-correctness screen**. It does NOT measure:
- Whether the model can run a multi-step task against the live env (that's a separate integration test)
- Long-context behavior beyond ~500 tokens
- Post-SFT performance (only base-model zero-shot)

## 2. Eval methodology

### Prompts
- **Source**: `data/sft/aws_rl_sft.val.jsonl` (150 rows)
- **Coverage**: 3 examples per `(tier, source)` combo β†’ **27 prompts per model**
- Combos cover: warmup+beginner+intermediate tiers Γ— success_first_step + multi_step_continuation + failure_recovery + verification + hint_usage producers
- Each prompt is sent exactly as inference.py would send it: `system` + `user` messages from the dataset, no assistant turn

### Model invocation
- **Endpoint**: LM Studio at `http://localhost:1234/v1/chat/completions` (OpenAI-compatible)
- **temperature**: `0.0` (deterministic)
- **max_tokens**: `120` (enough for any valid AWS command; truncates runaway prose)
- **timeout**: `60s` per call

### Total budget
- 11 chat models Γ— 27 prompts = **297 API calls**, completed in ~15 minutes

## 3. Metrics β€” what each column means

| Metric | What it measures | Why it matters |
|---|---|---|
| **`fmt%`** | Raw model output starts with `aws ` (no preamble, no fences, no prose) | Inference-time gate: [inference.py:93](../../inference.py#L93) rejects anything that doesn't start with `aws ` and replaces it with `aws help`. High `fmt%` = fewer wasted env steps. |
| **`+xtr%`** | After stripping markdown fences and leading prose, does the first `aws ...` line exist? | Measures "the model knows the answer but wraps it in junk." If `+xtr% >> fmt%`, the gap is all format noise β€” a simple regex in inference.py could recover most of it, OR SFT can lock it cheaply. |
| **`exact%`** | Extracted command matches the canonical command token-for-token | The hardest metric. Hits all the way down to exact flag values and escaping. This is the ceiling SFT has to reach. |
| **`svc%`** | Extracted command uses the same AWS service as canonical (e.g. both start with `aws s3api`) | Measures domain orientation: does the model know "this task calls for DynamoDB" even if it gets the exact operation wrong? |
| **`op%`** | Same AWS service AND same operation (e.g. both are `aws s3api create-bucket`) | Measures how close the model is to correct β€” it knows *what* to do, maybe not with *which* flags. This is the gap SFT closes most reliably. |
| **`lat`** | Mean seconds per call | Matters for GRPO rollout throughput. G=8 rollouts Γ— 100 prompts Γ— 5 steps = 4000 generations per training epoch. At 10s/call that's 11 hours; at 3s it's 3.3 hours. |
| **`len`** | Mean raw output length in characters | Proxy for verbosity. Lower = more concentrated signal for SFT loss; higher = model likes to explain itself (bad for this task). |

### Symbols in per-call logs
- **βœ“** β€” exact match with canonical command
- **~** β€” format valid (after extraction) but content doesn't match canonical
- **βœ—** β€” either no valid `aws ` line or the output is malformed

## 4. Full results β€” 11 models Γ— 27 prompts each

```
Model                                  n errs  fmt%  +xtr%  exact%  svc%   op%   lat   len
--------------------------------------------------------------------------------------------
qwen2.5-coder-3b-instruct             27    0   85%   100%     41%   70%   63%  3.1s   86  ⭐
qwen/qwen3-4b-2507                    27    0  100%   100%     33%   74%   59% 10.4s  108
qwen2.5-coder-1.5b-instruct           27    0   81%    85%     22%   48%   44%  2.5s  110
smollm2-1.7b-instruct                 27    0   63%    63%      7%   63%   37%  2.1s   87
smollm-360m-instruct                  27    0    0%    63%      0%   26%    7%  1.7s  402
smollm2-135m-instruct                 27    0    0%    59%      0%   15%    7%  1.1s  337
smollm-360m-instruct-v0.2             27    0    0%    56%      0%   15%    7%  2.2s  364
smollm2-360m-instruct                 27    0   52%    52%      0%   48%   33%  1.0s  137
smollm-1.7b-instruct-v0.2             27    0    0%    37%      0%   15%   11%  3.9s  342
smollm2-360m (base)                   27    0    0%     0%      0%    0%    0%  1.7s  390
deepseek-r1-distill-qwen-1.5b         27    0    0%     0%      0%    0%    0%  4.1s    0†
```

*† DeepSeek-R1-Distill was truncated by `max_tokens=120` during its `<think>...</think>` reasoning phase. We re-ran it separately with `max_tokens=2048` β€” see section 6 for real numbers.*

## 5. Per-model verdicts

### ⭐ `qwen2.5-coder-3b-instruct` β€” **recommended**

**Evidence**
- **exact% = 41%** β€” highest of any model tested
- **op% = 63%** β€” best service+operation recognition; it knows *what* most tasks need
- **len = 86 chars** β€” tightest output in the test (even tighter than qwen3-4b at 108)
- **lat = 3.1s** β€” 3.4Γ— faster than qwen3-4b with better accuracy
- Correctly handled `aws cognito-idp create-user-pool --pool-name app-users` (intermediate tier)
- Correctly handled `aws rds create-db-instance --db-instance-identifier app-database --engine mysql` (a notoriously long command)

**Weaknesses**
- `fmt% = 85%` (not 100%) β€” occasionally wraps commands in `'...'` quotes or adds a trailing period. SFT fixes this in one epoch.
- Sometimes picks the wrong operation within the right service (e.g. `create-user-pool-client` instead of `create-user-pool`). Failure-recovery rows in your SFT dataset address this directly.

**Training implications**
- Recommended LoRA config: **r=8, Ξ±=16, 2 epochs, lr=2e-4** β€” model is already strong enough that r=16 would memorize rather than generalize
- Expected post-SFT performance: exact% > 75%, op% > 90%
- Inference cost during GRPO: ~3Γ— cheaper than qwen3-4b

---

### `qwen/qwen3-4b-2507` β€” strong runner-up

**Evidence**
- **fmt% = 100%** β€” the only model that never produces preamble, quotes, or fences
- **exact% = 33%**, **svc% = 74%** β€” still very good
- **lat = 10.4s** β€” 3Γ— slower than qwen2.5-coder-3b due to 33% more parameters

**Weaknesses**
- The latency is a real problem for GRPO at scale β€” 10s Γ— G=8 rollouts Γ— 100 prompts = 2.2 hours per training step pair
- Lower `op%` than qwen2.5-coder-3b (59% vs 63%) despite being larger β€” suggests coder-tuning beats raw scale for this task

**Verdict**: use only if post-SFT evaluation on qwen2.5-coder-3b falls short of expectations. Otherwise the smaller coder model dominates.

---

### `qwen2.5-coder-1.5b-instruct` β€” the speed play

**Evidence**
- **fmt% = 81%**, **+xtr% = 85%**, **exact% = 22%**
- **lat = 2.5s** β€” fastest of the viable candidates
- 1.5B parameters β€” ~2Γ— cheaper inference than the 3B

**Weaknesses**
- 22% exact-match is a real accuracy gap from the 3B (41%)
- Sometimes confuses related operations (e.g. `put-secret-value` instead of `create-secret`)

**Verdict**: keep as a fallback. If your GRPO budget is tight, the 2Γ— throughput might justify the accuracy hit β€” but only after confirming SFT can close the gap. Recommended only if you plan to run many thousands of GRPO episodes.

---

### `smollm2-1.7b-instruct` β€” best of the SmolLMs, but not enough

**Evidence**
- **exact% = 7%** (2/27 correct) β€” only SmolLM variant above zero
- **svc% = 63%** β€” knows which service most tasks target
- Picks up service names fairly often but almost always with wrong operation or flags

**Weaknesses**
- A 34% accuracy gap to qwen2.5-coder-3b on the critical exact% metric
- Frequent hallucinations: `aws s3 mb s3://firehose-delivery/ --profile aws-dev-prod` (made-up profile flag)

**Verdict**: not worth training. The post-SFT ceiling will be limited by the base model's sparse AWS knowledge.

---

### `smollm2-135m-instruct` β€” surprising +xtr%, zero substance

**Evidence**
- **+xtr% = 59%** β€” emits `aws ` prefixed lines more often than half the larger SmolLMs
- **exact% = 0%**, **op% = 7%** β€” complete syntax salad behind the prefix

**Example outputs**
- `aws s3 ls --bucket=/path/to/s3 -o /path/to/s3-output.json -n notifications` (hallucinated flags for list-topics task)
- `aws elastic describe-cache-clusters --cluster=my_elastiCache` (wrong service name, fabricated flags)

**Verdict**: it produces convincing-looking CLI syntax but none of it is valid. A completely different failure mode from the 360M models (which dump prose) β€” and equally useless.

---

### `smollm-360m-instruct` / `smollm-360m-instruct-v0.2` / `smollm2-360m-instruct`

All three fail similarly:
- `fmt%` either 0% (dumps prose or Python code) or ~50% (emits quoted strings like `"'aws s3 ls'"`)
- `exact% = 0%` across the board
- Outputs often include markdown code fences, step-by-step narration, or hallucinated boto3 code

**Verdict**: ineligible. Format instability makes SFT expensive and the base capability is absent.

---

### `smollm-1.7b-instruct-v0.2` β€” size doesn't save it

**Evidence**
- Same parameter count as `smollm2-1.7b-instruct` but older / different training
- **+xtr% = 37%** vs. 63% for smollm2-1.7b β€” the training difference matters more than scale
- 0% exact match, 11% op match

**Verdict**: the newer smollm2-1.7b-instruct is strictly better; this variant has no role.

---

### `smollm2-360m` (base, not instruct)

**Evidence**
- 0% across every column
- Echoes the prompt back verbatim

**Verdict**: base models without instruction tuning are architecturally wrong for a chat-format SFT setup. Skip.

---

### `deepseek-r1-distill-qwen-1.5b` β€” wrong tool for this job

**Original run (max_tokens=120)**
- 0% across the board, 0-char outputs
- **Cause**: R1 models emit `<think>...</think>` reasoning blocks of 500-2000 tokens before their answer. 120 tokens truncated every response mid-thinking.

**Re-run (max_tokens=2048)**
- **exact% = 0/27** (still zero)
- **avg latency = 16.0s** (2-3Γ— slower than qwen3-4b due to thinking overhead)
- 2 calls timed out at 60s
- Typical outputs: `aws s3 bucket-create --bucket data-pipeline` (invented op), `aws s3 topic --name Alerts` (wrong service), `aws iam checkRolePolicy` (hallucinated op name)

**Why it fails**
- R1-distill was trained on math and coding reasoning β€” not AWS CLI
- The `<think>` pattern doesn't summon domain knowledge that isn't in the base model
- Qwen-1.5B's AWS knowledge is sparse; wrapping it in reasoning tokens doesn't add substance

**Verdict**: only useful if you specifically want GRPO-with-thinking from day one AND are willing to do heavier SFT. For this task, qwen2.5-coder-3b + emergent reasoning during GRPO (R1-Zero style) is the cleaner path.

## 6. How to read the gap between `fmt%` and `+xtr%`

This gap tells you what kind of SFT each model needs:

- **`qwen/qwen3-4b-2507`**: `fmt% = +xtr% = 100%` β†’ zero format-locking needed, SFT can focus entirely on task correctness
- **`qwen2.5-coder-3b`**: `85% β†’ 100%` β†’ small format tax (quotes, trailing punctuation); one epoch of SFT fixes it
- **`smollm-360m-instruct`**: `0% β†’ 63%` β†’ the model *knows* what to say but always wraps it in prose. A regex post-processor could salvage 63% without any training β€” but it's cheap signal to SFT on
- **`deepseek-r1-distill`**: `0% β†’ 0%` β†’ format-broken even with reasoning budget; not recoverable by regex

## 7. Overall ranking (for SFT + GRPO)

| Rank | Model | Train? | Reasoning |
|------|---|:---:|---|
| 1 | qwen2.5-coder-3b-instruct | βœ… | Best exact%, best op%, cleanest output, fast enough for GRPO |
| 2 | qwen/qwen3-4b-2507 | ⚠️ fallback | Perfect format but 3Γ— slower and slightly worse content than #1 |
| 3 | qwen2.5-coder-1.5b-instruct | ⚠️ speed play | Strong for its size; train only if GRPO throughput is critical |
| 4 | smollm2-1.7b-instruct | ❌ | 34pt gap on exact% vs #1; ceiling too low |
| β€” | All smaller SmolLMs | ❌ | Format-broken, zero exact match, hallucinated syntax |
| β€” | smollm-1.7b-instruct-v0.2 | ❌ | Strictly dominated by smollm2-1.7b-instruct |
| β€” | deepseek-r1-distill-qwen-1.5b | ❌ | Wrong domain + latency 2Γ— worse than #2 |

## 8. Caveats & limitations

- **27 prompts is a sample, not an exhaustive benchmark.** The error bars on exact% are Β±5-10 percentage points. For close calls (like coder-3b vs qwen3-4b), rerun with `--max-per-combo 5` or higher before making the final call.
- **LM Studio latency is serving-architecture-dependent.** The 10s/call for qwen3-4b reflects Metal / llama.cpp on your local Mac. During actual training we'll run on CUDA via `transformers` (~100ms forward pass) or vLLM (~30ms), and the picture changes.
- **We only measure single-turn behavior.** Multi-step task completion (does the model actually solve the episode end-to-end?) requires running against the live env. This eval predicts first-step performance, which correlates well but isn't the same thing.
- **R1-distill was tested twice** β€” once with the default budget that truncated thinking, once with `max_tokens=2048`. The README table shows the truncated numbers; real performance is section 5's re-run.

## 9. Training implications β€” if you pick `qwen2.5-coder-3b-instruct`

- **LoRA**: `r=8, lora_alpha=16, target_modules=["q_proj","k_proj","v_proj","o_proj"], lora_dropout=0.05` β€” lower rank than the default because the base model is already strong
- **Training**: `num_train_epochs=2, lr=2e-4, effective_batch=16, max_seq_length=512, lr_scheduler="cosine"` β€” shorter than the plan for Llama-3.1-8B; don't over-train
- **Expected post-SFT**: fmt% β†’ 100%, op% β†’ 90%+, exact% β†’ 75%+
- **GRPO after SFT**: ~3Γ— cheaper rollouts than qwen3-4b, so more exploration per compute budget

## 10. Files produced by this evaluation

- [model_eval_full.json](model_eval_full.json) β€” full per-call data (every prompt Γ— every model Γ— every response), 297 rows
- [model_eval_full.txt](model_eval_full.txt) β€” raw execution log (what was streamed to stdout during the run)
- [deepseek_r1_rerun.json](deepseek_r1_rerun.json) β€” R1-distill re-run data with `max_tokens=2048`
- [../eval_lm_studio_models.py](../eval_lm_studio_models.py) β€” the eval harness (reusable for post-SFT evaluation)

## 11. How to rerun this evaluation post-SFT

After training, save the merged model to LM Studio and rerun:

```bash
.venv/bin/python data/eval_lm_studio_models.py \
    --max-per-combo 5 \
    --out data/sft/model_eval_postsft.json
```

Compare the `exact%` and `op%` deltas vs the baseline in [model_eval_full.json](model_eval_full.json). A successful SFT run should see:
- `exact%`: 41% β†’ 75%+
- `op%`: 63% β†’ 90%+
- `fmt%`: 85% β†’ 100%

If those deltas don't land, something's wrong with the training β€” not the dataset.