# `data/` — SFT Dataset Generation & Base-Model Selection [← back to main README](../README.md) This directory holds the SFT training corpus, the dataset generator that produced it, and the rigorous benchmark we used to pick the base model. Together they answer two questions a hackathon judge should be able to verify in under five minutes: 1. **What did we train on?** A 1,500-row synthetic SFT corpus with five trajectory types covering success, continuation, failure recovery, verification, and hint usage. ([§1](#1-sft-dataset-generation)) 2. **Why this base model?** A reproducible 11-model benchmark across 27 held-out prompts. **Qwen2.5-Coder-3B-Instruct** wins on every metric that matters. ([§5](#5-base-model-selection-overview)) > ![Top 4 candidate models on the held-out benchmark](../docs/figures/model_eval_chart.png) --- ## Table of contents 1. [SFT dataset generation](#1-sft-dataset-generation) 2. [Five trajectory types](#2-five-trajectory-types) 3. [Tier weighting](#3-tier-weighting) 4. [Dataset format & artifacts](#4-dataset-format--artifacts) 5. [Base-model selection — overview](#5-base-model-selection-overview) 6. [Eval harness](#6-eval-harness) 7. [HuggingFace publishing](#7-huggingface-publishing) 8. [Files in this directory](#8-files-in-this-directory) --- ## 1. SFT dataset generation [data/build_sft_dataset.py](build_sft_dataset.py) — 27 KB, single-script generator. ### Approach The dataset is **synthetically generated** but grounded in canonical solutions extracted from our integration test suite. Two design decisions worth flagging to judges: #### AST-based extraction, not pytest execution Each `tests_tasks/test__tasks.py` file has a top-level constant (`WARMUP_COMMANDS`, `BEGINNER_COMMANDS`, …) mapping `task_id → canonical AWS CLI command`. We extract these via Python's `ast` module — we do **not** execute the test file. Reasons: 1. `pytest` fixtures would spin up a MiniStack, hit AWS APIs, and add 30+ seconds of overhead per generation run. 2. Static extraction is deterministic — no flake risk. The dataset is reproducible bit-for-bit given a seed. 3. The canonical solutions are intentionally simple constant declarations that AST can parse without import side effects. #### Plausible-output simulation When generating multi-step continuations, we don't have a real MiniStack response to feed back into the user message — we have to fabricate one. The generator maps each AWS operation (`list-buckets`, `create-table`, `describe-instances`, …) to a JSON template, then interpolates the right resource names from the task. So an `aws s3api list-buckets` step in the user prompt history has output like: ```json {"Buckets":[{"Name":"my-app-data","CreationDate":"2026-04-15T..."}]} ``` …instead of the empty `{"Buckets":[]}` you'd get from a fresh MiniStack. This is the difference between the SFT model learning "first step, always answer with the canonical command" (degenerate) and "first step depends on what's already been done" (correct). ### Dynamic-ID filtering Some tests reference resources whose IDs only exist at runtime — security groups (`sg-…`), subnets (`subnet-…`), VPCs (`vpc-…`), instance IDs (`i-…`). These commands cannot be deterministically captured by static extraction. The generator skips any task whose canonical command contains those patterns. The result: 72 unique tasks make it into the train split (out of 134 total tasks), all of which are deterministically reproducible. --- ## 2. Five trajectory types The SFT corpus mixes five distinct trajectory shapes so the model learns to handle real multi-turn agent behavior, not just one-shot question answering. Actual proportions (from [data/sft/dataset_stats.json](sft/dataset_stats.json)): | Source | Train pct (target) | Train rows | What the model sees | |----------------------------|:------------------:|:----------:|-------------------------------------------------------------------------------------------| | `success_first_step` | 55.1% (55%) | 826 | User → Task description → assistant emits the canonical command | | `multi_step_continuation` | 20.1% (20%) | 301 | User → Task description + a baked-in history of N-1 prior commands and their outputs → assistant emits step N | | `failure_recovery` | 15.5% (15%) | 232 | User → Task description + step 1 of a wrong command and its simulated error → assistant emits the recovery command | | `verification` | 4.5% (5%) | 67 | User → Task already complete → assistant emits a read-only verification command | | `hint_usage` | 4.9% (5%) | 74 | User → Task description → assistant emits `aws help --task-hint` (the agent action that requests a hint) | Why include the last four sources at all? - **`multi_step_continuation`** trains continuation behavior. Without it, the model overfits to step 1 and degrades on later turns. - **`failure_recovery`** teaches the model that a typo / wrong command is recoverable. The reward signal during GRPO is dense — the model needs to know what "try again" looks like. - **`verification`** trains the model to recognize when a task is done and respond appropriately. Production agents must distinguish "do something" from "confirm it's done". - **`hint_usage`** lets the model learn that `aws help --task-hint` is the in-environment way to request help, not just a literal CLI command. --- ## 3. Tier weighting [data/build_sft_dataset.py:54-60](build_sft_dataset.py) — sampling weights: | Tier | Weight | Train rows | Why | |--------------|:------:|:----------:|------------------------------------------------------------------------------------| | warmup | 0.50 | 456 | Most rows. Format-locks the model on the simplest possible "aws X list" pattern. | | beginner | 0.30 | 378 | Single-resource creation — bread and butter. | | intermediate | 0.15 | 666 * | Multi-step workflows. Note actual count > target because each task contributes more rows via multi_step_continuation. | | advanced | 0.05 | 0 | Cross-service architectures. Filtered out post-extraction (most have dynamic IDs). | | expert | 0.00 | 0 | SRE / drift / security-posture. **Intentionally excluded from SFT.** | > **Why expert tier is excluded from SFT.** The expert tasks (drift detection, security audits) have *randomized* state checks — there is no canonical command sequence. Trying to SFT on them would teach the model a particular fix script that is *wrong* on most episodes. These tasks are reserved for GRPO, where the env's `state_checks` reward signal handles the randomization correctly. `*` Intermediate row count exceeds the simple weight because the multi-step trajectory generator naturally produces multiple rows per task (one for step 1, step 2, etc.). --- ## 4. Dataset format & artifacts ### JSONL chat-message schema ```json { "messages": [ {"role": "system", "content": "You are an AWS cloud engineer interacting with a real AWS environment via CLI..."}, {"role": "user", "content": "TASK: Create an S3 bucket named my-app-data and enable versioning on it.\n\nPREVIOUS COMMANDS:\n[1] $ aws s3 mb s3://my-app-data\n output: make_bucket: my-app-data\n reward: 0.50\n\n---\n\nCURRENT OBSERVATION:\nProgress: 0.50 Achieved: False Step: 2"}, {"role": "assistant", "content": "aws s3api put-bucket-versioning --bucket my-app-data --versioning-configuration Status=Enabled"} ], "difficulty": "intermediate", "source": "multi_step_continuation", "task_id": 42 } ``` Every row carries the `difficulty`, `source`, and `task_id` metadata — useful for filtering, ablations, and debugging. ### Artifacts [data/sft/](sft/): | File | Size | Rows | Unique tasks | Use | |--------------------------------------------------------------|------:|------:|:------------:|------------------------------------------------| | [aws_rl_sft.train.jsonl](sft/aws_rl_sft.train.jsonl) | 2.2 MB | 1,500 | 72 | SFT training | | [aws_rl_sft.val.jsonl](sft/aws_rl_sft.val.jsonl) | 218 KB | 150 | 63 | SFT validation; basis for [MODEL_EVALUATION.md](sft/MODEL_EVALUATION.md) | | [aws_rl_sft.reserve.jsonl](sft/aws_rl_sft.reserve.jsonl) | 294 KB | 200 | 66 | Held-out reserve for post-SFT regression checks | | [dataset_stats.json](sft/dataset_stats.json) | 3.4 KB | — | — | Per-split source/tier/task breakdowns | | [MODEL_EVALUATION.md](sft/MODEL_EVALUATION.md) | 15 KB | — | — | Full model-selection writeup ([§5](#5-base-model-selection-overview)) | | [model_eval_full.json](sft/model_eval_full.json) | 209 KB | 297 | — | Per-call eval data (11 models × 27 prompts) | | [deepseek_r1_rerun.json](sft/deepseek_r1_rerun.json) | 5.3 KB | 27 | — | DeepSeek R1 re-run with `max_tokens=2048` | --- ## 5. Base-model selection — overview This is the most rigorous decision in the whole project. Full reasoning, per-model verdicts, and methodology lives in **[data/sft/MODEL_EVALUATION.md](sft/MODEL_EVALUATION.md)** — a 270-line standalone report. Read it before judging the project's technical depth; it's what convinces us we're training the right thing. The 30-second summary: | Model | exact% | op% | fmt% | Latency | Verdict | |--------------------------------|:-----:|:----:|:------:|:-------:|--------------------------------------| | **qwen2.5-coder-3b-instruct** | **41%** | **63%** | 85% | **3.1s** | ✅ Train this. Highest exact, fastest viable. | | qwen/qwen3-4b-2507 | 33% | 59% | 100% | 10.4s | Fallback. Perfect format, 3× slower. | | qwen2.5-coder-1.5b-instruct | 22% | 44% | 81% | 2.5s | Speed play if GRPO budget tight. | | smollm2-1.7b-instruct | 7% | 37% | 63% | 2.1s | ❌ Ceiling too low. | | (7 more) | 0% | … | … | … | ❌ Format-broken or wrong domain. | > ![Per-model comparison: 5 quality metrics + latency](../docs/figures/model_eval_chart.png) What the metrics mean: - **`fmt%`**: raw output starts with `aws ` (no preamble, fences, or quotes). The agent's [inference.py:93](../inference.py) gate rejects everything else. - **`+xtr%`**: `fmt%` after stripping markdown fences. Gap to `fmt%` = "model knows the answer, wrapping it in junk". - **`exact%`**: extracted command matches canonical token-for-token. The hardest metric. - **`svc%`**: same AWS service as canonical. Domain orientation. - **`op%`**: same service AND operation. The gap SFT closes most reliably. The full table (11 models, 9 metrics, per-call logs) is in [data/sft/model_eval_full.json](sft/model_eval_full.json) — 297 records. --- ## 6. Eval harness [data/eval_lm_studio_models.py](eval_lm_studio_models.py) — 9.9 KB, reusable. - Calls each chat model loaded in LM Studio at `http://localhost:1234/v1/chat/completions` (OpenAI-compatible API) - Sends the same 27 held-out prompts to each model - Extracts `aws ...` from the response (stripping fences / preamble) - Compares against the canonical command from the val split - Writes per-call detail + aggregate metrics to JSON To re-run post-SFT: ```bash .venv/bin/python data/eval_lm_studio_models.py \ --max-per-combo 5 \ --out data/sft/model_eval_postsft.json ``` A successful SFT run should see (predictions from [MODEL_EVALUATION.md §11](sft/MODEL_EVALUATION.md), and **actuals from our reference SFT run**): | Metric | Base | Target | **Actual (post-SFT)** | |-----------|:-----:|:-------:|:---------------------:| | `exact%` | 39% | 75%+ | **88.9%** ✅ | | `op%` | 61% | 90%+ | **88.9%** ≈ | | `svc%` | 78% | — | **88.9%** | | `fmt%` | 33% | 100% | **100.0%** ✅ | | latency | 2.03s | — | **1.40s** (faster) | Every target from MODEL_EVALUATION.md is hit or essentially hit. Format compliance is now perfect; exact-match jumped 50 pp; the model is faster *and* tighter. > ![Base vs SFT comparison (eval metrics)](../docs/figures/base_vs_sft_success.png) > ![Single-step eval base vs SFT](../docs/figures/single_step_eval.png) --- ## 7. HuggingFace publishing [data/upload_sft_to_hf.py](upload_sft_to_hf.py) — pushes the JSONL splits to HuggingFace Hub: | Split | Hub repo | |----------|-----------------------------------------------------| | train | `Sizzing/aws-rl-sft-qwen25coder3b-train` | | val | `Sizzing/aws-rl-sft-qwen25coder3b-val` | | reserve | `Sizzing/aws-rl-sft-qwen25coder3b-reserve` | The trained SFT adapter (output of [train/train_sft_lora.ipynb](../train/train_sft_lora.ipynb)) is published separately at: - `Sizzing/aws-rl-sft-qwen25coder3b-adapter` GRPO training picks it up by setting `SFT_ADAPTER = "Sizzing/aws-rl-sft-qwen25coder3b-adapter"` in [aws_rl_env_colab.ipynb](../aws_rl_env_colab.ipynb). --- ## 8. Files in this directory | File | Purpose | |--------------------------------------------------------------------|--------------------------------------------------------------------| | [build_sft_dataset.py](build_sft_dataset.py) | Generator — AST extraction + 5 trajectory types + plausible outputs | | [eval_lm_studio_models.py](eval_lm_studio_models.py) | Base-model benchmark harness (LM Studio API) | | [upload_sft_to_hf.py](upload_sft_to_hf.py) | Push the SFT splits to HuggingFace | | [sft/aws_rl_sft.train.jsonl](sft/aws_rl_sft.train.jsonl) | 1,500 SFT training rows | | [sft/aws_rl_sft.val.jsonl](sft/aws_rl_sft.val.jsonl) | 150 validation rows | | [sft/aws_rl_sft.reserve.jsonl](sft/aws_rl_sft.reserve.jsonl) | 200 reserve rows | | [sft/dataset_stats.json](sft/dataset_stats.json) | Per-split source / tier / task counts | | [sft/MODEL_EVALUATION.md](sft/MODEL_EVALUATION.md) | **The base-model selection report (read this)** | | [sft/model_eval_full.json](sft/model_eval_full.json) | Per-call eval data (11 models × 27 prompts) | | [sft/deepseek_r1_rerun.json](sft/deepseek_r1_rerun.json) | R1 re-run with extended `max_tokens` | --- ## See also - [Main README](../README.md) - [data/sft/MODEL_EVALUATION.md](sft/MODEL_EVALUATION.md) — full base-model selection writeup - [train/README.md](../train/README.md) — how this dataset is consumed by SFT training - [compare/README.md](../compare/README.md) — how the trained model is benchmarked vs the base - [server/services/tasks/](../server/services/tasks/) — source of truth for task definitions (the YAML the generator reads) - [tests_tasks/](../tests_tasks/) — canonical solutions the generator extracts via AST