Spaces:
Running
Running
| # `data/` β SFT Dataset Generation & Base-Model Selection | |
| [β back to main README](../README.md) | |
| This directory holds the SFT training corpus, the dataset generator that produced it, and the rigorous benchmark we used to pick the base model. Together they answer two questions a hackathon judge should be able to verify in under five minutes: | |
| 1. **What did we train on?** A 1,500-row synthetic SFT corpus with five trajectory types covering success, continuation, failure recovery, verification, and hint usage. ([Β§1](#1-sft-dataset-generation)) | |
| 2. **Why this base model?** A reproducible 11-model benchmark across 27 held-out prompts. **Qwen2.5-Coder-3B-Instruct** wins on every metric that matters. ([Β§5](#5-base-model-selection-overview)) | |
| >  | |
| --- | |
| ## Table of contents | |
| 1. [SFT dataset generation](#1-sft-dataset-generation) | |
| 2. [Five trajectory types](#2-five-trajectory-types) | |
| 3. [Tier weighting](#3-tier-weighting) | |
| 4. [Dataset format & artifacts](#4-dataset-format--artifacts) | |
| 5. [Base-model selection β overview](#5-base-model-selection-overview) | |
| 6. [Eval harness](#6-eval-harness) | |
| 7. [HuggingFace publishing](#7-huggingface-publishing) | |
| 8. [Files in this directory](#8-files-in-this-directory) | |
| --- | |
| ## 1. SFT dataset generation | |
| [data/build_sft_dataset.py](build_sft_dataset.py) β 27 KB, single-script generator. | |
| ### Approach | |
| The dataset is **synthetically generated** but grounded in canonical solutions extracted from our integration test suite. Two design decisions worth flagging to judges: | |
| #### AST-based extraction, not pytest execution | |
| Each `tests_tasks/test_<tier>_tasks.py` file has a top-level constant (`WARMUP_COMMANDS`, `BEGINNER_COMMANDS`, β¦) mapping `task_id β canonical AWS CLI command`. We extract these via Python's `ast` module β we do **not** execute the test file. Reasons: | |
| 1. `pytest` fixtures would spin up a MiniStack, hit AWS APIs, and add 30+ seconds of overhead per generation run. | |
| 2. Static extraction is deterministic β no flake risk. The dataset is reproducible bit-for-bit given a seed. | |
| 3. The canonical solutions are intentionally simple constant declarations that AST can parse without import side effects. | |
| #### Plausible-output simulation | |
| When generating multi-step continuations, we don't have a real MiniStack response to feed back into the user message β we have to fabricate one. The generator maps each AWS operation (`list-buckets`, `create-table`, `describe-instances`, β¦) to a JSON template, then interpolates the right resource names from the task. So an `aws s3api list-buckets` step in the user prompt history has output like: | |
| ```json | |
| {"Buckets":[{"Name":"my-app-data","CreationDate":"2026-04-15T..."}]} | |
| ``` | |
| β¦instead of the empty `{"Buckets":[]}` you'd get from a fresh MiniStack. This is the difference between the SFT model learning "first step, always answer with the canonical command" (degenerate) and "first step depends on what's already been done" (correct). | |
| ### Dynamic-ID filtering | |
| Some tests reference resources whose IDs only exist at runtime β security groups (`sg-β¦`), subnets (`subnet-β¦`), VPCs (`vpc-β¦`), instance IDs (`i-β¦`). These commands cannot be deterministically captured by static extraction. The generator skips any task whose canonical command contains those patterns. The result: 72 unique tasks make it into the train split (out of 134 total tasks), all of which are deterministically reproducible. | |
| --- | |
| ## 2. Five trajectory types | |
| The SFT corpus mixes five distinct trajectory shapes so the model learns to handle real multi-turn agent behavior, not just one-shot question answering. Actual proportions (from [data/sft/dataset_stats.json](sft/dataset_stats.json)): | |
| | Source | Train pct (target) | Train rows | What the model sees | | |
| |----------------------------|:------------------:|:----------:|-------------------------------------------------------------------------------------------| | |
| | `success_first_step` | 55.1% (55%) | 826 | User β Task description β assistant emits the canonical command | | |
| | `multi_step_continuation` | 20.1% (20%) | 301 | User β Task description + a baked-in history of N-1 prior commands and their outputs β assistant emits step N | | |
| | `failure_recovery` | 15.5% (15%) | 232 | User β Task description + step 1 of a wrong command and its simulated error β assistant emits the recovery command | | |
| | `verification` | 4.5% (5%) | 67 | User β Task already complete β assistant emits a read-only verification command | | |
| | `hint_usage` | 4.9% (5%) | 74 | User β Task description β assistant emits `aws help --task-hint` (the agent action that requests a hint) | | |
| Why include the last four sources at all? | |
| - **`multi_step_continuation`** trains continuation behavior. Without it, the model overfits to step 1 and degrades on later turns. | |
| - **`failure_recovery`** teaches the model that a typo / wrong command is recoverable. The reward signal during GRPO is dense β the model needs to know what "try again" looks like. | |
| - **`verification`** trains the model to recognize when a task is done and respond appropriately. Production agents must distinguish "do something" from "confirm it's done". | |
| - **`hint_usage`** lets the model learn that `aws help --task-hint` is the in-environment way to request help, not just a literal CLI command. | |
| --- | |
| ## 3. Tier weighting | |
| [data/build_sft_dataset.py:54-60](build_sft_dataset.py) β sampling weights: | |
| | Tier | Weight | Train rows | Why | | |
| |--------------|:------:|:----------:|------------------------------------------------------------------------------------| | |
| | warmup | 0.50 | 456 | Most rows. Format-locks the model on the simplest possible "aws X list" pattern. | | |
| | beginner | 0.30 | 378 | Single-resource creation β bread and butter. | | |
| | intermediate | 0.15 | 666 * | Multi-step workflows. Note actual count > target because each task contributes more rows via multi_step_continuation. | | |
| | advanced | 0.05 | 0 | Cross-service architectures. Filtered out post-extraction (most have dynamic IDs). | | |
| | expert | 0.00 | 0 | SRE / drift / security-posture. **Intentionally excluded from SFT.** | | |
| > **Why expert tier is excluded from SFT.** The expert tasks (drift detection, security audits) have *randomized* state checks β there is no canonical command sequence. Trying to SFT on them would teach the model a particular fix script that is *wrong* on most episodes. These tasks are reserved for GRPO, where the env's `state_checks` reward signal handles the randomization correctly. | |
| `*` Intermediate row count exceeds the simple weight because the multi-step trajectory generator naturally produces multiple rows per task (one for step 1, step 2, etc.). | |
| --- | |
| ## 4. Dataset format & artifacts | |
| ### JSONL chat-message schema | |
| ```json | |
| { | |
| "messages": [ | |
| {"role": "system", "content": "You are an AWS cloud engineer interacting with a real AWS environment via CLI..."}, | |
| {"role": "user", "content": "TASK: Create an S3 bucket named my-app-data and enable versioning on it.\n\nPREVIOUS COMMANDS:\n[1] $ aws s3 mb s3://my-app-data\n output: make_bucket: my-app-data\n reward: 0.50\n\n---\n\nCURRENT OBSERVATION:\nProgress: 0.50 Achieved: False Step: 2"}, | |
| {"role": "assistant", "content": "aws s3api put-bucket-versioning --bucket my-app-data --versioning-configuration Status=Enabled"} | |
| ], | |
| "difficulty": "intermediate", | |
| "source": "multi_step_continuation", | |
| "task_id": 42 | |
| } | |
| ``` | |
| Every row carries the `difficulty`, `source`, and `task_id` metadata β useful for filtering, ablations, and debugging. | |
| ### Artifacts | |
| [data/sft/](sft/): | |
| | File | Size | Rows | Unique tasks | Use | | |
| |--------------------------------------------------------------|------:|------:|:------------:|------------------------------------------------| | |
| | [aws_rl_sft.train.jsonl](sft/aws_rl_sft.train.jsonl) | 2.2 MB | 1,500 | 72 | SFT training | | |
| | [aws_rl_sft.val.jsonl](sft/aws_rl_sft.val.jsonl) | 218 KB | 150 | 63 | SFT validation; basis for [MODEL_EVALUATION.md](sft/MODEL_EVALUATION.md) | | |
| | [aws_rl_sft.reserve.jsonl](sft/aws_rl_sft.reserve.jsonl) | 294 KB | 200 | 66 | Held-out reserve for post-SFT regression checks | | |
| | [dataset_stats.json](sft/dataset_stats.json) | 3.4 KB | β | β | Per-split source/tier/task breakdowns | | |
| | [MODEL_EVALUATION.md](sft/MODEL_EVALUATION.md) | 15 KB | β | β | Full model-selection writeup ([Β§5](#5-base-model-selection-overview)) | | |
| | [model_eval_full.json](sft/model_eval_full.json) | 209 KB | 297 | β | Per-call eval data (11 models Γ 27 prompts) | | |
| | [deepseek_r1_rerun.json](sft/deepseek_r1_rerun.json) | 5.3 KB | 27 | β | DeepSeek R1 re-run with `max_tokens=2048` | | |
| --- | |
| ## 5. Base-model selection β overview | |
| This is the most rigorous decision in the whole project. Full reasoning, per-model verdicts, and methodology lives in **[data/sft/MODEL_EVALUATION.md](sft/MODEL_EVALUATION.md)** β a 270-line standalone report. Read it before judging the project's technical depth; it's what convinces us we're training the right thing. | |
| The 30-second summary: | |
| | Model | exact% | op% | fmt% | Latency | Verdict | | |
| |--------------------------------|:-----:|:----:|:------:|:-------:|--------------------------------------| | |
| | **qwen2.5-coder-3b-instruct** | **41%** | **63%** | 85% | **3.1s** | β Train this. Highest exact, fastest viable. | | |
| | qwen/qwen3-4b-2507 | 33% | 59% | 100% | 10.4s | Fallback. Perfect format, 3Γ slower. | | |
| | qwen2.5-coder-1.5b-instruct | 22% | 44% | 81% | 2.5s | Speed play if GRPO budget tight. | | |
| | smollm2-1.7b-instruct | 7% | 37% | 63% | 2.1s | β Ceiling too low. | | |
| | (7 more) | 0% | β¦ | β¦ | β¦ | β Format-broken or wrong domain. | | |
| >  | |
| What the metrics mean: | |
| - **`fmt%`**: raw output starts with `aws ` (no preamble, fences, or quotes). The agent's [inference.py:93](../inference.py) gate rejects everything else. | |
| - **`+xtr%`**: `fmt%` after stripping markdown fences. Gap to `fmt%` = "model knows the answer, wrapping it in junk". | |
| - **`exact%`**: extracted command matches canonical token-for-token. The hardest metric. | |
| - **`svc%`**: same AWS service as canonical. Domain orientation. | |
| - **`op%`**: same service AND operation. The gap SFT closes most reliably. | |
| The full table (11 models, 9 metrics, per-call logs) is in [data/sft/model_eval_full.json](sft/model_eval_full.json) β 297 records. | |
| --- | |
| ## 6. Eval harness | |
| [data/eval_lm_studio_models.py](eval_lm_studio_models.py) β 9.9 KB, reusable. | |
| - Calls each chat model loaded in LM Studio at `http://localhost:1234/v1/chat/completions` (OpenAI-compatible API) | |
| - Sends the same 27 held-out prompts to each model | |
| - Extracts `aws ...` from the response (stripping fences / preamble) | |
| - Compares against the canonical command from the val split | |
| - Writes per-call detail + aggregate metrics to JSON | |
| To re-run post-SFT: | |
| ```bash | |
| .venv/bin/python data/eval_lm_studio_models.py \ | |
| --max-per-combo 5 \ | |
| --out data/sft/model_eval_postsft.json | |
| ``` | |
| A successful SFT run should see (predictions from [MODEL_EVALUATION.md Β§11](sft/MODEL_EVALUATION.md), and **actuals from our reference SFT run**): | |
| | Metric | Base | Target | **Actual (post-SFT)** | | |
| |-----------|:-----:|:-------:|:---------------------:| | |
| | `exact%` | 39% | 75%+ | **88.9%** β | | |
| | `op%` | 61% | 90%+ | **88.9%** β | | |
| | `svc%` | 78% | β | **88.9%** | | |
| | `fmt%` | 33% | 100% | **100.0%** β | | |
| | latency | 2.03s | β | **1.40s** (faster) | | |
| Every target from MODEL_EVALUATION.md is hit or essentially hit. Format compliance is now perfect; exact-match jumped 50 pp; the model is faster *and* tighter. | |
| >  | |
| >  | |
| --- | |
| ## 7. HuggingFace publishing | |
| [data/upload_sft_to_hf.py](upload_sft_to_hf.py) β pushes the JSONL splits to HuggingFace Hub: | |
| | Split | Hub repo | | |
| |----------|-----------------------------------------------------| | |
| | train | `Sizzing/aws-rl-sft-qwen25coder3b-train` | | |
| | val | `Sizzing/aws-rl-sft-qwen25coder3b-val` | | |
| | reserve | `Sizzing/aws-rl-sft-qwen25coder3b-reserve` | | |
| The trained SFT adapter (output of [train/train_sft_lora.ipynb](../train/train_sft_lora.ipynb)) is published separately at: | |
| - `Sizzing/aws-rl-sft-qwen25coder3b-adapter` | |
| GRPO training picks it up by setting `SFT_ADAPTER = "Sizzing/aws-rl-sft-qwen25coder3b-adapter"` in [aws_rl_env_colab.ipynb](../aws_rl_env_colab.ipynb). | |
| --- | |
| ## 8. Files in this directory | |
| | File | Purpose | | |
| |--------------------------------------------------------------------|--------------------------------------------------------------------| | |
| | [build_sft_dataset.py](build_sft_dataset.py) | Generator β AST extraction + 5 trajectory types + plausible outputs | | |
| | [eval_lm_studio_models.py](eval_lm_studio_models.py) | Base-model benchmark harness (LM Studio API) | | |
| | [upload_sft_to_hf.py](upload_sft_to_hf.py) | Push the SFT splits to HuggingFace | | |
| | [sft/aws_rl_sft.train.jsonl](sft/aws_rl_sft.train.jsonl) | 1,500 SFT training rows | | |
| | [sft/aws_rl_sft.val.jsonl](sft/aws_rl_sft.val.jsonl) | 150 validation rows | | |
| | [sft/aws_rl_sft.reserve.jsonl](sft/aws_rl_sft.reserve.jsonl) | 200 reserve rows | | |
| | [sft/dataset_stats.json](sft/dataset_stats.json) | Per-split source / tier / task counts | | |
| | [sft/MODEL_EVALUATION.md](sft/MODEL_EVALUATION.md) | **The base-model selection report (read this)** | | |
| | [sft/model_eval_full.json](sft/model_eval_full.json) | Per-call eval data (11 models Γ 27 prompts) | | |
| | [sft/deepseek_r1_rerun.json](sft/deepseek_r1_rerun.json) | R1 re-run with extended `max_tokens` | | |
| --- | |
| ## See also | |
| - [Main README](../README.md) | |
| - [data/sft/MODEL_EVALUATION.md](sft/MODEL_EVALUATION.md) β full base-model selection writeup | |
| - [train/README.md](../train/README.md) β how this dataset is consumed by SFT training | |
| - [compare/README.md](../compare/README.md) β how the trained model is benchmarked vs the base | |
| - [server/services/tasks/](../server/services/tasks/) β source of truth for task definitions (the YAML the generator reads) | |
| - [tests_tasks/](../tests_tasks/) β canonical solutions the generator extracts via AST | |