aws_rl_env / data /README.md
Sizzing's picture
Upload folder using huggingface_hub
71e54ee verified
# `data/` β€” SFT Dataset Generation & Base-Model Selection
[← back to main README](../README.md)
This directory holds the SFT training corpus, the dataset generator that produced it, and the rigorous benchmark we used to pick the base model. Together they answer two questions a hackathon judge should be able to verify in under five minutes:
1. **What did we train on?** A 1,500-row synthetic SFT corpus with five trajectory types covering success, continuation, failure recovery, verification, and hint usage. ([Β§1](#1-sft-dataset-generation))
2. **Why this base model?** A reproducible 11-model benchmark across 27 held-out prompts. **Qwen2.5-Coder-3B-Instruct** wins on every metric that matters. ([Β§5](#5-base-model-selection-overview))
> ![Top 4 candidate models on the held-out benchmark](../docs/figures/model_eval_chart.png)
---
## Table of contents
1. [SFT dataset generation](#1-sft-dataset-generation)
2. [Five trajectory types](#2-five-trajectory-types)
3. [Tier weighting](#3-tier-weighting)
4. [Dataset format & artifacts](#4-dataset-format--artifacts)
5. [Base-model selection β€” overview](#5-base-model-selection-overview)
6. [Eval harness](#6-eval-harness)
7. [HuggingFace publishing](#7-huggingface-publishing)
8. [Files in this directory](#8-files-in-this-directory)
---
## 1. SFT dataset generation
[data/build_sft_dataset.py](build_sft_dataset.py) β€” 27 KB, single-script generator.
### Approach
The dataset is **synthetically generated** but grounded in canonical solutions extracted from our integration test suite. Two design decisions worth flagging to judges:
#### AST-based extraction, not pytest execution
Each `tests_tasks/test_<tier>_tasks.py` file has a top-level constant (`WARMUP_COMMANDS`, `BEGINNER_COMMANDS`, …) mapping `task_id β†’ canonical AWS CLI command`. We extract these via Python's `ast` module β€” we do **not** execute the test file. Reasons:
1. `pytest` fixtures would spin up a MiniStack, hit AWS APIs, and add 30+ seconds of overhead per generation run.
2. Static extraction is deterministic β€” no flake risk. The dataset is reproducible bit-for-bit given a seed.
3. The canonical solutions are intentionally simple constant declarations that AST can parse without import side effects.
#### Plausible-output simulation
When generating multi-step continuations, we don't have a real MiniStack response to feed back into the user message β€” we have to fabricate one. The generator maps each AWS operation (`list-buckets`, `create-table`, `describe-instances`, …) to a JSON template, then interpolates the right resource names from the task. So an `aws s3api list-buckets` step in the user prompt history has output like:
```json
{"Buckets":[{"Name":"my-app-data","CreationDate":"2026-04-15T..."}]}
```
…instead of the empty `{"Buckets":[]}` you'd get from a fresh MiniStack. This is the difference between the SFT model learning "first step, always answer with the canonical command" (degenerate) and "first step depends on what's already been done" (correct).
### Dynamic-ID filtering
Some tests reference resources whose IDs only exist at runtime β€” security groups (`sg-…`), subnets (`subnet-…`), VPCs (`vpc-…`), instance IDs (`i-…`). These commands cannot be deterministically captured by static extraction. The generator skips any task whose canonical command contains those patterns. The result: 72 unique tasks make it into the train split (out of 134 total tasks), all of which are deterministically reproducible.
---
## 2. Five trajectory types
The SFT corpus mixes five distinct trajectory shapes so the model learns to handle real multi-turn agent behavior, not just one-shot question answering. Actual proportions (from [data/sft/dataset_stats.json](sft/dataset_stats.json)):
| Source | Train pct (target) | Train rows | What the model sees |
|----------------------------|:------------------:|:----------:|-------------------------------------------------------------------------------------------|
| `success_first_step` | 55.1% (55%) | 826 | User β†’ Task description β†’ assistant emits the canonical command |
| `multi_step_continuation` | 20.1% (20%) | 301 | User β†’ Task description + a baked-in history of N-1 prior commands and their outputs β†’ assistant emits step N |
| `failure_recovery` | 15.5% (15%) | 232 | User β†’ Task description + step 1 of a wrong command and its simulated error β†’ assistant emits the recovery command |
| `verification` | 4.5% (5%) | 67 | User β†’ Task already complete β†’ assistant emits a read-only verification command |
| `hint_usage` | 4.9% (5%) | 74 | User β†’ Task description β†’ assistant emits `aws help --task-hint` (the agent action that requests a hint) |
Why include the last four sources at all?
- **`multi_step_continuation`** trains continuation behavior. Without it, the model overfits to step 1 and degrades on later turns.
- **`failure_recovery`** teaches the model that a typo / wrong command is recoverable. The reward signal during GRPO is dense β€” the model needs to know what "try again" looks like.
- **`verification`** trains the model to recognize when a task is done and respond appropriately. Production agents must distinguish "do something" from "confirm it's done".
- **`hint_usage`** lets the model learn that `aws help --task-hint` is the in-environment way to request help, not just a literal CLI command.
---
## 3. Tier weighting
[data/build_sft_dataset.py:54-60](build_sft_dataset.py) β€” sampling weights:
| Tier | Weight | Train rows | Why |
|--------------|:------:|:----------:|------------------------------------------------------------------------------------|
| warmup | 0.50 | 456 | Most rows. Format-locks the model on the simplest possible "aws X list" pattern. |
| beginner | 0.30 | 378 | Single-resource creation β€” bread and butter. |
| intermediate | 0.15 | 666 * | Multi-step workflows. Note actual count > target because each task contributes more rows via multi_step_continuation. |
| advanced | 0.05 | 0 | Cross-service architectures. Filtered out post-extraction (most have dynamic IDs). |
| expert | 0.00 | 0 | SRE / drift / security-posture. **Intentionally excluded from SFT.** |
> **Why expert tier is excluded from SFT.** The expert tasks (drift detection, security audits) have *randomized* state checks β€” there is no canonical command sequence. Trying to SFT on them would teach the model a particular fix script that is *wrong* on most episodes. These tasks are reserved for GRPO, where the env's `state_checks` reward signal handles the randomization correctly.
`*` Intermediate row count exceeds the simple weight because the multi-step trajectory generator naturally produces multiple rows per task (one for step 1, step 2, etc.).
---
## 4. Dataset format & artifacts
### JSONL chat-message schema
```json
{
"messages": [
{"role": "system", "content": "You are an AWS cloud engineer interacting with a real AWS environment via CLI..."},
{"role": "user", "content": "TASK: Create an S3 bucket named my-app-data and enable versioning on it.\n\nPREVIOUS COMMANDS:\n[1] $ aws s3 mb s3://my-app-data\n output: make_bucket: my-app-data\n reward: 0.50\n\n---\n\nCURRENT OBSERVATION:\nProgress: 0.50 Achieved: False Step: 2"},
{"role": "assistant", "content": "aws s3api put-bucket-versioning --bucket my-app-data --versioning-configuration Status=Enabled"}
],
"difficulty": "intermediate",
"source": "multi_step_continuation",
"task_id": 42
}
```
Every row carries the `difficulty`, `source`, and `task_id` metadata β€” useful for filtering, ablations, and debugging.
### Artifacts
[data/sft/](sft/):
| File | Size | Rows | Unique tasks | Use |
|--------------------------------------------------------------|------:|------:|:------------:|------------------------------------------------|
| [aws_rl_sft.train.jsonl](sft/aws_rl_sft.train.jsonl) | 2.2 MB | 1,500 | 72 | SFT training |
| [aws_rl_sft.val.jsonl](sft/aws_rl_sft.val.jsonl) | 218 KB | 150 | 63 | SFT validation; basis for [MODEL_EVALUATION.md](sft/MODEL_EVALUATION.md) |
| [aws_rl_sft.reserve.jsonl](sft/aws_rl_sft.reserve.jsonl) | 294 KB | 200 | 66 | Held-out reserve for post-SFT regression checks |
| [dataset_stats.json](sft/dataset_stats.json) | 3.4 KB | β€” | β€” | Per-split source/tier/task breakdowns |
| [MODEL_EVALUATION.md](sft/MODEL_EVALUATION.md) | 15 KB | β€” | β€” | Full model-selection writeup ([Β§5](#5-base-model-selection-overview)) |
| [model_eval_full.json](sft/model_eval_full.json) | 209 KB | 297 | β€” | Per-call eval data (11 models Γ— 27 prompts) |
| [deepseek_r1_rerun.json](sft/deepseek_r1_rerun.json) | 5.3 KB | 27 | β€” | DeepSeek R1 re-run with `max_tokens=2048` |
---
## 5. Base-model selection β€” overview
This is the most rigorous decision in the whole project. Full reasoning, per-model verdicts, and methodology lives in **[data/sft/MODEL_EVALUATION.md](sft/MODEL_EVALUATION.md)** β€” a 270-line standalone report. Read it before judging the project's technical depth; it's what convinces us we're training the right thing.
The 30-second summary:
| Model | exact% | op% | fmt% | Latency | Verdict |
|--------------------------------|:-----:|:----:|:------:|:-------:|--------------------------------------|
| **qwen2.5-coder-3b-instruct** | **41%** | **63%** | 85% | **3.1s** | βœ… Train this. Highest exact, fastest viable. |
| qwen/qwen3-4b-2507 | 33% | 59% | 100% | 10.4s | Fallback. Perfect format, 3Γ— slower. |
| qwen2.5-coder-1.5b-instruct | 22% | 44% | 81% | 2.5s | Speed play if GRPO budget tight. |
| smollm2-1.7b-instruct | 7% | 37% | 63% | 2.1s | ❌ Ceiling too low. |
| (7 more) | 0% | … | … | … | ❌ Format-broken or wrong domain. |
> ![Per-model comparison: 5 quality metrics + latency](../docs/figures/model_eval_chart.png)
What the metrics mean:
- **`fmt%`**: raw output starts with `aws ` (no preamble, fences, or quotes). The agent's [inference.py:93](../inference.py) gate rejects everything else.
- **`+xtr%`**: `fmt%` after stripping markdown fences. Gap to `fmt%` = "model knows the answer, wrapping it in junk".
- **`exact%`**: extracted command matches canonical token-for-token. The hardest metric.
- **`svc%`**: same AWS service as canonical. Domain orientation.
- **`op%`**: same service AND operation. The gap SFT closes most reliably.
The full table (11 models, 9 metrics, per-call logs) is in [data/sft/model_eval_full.json](sft/model_eval_full.json) β€” 297 records.
---
## 6. Eval harness
[data/eval_lm_studio_models.py](eval_lm_studio_models.py) β€” 9.9 KB, reusable.
- Calls each chat model loaded in LM Studio at `http://localhost:1234/v1/chat/completions` (OpenAI-compatible API)
- Sends the same 27 held-out prompts to each model
- Extracts `aws ...` from the response (stripping fences / preamble)
- Compares against the canonical command from the val split
- Writes per-call detail + aggregate metrics to JSON
To re-run post-SFT:
```bash
.venv/bin/python data/eval_lm_studio_models.py \
--max-per-combo 5 \
--out data/sft/model_eval_postsft.json
```
A successful SFT run should see (predictions from [MODEL_EVALUATION.md Β§11](sft/MODEL_EVALUATION.md), and **actuals from our reference SFT run**):
| Metric | Base | Target | **Actual (post-SFT)** |
|-----------|:-----:|:-------:|:---------------------:|
| `exact%` | 39% | 75%+ | **88.9%** βœ… |
| `op%` | 61% | 90%+ | **88.9%** β‰ˆ |
| `svc%` | 78% | β€” | **88.9%** |
| `fmt%` | 33% | 100% | **100.0%** βœ… |
| latency | 2.03s | β€” | **1.40s** (faster) |
Every target from MODEL_EVALUATION.md is hit or essentially hit. Format compliance is now perfect; exact-match jumped 50 pp; the model is faster *and* tighter.
> ![Base vs SFT comparison (eval metrics)](../docs/figures/base_vs_sft_success.png)
> ![Single-step eval base vs SFT](../docs/figures/single_step_eval.png)
---
## 7. HuggingFace publishing
[data/upload_sft_to_hf.py](upload_sft_to_hf.py) β€” pushes the JSONL splits to HuggingFace Hub:
| Split | Hub repo |
|----------|-----------------------------------------------------|
| train | `Sizzing/aws-rl-sft-qwen25coder3b-train` |
| val | `Sizzing/aws-rl-sft-qwen25coder3b-val` |
| reserve | `Sizzing/aws-rl-sft-qwen25coder3b-reserve` |
The trained SFT adapter (output of [train/train_sft_lora.ipynb](../train/train_sft_lora.ipynb)) is published separately at:
- `Sizzing/aws-rl-sft-qwen25coder3b-adapter`
GRPO training picks it up by setting `SFT_ADAPTER = "Sizzing/aws-rl-sft-qwen25coder3b-adapter"` in [aws_rl_env_colab.ipynb](../aws_rl_env_colab.ipynb).
---
## 8. Files in this directory
| File | Purpose |
|--------------------------------------------------------------------|--------------------------------------------------------------------|
| [build_sft_dataset.py](build_sft_dataset.py) | Generator β€” AST extraction + 5 trajectory types + plausible outputs |
| [eval_lm_studio_models.py](eval_lm_studio_models.py) | Base-model benchmark harness (LM Studio API) |
| [upload_sft_to_hf.py](upload_sft_to_hf.py) | Push the SFT splits to HuggingFace |
| [sft/aws_rl_sft.train.jsonl](sft/aws_rl_sft.train.jsonl) | 1,500 SFT training rows |
| [sft/aws_rl_sft.val.jsonl](sft/aws_rl_sft.val.jsonl) | 150 validation rows |
| [sft/aws_rl_sft.reserve.jsonl](sft/aws_rl_sft.reserve.jsonl) | 200 reserve rows |
| [sft/dataset_stats.json](sft/dataset_stats.json) | Per-split source / tier / task counts |
| [sft/MODEL_EVALUATION.md](sft/MODEL_EVALUATION.md) | **The base-model selection report (read this)** |
| [sft/model_eval_full.json](sft/model_eval_full.json) | Per-call eval data (11 models Γ— 27 prompts) |
| [sft/deepseek_r1_rerun.json](sft/deepseek_r1_rerun.json) | R1 re-run with extended `max_tokens` |
---
## See also
- [Main README](../README.md)
- [data/sft/MODEL_EVALUATION.md](sft/MODEL_EVALUATION.md) β€” full base-model selection writeup
- [train/README.md](../train/README.md) β€” how this dataset is consumed by SFT training
- [compare/README.md](../compare/README.md) β€” how the trained model is benchmarked vs the base
- [server/services/tasks/](../server/services/tasks/) β€” source of truth for task definitions (the YAML the generator reads)
- [tests_tasks/](../tests_tasks/) β€” canonical solutions the generator extracts via AST