# LifeStack GRPO Training Guide > **Model**: Qwen2.5-1.5B-Instruct → LoRA fine-tuned via GRPO > **Algorithm**: Group Relative Policy Optimization (TRL + Unsloth) > **Domains**: 8 daily-life domains including `transport_crisis` (5 modes), career, finances, relationships, physical health, mental wellbeing, time, code merge crisis --- ## 1. How GRPO Works in LifeStack GRPO trains the model by generating **groups of completions** for the same prompt and ranking them by reward. The model learns to prefer higher-reward actions without needing a separate critic network (unlike PPO). ``` Prompt (life scenario) │ ▼ LLM generates N=4 candidate JSON actions │ ▼ 5 reward functions score each action ├── format_compliance (is it valid JSON?) ├── plausibility (no zero-cost miracle fixes?) ├── task_success (did it actually help the LifeMetrics?) ├── milestone (did it unlock key progress gates?) └── reasoning (is the explanation coherent?) │ ▼ GRPO updates policy to prefer higher-reward completions ``` The curriculum starts at difficulty 1 (gym skipped, forgotten bill) and advances to difficulty 5 (flight cancelled + card declined + boss moved deadline) only when avg reward > 0.6 on the current level. --- ## 2. Free-Tier GPU Recommendation ### ✅ Use Kaggle — not Colab | | **Kaggle** ✅ | Colab Free ❌ | |---|---|---| | GPU | **T4 × 2** (or P100) | T4 × 1 | | VRAM | **32 GB** (dual T4) | 16 GB | | Session limit | **9 hours** | **90 minutes** | | Weekly GPU quota | **30 hrs / week** | ~12 hrs (varies) | | Storage between sessions | ✅ Persistent (save as Dataset) | ❌ Wiped on disconnect | | `bf16` support | ❌ T4 is too old → `fp16` used instead | ❌ same | | Auto-detects fp16 fallback | ✅ script handles it | ✅ script handles it | **Bottom line**: Colab free sessions cut off at 90 minutes. Even with checkpoints that means 3–4 restarts for a full 5-stage run. One Kaggle session (9h) completes the entire curriculum in a single stretch — no resume needed. ### Paid cloud (if you need speed) | Tier | GPU | VRAM | Time / Stage | Cost | |------|-----|------|-------------|------| | 🥇 Best | A100 80GB | 80 GB | ~25 min | ~$2.50/hr | | 🥈 Good | A100 40GB | 40 GB | ~45 min | ~$1.60/hr | | 🥉 Budget | L4 / RTX 3090 | 24 GB | ~90 min | ~$0.80/hr | ### VRAM math (why any T4 is fine) | Component | VRAM | |-----------|------| | Model (1.5B, 4-bit Unsloth) | ~1.2 GB | | LoRA adapters (r=16) | ~0.1 GB | | Optimizer states | ~2.0 GB | | Activations (batch=2, seq=1024) | ~3.5 GB | | **Total** | **~7 GB** | A single T4 (16 GB) has 9 GB headroom. Kaggle's dual T4 = 32 GB total. --- ## 3. Environment Setup ### ✅ Option A — Kaggle (Recommended Free Tier) Create a new Kaggle Notebook → Settings → Accelerator: **GPU T4 x2**. ```python # Cell 1 — Install deps !pip install unsloth trl datasets transformers accelerate matplotlib -q # Cell 2 — Clone repo !git clone https://github.com/YOUR_ORG/Meta-R2.git import os; os.chdir("Meta-R2") # Cell 3 — Smoke test (makes sure everything imports correctly) !python scripts/train_trl.py --dry-run # Cell 4 — Full curriculum (completes in ~5–6 hrs on T4 x2) OUTPUT = "/kaggle/working/lifestack_model" !python scripts/train_trl.py --stages 5 --prompts-per-stage 200 --output-dir {OUTPUT} ``` **Saving across sessions** (so you can resume if you hit 9h or re-run next week): ```python # After training, save the output as a Kaggle Dataset via the notebook sidebar: # Notebook → Data → + Add Output → name it "lifestack-model" # Next session: attach that dataset and pass --resume !python scripts/train_trl.py --resume --output-dir /kaggle/input/lifestack-model/lifestack_model ``` ### Option B — Google Colab (Secondary, needs Drive) Colab sessions cut at 90 min. You **must** mount Drive to survive disconnects. ```python # Cell 1 — Mount Google Drive for persistent storage from google.colab import drive drive.mount('/content/drive') OUTPUT = "/content/drive/MyDrive/lifestack_model" # Cell 2 — Install & clone !pip install unsloth trl datasets transformers accelerate matplotlib -q !git clone https://github.com/YOUR_ORG/Meta-R2.git import os; os.chdir("Meta-R2") # Cell 3 — Smoke test !python scripts/train_trl.py --dry-run # Cell 4 — First run !python scripts/train_trl.py --stages 5 --prompts-per-stage 200 --output-dir {OUTPUT} # Cell 5 — After disconnect, re-run cells 1-2, then: !python scripts/train_trl.py --resume --output-dir {OUTPUT} ``` > ⚠️ Without mounting Drive, every Colab disconnect loses all progress regardless of checkpoints. ### Option C — Local / Cloud GPU (Linux) ```bash git clone https://github.com/YOUR_ORG/Meta-R2.git && cd Meta-R2 python3 -m venv .venv && source .venv/bin/activate pip install unsloth trl datasets transformers accelerate matplotlib # On A100 (CUDA 12.x), use the fast Unsloth build: pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git" python -c "import torch; print(torch.cuda.get_device_name(0))" python scripts/train_trl.py --dry-run python scripts/train_trl.py --stages 5 --prompts-per-stage 200 ``` --- ## 4. Checkpoint & Resume System Every **25 optimiser steps** the Trainer writes a checkpoint. If the session dies mid-stage, it picks up exactly where it left off. ### What gets saved ``` lifestack_model/ ├── curriculum_state.json ← {"completed_stage": 2, "next_difficulty": 3} ├── stage_1/ │ ├── checkpoint-25/ ← step 25 snapshot (weights + optimizer) │ ├── checkpoint-50/ ← step 50 snapshot │ ├── checkpoint-75/ ← step 75 (oldest auto-deleted at 4th save) │ └── model.safetensors ← written when stage completes cleanly ├── stage_2/ │ └── checkpoint-25/ ← mid-stage when session was cut └── stage_3/ ... ``` Only the **3 most recent checkpoints** per stage are kept (`save_total_limit=3`) to save disk. ### Resume commands ```bash # Kaggle / Colab: auto-resume after any disconnect python scripts/train_trl.py --resume # Jump to a specific stage (e.g. re-run stage 3 from scratch) python scripts/train_trl.py --start-stage 3 # Resume + change number of stages (e.g. add 2 more stages) python scripts/train_trl.py --resume --stages 7 ``` How `--resume` works: 1. Reads `curriculum_state.json` → knows stage 2 completed, next is stage 3 2. Calls `find_latest_checkpoint("stage_3/")` → finds `checkpoint-25` 3. `trainer.train(resume_from_checkpoint="stage_3/checkpoint-25")` → restores weights + optimizer state → continues from step 25 --- ## 5. Training Commands ### Dry-Run — No GPU Required ```bash python scripts/train_trl.py --dry-run ``` - 1 step, 4 prompts, CPU only, ~30 seconds - Expected output: `✅ DRY-RUN PASSED` ### Full Curriculum (Kaggle / cloud) ```bash python scripts/train_trl.py --stages 5 --prompts-per-stage 200 --output-dir ./lifestack_model ``` ### Fast Dev Run (1 stage, test iterations) ```bash python scripts/train_trl.py --stages 1 --prompts-per-stage 50 ``` ### All CLI Flags | Flag | Default | Description | |------|---------|-------------| | `--dry-run` | — | 1-step CPU smoke test | | `--stages` | `5` | Number of curriculum stages | | `--prompts-per-stage` | `100` | Prompts per stage | | `--output-dir` | `./lifestack_model` | Model save path | | `--resume` | `False` | Resume from `curriculum_state.json` + latest checkpoint | | `--start-stage` | `None` | Force-start from a specific stage number | --- ## 6. What Gets Trained On The dataset covers **all 8 domains equally** using round-robin sampling: | # | Domain | Scenario Examples | |---|--------|-----------------| | 1 | `career` | Boss drops 10-hr task at 5 PM / performance review rumours | | 2 | `finances` | Card declined, late fee / tax audit, emergency fund needed | | 3 | `relationships` | Partner feels like a roommate / sibling needs emergency help | | 4 | `physical_health` | Fainting spell at office / warning signs ignored too long | | 5 | `mental_wellbeing` | Burnout, inbox at 500 / panic attack at work | | 6 | `time` | Double-booked all weekend / drowning in obligations | | 7 | `transport_crisis` | **5 sub-modes** — see below | | 8 | `code_merge_crisis` | Botched merge took down staging / CTO asking for ETA | ### `transport_crisis` sub-modes (randomly drawn each time) | Sub-type | Scenario | |----------|---------| | `flight_crisis` | Flight cancelled + card declined + deadline moved to Sunday | | `train_delay` | Signal failure, 90-min delay, 9 AM client meeting | | `car_breakdown` | Engine seized on highway, tow + rental = $400, rental shortage exo-event | | `rideshare_surge` | 9x surge pricing, major presentation in 2 hours | | `transit_strike` | City-wide indefinite strike, e-bike shortage exo-event | With 5 personalities × 5 difficulty levels × 8 domains, a 200-prompt stage has strong variation across ~3,000+ unique scenario combinations. --- ## 7. Reward Functions | Function | What it checks | Range | |----------|---------------|-------| | `reward_format_fn` | Valid JSON + all required fields | `[-1, 1]` | | `reward_plausibility_fn` | No miracle zero-cost fixes | `{-1, 1}` | | `reward_task_success_fn` | LifeMetrics improved + no cascade spread | `[-1, 1]` | | `reward_milestone_fn` | Logical progress gates hit | `[0, 1]` | | `reward_reasoning_fn` | Reasoning coherence + domain keywords | `[-0.1, 0.1]` | ``` +1.0 │ Perfect JSON, all metrics improved, milestone hit 0.5 │ Reasonable action, some metrics improved 0.0 │ Neutral / no change -0.5 │ PLAUSIBILITY_VIOLATION or CASCADE_SPREAD_WIDER -1.0 │ Refusal / empty / broke multiple metrics ``` --- ## 8. Monitoring Training ### TensorBoard (local/cloud only) ```bash tensorboard --logdir ./lifestack_model # open http://localhost:6006 ``` Watch: `train/reward` rising toward 0.5+, `train/kl_divergence` staying < 0.5. ### Console log (every 5 steps) ``` [step 25] reward=0.312 | outcome=0.124 | containment=0.800 | efficiency=0.710 [ckpt] Curriculum state saved → stage=1, next_diff=2 ``` ### Live JSONL log ```bash tail -f training_logs/generations.jsonl | python -c " import sys, json for line in sys.stdin: d = json.loads(line) print(f\"step={d['step']} reward={d['reward']:.3f} action={d['action'].get('action_type')}\") " ``` --- ## 9. Expected Training Results | Stage | Difficulty | Expected Avg Reward | Progression Rule | |-------|-----------|---------------------|-----------------| | 1 | 1 — flat tyre, forgotten bill | 0.55 – 0.70 | advances if > 0.60 | | 2 | 2 — project surge, train delay | 0.45 – 0.65 | advances if > 0.60 | | 3 | 3 — health scare, car breakdown | 0.35 – 0.55 | advances if > 0.60 | | 4 | 4 — performance review, surge pricing | 0.25 – 0.50 | advances if > 0.60 | | 5 | 5 — transit strike, total collapse | 0.20 – 0.45 | — | --- ## 10. Post-Training Artifacts ``` lifestack_model/ ├── curriculum_state.json ← curriculum progress tracker ├── model.safetensors ← final LoRA adapter weights ├── adapter_config.json ├── tokenizer.json / tokenizer_config.json └── stage_1/ ... stage_5/ ├── checkpoint-25/ ... checkpoint-75/ ← step snapshots └── model.safetensors ← completed stage weights training_logs/ └── generations.jsonl ← per-step reward breakdown ``` Validate the final save: ```bash python -c "from scripts.train_trl import validate_saved_model; validate_saved_model('./lifestack_model')" ``` --- ## 11. Troubleshooting | Symptom | Likely Cause | Fix | |---------|-------------|-----| | `ImportError: unsloth` | Not installed | `pip install unsloth` | | `CUDA out of memory` | Batch too large | `per_device_train_batch_size=1` | | All rewards = -0.5 | Env reset failing | Run `--dry-run` to surface the error | | KL divergence > 1.0 | LR too high | Lower `learning_rate` to `1e-6` | | `Task missing required fields` | Domain generator bug | Check `TaskGenerator.generate()` | | reward stuck at 0.0 | Model refuses JSON | Check `reward_format_fn` — should be -1.0 not 0.0 | | Colab disconnect lost progress | Drive not mounted | Mount Drive before running; use `--resume` | | `checkpoint-*` dirs missing | `save_steps` too high | Already set to 25 in this script | --- ## 12. Quick Reference ```bash # Smoke test (CPU, ~30s) python scripts/train_trl.py --dry-run # Kaggle full run (~5-6 hr, T4 x2) python scripts/train_trl.py --stages 5 --prompts-per-stage 200 # Resume after any disconnect python scripts/train_trl.py --resume # Jump to stage 3 (e.g. stages 1-2 already done) python scripts/train_trl.py --start-stage 3 # Validate model saved correctly python -c "from scripts.train_trl import validate_saved_model; validate_saved_model('./lifestack_model')" # Plot evaluation reward curve python -c "from scripts.train_trl import evaluate_and_plot; evaluate_and_plot('./lifestack_model')" ```