| # LifeStack GRPO Training Guide |
|
|
| > **Model**: Qwen2.5-1.5B-Instruct β LoRA fine-tuned via GRPO |
| > **Algorithm**: Group Relative Policy Optimization (TRL + Unsloth) |
| > **Domains**: 8 daily-life domains including `transport_crisis` (5 modes), career, finances, relationships, physical health, mental wellbeing, time, code merge crisis |
| |
| --- |
| |
| ## 1. How GRPO Works in LifeStack |
| |
| GRPO trains the model by generating **groups of completions** for the same prompt and ranking them by reward. The model learns to prefer higher-reward actions without needing a separate critic network (unlike PPO). |
| |
| ``` |
| Prompt (life scenario) |
| β |
| βΌ |
| LLM generates N=4 candidate JSON actions |
| β |
| βΌ |
| 5 reward functions score each action |
| βββ format_compliance (is it valid JSON?) |
| βββ plausibility (no zero-cost miracle fixes?) |
| βββ task_success (did it actually help the LifeMetrics?) |
| βββ milestone (did it unlock key progress gates?) |
| βββ reasoning (is the explanation coherent?) |
| β |
| βΌ |
| GRPO updates policy to prefer higher-reward completions |
| ``` |
| |
| The curriculum starts at difficulty 1 (gym skipped, forgotten bill) and advances to difficulty 5 (flight cancelled + card declined + boss moved deadline) only when avg reward > 0.6 on the current level. |
| |
| --- |
| |
| ## 2. Free-Tier GPU Recommendation |
| |
| ### β
Use Kaggle β not Colab |
| |
| | | **Kaggle** β
| Colab Free β | |
| |---|---|---| |
| | GPU | **T4 Γ 2** (or P100) | T4 Γ 1 | |
| | VRAM | **32 GB** (dual T4) | 16 GB | |
| | Session limit | **9 hours** | **90 minutes** | |
| | Weekly GPU quota | **30 hrs / week** | ~12 hrs (varies) | |
| | Storage between sessions | β
Persistent (save as Dataset) | β Wiped on disconnect | |
| | `bf16` support | β T4 is too old β `fp16` used instead | β same | |
| | Auto-detects fp16 fallback | β
script handles it | β
script handles it | |
| |
| **Bottom line**: Colab free sessions cut off at 90 minutes. Even with checkpoints that means 3β4 restarts for a full 5-stage run. One Kaggle session (9h) completes the entire curriculum in a single stretch β no resume needed. |
| |
| ### Paid cloud (if you need speed) |
| |
| | Tier | GPU | VRAM | Time / Stage | Cost | |
| |------|-----|------|-------------|------| |
| | π₯ Best | A100 80GB | 80 GB | ~25 min | ~$2.50/hr | |
| | π₯ Good | A100 40GB | 40 GB | ~45 min | ~$1.60/hr | |
| | π₯ Budget | L4 / RTX 3090 | 24 GB | ~90 min | ~$0.80/hr | |
| |
| ### VRAM math (why any T4 is fine) |
| |
| | Component | VRAM | |
| |-----------|------| |
| | Model (1.5B, 4-bit Unsloth) | ~1.2 GB | |
| | LoRA adapters (r=16) | ~0.1 GB | |
| | Optimizer states | ~2.0 GB | |
| | Activations (batch=2, seq=1024) | ~3.5 GB | |
| | **Total** | **~7 GB** | |
| |
| A single T4 (16 GB) has 9 GB headroom. Kaggle's dual T4 = 32 GB total. |
| |
| --- |
| |
| ## 3. Environment Setup |
| |
| ### β
Option A β Kaggle (Recommended Free Tier) |
| |
| Create a new Kaggle Notebook β Settings β Accelerator: **GPU T4 x2**. |
| |
| ```python |
| # Cell 1 β Install deps |
| !pip install unsloth trl datasets transformers accelerate matplotlib -q |
| |
| # Cell 2 β Clone repo |
| !git clone https://github.com/YOUR_ORG/Meta-R2.git |
| import os; os.chdir("Meta-R2") |
|
|
| # Cell 3 β Smoke test (makes sure everything imports correctly) |
| !python scripts/train_trl.py --dry-run |
| |
| # Cell 4 β Full curriculum (completes in ~5β6 hrs on T4 x2) |
| OUTPUT = "/kaggle/working/lifestack_model" |
| !python scripts/train_trl.py --stages 5 --prompts-per-stage 200 --output-dir {OUTPUT} |
| ``` |
| |
| **Saving across sessions** (so you can resume if you hit 9h or re-run next week): |
| |
| ```python |
| # After training, save the output as a Kaggle Dataset via the notebook sidebar: |
| # Notebook β Data β + Add Output β name it "lifestack-model" |
| # Next session: attach that dataset and pass --resume |
| !python scripts/train_trl.py --resume --output-dir /kaggle/input/lifestack-model/lifestack_model |
| ``` |
| |
| ### Option B β Google Colab (Secondary, needs Drive) |
| |
| Colab sessions cut at 90 min. You **must** mount Drive to survive disconnects. |
| |
| ```python |
| # Cell 1 β Mount Google Drive for persistent storage |
| from google.colab import drive |
| drive.mount('/content/drive') |
| OUTPUT = "/content/drive/MyDrive/lifestack_model" |
|
|
| # Cell 2 β Install & clone |
| !pip install unsloth trl datasets transformers accelerate matplotlib -q |
| !git clone https://github.com/YOUR_ORG/Meta-R2.git |
| import os; os.chdir("Meta-R2") |
| |
| # Cell 3 β Smoke test |
| !python scripts/train_trl.py --dry-run |
|
|
| # Cell 4 β First run |
| !python scripts/train_trl.py --stages 5 --prompts-per-stage 200 --output-dir {OUTPUT} |
| |
| # Cell 5 β After disconnect, re-run cells 1-2, then: |
| !python scripts/train_trl.py --resume --output-dir {OUTPUT} |
| ``` |
| |
| > β οΈ Without mounting Drive, every Colab disconnect loses all progress regardless of checkpoints. |
| |
| ### Option C β Local / Cloud GPU (Linux) |
| |
| ```bash |
| git clone https://github.com/YOUR_ORG/Meta-R2.git && cd Meta-R2 |
| python3 -m venv .venv && source .venv/bin/activate |
| pip install unsloth trl datasets transformers accelerate matplotlib |
| |
| # On A100 (CUDA 12.x), use the fast Unsloth build: |
| pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git" |
| |
| python -c "import torch; print(torch.cuda.get_device_name(0))" |
| python scripts/train_trl.py --dry-run |
| python scripts/train_trl.py --stages 5 --prompts-per-stage 200 |
| ``` |
| |
| --- |
| |
| ## 4. Checkpoint & Resume System |
| |
| Every **25 optimiser steps** the Trainer writes a checkpoint. If the session dies mid-stage, it picks up exactly where it left off. |
| |
| ### What gets saved |
| |
| ``` |
| lifestack_model/ |
| βββ curriculum_state.json β {"completed_stage": 2, "next_difficulty": 3} |
| βββ stage_1/ |
| β βββ checkpoint-25/ β step 25 snapshot (weights + optimizer) |
| β βββ checkpoint-50/ β step 50 snapshot |
| β βββ checkpoint-75/ β step 75 (oldest auto-deleted at 4th save) |
| β βββ model.safetensors β written when stage completes cleanly |
| βββ stage_2/ |
| β βββ checkpoint-25/ β mid-stage when session was cut |
| βββ stage_3/ ... |
| ``` |
| |
| Only the **3 most recent checkpoints** per stage are kept (`save_total_limit=3`) to save disk. |
| |
| ### Resume commands |
| |
| ```bash |
| # Kaggle / Colab: auto-resume after any disconnect |
| python scripts/train_trl.py --resume |
| |
| # Jump to a specific stage (e.g. re-run stage 3 from scratch) |
| python scripts/train_trl.py --start-stage 3 |
|
|
| # Resume + change number of stages (e.g. add 2 more stages) |
| python scripts/train_trl.py --resume --stages 7 |
| ``` |
| |
| How `--resume` works: |
| 1. Reads `curriculum_state.json` β knows stage 2 completed, next is stage 3 |
| 2. Calls `find_latest_checkpoint("stage_3/")` β finds `checkpoint-25` |
| 3. `trainer.train(resume_from_checkpoint="stage_3/checkpoint-25")` β restores weights + optimizer state β continues from step 25 |
|
|
| --- |
|
|
| ## 5. Training Commands |
|
|
| ### Dry-Run β No GPU Required |
| ```bash |
| python scripts/train_trl.py --dry-run |
| ``` |
| - 1 step, 4 prompts, CPU only, ~30 seconds |
| - Expected output: `β
DRY-RUN PASSED` |
|
|
| ### Full Curriculum (Kaggle / cloud) |
| ```bash |
| python scripts/train_trl.py --stages 5 --prompts-per-stage 200 --output-dir ./lifestack_model |
| ``` |
|
|
| ### Fast Dev Run (1 stage, test iterations) |
| ```bash |
| python scripts/train_trl.py --stages 1 --prompts-per-stage 50 |
| ``` |
|
|
| ### All CLI Flags |
|
|
| | Flag | Default | Description | |
| |------|---------|-------------| |
| | `--dry-run` | β | 1-step CPU smoke test | |
| | `--stages` | `5` | Number of curriculum stages | |
| | `--prompts-per-stage` | `100` | Prompts per stage | |
| | `--output-dir` | `./lifestack_model` | Model save path | |
| | `--resume` | `False` | Resume from `curriculum_state.json` + latest checkpoint | |
| | `--start-stage` | `None` | Force-start from a specific stage number | |
|
|
| --- |
|
|
| ## 6. What Gets Trained On |
|
|
| The dataset covers **all 8 domains equally** using round-robin sampling: |
|
|
| | # | Domain | Scenario Examples | |
| |---|--------|-----------------| |
| | 1 | `career` | Boss drops 10-hr task at 5 PM / performance review rumours | |
| | 2 | `finances` | Card declined, late fee / tax audit, emergency fund needed | |
| | 3 | `relationships` | Partner feels like a roommate / sibling needs emergency help | |
| | 4 | `physical_health` | Fainting spell at office / warning signs ignored too long | |
| | 5 | `mental_wellbeing` | Burnout, inbox at 500 / panic attack at work | |
| | 6 | `time` | Double-booked all weekend / drowning in obligations | |
| | 7 | `transport_crisis` | **5 sub-modes** β see below | |
| | 8 | `code_merge_crisis` | Botched merge took down staging / CTO asking for ETA | |
|
|
| ### `transport_crisis` sub-modes (randomly drawn each time) |
| |
| | Sub-type | Scenario | |
| |----------|---------| |
| | `flight_crisis` | Flight cancelled + card declined + deadline moved to Sunday | |
| | `train_delay` | Signal failure, 90-min delay, 9 AM client meeting | |
| | `car_breakdown` | Engine seized on highway, tow + rental = $400, rental shortage exo-event | |
| | `rideshare_surge` | 9x surge pricing, major presentation in 2 hours | |
| | `transit_strike` | City-wide indefinite strike, e-bike shortage exo-event | |
|
|
| With 5 personalities Γ 5 difficulty levels Γ 8 domains, a 200-prompt stage has strong variation across ~3,000+ unique scenario combinations. |
|
|
| --- |
|
|
| ## 7. Reward Functions |
|
|
| | Function | What it checks | Range | |
| |----------|---------------|-------| |
| | `reward_format_fn` | Valid JSON + all required fields | `[-1, 1]` | |
| | `reward_plausibility_fn` | No miracle zero-cost fixes | `{-1, 1}` | |
| | `reward_task_success_fn` | LifeMetrics improved + no cascade spread | `[-1, 1]` | |
| | `reward_milestone_fn` | Logical progress gates hit | `[0, 1]` | |
| | `reward_reasoning_fn` | Reasoning coherence + domain keywords | `[-0.1, 0.1]` | |
|
|
| ``` |
| +1.0 β Perfect JSON, all metrics improved, milestone hit |
| 0.5 β Reasonable action, some metrics improved |
| 0.0 β Neutral / no change |
| -0.5 β PLAUSIBILITY_VIOLATION or CASCADE_SPREAD_WIDER |
| -1.0 β Refusal / empty / broke multiple metrics |
| ``` |
|
|
| --- |
|
|
| ## 8. Monitoring Training |
|
|
| ### TensorBoard (local/cloud only) |
| ```bash |
| tensorboard --logdir ./lifestack_model # open http://localhost:6006 |
| ``` |
| Watch: `train/reward` rising toward 0.5+, `train/kl_divergence` staying < 0.5. |
|
|
| ### Console log (every 5 steps) |
| ``` |
| [step 25] reward=0.312 | outcome=0.124 | containment=0.800 | efficiency=0.710 |
| [ckpt] Curriculum state saved β stage=1, next_diff=2 |
| ``` |
|
|
| ### Live JSONL log |
| ```bash |
| tail -f training_logs/generations.jsonl | python -c " |
| import sys, json |
| for line in sys.stdin: |
| d = json.loads(line) |
| print(f\"step={d['step']} reward={d['reward']:.3f} action={d['action'].get('action_type')}\") |
| " |
| ``` |
|
|
| --- |
|
|
| ## 9. Expected Training Results |
|
|
| | Stage | Difficulty | Expected Avg Reward | Progression Rule | |
| |-------|-----------|---------------------|-----------------| |
| | 1 | 1 β flat tyre, forgotten bill | 0.55 β 0.70 | advances if > 0.60 | |
| | 2 | 2 β project surge, train delay | 0.45 β 0.65 | advances if > 0.60 | |
| | 3 | 3 β health scare, car breakdown | 0.35 β 0.55 | advances if > 0.60 | |
| | 4 | 4 β performance review, surge pricing | 0.25 β 0.50 | advances if > 0.60 | |
| | 5 | 5 β transit strike, total collapse | 0.20 β 0.45 | β | |
|
|
| --- |
|
|
| ## 10. Post-Training Artifacts |
|
|
| ``` |
| lifestack_model/ |
| βββ curriculum_state.json β curriculum progress tracker |
| βββ model.safetensors β final LoRA adapter weights |
| βββ adapter_config.json |
| βββ tokenizer.json / tokenizer_config.json |
| βββ stage_1/ ... stage_5/ |
| βββ checkpoint-25/ ... checkpoint-75/ β step snapshots |
| βββ model.safetensors β completed stage weights |
| |
| training_logs/ |
| βββ generations.jsonl β per-step reward breakdown |
| ``` |
|
|
| Validate the final save: |
| ```bash |
| python -c "from scripts.train_trl import validate_saved_model; validate_saved_model('./lifestack_model')" |
| ``` |
|
|
| --- |
|
|
| ## 11. Troubleshooting |
|
|
| | Symptom | Likely Cause | Fix | |
| |---------|-------------|-----| |
| | `ImportError: unsloth` | Not installed | `pip install unsloth` | |
| | `CUDA out of memory` | Batch too large | `per_device_train_batch_size=1` | |
| | All rewards = -0.5 | Env reset failing | Run `--dry-run` to surface the error | |
| | KL divergence > 1.0 | LR too high | Lower `learning_rate` to `1e-6` | |
| | `Task missing required fields` | Domain generator bug | Check `TaskGenerator.generate()` | |
| | reward stuck at 0.0 | Model refuses JSON | Check `reward_format_fn` β should be -1.0 not 0.0 | |
| | Colab disconnect lost progress | Drive not mounted | Mount Drive before running; use `--resume` | |
| | `checkpoint-*` dirs missing | `save_steps` too high | Already set to 25 in this script | |
|
|
| --- |
|
|
| ## 12. Quick Reference |
|
|
| ```bash |
| # Smoke test (CPU, ~30s) |
| python scripts/train_trl.py --dry-run |
| |
| # Kaggle full run (~5-6 hr, T4 x2) |
| python scripts/train_trl.py --stages 5 --prompts-per-stage 200 |
| |
| # Resume after any disconnect |
| python scripts/train_trl.py --resume |
| |
| # Jump to stage 3 (e.g. stages 1-2 already done) |
| python scripts/train_trl.py --start-stage 3 |
| |
| # Validate model saved correctly |
| python -c "from scripts.train_trl import validate_saved_model; validate_saved_model('./lifestack_model')" |
| |
| # Plot evaluation reward curve |
| python -c "from scripts.train_trl import evaluate_and_plot; evaluate_and_plot('./lifestack_model')" |
| ``` |
|
|