LifeStack / docs /training_guide.md
Soham Banerjee
deploy: pure lifestack with partitioned wisdom pool
77da5ce
# LifeStack GRPO Training Guide
> **Model**: Qwen2.5-1.5B-Instruct β†’ LoRA fine-tuned via GRPO
> **Algorithm**: Group Relative Policy Optimization (TRL + Unsloth)
> **Domains**: 8 daily-life domains including `transport_crisis` (5 modes), career, finances, relationships, physical health, mental wellbeing, time, code merge crisis
---
## 1. How GRPO Works in LifeStack
GRPO trains the model by generating **groups of completions** for the same prompt and ranking them by reward. The model learns to prefer higher-reward actions without needing a separate critic network (unlike PPO).
```
Prompt (life scenario)
β”‚
β–Ό
LLM generates N=4 candidate JSON actions
β”‚
β–Ό
5 reward functions score each action
β”œβ”€β”€ format_compliance (is it valid JSON?)
β”œβ”€β”€ plausibility (no zero-cost miracle fixes?)
β”œβ”€β”€ task_success (did it actually help the LifeMetrics?)
β”œβ”€β”€ milestone (did it unlock key progress gates?)
└── reasoning (is the explanation coherent?)
β”‚
β–Ό
GRPO updates policy to prefer higher-reward completions
```
The curriculum starts at difficulty 1 (gym skipped, forgotten bill) and advances to difficulty 5 (flight cancelled + card declined + boss moved deadline) only when avg reward > 0.6 on the current level.
---
## 2. Free-Tier GPU Recommendation
### βœ… Use Kaggle β€” not Colab
| | **Kaggle** βœ… | Colab Free ❌ |
|---|---|---|
| GPU | **T4 Γ— 2** (or P100) | T4 Γ— 1 |
| VRAM | **32 GB** (dual T4) | 16 GB |
| Session limit | **9 hours** | **90 minutes** |
| Weekly GPU quota | **30 hrs / week** | ~12 hrs (varies) |
| Storage between sessions | βœ… Persistent (save as Dataset) | ❌ Wiped on disconnect |
| `bf16` support | ❌ T4 is too old β†’ `fp16` used instead | ❌ same |
| Auto-detects fp16 fallback | βœ… script handles it | βœ… script handles it |
**Bottom line**: Colab free sessions cut off at 90 minutes. Even with checkpoints that means 3–4 restarts for a full 5-stage run. One Kaggle session (9h) completes the entire curriculum in a single stretch β€” no resume needed.
### Paid cloud (if you need speed)
| Tier | GPU | VRAM | Time / Stage | Cost |
|------|-----|------|-------------|------|
| πŸ₯‡ Best | A100 80GB | 80 GB | ~25 min | ~$2.50/hr |
| πŸ₯ˆ Good | A100 40GB | 40 GB | ~45 min | ~$1.60/hr |
| πŸ₯‰ Budget | L4 / RTX 3090 | 24 GB | ~90 min | ~$0.80/hr |
### VRAM math (why any T4 is fine)
| Component | VRAM |
|-----------|------|
| Model (1.5B, 4-bit Unsloth) | ~1.2 GB |
| LoRA adapters (r=16) | ~0.1 GB |
| Optimizer states | ~2.0 GB |
| Activations (batch=2, seq=1024) | ~3.5 GB |
| **Total** | **~7 GB** |
A single T4 (16 GB) has 9 GB headroom. Kaggle's dual T4 = 32 GB total.
---
## 3. Environment Setup
### βœ… Option A β€” Kaggle (Recommended Free Tier)
Create a new Kaggle Notebook β†’ Settings β†’ Accelerator: **GPU T4 x2**.
```python
# Cell 1 β€” Install deps
!pip install unsloth trl datasets transformers accelerate matplotlib -q
# Cell 2 β€” Clone repo
!git clone https://github.com/YOUR_ORG/Meta-R2.git
import os; os.chdir("Meta-R2")
# Cell 3 β€” Smoke test (makes sure everything imports correctly)
!python scripts/train_trl.py --dry-run
# Cell 4 β€” Full curriculum (completes in ~5–6 hrs on T4 x2)
OUTPUT = "/kaggle/working/lifestack_model"
!python scripts/train_trl.py --stages 5 --prompts-per-stage 200 --output-dir {OUTPUT}
```
**Saving across sessions** (so you can resume if you hit 9h or re-run next week):
```python
# After training, save the output as a Kaggle Dataset via the notebook sidebar:
# Notebook β†’ Data β†’ + Add Output β†’ name it "lifestack-model"
# Next session: attach that dataset and pass --resume
!python scripts/train_trl.py --resume --output-dir /kaggle/input/lifestack-model/lifestack_model
```
### Option B β€” Google Colab (Secondary, needs Drive)
Colab sessions cut at 90 min. You **must** mount Drive to survive disconnects.
```python
# Cell 1 β€” Mount Google Drive for persistent storage
from google.colab import drive
drive.mount('/content/drive')
OUTPUT = "/content/drive/MyDrive/lifestack_model"
# Cell 2 β€” Install & clone
!pip install unsloth trl datasets transformers accelerate matplotlib -q
!git clone https://github.com/YOUR_ORG/Meta-R2.git
import os; os.chdir("Meta-R2")
# Cell 3 β€” Smoke test
!python scripts/train_trl.py --dry-run
# Cell 4 β€” First run
!python scripts/train_trl.py --stages 5 --prompts-per-stage 200 --output-dir {OUTPUT}
# Cell 5 β€” After disconnect, re-run cells 1-2, then:
!python scripts/train_trl.py --resume --output-dir {OUTPUT}
```
> ⚠️ Without mounting Drive, every Colab disconnect loses all progress regardless of checkpoints.
### Option C β€” Local / Cloud GPU (Linux)
```bash
git clone https://github.com/YOUR_ORG/Meta-R2.git && cd Meta-R2
python3 -m venv .venv && source .venv/bin/activate
pip install unsloth trl datasets transformers accelerate matplotlib
# On A100 (CUDA 12.x), use the fast Unsloth build:
pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
python -c "import torch; print(torch.cuda.get_device_name(0))"
python scripts/train_trl.py --dry-run
python scripts/train_trl.py --stages 5 --prompts-per-stage 200
```
---
## 4. Checkpoint & Resume System
Every **25 optimiser steps** the Trainer writes a checkpoint. If the session dies mid-stage, it picks up exactly where it left off.
### What gets saved
```
lifestack_model/
β”œβ”€β”€ curriculum_state.json ← {"completed_stage": 2, "next_difficulty": 3}
β”œβ”€β”€ stage_1/
β”‚ β”œβ”€β”€ checkpoint-25/ ← step 25 snapshot (weights + optimizer)
β”‚ β”œβ”€β”€ checkpoint-50/ ← step 50 snapshot
β”‚ β”œβ”€β”€ checkpoint-75/ ← step 75 (oldest auto-deleted at 4th save)
β”‚ └── model.safetensors ← written when stage completes cleanly
β”œβ”€β”€ stage_2/
β”‚ └── checkpoint-25/ ← mid-stage when session was cut
└── stage_3/ ...
```
Only the **3 most recent checkpoints** per stage are kept (`save_total_limit=3`) to save disk.
### Resume commands
```bash
# Kaggle / Colab: auto-resume after any disconnect
python scripts/train_trl.py --resume
# Jump to a specific stage (e.g. re-run stage 3 from scratch)
python scripts/train_trl.py --start-stage 3
# Resume + change number of stages (e.g. add 2 more stages)
python scripts/train_trl.py --resume --stages 7
```
How `--resume` works:
1. Reads `curriculum_state.json` β†’ knows stage 2 completed, next is stage 3
2. Calls `find_latest_checkpoint("stage_3/")` β†’ finds `checkpoint-25`
3. `trainer.train(resume_from_checkpoint="stage_3/checkpoint-25")` β†’ restores weights + optimizer state β†’ continues from step 25
---
## 5. Training Commands
### Dry-Run β€” No GPU Required
```bash
python scripts/train_trl.py --dry-run
```
- 1 step, 4 prompts, CPU only, ~30 seconds
- Expected output: `βœ… DRY-RUN PASSED`
### Full Curriculum (Kaggle / cloud)
```bash
python scripts/train_trl.py --stages 5 --prompts-per-stage 200 --output-dir ./lifestack_model
```
### Fast Dev Run (1 stage, test iterations)
```bash
python scripts/train_trl.py --stages 1 --prompts-per-stage 50
```
### All CLI Flags
| Flag | Default | Description |
|------|---------|-------------|
| `--dry-run` | β€” | 1-step CPU smoke test |
| `--stages` | `5` | Number of curriculum stages |
| `--prompts-per-stage` | `100` | Prompts per stage |
| `--output-dir` | `./lifestack_model` | Model save path |
| `--resume` | `False` | Resume from `curriculum_state.json` + latest checkpoint |
| `--start-stage` | `None` | Force-start from a specific stage number |
---
## 6. What Gets Trained On
The dataset covers **all 8 domains equally** using round-robin sampling:
| # | Domain | Scenario Examples |
|---|--------|-----------------|
| 1 | `career` | Boss drops 10-hr task at 5 PM / performance review rumours |
| 2 | `finances` | Card declined, late fee / tax audit, emergency fund needed |
| 3 | `relationships` | Partner feels like a roommate / sibling needs emergency help |
| 4 | `physical_health` | Fainting spell at office / warning signs ignored too long |
| 5 | `mental_wellbeing` | Burnout, inbox at 500 / panic attack at work |
| 6 | `time` | Double-booked all weekend / drowning in obligations |
| 7 | `transport_crisis` | **5 sub-modes** β€” see below |
| 8 | `code_merge_crisis` | Botched merge took down staging / CTO asking for ETA |
### `transport_crisis` sub-modes (randomly drawn each time)
| Sub-type | Scenario |
|----------|---------|
| `flight_crisis` | Flight cancelled + card declined + deadline moved to Sunday |
| `train_delay` | Signal failure, 90-min delay, 9 AM client meeting |
| `car_breakdown` | Engine seized on highway, tow + rental = $400, rental shortage exo-event |
| `rideshare_surge` | 9x surge pricing, major presentation in 2 hours |
| `transit_strike` | City-wide indefinite strike, e-bike shortage exo-event |
With 5 personalities Γ— 5 difficulty levels Γ— 8 domains, a 200-prompt stage has strong variation across ~3,000+ unique scenario combinations.
---
## 7. Reward Functions
| Function | What it checks | Range |
|----------|---------------|-------|
| `reward_format_fn` | Valid JSON + all required fields | `[-1, 1]` |
| `reward_plausibility_fn` | No miracle zero-cost fixes | `{-1, 1}` |
| `reward_task_success_fn` | LifeMetrics improved + no cascade spread | `[-1, 1]` |
| `reward_milestone_fn` | Logical progress gates hit | `[0, 1]` |
| `reward_reasoning_fn` | Reasoning coherence + domain keywords | `[-0.1, 0.1]` |
```
+1.0 β”‚ Perfect JSON, all metrics improved, milestone hit
0.5 β”‚ Reasonable action, some metrics improved
0.0 β”‚ Neutral / no change
-0.5 β”‚ PLAUSIBILITY_VIOLATION or CASCADE_SPREAD_WIDER
-1.0 β”‚ Refusal / empty / broke multiple metrics
```
---
## 8. Monitoring Training
### TensorBoard (local/cloud only)
```bash
tensorboard --logdir ./lifestack_model # open http://localhost:6006
```
Watch: `train/reward` rising toward 0.5+, `train/kl_divergence` staying < 0.5.
### Console log (every 5 steps)
```
[step 25] reward=0.312 | outcome=0.124 | containment=0.800 | efficiency=0.710
[ckpt] Curriculum state saved β†’ stage=1, next_diff=2
```
### Live JSONL log
```bash
tail -f training_logs/generations.jsonl | python -c "
import sys, json
for line in sys.stdin:
d = json.loads(line)
print(f\"step={d['step']} reward={d['reward']:.3f} action={d['action'].get('action_type')}\")
"
```
---
## 9. Expected Training Results
| Stage | Difficulty | Expected Avg Reward | Progression Rule |
|-------|-----------|---------------------|-----------------|
| 1 | 1 β€” flat tyre, forgotten bill | 0.55 – 0.70 | advances if > 0.60 |
| 2 | 2 β€” project surge, train delay | 0.45 – 0.65 | advances if > 0.60 |
| 3 | 3 β€” health scare, car breakdown | 0.35 – 0.55 | advances if > 0.60 |
| 4 | 4 β€” performance review, surge pricing | 0.25 – 0.50 | advances if > 0.60 |
| 5 | 5 β€” transit strike, total collapse | 0.20 – 0.45 | β€” |
---
## 10. Post-Training Artifacts
```
lifestack_model/
β”œβ”€β”€ curriculum_state.json ← curriculum progress tracker
β”œβ”€β”€ model.safetensors ← final LoRA adapter weights
β”œβ”€β”€ adapter_config.json
β”œβ”€β”€ tokenizer.json / tokenizer_config.json
└── stage_1/ ... stage_5/
β”œβ”€β”€ checkpoint-25/ ... checkpoint-75/ ← step snapshots
└── model.safetensors ← completed stage weights
training_logs/
└── generations.jsonl ← per-step reward breakdown
```
Validate the final save:
```bash
python -c "from scripts.train_trl import validate_saved_model; validate_saved_model('./lifestack_model')"
```
---
## 11. Troubleshooting
| Symptom | Likely Cause | Fix |
|---------|-------------|-----|
| `ImportError: unsloth` | Not installed | `pip install unsloth` |
| `CUDA out of memory` | Batch too large | `per_device_train_batch_size=1` |
| All rewards = -0.5 | Env reset failing | Run `--dry-run` to surface the error |
| KL divergence > 1.0 | LR too high | Lower `learning_rate` to `1e-6` |
| `Task missing required fields` | Domain generator bug | Check `TaskGenerator.generate()` |
| reward stuck at 0.0 | Model refuses JSON | Check `reward_format_fn` β€” should be -1.0 not 0.0 |
| Colab disconnect lost progress | Drive not mounted | Mount Drive before running; use `--resume` |
| `checkpoint-*` dirs missing | `save_steps` too high | Already set to 25 in this script |
---
## 12. Quick Reference
```bash
# Smoke test (CPU, ~30s)
python scripts/train_trl.py --dry-run
# Kaggle full run (~5-6 hr, T4 x2)
python scripts/train_trl.py --stages 5 --prompts-per-stage 200
# Resume after any disconnect
python scripts/train_trl.py --resume
# Jump to stage 3 (e.g. stages 1-2 already done)
python scripts/train_trl.py --start-stage 3
# Validate model saved correctly
python -c "from scripts.train_trl import validate_saved_model; validate_saved_model('./lifestack_model')"
# Plot evaluation reward curve
python -c "from scripts.train_trl import evaluate_and_plot; evaluate_and_plot('./lifestack_model')"
```