Spaces:

s-b3
/

LifeStack

Sleeping

App Files Files Community

LifeStack / docs /training_guide.md

Soham Banerjee

deploy: pure lifestack with partitioned wisdom pool

77da5ce about 1 month ago

preview code

raw

history blame contribute delete

13.2 kB

	# LifeStack GRPO Training Guide

	> Model: Qwen2.5-1.5B-Instruct → LoRA fine-tuned via GRPO
	> Algorithm: Group Relative Policy Optimization (TRL + Unsloth)
	> Domains: 8 daily-life domains including `transport_crisis` (5 modes), career, finances, relationships, physical health, mental wellbeing, time, code merge crisis

	---

	## 1. How GRPO Works in LifeStack

	GRPO trains the model by generating groups of completions for the same prompt and ranking them by reward. The model learns to prefer higher-reward actions without needing a separate critic network (unlike PPO).

	```
	Prompt (life scenario)
	│
	▼
	LLM generates N=4 candidate JSON actions
	│
	▼
	5 reward functions score each action
	├── format_compliance (is it valid JSON?)
	├── plausibility (no zero-cost miracle fixes?)
	├── task_success (did it actually help the LifeMetrics?)
	├── milestone (did it unlock key progress gates?)
	└── reasoning (is the explanation coherent?)
	│
	▼
	GRPO updates policy to prefer higher-reward completions
	```

	The curriculum starts at difficulty 1 (gym skipped, forgotten bill) and advances to difficulty 5 (flight cancelled + card declined + boss moved deadline) only when avg reward > 0.6 on the current level.

	---

	## 2. Free-Tier GPU Recommendation

	### ✅ Use Kaggle — not Colab

	\| \| Kaggle ✅ \| Colab Free ❌ \|
	\|---\|---\|---\|
	\| GPU \| T4 × 2 (or P100) \| T4 × 1 \|
	\| VRAM \| 32 GB (dual T4) \| 16 GB \|
	\| Session limit \| 9 hours \| 90 minutes \|
	\| Weekly GPU quota \| 30 hrs / week \| ~12 hrs (varies) \|
	\| Storage between sessions \| ✅ Persistent (save as Dataset) \| ❌ Wiped on disconnect \|
	\| `bf16` support \| ❌ T4 is too old → `fp16` used instead \| ❌ same \|
	\| Auto-detects fp16 fallback \| ✅ script handles it \| ✅ script handles it \|

	Bottom line: Colab free sessions cut off at 90 minutes. Even with checkpoints that means 3–4 restarts for a full 5-stage run. One Kaggle session (9h) completes the entire curriculum in a single stretch — no resume needed.

	### Paid cloud (if you need speed)

	\| Tier \| GPU \| VRAM \| Time / Stage \| Cost \|
	\|------\|-----\|------\|-------------\|------\|
	\| 🥇 Best \| A100 80GB \| 80 GB \| ~25 min \| ~$2.50/hr \|
	\| 🥈 Good \| A100 40GB \| 40 GB \| ~45 min \| ~$1.60/hr \|
	\| 🥉 Budget \| L4 / RTX 3090 \| 24 GB \| ~90 min \| ~$0.80/hr \|

	### VRAM math (why any T4 is fine)

	\| Component \| VRAM \|
	\|-----------\|------\|
	\| Model (1.5B, 4-bit Unsloth) \| ~1.2 GB \|
	\| LoRA adapters (r=16) \| ~0.1 GB \|
	\| Optimizer states \| ~2.0 GB \|
	\| Activations (batch=2, seq=1024) \| ~3.5 GB \|
	\| Total \| ~7 GB \|

	A single T4 (16 GB) has 9 GB headroom. Kaggle's dual T4 = 32 GB total.

	---

	## 3. Environment Setup

	### ✅ Option A — Kaggle (Recommended Free Tier)

	Create a new Kaggle Notebook → Settings → Accelerator: GPU T4 x2.

	```python
	# Cell 1 — Install deps
	!pip install unsloth trl datasets transformers accelerate matplotlib -q

	# Cell 2 — Clone repo
	!git clone https://github.com/YOUR_ORG/Meta-R2.git
	import os; os.chdir("Meta-R2")

	# Cell 3 — Smoke test (makes sure everything imports correctly)
	!python scripts/train_trl.py --dry-run

	# Cell 4 — Full curriculum (completes in ~5–6 hrs on T4 x2)
	OUTPUT = "/kaggle/working/lifestack_model"
	!python scripts/train_trl.py --stages 5 --prompts-per-stage 200 --output-dir {OUTPUT}
	```

	Saving across sessions (so you can resume if you hit 9h or re-run next week):

	```python
	# After training, save the output as a Kaggle Dataset via the notebook sidebar:
	# Notebook → Data → + Add Output → name it "lifestack-model"
	# Next session: attach that dataset and pass --resume
	!python scripts/train_trl.py --resume --output-dir /kaggle/input/lifestack-model/lifestack_model
	```

	### Option B — Google Colab (Secondary, needs Drive)

	Colab sessions cut at 90 min. You must mount Drive to survive disconnects.

	```python
	# Cell 1 — Mount Google Drive for persistent storage
	from google.colab import drive
	drive.mount('/content/drive')
	OUTPUT = "/content/drive/MyDrive/lifestack_model"

	# Cell 2 — Install & clone
	!pip install unsloth trl datasets transformers accelerate matplotlib -q
	!git clone https://github.com/YOUR_ORG/Meta-R2.git
	import os; os.chdir("Meta-R2")

	# Cell 3 — Smoke test
	!python scripts/train_trl.py --dry-run

	# Cell 4 — First run
	!python scripts/train_trl.py --stages 5 --prompts-per-stage 200 --output-dir {OUTPUT}

	# Cell 5 — After disconnect, re-run cells 1-2, then:
	!python scripts/train_trl.py --resume --output-dir {OUTPUT}
	```

	> ⚠️ Without mounting Drive, every Colab disconnect loses all progress regardless of checkpoints.

	### Option C — Local / Cloud GPU (Linux)

	```bash
	git clone https://github.com/YOUR_ORG/Meta-R2.git && cd Meta-R2
	python3 -m venv .venv && source .venv/bin/activate
	pip install unsloth trl datasets transformers accelerate matplotlib

	# On A100 (CUDA 12.x), use the fast Unsloth build:
	pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"

	python -c "import torch; print(torch.cuda.get_device_name(0))"
	python scripts/train_trl.py --dry-run
	python scripts/train_trl.py --stages 5 --prompts-per-stage 200
	```

	---

	## 4. Checkpoint & Resume System

	Every 25 optimiser steps the Trainer writes a checkpoint. If the session dies mid-stage, it picks up exactly where it left off.

	### What gets saved

	```
	lifestack_model/
	├── curriculum_state.json ← {"completed_stage": 2, "next_difficulty": 3}
	├── stage_1/
	│ ├── checkpoint-25/ ← step 25 snapshot (weights + optimizer)
	│ ├── checkpoint-50/ ← step 50 snapshot
	│ ├── checkpoint-75/ ← step 75 (oldest auto-deleted at 4th save)
	│ └── model.safetensors ← written when stage completes cleanly
	├── stage_2/
	│ └── checkpoint-25/ ← mid-stage when session was cut
	└── stage_3/ ...
	```

	Only the 3 most recent checkpoints per stage are kept (`save_total_limit=3`) to save disk.

	### Resume commands

	```bash
	# Kaggle / Colab: auto-resume after any disconnect
	python scripts/train_trl.py --resume

	# Jump to a specific stage (e.g. re-run stage 3 from scratch)
	python scripts/train_trl.py --start-stage 3

	# Resume + change number of stages (e.g. add 2 more stages)
	python scripts/train_trl.py --resume --stages 7
	```

	How `--resume` works:
	1. Reads `curriculum_state.json` → knows stage 2 completed, next is stage 3
	2. Calls `find_latest_checkpoint("stage_3/")` → finds `checkpoint-25`
	3. `trainer.train(resume_from_checkpoint="stage_3/checkpoint-25")` → restores weights + optimizer state → continues from step 25

	---

	## 5. Training Commands

	### Dry-Run — No GPU Required
	```bash
	python scripts/train_trl.py --dry-run
	```
	- 1 step, 4 prompts, CPU only, ~30 seconds
	- Expected output: `✅ DRY-RUN PASSED`

	### Full Curriculum (Kaggle / cloud)
	```bash
	python scripts/train_trl.py --stages 5 --prompts-per-stage 200 --output-dir ./lifestack_model
	```

	### Fast Dev Run (1 stage, test iterations)
	```bash
	python scripts/train_trl.py --stages 1 --prompts-per-stage 50
	```

	### All CLI Flags

	\| Flag \| Default \| Description \|
	\|------\|---------\|-------------\|
	\| `--dry-run` \| — \| 1-step CPU smoke test \|
	\| `--stages` \| `5` \| Number of curriculum stages \|
	\| `--prompts-per-stage` \| `100` \| Prompts per stage \|
	\| `--output-dir` \| `./lifestack_model` \| Model save path \|
	\| `--resume` \| `False` \| Resume from `curriculum_state.json` + latest checkpoint \|
	\| `--start-stage` \| `None` \| Force-start from a specific stage number \|

	---

	## 6. What Gets Trained On

	The dataset covers all 8 domains equally using round-robin sampling:

	\| # \| Domain \| Scenario Examples \|
	\|---\|--------\|-----------------\|
	\| 1 \| `career` \| Boss drops 10-hr task at 5 PM / performance review rumours \|
	\| 2 \| `finances` \| Card declined, late fee / tax audit, emergency fund needed \|
	\| 3 \| `relationships` \| Partner feels like a roommate / sibling needs emergency help \|
	\| 4 \| `physical_health` \| Fainting spell at office / warning signs ignored too long \|
	\| 5 \| `mental_wellbeing` \| Burnout, inbox at 500 / panic attack at work \|
	\| 6 \| `time` \| Double-booked all weekend / drowning in obligations \|
	\| 7 \| `transport_crisis` \| 5 sub-modes — see below \|
	\| 8 \| `code_merge_crisis` \| Botched merge took down staging / CTO asking for ETA \|

	### `transport_crisis` sub-modes (randomly drawn each time)

	\| Sub-type \| Scenario \|
	\|----------\|---------\|
	\| `flight_crisis` \| Flight cancelled + card declined + deadline moved to Sunday \|
	\| `train_delay` \| Signal failure, 90-min delay, 9 AM client meeting \|
	\| `car_breakdown` \| Engine seized on highway, tow + rental = $400, rental shortage exo-event \|
	\| `rideshare_surge` \| 9x surge pricing, major presentation in 2 hours \|
	\| `transit_strike` \| City-wide indefinite strike, e-bike shortage exo-event \|

	With 5 personalities × 5 difficulty levels × 8 domains, a 200-prompt stage has strong variation across ~3,000+ unique scenario combinations.

	---

	## 7. Reward Functions

	\| Function \| What it checks \| Range \|
	\|----------\|---------------\|-------\|
	\| `reward_format_fn` \| Valid JSON + all required fields \| `[-1, 1]` \|
	\| `reward_plausibility_fn` \| No miracle zero-cost fixes \| `{-1, 1}` \|
	\| `reward_task_success_fn` \| LifeMetrics improved + no cascade spread \| `[-1, 1]` \|
	\| `reward_milestone_fn` \| Logical progress gates hit \| `[0, 1]` \|
	\| `reward_reasoning_fn` \| Reasoning coherence + domain keywords \| `[-0.1, 0.1]` \|

	```
	+1.0 │ Perfect JSON, all metrics improved, milestone hit
	0.5 │ Reasonable action, some metrics improved
	0.0 │ Neutral / no change
	-0.5 │ PLAUSIBILITY_VIOLATION or CASCADE_SPREAD_WIDER
	-1.0 │ Refusal / empty / broke multiple metrics
	```

	---

	## 8. Monitoring Training

	### TensorBoard (local/cloud only)
	```bash
	tensorboard --logdir ./lifestack_model # open http://localhost:6006
	```
	Watch: `train/reward` rising toward 0.5+, `train/kl_divergence` staying < 0.5.

	### Console log (every 5 steps)
	```
	[step 25] reward=0.312 \| outcome=0.124 \| containment=0.800 \| efficiency=0.710
	[ckpt] Curriculum state saved → stage=1, next_diff=2
	```

	### Live JSONL log
	```bash
	tail -f training_logs/generations.jsonl \| python -c "
	import sys, json
	for line in sys.stdin:
	d = json.loads(line)
	print(f\"step={d['step']} reward={d['reward']:.3f} action={d['action'].get('action_type')}\")
	"
	```

	---

	## 9. Expected Training Results

	\| Stage \| Difficulty \| Expected Avg Reward \| Progression Rule \|
	\|-------\|-----------\|---------------------\|-----------------\|
	\| 1 \| 1 — flat tyre, forgotten bill \| 0.55 – 0.70 \| advances if > 0.60 \|
	\| 2 \| 2 — project surge, train delay \| 0.45 – 0.65 \| advances if > 0.60 \|
	\| 3 \| 3 — health scare, car breakdown \| 0.35 – 0.55 \| advances if > 0.60 \|
	\| 4 \| 4 — performance review, surge pricing \| 0.25 – 0.50 \| advances if > 0.60 \|
	\| 5 \| 5 — transit strike, total collapse \| 0.20 – 0.45 \| — \|

	---

	## 10. Post-Training Artifacts

	```
	lifestack_model/
	├── curriculum_state.json ← curriculum progress tracker
	├── model.safetensors ← final LoRA adapter weights
	├── adapter_config.json
	├── tokenizer.json / tokenizer_config.json
	└── stage_1/ ... stage_5/
	├── checkpoint-25/ ... checkpoint-75/ ← step snapshots
	└── model.safetensors ← completed stage weights

	training_logs/
	└── generations.jsonl ← per-step reward breakdown
	```

	Validate the final save:
	```bash
	python -c "from scripts.train_trl import validate_saved_model; validate_saved_model('./lifestack_model')"
	```

	---

	## 11. Troubleshooting

	\| Symptom \| Likely Cause \| Fix \|
	\|---------\|-------------\|-----\|
	\| `ImportError: unsloth` \| Not installed \| `pip install unsloth` \|
	\| `CUDA out of memory` \| Batch too large \| `per_device_train_batch_size=1` \|
	\| All rewards = -0.5 \| Env reset failing \| Run `--dry-run` to surface the error \|
	\| KL divergence > 1.0 \| LR too high \| Lower `learning_rate` to `1e-6` \|
	\| `Task missing required fields` \| Domain generator bug \| Check `TaskGenerator.generate()` \|
	\| reward stuck at 0.0 \| Model refuses JSON \| Check `reward_format_fn` — should be -1.0 not 0.0 \|
	\| Colab disconnect lost progress \| Drive not mounted \| Mount Drive before running; use `--resume` \|
	\| `checkpoint-*` dirs missing \| `save_steps` too high \| Already set to 25 in this script \|

	---

	## 12. Quick Reference

	```bash
	# Smoke test (CPU, ~30s)
	python scripts/train_trl.py --dry-run

	# Kaggle full run (~5-6 hr, T4 x2)
	python scripts/train_trl.py --stages 5 --prompts-per-stage 200

	# Resume after any disconnect
	python scripts/train_trl.py --resume

	# Jump to stage 3 (e.g. stages 1-2 already done)
	python scripts/train_trl.py --start-stage 3

	# Validate model saved correctly
	python -c "from scripts.train_trl import validate_saved_model; validate_saved_model('./lifestack_model')"

	# Plot evaluation reward curve
	python -c "from scripts.train_trl import evaluate_and_plot; evaluate_and_plot('./lifestack_model')"
	```