Spaces:

s-b3
/

LifeStack

Sleeping

App Files Files Community

LifeStack / docs /training_guide.md

Soham Banerjee

deploy: pure lifestack with partitioned wisdom pool

77da5ce about 1 month ago

preview code

raw

history blame contribute delete

13.2 kB

LifeStack GRPO Training Guide

Model: Qwen2.5-1.5B-Instruct → LoRA fine-tuned via GRPO
Algorithm: Group Relative Policy Optimization (TRL + Unsloth)
Domains: 8 daily-life domains including transport_crisis (5 modes), career, finances, relationships, physical health, mental wellbeing, time, code merge crisis

1. How GRPO Works in LifeStack

GRPO trains the model by generating groups of completions for the same prompt and ranking them by reward. The model learns to prefer higher-reward actions without needing a separate critic network (unlike PPO).

Prompt (life scenario)
       │
       ▼
 LLM generates N=4 candidate JSON actions
       │
       ▼
 5 reward functions score each action
   ├── format_compliance    (is it valid JSON?)
   ├── plausibility         (no zero-cost miracle fixes?)
   ├── task_success         (did it actually help the LifeMetrics?)
   ├── milestone            (did it unlock key progress gates?)
   └── reasoning            (is the explanation coherent?)
       │
       ▼
 GRPO updates policy to prefer higher-reward completions

The curriculum starts at difficulty 1 (gym skipped, forgotten bill) and advances to difficulty 5 (flight cancelled + card declined + boss moved deadline) only when avg reward > 0.6 on the current level.

2. Free-Tier GPU Recommendation

✅ Use Kaggle — not Colab

	Kaggle ✅	Colab Free ❌
GPU	T4 × 2 (or P100)	T4 × 1
VRAM	32 GB (dual T4)	16 GB
Session limit	9 hours	90 minutes
Weekly GPU quota	30 hrs / week	~12 hrs (varies)
Storage between sessions	✅ Persistent (save as Dataset)	❌ Wiped on disconnect
`bf16` support	❌ T4 is too old → `fp16` used instead	❌ same
Auto-detects fp16 fallback	✅ script handles it	✅ script handles it

Bottom line: Colab free sessions cut off at 90 minutes. Even with checkpoints that means 3–4 restarts for a full 5-stage run. One Kaggle session (9h) completes the entire curriculum in a single stretch — no resume needed.

Paid cloud (if you need speed)

Tier	GPU	VRAM	Time / Stage	Cost
🥇 Best	A100 80GB	80 GB	~25 min	~$2.50/hr
🥈 Good	A100 40GB	40 GB	~45 min	~$1.60/hr
🥉 Budget	L4 / RTX 3090	24 GB	~90 min	~$0.80/hr

VRAM math (why any T4 is fine)

Component	VRAM
Model (1.5B, 4-bit Unsloth)	~1.2 GB
LoRA adapters (r=16)	~0.1 GB
Optimizer states	~2.0 GB
Activations (batch=2, seq=1024)	~3.5 GB
Total	~7 GB

A single T4 (16 GB) has 9 GB headroom. Kaggle's dual T4 = 32 GB total.

3. Environment Setup

✅ Option A — Kaggle (Recommended Free Tier)

Create a new Kaggle Notebook → Settings → Accelerator: GPU T4 x2.

# Cell 1 — Install deps
!pip install unsloth trl datasets transformers accelerate matplotlib -q

# Cell 2 — Clone repo
!git clone https://github.com/YOUR_ORG/Meta-R2.git
import os; os.chdir("Meta-R2")

# Cell 3 — Smoke test (makes sure everything imports correctly)
!python scripts/train_trl.py --dry-run

# Cell 4 — Full curriculum (completes in ~5–6 hrs on T4 x2)
OUTPUT = "/kaggle/working/lifestack_model"
!python scripts/train_trl.py --stages 5 --prompts-per-stage 200 --output-dir {OUTPUT}

Saving across sessions (so you can resume if you hit 9h or re-run next week):

# After training, save the output as a Kaggle Dataset via the notebook sidebar:
# Notebook → Data → + Add Output → name it "lifestack-model"
# Next session: attach that dataset and pass --resume
!python scripts/train_trl.py --resume --output-dir /kaggle/input/lifestack-model/lifestack_model

Option B — Google Colab (Secondary, needs Drive)

Colab sessions cut at 90 min. You must mount Drive to survive disconnects.

# Cell 1 — Mount Google Drive for persistent storage
from google.colab import drive
drive.mount('/content/drive')
OUTPUT = "/content/drive/MyDrive/lifestack_model"

# Cell 2 — Install & clone
!pip install unsloth trl datasets transformers accelerate matplotlib -q
!git clone https://github.com/YOUR_ORG/Meta-R2.git
import os; os.chdir("Meta-R2")

# Cell 3 — Smoke test
!python scripts/train_trl.py --dry-run

# Cell 4 — First run
!python scripts/train_trl.py --stages 5 --prompts-per-stage 200 --output-dir {OUTPUT}

# Cell 5 — After disconnect, re-run cells 1-2, then:
!python scripts/train_trl.py --resume --output-dir {OUTPUT}

⚠️ Without mounting Drive, every Colab disconnect loses all progress regardless of checkpoints.

Option C — Local / Cloud GPU (Linux)

git clone https://github.com/YOUR_ORG/Meta-R2.git && cd Meta-R2
python3 -m venv .venv && source .venv/bin/activate
pip install unsloth trl datasets transformers accelerate matplotlib

# On A100 (CUDA 12.x), use the fast Unsloth build:
pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"

python -c "import torch; print(torch.cuda.get_device_name(0))"
python scripts/train_trl.py --dry-run
python scripts/train_trl.py --stages 5 --prompts-per-stage 200

4. Checkpoint & Resume System

Every 25 optimiser steps the Trainer writes a checkpoint. If the session dies mid-stage, it picks up exactly where it left off.

What gets saved

lifestack_model/
├── curriculum_state.json          ← {"completed_stage": 2, "next_difficulty": 3}
├── stage_1/
│   ├── checkpoint-25/             ← step 25 snapshot (weights + optimizer)
│   ├── checkpoint-50/             ← step 50 snapshot
│   ├── checkpoint-75/             ← step 75 (oldest auto-deleted at 4th save)
│   └── model.safetensors          ← written when stage completes cleanly
├── stage_2/
│   └── checkpoint-25/             ← mid-stage when session was cut
└── stage_3/ ...

Only the 3 most recent checkpoints per stage are kept (save_total_limit=3) to save disk.

Resume commands

# Kaggle / Colab: auto-resume after any disconnect
python scripts/train_trl.py --resume

# Jump to a specific stage (e.g. re-run stage 3 from scratch)
python scripts/train_trl.py --start-stage 3

# Resume + change number of stages (e.g. add 2 more stages)
python scripts/train_trl.py --resume --stages 7

How --resume works:

Reads curriculum_state.json → knows stage 2 completed, next is stage 3
Calls find_latest_checkpoint("stage_3/") → finds checkpoint-25
trainer.train(resume_from_checkpoint="stage_3/checkpoint-25") → restores weights + optimizer state → continues from step 25

5. Training Commands

Dry-Run — No GPU Required

python scripts/train_trl.py --dry-run

1 step, 4 prompts, CPU only, ~30 seconds
Expected output: ✅ DRY-RUN PASSED

Full Curriculum (Kaggle / cloud)

python scripts/train_trl.py --stages 5 --prompts-per-stage 200 --output-dir ./lifestack_model

Fast Dev Run (1 stage, test iterations)

python scripts/train_trl.py --stages 1 --prompts-per-stage 50

All CLI Flags

Flag	Default	Description
`--dry-run`	—	1-step CPU smoke test
`--stages`	`5`	Number of curriculum stages
`--prompts-per-stage`	`100`	Prompts per stage
`--output-dir`	`./lifestack_model`	Model save path
`--resume`	`False`	Resume from `curriculum_state.json` + latest checkpoint
`--start-stage`	`None`	Force-start from a specific stage number

6. What Gets Trained On

The dataset covers all 8 domains equally using round-robin sampling:

#	Domain	Scenario Examples
1	`career`	Boss drops 10-hr task at 5 PM / performance review rumours
2	`finances`	Card declined, late fee / tax audit, emergency fund needed
3	`relationships`	Partner feels like a roommate / sibling needs emergency help
4	`physical_health`	Fainting spell at office / warning signs ignored too long
5	`mental_wellbeing`	Burnout, inbox at 500 / panic attack at work
6	`time`	Double-booked all weekend / drowning in obligations
7	`transport_crisis`	5 sub-modes — see below
8	`code_merge_crisis`	Botched merge took down staging / CTO asking for ETA

`transport_crisis` sub-modes (randomly drawn each time)

Sub-type	Scenario
`flight_crisis`	Flight cancelled + card declined + deadline moved to Sunday
`train_delay`	Signal failure, 90-min delay, 9 AM client meeting
`car_breakdown`	Engine seized on highway, tow + rental = $400, rental shortage exo-event
`rideshare_surge`	9x surge pricing, major presentation in 2 hours
`transit_strike`	City-wide indefinite strike, e-bike shortage exo-event

With 5 personalities × 5 difficulty levels × 8 domains, a 200-prompt stage has strong variation across ~3,000+ unique scenario combinations.

7. Reward Functions

Function	What it checks	Range
`reward_format_fn`	Valid JSON + all required fields	`[-1, 1]`
`reward_plausibility_fn`	No miracle zero-cost fixes	`{-1, 1}`
`reward_task_success_fn`	LifeMetrics improved + no cascade spread	`[-1, 1]`
`reward_milestone_fn`	Logical progress gates hit	`[0, 1]`
`reward_reasoning_fn`	Reasoning coherence + domain keywords	`[-0.1, 0.1]`

+1.0 │ Perfect JSON, all metrics improved, milestone hit
 0.5 │ Reasonable action, some metrics improved
 0.0 │ Neutral / no change
-0.5 │ PLAUSIBILITY_VIOLATION or CASCADE_SPREAD_WIDER
-1.0 │ Refusal / empty / broke multiple metrics

8. Monitoring Training

TensorBoard (local/cloud only)

tensorboard --logdir ./lifestack_model   # open http://localhost:6006

Watch: train/reward rising toward 0.5+, train/kl_divergence staying < 0.5.

Console log (every 5 steps)

[step 25] reward=0.312 | outcome=0.124 | containment=0.800 | efficiency=0.710
[ckpt] Curriculum state saved → stage=1, next_diff=2

Live JSONL log

tail -f training_logs/generations.jsonl | python -c "
import sys, json
for line in sys.stdin:
    d = json.loads(line)
    print(f\"step={d['step']} reward={d['reward']:.3f} action={d['action'].get('action_type')}\")
"

9. Expected Training Results

Stage	Difficulty	Expected Avg Reward	Progression Rule
1	1 — flat tyre, forgotten bill	0.55 – 0.70	advances if > 0.60
2	2 — project surge, train delay	0.45 – 0.65	advances if > 0.60
3	3 — health scare, car breakdown	0.35 – 0.55	advances if > 0.60
4	4 — performance review, surge pricing	0.25 – 0.50	advances if > 0.60
5	5 — transit strike, total collapse	0.20 – 0.45	—

10. Post-Training Artifacts

lifestack_model/
├── curriculum_state.json          ← curriculum progress tracker
├── model.safetensors              ← final LoRA adapter weights
├── adapter_config.json
├── tokenizer.json / tokenizer_config.json
└── stage_1/ ... stage_5/
    ├── checkpoint-25/ ... checkpoint-75/   ← step snapshots
    └── model.safetensors                   ← completed stage weights

training_logs/
└── generations.jsonl              ← per-step reward breakdown

Validate the final save:

python -c "from scripts.train_trl import validate_saved_model; validate_saved_model('./lifestack_model')"

11. Troubleshooting

Symptom	Likely Cause	Fix
`ImportError: unsloth`	Not installed	`pip install unsloth`
`CUDA out of memory`	Batch too large	`per_device_train_batch_size=1`
All rewards = -0.5	Env reset failing	Run `--dry-run` to surface the error
KL divergence > 1.0	LR too high	Lower `learning_rate` to `1e-6`
`Task missing required fields`	Domain generator bug	Check `TaskGenerator.generate()`
reward stuck at 0.0	Model refuses JSON	Check `reward_format_fn` — should be -1.0 not 0.0
Colab disconnect lost progress	Drive not mounted	Mount Drive before running; use `--resume`
`checkpoint-*` dirs missing	`save_steps` too high	Already set to 25 in this script

12. Quick Reference

# Smoke test (CPU, ~30s)
python scripts/train_trl.py --dry-run

# Kaggle full run (~5-6 hr, T4 x2)
python scripts/train_trl.py --stages 5 --prompts-per-stage 200

# Resume after any disconnect
python scripts/train_trl.py --resume

# Jump to stage 3 (e.g. stages 1-2 already done)
python scripts/train_trl.py --start-stage 3

# Validate model saved correctly
python -c "from scripts.train_trl import validate_saved_model; validate_saved_model('./lifestack_model')"

# Plot evaluation reward curve
python -c "from scripts.train_trl import evaluate_and_plot; evaluate_and_plot('./lifestack_model')"