# train_trl.py — GRPO Training Reference `scripts/train_trl.py` — Curriculum GRPO training via HuggingFace TRL + Unsloth. --- ## Overview Trains a small LLM (default: `Qwen2.5-1.5B-Instruct`) to resolve LifeStack life conflicts using **Group Relative Policy Optimization (GRPO)**. Implements a success-based curriculum that automatically increases difficulty when the agent's average reward exceeds 0.6. Requires: `unsloth`, `trl`, `datasets`, `transformers`, `accelerate` (Colab / GPU). --- ## Usage ```bash # Full curriculum training (5 stages × 100 prompts) python scripts/train_trl.py ``` No CLI args — edit constants at the top of the file to change stages/prompts/output dir. --- ## Architecture ### Reward Functions (multi-signal GRPO) | Function | Signal | |---|---| | `reward_format_fn` | JSON format compliance | | `reward_plausibility_fn` | Penalises zero-cost metric changes | | `reward_task_success_fn` | Core env-step outcome reward | | `reward_milestone_fn` | Milestone progress bonus | | `reward_reasoning_fn` | Planning coherence score | | `reward_human_feedback_fn` | Alignment with past real-world outcome feedback | ### `get_lifestack_evaluation(completion, prompt) -> dict` The central reward computation function. Parses the LLM's JSON completion, reconstructs the Task from the prompt's `` block, steps the env, and returns: ```python { "reward": float, "breakdown": dict, # from obs.metadata["breakdown"] "action": LifeStackAction } ``` Returns `{"reward": -0.5, "breakdown": {"error": ...}}` on any parse or env failure. #### Task Construction Hardening (2026-04-23) The `Task(...)` call inside `get_lifestack_evaluation` is wrapped in its own `try/except`. On exception, logs `[reward] Task construction failed: ` and returns the `-0.5` fallback immediately. A field-presence check on `(id, goal, constraints, mutable_world, visible_world)` follows construction. ### Curriculum (`train_curriculum`) ``` Stage 1: difficulty=1 → train → eval → if avg_reward > 0.6: difficulty++ Stage 2: difficulty=2 → ... ... Stage 5: difficulty=5 → final save ``` ### Dataset (`generate_dataset`) Generates `N` prompts by: 1. Sampling a `TaskGenerator` task (flight_crisis or code_merge_crisis) 2. Merging a legacy `ConflictEvent` disruption for variety 3. Cascading the disruption through the `DependencyGraph` 4. Embedding task metadata in a `` block for reward reconstruction --- ## Outputs | Path | Contents | |---|---| | `./lifestack_model/` | Final saved model + tokenizer | | `./lifestack_model/stage_N/` | Per-stage checkpoints | | `training_logs/generations.jsonl` | Sampled generations (every 20 reward calls) | | `grpo_reward_curve.png` | 50-episode eval reward curve | --- ## Change Log | Date | Change | |---|---| | 2026-04-23 | `Task()` construction wrapped in try/except + field validation; returns -0.5 fallback on failure |