Spaces:

s-b3
/

LifeStack

Sleeping

App Files Files Community

LifeStack / docs /train_trl.md

Soham Banerjee

deploy: pure lifestack with partitioned wisdom pool

77da5ce about 1 month ago

preview code

raw

history blame contribute delete

2.95 kB

	# train_trl.py — GRPO Training Reference

	`scripts/train_trl.py` — Curriculum GRPO training via HuggingFace TRL + Unsloth.

	---

	## Overview

	Trains a small LLM (default: `Qwen2.5-1.5B-Instruct`) to resolve LifeStack life conflicts
	using Group Relative Policy Optimization (GRPO). Implements a success-based curriculum
	that automatically increases difficulty when the agent's average reward exceeds 0.6.

	Requires: `unsloth`, `trl`, `datasets`, `transformers`, `accelerate` (Colab / GPU).

	---

	## Usage

	```bash
	# Full curriculum training (5 stages × 100 prompts)
	python scripts/train_trl.py
	```

	No CLI args — edit constants at the top of the file to change stages/prompts/output dir.

	---

	## Architecture

	### Reward Functions (multi-signal GRPO)

	\| Function \| Signal \|
	\|---\|---\|
	\| `reward_format_fn` \| JSON format compliance \|
	\| `reward_plausibility_fn` \| Penalises zero-cost metric changes \|
	\| `reward_task_success_fn` \| Core env-step outcome reward \|
	\| `reward_milestone_fn` \| Milestone progress bonus \|
	\| `reward_reasoning_fn` \| Planning coherence score \|
	\| `reward_human_feedback_fn` \| Alignment with past real-world outcome feedback \|

	### `get_lifestack_evaluation(completion, prompt) -> dict`

	The central reward computation function. Parses the LLM's JSON completion, reconstructs
	the Task from the prompt's `<SYSTEM_METADATA>` block, steps the env, and returns:

	```python
	{
	"reward": float,
	"breakdown": dict, # from obs.metadata["breakdown"]
	"action": LifeStackAction
	}
	```

	Returns `{"reward": -0.5, "breakdown": {"error": ...}}` on any parse or env failure.

	#### Task Construction Hardening (2026-04-23)

	The `Task(...)` call inside `get_lifestack_evaluation` is wrapped in its own
	`try/except`. On exception, logs `[reward] Task construction failed: <error>` and
	returns the `-0.5` fallback immediately. A field-presence check on
	`(id, goal, constraints, mutable_world, visible_world)` follows construction.

	### Curriculum (`train_curriculum`)

	```
	Stage 1: difficulty=1 → train → eval → if avg_reward > 0.6: difficulty++
	Stage 2: difficulty=2 → ...
	...
	Stage 5: difficulty=5 → final save
	```

	### Dataset (`generate_dataset`)

	Generates `N` prompts by:
	1. Sampling a `TaskGenerator` task (flight_crisis or code_merge_crisis)
	2. Merging a legacy `ConflictEvent` disruption for variety
	3. Cascading the disruption through the `DependencyGraph`
	4. Embedding task metadata in a `<SYSTEM_METADATA>` block for reward reconstruction

	---

	## Outputs

	\| Path \| Contents \|
	\|---\|---\|
	\| `./lifestack_model/` \| Final saved model + tokenizer \|
	\| `./lifestack_model/stage_N/` \| Per-stage checkpoints \|
	\| `training_logs/generations.jsonl` \| Sampled generations (every 20 reward calls) \|
	\| `grpo_reward_curve.png` \| 50-episode eval reward curve \|

	---

	## Change Log

	\| Date \| Change \|
	\|---\|---\|
	\| 2026-04-23 \| `Task()` construction wrapped in try/except + field validation; returns -0.5 fallback on failure \|