πŸͺ LifeStack-GRPO (v1)

Qwen2.5-1.5B fine-tuned with 5-stage GRPO curriculum for life crisis resolution

Built for the Meta Γ— HuggingFace PyTorch OpenEnv Hackathon 2026 Team BholeChature β€” Scaler School of Technology, Bangalore

GitHub HF Space v4 Model


Note: This is v1 β€” the single-step curriculum checkpoint. For the best model, see jdsb06/lifestack-grpo-v4 (episodic GRPO, peak reward 0.856).


What is This?

This is a LoRA adapter for Qwen/Qwen2.5-1.5B-Instruct, fine-tuned using GRPO via TRL + Unsloth on the LifeStack environment β€” an OpenEnv-compatible RL environment with 23 interdependent metrics across 6 life-metric domains, connected by a 32-edge dependency graph. Training tasks are sampled across 8 task domains (the 6 life domains + transport_crisis + code_merge_crisis).

v1 is the single-step checkpoint. It was trained with a 5-stage difficulty curriculum and is the most consistent checkpoint β€” 45/50 episodes produced non-failing outputs. The later v4 model adds multi-step episodic planning capability.


Model Details

Property Value
Base model Qwen/Qwen2.5-1.5B-Instruct
Adapter type LoRA (r=16, alpha=16)
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters 18,464,768 / 1,562,179,072 (1.18%)
Training hardware Tesla T4 (14.7 GB VRAM)
Precision FP16
Framework TRL + Unsloth 2026.4.8 + Transformers 5.5.0

Training: 5-Stage GRPO Curriculum

Stage Reward Functions LR Steps
1 format_fn + clean_eos_fn (warm-up only) 8e-6 25
2 All 9 reward signals 5e-6 25
3 All 9 reward signals 3e-6 25
4 All 9 reward signals 2e-6 25
5 All 9 reward signals 1e-6 25

Total: 125 optimizer steps | ~45 minutes on Tesla T4

Training Metrics (from train_run_v1.log)

Step Format mean Reward mean KL
5 (stage 4 start) 0.075 βˆ’0.568 7.3e-6
10 0.184 βˆ’0.446 1.2e-5
15 0.145 βˆ’0.479 1.5e-5
20 0.160 βˆ’0.463 2.0e-5
25 (stage 4 end) βˆ’0.036 βˆ’0.652 2.8e-5
30 (stage 5 start) 0.043 βˆ’0.612 3.4e-5
50 (stage 5 end) 0.195 βˆ’0.396 8.2e-5

Post-training evaluation: βˆ’0.10 avg reward over 50 episodes across all 8 domains.

Training Reward Curve

GRPO Reward Curve

Raw asset: grpo_reward_curve.png

9-Signal Reward Orchestrator

  1. reward_format_fn β€” JSON validity + required keys + valid action_type
  2. reward_clean_eos_fn β€” Penalises trailing text after }
  3. reward_plausibility_fn β€” Blocks zero-cost massive metric claims
  4. reward_task_success_fn β€” Environment simulation: did the action resolve the task?
  5. reward_milestone_fn β€” Did the action unlock a task milestone?
  6. reward_replan_fn β€” Did the agent adapt after exogenous events?
  7. reward_reasoning_fn β€” Does the reasoning text logically justify the action type?
  8. reward_human_feedback_fn β€” ChromaDB memory: alignment with past successful trajectories
  9. reward_longterm_fn β€” 7-day Ξ³=0.9 discounted rollout

Known Limitation: clipped_ratio = 1.0

The model fills its full 96-token budget on every generation β€” never emits a natural EOS during training. Mitigated at inference via _JsonCompleteStopping and first-object extraction.


How to Load

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("jdsb06/lifestack-grpo")
model = PeftModel.from_pretrained(base, "jdsb06/lifestack-grpo")
model.eval()

Citation

@misc{lifestack2026,
  title        = {LifeStack: Training AI to Handle Life's Cascading Crises},
  author       = {Team BholeChature, Scaler School of Technology},
  year         = {2026},
  howpublished = {Meta Γ— HuggingFace PyTorch OpenEnv Hackathon 2026},
  url          = {https://github.com/oki-dokii/Meta-R2}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jdsb06/lifestack-grpo

Adapter
(1000)
this model

Space using jdsb06/lifestack-grpo 1