Instructions to use jdsb06/lifestack-grpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps
- Unsloth Studio
How to use jdsb06/lifestack-grpo with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jdsb06/lifestack-grpo to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jdsb06/lifestack-grpo to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for jdsb06/lifestack-grpo to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="jdsb06/lifestack-grpo", max_seq_length=2048, )
πͺ LifeStack-GRPO (v1)
Qwen2.5-1.5B fine-tuned with 5-stage GRPO curriculum for life crisis resolution
Built for the Meta Γ HuggingFace PyTorch OpenEnv Hackathon 2026 Team BholeChature β Scaler School of Technology, Bangalore
Note: This is v1 β the single-step curriculum checkpoint. For the best model, see jdsb06/lifestack-grpo-v4 (episodic GRPO, peak reward 0.856).
What is This?
This is a LoRA adapter for Qwen/Qwen2.5-1.5B-Instruct, fine-tuned using GRPO via TRL + Unsloth on the LifeStack environment β an OpenEnv-compatible RL environment with 23 interdependent metrics across 6 life-metric domains, connected by a 32-edge dependency graph. Training tasks are sampled across 8 task domains (the 6 life domains + transport_crisis + code_merge_crisis).
v1 is the single-step checkpoint. It was trained with a 5-stage difficulty curriculum and is the most consistent checkpoint β 45/50 episodes produced non-failing outputs. The later v4 model adds multi-step episodic planning capability.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen2.5-1.5B-Instruct |
| Adapter type | LoRA (r=16, alpha=16) |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | 18,464,768 / 1,562,179,072 (1.18%) |
| Training hardware | Tesla T4 (14.7 GB VRAM) |
| Precision | FP16 |
| Framework | TRL + Unsloth 2026.4.8 + Transformers 5.5.0 |
Training: 5-Stage GRPO Curriculum
| Stage | Reward Functions | LR | Steps |
|---|---|---|---|
| 1 | format_fn + clean_eos_fn (warm-up only) |
8e-6 | 25 |
| 2 | All 9 reward signals | 5e-6 | 25 |
| 3 | All 9 reward signals | 3e-6 | 25 |
| 4 | All 9 reward signals | 2e-6 | 25 |
| 5 | All 9 reward signals | 1e-6 | 25 |
Total: 125 optimizer steps | ~45 minutes on Tesla T4
Training Metrics (from train_run_v1.log)
| Step | Format mean | Reward mean | KL |
|---|---|---|---|
| 5 (stage 4 start) | 0.075 | β0.568 | 7.3e-6 |
| 10 | 0.184 | β0.446 | 1.2e-5 |
| 15 | 0.145 | β0.479 | 1.5e-5 |
| 20 | 0.160 | β0.463 | 2.0e-5 |
| 25 (stage 4 end) | β0.036 | β0.652 | 2.8e-5 |
| 30 (stage 5 start) | 0.043 | β0.612 | 3.4e-5 |
| 50 (stage 5 end) | 0.195 | β0.396 | 8.2e-5 |
Post-training evaluation: β0.10 avg reward over 50 episodes across all 8 domains.
Training Reward Curve
Raw asset: grpo_reward_curve.png
9-Signal Reward Orchestrator
reward_format_fnβ JSON validity + required keys + valid action_typereward_clean_eos_fnβ Penalises trailing text after}reward_plausibility_fnβ Blocks zero-cost massive metric claimsreward_task_success_fnβ Environment simulation: did the action resolve the task?reward_milestone_fnβ Did the action unlock a task milestone?reward_replan_fnβ Did the agent adapt after exogenous events?reward_reasoning_fnβ Does the reasoning text logically justify the action type?reward_human_feedback_fnβ ChromaDB memory: alignment with past successful trajectoriesreward_longterm_fnβ 7-day Ξ³=0.9 discounted rollout
Known Limitation: clipped_ratio = 1.0
The model fills its full 96-token budget on every generation β never emits a natural EOS during training. Mitigated at inference via _JsonCompleteStopping and first-object extraction.
How to Load
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-1.5B-Instruct",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("jdsb06/lifestack-grpo")
model = PeftModel.from_pretrained(base, "jdsb06/lifestack-grpo")
model.eval()
Citation
@misc{lifestack2026,
title = {LifeStack: Training AI to Handle Life's Cascading Crises},
author = {Team BholeChature, Scaler School of Technology},
year = {2026},
howpublished = {Meta Γ HuggingFace PyTorch OpenEnv Hackathon 2026},
url = {https://github.com/oki-dokii/Meta-R2}
}
