🧬 LifeOS Trained Agent (Mistral-7B-Instruct-v0.3)

LifeOS Agent Banner

This model was trained to survive the chaos of an unpredictable, stressful student week using GRPO (Group Relative Policy Optimization) within the LifeOS OpenEnv simulation.

It is a fine-tuned version of mistralai/Mistral-7B-Instruct-v0.3 that has learned to balance multiple competing constraints—energy, stress, deadlines, social obligations, and budget—under conditions of high uncertainty (35% probability of random chaos events per step).

🏆 Meta OpenEnv Hackathon 2026 Submission


🚀 Model Capabilities: Triage over Grinding

Most agents fail at long-horizon personal planning because they treat scheduling as a static puzzle. This agent was trained in a dynamic environment where pushing too hard leads to burnout (-1.5 penalty) and ignoring friends leads to social debt (-0.8 penalty).

Key Behaviors Learned via RL:

  1. Proactive Recovery: It learns to call the rest action before its energy drops to critical levels, avoiding burnout cascades.
  2. Social Debt Management: It prioritizes the reply_message action to maintain relationships, clearing unread messages before they heavily penalize the social coherence score.
  3. Strategic Delegation: It learns to use budget (₹) via delegate_task to offload low-priority work when energy is low and deadlines are looming.
  4. Resilience to Chaos: When a random chaos event (e.g., "Deadline moved up by 2 days") fires, it can pause, recover, and pivot its focus without collapsing.

📊 Training Methodology

The model was trained entirely using Reinforcement Learning (GRPO) from the trl library, without human-labeled preference data. We designed a multi-objective reward function to prevent reward hacking:

  1. Task Completion: +1.0 for finishing, -1.0 for missing deadlines.
  2. Social Coherence: Penalty for leaving messages unread.
  3. Energy Sustainability: +0.4 for proactive rest, -1.5 (Game Over) for burnout.
  4. Format Compliance: Strict adherence to valid JSON actions.

Training Stack:

  • Frameworks: unsloth (for 2x faster, memory-efficient 4-bit LoRA training), trl (GRPO Trainer).
  • Compute: Single Google Colab T4 GPU (free tier).
  • Episodes: 30 training episodes, 30 steps each.

Reward Curves


💻 How to Use

This model is intended to be used as a policy agent within the LifeOS OpenEnv framework. It expects a highly structured observation prompt and is trained to output a specific JSON action schema preceded by an inner monologue block.

Example Input (Observation):

=== Step 14 / 30 ===
Vitals: Energy=35, Stress=72
Resources: Budget=₹1800, Relationships=0.60
Chaos Event: Assignment deadline moved up by 2 days!
Pending Tasks: 3
Unread Messages: 1

Example Output (Agent Action):

<thought>
Energy is at 35, which is dangerous territory. If I push to work on the urgent assignment now, I will hit 0 energy and burn out. I have enough budget to offload the minor task, but first I must rest to recover my vitals before tackling the big deadline.
</thought>
{
  "action_type": "rest",
  "reasoning": "Need to recover energy before prioritizing the urgent assignment."
}

⚙️ Loading the Model

You can load this model directly using Hugging Face transformers or unsloth for faster inference.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "SParsh003/LifeOS-Trained-Agent",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

FastLanguageModel.for_inference(model)

prompt = """... (insert LifeOS observation here) ..."""
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

This model was trained 2x faster with Unsloth 🦥.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for SParsh003/LifeOS-Trained-Agent

Space using SParsh003/LifeOS-Trained-Agent 1