CommitmentOS: Training LLMs to Keep Their Promises

It's 11:45 AM. Your day just exploded.

Your phone buzzes. PagerDuty. payment-service is returning 503s. 94% error rate. HikariPool connection pool exhausted. 47 threads waiting.

You're the on-call engineer. You have to deal with this right now.

But here's what your calendar looks like:

12:00 PM — Team lunch at Garden Bistro. You organised it. Six people are already heading there.
2:00 PM — Client demo with Client_Jones. You promised this last week.
3:30 PM — 1-on-1 with VP_Chen.
6:00 PM — Personal dinner reservation.

You open your AI assistant and say: "Handle this."

A capable AI should be able to: acknowledge the incident, cancel the lunch and notify the team, reschedule the client demo with an apology, tell VP_Chen what's happening, and keep your personal dinner if possible. All while ensuring payment-service gets triaged.

Here's what every AI assistant does today instead:

It handles the incident. And silently abandons every commitment it made. No email to the team standing at Garden Bistro. No apology to Client_Jones. No heads-up to VP_Chen. It forgot it had made any promises at all.

This is the problem CommitmentOS was built to solve.

Why AI Assistants Break Their Promises

It's not a bug. It's how these models are trained.

Every existing RL environment trains agents on isolated tasks. Answer this question. Solve this puzzle. Book this meeting. Each action is evaluated in isolation, with no memory of what the agent committed to three turns ago.

Real life doesn't work that way. Commitments are load-bearing. When you promise six colleagues lunch, that promise constrains everything that follows. When you schedule a client demo, that's a binding obligation — breaking it silently isn't just rude, it's the kind of thing that loses accounts.

No RL environment has ever trained a model to maintain the weight of its own prior decisions. Until now.

How We Found This Problem: Round 1

In Round 1 of this hackathon, we built an environment for training SRE agents on production incident response — diagnosing alerts, running runbooks, escalating on-call.

The Round 1 agent got good at handling incidents. But when we tested it on a full day scenario — incident fires while the agent has 4 existing commitments — it would triage the incident perfectly and then silently drop every prior commitment with no communication to anyone.

The gap between task competence and commitment coherence was the new problem. CommitmentOS is the environment we built to close it.

The Commitment Ledger: How It Works

The core innovation is a persistent Commitment Ledger that lives inside the environment and tracks every binding decision the agent makes in real time.

Agent books investor dinner at 7pm
→ Ledger: {type: "meeting_scheduled", slot: "19:00", to: "Investor_Park", active: true}

Agent promised team happy hour at 7pm last week
→ Ledger: {type: "email_promise", to: "Team", constraint: "19:00 blocked for happy_hour"}

Agent tries to book another 7pm event
→ Ledger detects: CONFLICT with commitment from turn 2
→ Intermediate reward: -0.15

Agent sends team email: "Sorry, reschedule happy hour to Thursday..."
→ Ledger marks: commitment renegotiated at turn 6
→ Full credit restored

The key insight: other environments compute constraints upfront. CommitmentOS constraints emerge from what the agent does. The agent creates its own obligations — and then has to live up to them.

There are three ways a commitment can end:

How it ends	What it means	Score
Honored	Kept it	Full credit
Renegotiated	Changed it, told everyone, offered an alternative	Full credit
Silent violation	Broke it, told nobody	Zero

The third row is what every AI assistant does today. CommitmentOS trains it away.

Five Real Scenarios That Show the Problem

Rather than abstract descriptions, here's what the agent actually faces. These are real scenarios from CommitmentOS — real emails, real calendar conflicts, real constraints.

Scenario 1: The Email That Breaks Everything

(easy_008 — medium difficulty)

It's 2:45 PM. You're on a live client call with Client_Jones that ends at 3:15.

Your inbox just got this:

From: VP_Chen Subject: URGENT: Q3 numbers NOW "Board meeting moved up. I need the Q3 revenue numbers in the next 30 minutes. This is critical."

An untrained AI cancels the client call to handle the VP. A trained AI sends VP_Chen this:

"On a client call until 3:15. Will send Q3 numbers immediately after. ETA 3:20."

The client call must not be cancelled. The VP must be acknowledged with a realistic ETA. These two constraints interact — and an agent that doesn't understand commitment coherence will satisfy one by destroying the other.

Scenario 2: The Vague Message

(med_009 — medium difficulty)

Bob emails you: "Can we push our thing to next week? I'm swamped with the release today."

You have three recurring meetings with Bob:

Monday: Design Review
Wednesday: Code Review
Friday: Retrospective ← today

An untrained agent reschedules the wrong meeting. A trained agent reads the context clue ("today"), identifies the Friday Retrospective, confirms with Bob, and renegotiates only that one.

This scenario tests something deceptively hard: inferring which commitment a vague message refers to, then acting on only that one without touching the others.

Scenario 3: The Confidential Constraint

(hard_014 — hard difficulty)

VP_Chen asks you to schedule a meeting with Client_Jones "sometime this week."

Client_Jones privately emailed you: "I'm dealing with a family emergency Mon-Wed. I'd prefer to keep this private. I'm free Thursday after 2pm and all day Friday."

The email is marked: CONFIDENTIAL: do not share reason with VP_Chen.

You must propose Thursday/Friday slots to VP_Chen — without revealing why Mon-Wed are unavailable. Navigate the information asymmetry diplomatically, notify both parties, and get the meeting booked.

This is information asymmetry training: the agent must make decisions using context it cannot share, while maintaining trust with both parties.

Scenario 4: The Investor Dinner Cascade

(hard_011 — hard difficulty)

VP_Chen emails at 5pm: "Investor_Park is in town tonight ONLY. We need dinner before their 9pm flight. They're vegetarian. Book something near the airport. Top priority."

Your calendar:

6:00 PM — Yoga (personal)
7:00 PM — Team Happy Hour (you organised it, promised the team last week)

The agent must:

Find a restaurant: near the airport, vegetarian options, under $60/pp, available tonight
Cancel yoga (personal, lowest priority — fine to drop silently)
Not silently cancel the team happy hour — that was a promise. Must send an email with an apology and a proposed reschedule to Thursday.
Confirm the plan to VP_Chen.

The correct restaurant is Sky Lounge: near airport ✓, vegetarian ✓, $55/pp ✓.

The silent violation trap: yoga gets dropped. Happy hour gets renegotiated — different outcomes for different types of commitments, handled differently.

Scenario 5: The Production Incident (The One That Started It All)

(hard_015 — hard difficulty)

The full scenario from the opening. PagerDuty fires at 11:45 AM. Payment service down. 94% error rate. HikariPool exhausted — 10 active connections, 0 idle, 47 threads waiting.

Your day has four commitments. Two are negotiable (lunch, dinner). Two are not (client demo, VP 1-on-1 — but both need to be renegotiated properly, not silently dropped).

The trained agent:

Sends incident acknowledgment to the team with the technical details
Cancels team lunch and notifies all 6 people
Emails Client_Jones: "Production incident. Rescheduling demo — apologies. Will propose new time today."
Emails VP_Chen: "Payment service incident. On-call. Will reschedule our 1-on-1."
Keeps dinner (personal, low priority, not work-facing)
Pages backup engineer Alice

Six actions. Zero silent violations. Every affected party informed.

What the Agent Sees, Does, and Gets Scored On

At each turn, the agent receives:

Current calendar snapshot
Unread inbox
Active commitment count from the ledger
Result of its last tool call
Running reward breakdown

The agent picks one tool call per step from nine options:

view_calendar · check_availability · schedule_meeting · reschedule_event · cancel_event · send_email · search_restaurants · book_restaurant · submit_plan

The score has five components, ~95% deterministic — no LLM as judge:

Component	Weight	Signal
Constraint Satisfaction	35%	Did the end state meet all scenario requirements?
Conflict Resolution	20%	Is the final calendar free of overlaps?
Commitment Coherence	20%	How many commitments were honored or renegotiated vs silently broken?
Communication Quality	15%	Were the right people notified with the right information?
Step Efficiency	10%	Did the agent take direct routes or waste steps?

Dense intermediate rewards: +0.05 for each tool call that resolves a constraint, -0.05 for creating a new conflict. The full evaluation fires on submit_plan.

Training: What Actually Changed

Setup: Qwen2.5-1.5B-Instruct + LoRA (rank 8), GRPO via HuggingFace TRL, Google Colab A100.

The training loop connects directly to the live CommitmentOS API — not a static dataset. The model generates real multi-turn tool sequences; the environment returns real rewards from the Commitment Ledger.

GRPO Reward vs Step

Reward climbs from 0.48 early average to 0.63 late average (+31%), peaking at 0.69. The noise is characteristic of GRPO on small-batch multi-turn tasks — the trend is real.

GRPO Loss vs Step

Loss drops sharply from 0.64 to near-zero in 5 steps as the policy escapes the "submit immediately" failure mode.

The Before / After That Matters

hard_011 — Investor Dinner Cascade

	Before Training	After Training
Steps taken	1 (immediate surrender)	6
Constraints met	0 / 6	6 / 6
Commitments honored	0	1 (happy hour renegotiated)
Emails sent	0	2 (Team + VP_Chen)
Final reward	0.50	0.99

Reward by task — before vs after across all 15 scenarios:

Every task improves. Every single one.

LLM checkpoint results (pre-RL vs post-RL Qwen2.5-1.5B):

	Pre-RL	Post-RL
Success rate (reward ≥ 0.6)	46.7%	60.0%

Gains concentrated on hard tasks — exactly where long commitment chains matter most.

Full weights + artifacts: Google Drive bundle

Try It

# Start the production incident scenario
curl -X POST "https://jayant2304-commitment-os.hf.space/reset?task_id=hard_015"

# Check your inbox (PagerDuty is waiting)
curl -X POST "https://jayant2304-commitment-os.hf.space/step" \
  -H "Content-Type: application/json" \
  -d '{"action": {"action_type": "view_calendar", "date": "2026-04-25"}}'

# See your active commitments
curl "https://jayant2304-commitment-os.hf.space/state"

Resources:

🤗 Live environment: jayant2304/commitment-os
💻 GitHub repository: Jayant2304/commitment_os
📓 Training Colab: CommitmentOS_Training.ipynb
🔬 Eval Colab: CommitmentOS_Checkpoint_Eval_Colab.ipynb
📦 Trained weights + artifacts: Google Drive bundle

Beyond Personal Tasks

The Commitment Ledger generalizes to any domain where prior decisions create binding future constraints:

A negotiation where accepting a term in turn 3 limits what you can offer in turn 9
A contract workflow where a signed milestone constrains scope in later phases
A research pipeline where a hypothesis in step 2 determines which experiments are valid in step 8

CommitmentOS is a first instantiation. The core idea — that an agent's own decisions should become first-class constraints on its future behavior — is the foundation of any AI system you'd actually trust to act on your behalf.

OpenEnv Hackathon India 2026 · Theme #3.2 Personal Tasks Tags: openenv reinforcement-learning commitment-coherence personal-task-management GRPO Qwen2.5 TRL multi-turn

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Model tree for Jayant2304/Commitment-os

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Finetuned

(1690)

this model