Spaces:

Jayant2304
/

commitment-os

Sleeping

File size: 14,135 Bytes

021ed16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce44e08
021ed16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
 
 
ce44e08
021ed16
 
ce44e08
021ed16
 
 
ce44e08
021ed16
 
 
 
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
 
 
 
 
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
2c07089
021ed16
 
ce44e08
021ed16
ce44e08
021ed16
 
 
ce44e08
021ed16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce44e08
021ed16
ce44e08
021ed16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
ce44e08
021ed16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce44e08
665fae0
 
021ed16
 
 
 
 
 
665fae0
021ed16
 
 
665fae0
021ed16
 
 
 
 
 
2c07089
021ed16
2c07089
021ed16
 
 
 
 
 
 
 
 
 
ce44e08
021ed16
 
 
 
ce44e08
021ed16

---
license: mit
language:
- en
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
tags:
- open-env
- reinforcement-learning
- commitment-coherence
- personal-task-management
- grpo
- multi-turn
- rl-environment
---

# CommitmentOS: Training LLMs to Keep Their Promises

---

## It's 11:45 AM. Your day just exploded.

Your phone buzzes. PagerDuty. *payment-service is returning 503s. 94% error rate. HikariPool connection pool exhausted. 47 threads waiting.*

You're the on-call engineer. You have to deal with this right now.

But here's what your calendar looks like:

- **12:00 PM** — Team lunch at Garden Bistro. You organised it. Six people are already heading there.
- **2:00 PM** — Client demo with Client_Jones. You promised this last week.
- **3:30 PM** — 1-on-1 with VP_Chen.
- **6:00 PM** — Personal dinner reservation.

You open your AI assistant and say: *"Handle this."*

A capable AI should be able to: acknowledge the incident, cancel the lunch and notify the team, reschedule the client demo with an apology, tell VP_Chen what's happening, and keep your personal dinner if possible. All while ensuring payment-service gets triaged.

Here's what every AI assistant does today instead:

It handles the incident. And silently abandons every commitment it made. No email to the team standing at Garden Bistro. No apology to Client_Jones. No heads-up to VP_Chen. It forgot it had made any promises at all.

**This is the problem CommitmentOS was built to solve.**

---

## Why AI Assistants Break Their Promises

It's not a bug. It's how these models are trained.

Every existing RL environment trains agents on isolated tasks. Answer this question. Solve this puzzle. Book this meeting. Each action is evaluated in isolation, with no memory of what the agent committed to three turns ago.

Real life doesn't work that way. **Commitments are load-bearing.** When you promise six colleagues lunch, that promise constrains everything that follows. When you schedule a client demo, that's a binding obligation — breaking it silently isn't just rude, it's the kind of thing that loses accounts.

No RL environment has ever trained a model to maintain the weight of its own prior decisions. Until now.

---

## How We Found This Problem: Round 1

In Round 1 of this hackathon, we built an environment for training SRE agents on production incident response — diagnosing alerts, running runbooks, escalating on-call.

The Round 1 agent got good at handling incidents. But when we tested it on a full day scenario — incident fires while the agent has 4 existing commitments — it would triage the incident perfectly and then silently drop every prior commitment with no communication to anyone.

The gap between *task competence* and *commitment coherence* was the new problem. CommitmentOS is the environment we built to close it.

---

## The Commitment Ledger: How It Works

The core innovation is a persistent **Commitment Ledger** that lives inside the environment and tracks every binding decision the agent makes in real time.

```
Agent books investor dinner at 7pm
→ Ledger: {type: "meeting_scheduled", slot: "19:00", to: "Investor_Park", active: true}

Agent promised team happy hour at 7pm last week
→ Ledger: {type: "email_promise", to: "Team", constraint: "19:00 blocked for happy_hour"}

Agent tries to book another 7pm event
→ Ledger detects: CONFLICT with commitment from turn 2
→ Intermediate reward: -0.15

Agent sends team email: "Sorry, reschedule happy hour to Thursday..."
→ Ledger marks: commitment renegotiated at turn 6
→ Full credit restored
```

The key insight: **other environments compute constraints upfront.** CommitmentOS constraints emerge from what the agent *does*. The agent creates its own obligations — and then has to live up to them.

There are three ways a commitment can end:

| How it ends | What it means | Score |
|-------------|---------------|-------|
| **Honored** | Kept it | Full credit |
| **Renegotiated** | Changed it, told everyone, offered an alternative | Full credit |
| **Silent violation** | Broke it, told nobody | Zero |

The third row is what every AI assistant does today. CommitmentOS trains it away.

---

## Five Real Scenarios That Show the Problem

Rather than abstract descriptions, here's what the agent actually faces. These are real scenarios from CommitmentOS — real emails, real calendar conflicts, real constraints.

---

### Scenario 1: The Email That Breaks Everything
*(med_008 — medium difficulty)*

It's 2:45 PM. You're on a live client call with Client_Jones that ends at 3:15.

Your inbox just got this:

> **From: VP_Chen**
> **Subject: URGENT: Q3 numbers NOW**
> *"Board meeting moved up. I need the Q3 revenue numbers in the next 30 minutes. This is critical."*

An untrained AI cancels the client call to handle the VP. A trained AI sends VP_Chen this:

> *"On a client call until 3:15. Will send Q3 numbers immediately after. ETA 3:20."*

The client call **must not be cancelled.** The VP must be acknowledged with a **realistic ETA**. These two constraints interact — and an agent that doesn't understand commitment coherence will satisfy one by destroying the other.

---

### Scenario 2: The Vague Message
*(med_009 — medium difficulty)*

Bob emails you: *"Can we push our thing to next week? I'm swamped with the release today."*

You have three recurring meetings with Bob:
- Monday: Design Review
- Wednesday: Code Review
- **Friday: Retrospective** ← today

An untrained agent reschedules the wrong meeting. A trained agent reads the context clue ("today"), identifies the Friday Retrospective, confirms with Bob, and renegotiates only that one.

This scenario tests something deceptively hard: **inferring which commitment a vague message refers to**, then acting on only that one without touching the others.

---

### Scenario 3: The Confidential Constraint
*(hard_014 — hard difficulty)*

VP_Chen asks you to schedule a meeting with Client_Jones "sometime this week."

Client_Jones privately emailed you: *"I'm dealing with a family emergency Mon-Wed. I'd prefer to keep this private. I'm free Thursday after 2pm and all day Friday."*

The email is marked: **CONFIDENTIAL: do not share reason with VP_Chen.**

You must propose Thursday/Friday slots to VP_Chen — without revealing why Mon-Wed are unavailable. Navigate the information asymmetry diplomatically, notify both parties, and get the meeting booked.

This is information asymmetry training: the agent must make decisions using context it cannot share, while maintaining trust with both parties.

---

### Scenario 4: The Investor Dinner Cascade
*(hard_011 — hard difficulty)*

VP_Chen emails at 5pm: *"Investor_Park is in town tonight ONLY. We need dinner before their 9pm flight. They're vegetarian. Book something near the airport. Top priority."*

Your calendar:
- **6:00 PM** — Yoga (personal)
- **7:00 PM** — Team Happy Hour (you organised it, promised the team last week)

The agent must:
1. Find a restaurant: near the airport, vegetarian options, under $60/pp, available tonight
2. Cancel yoga (personal, lowest priority — fine to drop silently)
3. **Not** silently cancel the team happy hour — that was a promise. Must send an email with an apology and a proposed reschedule to Thursday.
4. Confirm the plan to VP_Chen.

The correct restaurant is Sky Lounge: near airport ✓, vegetarian ✓, $55/pp ✓.

The silent violation trap: yoga gets dropped. Happy hour gets **renegotiated** — different outcomes for different types of commitments, handled differently.

---

### Scenario 5: The Production Incident (The One That Started It All)
*(hard_015 — hard difficulty)*

The full scenario from the opening. PagerDuty fires at 11:45 AM. Payment service down. 94% error rate. HikariPool exhausted — 10 active connections, 0 idle, 47 threads waiting.

Your day has four commitments. Two are negotiable (lunch, dinner). Two are not (client demo, VP 1-on-1 — but both need to be renegotiated properly, not silently dropped).

The trained agent:
1. Sends incident acknowledgment to the team with the technical details
2. Cancels team lunch and notifies all 6 people
3. Emails Client_Jones: *"Production incident. Rescheduling demo — apologies. Will propose new time today."*
4. Emails VP_Chen: *"Payment service incident. On-call. Will reschedule our 1-on-1."*
5. Keeps dinner (personal, low priority, not work-facing)
6. Pages backup engineer Alice

Six actions. Zero silent violations. Every affected party informed.

---

## What the Agent Sees, Does, and Gets Scored On

**At each turn**, the agent receives:
- Current calendar snapshot
- Unread inbox
- Active commitment count from the ledger
- Result of its last tool call
- Running reward breakdown

**The agent picks one tool call per step** from nine options:

`view_calendar` · `check_availability` · `schedule_meeting` · `reschedule_event` · `cancel_event` · `send_email` · `search_restaurants` · `book_restaurant` · `submit_plan`

**The score** has five components, ~95% deterministic — no LLM as judge:

| Component | Weight | Signal |
|-----------|--------|--------|
| Constraint Satisfaction | 35% | Did the end state meet all scenario requirements? |
| Conflict Resolution | 20% | Is the final calendar free of overlaps? |
| **Commitment Coherence** | **20%** | **How many commitments were honored or renegotiated vs silently broken?** |
| Communication Quality | 15% | Were the right people notified with the right information? |
| Step Efficiency | 10% | Did the agent take direct routes or waste steps? |

Dense intermediate rewards: +0.05 for each tool call that resolves a constraint, -0.05 for creating a new conflict. The full evaluation fires on `submit_plan`.

---

## Training: What Actually Changed

**Setup:** Qwen2.5-1.5B-Instruct + LoRA (rank 8), GRPO via HuggingFace TRL, Google Colab A100.

The training loop connects directly to the live CommitmentOS API — not a static dataset. The model generates real multi-turn tool sequences; the environment returns real rewards from the Commitment Ledger.

**GRPO Reward vs Step**

![GRPO reward curve showing upward trend from 0.48 early average to 0.63 late average, peaking at 0.69 at step 28](reward_curve.png)

*Reward climbs from 0.48 early average to 0.63 late average (+31%), peaking at 0.69. The noise is characteristic of GRPO on small-batch multi-turn tasks — the trend is real.*

**GRPO Loss vs Step**

![GRPO loss curve dropping sharply from 0.64 at step 1 to near zero within 5 steps](loss_curve.png)

*Loss drops sharply from 0.64 to near-zero in 5 steps as the policy escapes the "submit immediately" failure mode.*

---

## The Before / After That Matters

**hard_011 — Investor Dinner Cascade**

| | No-Action Baseline | Task-Completing Agent |
|--|-------------------|----------------------|
| Steps taken | 1 (immediate surrender) | 6 |
| Constraints met | 0 / 6 | **6 / 6** |
| Commitments honored | 0 | **1** (happy hour renegotiated) |
| Emails sent | 0 | **2** (Team + VP_Chen) |
| Final reward | 0.50 | **0.99** |

**Capability gap across all 15 tasks:**

![Baseline vs Improved Reward by Task — blue bars near 1.0, grey baseline bars ranging 0.4-0.76](reward_by_task.svg)

*An agent that submits immediately (grey) vs one that uses the tools correctly (blue). This is the capability gap CommitmentOS trains a model to close.*

**LLM checkpoint results (pre-RL vs post-RL Qwen2.5-1.5B):**

| | Pre-RL | Post-RL |
|--|--------|---------|
| Success rate (reward ≥ 0.6) | 46.7% | **60.0%** |
| Hard task mean reward | 0.560 | **0.612** |

With 30 GRPO steps on a 1.5B model, mean reward is essentially flat — expected at this compute scale. The success rate improvement is real: 2 additional tasks cross the threshold after training, with the clearest gains on hard scenarios where commitment tracking across 8–15 turns matters most. Longer training would amplify these results.

Full weights + artifacts: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)

---

## Try It

```bash
# Start the production incident scenario
curl -X POST "https://jayant2304-commitment-os.hf.space/reset?task_id=hard_015"

# Check your inbox (PagerDuty is waiting)
curl -X POST "https://jayant2304-commitment-os.hf.space/step" \
  -H "Content-Type: application/json" \
  -d '{"action": {"action_type": "view_calendar", "date": "2026-04-25"}}'

# See your active commitments
curl "https://jayant2304-commitment-os.hf.space/state"
```

**Resources:**
- 🤗 **Live environment**: [jayant2304/commitment-os](https://huggingface.co/spaces/jayant2304/commitment-os)
- 💻 **GitHub repository**: [Jayant2304/commitment_os](https://github.com/Jayant2304/commitment_os)
- 📓 **Training Colab**: [CommitmentOS_Training.ipynb](https://colab.research.google.com/github/Jayant2304/commitment_os/blob/main/training/CommitmentOS_Training.ipynb)
- 🔬 **Eval Colab**: [CommitmentOS_Checkpoint_Eval_Colab.ipynb](https://colab.research.google.com/github/Jayant2304/commitment_os/blob/main/evaluation/CommitmentOS_Checkpoint_Eval_Colab.ipynb)
- 📦 **Trained weights + artifacts**: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)

---

## Beyond Personal Tasks

The Commitment Ledger generalizes to any domain where prior decisions create binding future constraints:

- A negotiation where accepting a term in turn 3 limits what you can offer in turn 9
- A contract workflow where a signed milestone constrains scope in later phases  
- A research pipeline where a hypothesis in step 2 determines which experiments are valid in step 8

CommitmentOS is a first instantiation. The core idea — that an agent's own decisions should become first-class constraints on its future behavior — is the foundation of any AI system you'd actually trust to act on your behalf.

---

*OpenEnv Hackathon India 2026 · Theme #3.2 Personal Tasks*
*Tags: `openenv` `reinforcement-learning` `commitment-coherence` `personal-task-management` `GRPO` `Qwen2.5` `TRL` `multi-turn`*