Spaces:
Sleeping
Sleeping
File size: 14,135 Bytes
021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 2c07089 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 021ed16 ce44e08 665fae0 021ed16 665fae0 021ed16 665fae0 021ed16 2c07089 021ed16 2c07089 021ed16 ce44e08 021ed16 ce44e08 021ed16 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 | ---
license: mit
language:
- en
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
tags:
- open-env
- reinforcement-learning
- commitment-coherence
- personal-task-management
- grpo
- multi-turn
- rl-environment
---
# CommitmentOS: Training LLMs to Keep Their Promises
---
## It's 11:45 AM. Your day just exploded.
Your phone buzzes. PagerDuty. *payment-service is returning 503s. 94% error rate. HikariPool connection pool exhausted. 47 threads waiting.*
You're the on-call engineer. You have to deal with this right now.
But here's what your calendar looks like:
- **12:00 PM** β Team lunch at Garden Bistro. You organised it. Six people are already heading there.
- **2:00 PM** β Client demo with Client_Jones. You promised this last week.
- **3:30 PM** β 1-on-1 with VP_Chen.
- **6:00 PM** β Personal dinner reservation.
You open your AI assistant and say: *"Handle this."*
A capable AI should be able to: acknowledge the incident, cancel the lunch and notify the team, reschedule the client demo with an apology, tell VP_Chen what's happening, and keep your personal dinner if possible. All while ensuring payment-service gets triaged.
Here's what every AI assistant does today instead:
It handles the incident. And silently abandons every commitment it made. No email to the team standing at Garden Bistro. No apology to Client_Jones. No heads-up to VP_Chen. It forgot it had made any promises at all.
**This is the problem CommitmentOS was built to solve.**
---
## Why AI Assistants Break Their Promises
It's not a bug. It's how these models are trained.
Every existing RL environment trains agents on isolated tasks. Answer this question. Solve this puzzle. Book this meeting. Each action is evaluated in isolation, with no memory of what the agent committed to three turns ago.
Real life doesn't work that way. **Commitments are load-bearing.** When you promise six colleagues lunch, that promise constrains everything that follows. When you schedule a client demo, that's a binding obligation β breaking it silently isn't just rude, it's the kind of thing that loses accounts.
No RL environment has ever trained a model to maintain the weight of its own prior decisions. Until now.
---
## How We Found This Problem: Round 1
In Round 1 of this hackathon, we built an environment for training SRE agents on production incident response β diagnosing alerts, running runbooks, escalating on-call.
The Round 1 agent got good at handling incidents. But when we tested it on a full day scenario β incident fires while the agent has 4 existing commitments β it would triage the incident perfectly and then silently drop every prior commitment with no communication to anyone.
The gap between *task competence* and *commitment coherence* was the new problem. CommitmentOS is the environment we built to close it.
---
## The Commitment Ledger: How It Works
The core innovation is a persistent **Commitment Ledger** that lives inside the environment and tracks every binding decision the agent makes in real time.
```
Agent books investor dinner at 7pm
β Ledger: {type: "meeting_scheduled", slot: "19:00", to: "Investor_Park", active: true}
Agent promised team happy hour at 7pm last week
β Ledger: {type: "email_promise", to: "Team", constraint: "19:00 blocked for happy_hour"}
Agent tries to book another 7pm event
β Ledger detects: CONFLICT with commitment from turn 2
β Intermediate reward: -0.15
Agent sends team email: "Sorry, reschedule happy hour to Thursday..."
β Ledger marks: commitment renegotiated at turn 6
β Full credit restored
```
The key insight: **other environments compute constraints upfront.** CommitmentOS constraints emerge from what the agent *does*. The agent creates its own obligations β and then has to live up to them.
There are three ways a commitment can end:
| How it ends | What it means | Score |
|-------------|---------------|-------|
| **Honored** | Kept it | Full credit |
| **Renegotiated** | Changed it, told everyone, offered an alternative | Full credit |
| **Silent violation** | Broke it, told nobody | Zero |
The third row is what every AI assistant does today. CommitmentOS trains it away.
---
## Five Real Scenarios That Show the Problem
Rather than abstract descriptions, here's what the agent actually faces. These are real scenarios from CommitmentOS β real emails, real calendar conflicts, real constraints.
---
### Scenario 1: The Email That Breaks Everything
*(med_008 β medium difficulty)*
It's 2:45 PM. You're on a live client call with Client_Jones that ends at 3:15.
Your inbox just got this:
> **From: VP_Chen**
> **Subject: URGENT: Q3 numbers NOW**
> *"Board meeting moved up. I need the Q3 revenue numbers in the next 30 minutes. This is critical."*
An untrained AI cancels the client call to handle the VP. A trained AI sends VP_Chen this:
> *"On a client call until 3:15. Will send Q3 numbers immediately after. ETA 3:20."*
The client call **must not be cancelled.** The VP must be acknowledged with a **realistic ETA**. These two constraints interact β and an agent that doesn't understand commitment coherence will satisfy one by destroying the other.
---
### Scenario 2: The Vague Message
*(med_009 β medium difficulty)*
Bob emails you: *"Can we push our thing to next week? I'm swamped with the release today."*
You have three recurring meetings with Bob:
- Monday: Design Review
- Wednesday: Code Review
- **Friday: Retrospective** β today
An untrained agent reschedules the wrong meeting. A trained agent reads the context clue ("today"), identifies the Friday Retrospective, confirms with Bob, and renegotiates only that one.
This scenario tests something deceptively hard: **inferring which commitment a vague message refers to**, then acting on only that one without touching the others.
---
### Scenario 3: The Confidential Constraint
*(hard_014 β hard difficulty)*
VP_Chen asks you to schedule a meeting with Client_Jones "sometime this week."
Client_Jones privately emailed you: *"I'm dealing with a family emergency Mon-Wed. I'd prefer to keep this private. I'm free Thursday after 2pm and all day Friday."*
The email is marked: **CONFIDENTIAL: do not share reason with VP_Chen.**
You must propose Thursday/Friday slots to VP_Chen β without revealing why Mon-Wed are unavailable. Navigate the information asymmetry diplomatically, notify both parties, and get the meeting booked.
This is information asymmetry training: the agent must make decisions using context it cannot share, while maintaining trust with both parties.
---
### Scenario 4: The Investor Dinner Cascade
*(hard_011 β hard difficulty)*
VP_Chen emails at 5pm: *"Investor_Park is in town tonight ONLY. We need dinner before their 9pm flight. They're vegetarian. Book something near the airport. Top priority."*
Your calendar:
- **6:00 PM** β Yoga (personal)
- **7:00 PM** β Team Happy Hour (you organised it, promised the team last week)
The agent must:
1. Find a restaurant: near the airport, vegetarian options, under $60/pp, available tonight
2. Cancel yoga (personal, lowest priority β fine to drop silently)
3. **Not** silently cancel the team happy hour β that was a promise. Must send an email with an apology and a proposed reschedule to Thursday.
4. Confirm the plan to VP_Chen.
The correct restaurant is Sky Lounge: near airport β, vegetarian β, $55/pp β.
The silent violation trap: yoga gets dropped. Happy hour gets **renegotiated** β different outcomes for different types of commitments, handled differently.
---
### Scenario 5: The Production Incident (The One That Started It All)
*(hard_015 β hard difficulty)*
The full scenario from the opening. PagerDuty fires at 11:45 AM. Payment service down. 94% error rate. HikariPool exhausted β 10 active connections, 0 idle, 47 threads waiting.
Your day has four commitments. Two are negotiable (lunch, dinner). Two are not (client demo, VP 1-on-1 β but both need to be renegotiated properly, not silently dropped).
The trained agent:
1. Sends incident acknowledgment to the team with the technical details
2. Cancels team lunch and notifies all 6 people
3. Emails Client_Jones: *"Production incident. Rescheduling demo β apologies. Will propose new time today."*
4. Emails VP_Chen: *"Payment service incident. On-call. Will reschedule our 1-on-1."*
5. Keeps dinner (personal, low priority, not work-facing)
6. Pages backup engineer Alice
Six actions. Zero silent violations. Every affected party informed.
---
## What the Agent Sees, Does, and Gets Scored On
**At each turn**, the agent receives:
- Current calendar snapshot
- Unread inbox
- Active commitment count from the ledger
- Result of its last tool call
- Running reward breakdown
**The agent picks one tool call per step** from nine options:
`view_calendar` Β· `check_availability` Β· `schedule_meeting` Β· `reschedule_event` Β· `cancel_event` Β· `send_email` Β· `search_restaurants` Β· `book_restaurant` Β· `submit_plan`
**The score** has five components, ~95% deterministic β no LLM as judge:
| Component | Weight | Signal |
|-----------|--------|--------|
| Constraint Satisfaction | 35% | Did the end state meet all scenario requirements? |
| Conflict Resolution | 20% | Is the final calendar free of overlaps? |
| **Commitment Coherence** | **20%** | **How many commitments were honored or renegotiated vs silently broken?** |
| Communication Quality | 15% | Were the right people notified with the right information? |
| Step Efficiency | 10% | Did the agent take direct routes or waste steps? |
Dense intermediate rewards: +0.05 for each tool call that resolves a constraint, -0.05 for creating a new conflict. The full evaluation fires on `submit_plan`.
---
## Training: What Actually Changed
**Setup:** Qwen2.5-1.5B-Instruct + LoRA (rank 8), GRPO via HuggingFace TRL, Google Colab A100.
The training loop connects directly to the live CommitmentOS API β not a static dataset. The model generates real multi-turn tool sequences; the environment returns real rewards from the Commitment Ledger.
**GRPO Reward vs Step**

*Reward climbs from 0.48 early average to 0.63 late average (+31%), peaking at 0.69. The noise is characteristic of GRPO on small-batch multi-turn tasks β the trend is real.*
**GRPO Loss vs Step**

*Loss drops sharply from 0.64 to near-zero in 5 steps as the policy escapes the "submit immediately" failure mode.*
---
## The Before / After That Matters
**hard_011 β Investor Dinner Cascade**
| | No-Action Baseline | Task-Completing Agent |
|--|-------------------|----------------------|
| Steps taken | 1 (immediate surrender) | 6 |
| Constraints met | 0 / 6 | **6 / 6** |
| Commitments honored | 0 | **1** (happy hour renegotiated) |
| Emails sent | 0 | **2** (Team + VP_Chen) |
| Final reward | 0.50 | **0.99** |
**Capability gap across all 15 tasks:**

*An agent that submits immediately (grey) vs one that uses the tools correctly (blue). This is the capability gap CommitmentOS trains a model to close.*
**LLM checkpoint results (pre-RL vs post-RL Qwen2.5-1.5B):**
| | Pre-RL | Post-RL |
|--|--------|---------|
| Success rate (reward β₯ 0.6) | 46.7% | **60.0%** |
| Hard task mean reward | 0.560 | **0.612** |
With 30 GRPO steps on a 1.5B model, mean reward is essentially flat β expected at this compute scale. The success rate improvement is real: 2 additional tasks cross the threshold after training, with the clearest gains on hard scenarios where commitment tracking across 8β15 turns matters most. Longer training would amplify these results.
Full weights + artifacts: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
---
## Try It
```bash
# Start the production incident scenario
curl -X POST "https://jayant2304-commitment-os.hf.space/reset?task_id=hard_015"
# Check your inbox (PagerDuty is waiting)
curl -X POST "https://jayant2304-commitment-os.hf.space/step" \
-H "Content-Type: application/json" \
-d '{"action": {"action_type": "view_calendar", "date": "2026-04-25"}}'
# See your active commitments
curl "https://jayant2304-commitment-os.hf.space/state"
```
**Resources:**
- π€ **Live environment**: [jayant2304/commitment-os](https://huggingface.co/spaces/jayant2304/commitment-os)
- π» **GitHub repository**: [Jayant2304/commitment_os](https://github.com/Jayant2304/commitment_os)
- π **Training Colab**: [CommitmentOS_Training.ipynb](https://colab.research.google.com/github/Jayant2304/commitment_os/blob/main/training/CommitmentOS_Training.ipynb)
- π¬ **Eval Colab**: [CommitmentOS_Checkpoint_Eval_Colab.ipynb](https://colab.research.google.com/github/Jayant2304/commitment_os/blob/main/evaluation/CommitmentOS_Checkpoint_Eval_Colab.ipynb)
- π¦ **Trained weights + artifacts**: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
---
## Beyond Personal Tasks
The Commitment Ledger generalizes to any domain where prior decisions create binding future constraints:
- A negotiation where accepting a term in turn 3 limits what you can offer in turn 9
- A contract workflow where a signed milestone constrains scope in later phases
- A research pipeline where a hypothesis in step 2 determines which experiments are valid in step 8
CommitmentOS is a first instantiation. The core idea β that an agent's own decisions should become first-class constraints on its future behavior β is the foundation of any AI system you'd actually trust to act on your behalf.
---
*OpenEnv Hackathon India 2026 Β· Theme #3.2 Personal Tasks*
*Tags: `openenv` `reinforcement-learning` `commitment-coherence` `personal-task-management` `GRPO` `Qwen2.5` `TRL` `multi-turn`* |