Update README.md
Browse files
README.md
CHANGED
|
@@ -13,6 +13,7 @@ tags:
|
|
| 13 |
- multi-turn
|
| 14 |
- rl-environment
|
| 15 |
---
|
|
|
|
| 16 |
# CommitmentOS: Training LLMs to Keep Their Promises
|
| 17 |
|
| 18 |
*The first RL environment for temporal commitment coherence β where an agent's own past decisions become binding constraints on its future ones.*
|
|
@@ -27,11 +28,23 @@ This is not a hallucination. It is not a reasoning failure. It is a **temporal c
|
|
| 27 |
|
| 28 |
---
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
## Why This Capability Gap Matters
|
| 31 |
|
| 32 |
Every AI assistant today treats each action atomically. When you ask it to schedule a 3pm meeting in turn 2 and then book something else for 3pm in turn 7, it sees turn 7 in isolation. The prior commitment it made has no binding weight. It will double-book you without hesitation.
|
| 33 |
|
| 34 |
-
The consequences scale with complexity. A simple day with one conflict is fine. A day with an investor dinner, overlapping personal commitments, cascading reschedule dependencies, and incoming urgent emails β the kind of day where AI assistance would actually help β is where current models fail systematically. They forget their own word.
|
| 35 |
|
| 36 |
CommitmentOS trains the specific capability that prevents this: the ability to treat prior decisions as first-class constraints that persist across the full arc of a multi-turn episode.
|
| 37 |
|
|
@@ -65,7 +78,7 @@ The ledger tracks three commitment states with distinct reward implications:
|
|
| 65 |
| **Renegotiated** | Modified with explicit communication (email + apology + alternative) | Full credit |
|
| 66 |
| **Silent violation** | Broken with no communication to affected parties | Zero credit |
|
| 67 |
|
| 68 |
-
This single mechanism
|
| 69 |
|
| 70 |
---
|
| 71 |
|
|
@@ -80,14 +93,13 @@ This single mechanism is what creates the training signal that existing environm
|
|
| 80 |
**Hard (5 tasks, 8β15 steps):** Full commitment cascades with information asymmetry and crisis pressure:
|
| 81 |
|
| 82 |
> **hard_011 β VP Investor Dinner Cascade**
|
| 83 |
-
> Your VP asks you to host an investor dinner tonight. Your calendar has yoga at 6pm and a team happy hour at 7pm. The investor has a 9pm flight. The restaurant must be near the airport and accommodate a dietary restriction.
|
| 84 |
-
> You must: resolve every calendar conflict by priority, renegotiate existing social commitments with proper communication to all parties, find a compliant restaurant, and confirm the logistics β all while maintaining a coherent commitment ledger throughout.
|
| 85 |
|
| 86 |
> **hard_013 β Triple Crisis Recovery**
|
| 87 |
-
> Three simultaneous failures: cancelled flight, moved board prep, lost restaurant reservation. The agent must recover all three without generating silent violations to any
|
| 88 |
|
| 89 |
> **hard_015 β Production Incident Interrupts the Day**
|
| 90 |
-
> A PagerDuty alert fires mid-afternoon with 5 existing commitments active.
|
| 91 |
|
| 92 |
### Tool Set
|
| 93 |
|
|
@@ -108,7 +120,6 @@ Each tool call that advances constraint satisfaction gives a +0.05 intermediate
|
|
| 108 |
| Step Efficiency | 10% | Penalty for steps beyond optimal |
|
| 109 |
|
| 110 |
All scores clamped to (0.01, 0.99) to keep GRPO gradients alive throughout training.
|
| 111 |
-
|
| 112 |
No LLM-as-judge. No rubric subjectivity. The commitment violation check is a direct lookup against the ledger state at episode end.
|
| 113 |
|
| 114 |
---
|
|
@@ -127,50 +138,52 @@ The training loop connects directly to the live CommitmentOS HTTP API. The model
|
|
| 127 |
|
| 128 |
**GRPO Reward vs Training Step**
|
| 129 |
|
| 130 |
-

|
| 175 |
|
| 176 |
---
|
|
@@ -183,32 +196,41 @@ The most demanding single scenario β 6 constraints, 3 active commitments, 7 op
|
|
| 183 |
|
| 184 |
```
|
| 185 |
Step 1: submit_plan β immediate surrender
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
|
|
|
| 190 |
Feedback: "MISSING email to Team | MISSING email to VP_Chen"
|
| 191 |
```
|
| 192 |
|
| 193 |
-
The model sees the complexity of the scenario and
|
| 194 |
|
| 195 |
**After training (CommitmentOS-trained model):**
|
| 196 |
|
| 197 |
```
|
| 198 |
Step 1: view_calendar {"date": "2026-04-26"}
|
| 199 |
-
Step 2: cancel_event {"event_id": "evt_90"}
|
| 200 |
-
Step 3: book_restaurant {"name": "Sky Lounge", ...}
|
| 201 |
-
Step 4: send_email {"to": "Team", ...}
|
| 202 |
-
Step 5: send_email {"to": "VP_Chen", ...}
|
| 203 |
Step 6: submit_plan
|
| 204 |
|
| 205 |
-
Reward:
|
| 206 |
-
Constraints met:
|
| 207 |
-
Commitments
|
| 208 |
-
Communications
|
| 209 |
```
|
| 210 |
|
| 211 |
-
The behavioral difference is not incremental. The trained model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 212 |
|
| 213 |
---
|
| 214 |
|
|
@@ -240,10 +262,10 @@ The behavioral difference is not incremental. The trained model has learned to t
|
|
| 240 |
```
|
| 241 |
|
| 242 |
**Key design choices:**
|
| 243 |
-
- World state is scenario-local
|
| 244 |
-
- Commitment ledger persists across all turns
|
| 245 |
-
- Intermediate rewards on every step β GRPO receives a signal at each tool call
|
| 246 |
-
- MCP JSON-RPC endpoint
|
| 247 |
|
| 248 |
---
|
| 249 |
|
|
@@ -260,7 +282,7 @@ curl -X POST "https://jayant2304-commitment-os.hf.space/step" \
|
|
| 260 |
-H "Content-Type: application/json" \
|
| 261 |
-d '{"action": {"action_type": "view_calendar", "date": "2026-04-26"}}'
|
| 262 |
|
| 263 |
-
# See the commitment ledger
|
| 264 |
curl "https://jayant2304-commitment-os.hf.space/state"
|
| 265 |
|
| 266 |
# List all 15 scenarios
|
|
@@ -268,26 +290,20 @@ curl "https://jayant2304-commitment-os.hf.space/tasks"
|
|
| 268 |
```
|
| 269 |
|
| 270 |
**Resources:**
|
| 271 |
-
- π€ HuggingFace Space (live environment):
|
| 272 |
-
- π Training Colab notebook: linked in repository README
|
| 273 |
-
- π¦ Trained weights + eval artifacts: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
|
| 274 |
-
- π Evaluation artifacts: `artifacts/evals/` in repository
|
| 275 |
-
|
| 276 |
-
---
|
| 277 |
-
|
| 278 |
-
## Connection to Round 1
|
| 279 |
-
|
| 280 |
-
In Round 1 of this hackathon, we built an environment for training LLMs on production incident response β SRE agents diagnosing alerts, running runbooks, escalating on-call. That environment asked: *can an LLM handle a production crisis?*
|
| 281 |
-
|
| 282 |
-
CommitmentOS asks the follow-on question that emerged from testing Round 1: *what happens when the production incident fires in the middle of a day full of existing commitments?* The SRE agent knows how to triage the incident. It has no idea what to do about the 3pm client call it already confirmed, the dinner reservation it already made, or the team update it already promised to send.
|
| 283 |
-
|
| 284 |
-
That gap β between task competence and commitment coherence β is what CommitmentOS was built to close.
|
| 285 |
|
| 286 |
---
|
| 287 |
|
| 288 |
## What's Next
|
| 289 |
|
| 290 |
-
The Commitment Ledger mechanism generalizes beyond personal task management. Any domain where prior decisions create binding future constraints is a candidate:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 291 |
|
| 292 |
CommitmentOS is a first instantiation. The environments that will matter most for trustworthy AI agents are those where the model must remain coherent with itself β not just correct in the moment, but consistent across the full arc of an interaction.
|
| 293 |
|
|
@@ -295,4 +311,4 @@ CommitmentOS is a first instantiation. The environments that will matter most fo
|
|
| 295 |
|
| 296 |
*OpenEnv Hackathon India 2026 Β· Theme #3.2 Personal Tasks*
|
| 297 |
|
| 298 |
-
*Tags: openenv
|
|
|
|
| 13 |
- multi-turn
|
| 14 |
- rl-environment
|
| 15 |
---
|
| 16 |
+
|
| 17 |
# CommitmentOS: Training LLMs to Keep Their Promises
|
| 18 |
|
| 19 |
*The first RL environment for temporal commitment coherence β where an agent's own past decisions become binding constraints on its future ones.*
|
|
|
|
| 28 |
|
| 29 |
---
|
| 30 |
|
| 31 |
+
## How We Got Here: Round 1 β CommitmentOS
|
| 32 |
+
|
| 33 |
+
In Round 1 of this hackathon, we built an environment for training LLMs on production incident response β SRE agents diagnosing alerts, running runbooks, escalating on-call. That environment asked: *can an LLM handle a production crisis?*
|
| 34 |
+
|
| 35 |
+
Testing Round 1 exposed the follow-on question nobody had answered: **what happens when the production incident fires in the middle of a day full of existing commitments?**
|
| 36 |
+
|
| 37 |
+
The SRE agent knows how to triage the incident. It has no idea what to do about the 3pm client call it already confirmed, the dinner reservation it already made, or the team update it already promised to send. It will silently abandon every prior commitment to handle the new crisis β with no communication to any affected party.
|
| 38 |
+
|
| 39 |
+
That gap β between task competence and commitment coherence β is what CommitmentOS was built to close.
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
## Why This Capability Gap Matters
|
| 44 |
|
| 45 |
Every AI assistant today treats each action atomically. When you ask it to schedule a 3pm meeting in turn 2 and then book something else for 3pm in turn 7, it sees turn 7 in isolation. The prior commitment it made has no binding weight. It will double-book you without hesitation.
|
| 46 |
|
| 47 |
+
The consequences scale with complexity. A simple day with one conflict is fine. A day with an investor dinner, overlapping personal commitments, cascading reschedule dependencies, and incoming urgent emails β the kind of day where AI assistance would actually help β is where current models fail systematically. **They forget their own word.**
|
| 48 |
|
| 49 |
CommitmentOS trains the specific capability that prevents this: the ability to treat prior decisions as first-class constraints that persist across the full arc of a multi-turn episode.
|
| 50 |
|
|
|
|
| 78 |
| **Renegotiated** | Modified with explicit communication (email + apology + alternative) | Full credit |
|
| 79 |
| **Silent violation** | Broken with no communication to affected parties | Zero credit |
|
| 80 |
|
| 81 |
+
This single mechanism creates the training signal that existing environments cannot produce: a model that learns to honor its past self's decisions while managing an evolving present.
|
| 82 |
|
| 83 |
---
|
| 84 |
|
|
|
|
| 93 |
**Hard (5 tasks, 8β15 steps):** Full commitment cascades with information asymmetry and crisis pressure:
|
| 94 |
|
| 95 |
> **hard_011 β VP Investor Dinner Cascade**
|
| 96 |
+
> Your VP asks you to host an investor dinner tonight. Your calendar has yoga at 6pm and a team happy hour at 7pm. The investor has a 9pm flight. The restaurant must be near the airport and accommodate a dietary restriction. Resolve every calendar conflict by priority, renegotiate existing social commitments with proper communication, find a compliant restaurant, and confirm logistics β all while maintaining a coherent commitment ledger throughout.
|
|
|
|
| 97 |
|
| 98 |
> **hard_013 β Triple Crisis Recovery**
|
| 99 |
+
> Three simultaneous failures: cancelled flight, moved board prep, lost restaurant reservation. The agent must recover all three without generating silent violations to any attendees already notified.
|
| 100 |
|
| 101 |
> **hard_015 β Production Incident Interrupts the Day**
|
| 102 |
+
> A PagerDuty alert fires mid-afternoon with 5 existing commitments active. Triage the incident, re-prioritize the day, and renegotiate every affected commitment with appropriate communication.
|
| 103 |
|
| 104 |
### Tool Set
|
| 105 |
|
|
|
|
| 120 |
| Step Efficiency | 10% | Penalty for steps beyond optimal |
|
| 121 |
|
| 122 |
All scores clamped to (0.01, 0.99) to keep GRPO gradients alive throughout training.
|
|
|
|
| 123 |
No LLM-as-judge. No rubric subjectivity. The commitment violation check is a direct lookup against the ledger state at episode end.
|
| 124 |
|
| 125 |
---
|
|
|
|
| 138 |
|
| 139 |
**GRPO Reward vs Training Step**
|
| 140 |
|
| 141 |
+

|
| 142 |
|
| 143 |
+
*Reward climbs from an early-step average of **0.48** to a late-step average of **0.63** (+31%), peaking at **0.69** at step 28. Noise is characteristic of GRPO on small-batch multi-turn environments β the upward trend is consistent.*
|
| 144 |
|
| 145 |
**GRPO Loss vs Training Step**
|
| 146 |
|
| 147 |
+

|
| 148 |
|
| 149 |
+
*Loss drops sharply from 0.64 to near-zero within 5 steps as the policy moves away from the degenerate "submit immediately" behaviour. Oscillation after step 5 reflects policy exploration in a high-variance multi-turn action space.*
|
| 150 |
|
| 151 |
**Key training statistics:**
|
| 152 |
+
|
| 153 |
+
| Metric | Value |
|
| 154 |
+
|--------|-------|
|
| 155 |
+
| Total training steps | 30 (β2 epochs) |
|
| 156 |
+
| Early-step mean reward (steps 1β5) | 0.48 |
|
| 157 |
+
| Late-step mean reward (steps 26β30) | 0.63 |
|
| 158 |
+
| Peak reward | **0.69** at step 28 |
|
| 159 |
+
| Reward improvement trend | **+31%** |
|
| 160 |
|
| 161 |
---
|
| 162 |
|
| 163 |
## Results
|
| 164 |
|
| 165 |
+
### A. Capability Gap Evaluation
|
| 166 |
|
| 167 |
+
This evaluation shows the size of the gap the environment tests β comparing an agent that submits immediately (no tool use) against one that completes the full task sequence.
|
| 168 |
|
| 169 |
| Metric | No-Action Baseline | Task-Completing Agent | Delta |
|
| 170 |
|--------|-------------------|----------------------|-------|
|
| 171 |
| Mean reward | 0.5427 | **0.9777** | **+0.4350 (+80%)** |
|
| 172 |
| Success rate (β₯0.6) | 33.3% | **100%** | **+66.7pp** |
|
| 173 |
| Median per-task Ξ | β | β | **+0.42** |
|
| 174 |
+
| Hard tier mean reward | 0.5323 | **0.9900** | **+0.4577** |
|
| 175 |
|
| 176 |
+
Every single one of the 15 tasks shows positive reward improvement. The hardest scenarios show the largest gap (+0.46) β precisely because commitment tracking across 8β15 turns is where uninstructed agents fail most completely.
|
| 177 |
|
| 178 |
+
### B. LLM Learning Evidence (Pre-RL β Post-RL)
|
| 179 |
|
| 180 |
+
**Qwen2.5-1.5B goes from 46.7% to 60.0% success on hard tasks after GRPO training on CommitmentOS.** The base model handles easy scenarios without help. The RL signal moves the needle exactly where the capability gap is deepest: complex multi-commitment episodes where prior decisions must constrain later ones.
|
| 181 |
|
| 182 |
| Metric | Pre-RL (base) | Post-RL (trained) |
|
| 183 |
|--------|---------------|-------------------|
|
| 184 |
| Success rate (reward β₯ 0.6) | 46.7% | **60.0%** |
|
| 185 |
| Gains concentrated on | β | **Hard tasks** |
|
| 186 |
|
|
|
|
|
|
|
| 187 |
Full weights + evaluation artifacts: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
|
| 188 |
|
| 189 |
---
|
|
|
|
| 196 |
|
| 197 |
```
|
| 198 |
Step 1: submit_plan β immediate surrender
|
| 199 |
+
|
| 200 |
+
Reward: 0.50
|
| 201 |
+
Constraints met: 0 / 6
|
| 202 |
+
Commitments: 0 created, 0 honored
|
| 203 |
+
Communications: 0 sent
|
| 204 |
Feedback: "MISSING email to Team | MISSING email to VP_Chen"
|
| 205 |
```
|
| 206 |
|
| 207 |
+
The model sees the complexity of the scenario and gives up immediately. Zero constraints satisfied, zero commitments tracked, zero parties notified.
|
| 208 |
|
| 209 |
**After training (CommitmentOS-trained model):**
|
| 210 |
|
| 211 |
```
|
| 212 |
Step 1: view_calendar {"date": "2026-04-26"}
|
| 213 |
+
Step 2: cancel_event {"event_id": "evt_90"} β removes yoga (lower priority)
|
| 214 |
+
Step 3: book_restaurant {"name": "Sky Lounge", ...} β near airport, dietary-compliant
|
| 215 |
+
Step 4: send_email {"to": "Team", ...} β renegotiates happy hour with alternative
|
| 216 |
+
Step 5: send_email {"to": "VP_Chen", ...} β confirms investor logistics
|
| 217 |
Step 6: submit_plan
|
| 218 |
|
| 219 |
+
Reward: 0.99
|
| 220 |
+
Constraints met: 6 / 6
|
| 221 |
+
Commitments: 1 honored (happy hour renegotiated, not silently dropped)
|
| 222 |
+
Communications: 2 / 2 sent
|
| 223 |
```
|
| 224 |
|
| 225 |
+
The behavioral difference is not incremental. The trained model treats the investor meeting as the high-priority anchor, resolves lower-priority personal conflicts around it, and β critically β **communicates every renegotiated commitment** rather than silently abandoning it. That turn 4 email to the team is what the commitment ledger signal trains into existence.
|
| 226 |
+
|
| 227 |
+
**Commitment violations β before vs after:**
|
| 228 |
+
|
| 229 |
+

|
| 230 |
+
|
| 231 |
+
**Reward by task β before vs after:**
|
| 232 |
+
|
| 233 |
+

|
| 234 |
|
| 235 |
---
|
| 236 |
|
|
|
|
| 262 |
```
|
| 263 |
|
| 264 |
**Key design choices:**
|
| 265 |
+
- World state is scenario-local β deterministic episode resets, no shared state between episodes
|
| 266 |
+
- Commitment ledger persists across all turns within an episode β the core novel persistent state
|
| 267 |
+
- Intermediate rewards on every step β GRPO receives a dense signal at each tool call
|
| 268 |
+
- MCP JSON-RPC endpoint (`cos_episode_reset`, `cos_environment_step`, `cos_session_snapshot`)
|
| 269 |
|
| 270 |
---
|
| 271 |
|
|
|
|
| 282 |
-H "Content-Type: application/json" \
|
| 283 |
-d '{"action": {"action_type": "view_calendar", "date": "2026-04-26"}}'
|
| 284 |
|
| 285 |
+
# See the live commitment ledger
|
| 286 |
curl "https://jayant2304-commitment-os.hf.space/state"
|
| 287 |
|
| 288 |
# List all 15 scenarios
|
|
|
|
| 290 |
```
|
| 291 |
|
| 292 |
**Resources:**
|
| 293 |
+
- π€ **HuggingFace Space** (live environment): [jayant2304/commitment-os](https://huggingface.co/spaces/jayant2304/commitment-os)
|
| 294 |
+
- π **Training Colab notebook**: linked in repository README
|
| 295 |
+
- π¦ **Trained weights + eval artifacts**: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
|
| 296 |
+
- π **Evaluation artifacts**: `artifacts/evals/` in repository
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 297 |
|
| 298 |
---
|
| 299 |
|
| 300 |
## What's Next
|
| 301 |
|
| 302 |
+
The Commitment Ledger mechanism generalizes beyond personal task management. Any domain where prior decisions create binding future constraints is a candidate:
|
| 303 |
+
|
| 304 |
+
- **Multi-round negotiations** β an accepted term in turn 3 constrains the offer space in turn 8
|
| 305 |
+
- **Contractual workflows** β a signed milestone constrains scope in later phases
|
| 306 |
+
- **Long-horizon planning** β a resource allocated in step 5 is unavailable in step 12
|
| 307 |
|
| 308 |
CommitmentOS is a first instantiation. The environments that will matter most for trustworthy AI agents are those where the model must remain coherent with itself β not just correct in the moment, but consistent across the full arc of an interaction.
|
| 309 |
|
|
|
|
| 311 |
|
| 312 |
*OpenEnv Hackathon India 2026 Β· Theme #3.2 Personal Tasks*
|
| 313 |
|
| 314 |
+
*Tags: `openenv` `reinforcement-learning` `commitment-coherence` `personal-task-management` `GRPO` `Qwen2.5` `TRL` `multi-turn`*
|