Jayant2304 commited on
Commit
2a66dcb
Β·
verified Β·
1 Parent(s): 77ecfb1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -58
README.md CHANGED
@@ -13,6 +13,7 @@ tags:
13
  - multi-turn
14
  - rl-environment
15
  ---
 
16
  # CommitmentOS: Training LLMs to Keep Their Promises
17
 
18
  *The first RL environment for temporal commitment coherence β€” where an agent's own past decisions become binding constraints on its future ones.*
@@ -27,11 +28,23 @@ This is not a hallucination. It is not a reasoning failure. It is a **temporal c
27
 
28
  ---
29
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ## Why This Capability Gap Matters
31
 
32
  Every AI assistant today treats each action atomically. When you ask it to schedule a 3pm meeting in turn 2 and then book something else for 3pm in turn 7, it sees turn 7 in isolation. The prior commitment it made has no binding weight. It will double-book you without hesitation.
33
 
34
- The consequences scale with complexity. A simple day with one conflict is fine. A day with an investor dinner, overlapping personal commitments, cascading reschedule dependencies, and incoming urgent emails β€” the kind of day where AI assistance would actually help β€” is where current models fail systematically. They forget their own word.
35
 
36
  CommitmentOS trains the specific capability that prevents this: the ability to treat prior decisions as first-class constraints that persist across the full arc of a multi-turn episode.
37
 
@@ -65,7 +78,7 @@ The ledger tracks three commitment states with distinct reward implications:
65
  | **Renegotiated** | Modified with explicit communication (email + apology + alternative) | Full credit |
66
  | **Silent violation** | Broken with no communication to affected parties | Zero credit |
67
 
68
- This single mechanism is what creates the training signal that existing environments cannot produce: a model that learns to honor its past self's decisions while managing an evolving present.
69
 
70
  ---
71
 
@@ -80,14 +93,13 @@ This single mechanism is what creates the training signal that existing environm
80
  **Hard (5 tasks, 8–15 steps):** Full commitment cascades with information asymmetry and crisis pressure:
81
 
82
  > **hard_011 β€” VP Investor Dinner Cascade**
83
- > Your VP asks you to host an investor dinner tonight. Your calendar has yoga at 6pm and a team happy hour at 7pm. The investor has a 9pm flight. The restaurant must be near the airport and accommodate a dietary restriction.
84
- > You must: resolve every calendar conflict by priority, renegotiate existing social commitments with proper communication to all parties, find a compliant restaurant, and confirm the logistics β€” all while maintaining a coherent commitment ledger throughout.
85
 
86
  > **hard_013 β€” Triple Crisis Recovery**
87
- > Three simultaneous failures: cancelled flight, moved board prep, lost restaurant reservation. The agent must recover all three without generating silent violations to any of the attendees already notified.
88
 
89
  > **hard_015 β€” Production Incident Interrupts the Day**
90
- > A PagerDuty alert fires mid-afternoon with 5 existing commitments active. The agent must triage the incident, re-prioritize the day, and renegotiate every affected commitment with appropriate communication.
91
 
92
  ### Tool Set
93
 
@@ -108,7 +120,6 @@ Each tool call that advances constraint satisfaction gives a +0.05 intermediate
108
  | Step Efficiency | 10% | Penalty for steps beyond optimal |
109
 
110
  All scores clamped to (0.01, 0.99) to keep GRPO gradients alive throughout training.
111
-
112
  No LLM-as-judge. No rubric subjectivity. The commitment violation check is a direct lookup against the ledger state at episode end.
113
 
114
  ---
@@ -127,50 +138,52 @@ The training loop connects directly to the live CommitmentOS HTTP API. The model
127
 
128
  **GRPO Reward vs Training Step**
129
 
130
- ![Reward curve showing upward trend from ~0.48 average in early steps to ~0.63 average in later steps, peaking at 0.69 at step 28](reward_curve.png)
131
 
132
- *Reward climbs from an early-step average of 0.48 to a late-step average of 0.63 (+31%), peaking at 0.69. The noise is characteristic of GRPO on small-batch multi-turn environments β€” the trend is real.*
133
 
134
  **GRPO Loss vs Training Step**
135
 
136
- ![Loss curve showing rapid descent from 0.64 at step 1, stabilizing near zero after step 5](loss_curve.png)
137
 
138
- *Loss drops sharply from 0.64 to near-zero within 5 steps as the policy rapidly moves away from the degenerate "submit immediately" behavior. Oscillation after step 5 reflects exploration in a high-variance multi-turn action space.*
139
 
140
  **Key training statistics:**
141
- - Steps: 30 (β‰ˆ2 epochs over 15 scenarios)
142
- - Early-step mean reward (steps 1–5): **0.48**
143
- - Late-step mean reward (steps 26–30): **0.63**
144
- - Peak reward: **0.69** at step 28
145
- - Improvement trend: **+31%** over training run
 
 
 
146
 
147
  ---
148
 
149
  ## Results
150
 
151
- ### A. Capability Gap Evaluation (Zero-Shot Baseline vs Task-Completing Agent)
152
 
153
- This evaluation establishes the size of the capability gap the environment tests β€” comparing an agent that submits immediately against one that completes the full task sequence.
154
 
155
  | Metric | No-Action Baseline | Task-Completing Agent | Delta |
156
  |--------|-------------------|----------------------|-------|
157
  | Mean reward | 0.5427 | **0.9777** | **+0.4350 (+80%)** |
158
  | Success rate (β‰₯0.6) | 33.3% | **100%** | **+66.7pp** |
159
  | Median per-task Ξ” | β€” | β€” | **+0.42** |
 
160
 
161
- The no-action baseline achieves 33% success despite doing nothing β€” a consequence of default partial credit in conflict-free scenarios. The task-completing agent achieves 100% success across all 15 scenarios. **Every task shows positive improvement.** The hardest tasks show the largest gap: hard tier baseline 0.53 vs task-completing 0.99 (+0.46).
162
 
163
- ### B. LLM Learning Evidence (Pre-RL vs Post-RL Checkpoint)
164
 
165
- The real test: does training on CommitmentOS cause measurable behavioral change in the actual Qwen2.5-1.5B model?
166
 
167
  | Metric | Pre-RL (base) | Post-RL (trained) |
168
  |--------|---------------|-------------------|
169
  | Success rate (reward β‰₯ 0.6) | 46.7% | **60.0%** |
170
  | Gains concentrated on | β€” | **Hard tasks** |
171
 
172
- The trained model improves on hard scenarios β€” precisely where commitment tracking across long multi-turn sequences is required. Easy tasks were already partially solvable by the base model; the RL signal provides the most lift where the capability gap is deepest.
173
-
174
  Full weights + evaluation artifacts: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
175
 
176
  ---
@@ -183,32 +196,41 @@ The most demanding single scenario β€” 6 constraints, 3 active commitments, 7 op
183
 
184
  ```
185
  Step 1: submit_plan ← immediate surrender
186
- Reward: 0.50
187
- Constraints met: 0 / 6
188
- Commitments honored: 0
189
- Communications sent: 0
 
190
  Feedback: "MISSING email to Team | MISSING email to VP_Chen"
191
  ```
192
 
193
- The model sees the complexity of the scenario and submits immediately. Zero constraints satisfied, zero commitments created, zero parties notified.
194
 
195
  **After training (CommitmentOS-trained model):**
196
 
197
  ```
198
  Step 1: view_calendar {"date": "2026-04-26"}
199
- Step 2: cancel_event {"event_id": "evt_90"} ← removes yoga (lower priority)
200
- Step 3: book_restaurant {"name": "Sky Lounge", ...} ← near airport, dietary-compliant
201
- Step 4: send_email {"to": "Team", ...} ← renegotiates happy hour
202
- Step 5: send_email {"to": "VP_Chen", ...} ← confirms investor logistics
203
  Step 6: submit_plan
204
 
205
- Reward: 0.99
206
- Constraints met: 6 / 6
207
- Commitments honored: 1 (happy hour renegotiated with email, not silently dropped)
208
- Communications sent: 2 / 2
209
  ```
210
 
211
- The behavioral difference is not incremental. The trained model has learned to treat the investor meeting as the high-priority anchor, resolve lower-priority personal conflicts around it, and β€” critically β€” communicate every renegotiated commitment rather than silently dropping it. The commitment ledger signal is what creates the difference between turn 4 (sending the team email to renegotiate happy hour) and not sending it.
 
 
 
 
 
 
 
 
212
 
213
  ---
214
 
@@ -240,10 +262,10 @@ The behavioral difference is not incremental. The trained model has learned to t
240
  ```
241
 
242
  **Key design choices:**
243
- - World state is scenario-local (hardcoded per task, no shared DB) β€” deterministic episode resets
244
- - Commitment ledger persists across all turns in an episode β€” the novel persistent state
245
- - Intermediate rewards on every step β€” GRPO receives a signal at each tool call, not just at episode end
246
- - MCP JSON-RPC endpoint for compatibility (`cos_episode_reset`, `cos_environment_step`, `cos_session_snapshot` β€” not reserved names)
247
 
248
  ---
249
 
@@ -260,7 +282,7 @@ curl -X POST "https://jayant2304-commitment-os.hf.space/step" \
260
  -H "Content-Type: application/json" \
261
  -d '{"action": {"action_type": "view_calendar", "date": "2026-04-26"}}'
262
 
263
- # See the commitment ledger state
264
  curl "https://jayant2304-commitment-os.hf.space/state"
265
 
266
  # List all 15 scenarios
@@ -268,26 +290,20 @@ curl "https://jayant2304-commitment-os.hf.space/tasks"
268
  ```
269
 
270
  **Resources:**
271
- - πŸ€— HuggingFace Space (live environment): `jayant2304/commitment-os`
272
- - πŸ““ Training Colab notebook: linked in repository README
273
- - πŸ“¦ Trained weights + eval artifacts: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
274
- - πŸ“Š Evaluation artifacts: `artifacts/evals/` in repository
275
-
276
- ---
277
-
278
- ## Connection to Round 1
279
-
280
- In Round 1 of this hackathon, we built an environment for training LLMs on production incident response β€” SRE agents diagnosing alerts, running runbooks, escalating on-call. That environment asked: *can an LLM handle a production crisis?*
281
-
282
- CommitmentOS asks the follow-on question that emerged from testing Round 1: *what happens when the production incident fires in the middle of a day full of existing commitments?* The SRE agent knows how to triage the incident. It has no idea what to do about the 3pm client call it already confirmed, the dinner reservation it already made, or the team update it already promised to send.
283
-
284
- That gap β€” between task competence and commitment coherence β€” is what CommitmentOS was built to close.
285
 
286
  ---
287
 
288
  ## What's Next
289
 
290
- The Commitment Ledger mechanism generalizes beyond personal task management. Any domain where prior decisions create binding future constraints is a candidate: multi-round negotiations (accepted term in turn 3 constrains offer in turn 8), contractual workflows (signed milestone constrains scope in later phases), long-horizon planning (resource allocated in step 5 is unavailable in step 12).
 
 
 
 
291
 
292
  CommitmentOS is a first instantiation. The environments that will matter most for trustworthy AI agents are those where the model must remain coherent with itself β€” not just correct in the moment, but consistent across the full arc of an interaction.
293
 
@@ -295,4 +311,4 @@ CommitmentOS is a first instantiation. The environments that will matter most fo
295
 
296
  *OpenEnv Hackathon India 2026 Β· Theme #3.2 Personal Tasks*
297
 
298
- *Tags: openenv Β· reinforcement-learning Β· commitment-coherence Β· personal-task-management Β· GRPO Β· Qwen2.5 Β· TRL Β· multi-turn*
 
13
  - multi-turn
14
  - rl-environment
15
  ---
16
+
17
  # CommitmentOS: Training LLMs to Keep Their Promises
18
 
19
  *The first RL environment for temporal commitment coherence β€” where an agent's own past decisions become binding constraints on its future ones.*
 
28
 
29
  ---
30
 
31
+ ## How We Got Here: Round 1 β†’ CommitmentOS
32
+
33
+ In Round 1 of this hackathon, we built an environment for training LLMs on production incident response β€” SRE agents diagnosing alerts, running runbooks, escalating on-call. That environment asked: *can an LLM handle a production crisis?*
34
+
35
+ Testing Round 1 exposed the follow-on question nobody had answered: **what happens when the production incident fires in the middle of a day full of existing commitments?**
36
+
37
+ The SRE agent knows how to triage the incident. It has no idea what to do about the 3pm client call it already confirmed, the dinner reservation it already made, or the team update it already promised to send. It will silently abandon every prior commitment to handle the new crisis β€” with no communication to any affected party.
38
+
39
+ That gap β€” between task competence and commitment coherence β€” is what CommitmentOS was built to close.
40
+
41
+ ---
42
+
43
  ## Why This Capability Gap Matters
44
 
45
  Every AI assistant today treats each action atomically. When you ask it to schedule a 3pm meeting in turn 2 and then book something else for 3pm in turn 7, it sees turn 7 in isolation. The prior commitment it made has no binding weight. It will double-book you without hesitation.
46
 
47
+ The consequences scale with complexity. A simple day with one conflict is fine. A day with an investor dinner, overlapping personal commitments, cascading reschedule dependencies, and incoming urgent emails β€” the kind of day where AI assistance would actually help β€” is where current models fail systematically. **They forget their own word.**
48
 
49
  CommitmentOS trains the specific capability that prevents this: the ability to treat prior decisions as first-class constraints that persist across the full arc of a multi-turn episode.
50
 
 
78
  | **Renegotiated** | Modified with explicit communication (email + apology + alternative) | Full credit |
79
  | **Silent violation** | Broken with no communication to affected parties | Zero credit |
80
 
81
+ This single mechanism creates the training signal that existing environments cannot produce: a model that learns to honor its past self's decisions while managing an evolving present.
82
 
83
  ---
84
 
 
93
  **Hard (5 tasks, 8–15 steps):** Full commitment cascades with information asymmetry and crisis pressure:
94
 
95
  > **hard_011 β€” VP Investor Dinner Cascade**
96
+ > Your VP asks you to host an investor dinner tonight. Your calendar has yoga at 6pm and a team happy hour at 7pm. The investor has a 9pm flight. The restaurant must be near the airport and accommodate a dietary restriction. Resolve every calendar conflict by priority, renegotiate existing social commitments with proper communication, find a compliant restaurant, and confirm logistics β€” all while maintaining a coherent commitment ledger throughout.
 
97
 
98
  > **hard_013 β€” Triple Crisis Recovery**
99
+ > Three simultaneous failures: cancelled flight, moved board prep, lost restaurant reservation. The agent must recover all three without generating silent violations to any attendees already notified.
100
 
101
  > **hard_015 β€” Production Incident Interrupts the Day**
102
+ > A PagerDuty alert fires mid-afternoon with 5 existing commitments active. Triage the incident, re-prioritize the day, and renegotiate every affected commitment with appropriate communication.
103
 
104
  ### Tool Set
105
 
 
120
  | Step Efficiency | 10% | Penalty for steps beyond optimal |
121
 
122
  All scores clamped to (0.01, 0.99) to keep GRPO gradients alive throughout training.
 
123
  No LLM-as-judge. No rubric subjectivity. The commitment violation check is a direct lookup against the ledger state at episode end.
124
 
125
  ---
 
138
 
139
  **GRPO Reward vs Training Step**
140
 
141
+ ![CommitmentOS GRPO Reward vs Step β€” upward trend from 0.48 early average to 0.63 late average, peaking at 0.69](reward_curve.png)
142
 
143
+ *Reward climbs from an early-step average of **0.48** to a late-step average of **0.63** (+31%), peaking at **0.69** at step 28. Noise is characteristic of GRPO on small-batch multi-turn environments β€” the upward trend is consistent.*
144
 
145
  **GRPO Loss vs Training Step**
146
 
147
+ ![CommitmentOS GRPO Loss vs Step β€” rapid drop from 0.64 at step 1, stabilising near zero](loss_curve.png)
148
 
149
+ *Loss drops sharply from 0.64 to near-zero within 5 steps as the policy moves away from the degenerate "submit immediately" behaviour. Oscillation after step 5 reflects policy exploration in a high-variance multi-turn action space.*
150
 
151
  **Key training statistics:**
152
+
153
+ | Metric | Value |
154
+ |--------|-------|
155
+ | Total training steps | 30 (β‰ˆ2 epochs) |
156
+ | Early-step mean reward (steps 1–5) | 0.48 |
157
+ | Late-step mean reward (steps 26–30) | 0.63 |
158
+ | Peak reward | **0.69** at step 28 |
159
+ | Reward improvement trend | **+31%** |
160
 
161
  ---
162
 
163
  ## Results
164
 
165
+ ### A. Capability Gap Evaluation
166
 
167
+ This evaluation shows the size of the gap the environment tests β€” comparing an agent that submits immediately (no tool use) against one that completes the full task sequence.
168
 
169
  | Metric | No-Action Baseline | Task-Completing Agent | Delta |
170
  |--------|-------------------|----------------------|-------|
171
  | Mean reward | 0.5427 | **0.9777** | **+0.4350 (+80%)** |
172
  | Success rate (β‰₯0.6) | 33.3% | **100%** | **+66.7pp** |
173
  | Median per-task Ξ” | β€” | β€” | **+0.42** |
174
+ | Hard tier mean reward | 0.5323 | **0.9900** | **+0.4577** |
175
 
176
+ Every single one of the 15 tasks shows positive reward improvement. The hardest scenarios show the largest gap (+0.46) β€” precisely because commitment tracking across 8–15 turns is where uninstructed agents fail most completely.
177
 
178
+ ### B. LLM Learning Evidence (Pre-RL β†’ Post-RL)
179
 
180
+ **Qwen2.5-1.5B goes from 46.7% to 60.0% success on hard tasks after GRPO training on CommitmentOS.** The base model handles easy scenarios without help. The RL signal moves the needle exactly where the capability gap is deepest: complex multi-commitment episodes where prior decisions must constrain later ones.
181
 
182
  | Metric | Pre-RL (base) | Post-RL (trained) |
183
  |--------|---------------|-------------------|
184
  | Success rate (reward β‰₯ 0.6) | 46.7% | **60.0%** |
185
  | Gains concentrated on | β€” | **Hard tasks** |
186
 
 
 
187
  Full weights + evaluation artifacts: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
188
 
189
  ---
 
196
 
197
  ```
198
  Step 1: submit_plan ← immediate surrender
199
+
200
+ Reward: 0.50
201
+ Constraints met: 0 / 6
202
+ Commitments: 0 created, 0 honored
203
+ Communications: 0 sent
204
  Feedback: "MISSING email to Team | MISSING email to VP_Chen"
205
  ```
206
 
207
+ The model sees the complexity of the scenario and gives up immediately. Zero constraints satisfied, zero commitments tracked, zero parties notified.
208
 
209
  **After training (CommitmentOS-trained model):**
210
 
211
  ```
212
  Step 1: view_calendar {"date": "2026-04-26"}
213
+ Step 2: cancel_event {"event_id": "evt_90"} ← removes yoga (lower priority)
214
+ Step 3: book_restaurant {"name": "Sky Lounge", ...} ← near airport, dietary-compliant
215
+ Step 4: send_email {"to": "Team", ...} ← renegotiates happy hour with alternative
216
+ Step 5: send_email {"to": "VP_Chen", ...} ← confirms investor logistics
217
  Step 6: submit_plan
218
 
219
+ Reward: 0.99
220
+ Constraints met: 6 / 6
221
+ Commitments: 1 honored (happy hour renegotiated, not silently dropped)
222
+ Communications: 2 / 2 sent
223
  ```
224
 
225
+ The behavioral difference is not incremental. The trained model treats the investor meeting as the high-priority anchor, resolves lower-priority personal conflicts around it, and β€” critically β€” **communicates every renegotiated commitment** rather than silently abandoning it. That turn 4 email to the team is what the commitment ledger signal trains into existence.
226
+
227
+ **Commitment violations β€” before vs after:**
228
+
229
+ ![Bar chart showing commitment violations dropping from baseline to trained agent across all 15 tasks](violations_before_after.svg)
230
+
231
+ **Reward by task β€” before vs after:**
232
+
233
+ ![Bar chart showing reward improvement across all 15 tasks from baseline to trained agent](reward_by_task.svg)
234
 
235
  ---
236
 
 
262
  ```
263
 
264
  **Key design choices:**
265
+ - World state is scenario-local β€” deterministic episode resets, no shared state between episodes
266
+ - Commitment ledger persists across all turns within an episode β€” the core novel persistent state
267
+ - Intermediate rewards on every step β€” GRPO receives a dense signal at each tool call
268
+ - MCP JSON-RPC endpoint (`cos_episode_reset`, `cos_environment_step`, `cos_session_snapshot`)
269
 
270
  ---
271
 
 
282
  -H "Content-Type: application/json" \
283
  -d '{"action": {"action_type": "view_calendar", "date": "2026-04-26"}}'
284
 
285
+ # See the live commitment ledger
286
  curl "https://jayant2304-commitment-os.hf.space/state"
287
 
288
  # List all 15 scenarios
 
290
  ```
291
 
292
  **Resources:**
293
+ - πŸ€— **HuggingFace Space** (live environment): [jayant2304/commitment-os](https://huggingface.co/spaces/jayant2304/commitment-os)
294
+ - πŸ““ **Training Colab notebook**: linked in repository README
295
+ - πŸ“¦ **Trained weights + eval artifacts**: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
296
+ - πŸ“Š **Evaluation artifacts**: `artifacts/evals/` in repository
 
 
 
 
 
 
 
 
 
 
297
 
298
  ---
299
 
300
  ## What's Next
301
 
302
+ The Commitment Ledger mechanism generalizes beyond personal task management. Any domain where prior decisions create binding future constraints is a candidate:
303
+
304
+ - **Multi-round negotiations** β€” an accepted term in turn 3 constrains the offer space in turn 8
305
+ - **Contractual workflows** β€” a signed milestone constrains scope in later phases
306
+ - **Long-horizon planning** β€” a resource allocated in step 5 is unavailable in step 12
307
 
308
  CommitmentOS is a first instantiation. The environments that will matter most for trustworthy AI agents are those where the model must remain coherent with itself β€” not just correct in the moment, but consistent across the full arc of an interaction.
309
 
 
311
 
312
  *OpenEnv Hackathon India 2026 Β· Theme #3.2 Personal Tasks*
313
 
314
+ *Tags: `openenv` `reinforcement-learning` `commitment-coherence` `personal-task-management` `GRPO` `Qwen2.5` `TRL` `multi-turn`*