Jayant2304
/

Commitment-os

@@ -13,6 +13,7 @@ tags:
 - multi-turn
 - rl-environment
 ---
 # CommitmentOS: Training LLMs to Keep Their Promises
 *The first RL environment for temporal commitment coherence — where an agent's own past decisions become binding constraints on its future ones.*
@@ -27,11 +28,23 @@ This is not a hallucination. It is not a reasoning failure. It is a **temporal c
 ---
 ## Why This Capability Gap Matters
 Every AI assistant today treats each action atomically. When you ask it to schedule a 3pm meeting in turn 2 and then book something else for 3pm in turn 7, it sees turn 7 in isolation. The prior commitment it made has no binding weight. It will double-book you without hesitation.
-The consequences scale with complexity. A simple day with one conflict is fine. A day with an investor dinner, overlapping personal commitments, cascading reschedule dependencies, and incoming urgent emails — the kind of day where AI assistance would actually help — is where current models fail systematically. They forget their own word.
 CommitmentOS trains the specific capability that prevents this: the ability to treat prior decisions as first-class constraints that persist across the full arc of a multi-turn episode.
@@ -65,7 +78,7 @@ The ledger tracks three commitment states with distinct reward implications:
 | **Renegotiated** | Modified with explicit communication (email + apology + alternative) | Full credit |
 | **Silent violation** | Broken with no communication to affected parties | Zero credit |
-This single mechanism is what creates the training signal that existing environments cannot produce: a model that learns to honor its past self's decisions while managing an evolving present.
 ---
@@ -80,14 +93,13 @@ This single mechanism is what creates the training signal that existing environm
 **Hard (5 tasks, 8–15 steps):** Full commitment cascades with information asymmetry and crisis pressure:
 > **hard_011 — VP Investor Dinner Cascade**
-> Your VP asks you to host an investor dinner tonight. Your calendar has yoga at 6pm and a team happy hour at 7pm. The investor has a 9pm flight. The restaurant must be near the airport and accommodate a dietary restriction.
-> You must: resolve every calendar conflict by priority, renegotiate existing social commitments with proper communication to all parties, find a compliant restaurant, and confirm the logistics — all while maintaining a coherent commitment ledger throughout.
 > **hard_013 — Triple Crisis Recovery**
-> Three simultaneous failures: cancelled flight, moved board prep, lost restaurant reservation. The agent must recover all three without generating silent violations to any of the attendees already notified.
 > **hard_015 — Production Incident Interrupts the Day**
-> A PagerDuty alert fires mid-afternoon with 5 existing commitments active. The agent must triage the incident, re-prioritize the day, and renegotiate every affected commitment with appropriate communication.
 ### Tool Set
@@ -108,7 +120,6 @@ Each tool call that advances constraint satisfaction gives a +0.05 intermediate
 | Step Efficiency | 10% | Penalty for steps beyond optimal |
 All scores clamped to (0.01, 0.99) to keep GRPO gradients alive throughout training.
 No LLM-as-judge. No rubric subjectivity. The commitment violation check is a direct lookup against the ledger state at episode end.
 ---
@@ -127,50 +138,52 @@ The training loop connects directly to the live CommitmentOS HTTP API. The model
 **GRPO Reward vs Training Step**
-![Reward curve showing upward trend from ~0.48 average in early steps to ~0.63 average in later steps, peaking at 0.69 at step 28](reward_curve.png)
-*Reward climbs from an early-step average of 0.48 to a late-step average of 0.63 (+31%), peaking at 0.69. The noise is characteristic of GRPO on small-batch multi-turn environments — the trend is real.*
 **GRPO Loss vs Training Step**
-![Loss curve showing rapid descent from 0.64 at step 1, stabilizing near zero after step 5](loss_curve.png)
-*Loss drops sharply from 0.64 to near-zero within 5 steps as the policy rapidly moves away from the degenerate "submit immediately" behavior. Oscillation after step 5 reflects exploration in a high-variance multi-turn action space.*
 **Key training statistics:**
-- Steps: 30 (≈2 epochs over 15 scenarios)
-- Early-step mean reward (steps 1–5): **0.48**
-- Late-step mean reward (steps 26–30): **0.63**
-- Peak reward: **0.69** at step 28
-- Improvement trend: **+31%** over training run
 ---
 ## Results
-### A. Capability Gap Evaluation (Zero-Shot Baseline vs Task-Completing Agent)
-This evaluation establishes the size of the capability gap the environment tests — comparing an agent that submits immediately against one that completes the full task sequence.
 | Metric | No-Action Baseline | Task-Completing Agent | Delta |
 |--------|-------------------|----------------------|-------|
 | Mean reward | 0.5427 | **0.9777** | **+0.4350 (+80%)** |
 | Success rate (≥0.6) | 33.3% | **100%** | **+66.7pp** |
 | Median per-task Δ | — | — | **+0.42** |
-The no-action baseline achieves 33% success despite doing nothing — a consequence of default partial credit in conflict-free scenarios. The task-completing agent achieves 100% success across all 15 scenarios. **Every task shows positive improvement.** The hardest tasks show the largest gap: hard tier baseline 0.53 vs task-completing 0.99 (+0.46).
-### B. LLM Learning Evidence (Pre-RL vs Post-RL Checkpoint)
-The real test: does training on CommitmentOS cause measurable behavioral change in the actual Qwen2.5-1.5B model?
 | Metric | Pre-RL (base) | Post-RL (trained) |
 |--------|---------------|-------------------|
 | Success rate (reward ≥ 0.6) | 46.7% | **60.0%** |
 | Gains concentrated on | — | **Hard tasks** |
-The trained model improves on hard scenarios — precisely where commitment tracking across long multi-turn sequences is required. Easy tasks were already partially solvable by the base model; the RL signal provides the most lift where the capability gap is deepest.
 Full weights + evaluation artifacts: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
 ---
@@ -183,32 +196,41 @@ The most demanding single scenario — 6 constraints, 3 active commitments, 7 op
 ```
 Step 1: submit_plan  ← immediate surrender
-Reward: 0.50
-Constraints met: 0 / 6
-Commitments honored: 0
-Communications sent: 0
 Feedback: "MISSING email to Team | MISSING email to VP_Chen"
 ```
-The model sees the complexity of the scenario and submits immediately. Zero constraints satisfied, zero commitments created, zero parties notified.
 **After training (CommitmentOS-trained model):**
 ```
 Step 1: view_calendar {"date": "2026-04-26"}
-Step 2: cancel_event {"event_id": "evt_90"}        ← removes yoga (lower priority)
-Step 3: book_restaurant {"name": "Sky Lounge", ...} ← near airport, dietary-compliant
-Step 4: send_email {"to": "Team", ...}              ← renegotiates happy hour
-Step 5: send_email {"to": "VP_Chen", ...}           ← confirms investor logistics
 Step 6: submit_plan
-Reward: 0.99
-Constraints met: 6 / 6
-Commitments honored: 1 (happy hour renegotiated with email, not silently dropped)
-Communications sent: 2 / 2
 ```
-The behavioral difference is not incremental. The trained model has learned to treat the investor meeting as the high-priority anchor, resolve lower-priority personal conflicts around it, and — critically — communicate every renegotiated commitment rather than silently dropping it. The commitment ledger signal is what creates the difference between turn 4 (sending the team email to renegotiate happy hour) and not sending it.
 ---
@@ -240,10 +262,10 @@ The behavioral difference is not incremental. The trained model has learned to t
 ```
 **Key design choices:**
-- World state is scenario-local (hardcoded per task, no shared DB) — deterministic episode resets
-- Commitment ledger persists across all turns in an episode — the novel persistent state
-- Intermediate rewards on every step — GRPO receives a signal at each tool call, not just at episode end
-- MCP JSON-RPC endpoint for compatibility (`cos_episode_reset`, `cos_environment_step`, `cos_session_snapshot` — not reserved names)
 ---
@@ -260,7 +282,7 @@ curl -X POST "https://jayant2304-commitment-os.hf.space/step" \
   -H "Content-Type: application/json" \
   -d '{"action": {"action_type": "view_calendar", "date": "2026-04-26"}}'
-# See the commitment ledger state
 curl "https://jayant2304-commitment-os.hf.space/state"
 # List all 15 scenarios
@@ -268,26 +290,20 @@ curl "https://jayant2304-commitment-os.hf.space/tasks"
 ```
 **Resources:**
-- 🤗 HuggingFace Space (live environment): `jayant2304/commitment-os`
-- 📓 Training Colab notebook: linked in repository README
-- 📦 Trained weights + eval artifacts: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
-- 📊 Evaluation artifacts: `artifacts/evals/` in repository
----
-## Connection to Round 1
-In Round 1 of this hackathon, we built an environment for training LLMs on production incident response — SRE agents diagnosing alerts, running runbooks, escalating on-call. That environment asked: *can an LLM handle a production crisis?*
-CommitmentOS asks the follow-on question that emerged from testing Round 1: *what happens when the production incident fires in the middle of a day full of existing commitments?* The SRE agent knows how to triage the incident. It has no idea what to do about the 3pm client call it already confirmed, the dinner reservation it already made, or the team update it already promised to send.
-That gap — between task competence and commitment coherence — is what CommitmentOS was built to close.
 ---
 ## What's Next
-The Commitment Ledger mechanism generalizes beyond personal task management. Any domain where prior decisions create binding future constraints is a candidate: multi-round negotiations (accepted term in turn 3 constrains offer in turn 8), contractual workflows (signed milestone constrains scope in later phases), long-horizon planning (resource allocated in step 5 is unavailable in step 12).
 CommitmentOS is a first instantiation. The environments that will matter most for trustworthy AI agents are those where the model must remain coherent with itself — not just correct in the moment, but consistent across the full arc of an interaction.
@@ -295,4 +311,4 @@ CommitmentOS is a first instantiation. The environments that will matter most fo
 *OpenEnv Hackathon India 2026 · Theme #3.2 Personal Tasks*
-*Tags: openenv · reinforcement-learning · commitment-coherence · personal-task-management · GRPO · Qwen2.5 · TRL · multi-turn*

 - multi-turn
 - rl-environment
 ---
 # CommitmentOS: Training LLMs to Keep Their Promises
 *The first RL environment for temporal commitment coherence — where an agent's own past decisions become binding constraints on its future ones.*
 ---
+## How We Got Here: Round 1 → CommitmentOS
+In Round 1 of this hackathon, we built an environment for training LLMs on production incident response — SRE agents diagnosing alerts, running runbooks, escalating on-call. That environment asked: *can an LLM handle a production crisis?*
+Testing Round 1 exposed the follow-on question nobody had answered: **what happens when the production incident fires in the middle of a day full of existing commitments?**
+The SRE agent knows how to triage the incident. It has no idea what to do about the 3pm client call it already confirmed, the dinner reservation it already made, or the team update it already promised to send. It will silently abandon every prior commitment to handle the new crisis — with no communication to any affected party.
+That gap — between task competence and commitment coherence — is what CommitmentOS was built to close.
+---
 ## Why This Capability Gap Matters
 Every AI assistant today treats each action atomically. When you ask it to schedule a 3pm meeting in turn 2 and then book something else for 3pm in turn 7, it sees turn 7 in isolation. The prior commitment it made has no binding weight. It will double-book you without hesitation.
+The consequences scale with complexity. A simple day with one conflict is fine. A day with an investor dinner, overlapping personal commitments, cascading reschedule dependencies, and incoming urgent emails — the kind of day where AI assistance would actually help — is where current models fail systematically. **They forget their own word.**
 CommitmentOS trains the specific capability that prevents this: the ability to treat prior decisions as first-class constraints that persist across the full arc of a multi-turn episode.
 | **Renegotiated** | Modified with explicit communication (email + apology + alternative) | Full credit |
 | **Silent violation** | Broken with no communication to affected parties | Zero credit |
+This single mechanism creates the training signal that existing environments cannot produce: a model that learns to honor its past self's decisions while managing an evolving present.
 ---
 **Hard (5 tasks, 8–15 steps):** Full commitment cascades with information asymmetry and crisis pressure:
 > **hard_011 — VP Investor Dinner Cascade**
+> Your VP asks you to host an investor dinner tonight. Your calendar has yoga at 6pm and a team happy hour at 7pm. The investor has a 9pm flight. The restaurant must be near the airport and accommodate a dietary restriction. Resolve every calendar conflict by priority, renegotiate existing social commitments with proper communication, find a compliant restaurant, and confirm logistics — all while maintaining a coherent commitment ledger throughout.
 > **hard_013 — Triple Crisis Recovery**
+> Three simultaneous failures: cancelled flight, moved board prep, lost restaurant reservation. The agent must recover all three without generating silent violations to any attendees already notified.
 > **hard_015 — Production Incident Interrupts the Day**
+> A PagerDuty alert fires mid-afternoon with 5 existing commitments active. Triage the incident, re-prioritize the day, and renegotiate every affected commitment with appropriate communication.
 ### Tool Set
 | Step Efficiency | 10% | Penalty for steps beyond optimal |
 All scores clamped to (0.01, 0.99) to keep GRPO gradients alive throughout training.
 No LLM-as-judge. No rubric subjectivity. The commitment violation check is a direct lookup against the ledger state at episode end.
 ---
 **GRPO Reward vs Training Step**
+![CommitmentOS GRPO Reward vs Step — upward trend from 0.48 early average to 0.63 late average, peaking at 0.69](reward_curve.png)
+*Reward climbs from an early-step average of **0.48** to a late-step average of **0.63** (+31%), peaking at **0.69** at step 28. Noise is characteristic of GRPO on small-batch multi-turn environments — the upward trend is consistent.*
 **GRPO Loss vs Training Step**
+![CommitmentOS GRPO Loss vs Step — rapid drop from 0.64 at step 1, stabilising near zero](loss_curve.png)
+*Loss drops sharply from 0.64 to near-zero within 5 steps as the policy moves away from the degenerate "submit immediately" behaviour. Oscillation after step 5 reflects policy exploration in a high-variance multi-turn action space.*
 **Key training statistics:**
+| Metric | Value |
+|--------|-------|
+| Total training steps | 30 (≈2 epochs) |
+| Early-step mean reward (steps 1–5) | 0.48 |
+| Late-step mean reward (steps 26–30) | 0.63 |
+| Peak reward | **0.69** at step 28 |
+| Reward improvement trend | **+31%** |
 ---
 ## Results
+### A. Capability Gap Evaluation
+This evaluation shows the size of the gap the environment tests — comparing an agent that submits immediately (no tool use) against one that completes the full task sequence.
 | Metric | No-Action Baseline | Task-Completing Agent | Delta |
 |--------|-------------------|----------------------|-------|
 | Mean reward | 0.5427 | **0.9777** | **+0.4350 (+80%)** |
 | Success rate (≥0.6) | 33.3% | **100%** | **+66.7pp** |
 | Median per-task Δ | — | — | **+0.42** |
+| Hard tier mean reward | 0.5323 | **0.9900** | **+0.4577** |
+Every single one of the 15 tasks shows positive reward improvement. The hardest scenarios show the largest gap (+0.46) — precisely because commitment tracking across 8–15 turns is where uninstructed agents fail most completely.
+### B. LLM Learning Evidence (Pre-RL → Post-RL)
+**Qwen2.5-1.5B goes from 46.7% to 60.0% success on hard tasks after GRPO training on CommitmentOS.** The base model handles easy scenarios without help. The RL signal moves the needle exactly where the capability gap is deepest: complex multi-commitment episodes where prior decisions must constrain later ones.
 | Metric | Pre-RL (base) | Post-RL (trained) |
 |--------|---------------|-------------------|
 | Success rate (reward ≥ 0.6) | 46.7% | **60.0%** |
 | Gains concentrated on | — | **Hard tasks** |
 Full weights + evaluation artifacts: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
 ---
 ```
 Step 1: submit_plan  ← immediate surrender
+Reward:            0.50
+Constraints met:   0 / 6
+Commitments:       0 created, 0 honored
+Communications:    0 sent
 Feedback: "MISSING email to Team | MISSING email to VP_Chen"
 ```
+The model sees the complexity of the scenario and gives up immediately. Zero constraints satisfied, zero commitments tracked, zero parties notified.
 **After training (CommitmentOS-trained model):**
 ```
 Step 1: view_calendar {"date": "2026-04-26"}
+Step 2: cancel_event {"event_id": "evt_90"}         ← removes yoga (lower priority)
+Step 3: book_restaurant {"name": "Sky Lounge", ...}  ← near airport, dietary-compliant
+Step 4: send_email {"to": "Team", ...}               ← renegotiates happy hour with alternative
+Step 5: send_email {"to": "VP_Chen", ...}            ← confirms investor logistics
 Step 6: submit_plan
+Reward:            0.99
+Constraints met:   6 / 6
+Commitments:       1 honored (happy hour renegotiated, not silently dropped)
+Communications:    2 / 2 sent
 ```
+The behavioral difference is not incremental. The trained model treats the investor meeting as the high-priority anchor, resolves lower-priority personal conflicts around it, and — critically — **communicates every renegotiated commitment** rather than silently abandoning it. That turn 4 email to the team is what the commitment ledger signal trains into existence.
+**Commitment violations — before vs after:**
+![Bar chart showing commitment violations dropping from baseline to trained agent across all 15 tasks](violations_before_after.svg)
+**Reward by task — before vs after:**
+![Bar chart showing reward improvement across all 15 tasks from baseline to trained agent](reward_by_task.svg)
 ---
 ```
 **Key design choices:**
+- World state is scenario-local — deterministic episode resets, no shared state between episodes
+- Commitment ledger persists across all turns within an episode — the core novel persistent state
+- Intermediate rewards on every step — GRPO receives a dense signal at each tool call
+- MCP JSON-RPC endpoint (`cos_episode_reset`, `cos_environment_step`, `cos_session_snapshot`)
 ---
   -H "Content-Type: application/json" \
   -d '{"action": {"action_type": "view_calendar", "date": "2026-04-26"}}'
+# See the live commitment ledger
 curl "https://jayant2304-commitment-os.hf.space/state"
 # List all 15 scenarios
 ```
 **Resources:**
+- 🤗 **HuggingFace Space** (live environment): [jayant2304/commitment-os](https://huggingface.co/spaces/jayant2304/commitment-os)
+- 📓 **Training Colab notebook**: linked in repository README
+- 📦 **Trained weights + eval artifacts**: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
+- 📊 **Evaluation artifacts**: `artifacts/evals/` in repository
 ---
 ## What's Next
+The Commitment Ledger mechanism generalizes beyond personal task management. Any domain where prior decisions create binding future constraints is a candidate:
+- **Multi-round negotiations** — an accepted term in turn 3 constrains the offer space in turn 8
+- **Contractual workflows** — a signed milestone constrains scope in later phases
+- **Long-horizon planning** — a resource allocated in step 5 is unavailable in step 12
 CommitmentOS is a first instantiation. The environments that will matter most for trustworthy AI agents are those where the model must remain coherent with itself — not just correct in the moment, but consistent across the full arc of an interaction.
 *OpenEnv Hackathon India 2026 · Theme #3.2 Personal Tasks*
+*Tags: `openenv` `reinforcement-learning` `commitment-coherence` `personal-task-management` `GRPO` `Qwen2.5` `TRL` `multi-turn`*