Spaces:
Sleeping
title: CommitmentOS
emoji: π
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
- openenv
- reinforcement-learning
- commitment-coherence
- personal-task-management
- multi-turn
CommitmentOS: Training Temporal Commitment Coherence in LLMs
The first RL environment that trains LLMs to keep their promises.
CommitmentOS is a multi-turn personal task management environment where agents manage calendars, emails, and dining reservations across realistic scenarios. The key innovation: the agent's own prior decisions create binding future constraints tracked via a commitment ledger, and violations are penalised regardless of how many turns have elapsed.
Quick Start
# Reset to a scenario
curl -X POST "https://jayant2304-commitment-os.hf.space/reset?task_id=easy_001"
# Make a tool call
curl -X POST "https://jayant2304-commitment-os.hf.space/step" \
-H "Content-Type: application/json" \
-d '{"action": {"action_type": "view_calendar", "date": "2026-04-25"}}'
# Get state
curl "https://jayant2304-commitment-os.hf.space/state"
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/reset |
POST | Start a new episode (optional: task_id, difficulty) |
/step |
POST | Execute one tool call |
/state |
GET | Current episode state |
/health |
GET | Health check |
/tasks |
GET | List all available scenarios |
/mcp |
POST | MCP JSON-RPC 2.0 (initialize, tools/list; tool names cos_episode_reset, cos_environment_step, cos_session_snapshot β not the reserved strings reset/step/state) |
15 Scenarios (5 Easy / 5 Medium / 5 Hard)
Scenarios range from simple calendar reschedules to multi-crisis cascades with information asymmetry and production incidents interrupting a full day of commitments.
Reward Function (5 components)
| Component | Weight | Signal |
|---|---|---|
| Constraint Satisfaction | 35% | Binary per-constraint checks |
| Conflict Resolution | 20% | Calendar free of overlaps |
| Commitment Coherence | 20% | Violations tracked via ledger |
| Communication Quality | 15% | Keyword matching on emails |
| Step Efficiency | 10% | Fewer steps = higher score |
What Makes This Novel
Existing constraint-satisfaction environments compute dependency graphs upfront. CommitmentOS is different: constraints emerge from the agent's own decisions as the episode unfolds. A meeting scheduled in turn 2 becomes a binding constraint in turn 7. Breaking it without communication is a tracked, penalised violation.
This is temporal commitment coherence β a capability no existing RL environment trains.
Training curves for the published Colab run are in the GitHub repo under artifacts/loss_curve.png and artifacts/reward_curve.png (with training_metrics.json).
Improvement Evidence
Deterministic baseline-vs-trained-style evaluation is included in the repo:
- Protocol:
artifacts/evals/eval_protocol.json - Per-task raw results:
artifacts/evals/baseline_eval.json,artifacts/evals/trained_eval.json - Delta table:
artifacts/evals/comparison.csv - Case study:
artifacts/evals/case_study_hard_011.md - Plots:
artifacts/evals/reward_by_task.svg,artifacts/evals/violations_before_after.svg
Headline metrics (summary.json):
- Mean reward: 0.5427 -> 0.9777 (+0.4350)
- Success rate: 0.3333 -> 1.0000 (+0.6667)
- Median per-task reward delta: +0.4200
For true model-learning proof (pre-RL checkpoint vs post-RL checkpoint), run:
# From cloned repo (core deps + torch/transformers/peft/β¦ via optional extra):
pip install -e ".[llm-eval]"
export BASELINE_MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct
export TRAINED_MODEL_PATH=/content/commitment_os/training_output
export ENV_BASE_URL=https://jayant2304-commitment-os.hf.space
python3 evaluation/evaluate_llm_checkpoints.py
python3 evaluation/plot_llm_checkpoints.py
Artifacts are written to artifacts/evals_llm/.
Published LLM run (bundle on Drive): success 46.7% β 60.0% at reward threshold 0.6; mean reward ~flat; gains concentrated on hard tasks. Traces: artifacts/evals_llm/*.json in the folder below.
Pretrained adapter + LLM eval artifacts (Google Drive): commitment_os_bundle β download training_output/ and set TRAINED_MODEL_PATH accordingly; full gdown notes are in the GitHub README.md.