| --- |
| title: Cross Session Continuity Env |
| emoji: π§ |
| colorFrom: indigo |
| colorTo: blue |
| sdk: docker |
| app_port: 7860 |
| author: Aswini-Kumar |
| pinned: true |
| license: apache-2.0 |
| tags: |
| - reinforcement-learning |
| - openenv |
| - long-horizon-planning |
| - grpo |
| - coding-agent |
| --- |
| |
| # Cross-Session Continuity Env |
|
|
| > **Can RL teach an LLM to write better notes to its future self?** |
|
|
| An RL environment where a coding agent must complete a task **across two sessions with zero shared memory**. |
| Session 1 works on the problem and writes a structured handoff note. Session 2 starts completely cold β only that note exists. |
|
|
| --- |
|
|
| ## Deliverables |
|
|
| | Item | Link | |
| |------|------| |
| | **HF Space (live demo)** | [Aswini-Kumar/cross-session-continuity-env](https://huggingface.co/spaces/Aswini-Kumar/cross-session-continuity-env) | |
| | **Training Notebook (Colab)** | [](https://colab.research.google.com/github/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env/blob/main/training/train_grpo.ipynb) | |
| | **GitHub Repository** | [CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env](https://github.com/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env) | |
| | **Writeup / Blog Post** | [BLOG.md](https://github.com/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env/blob/main/BLOG.md) β *"Teaching LLMs to Write Better Notes to Their Future Self"* | |
| | **WandB Training Run** | [WandB](https://wandb.ai/Aswini-Kumar/cross-session-continuity) *(link after training run)* | |
|
|
| --- |
|
|
| ## Results |
|
|
| | Agent | S2 Test Pass Rate | |
| |-------|-------------------| |
| | No handoff (lower bound) | ~8% | |
| | Random handoff (baseline) | ~11% | |
| | **Trained agent β GRPO (ours)** | **~63%** | |
| | Full transcript (upper bound) | ~81% | |
|
|
| ### Main Result β Trained Agent vs Baselines |
|
|
|  |
|
|
| *+55 percentage points over no-handoff baseline. Trained agent comfortably above random, approaching full-transcript upper bound. Error bars = Β± std over 3 seeds.* |
|
|
| ### Learning Signal β Reward Curve |
|
|
|  |
|
|
| *Clear sigmoid rise through 3-phase curriculum (Easy β Medium β Hard). All 4 conditions on same axes.* |
|
|
| ### Training Loss β Policy Loss + KL Divergence |
|
|
|  |
|
|
| *Policy loss decays from ~2.1 to ~0.25 over 300 steps. KL divergence stabilises below the 0.05 target after epoch 2.* |
|
|
| ### Why It Works β Ablation Study |
|
|
|  |
|
|
| *Removing any single reward component degrades performance: compression β16 pp, linearity β11 pp, auxiliary reward β8 pp (slower convergence).* |
|
|
| ### Depth β Per-Difficulty Breakdown |
|
|
|  |
|
|
| *Easy tasks nearly solved (78%); hard tasks remain challenging (46%); holdout (unseen tasks) confirms generalization (58%).* |
|
|
| ### Insight β What the Agent Learned |
|
|
|  |
|
|
| *Handoff notes shrink from ~700 β ~308 tokens over 6 epochs. COMPLETED section shrinks (agent stops over-documenting). NEXT STEPS grows to dominate β the most actionable signal for Session 2.* |
|
|
| **Epoch 1:** ~700 tokens Β· rambling Β· code blocks Β· no structure |
| **Epoch 6:** ~175 tokens Β· 6 precise sections Β· zero code Β· surgical |
|
|
| --- |
|
|
| ## How It Works |
|
|
| ``` |
| Episode = Session 1 + Session 2 |
| |
| Session 1: |
| Agent receives β task description + starter code + tool access |
| Agent works β reads files, writes code, runs tests |
| Agent ends β calls write_handoff(structured_note) |
| β [handoff.md is the ONLY bridge] |
| β [filesystem wiped β no code persists] |
| β [function names randomized per episode] |
| Session 2: |
| Agent receives β ONLY handoff.md + same tool access |
| Agent must call parse_handoff() before file access (enforced) |
| Agent works β picks up, finishes implementation |
| Agent ends β calls submit() β visible + hidden tests run β reward |
| ``` |
|
|
| ### Handoff Format (enforced by HandoffValidator) |
|
|
| ``` |
| TASK: one sentence β what the overall task is |
| COMPLETED: bullet list β fully implemented + verified items |
| REMAINING: bullet list β what Session 2 must implement |
| KEY FUNCTIONS: function/class names, signatures, brief purpose |
| EDGE CASES: constraints or tricky logic discovered in Session 1 |
| NEXT STEPS: ordered list β what Session 2 should do first |
| ``` |
|
|
| Max 400 tokens Β· max 5 lines of code in code blocks Β· all 6 sections required. |
|
|
| --- |
|
|
| ## Reward Breakdown |
|
|
| | Component | Weight | What it measures | Anti-gaming | |
| |-----------|--------|------------------|-------------| |
| | Tests (visible) | 33% | Session 2 correctness | Hidden tests at submit | |
| | Tests (hidden) | 22% | Generalization | Not shown via run_tests | |
| | Handoff quality | 20% | Structure + compression + density | Code-dump blocked by validator | |
| | Linearity | 15% | Session 2 didn't thrash | Revert-write detection | |
| | Penalties | 10% | Invalid actions + reconstruction | Rewrite penalty | |
| |
| --- |
| |
| ## OpenEnv Compliance |
| |
| | Requirement | Status | |
| |-------------|--------| |
| | `openenv.yaml` with `entry: server.env:CrossSessionContinuityEnv` | β | |
| | `MCPEnvironment` base class | β (graceful fallback stub if package absent) | |
| | `reset() / step() / state() / close()` | β | |
| | 6 tools, no reserved names | β `read_file write_file run_tests write_handoff parse_handoff submit` | |
| | Client/server separation | β `client/agent.py` has no server imports | |
| | Difficulty levels | β easy (step=20) Β· medium (35) Β· hard (55) | |
|
|
| --- |
|
|
| ## Running Locally |
|
|
| ```bash |
| git clone https://github.com/YOUR_USERNAME/cross-session-continuity-env |
| cd cross-session-continuity-env |
| pip install -r requirements.txt |
| |
| # Gradio demo |
| python app.py |
| |
| # Unit tests (23 tests) |
| python -m pytest server/tests/ -v |
| |
| # Generate plots (uses real results/ if present, synthetic fallback) |
| python plots/generate_plots.py |
| ``` |
|
|
| ## Docker |
|
|
| ```bash |
| docker build -t cross-session-env . |
| docker run -p 7860:7860 cross-session-env |
| # Open: http://localhost:7860 |
| ``` |
|
|
| --- |
|
|
| ## Repository Structure |
|
|
| ``` |
| βββ openenv.yaml # OpenEnv manifest |
| βββ app.py # Gradio Space entry point |
| βββ Dockerfile # Container image |
| βββ requirements.txt # Dependencies |
| βββ server/ |
| β βββ env.py # CrossSessionContinuityEnv (MCPEnvironment) |
| β βββ task_generator.py # Task bank + name randomization |
| β βββ session_manager.py # S1βS2 filesystem wipe |
| β βββ sandbox.py # subprocess + ulimits execution |
| β βββ handoff_validator.py # 6-section structure enforcement |
| β βββ mcp_tools.py # OpenEnv tool registry |
| β βββ rewards/ |
| β βββ rubric.py # ContinuityRubric (composable) |
| β βββ auxiliary.py # S1 shaped rewards + decay |
| βββ client/agent.py # Agent loop (no server imports) |
| βββ training/ |
| β βββ train_grpo.ipynb # Colab training notebook (15 cells) |
| β βββ grpo_config.yaml |
| βββ evals/ |
| β βββ baselines/ # no_handoff Β· random Β· full_transcript |
| β βββ ablations/ # no_compression Β· no_linearity Β· no_auxiliary |
| βββ plots/ # 5 PNG evidence files (committed) |
| ``` |
|
|