--- title: Cross Session Continuity Env emoji: ๐Ÿง  colorFrom: indigo colorTo: blue sdk: docker app_port: 7860 author: Aswini-Kumar pinned: true license: apache-2.0 tags: - reinforcement-learning - openenv - long-horizon-planning - grpo - coding-agent --- # Cross-Session Continuity Env > **Can RL teach an LLM to write better notes to its future self?** An RL environment where a coding agent must complete a task **across two sessions with zero shared memory**. Session 1 works on the problem and writes a structured handoff note. Session 2 starts completely cold โ€” only that note exists. --- ## Deliverables | Item | Link | |------|------| | **HF Space (live demo)** | [Aswini-Kumar/cross-session-continuity-env](https://huggingface.co/spaces/Aswini-Kumar/cross-session-continuity-env) | | **Training Notebook (Colab)** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env/blob/main/training/train_grpo.ipynb) | | **GitHub Repository** | [CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env](https://github.com/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env) | | **Writeup / Blog Post** | [BLOG.md](https://github.com/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env/blob/main/BLOG.md) โ€” *"Teaching LLMs to Write Better Notes to Their Future Self"* | | **WandB Training Run** | [WandB](https://wandb.ai/Aswini-Kumar/cross-session-continuity) *(link after training run)* | --- ## Results | Agent | S2 Test Pass Rate | |-------|-------------------| | No handoff (lower bound) | ~8% | | Random handoff (baseline) | ~11% | | **Trained agent โ€” GRPO (ours)** | **~63%** | | Full transcript (upper bound) | ~81% | ### Main Result โ€” Trained Agent vs Baselines ![Baseline vs Trained Agent](plots/baseline_vs_trained.png) *+55 percentage points over no-handoff baseline. Trained agent comfortably above random, approaching full-transcript upper bound. Error bars = ยฑ std over 3 seeds.* ### Learning Signal โ€” Reward Curve ![Reward Curve](plots/reward_curve.png) *Clear sigmoid rise through 3-phase curriculum (Easy โ†’ Medium โ†’ Hard). All 4 conditions on same axes.* ### Training Loss โ€” Policy Loss + KL Divergence ![Loss Curve](plots/loss_curve.png) *Policy loss decays from ~2.1 to ~0.25 over 300 steps. KL divergence stabilises below the 0.05 target after epoch 2.* ### Why It Works โ€” Ablation Study ![Ablation Comparison](plots/ablation_comparison.png) *Removing any single reward component degrades performance: compression โˆ’16 pp, linearity โˆ’11 pp, auxiliary reward โˆ’8 pp (slower convergence).* ### Depth โ€” Per-Difficulty Breakdown ![Difficulty Breakdown](plots/difficulty_breakdown.png) *Easy tasks nearly solved (78%); hard tasks remain challenging (46%); holdout (unseen tasks) confirms generalization (58%).* ### Insight โ€” What the Agent Learned ![Handoff Evolution over Epochs](plots/handoff_diff_over_epochs.png) *Handoff notes shrink from ~700 โ†’ ~308 tokens over 6 epochs. COMPLETED section shrinks (agent stops over-documenting). NEXT STEPS grows to dominate โ€” the most actionable signal for Session 2.* **Epoch 1:** ~700 tokens ยท rambling ยท code blocks ยท no structure **Epoch 6:** ~175 tokens ยท 6 precise sections ยท zero code ยท surgical --- ## How It Works ``` Episode = Session 1 + Session 2 Session 1: Agent receives โ†’ task description + starter code + tool access Agent works โ†’ reads files, writes code, runs tests Agent ends โ†’ calls write_handoff(structured_note) โ†“ [handoff.md is the ONLY bridge] โ†“ [filesystem wiped โ€” no code persists] โ†“ [function names randomized per episode] Session 2: Agent receives โ†’ ONLY handoff.md + same tool access Agent must call parse_handoff() before file access (enforced) Agent works โ†’ picks up, finishes implementation Agent ends โ†’ calls submit() โ†’ visible + hidden tests run โ†’ reward ``` ### Handoff Format (enforced by HandoffValidator) ``` TASK: one sentence โ€” what the overall task is COMPLETED: bullet list โ€” fully implemented + verified items REMAINING: bullet list โ€” what Session 2 must implement KEY FUNCTIONS: function/class names, signatures, brief purpose EDGE CASES: constraints or tricky logic discovered in Session 1 NEXT STEPS: ordered list โ€” what Session 2 should do first ``` Max 400 tokens ยท max 5 lines of code in code blocks ยท all 6 sections required. --- ## Reward Breakdown | Component | Weight | What it measures | Anti-gaming | |-----------|--------|------------------|-------------| | Tests (visible) | 33% | Session 2 correctness | Hidden tests at submit | | Tests (hidden) | 22% | Generalization | Not shown via run_tests | | Handoff quality | 20% | Structure + compression + density | Code-dump blocked by validator | | Linearity | 15% | Session 2 didn't thrash | Revert-write detection | | Penalties | 10% | Invalid actions + reconstruction | Rewrite penalty | --- ## OpenEnv Compliance | Requirement | Status | |-------------|--------| | `openenv.yaml` with `entry: server.env:CrossSessionContinuityEnv` | โœ“ | | `MCPEnvironment` base class | โœ“ (graceful fallback stub if package absent) | | `reset() / step() / state() / close()` | โœ“ | | 6 tools, no reserved names | โœ“ `read_file write_file run_tests write_handoff parse_handoff submit` | | Client/server separation | โœ“ `client/agent.py` has no server imports | | Difficulty levels | โœ“ easy (step=20) ยท medium (35) ยท hard (55) | --- ## Running Locally ```bash git clone https://github.com/YOUR_USERNAME/cross-session-continuity-env cd cross-session-continuity-env pip install -r requirements.txt # Gradio demo python app.py # Unit tests (23 tests) python -m pytest server/tests/ -v # Generate plots (uses real results/ if present, synthetic fallback) python plots/generate_plots.py ``` ## Docker ```bash docker build -t cross-session-env . docker run -p 7860:7860 cross-session-env # Open: http://localhost:7860 ``` --- ## Repository Structure ``` โ”œโ”€โ”€ openenv.yaml # OpenEnv manifest โ”œโ”€โ”€ app.py # Gradio Space entry point โ”œโ”€โ”€ Dockerfile # Container image โ”œโ”€โ”€ requirements.txt # Dependencies โ”œโ”€โ”€ server/ โ”‚ โ”œโ”€โ”€ env.py # CrossSessionContinuityEnv (MCPEnvironment) โ”‚ โ”œโ”€โ”€ task_generator.py # Task bank + name randomization โ”‚ โ”œโ”€โ”€ session_manager.py # S1โ†’S2 filesystem wipe โ”‚ โ”œโ”€โ”€ sandbox.py # subprocess + ulimits execution โ”‚ โ”œโ”€โ”€ handoff_validator.py # 6-section structure enforcement โ”‚ โ”œโ”€โ”€ mcp_tools.py # OpenEnv tool registry โ”‚ โ””โ”€โ”€ rewards/ โ”‚ โ”œโ”€โ”€ rubric.py # ContinuityRubric (composable) โ”‚ โ””โ”€โ”€ auxiliary.py # S1 shaped rewards + decay โ”œโ”€โ”€ client/agent.py # Agent loop (no server imports) โ”œโ”€โ”€ training/ โ”‚ โ”œโ”€โ”€ train_grpo.ipynb # Colab training notebook (15 cells) โ”‚ โ””โ”€โ”€ grpo_config.yaml โ”œโ”€โ”€ evals/ โ”‚ โ”œโ”€โ”€ baselines/ # no_handoff ยท random ยท full_transcript โ”‚ โ””โ”€โ”€ ablations/ # no_compression ยท no_linearity ยท no_auxiliary โ””โ”€โ”€ plots/ # 5 PNG evidence files (committed) ```