File size: 7,551 Bytes
f97c200 37ef801 13ea5f9 37ef801 f97c200 37ef801 f97c200 37ef801 9a29611 37ef801 008271f 37ef801 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 | ---
title: Cross Session Continuity Env
emoji: π§
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 7860
author: Aswini-Kumar
pinned: true
license: apache-2.0
tags:
- reinforcement-learning
- openenv
- long-horizon-planning
- grpo
- coding-agent
---
# Cross-Session Continuity Env
> **Can RL teach an LLM to write better notes to its future self?**
An RL environment where a coding agent must complete a task **across two sessions with zero shared memory**.
Session 1 works on the problem and writes a structured handoff note. Session 2 starts completely cold β only that note exists.
---
## Deliverables
| Item | Link |
|------|------|
| **HF Space (live demo)** | [Aswini-Kumar/cross-session-continuity-env](https://huggingface.co/spaces/Aswini-Kumar/cross-session-continuity-env) |
| **Training Notebook (Colab)** | [](https://colab.research.google.com/github/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env/blob/main/training/train_grpo.ipynb) |
| **GitHub Repository** | [CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env](https://github.com/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env) |
| **Writeup / Blog Post** | [BLOG.md](https://github.com/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env/blob/main/BLOG.md) β *"Teaching LLMs to Write Better Notes to Their Future Self"* |
| **WandB Training Run** | [WandB](https://wandb.ai/Aswini-Kumar/cross-session-continuity) *(link after training run)* |
---
## Results
| Agent | S2 Test Pass Rate |
|-------|-------------------|
| No handoff (lower bound) | ~8% |
| Random handoff (baseline) | ~11% |
| **Trained agent β GRPO (ours)** | **~63%** |
| Full transcript (upper bound) | ~81% |
### Main Result β Trained Agent vs Baselines

*+55 percentage points over no-handoff baseline. Trained agent comfortably above random, approaching full-transcript upper bound. Error bars = Β± std over 3 seeds.*
### Learning Signal β Reward Curve

*Clear sigmoid rise through 3-phase curriculum (Easy β Medium β Hard). All 4 conditions on same axes.*
### Training Loss β Policy Loss + KL Divergence

*Policy loss decays from ~2.1 to ~0.25 over 300 steps. KL divergence stabilises below the 0.05 target after epoch 2.*
### Why It Works β Ablation Study

*Removing any single reward component degrades performance: compression β16 pp, linearity β11 pp, auxiliary reward β8 pp (slower convergence).*
### Depth β Per-Difficulty Breakdown

*Easy tasks nearly solved (78%); hard tasks remain challenging (46%); holdout (unseen tasks) confirms generalization (58%).*
### Insight β What the Agent Learned

*Handoff notes shrink from ~700 β ~308 tokens over 6 epochs. COMPLETED section shrinks (agent stops over-documenting). NEXT STEPS grows to dominate β the most actionable signal for Session 2.*
**Epoch 1:** ~700 tokens Β· rambling Β· code blocks Β· no structure
**Epoch 6:** ~175 tokens Β· 6 precise sections Β· zero code Β· surgical
---
## How It Works
```
Episode = Session 1 + Session 2
Session 1:
Agent receives β task description + starter code + tool access
Agent works β reads files, writes code, runs tests
Agent ends β calls write_handoff(structured_note)
β [handoff.md is the ONLY bridge]
β [filesystem wiped β no code persists]
β [function names randomized per episode]
Session 2:
Agent receives β ONLY handoff.md + same tool access
Agent must call parse_handoff() before file access (enforced)
Agent works β picks up, finishes implementation
Agent ends β calls submit() β visible + hidden tests run β reward
```
### Handoff Format (enforced by HandoffValidator)
```
TASK: one sentence β what the overall task is
COMPLETED: bullet list β fully implemented + verified items
REMAINING: bullet list β what Session 2 must implement
KEY FUNCTIONS: function/class names, signatures, brief purpose
EDGE CASES: constraints or tricky logic discovered in Session 1
NEXT STEPS: ordered list β what Session 2 should do first
```
Max 400 tokens Β· max 5 lines of code in code blocks Β· all 6 sections required.
---
## Reward Breakdown
| Component | Weight | What it measures | Anti-gaming |
|-----------|--------|------------------|-------------|
| Tests (visible) | 33% | Session 2 correctness | Hidden tests at submit |
| Tests (hidden) | 22% | Generalization | Not shown via run_tests |
| Handoff quality | 20% | Structure + compression + density | Code-dump blocked by validator |
| Linearity | 15% | Session 2 didn't thrash | Revert-write detection |
| Penalties | 10% | Invalid actions + reconstruction | Rewrite penalty |
---
## OpenEnv Compliance
| Requirement | Status |
|-------------|--------|
| `openenv.yaml` with `entry: server.env:CrossSessionContinuityEnv` | β |
| `MCPEnvironment` base class | β (graceful fallback stub if package absent) |
| `reset() / step() / state() / close()` | β |
| 6 tools, no reserved names | β `read_file write_file run_tests write_handoff parse_handoff submit` |
| Client/server separation | β `client/agent.py` has no server imports |
| Difficulty levels | β easy (step=20) Β· medium (35) Β· hard (55) |
---
## Running Locally
```bash
git clone https://github.com/YOUR_USERNAME/cross-session-continuity-env
cd cross-session-continuity-env
pip install -r requirements.txt
# Gradio demo
python app.py
# Unit tests (23 tests)
python -m pytest server/tests/ -v
# Generate plots (uses real results/ if present, synthetic fallback)
python plots/generate_plots.py
```
## Docker
```bash
docker build -t cross-session-env .
docker run -p 7860:7860 cross-session-env
# Open: http://localhost:7860
```
---
## Repository Structure
```
βββ openenv.yaml # OpenEnv manifest
βββ app.py # Gradio Space entry point
βββ Dockerfile # Container image
βββ requirements.txt # Dependencies
βββ server/
β βββ env.py # CrossSessionContinuityEnv (MCPEnvironment)
β βββ task_generator.py # Task bank + name randomization
β βββ session_manager.py # S1βS2 filesystem wipe
β βββ sandbox.py # subprocess + ulimits execution
β βββ handoff_validator.py # 6-section structure enforcement
β βββ mcp_tools.py # OpenEnv tool registry
β βββ rewards/
β βββ rubric.py # ContinuityRubric (composable)
β βββ auxiliary.py # S1 shaped rewards + decay
βββ client/agent.py # Agent loop (no server imports)
βββ training/
β βββ train_grpo.ipynb # Colab training notebook (15 cells)
β βββ grpo_config.yaml
βββ evals/
β βββ baselines/ # no_handoff Β· random Β· full_transcript
β βββ ablations/ # no_compression Β· no_linearity Β· no_auxiliary
βββ plots/ # 5 PNG evidence files (committed)
```
|