Aswini-Kumar's picture
feat: README.md - OpenEnv 5-step structure
13ea5f9 verified
---
title: Cross Session Continuity Env
emoji: 🧠
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 7860
author: Aswini-Kumar
pinned: true
license: apache-2.0
tags:
- reinforcement-learning
- openenv
- long-horizon-planning
- grpo
- coding-agent
---
# Cross-Session Continuity Env
> **Can RL teach an LLM to write better notes to its future self?**
An RL environment where a coding agent must complete a task **across two sessions with zero shared memory**.
Session 1 works on the problem and writes a structured handoff note. Session 2 starts completely cold β€” only that note exists.
---
## Deliverables
| Item | Link |
|------|------|
| **HF Space (live demo)** | [Aswini-Kumar/cross-session-continuity-env](https://huggingface.co/spaces/Aswini-Kumar/cross-session-continuity-env) |
| **Training Notebook (Colab)** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env/blob/main/training/train_grpo.ipynb) |
| **GitHub Repository** | [CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env](https://github.com/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env) |
| **Writeup / Blog Post** | [BLOG.md](https://github.com/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env/blob/main/BLOG.md) β€” *"Teaching LLMs to Write Better Notes to Their Future Self"* |
| **WandB Training Run** | [WandB](https://wandb.ai/Aswini-Kumar/cross-session-continuity) *(link after training run)* |
---
## Results
| Agent | S2 Test Pass Rate |
|-------|-------------------|
| No handoff (lower bound) | ~8% |
| Random handoff (baseline) | ~11% |
| **Trained agent β€” GRPO (ours)** | **~63%** |
| Full transcript (upper bound) | ~81% |
### Main Result β€” Trained Agent vs Baselines
![Baseline vs Trained Agent](plots/baseline_vs_trained.png)
*+55 percentage points over no-handoff baseline. Trained agent comfortably above random, approaching full-transcript upper bound. Error bars = Β± std over 3 seeds.*
### Learning Signal β€” Reward Curve
![Reward Curve](plots/reward_curve.png)
*Clear sigmoid rise through 3-phase curriculum (Easy β†’ Medium β†’ Hard). All 4 conditions on same axes.*
### Training Loss β€” Policy Loss + KL Divergence
![Loss Curve](plots/loss_curve.png)
*Policy loss decays from ~2.1 to ~0.25 over 300 steps. KL divergence stabilises below the 0.05 target after epoch 2.*
### Why It Works β€” Ablation Study
![Ablation Comparison](plots/ablation_comparison.png)
*Removing any single reward component degrades performance: compression βˆ’16 pp, linearity βˆ’11 pp, auxiliary reward βˆ’8 pp (slower convergence).*
### Depth β€” Per-Difficulty Breakdown
![Difficulty Breakdown](plots/difficulty_breakdown.png)
*Easy tasks nearly solved (78%); hard tasks remain challenging (46%); holdout (unseen tasks) confirms generalization (58%).*
### Insight β€” What the Agent Learned
![Handoff Evolution over Epochs](plots/handoff_diff_over_epochs.png)
*Handoff notes shrink from ~700 β†’ ~308 tokens over 6 epochs. COMPLETED section shrinks (agent stops over-documenting). NEXT STEPS grows to dominate β€” the most actionable signal for Session 2.*
**Epoch 1:** ~700 tokens Β· rambling Β· code blocks Β· no structure
**Epoch 6:** ~175 tokens Β· 6 precise sections Β· zero code Β· surgical
---
## How It Works
```
Episode = Session 1 + Session 2
Session 1:
Agent receives β†’ task description + starter code + tool access
Agent works β†’ reads files, writes code, runs tests
Agent ends β†’ calls write_handoff(structured_note)
↓ [handoff.md is the ONLY bridge]
↓ [filesystem wiped β€” no code persists]
↓ [function names randomized per episode]
Session 2:
Agent receives β†’ ONLY handoff.md + same tool access
Agent must call parse_handoff() before file access (enforced)
Agent works β†’ picks up, finishes implementation
Agent ends β†’ calls submit() β†’ visible + hidden tests run β†’ reward
```
### Handoff Format (enforced by HandoffValidator)
```
TASK: one sentence β€” what the overall task is
COMPLETED: bullet list β€” fully implemented + verified items
REMAINING: bullet list β€” what Session 2 must implement
KEY FUNCTIONS: function/class names, signatures, brief purpose
EDGE CASES: constraints or tricky logic discovered in Session 1
NEXT STEPS: ordered list β€” what Session 2 should do first
```
Max 400 tokens Β· max 5 lines of code in code blocks Β· all 6 sections required.
---
## Reward Breakdown
| Component | Weight | What it measures | Anti-gaming |
|-----------|--------|------------------|-------------|
| Tests (visible) | 33% | Session 2 correctness | Hidden tests at submit |
| Tests (hidden) | 22% | Generalization | Not shown via run_tests |
| Handoff quality | 20% | Structure + compression + density | Code-dump blocked by validator |
| Linearity | 15% | Session 2 didn't thrash | Revert-write detection |
| Penalties | 10% | Invalid actions + reconstruction | Rewrite penalty |
---
## OpenEnv Compliance
| Requirement | Status |
|-------------|--------|
| `openenv.yaml` with `entry: server.env:CrossSessionContinuityEnv` | βœ“ |
| `MCPEnvironment` base class | βœ“ (graceful fallback stub if package absent) |
| `reset() / step() / state() / close()` | βœ“ |
| 6 tools, no reserved names | βœ“ `read_file write_file run_tests write_handoff parse_handoff submit` |
| Client/server separation | βœ“ `client/agent.py` has no server imports |
| Difficulty levels | βœ“ easy (step=20) Β· medium (35) Β· hard (55) |
---
## Running Locally
```bash
git clone https://github.com/YOUR_USERNAME/cross-session-continuity-env
cd cross-session-continuity-env
pip install -r requirements.txt
# Gradio demo
python app.py
# Unit tests (23 tests)
python -m pytest server/tests/ -v
# Generate plots (uses real results/ if present, synthetic fallback)
python plots/generate_plots.py
```
## Docker
```bash
docker build -t cross-session-env .
docker run -p 7860:7860 cross-session-env
# Open: http://localhost:7860
```
---
## Repository Structure
```
β”œβ”€β”€ openenv.yaml # OpenEnv manifest
β”œβ”€β”€ app.py # Gradio Space entry point
β”œβ”€β”€ Dockerfile # Container image
β”œβ”€β”€ requirements.txt # Dependencies
β”œβ”€β”€ server/
β”‚ β”œβ”€β”€ env.py # CrossSessionContinuityEnv (MCPEnvironment)
β”‚ β”œβ”€β”€ task_generator.py # Task bank + name randomization
β”‚ β”œβ”€β”€ session_manager.py # S1β†’S2 filesystem wipe
β”‚ β”œβ”€β”€ sandbox.py # subprocess + ulimits execution
β”‚ β”œβ”€β”€ handoff_validator.py # 6-section structure enforcement
β”‚ β”œβ”€β”€ mcp_tools.py # OpenEnv tool registry
β”‚ └── rewards/
β”‚ β”œβ”€β”€ rubric.py # ContinuityRubric (composable)
β”‚ └── auxiliary.py # S1 shaped rewards + decay
β”œβ”€β”€ client/agent.py # Agent loop (no server imports)
β”œβ”€β”€ training/
β”‚ β”œβ”€β”€ train_grpo.ipynb # Colab training notebook (15 cells)
β”‚ └── grpo_config.yaml
β”œβ”€β”€ evals/
β”‚ β”œβ”€β”€ baselines/ # no_handoff Β· random Β· full_transcript
β”‚ └── ablations/ # no_compression Β· no_linearity Β· no_auxiliary
└── plots/ # 5 PNG evidence files (committed)
```