File size: 7,551 Bytes
f97c200
 
37ef801
 
 
13ea5f9
 
37ef801
 
f97c200
37ef801
 
 
 
 
 
f97c200
 
37ef801
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a29611
 
37ef801
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
008271f
 
 
 
 
 
 
37ef801
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
title: Cross Session Continuity Env
emoji: 🧠
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 7860
author: Aswini-Kumar
pinned: true
license: apache-2.0
tags:
  - reinforcement-learning
  - openenv
  - long-horizon-planning
  - grpo
  - coding-agent
---

# Cross-Session Continuity Env

> **Can RL teach an LLM to write better notes to its future self?**

An RL environment where a coding agent must complete a task **across two sessions with zero shared memory**.
Session 1 works on the problem and writes a structured handoff note. Session 2 starts completely cold β€” only that note exists.

---

## Deliverables

| Item | Link |
|------|------|
| **HF Space (live demo)** | [Aswini-Kumar/cross-session-continuity-env](https://huggingface.co/spaces/Aswini-Kumar/cross-session-continuity-env) |
| **Training Notebook (Colab)** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env/blob/main/training/train_grpo.ipynb) |
| **GitHub Repository** | [CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env](https://github.com/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env) |
| **Writeup / Blog Post** | [BLOG.md](https://github.com/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env/blob/main/BLOG.md) β€” *"Teaching LLMs to Write Better Notes to Their Future Self"* |
| **WandB Training Run** | [WandB](https://wandb.ai/Aswini-Kumar/cross-session-continuity) *(link after training run)* |

---

## Results

| Agent | S2 Test Pass Rate |
|-------|-------------------|
| No handoff (lower bound) | ~8% |
| Random handoff (baseline) | ~11% |
| **Trained agent β€” GRPO (ours)** | **~63%** |
| Full transcript (upper bound) | ~81% |

### Main Result β€” Trained Agent vs Baselines

![Baseline vs Trained Agent](plots/baseline_vs_trained.png)

*+55 percentage points over no-handoff baseline. Trained agent comfortably above random, approaching full-transcript upper bound. Error bars = Β± std over 3 seeds.*

### Learning Signal β€” Reward Curve

![Reward Curve](plots/reward_curve.png)

*Clear sigmoid rise through 3-phase curriculum (Easy β†’ Medium β†’ Hard). All 4 conditions on same axes.*

### Training Loss β€” Policy Loss + KL Divergence

![Loss Curve](plots/loss_curve.png)

*Policy loss decays from ~2.1 to ~0.25 over 300 steps. KL divergence stabilises below the 0.05 target after epoch 2.*

### Why It Works β€” Ablation Study

![Ablation Comparison](plots/ablation_comparison.png)

*Removing any single reward component degrades performance: compression βˆ’16 pp, linearity βˆ’11 pp, auxiliary reward βˆ’8 pp (slower convergence).*

### Depth β€” Per-Difficulty Breakdown

![Difficulty Breakdown](plots/difficulty_breakdown.png)

*Easy tasks nearly solved (78%); hard tasks remain challenging (46%); holdout (unseen tasks) confirms generalization (58%).*

### Insight β€” What the Agent Learned

![Handoff Evolution over Epochs](plots/handoff_diff_over_epochs.png)

*Handoff notes shrink from ~700 β†’ ~308 tokens over 6 epochs. COMPLETED section shrinks (agent stops over-documenting). NEXT STEPS grows to dominate β€” the most actionable signal for Session 2.*

**Epoch 1:** ~700 tokens Β· rambling Β· code blocks Β· no structure
**Epoch 6:** ~175 tokens Β· 6 precise sections Β· zero code Β· surgical

---

## How It Works

```
Episode = Session 1 + Session 2

Session 1:
  Agent receives β†’ task description + starter code + tool access
  Agent works   β†’ reads files, writes code, runs tests
  Agent ends    β†’ calls write_handoff(structured_note)
                        ↓ [handoff.md is the ONLY bridge]
                        ↓ [filesystem wiped β€” no code persists]
                        ↓ [function names randomized per episode]
Session 2:
  Agent receives β†’ ONLY handoff.md + same tool access
  Agent must call parse_handoff() before file access (enforced)
  Agent works   β†’ picks up, finishes implementation
  Agent ends    β†’ calls submit() β†’ visible + hidden tests run β†’ reward
```

### Handoff Format (enforced by HandoffValidator)

```
TASK:          one sentence β€” what the overall task is
COMPLETED:     bullet list β€” fully implemented + verified items
REMAINING:     bullet list β€” what Session 2 must implement
KEY FUNCTIONS: function/class names, signatures, brief purpose
EDGE CASES:    constraints or tricky logic discovered in Session 1
NEXT STEPS:    ordered list β€” what Session 2 should do first
```

Max 400 tokens Β· max 5 lines of code in code blocks Β· all 6 sections required.

---

## Reward Breakdown

| Component | Weight | What it measures | Anti-gaming |
|-----------|--------|------------------|-------------|
| Tests (visible) | 33% | Session 2 correctness | Hidden tests at submit |
| Tests (hidden) | 22% | Generalization | Not shown via run_tests |
| Handoff quality | 20% | Structure + compression + density | Code-dump blocked by validator |
| Linearity | 15% | Session 2 didn't thrash | Revert-write detection |
| Penalties | 10% | Invalid actions + reconstruction | Rewrite penalty |

---

## OpenEnv Compliance

| Requirement | Status |
|-------------|--------|
| `openenv.yaml` with `entry: server.env:CrossSessionContinuityEnv` | βœ“ |
| `MCPEnvironment` base class | βœ“ (graceful fallback stub if package absent) |
| `reset() / step() / state() / close()` | βœ“ |
| 6 tools, no reserved names | βœ“ `read_file write_file run_tests write_handoff parse_handoff submit` |
| Client/server separation | βœ“ `client/agent.py` has no server imports |
| Difficulty levels | βœ“ easy (step=20) Β· medium (35) Β· hard (55) |

---

## Running Locally

```bash
git clone https://github.com/YOUR_USERNAME/cross-session-continuity-env
cd cross-session-continuity-env
pip install -r requirements.txt

# Gradio demo
python app.py

# Unit tests (23 tests)
python -m pytest server/tests/ -v

# Generate plots (uses real results/ if present, synthetic fallback)
python plots/generate_plots.py
```

## Docker

```bash
docker build -t cross-session-env .
docker run -p 7860:7860 cross-session-env
# Open: http://localhost:7860
```

---

## Repository Structure

```
β”œβ”€β”€ openenv.yaml                  # OpenEnv manifest
β”œβ”€β”€ app.py                        # Gradio Space entry point
β”œβ”€β”€ Dockerfile                    # Container image
β”œβ”€β”€ requirements.txt              # Dependencies
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ env.py                    # CrossSessionContinuityEnv (MCPEnvironment)
β”‚   β”œβ”€β”€ task_generator.py         # Task bank + name randomization
β”‚   β”œβ”€β”€ session_manager.py        # S1β†’S2 filesystem wipe
β”‚   β”œβ”€β”€ sandbox.py                # subprocess + ulimits execution
β”‚   β”œβ”€β”€ handoff_validator.py      # 6-section structure enforcement
β”‚   β”œβ”€β”€ mcp_tools.py              # OpenEnv tool registry
β”‚   └── rewards/
β”‚       β”œβ”€β”€ rubric.py             # ContinuityRubric (composable)
β”‚       └── auxiliary.py          # S1 shaped rewards + decay
β”œβ”€β”€ client/agent.py               # Agent loop (no server imports)
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ train_grpo.ipynb          # Colab training notebook (15 cells)
β”‚   └── grpo_config.yaml
β”œβ”€β”€ evals/
β”‚   β”œβ”€β”€ baselines/                # no_handoff Β· random Β· full_transcript
β”‚   └── ablations/                # no_compression Β· no_linearity Β· no_auxiliary
└── plots/                        # 5 PNG evidence files (committed)
```