Aswini-Kumar's picture
update: BLOG.md
6162152 verified
---
title: "Teaching LLMs to Write Better Notes to Their Future Self"
thumbnail: /blog/assets/cross-session-continuity/baseline_vs_trained.png
authors:
- user: Aswini-Kumar
---
# Teaching LLMs to Write Better Notes to Their Future Self
*Can reinforcement learning teach a coding agent to communicate better across sessions with zero shared memory?*
---
## The Problem
Every time you start a new chat with an LLM, it forgets everything from the last session.
For short tasks this is fine. For long ones β€” a multi-hour coding project, a research investigation, a debugging marathon β€” this is catastrophic. The model re-reads the same files, re-discovers the same bugs, and wastes your time.
Humans solve this with notes. Good ones. A developer leaving for the night writes: *"fixed the import error in utils.py, still need to handle the empty-list edge case in merge_intervals, run the tests first when you're back."*
Can we train an LLM to do the same?
## The Environment
**Cross-Session Continuity Env** is an RL environment built on OpenEnv where a coding agent must complete a task **across two separate sessions with zero shared memory**.
```
Session 1 Session 2
─────────────────────────────────────────────────────────
Agent receives task + starter code Agent receives ONLY handoff note
Agent works: read β†’ write β†’ test ─> Agent calls parse_handoff()
Agent ends: writes handoff note Agent completes task β†’ submit
↓
[filesystem wiped]
[function names randomized]
[no code persists]
```
The agent has 6 tools: `read_file`, `write_file`, `run_tests`, `write_handoff`, `parse_handoff`, `submit`.
The handoff note has a strict structure (enforced by the validator):
```
TASK: what the overall task is
COMPLETED: what was implemented + verified
REMAINING: what Session 2 must implement
KEY FUNCTIONS: function names, signatures
EDGE CASES: constraints or tricky logic
NEXT STEPS: ordered action list for Session 2
```
**Max 400 tokens. Max 5 lines of code. All 6 sections required.**
If the note doesn't meet these constraints, the validator rejects it (no penalty β€” retry is allowed). This forces the agent to develop *information-dense, structured communication* rather than just copy-pasting code.
## Reward Design
The reward is composable and anti-gaming:
| Component | Weight | Anti-gaming |
|-----------|--------|-------------|
| Tests (visible, Session 2) | 33% | Hidden tests at submit time |
| Tests (hidden) | 22% | Not accessible via `run_tests` |
| Handoff quality | 20% | Code-dump blocked by validator |
| Linearity | 15% | Thrash/rewrite detection |
| Penalties | 10% | Invalid actions, reconstruction |
40% of the test score comes from hidden test cases that are never revealed to the agent. This ensures the agent can't memorize specific test patterns.
## Training
We train **Qwen2.5-Coder-7B-Instruct** using **GRPO** (Group Relative Policy Optimization) via Hugging Face TRL and Unsloth, on a Colab T4 GPU.
The training uses a 3-phase curriculum:
- **Epochs 1–2**: Easy tasks (step limit 20, 3 visible tests)
- **Epochs 3–4**: Medium tasks (step limit 35, 5 visible tests)
- **Epochs 5–6**: Hard tasks (step limit 55, 8 visible tests)
This bootstraps the agent on simpler tasks before exposing it to harder generalization challenges.
## Results
The core question: does training actually improve the agent's ability to use its handoff note?
![Baseline vs Trained](plots/baseline_vs_trained.png)
*Session 2 test pass rate across 4 conditions. Error bars = Β± std, 3 seeds.*
The trained agent achieves **~63% Session 2 test pass rate** vs **~8% for no handoff** and **~11% for random handoff**. This is a **+55 percentage point improvement** over the lower bound.
The reward curve shows clear learning:
![Reward Curve](plots/reward_curve.png)
*Total reward across training episodes. All 4 conditions on same axes. Band = Β±1 std.*
And the training loss descends cleanly:
![Loss Curve](plots/loss_curve.png)
*Policy loss (top) + KL divergence (bottom) across training steps. Curriculum phases shown as shaded regions.*
## What the Agent Actually Learned
The most interesting result is *how the handoff notes changed*.
![Handoff Evolution](plots/handoff_diff_over_epochs.png)
*Token count per handoff section across 6 training epochs.*
- **Epoch 1**: ~700 tokens. Rambling. Code blocks everywhere. Repeats the task description verbatim. The NEXT STEPS section is almost empty.
- **Epoch 6**: ~175 tokens. Surgical. Zero code. COMPLETED shrinks (less over-documentation). NEXT STEPS grows to dominate β€” the most actionable information for Session 2.
The agent learned that Session 2 doesn't need to know *what was done*, it needs to know *exactly what to do next*. That's a genuine insight about communication.
## Ablation Study
![Ablation Study](plots/ablation_comparison.png)
*Removing any reward component degrades performance. All configs on same axes.*
- **No compression reward**: -16 pp. Agent produces bloated notes. Session 2 spends steps parsing instead of coding.
- **No linearity reward**: -11 pp. Session 2 thrashes β€” rewrites code instead of building on it.
- **No auxiliary reward**: -8 pp. Slower convergence; the shaped S1 rewards help bootstrap early.
## Why This Matters
The capability gap we're targeting β€” **structured cross-session state transfer** β€” is genuinely unsolved. Every production deployment of a coding agent hits this wall when tasks span multiple conversations.
The environment is designed to force the agent to develop a real skill, not to game a metric:
- Function names are randomized per episode (can't memorize by name)
- Hidden tests at submit time (can't overfit to visible tests)
- Validator blocks code dumps (must communicate structurally)
An agent that scores well here has actually learned to write better notes. That's the bet.
## Links
- **HF Space (live demo)**: https://huggingface.co/spaces/Aswini-Kumar/cross-session-continuity-env
- **Training Notebook**: https://colab.research.google.com/github/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env/blob/main/training/train_grpo.ipynb
- **GitHub**: https://github.com/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env