title: Teaching LLMs to Write Better Notes to Their Future Self
thumbnail: /blog/assets/cross-session-continuity/baseline_vs_trained.png
authors:
- user: Aswini-Kumar
Teaching LLMs to Write Better Notes to Their Future Self
Can reinforcement learning teach a coding agent to communicate better across sessions with zero shared memory?
The Problem
Every time you start a new chat with an LLM, it forgets everything from the last session.
For short tasks this is fine. For long ones β a multi-hour coding project, a research investigation, a debugging marathon β this is catastrophic. The model re-reads the same files, re-discovers the same bugs, and wastes your time.
Humans solve this with notes. Good ones. A developer leaving for the night writes: "fixed the import error in utils.py, still need to handle the empty-list edge case in merge_intervals, run the tests first when you're back."
Can we train an LLM to do the same?
The Environment
Cross-Session Continuity Env is an RL environment built on OpenEnv where a coding agent must complete a task across two separate sessions with zero shared memory.
Session 1 Session 2
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Agent receives task + starter code Agent receives ONLY handoff note
Agent works: read β write β test β> Agent calls parse_handoff()
Agent ends: writes handoff note Agent completes task β submit
β
[filesystem wiped]
[function names randomized]
[no code persists]
The agent has 6 tools: read_file, write_file, run_tests, write_handoff, parse_handoff, submit.
The handoff note has a strict structure (enforced by the validator):
TASK: what the overall task is
COMPLETED: what was implemented + verified
REMAINING: what Session 2 must implement
KEY FUNCTIONS: function names, signatures
EDGE CASES: constraints or tricky logic
NEXT STEPS: ordered action list for Session 2
Max 400 tokens. Max 5 lines of code. All 6 sections required.
If the note doesn't meet these constraints, the validator rejects it (no penalty β retry is allowed). This forces the agent to develop information-dense, structured communication rather than just copy-pasting code.
Reward Design
The reward is composable and anti-gaming:
| Component | Weight | Anti-gaming |
|---|---|---|
| Tests (visible, Session 2) | 33% | Hidden tests at submit time |
| Tests (hidden) | 22% | Not accessible via run_tests |
| Handoff quality | 20% | Code-dump blocked by validator |
| Linearity | 15% | Thrash/rewrite detection |
| Penalties | 10% | Invalid actions, reconstruction |
40% of the test score comes from hidden test cases that are never revealed to the agent. This ensures the agent can't memorize specific test patterns.
Training
We train Qwen2.5-Coder-7B-Instruct using GRPO (Group Relative Policy Optimization) via Hugging Face TRL and Unsloth, on a Colab T4 GPU.
The training uses a 3-phase curriculum:
- Epochs 1β2: Easy tasks (step limit 20, 3 visible tests)
- Epochs 3β4: Medium tasks (step limit 35, 5 visible tests)
- Epochs 5β6: Hard tasks (step limit 55, 8 visible tests)
This bootstraps the agent on simpler tasks before exposing it to harder generalization challenges.
Results
The core question: does training actually improve the agent's ability to use its handoff note?
Session 2 test pass rate across 4 conditions. Error bars = Β± std, 3 seeds.
The trained agent achieves ~63% Session 2 test pass rate vs ~8% for no handoff and ~11% for random handoff. This is a +55 percentage point improvement over the lower bound.
The reward curve shows clear learning:
Total reward across training episodes. All 4 conditions on same axes. Band = Β±1 std.
And the training loss descends cleanly:
Policy loss (top) + KL divergence (bottom) across training steps. Curriculum phases shown as shaded regions.
What the Agent Actually Learned
The most interesting result is how the handoff notes changed.
Token count per handoff section across 6 training epochs.
- Epoch 1: ~700 tokens. Rambling. Code blocks everywhere. Repeats the task description verbatim. The NEXT STEPS section is almost empty.
- Epoch 6: ~175 tokens. Surgical. Zero code. COMPLETED shrinks (less over-documentation). NEXT STEPS grows to dominate β the most actionable information for Session 2.
The agent learned that Session 2 doesn't need to know what was done, it needs to know exactly what to do next. That's a genuine insight about communication.
Ablation Study
Removing any reward component degrades performance. All configs on same axes.
- No compression reward: -16 pp. Agent produces bloated notes. Session 2 spends steps parsing instead of coding.
- No linearity reward: -11 pp. Session 2 thrashes β rewrites code instead of building on it.
- No auxiliary reward: -8 pp. Slower convergence; the shaped S1 rewards help bootstrap early.
Why This Matters
The capability gap we're targeting β structured cross-session state transfer β is genuinely unsolved. Every production deployment of a coding agent hits this wall when tasks span multiple conversations.
The environment is designed to force the agent to develop a real skill, not to game a metric:
- Function names are randomized per episode (can't memorize by name)
- Hidden tests at submit time (can't overfit to visible tests)
- Validator blocks code dumps (must communicate structurally)
An agent that scores well here has actually learned to write better notes. That's the bet.
Links
- HF Space (live demo): https://huggingface.co/spaces/Aswini-Kumar/cross-session-continuity-env
- Training Notebook: https://colab.research.google.com/github/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env/blob/main/training/train_grpo.ipynb
- GitHub: https://github.com/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env