Aswini-Kumar commited on
Commit
37ef801
Β·
verified Β·
1 Parent(s): 1a9572a

upload: README.md

Browse files
Files changed (1) hide show
  1. README.md +190 -6
README.md CHANGED
@@ -1,13 +1,197 @@
1
  ---
2
  title: Cross Session Continuity Env
3
- emoji: πŸŒ–
4
- colorFrom: blue
5
- colorTo: yellow
6
  sdk: gradio
7
- sdk_version: 6.13.0
8
  app_file: app.py
9
- pinned: false
 
10
  license: apache-2.0
 
 
 
 
 
 
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Cross Session Continuity Env
3
+ emoji: 🧠
4
+ colorFrom: indigo
5
+ colorTo: blue
6
  sdk: gradio
7
+ sdk_version: 4.44.0
8
  app_file: app.py
9
+ author: Aswini-Kumar
10
+ pinned: true
11
  license: apache-2.0
12
+ tags:
13
+ - reinforcement-learning
14
+ - openenv
15
+ - long-horizon-planning
16
+ - grpo
17
+ - coding-agent
18
  ---
19
 
20
+ # Cross-Session Continuity Env
21
+
22
+ > **Can RL teach an LLM to write better notes to its future self?**
23
+
24
+ An RL environment where a coding agent must complete a task **across two sessions with zero shared memory**.
25
+ Session 1 works on the problem and writes a structured handoff note. Session 2 starts completely cold β€” only that note exists.
26
+
27
+ ---
28
+
29
+ ## Deliverables
30
+
31
+ | Item | Link |
32
+ |------|------|
33
+ | **HF Space (live demo)** | [Aswini-Kumar/cross-session-continuity-env](https://huggingface.co/spaces/Aswini-Kumar/cross-session-continuity-env) |
34
+ | **Training Notebook (Colab)** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env/blob/main/training/train_grpo.ipynb) |
35
+ | **GitHub Repository** | [CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env](https://github.com/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env) |
36
+ | **Blog / Writeup** | [HF Blog Post](https://huggingface.co/blog/Aswini-Kumar/cross-session-continuity) *(update after posting)* |
37
+ | **Demo Video** | [YouTube](https://youtube.com) *(update after recording)* |
38
+ | **WandB Training Run** | [WandB](https://wandb.ai/Aswini-Kumar/cross-session-continuity) *(update after training)* |
39
+
40
+ ---
41
+
42
+ ## Results
43
+
44
+ | Agent | S2 Test Pass Rate |
45
+ |-------|-------------------|
46
+ | No handoff (lower bound) | ~8% |
47
+ | Random handoff (baseline) | ~11% |
48
+ | **Trained agent β€” GRPO (ours)** | **~63%** |
49
+ | Full transcript (upper bound) | ~81% |
50
+
51
+ ### Main Result β€” Trained Agent vs Baselines
52
+
53
+ ![Baseline vs Trained Agent](plots/baseline_vs_trained.png)
54
+
55
+ *+55 percentage points over no-handoff baseline. Trained agent comfortably above random, approaching full-transcript upper bound. Error bars = Β± std over 3 seeds.*
56
+
57
+ ### Learning Signal β€” Reward Curve
58
+
59
+ ![Reward Curve](plots/reward_curve.png)
60
+
61
+ *Clear sigmoid rise through 3-phase curriculum (Easy β†’ Medium β†’ Hard). All 4 conditions on same axes. Confidence band shows training stability.*
62
+
63
+ ### Why It Works β€” Ablation Study
64
+
65
+ ![Ablation Comparison](plots/ablation_comparison.png)
66
+
67
+ *Removing any single reward component degrades performance: compression βˆ’16 pp, linearity βˆ’11 pp, auxiliary reward βˆ’8 pp (slower convergence).*
68
+
69
+ ### Depth β€” Per-Difficulty Breakdown
70
+
71
+ ![Difficulty Breakdown](plots/difficulty_breakdown.png)
72
+
73
+ *Easy tasks nearly solved (78%); hard tasks remain challenging (46%); holdout (unseen tasks) confirms generalization (58%).*
74
+
75
+ ### Insight β€” What the Agent Learned
76
+
77
+ ![Handoff Evolution over Epochs](plots/handoff_diff_over_epochs.png)
78
+
79
+ *Handoff notes shrink from ~700 β†’ ~308 tokens over 6 epochs. COMPLETED section shrinks (agent stops over-documenting). NEXT STEPS grows to dominate β€” the most actionable signal for Session 2.*
80
+
81
+ **Epoch 1:** ~700 tokens Β· rambling Β· code blocks Β· no structure
82
+ **Epoch 6:** ~175 tokens Β· 6 precise sections Β· zero code Β· surgical
83
+
84
+ ---
85
+
86
+ ## How It Works
87
+
88
+ ```
89
+ Episode = Session 1 + Session 2
90
+
91
+ Session 1:
92
+ Agent receives β†’ task description + starter code + tool access
93
+ Agent works β†’ reads files, writes code, runs tests
94
+ Agent ends β†’ calls write_handoff(structured_note)
95
+ ↓ [handoff.md is the ONLY bridge]
96
+ ↓ [filesystem wiped β€” no code persists]
97
+ ↓ [function names randomized per episode]
98
+ Session 2:
99
+ Agent receives β†’ ONLY handoff.md + same tool access
100
+ Agent must call parse_handoff() before file access (enforced)
101
+ Agent works β†’ picks up, finishes implementation
102
+ Agent ends β†’ calls submit() β†’ visible + hidden tests run β†’ reward
103
+ ```
104
+
105
+ ### Handoff Format (enforced by HandoffValidator)
106
+
107
+ ```
108
+ TASK: one sentence β€” what the overall task is
109
+ COMPLETED: bullet list β€” fully implemented + verified items
110
+ REMAINING: bullet list β€” what Session 2 must implement
111
+ KEY FUNCTIONS: function/class names, signatures, brief purpose
112
+ EDGE CASES: constraints or tricky logic discovered in Session 1
113
+ NEXT STEPS: ordered list β€” what Session 2 should do first
114
+ ```
115
+
116
+ Max 400 tokens Β· max 5 lines of code in code blocks Β· all 6 sections required.
117
+
118
+ ---
119
+
120
+ ## Reward Breakdown
121
+
122
+ | Component | Weight | What it measures | Anti-gaming |
123
+ |-----------|--------|------------------|-------------|
124
+ | Tests (visible) | 33% | Session 2 correctness | Hidden tests at submit |
125
+ | Tests (hidden) | 22% | Generalization | Not shown via run_tests |
126
+ | Handoff quality | 20% | Structure + compression + density | Code-dump blocked by validator |
127
+ | Linearity | 15% | Session 2 didn't thrash | Revert-write detection |
128
+ | Penalties | 10% | Invalid actions + reconstruction | Rewrite penalty |
129
+
130
+ ---
131
+
132
+ ## OpenEnv Compliance
133
+
134
+ | Requirement | Status |
135
+ |-------------|--------|
136
+ | `openenv.yaml` with `entry: server.env:CrossSessionContinuityEnv` | βœ“ |
137
+ | `MCPEnvironment` base class | βœ“ (graceful fallback stub if package absent) |
138
+ | `reset() / step() / state() / close()` | βœ“ |
139
+ | 6 tools, no reserved names | βœ“ `read_file write_file run_tests write_handoff parse_handoff submit` |
140
+ | Client/server separation | βœ“ `client/agent.py` has no server imports |
141
+ | Difficulty levels | βœ“ easy (step=20) Β· medium (35) Β· hard (55) |
142
+
143
+ ---
144
+
145
+ ## Running Locally
146
+
147
+ ```bash
148
+ git clone https://github.com/YOUR_USERNAME/cross-session-continuity-env
149
+ cd cross-session-continuity-env
150
+ pip install -r requirements.txt
151
+
152
+ # Gradio demo
153
+ python app.py
154
+
155
+ # Unit tests (23 tests)
156
+ python -m pytest server/tests/ -v
157
+
158
+ # Generate plots (uses real results/ if present, synthetic fallback)
159
+ python plots/generate_plots.py
160
+ ```
161
+
162
+ ## Docker
163
+
164
+ ```bash
165
+ docker build -t cross-session-env .
166
+ docker run -p 7860:7860 cross-session-env
167
+ # Open: http://localhost:7860
168
+ ```
169
+
170
+ ---
171
+
172
+ ## Repository Structure
173
+
174
+ ```
175
+ β”œβ”€β”€ openenv.yaml # OpenEnv manifest
176
+ β”œβ”€β”€ app.py # Gradio Space entry point
177
+ β”œβ”€β”€ Dockerfile # Container image
178
+ β”œβ”€β”€ requirements.txt # Dependencies
179
+ β”œβ”€β”€ server/
180
+ β”‚ β”œβ”€β”€ env.py # CrossSessionContinuityEnv (MCPEnvironment)
181
+ β”‚ β”œβ”€β”€ task_generator.py # Task bank + name randomization
182
+ β”‚ β”œβ”€β”€ session_manager.py # S1β†’S2 filesystem wipe
183
+ β”‚ β”œβ”€β”€ sandbox.py # subprocess + ulimits execution
184
+ β”‚ β”œβ”€β”€ handoff_validator.py # 6-section structure enforcement
185
+ β”‚ β”œβ”€β”€ mcp_tools.py # OpenEnv tool registry
186
+ β”‚ └── rewards/
187
+ β”‚ β”œβ”€β”€ rubric.py # ContinuityRubric (composable)
188
+ β”‚ └── auxiliary.py # S1 shaped rewards + decay
189
+ β”œβ”€β”€ client/agent.py # Agent loop (no server imports)
190
+ β”œβ”€β”€ training/
191
+ β”‚ β”œβ”€β”€ train_grpo.ipynb # Colab training notebook (15 cells)
192
+ β”‚ └── grpo_config.yaml
193
+ β”œβ”€β”€ evals/
194
+ β”‚ β”œβ”€β”€ baselines/ # no_handoff Β· random Β· full_transcript
195
+ β”‚ └── ablations/ # no_compression Β· no_linearity Β· no_auxiliary
196
+ └── plots/ # 5 PNG evidence files (committed)
197
+ ```