ianalin123 Claude Sonnet 4.6 commited on
Commit
3dddeb0
·
1 Parent(s): 32d29cd

docs: add V2 handoff and plan documents

Browse files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (2) hide show
  1. HANDOFF_V2.md +395 -0
  2. PLAN_V2.md +476 -0
HANDOFF_V2.md ADDED
@@ -0,0 +1,395 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Origami Env — V2 Handoff
2
+
3
+ ## What This Project Is
4
+
5
+ RL environment where an LLM learns to generate origami crease patterns (FOLD format JSON).
6
+ The model is rewarded based on how closely its folded shape matches a target shape.
7
+ Deployed on Modal (NVIDIA B200, 192GB HBM3e) using Unsloth + TRL GRPO.
8
+
9
+ Env server runs on Railway: `https://origami-env-production.up.railway.app`
10
+
11
+ ---
12
+
13
+ ## Current State (V1 — working)
14
+
15
+ ### Architecture
16
+
17
+ **Single-shot episodes**: LLM submits a complete FOLD JSON crease pattern in one action. Physics simulates it. Reward = shape similarity × 20. `done=True` after step 1.
18
+
19
+ ```
20
+ reset(task_name="quarter_fold") → task description + target positions
21
+ step(OrigamiAction(fold_data={complete FOLD JSON})) → reward, done=True
22
+ ```
23
+
24
+ ### Stack
25
+ - **Model**: `unsloth/Qwen3-32B` bfloat16 on B200 (no quantization)
26
+ - **Training**: TRL `GRPOTrainer` via `training/train_grpo.py`
27
+ - **Cloud**: Modal (`modal run modal_train.py`)
28
+ - **Checkpoints**: Modal volume `origami-checkpoints` at `/outputs`
29
+ - **Env server**: FastAPI via OpenEnv `create_app()`, hosted on Railway
30
+
31
+ ### Reward Functions (3 signals)
32
+
33
+ | Function | Source | Range | What it measures |
34
+ |---|---|---|---|
35
+ | `valid_fold` | `training/reward.py` | -2 to +1 | Parseable FOLD JSON with correct structure |
36
+ | `flat_foldable_reward` | `training/reward.py` | -0.5 to +1 | Kawasaki + Maekawa + BLB at interior vertices |
37
+ | `shape_match_reward` | `training/train_grpo.py` | -2 to +20 | Chamfer distance to target shape (via env server) |
38
+
39
+ `flat_foldable_reward` is new as of the last session — ported from optigami. Runs locally, no server round-trip.
40
+
41
+ ### Tasks (4 tasks, all single-step)
42
+
43
+ | Task | Difficulty | Description |
44
+ |---|---|---|
45
+ | `triangle` | 1 | Diagonal valley fold — trivially easy for Qwen3-32B, converges in ~5 steps |
46
+ | `half_fold` | 1 | Horizontal fold at y=0.5 |
47
+ | `quarter_fold` | 2 | Two perpendicular valley folds |
48
+ | `letter_fold` | 2 | Two parallel folds at y=1/3 and y=2/3 (valley + mountain) |
49
+
50
+ Default training uses `--task all`: all 4 tasks × 200 samples = 800 dataset rows.
51
+
52
+ ### Key Files
53
+
54
+ ```
55
+ origami_env/
56
+ ├── modal_train.py # Modal cloud training entrypoint
57
+ ├── modal_eval.py # Modal cloud eval entrypoint
58
+ ├── client.py # OrigamiEnv OpenEnv client
59
+ ├── Dockerfile # Railway env server (PORT env var for Railway)
60
+ ├── origami_server/
61
+ │ ├── app.py # FastAPI server via create_app()
62
+ │ ├── environment.py # OrigamiEnvironment: reset() + step()
63
+ │ ├── models.py # OrigamiAction, OrigamiObservation, OrigamiState
64
+ │ ├── tasks.py # TASKS dict — 4 target patterns
65
+ │ └── engine/
66
+ │ ├── simulate.py # BFS + cumulative rotation transforms
67
+ │ ├── shape_match.py # Chamfer distance + 24-rotation search
68
+ │ └── fold_parser.py # FOLD validation + face triangulation
69
+ └── training/
70
+ ├── train_grpo.py # GRPOTrainer setup, multi-task dataset, prompts
71
+ └── reward.py # valid_fold, flat_foldable_reward, extract_fold_json
72
+ ```
73
+
74
+ ### Training Commands
75
+
76
+ ```bash
77
+ # Full run (all tasks, 600 steps)
78
+ modal run modal_train.py
79
+
80
+ # Resume from checkpoint
81
+ modal run modal_train.py --resume --max-steps 1200
82
+
83
+ # Eval latest checkpoint vs base model
84
+ modal run modal_eval.py --checkpoint checkpoint-20 --n-samples 20
85
+ modal run modal_eval.py --checkpoint base --n-samples 20
86
+
87
+ # Check volume contents
88
+ modal volume ls origami-checkpoints
89
+ modal volume get origami-checkpoints checkpoint-20 ./outputs/checkpoint-20
90
+ ```
91
+
92
+ ### Known V1 Issues
93
+
94
+ 1. **Converges too fast**: Qwen3-32B already knows these tasks. All 4 tasks hit max reward within ~30 steps. After that, `reward_std=0` → no GRPO gradient → training is a no-op.
95
+
96
+ 2. **Reward ceiling**: `shape_match_reward` maxes at 20.0. The model hits it early and stays there. No harder signal to keep learning.
97
+
98
+ 3. **Single-shot limits learning**: The model submits the complete pattern at once. GRPO only sees the final result, not individual fold decisions. Compare: training on a chess game vs just win/loss.
99
+
100
+ 4. **KL drift without gradient**: When `reward_std=0`, the policy drifts from base (KL grows to ~0.1) without any learning. Pure degradation after convergence.
101
+
102
+ 5. **`flat_foldable_reward` untested**: Added last session. Needs a training run to verify it actually fires and produces useful signal.
103
+
104
+ ---
105
+
106
+ ## V2 Goal: Multi-Step Episodes
107
+
108
+ The core upgrade: instead of submitting complete FOLD JSON in one shot, the model outputs **one fold crease at a time**, gets reward after each crease, and sees the updated paper state before deciding the next crease.
109
+
110
+ ### Reference Implementation
111
+
112
+ `/Users/ianalin/Desktop/optigami/` has a working multi-step implementation. Key files:
113
+ - `env/environment.py` — `OrigamiEnvironment` with `mode='step'` for one-crease-per-step
114
+ - `env/paper_state.py` — `PaperState` tracks crease graph incrementally
115
+ - `env/graph.py` — `CreaseGraph` with vertex deduplication + edge splitting at intersections
116
+ - `env/rewards.py` — per-step reward: Kawasaki/Maekawa/BLB + progress + delta + efficiency
117
+ - `env/prompts.py` — step-level prompt showing current state + anchor points + last reward breakdown
118
+ - `env/verifier.py` — Kawasaki, Maekawa, BLB theorem checks (already ported to `training/reward.py`)
119
+
120
+ ### V2 Action Format
121
+
122
+ Single crease per step (instead of complete FOLD JSON):
123
+
124
+ ```json
125
+ {"from": [0.0, 0.5], "to": [1.0, 0.5], "assignment": "V"}
126
+ ```
127
+
128
+ ### V2 Episode Flow
129
+
130
+ ```
131
+ reset(task_name="quarter_fold")
132
+ → observation: task description + available anchor points + current state (empty)
133
+
134
+ step({"from": [0.5, 0], "to": [0.5, 1], "assignment": "V"})
135
+ → observation: updated paper state + intermediate reward + available anchor points
136
+
137
+ step({"from": [0, 0.5], "to": [1, 0.5], "assignment": "V"})
138
+ → observation: final shape + terminal reward, done=True
139
+ ```
140
+
141
+ ### V2 Reward (per-step, from optigami/env/rewards.py)
142
+
143
+ ```python
144
+ total = (
145
+ 0.40 * progress # fraction of target creases covered
146
+ + 0.20 * delta # improvement this step
147
+ + 0.10 * kawasaki # Kawasaki theorem compliance
148
+ + 0.10 * maekawa # Maekawa theorem compliance
149
+ + 0.05 * blb # BLB lemma compliance
150
+ + 0.05 * economy # penalty for excess creases
151
+ + 0.05 * assignment_acc # correct M/V types
152
+ - 0.01 * step_penalty # efficiency: finish in fewer steps
153
+ + 10.0 * completion_bonus # if progress > 0.9 and all geometry valid
154
+ )
155
+ ```
156
+
157
+ This gives GRPO a gradient at every step, not just at the end.
158
+
159
+ ### V2 Prompt (per-step, from optigami/env/prompts.py)
160
+
161
+ ```
162
+ Target: quarter_fold — fold the paper into quarters
163
+
164
+ CURRENT STATE (step 1 of 5):
165
+ Creases placed: none
166
+
167
+ AVAILABLE ANCHOR POINTS:
168
+ Corners: (0,0) (1,0) (1,1) (0,1)
169
+ Midpoints: (0,0.5) (0.5,0) (1,0.5) (0.5,1)
170
+
171
+ Output the NEXT crease as JSON:
172
+ {"from": [x1, y1], "to": [x2, y2], "assignment": "M" or "V"}
173
+ ```
174
+
175
+ ### V2 Implementation Plan
176
+
177
+ **Phase 1: New environment server (modify `origami_server/`)**
178
+
179
+ 1. Add `PaperState` class to track crease graph across steps (port from `optigami/env/paper_state.py` + `graph.py`)
180
+ 2. Modify `OrigamiAction` in `models.py` to accept single-crease format: `{"from": [...], "to": [...], "assignment": "M"|"V"}`
181
+ 3. Modify `OrigamiEnvironment` in `environment.py` to:
182
+ - Track `_paper_state: PaperState` between steps
183
+ - Return `done=False` until max_folds reached or "stop" action
184
+ - Compute per-step reward using the optigami reward formula
185
+ - Include current crease state + available anchor points in observation
186
+ 4. Keep backward compat: make single-step (complete FOLD JSON) mode still work as `mode='single'`
187
+
188
+ **Phase 2: Update training (modify `training/`)**
189
+
190
+ 1. Update `train_grpo.py` prompt to step-level format (already in optigami)
191
+ 2. Update `shape_match_reward` to accept the incremental observation — final shape only computed when `done=True`
192
+ 3. Consider `max_folds` as a task parameter (e.g. triangle=1, quarter_fold=2, letter_fold=2)
193
+
194
+ **Phase 3: Add harder tasks**
195
+
196
+ From optigami's `server/tasks.py`, good candidates:
197
+ - `map_fold` — 8 folds, must be deployable (can unfold back flat)
198
+ - `waterbomb_base` — classic base requiring diagonal + perpendicular folds
199
+ - Custom tasks with `target_ratio` (compactness goals)
200
+
201
+ **Phase 4: Model upgrade**
202
+
203
+ optigami uses `Qwen2.5-VL-7B` (vision-language) — could let the model SEE a rendered view of the current paper state as part of the observation. This is the highest-ceiling path but requires significant extra work.
204
+
205
+ ---
206
+
207
+ ## Important Constraints
208
+
209
+ - **OpenEnv API**: `reset()` and `step()` must return types matching `OrigamiObservation`. The FastAPI server is generated by `create_app(OrigamiEnvironment, OrigamiAction, OrigamiObservation)`. Changing `OrigamiAction` shape requires updating models + server + client.
210
+ - **Modal image**: Adding new Python dependencies requires changing the `run_commands` block in `modal_train.py`. The image caches by content hash — changing deps triggers a full rebuild (~10 min).
211
+ - **Railway**: Env server auto-deploys from `main` branch. `Dockerfile` + `requirements.txt` must stay in root.
212
+ - **Unsloth quirk**: With `num_generations > per_device_train_batch_size`, Unsloth auto-bumps batch size. Keep `num_generations=4` (current default) to avoid 8×batch blowup.
213
+ - **Qwen3 thinking**: Always include `{"role": "system", "content": "/no_think"}` in prompts. Without it, `<think>` tokens fill the entire completion budget.
214
+
215
+ ---
216
+
217
+ ---
218
+
219
+ ## HuggingFace Deployment
220
+
221
+ Two separate HF deployments are needed: the **env server** on HF Spaces, and the **trained model** on HF Hub.
222
+
223
+ ### 1. Env Server → HF Spaces (Docker Space)
224
+
225
+ HF Spaces runs the `Dockerfile` automatically. The current `Dockerfile` is already compatible:
226
+ - Uses `${PORT:-8000}` — HF Spaces injects `PORT=7860` at runtime, so it auto-binds correctly
227
+ - No code changes needed to the server itself
228
+
229
+ **What needs to be added:**
230
+
231
+ `README.md` must have HF Spaces frontmatter (was stripped during Railway migration — needs to come back):
232
+
233
+ ```yaml
234
+ ---
235
+ title: Origami Env
236
+ emoji: 🦢
237
+ colorFrom: purple
238
+ colorTo: blue
239
+ sdk: docker
240
+ pinned: false
241
+ ---
242
+ ```
243
+
244
+ **HF Spaces constraints to design around in V2:**
245
+
246
+ | Constraint | Impact |
247
+ |---|---|
248
+ | **Stateless** — container restarts wipe memory | No in-memory episode state. `OrigamiEnvironment` must be fully reconstructable from the session ID alone. This is already true for V1 (no cross-request state) but V2 multi-step will need to store `PaperState` per session somewhere (dict keyed by `episode_id`, or Redis). |
249
+ | **Free tier is CPU-only** | Simulation (`simulate.py`) is pure NumPy — fine on CPU. No GPU needed for the env server. |
250
+ | **No persistent disk** | Checkpoints live on Modal volume, not HF. The env server doesn't need checkpoints. |
251
+ | **Cold starts** | First request after inactivity spins up fresh. Health check endpoints (`/health`) are already present. |
252
+ | **`MAX_CONCURRENT_ENVS`** | Currently set to 16 in `Dockerfile`. On free-tier HF Spaces with limited RAM, lower this to 4-8 for V2 multi-step since each session will hold a `PaperState` object in memory. |
253
+
254
+ **V2-specific concern — session state for multi-step:**
255
+
256
+ V1 is stateless between steps (single-shot, `done=True` after step 1). V2 multi-step is NOT stateless — `PaperState` (the evolving crease graph) must persist across `reset()` → `step()` → `step()` calls within an episode.
257
+
258
+ OpenEnv's `create_app()` already handles concurrent sessions via `session_id`. The `OrigamiEnvironment` instance is kept alive per session. This works fine on a single container. On HF Spaces with auto-scaling or restarts, a session mid-episode would be dropped. For the hackathon / demo use case this is acceptable — just document that episodes are tied to a single container lifetime.
259
+
260
+ **Deployment steps:**
261
+
262
+ ```bash
263
+ # 1. Add README.md frontmatter (see above)
264
+
265
+ # 2. Push to HF Space repo
266
+ git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/origami-env
267
+ git push hf main
268
+
269
+ # 3. Verify health
270
+ curl https://YOUR_USERNAME-origami-env.hf.space/health
271
+
272
+ # 4. Update client.py base_url to HF Space URL
273
+ # Update server_url default in modal_train.py if using external server
274
+ ```
275
+
276
+ ---
277
+
278
+ ### 2. Trained Model → HF Hub
279
+
280
+ After training on Modal, push the LoRA adapter (or merged model) to HF Hub so it's publicly usable.
281
+
282
+ **Option A: Push LoRA adapter only** (small, ~300MB, requires base model separately)
283
+
284
+ Add to the end of `training/train_grpo.py` after `model.save_pretrained(save_path)`:
285
+
286
+ ```python
287
+ # Push to HF Hub
288
+ import os
289
+ hf_repo = os.environ.get("HF_REPO") # e.g. "username/origami-qwen3-32b-lora"
290
+ if hf_repo:
291
+ model.push_to_hub(hf_repo, token=os.environ["HF_TOKEN"])
292
+ tokenizer.push_to_hub(hf_repo, token=os.environ["HF_TOKEN"])
293
+ print(f"Model pushed to https://huggingface.co/{hf_repo}")
294
+ ```
295
+
296
+ Add `HF_REPO` and `HF_TOKEN` to Modal secrets:
297
+ ```bash
298
+ modal secret create huggingface HF_TOKEN=hf_xxx HF_REPO=username/origami-qwen3-32b-lora
299
+ ```
300
+
301
+ Then reference the secret in `modal_train.py`:
302
+ ```python
303
+ @app.function(
304
+ image=image,
305
+ gpu=GPU,
306
+ timeout=TIMEOUT,
307
+ volumes={OUTPUTS_DIR: volume},
308
+ secrets=[modal.Secret.from_name("huggingface")], # add this
309
+ )
310
+ ```
311
+
312
+ **Option B: Merge LoRA into base + push** (large, ~65GB, self-contained)
313
+
314
+ ```python
315
+ # After training, merge and push
316
+ if USE_UNSLOTH:
317
+ merged = model.merge_and_unload()
318
+ merged.push_to_hub(hf_repo, token=os.environ["HF_TOKEN"])
319
+ ```
320
+
321
+ For the demo use case, Option A is fine. Option B is only needed if users will run inference without the base model available.
322
+
323
+ **HF Hub model card:**
324
+
325
+ The pushed repo needs a `README.md` model card. Minimum viable:
326
+
327
+ ```markdown
328
+ ---
329
+ base_model: unsloth/Qwen3-32B
330
+ tags:
331
+ - lora
332
+ - origami
333
+ - rl
334
+ - grpo
335
+ license: apache-2.0
336
+ ---
337
+
338
+ # Origami Qwen3-32B LoRA
339
+
340
+ LoRA adapter trained with GRPO on origami crease pattern generation.
341
+ Tasks: triangle, half_fold, quarter_fold, letter_fold.
342
+ ```
343
+
344
+ ---
345
+
346
+ ### 3. Full Deployment Topology
347
+
348
+ ```
349
+ ┌─────────────────────┐ ┌─────────────────────┐
350
+ │ HF Spaces │ │ HF Hub │
351
+ │ (env server) │ │ (trained model) │
352
+ │ Docker + CPU │ │ LoRA adapter │
353
+ │ /health /reset │ │ ~300MB │
354
+ │ /step /tasks │ └─────────────────────┘
355
+ └────────┬────────────┘ ▲
356
+ │ WebSocket │ push_to_hub()
357
+ │ /ws │
358
+ ▼ │
359
+ ┌─────────────────────┐ ┌─────��──┴────────────┐
360
+ │ Modal │────▶│ Modal Volume │
361
+ │ B200 training │ │ origami-checkpoints│
362
+ │ GRPO + Unsloth │ │ checkpoint-N/ │
363
+ └─────────────────────┘ └─────────────────────┘
364
+ ```
365
+
366
+ ---
367
+
368
+ ## Environment Setup
369
+
370
+ ```bash
371
+ # Install deps
372
+ pip install -r requirements.txt
373
+
374
+ # Start env server locally
375
+ uvicorn origami_server.app:app --host 0.0.0.0 --port 8000
376
+
377
+ # Run training locally (small model for testing)
378
+ python -m training.train_grpo --model unsloth/Qwen2.5-3B-Instruct --max_steps 50
379
+
380
+ # Deploy to Modal (B200)
381
+ modal run modal_train.py
382
+ ```
383
+
384
+ ## Quick Verification
385
+
386
+ ```bash
387
+ # Check env server is healthy
388
+ curl http://localhost:8000/health
389
+
390
+ # Check tasks
391
+ curl http://localhost:8000/tasks
392
+
393
+ # Submit a fold manually
394
+ curl -X POST http://localhost:8000/reset -d '{"task_name": "triangle"}'
395
+ ```
PLAN_V2.md ADDED
@@ -0,0 +1,476 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # V2 Implementation Plan — Multi-Step Origami Episodes
2
+
3
+ ## Goal
4
+ Upgrade from single-shot episodes (complete FOLD JSON in one action) to multi-step episodes
5
+ (one crease per step, per-step reward, evolving paper state in observation).
6
+
7
+ ## Key Design Decision
8
+ Training stays compatible with `GRPOTrainer`. Each training sample is still a single
9
+ (prompt → completion → reward) tuple. The difference: the prompt now shows the current
10
+ paper state (initially empty) and the completion is a single crease JSON, not a full FOLD.
11
+ V2 MVP trains on step-0 only (empty paper). At inference, steps are chained sequentially.
12
+
13
+ ---
14
+
15
+ ## Step 1 — Add `shapely` to requirements
16
+ **File:** `requirements.txt`
17
+
18
+ Add `shapely>=2.0` to requirements. PaperState uses Shapely for intersection detection and
19
+ bounds clipping. This is the only new Python dependency.
20
+
21
+ Also add `shapely>=2.0` to the `run_commands` block in `modal_train.py` so the Modal image
22
+ includes it. Changing `run_commands` triggers a full image rebuild (~10 min), so do this step
23
+ first.
24
+
25
+ ---
26
+
27
+ ## Step 2 — Port `CreaseGraph` → `origami_server/engine/graph.py`
28
+ **New file:** `origami_server/engine/graph.py`
29
+ **Source:** `optigami/env/graph.py` (direct port, minimal changes)
30
+
31
+ Copy verbatim. No changes needed — the class is self-contained, pure Python + numpy.
32
+
33
+ ```
34
+ CreaseGraph:
35
+ - Pre-initializes unit-square corners (4 vertices) + boundary edges (4 B edges)
36
+ - add_vertex(x, y): deduplicates by proximity (VERTEX_TOL = 1e-9)
37
+ - add_edge(v1, v2, assignment): idempotent
38
+ - split_edge(edge_id, new_vertex_id): for intersection handling
39
+ - get_cyclic_edges(vertex_id): sorted by angle (used in verifier)
40
+ - interior_vertices(): vertices not on boundary
41
+ - crease_edges(): edges with assignment M or V
42
+ - boundary_midpoints(): midpoints of B edges
43
+ ```
44
+
45
+ ---
46
+
47
+ ## Step 3 — Port `PaperState` → `origami_server/engine/paper_state.py`
48
+ **New file:** `origami_server/engine/paper_state.py`
49
+ **Source:** `optigami/env/paper_state.py` (direct port, minimal changes)
50
+
51
+ Copy verbatim. PaperState:
52
+ - Wraps CreaseGraph
53
+ - `add_crease(p1, p2, assignment)` — validates, clips to unit square, finds intersections,
54
+ splits existing edges, adds new waypoint edges
55
+ - `anchor_points()` — corners + all current vertices
56
+ - `crease_edges()` — returns list of dicts for serialization
57
+
58
+ One change from optigami: make paper dimensions configurable (default 1×1). The existing tasks
59
+ all use 1×1 paper so this is not urgent for V2 MVP.
60
+
61
+ ---
62
+
63
+ ## Step 4 — Port step reward → `origami_server/engine/step_reward.py`
64
+ **New file:** `origami_server/engine/step_reward.py`
65
+ **Sources:** `optigami/env/rewards.py` + `optigami/env/verifier.py`
66
+
67
+ Port `compute_reward()` and its dependencies:
68
+ - `target_crease_edges(target)` — extract M/V creases from FOLD target dict
69
+ - `check_all_vertices(graph)` — Kawasaki, Maekawa, BLB at all interior vertices
70
+ - `check_degree_sanity(graph)` — even crease count at interior vertices
71
+ - `geometric_crease_coverage(paper_state, target_edges)` — progress, economy, assignment_accuracy
72
+
73
+ The verifier functions (`_kawasaki_ok`, `_maekawa_ok`, `_blb_ok`) already exist in
74
+ `training/reward.py`, but they operate on raw FOLD JSON. The step_reward versions operate on
75
+ `CreaseGraph` directly. Keep both — they serve different purposes.
76
+
77
+ `compute_reward` signature:
78
+ ```python
79
+ def compute_reward(
80
+ prev_state: PaperState,
81
+ action_result: dict, # from PaperState.add_crease()
82
+ new_state: PaperState,
83
+ target: dict, # FOLD task target dict
84
+ step: int,
85
+ max_steps: int,
86
+ ) -> dict:
87
+ # Returns dict with keys:
88
+ # format, anchored, novelty, kawasaki, maekawa, blb, degree_sanity,
89
+ # progress, economy, assignment_accuracy, delta, regression,
90
+ # completion, efficiency, total
91
+ ```
92
+
93
+ Weights (from optigami, validated):
94
+ ```
95
+ total = (
96
+ 0.05 * anchored
97
+ + 0.05 * novelty
98
+ + 0.06 * kawasaki + 0.06 * maekawa + 0.04 * blb + 0.04 * degree_sanity
99
+ + 0.25 * progress
100
+ + 0.05 * economy + 0.05 * assignment_accuracy
101
+ + 0.20 * delta
102
+ + 0.10 * regression
103
+ + completion # 10.0 if progress > 0.9 and all geometry valid
104
+ + efficiency # -0.01 * (1 + step/max_steps)
105
+ )
106
+ ```
107
+
108
+ The 10.0× completion bonus is the primary learning signal for hard tasks.
109
+
110
+ ---
111
+
112
+ ## Step 5 — Update `origami_server/models.py`
113
+ **File:** `origami_server/models.py`
114
+
115
+ Three changes:
116
+
117
+ ### OrigamiAction
118
+ Make `fold_data` optional (backward compat) and add `crease` for V2:
119
+ ```python
120
+ class OrigamiAction(Action):
121
+ fold_data: dict[str, Any] | None = Field(
122
+ default=None, description="V1: complete FOLD-format crease pattern"
123
+ )
124
+ crease: dict[str, Any] | None = Field(
125
+ default=None,
126
+ description='V2: single crease {"from": [x,y], "to": [x,y], "assignment": "M"|"V"}'
127
+ )
128
+ ```
129
+
130
+ Server validates that exactly one of `fold_data` or `crease` is set.
131
+
132
+ ### OrigamiObservation
133
+ Add V2 fields (keep all V1 fields for backward compat):
134
+ ```python
135
+ class OrigamiObservation(Observation):
136
+ # V1 fields (unchanged)
137
+ task: dict[str, Any] = Field(default_factory=dict)
138
+ fold_data: dict[str, Any] = Field(default_factory=dict)
139
+ final_positions: list[list[float]] = Field(default_factory=list)
140
+ target_positions: list[list[float]] = Field(default_factory=list)
141
+ shape_similarity: float = 0.0
142
+ max_strain: float = 0.0
143
+ is_stable: bool = True
144
+ error: Optional[str] = None
145
+ # V2 fields (new)
146
+ step_count: int = 0
147
+ max_steps: int = 1
148
+ current_creases: list[dict] = Field(default_factory=list) # placed so far
149
+ anchor_points: list[list[float]] = Field(default_factory=list)
150
+ reward_breakdown: dict[str, float] = Field(default_factory=dict)
151
+ ```
152
+
153
+ ### OrigamiState
154
+ Add mode and step tracking:
155
+ ```python
156
+ class OrigamiState(State):
157
+ task_name: str = ""
158
+ mode: str = "single" # "single" | "step"
159
+ step_count: int = 0
160
+ shape_similarity: float = 0.0
161
+ is_stable: bool = True
162
+ ```
163
+
164
+ ---
165
+
166
+ ## Step 6 — Update `origami_server/environment.py`
167
+ **File:** `origami_server/environment.py`
168
+
169
+ Major upgrade. Add `mode` parameter and multi-step logic.
170
+
171
+ ### Constructor
172
+ ```python
173
+ def __init__(self, mode: str = "step", **kwargs):
174
+ # mode: "step" (V2 default) | "single" (V1 backward compat)
175
+ self._mode = mode
176
+ self._paper_state: PaperState | None = None
177
+ self._step_reward_prev: PaperState | None = None # for delta computation
178
+ # ... existing fields
179
+ ```
180
+
181
+ ### reset()
182
+ In step mode, initialize `_paper_state = PaperState()` and return initial observation
183
+ with empty `current_creases`, all anchor points, `done=False`.
184
+
185
+ In single mode, behavior unchanged from V1.
186
+
187
+ Grab `max_folds` from task definition (new task field, see Step 7).
188
+
189
+ ### step() — V2 path (when `action.crease` is set)
190
+ ```python
191
+ # 1. Parse crease
192
+ crease = action.crease # {"from": [x,y], "to": [x,y], "assignment": "M"|"V"}
193
+
194
+ # 2. Validate
195
+ if crease["assignment"] not in ("M", "V"):
196
+ return error observation, reward=-0.1
197
+
198
+ # 3. Apply crease to paper state
199
+ import copy
200
+ prev_state = copy.deepcopy(self._paper_state)
201
+ result = self._paper_state.add_crease(
202
+ crease["from"], crease["to"], crease["assignment"]
203
+ )
204
+
205
+ # 4. Compute per-step reward
206
+ self._state.step_count += 1
207
+ reward_dict = compute_reward(
208
+ prev_state=prev_state,
209
+ action_result=result,
210
+ new_state=self._paper_state,
211
+ target=self._task,
212
+ step=self._state.step_count,
213
+ max_steps=self._task["max_folds"],
214
+ )
215
+
216
+ # 5. Check done
217
+ done = (
218
+ self._state.step_count >= self._task["max_folds"]
219
+ or reward_dict.get("completion", 0) > 0
220
+ )
221
+
222
+ # 6. Return observation
223
+ return OrigamiObservation(
224
+ done=done,
225
+ reward=reward_dict["total"],
226
+ task=self._task_info(),
227
+ fold_data={}, # empty in step mode
228
+ final_positions=[], # only populated on done=True
229
+ target_positions=self._target_positions.tolist(),
230
+ shape_similarity=reward_dict.get("progress", 0.0),
231
+ max_strain=0.0,
232
+ is_stable=True,
233
+ error=None,
234
+ step_count=self._state.step_count,
235
+ max_steps=self._task["max_folds"],
236
+ current_creases=self._paper_state.crease_edges(),
237
+ anchor_points=[[x, y] for x, y in self._paper_state.anchor_points()],
238
+ reward_breakdown=reward_dict,
239
+ )
240
+ ```
241
+
242
+ When `done=True`, optionally run the full simulation (`simulate()`) to populate
243
+ `final_positions` and `shape_similarity` — useful for the viewer but not required for training.
244
+
245
+ ### step() — V1 path (when `action.fold_data` is set)
246
+ Identical to current V1 implementation. No changes.
247
+
248
+ ---
249
+
250
+ ## Step 7 — Update `origami_server/tasks.py`
251
+ **File:** `origami_server/tasks.py`
252
+
253
+ Two changes:
254
+
255
+ ### Add `max_folds` to all existing tasks
256
+ ```python
257
+ "triangle": max_folds=1
258
+ "half_fold": max_folds=1
259
+ "quarter_fold": max_folds=2
260
+ "letter_fold": max_folds=2
261
+ ```
262
+
263
+ `max_folds` is the maximum number of step() calls before done=True.
264
+
265
+ ### Add two harder tasks
266
+
267
+ **`waterbomb_base`** (difficulty 3, max_folds=4):
268
+ Two diagonal valley folds (corner to corner) + two perpendicular valley folds (midpoint to
269
+ midpoint). Classic base that requires all four folds to be correct simultaneously.
270
+ ```
271
+ target_fold: 9 vertices (4 corners + 4 midpoints + 1 center), 8 crease edges (all V)
272
+ Creases:
273
+ (0,0)→(1,1) V (diagonal)
274
+ (1,0)→(0,1) V (diagonal)
275
+ (0.5,0)→(0.5,1) V (vertical)
276
+ (0,0.5)→(1,0.5) V (horizontal)
277
+ ```
278
+
279
+ **`map_fold`** (difficulty 4, max_folds=8):
280
+ Accordion fold into 4 strips horizontally + 4 strips vertically (8 total creases,
281
+ alternating M/V). The most demanding task for V2.
282
+ ```
283
+ target_fold: Creases at y=0.25, 0.5, 0.75 (alternating V/M/V) + x=0.25, 0.5, 0.75 (V/M/V)
284
+ plus corner diagonals for proper map fold behavior
285
+ ```
286
+
287
+ Add `get_task_for_step_mode(name)` helper that returns the task with `max_folds` validated.
288
+
289
+ ---
290
+
291
+ ## Step 8 — Update `training/reward.py`
292
+ **File:** `training/reward.py`
293
+
294
+ ### Add `valid_crease()` reward function
295
+ New reward for V2 single-crease format:
296
+ ```python
297
+ def valid_crease(completions: list, **kwargs) -> list[float]:
298
+ """V2: Does the LLM output parse as valid single-crease JSON?
299
+
300
+ +1.0 valid {"from": [x,y], "to": [x,y], "assignment": "M"|"V"}
301
+ -0.5 parseable JSON but missing fields or wrong types
302
+ -2.0 not parseable JSON
303
+ """
304
+ ```
305
+
306
+ ### Add `extract_crease_json()` helper
307
+ ```python
308
+ def extract_crease_json(response: str) -> dict | None:
309
+ """Extract single-crease JSON from LLM response.
310
+ Looks for {"from": ..., "to": ..., "assignment": ...} object.
311
+ """
312
+ ```
313
+
314
+ Keep all existing V1 functions (`valid_fold`, `flat_foldable_reward`, `extract_fold_json`)
315
+ unchanged for backward compat.
316
+
317
+ ---
318
+
319
+ ## Step 9 — Update `training/train_grpo.py`
320
+ **File:** `training/train_grpo.py`
321
+
322
+ ### New prompt template
323
+ Replace `PROMPT_TEMPLATE` with a step-level format. Key difference: no FOLD fields listed,
324
+ just "output the next crease as JSON":
325
+
326
+ ```python
327
+ STEP_PROMPT_TEMPLATE = """You are an origami designer. Add the next fold crease.
328
+
329
+ Target: {description}
330
+ Paper: {width} × {height} unit square
331
+
332
+ CURRENT STATE (step {step} of {max_folds}):
333
+ Creases placed: {crease_history}
334
+
335
+ AVAILABLE ANCHOR POINTS:
336
+ Corners: {corners}
337
+ Boundary pts: {boundary_pts}
338
+ Intersections:{intersections}
339
+
340
+ Flat-foldability rules at every interior vertex:
341
+ - Kawasaki: alternating sector angles each sum to 180°
342
+ - Maekawa: |mountain_count - valley_count| = 2
343
+ - BLB: smallest sector bounded by opposite M/V types
344
+
345
+ Output ONLY this JSON (no explanation):
346
+ {{"from": [x1, y1], "to": [x2, y2], "assignment": "M" or "V"}}"""
347
+ ```
348
+
349
+ For V2 MVP (step-0 training), `step=0`, `crease_history="none"`, anchor points = corners + midpoints.
350
+
351
+ ### New `per_step_reward()` function
352
+ Replace `shape_match_reward`:
353
+ ```python
354
+ def per_step_reward(completions, task_name, **kwargs):
355
+ scores = []
356
+ for completion, tname in zip(completions, task_name):
357
+ response = completion[0]["content"]
358
+ crease = extract_crease_json(response)
359
+ if crease is None:
360
+ scores.append(-2.0)
361
+ continue
362
+ try:
363
+ port, openenv_process = launch_openenv(port, openenv_process)
364
+ openenv_process.reset(task_name=tname)
365
+ result = openenv_process.step(OrigamiAction(crease=crease))
366
+ scores.append(result.reward if result.reward is not None else 0.0)
367
+ except TimeoutError:
368
+ scores.append(-1.0)
369
+ except Exception:
370
+ scores.append(-2.0)
371
+ return scores
372
+ ```
373
+
374
+ ### Updated reward function list
375
+ ```python
376
+ trainer = GRPOTrainer(
377
+ reward_funcs=[valid_crease, per_step_reward], # removed flat_foldable_reward (server handles it now)
378
+ ...
379
+ )
380
+ ```
381
+
382
+ ### Updated task list
383
+ Add new tasks to `ALL_TASKS`:
384
+ ```python
385
+ ALL_TASKS = ["triangle", "half_fold", "quarter_fold", "letter_fold", "waterbomb_base", "map_fold"]
386
+ ```
387
+
388
+ ### Updated GRPO config
389
+ Increase `max_completion_length` since single crease JSON is shorter (~50 tokens):
390
+ ```python
391
+ max_prompt_length=512,
392
+ max_completion_length=128, # single crease JSON is ~50 tokens
393
+ max_steps=1200, # more steps since harder tasks
394
+ ```
395
+
396
+ ---
397
+
398
+ ## Step 10 — Update `client.py`
399
+ **File:** `client.py`
400
+
401
+ Minor update: `_step_payload` already calls `action.model_dump()`. With optional fields,
402
+ this will naturally include `crease` or `fold_data` depending on which is set. No change needed
403
+ unless OpenEnv has strict serialization requirements.
404
+
405
+ If OpenEnv rejects None fields, filter them:
406
+ ```python
407
+ def _step_payload(self, action: OrigamiAction) -> Dict[str, Any]:
408
+ return {k: v for k, v in action.model_dump().items() if v is not None}
409
+ ```
410
+
411
+ ---
412
+
413
+ ## Implementation Order
414
+
415
+ Execute in this order to minimize broken states:
416
+
417
+ ```
418
+ 1. requirements.txt + modal_train.py (deps first, triggers image rebuild)
419
+ 2. origami_server/engine/graph.py (new file, no dependencies)
420
+ 3. origami_server/engine/paper_state.py (depends on graph.py)
421
+ 4. origami_server/engine/step_reward.py (depends on paper_state.py)
422
+ 5. origami_server/models.py (API types — do before environment)
423
+ 6. origami_server/environment.py (depends on models + paper_state + step_reward)
424
+ 7. origami_server/tasks.py (add max_folds + new tasks)
425
+ 8. training/reward.py (new valid_crease, extract_crease_json)
426
+ 9. training/train_grpo.py (new prompts + per_step_reward)
427
+ 10. client.py (minor defensive fix)
428
+ ```
429
+
430
+ After step 7: run `curl http://localhost:8000/tasks` and verify new tasks appear.
431
+ After step 9: run a single training step locally with `--model unsloth/Qwen2.5-3B-Instruct --max_steps 5`
432
+ to verify reward functions fire.
433
+
434
+ ---
435
+
436
+ ## Files Changed / Created
437
+
438
+ | File | Status | Notes |
439
+ |------|--------|-------|
440
+ | `requirements.txt` | modified | add shapely>=2.0 |
441
+ | `modal_train.py` | modified | add shapely to run_commands |
442
+ | `origami_server/engine/graph.py` | **new** | port from optigami |
443
+ | `origami_server/engine/paper_state.py` | **new** | port from optigami |
444
+ | `origami_server/engine/step_reward.py` | **new** | port from optigami |
445
+ | `origami_server/models.py` | modified | OrigamiAction crease field, observation V2 fields |
446
+ | `origami_server/environment.py` | modified | multi-step mode |
447
+ | `origami_server/tasks.py` | modified | max_folds + waterbomb_base + map_fold |
448
+ | `training/reward.py` | modified | valid_crease + extract_crease_json |
449
+ | `training/train_grpo.py` | modified | step prompt + per_step_reward |
450
+ | `client.py` | modified | optional fields in _step_payload |
451
+
452
+ **V1 backward compat preserved:** All existing API routes, observation fields, and reward
453
+ functions remain unchanged. `mode='single'` continues to work for existing training runs.
454
+
455
+ ---
456
+
457
+ ## Risk Notes
458
+
459
+ - **shapely requirement**: PaperState intersection detection uses shapely. If Railway/Modal
460
+ build fails, can fall back to numpy-only intersection (more code, but avoids the dep).
461
+ Suggest testing locally first with `pip install shapely`.
462
+
463
+ - **OrigamiAction change**: Making `fold_data` optional is a breaking change for clients
464
+ sending the field as required. Any existing V1 clients that always set `fold_data` will
465
+ continue to work since pydantic accepts it as optional-with-value.
466
+
467
+ - **step-0 only training**: V2 MVP trains exclusively from empty paper state (step 0).
468
+ The model learns "first crease for task X" but doesn't train on step 1+. This means
469
+ chained inference (running multiple steps at eval time) may degrade at step 2+ because
470
+ the policy was never trained on non-empty paper states. Acceptable for V2 MVP — a future
471
+ V3 adds episode rollout collection to the training loop.
472
+
473
+ - **completion bonus scale**: The 10.0× completion bonus means episodes where the model
474
+ hits >90% coverage + valid geometry will dominate the reward signal. For easy tasks
475
+ (triangle, half_fold) this will happen quickly. For map_fold it may never happen in early
476
+ training. Consider starting with only triangle/waterbomb_base for first training run.