Spaces:
Sleeping
Sleeping
Commit ·
3dddeb0
1
Parent(s): 32d29cd
docs: add V2 handoff and plan documents
Browse filesCo-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- HANDOFF_V2.md +395 -0
- PLAN_V2.md +476 -0
HANDOFF_V2.md
ADDED
|
@@ -0,0 +1,395 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Origami Env — V2 Handoff
|
| 2 |
+
|
| 3 |
+
## What This Project Is
|
| 4 |
+
|
| 5 |
+
RL environment where an LLM learns to generate origami crease patterns (FOLD format JSON).
|
| 6 |
+
The model is rewarded based on how closely its folded shape matches a target shape.
|
| 7 |
+
Deployed on Modal (NVIDIA B200, 192GB HBM3e) using Unsloth + TRL GRPO.
|
| 8 |
+
|
| 9 |
+
Env server runs on Railway: `https://origami-env-production.up.railway.app`
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## Current State (V1 — working)
|
| 14 |
+
|
| 15 |
+
### Architecture
|
| 16 |
+
|
| 17 |
+
**Single-shot episodes**: LLM submits a complete FOLD JSON crease pattern in one action. Physics simulates it. Reward = shape similarity × 20. `done=True` after step 1.
|
| 18 |
+
|
| 19 |
+
```
|
| 20 |
+
reset(task_name="quarter_fold") → task description + target positions
|
| 21 |
+
step(OrigamiAction(fold_data={complete FOLD JSON})) → reward, done=True
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
### Stack
|
| 25 |
+
- **Model**: `unsloth/Qwen3-32B` bfloat16 on B200 (no quantization)
|
| 26 |
+
- **Training**: TRL `GRPOTrainer` via `training/train_grpo.py`
|
| 27 |
+
- **Cloud**: Modal (`modal run modal_train.py`)
|
| 28 |
+
- **Checkpoints**: Modal volume `origami-checkpoints` at `/outputs`
|
| 29 |
+
- **Env server**: FastAPI via OpenEnv `create_app()`, hosted on Railway
|
| 30 |
+
|
| 31 |
+
### Reward Functions (3 signals)
|
| 32 |
+
|
| 33 |
+
| Function | Source | Range | What it measures |
|
| 34 |
+
|---|---|---|---|
|
| 35 |
+
| `valid_fold` | `training/reward.py` | -2 to +1 | Parseable FOLD JSON with correct structure |
|
| 36 |
+
| `flat_foldable_reward` | `training/reward.py` | -0.5 to +1 | Kawasaki + Maekawa + BLB at interior vertices |
|
| 37 |
+
| `shape_match_reward` | `training/train_grpo.py` | -2 to +20 | Chamfer distance to target shape (via env server) |
|
| 38 |
+
|
| 39 |
+
`flat_foldable_reward` is new as of the last session — ported from optigami. Runs locally, no server round-trip.
|
| 40 |
+
|
| 41 |
+
### Tasks (4 tasks, all single-step)
|
| 42 |
+
|
| 43 |
+
| Task | Difficulty | Description |
|
| 44 |
+
|---|---|---|
|
| 45 |
+
| `triangle` | 1 | Diagonal valley fold — trivially easy for Qwen3-32B, converges in ~5 steps |
|
| 46 |
+
| `half_fold` | 1 | Horizontal fold at y=0.5 |
|
| 47 |
+
| `quarter_fold` | 2 | Two perpendicular valley folds |
|
| 48 |
+
| `letter_fold` | 2 | Two parallel folds at y=1/3 and y=2/3 (valley + mountain) |
|
| 49 |
+
|
| 50 |
+
Default training uses `--task all`: all 4 tasks × 200 samples = 800 dataset rows.
|
| 51 |
+
|
| 52 |
+
### Key Files
|
| 53 |
+
|
| 54 |
+
```
|
| 55 |
+
origami_env/
|
| 56 |
+
├── modal_train.py # Modal cloud training entrypoint
|
| 57 |
+
├── modal_eval.py # Modal cloud eval entrypoint
|
| 58 |
+
├── client.py # OrigamiEnv OpenEnv client
|
| 59 |
+
├── Dockerfile # Railway env server (PORT env var for Railway)
|
| 60 |
+
├── origami_server/
|
| 61 |
+
│ ├── app.py # FastAPI server via create_app()
|
| 62 |
+
│ ├── environment.py # OrigamiEnvironment: reset() + step()
|
| 63 |
+
│ ├── models.py # OrigamiAction, OrigamiObservation, OrigamiState
|
| 64 |
+
│ ├── tasks.py # TASKS dict — 4 target patterns
|
| 65 |
+
│ └── engine/
|
| 66 |
+
│ ├── simulate.py # BFS + cumulative rotation transforms
|
| 67 |
+
│ ├── shape_match.py # Chamfer distance + 24-rotation search
|
| 68 |
+
│ └── fold_parser.py # FOLD validation + face triangulation
|
| 69 |
+
└── training/
|
| 70 |
+
├── train_grpo.py # GRPOTrainer setup, multi-task dataset, prompts
|
| 71 |
+
└── reward.py # valid_fold, flat_foldable_reward, extract_fold_json
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
### Training Commands
|
| 75 |
+
|
| 76 |
+
```bash
|
| 77 |
+
# Full run (all tasks, 600 steps)
|
| 78 |
+
modal run modal_train.py
|
| 79 |
+
|
| 80 |
+
# Resume from checkpoint
|
| 81 |
+
modal run modal_train.py --resume --max-steps 1200
|
| 82 |
+
|
| 83 |
+
# Eval latest checkpoint vs base model
|
| 84 |
+
modal run modal_eval.py --checkpoint checkpoint-20 --n-samples 20
|
| 85 |
+
modal run modal_eval.py --checkpoint base --n-samples 20
|
| 86 |
+
|
| 87 |
+
# Check volume contents
|
| 88 |
+
modal volume ls origami-checkpoints
|
| 89 |
+
modal volume get origami-checkpoints checkpoint-20 ./outputs/checkpoint-20
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
### Known V1 Issues
|
| 93 |
+
|
| 94 |
+
1. **Converges too fast**: Qwen3-32B already knows these tasks. All 4 tasks hit max reward within ~30 steps. After that, `reward_std=0` → no GRPO gradient → training is a no-op.
|
| 95 |
+
|
| 96 |
+
2. **Reward ceiling**: `shape_match_reward` maxes at 20.0. The model hits it early and stays there. No harder signal to keep learning.
|
| 97 |
+
|
| 98 |
+
3. **Single-shot limits learning**: The model submits the complete pattern at once. GRPO only sees the final result, not individual fold decisions. Compare: training on a chess game vs just win/loss.
|
| 99 |
+
|
| 100 |
+
4. **KL drift without gradient**: When `reward_std=0`, the policy drifts from base (KL grows to ~0.1) without any learning. Pure degradation after convergence.
|
| 101 |
+
|
| 102 |
+
5. **`flat_foldable_reward` untested**: Added last session. Needs a training run to verify it actually fires and produces useful signal.
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## V2 Goal: Multi-Step Episodes
|
| 107 |
+
|
| 108 |
+
The core upgrade: instead of submitting complete FOLD JSON in one shot, the model outputs **one fold crease at a time**, gets reward after each crease, and sees the updated paper state before deciding the next crease.
|
| 109 |
+
|
| 110 |
+
### Reference Implementation
|
| 111 |
+
|
| 112 |
+
`/Users/ianalin/Desktop/optigami/` has a working multi-step implementation. Key files:
|
| 113 |
+
- `env/environment.py` — `OrigamiEnvironment` with `mode='step'` for one-crease-per-step
|
| 114 |
+
- `env/paper_state.py` — `PaperState` tracks crease graph incrementally
|
| 115 |
+
- `env/graph.py` — `CreaseGraph` with vertex deduplication + edge splitting at intersections
|
| 116 |
+
- `env/rewards.py` — per-step reward: Kawasaki/Maekawa/BLB + progress + delta + efficiency
|
| 117 |
+
- `env/prompts.py` — step-level prompt showing current state + anchor points + last reward breakdown
|
| 118 |
+
- `env/verifier.py` — Kawasaki, Maekawa, BLB theorem checks (already ported to `training/reward.py`)
|
| 119 |
+
|
| 120 |
+
### V2 Action Format
|
| 121 |
+
|
| 122 |
+
Single crease per step (instead of complete FOLD JSON):
|
| 123 |
+
|
| 124 |
+
```json
|
| 125 |
+
{"from": [0.0, 0.5], "to": [1.0, 0.5], "assignment": "V"}
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
### V2 Episode Flow
|
| 129 |
+
|
| 130 |
+
```
|
| 131 |
+
reset(task_name="quarter_fold")
|
| 132 |
+
→ observation: task description + available anchor points + current state (empty)
|
| 133 |
+
|
| 134 |
+
step({"from": [0.5, 0], "to": [0.5, 1], "assignment": "V"})
|
| 135 |
+
→ observation: updated paper state + intermediate reward + available anchor points
|
| 136 |
+
|
| 137 |
+
step({"from": [0, 0.5], "to": [1, 0.5], "assignment": "V"})
|
| 138 |
+
→ observation: final shape + terminal reward, done=True
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
### V2 Reward (per-step, from optigami/env/rewards.py)
|
| 142 |
+
|
| 143 |
+
```python
|
| 144 |
+
total = (
|
| 145 |
+
0.40 * progress # fraction of target creases covered
|
| 146 |
+
+ 0.20 * delta # improvement this step
|
| 147 |
+
+ 0.10 * kawasaki # Kawasaki theorem compliance
|
| 148 |
+
+ 0.10 * maekawa # Maekawa theorem compliance
|
| 149 |
+
+ 0.05 * blb # BLB lemma compliance
|
| 150 |
+
+ 0.05 * economy # penalty for excess creases
|
| 151 |
+
+ 0.05 * assignment_acc # correct M/V types
|
| 152 |
+
- 0.01 * step_penalty # efficiency: finish in fewer steps
|
| 153 |
+
+ 10.0 * completion_bonus # if progress > 0.9 and all geometry valid
|
| 154 |
+
)
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
This gives GRPO a gradient at every step, not just at the end.
|
| 158 |
+
|
| 159 |
+
### V2 Prompt (per-step, from optigami/env/prompts.py)
|
| 160 |
+
|
| 161 |
+
```
|
| 162 |
+
Target: quarter_fold — fold the paper into quarters
|
| 163 |
+
|
| 164 |
+
CURRENT STATE (step 1 of 5):
|
| 165 |
+
Creases placed: none
|
| 166 |
+
|
| 167 |
+
AVAILABLE ANCHOR POINTS:
|
| 168 |
+
Corners: (0,0) (1,0) (1,1) (0,1)
|
| 169 |
+
Midpoints: (0,0.5) (0.5,0) (1,0.5) (0.5,1)
|
| 170 |
+
|
| 171 |
+
Output the NEXT crease as JSON:
|
| 172 |
+
{"from": [x1, y1], "to": [x2, y2], "assignment": "M" or "V"}
|
| 173 |
+
```
|
| 174 |
+
|
| 175 |
+
### V2 Implementation Plan
|
| 176 |
+
|
| 177 |
+
**Phase 1: New environment server (modify `origami_server/`)**
|
| 178 |
+
|
| 179 |
+
1. Add `PaperState` class to track crease graph across steps (port from `optigami/env/paper_state.py` + `graph.py`)
|
| 180 |
+
2. Modify `OrigamiAction` in `models.py` to accept single-crease format: `{"from": [...], "to": [...], "assignment": "M"|"V"}`
|
| 181 |
+
3. Modify `OrigamiEnvironment` in `environment.py` to:
|
| 182 |
+
- Track `_paper_state: PaperState` between steps
|
| 183 |
+
- Return `done=False` until max_folds reached or "stop" action
|
| 184 |
+
- Compute per-step reward using the optigami reward formula
|
| 185 |
+
- Include current crease state + available anchor points in observation
|
| 186 |
+
4. Keep backward compat: make single-step (complete FOLD JSON) mode still work as `mode='single'`
|
| 187 |
+
|
| 188 |
+
**Phase 2: Update training (modify `training/`)**
|
| 189 |
+
|
| 190 |
+
1. Update `train_grpo.py` prompt to step-level format (already in optigami)
|
| 191 |
+
2. Update `shape_match_reward` to accept the incremental observation — final shape only computed when `done=True`
|
| 192 |
+
3. Consider `max_folds` as a task parameter (e.g. triangle=1, quarter_fold=2, letter_fold=2)
|
| 193 |
+
|
| 194 |
+
**Phase 3: Add harder tasks**
|
| 195 |
+
|
| 196 |
+
From optigami's `server/tasks.py`, good candidates:
|
| 197 |
+
- `map_fold` — 8 folds, must be deployable (can unfold back flat)
|
| 198 |
+
- `waterbomb_base` — classic base requiring diagonal + perpendicular folds
|
| 199 |
+
- Custom tasks with `target_ratio` (compactness goals)
|
| 200 |
+
|
| 201 |
+
**Phase 4: Model upgrade**
|
| 202 |
+
|
| 203 |
+
optigami uses `Qwen2.5-VL-7B` (vision-language) — could let the model SEE a rendered view of the current paper state as part of the observation. This is the highest-ceiling path but requires significant extra work.
|
| 204 |
+
|
| 205 |
+
---
|
| 206 |
+
|
| 207 |
+
## Important Constraints
|
| 208 |
+
|
| 209 |
+
- **OpenEnv API**: `reset()` and `step()` must return types matching `OrigamiObservation`. The FastAPI server is generated by `create_app(OrigamiEnvironment, OrigamiAction, OrigamiObservation)`. Changing `OrigamiAction` shape requires updating models + server + client.
|
| 210 |
+
- **Modal image**: Adding new Python dependencies requires changing the `run_commands` block in `modal_train.py`. The image caches by content hash — changing deps triggers a full rebuild (~10 min).
|
| 211 |
+
- **Railway**: Env server auto-deploys from `main` branch. `Dockerfile` + `requirements.txt` must stay in root.
|
| 212 |
+
- **Unsloth quirk**: With `num_generations > per_device_train_batch_size`, Unsloth auto-bumps batch size. Keep `num_generations=4` (current default) to avoid 8×batch blowup.
|
| 213 |
+
- **Qwen3 thinking**: Always include `{"role": "system", "content": "/no_think"}` in prompts. Without it, `<think>` tokens fill the entire completion budget.
|
| 214 |
+
|
| 215 |
+
---
|
| 216 |
+
|
| 217 |
+
---
|
| 218 |
+
|
| 219 |
+
## HuggingFace Deployment
|
| 220 |
+
|
| 221 |
+
Two separate HF deployments are needed: the **env server** on HF Spaces, and the **trained model** on HF Hub.
|
| 222 |
+
|
| 223 |
+
### 1. Env Server → HF Spaces (Docker Space)
|
| 224 |
+
|
| 225 |
+
HF Spaces runs the `Dockerfile` automatically. The current `Dockerfile` is already compatible:
|
| 226 |
+
- Uses `${PORT:-8000}` — HF Spaces injects `PORT=7860` at runtime, so it auto-binds correctly
|
| 227 |
+
- No code changes needed to the server itself
|
| 228 |
+
|
| 229 |
+
**What needs to be added:**
|
| 230 |
+
|
| 231 |
+
`README.md` must have HF Spaces frontmatter (was stripped during Railway migration — needs to come back):
|
| 232 |
+
|
| 233 |
+
```yaml
|
| 234 |
+
---
|
| 235 |
+
title: Origami Env
|
| 236 |
+
emoji: 🦢
|
| 237 |
+
colorFrom: purple
|
| 238 |
+
colorTo: blue
|
| 239 |
+
sdk: docker
|
| 240 |
+
pinned: false
|
| 241 |
+
---
|
| 242 |
+
```
|
| 243 |
+
|
| 244 |
+
**HF Spaces constraints to design around in V2:**
|
| 245 |
+
|
| 246 |
+
| Constraint | Impact |
|
| 247 |
+
|---|---|
|
| 248 |
+
| **Stateless** — container restarts wipe memory | No in-memory episode state. `OrigamiEnvironment` must be fully reconstructable from the session ID alone. This is already true for V1 (no cross-request state) but V2 multi-step will need to store `PaperState` per session somewhere (dict keyed by `episode_id`, or Redis). |
|
| 249 |
+
| **Free tier is CPU-only** | Simulation (`simulate.py`) is pure NumPy — fine on CPU. No GPU needed for the env server. |
|
| 250 |
+
| **No persistent disk** | Checkpoints live on Modal volume, not HF. The env server doesn't need checkpoints. |
|
| 251 |
+
| **Cold starts** | First request after inactivity spins up fresh. Health check endpoints (`/health`) are already present. |
|
| 252 |
+
| **`MAX_CONCURRENT_ENVS`** | Currently set to 16 in `Dockerfile`. On free-tier HF Spaces with limited RAM, lower this to 4-8 for V2 multi-step since each session will hold a `PaperState` object in memory. |
|
| 253 |
+
|
| 254 |
+
**V2-specific concern — session state for multi-step:**
|
| 255 |
+
|
| 256 |
+
V1 is stateless between steps (single-shot, `done=True` after step 1). V2 multi-step is NOT stateless — `PaperState` (the evolving crease graph) must persist across `reset()` → `step()` → `step()` calls within an episode.
|
| 257 |
+
|
| 258 |
+
OpenEnv's `create_app()` already handles concurrent sessions via `session_id`. The `OrigamiEnvironment` instance is kept alive per session. This works fine on a single container. On HF Spaces with auto-scaling or restarts, a session mid-episode would be dropped. For the hackathon / demo use case this is acceptable — just document that episodes are tied to a single container lifetime.
|
| 259 |
+
|
| 260 |
+
**Deployment steps:**
|
| 261 |
+
|
| 262 |
+
```bash
|
| 263 |
+
# 1. Add README.md frontmatter (see above)
|
| 264 |
+
|
| 265 |
+
# 2. Push to HF Space repo
|
| 266 |
+
git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/origami-env
|
| 267 |
+
git push hf main
|
| 268 |
+
|
| 269 |
+
# 3. Verify health
|
| 270 |
+
curl https://YOUR_USERNAME-origami-env.hf.space/health
|
| 271 |
+
|
| 272 |
+
# 4. Update client.py base_url to HF Space URL
|
| 273 |
+
# Update server_url default in modal_train.py if using external server
|
| 274 |
+
```
|
| 275 |
+
|
| 276 |
+
---
|
| 277 |
+
|
| 278 |
+
### 2. Trained Model → HF Hub
|
| 279 |
+
|
| 280 |
+
After training on Modal, push the LoRA adapter (or merged model) to HF Hub so it's publicly usable.
|
| 281 |
+
|
| 282 |
+
**Option A: Push LoRA adapter only** (small, ~300MB, requires base model separately)
|
| 283 |
+
|
| 284 |
+
Add to the end of `training/train_grpo.py` after `model.save_pretrained(save_path)`:
|
| 285 |
+
|
| 286 |
+
```python
|
| 287 |
+
# Push to HF Hub
|
| 288 |
+
import os
|
| 289 |
+
hf_repo = os.environ.get("HF_REPO") # e.g. "username/origami-qwen3-32b-lora"
|
| 290 |
+
if hf_repo:
|
| 291 |
+
model.push_to_hub(hf_repo, token=os.environ["HF_TOKEN"])
|
| 292 |
+
tokenizer.push_to_hub(hf_repo, token=os.environ["HF_TOKEN"])
|
| 293 |
+
print(f"Model pushed to https://huggingface.co/{hf_repo}")
|
| 294 |
+
```
|
| 295 |
+
|
| 296 |
+
Add `HF_REPO` and `HF_TOKEN` to Modal secrets:
|
| 297 |
+
```bash
|
| 298 |
+
modal secret create huggingface HF_TOKEN=hf_xxx HF_REPO=username/origami-qwen3-32b-lora
|
| 299 |
+
```
|
| 300 |
+
|
| 301 |
+
Then reference the secret in `modal_train.py`:
|
| 302 |
+
```python
|
| 303 |
+
@app.function(
|
| 304 |
+
image=image,
|
| 305 |
+
gpu=GPU,
|
| 306 |
+
timeout=TIMEOUT,
|
| 307 |
+
volumes={OUTPUTS_DIR: volume},
|
| 308 |
+
secrets=[modal.Secret.from_name("huggingface")], # add this
|
| 309 |
+
)
|
| 310 |
+
```
|
| 311 |
+
|
| 312 |
+
**Option B: Merge LoRA into base + push** (large, ~65GB, self-contained)
|
| 313 |
+
|
| 314 |
+
```python
|
| 315 |
+
# After training, merge and push
|
| 316 |
+
if USE_UNSLOTH:
|
| 317 |
+
merged = model.merge_and_unload()
|
| 318 |
+
merged.push_to_hub(hf_repo, token=os.environ["HF_TOKEN"])
|
| 319 |
+
```
|
| 320 |
+
|
| 321 |
+
For the demo use case, Option A is fine. Option B is only needed if users will run inference without the base model available.
|
| 322 |
+
|
| 323 |
+
**HF Hub model card:**
|
| 324 |
+
|
| 325 |
+
The pushed repo needs a `README.md` model card. Minimum viable:
|
| 326 |
+
|
| 327 |
+
```markdown
|
| 328 |
+
---
|
| 329 |
+
base_model: unsloth/Qwen3-32B
|
| 330 |
+
tags:
|
| 331 |
+
- lora
|
| 332 |
+
- origami
|
| 333 |
+
- rl
|
| 334 |
+
- grpo
|
| 335 |
+
license: apache-2.0
|
| 336 |
+
---
|
| 337 |
+
|
| 338 |
+
# Origami Qwen3-32B LoRA
|
| 339 |
+
|
| 340 |
+
LoRA adapter trained with GRPO on origami crease pattern generation.
|
| 341 |
+
Tasks: triangle, half_fold, quarter_fold, letter_fold.
|
| 342 |
+
```
|
| 343 |
+
|
| 344 |
+
---
|
| 345 |
+
|
| 346 |
+
### 3. Full Deployment Topology
|
| 347 |
+
|
| 348 |
+
```
|
| 349 |
+
┌─────────────────────┐ ┌─────────────────────┐
|
| 350 |
+
│ HF Spaces │ │ HF Hub │
|
| 351 |
+
│ (env server) │ │ (trained model) │
|
| 352 |
+
│ Docker + CPU │ │ LoRA adapter │
|
| 353 |
+
│ /health /reset │ │ ~300MB │
|
| 354 |
+
│ /step /tasks │ └─────────────────────┘
|
| 355 |
+
└────────┬────────────┘ ▲
|
| 356 |
+
│ WebSocket │ push_to_hub()
|
| 357 |
+
│ /ws │
|
| 358 |
+
▼ │
|
| 359 |
+
┌─────────────────────┐ ┌─────��──┴────────────┐
|
| 360 |
+
│ Modal │────▶│ Modal Volume │
|
| 361 |
+
│ B200 training │ │ origami-checkpoints│
|
| 362 |
+
│ GRPO + Unsloth │ │ checkpoint-N/ │
|
| 363 |
+
└─────────────────────┘ └─────────────────────┘
|
| 364 |
+
```
|
| 365 |
+
|
| 366 |
+
---
|
| 367 |
+
|
| 368 |
+
## Environment Setup
|
| 369 |
+
|
| 370 |
+
```bash
|
| 371 |
+
# Install deps
|
| 372 |
+
pip install -r requirements.txt
|
| 373 |
+
|
| 374 |
+
# Start env server locally
|
| 375 |
+
uvicorn origami_server.app:app --host 0.0.0.0 --port 8000
|
| 376 |
+
|
| 377 |
+
# Run training locally (small model for testing)
|
| 378 |
+
python -m training.train_grpo --model unsloth/Qwen2.5-3B-Instruct --max_steps 50
|
| 379 |
+
|
| 380 |
+
# Deploy to Modal (B200)
|
| 381 |
+
modal run modal_train.py
|
| 382 |
+
```
|
| 383 |
+
|
| 384 |
+
## Quick Verification
|
| 385 |
+
|
| 386 |
+
```bash
|
| 387 |
+
# Check env server is healthy
|
| 388 |
+
curl http://localhost:8000/health
|
| 389 |
+
|
| 390 |
+
# Check tasks
|
| 391 |
+
curl http://localhost:8000/tasks
|
| 392 |
+
|
| 393 |
+
# Submit a fold manually
|
| 394 |
+
curl -X POST http://localhost:8000/reset -d '{"task_name": "triangle"}'
|
| 395 |
+
```
|
PLAN_V2.md
ADDED
|
@@ -0,0 +1,476 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# V2 Implementation Plan — Multi-Step Origami Episodes
|
| 2 |
+
|
| 3 |
+
## Goal
|
| 4 |
+
Upgrade from single-shot episodes (complete FOLD JSON in one action) to multi-step episodes
|
| 5 |
+
(one crease per step, per-step reward, evolving paper state in observation).
|
| 6 |
+
|
| 7 |
+
## Key Design Decision
|
| 8 |
+
Training stays compatible with `GRPOTrainer`. Each training sample is still a single
|
| 9 |
+
(prompt → completion → reward) tuple. The difference: the prompt now shows the current
|
| 10 |
+
paper state (initially empty) and the completion is a single crease JSON, not a full FOLD.
|
| 11 |
+
V2 MVP trains on step-0 only (empty paper). At inference, steps are chained sequentially.
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## Step 1 — Add `shapely` to requirements
|
| 16 |
+
**File:** `requirements.txt`
|
| 17 |
+
|
| 18 |
+
Add `shapely>=2.0` to requirements. PaperState uses Shapely for intersection detection and
|
| 19 |
+
bounds clipping. This is the only new Python dependency.
|
| 20 |
+
|
| 21 |
+
Also add `shapely>=2.0` to the `run_commands` block in `modal_train.py` so the Modal image
|
| 22 |
+
includes it. Changing `run_commands` triggers a full image rebuild (~10 min), so do this step
|
| 23 |
+
first.
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## Step 2 — Port `CreaseGraph` → `origami_server/engine/graph.py`
|
| 28 |
+
**New file:** `origami_server/engine/graph.py`
|
| 29 |
+
**Source:** `optigami/env/graph.py` (direct port, minimal changes)
|
| 30 |
+
|
| 31 |
+
Copy verbatim. No changes needed — the class is self-contained, pure Python + numpy.
|
| 32 |
+
|
| 33 |
+
```
|
| 34 |
+
CreaseGraph:
|
| 35 |
+
- Pre-initializes unit-square corners (4 vertices) + boundary edges (4 B edges)
|
| 36 |
+
- add_vertex(x, y): deduplicates by proximity (VERTEX_TOL = 1e-9)
|
| 37 |
+
- add_edge(v1, v2, assignment): idempotent
|
| 38 |
+
- split_edge(edge_id, new_vertex_id): for intersection handling
|
| 39 |
+
- get_cyclic_edges(vertex_id): sorted by angle (used in verifier)
|
| 40 |
+
- interior_vertices(): vertices not on boundary
|
| 41 |
+
- crease_edges(): edges with assignment M or V
|
| 42 |
+
- boundary_midpoints(): midpoints of B edges
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## Step 3 — Port `PaperState` → `origami_server/engine/paper_state.py`
|
| 48 |
+
**New file:** `origami_server/engine/paper_state.py`
|
| 49 |
+
**Source:** `optigami/env/paper_state.py` (direct port, minimal changes)
|
| 50 |
+
|
| 51 |
+
Copy verbatim. PaperState:
|
| 52 |
+
- Wraps CreaseGraph
|
| 53 |
+
- `add_crease(p1, p2, assignment)` — validates, clips to unit square, finds intersections,
|
| 54 |
+
splits existing edges, adds new waypoint edges
|
| 55 |
+
- `anchor_points()` — corners + all current vertices
|
| 56 |
+
- `crease_edges()` — returns list of dicts for serialization
|
| 57 |
+
|
| 58 |
+
One change from optigami: make paper dimensions configurable (default 1×1). The existing tasks
|
| 59 |
+
all use 1×1 paper so this is not urgent for V2 MVP.
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## Step 4 — Port step reward → `origami_server/engine/step_reward.py`
|
| 64 |
+
**New file:** `origami_server/engine/step_reward.py`
|
| 65 |
+
**Sources:** `optigami/env/rewards.py` + `optigami/env/verifier.py`
|
| 66 |
+
|
| 67 |
+
Port `compute_reward()` and its dependencies:
|
| 68 |
+
- `target_crease_edges(target)` — extract M/V creases from FOLD target dict
|
| 69 |
+
- `check_all_vertices(graph)` — Kawasaki, Maekawa, BLB at all interior vertices
|
| 70 |
+
- `check_degree_sanity(graph)` — even crease count at interior vertices
|
| 71 |
+
- `geometric_crease_coverage(paper_state, target_edges)` — progress, economy, assignment_accuracy
|
| 72 |
+
|
| 73 |
+
The verifier functions (`_kawasaki_ok`, `_maekawa_ok`, `_blb_ok`) already exist in
|
| 74 |
+
`training/reward.py`, but they operate on raw FOLD JSON. The step_reward versions operate on
|
| 75 |
+
`CreaseGraph` directly. Keep both — they serve different purposes.
|
| 76 |
+
|
| 77 |
+
`compute_reward` signature:
|
| 78 |
+
```python
|
| 79 |
+
def compute_reward(
|
| 80 |
+
prev_state: PaperState,
|
| 81 |
+
action_result: dict, # from PaperState.add_crease()
|
| 82 |
+
new_state: PaperState,
|
| 83 |
+
target: dict, # FOLD task target dict
|
| 84 |
+
step: int,
|
| 85 |
+
max_steps: int,
|
| 86 |
+
) -> dict:
|
| 87 |
+
# Returns dict with keys:
|
| 88 |
+
# format, anchored, novelty, kawasaki, maekawa, blb, degree_sanity,
|
| 89 |
+
# progress, economy, assignment_accuracy, delta, regression,
|
| 90 |
+
# completion, efficiency, total
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
Weights (from optigami, validated):
|
| 94 |
+
```
|
| 95 |
+
total = (
|
| 96 |
+
0.05 * anchored
|
| 97 |
+
+ 0.05 * novelty
|
| 98 |
+
+ 0.06 * kawasaki + 0.06 * maekawa + 0.04 * blb + 0.04 * degree_sanity
|
| 99 |
+
+ 0.25 * progress
|
| 100 |
+
+ 0.05 * economy + 0.05 * assignment_accuracy
|
| 101 |
+
+ 0.20 * delta
|
| 102 |
+
+ 0.10 * regression
|
| 103 |
+
+ completion # 10.0 if progress > 0.9 and all geometry valid
|
| 104 |
+
+ efficiency # -0.01 * (1 + step/max_steps)
|
| 105 |
+
)
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
The 10.0× completion bonus is the primary learning signal for hard tasks.
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## Step 5 — Update `origami_server/models.py`
|
| 113 |
+
**File:** `origami_server/models.py`
|
| 114 |
+
|
| 115 |
+
Three changes:
|
| 116 |
+
|
| 117 |
+
### OrigamiAction
|
| 118 |
+
Make `fold_data` optional (backward compat) and add `crease` for V2:
|
| 119 |
+
```python
|
| 120 |
+
class OrigamiAction(Action):
|
| 121 |
+
fold_data: dict[str, Any] | None = Field(
|
| 122 |
+
default=None, description="V1: complete FOLD-format crease pattern"
|
| 123 |
+
)
|
| 124 |
+
crease: dict[str, Any] | None = Field(
|
| 125 |
+
default=None,
|
| 126 |
+
description='V2: single crease {"from": [x,y], "to": [x,y], "assignment": "M"|"V"}'
|
| 127 |
+
)
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
Server validates that exactly one of `fold_data` or `crease` is set.
|
| 131 |
+
|
| 132 |
+
### OrigamiObservation
|
| 133 |
+
Add V2 fields (keep all V1 fields for backward compat):
|
| 134 |
+
```python
|
| 135 |
+
class OrigamiObservation(Observation):
|
| 136 |
+
# V1 fields (unchanged)
|
| 137 |
+
task: dict[str, Any] = Field(default_factory=dict)
|
| 138 |
+
fold_data: dict[str, Any] = Field(default_factory=dict)
|
| 139 |
+
final_positions: list[list[float]] = Field(default_factory=list)
|
| 140 |
+
target_positions: list[list[float]] = Field(default_factory=list)
|
| 141 |
+
shape_similarity: float = 0.0
|
| 142 |
+
max_strain: float = 0.0
|
| 143 |
+
is_stable: bool = True
|
| 144 |
+
error: Optional[str] = None
|
| 145 |
+
# V2 fields (new)
|
| 146 |
+
step_count: int = 0
|
| 147 |
+
max_steps: int = 1
|
| 148 |
+
current_creases: list[dict] = Field(default_factory=list) # placed so far
|
| 149 |
+
anchor_points: list[list[float]] = Field(default_factory=list)
|
| 150 |
+
reward_breakdown: dict[str, float] = Field(default_factory=dict)
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
### OrigamiState
|
| 154 |
+
Add mode and step tracking:
|
| 155 |
+
```python
|
| 156 |
+
class OrigamiState(State):
|
| 157 |
+
task_name: str = ""
|
| 158 |
+
mode: str = "single" # "single" | "step"
|
| 159 |
+
step_count: int = 0
|
| 160 |
+
shape_similarity: float = 0.0
|
| 161 |
+
is_stable: bool = True
|
| 162 |
+
```
|
| 163 |
+
|
| 164 |
+
---
|
| 165 |
+
|
| 166 |
+
## Step 6 — Update `origami_server/environment.py`
|
| 167 |
+
**File:** `origami_server/environment.py`
|
| 168 |
+
|
| 169 |
+
Major upgrade. Add `mode` parameter and multi-step logic.
|
| 170 |
+
|
| 171 |
+
### Constructor
|
| 172 |
+
```python
|
| 173 |
+
def __init__(self, mode: str = "step", **kwargs):
|
| 174 |
+
# mode: "step" (V2 default) | "single" (V1 backward compat)
|
| 175 |
+
self._mode = mode
|
| 176 |
+
self._paper_state: PaperState | None = None
|
| 177 |
+
self._step_reward_prev: PaperState | None = None # for delta computation
|
| 178 |
+
# ... existing fields
|
| 179 |
+
```
|
| 180 |
+
|
| 181 |
+
### reset()
|
| 182 |
+
In step mode, initialize `_paper_state = PaperState()` and return initial observation
|
| 183 |
+
with empty `current_creases`, all anchor points, `done=False`.
|
| 184 |
+
|
| 185 |
+
In single mode, behavior unchanged from V1.
|
| 186 |
+
|
| 187 |
+
Grab `max_folds` from task definition (new task field, see Step 7).
|
| 188 |
+
|
| 189 |
+
### step() — V2 path (when `action.crease` is set)
|
| 190 |
+
```python
|
| 191 |
+
# 1. Parse crease
|
| 192 |
+
crease = action.crease # {"from": [x,y], "to": [x,y], "assignment": "M"|"V"}
|
| 193 |
+
|
| 194 |
+
# 2. Validate
|
| 195 |
+
if crease["assignment"] not in ("M", "V"):
|
| 196 |
+
return error observation, reward=-0.1
|
| 197 |
+
|
| 198 |
+
# 3. Apply crease to paper state
|
| 199 |
+
import copy
|
| 200 |
+
prev_state = copy.deepcopy(self._paper_state)
|
| 201 |
+
result = self._paper_state.add_crease(
|
| 202 |
+
crease["from"], crease["to"], crease["assignment"]
|
| 203 |
+
)
|
| 204 |
+
|
| 205 |
+
# 4. Compute per-step reward
|
| 206 |
+
self._state.step_count += 1
|
| 207 |
+
reward_dict = compute_reward(
|
| 208 |
+
prev_state=prev_state,
|
| 209 |
+
action_result=result,
|
| 210 |
+
new_state=self._paper_state,
|
| 211 |
+
target=self._task,
|
| 212 |
+
step=self._state.step_count,
|
| 213 |
+
max_steps=self._task["max_folds"],
|
| 214 |
+
)
|
| 215 |
+
|
| 216 |
+
# 5. Check done
|
| 217 |
+
done = (
|
| 218 |
+
self._state.step_count >= self._task["max_folds"]
|
| 219 |
+
or reward_dict.get("completion", 0) > 0
|
| 220 |
+
)
|
| 221 |
+
|
| 222 |
+
# 6. Return observation
|
| 223 |
+
return OrigamiObservation(
|
| 224 |
+
done=done,
|
| 225 |
+
reward=reward_dict["total"],
|
| 226 |
+
task=self._task_info(),
|
| 227 |
+
fold_data={}, # empty in step mode
|
| 228 |
+
final_positions=[], # only populated on done=True
|
| 229 |
+
target_positions=self._target_positions.tolist(),
|
| 230 |
+
shape_similarity=reward_dict.get("progress", 0.0),
|
| 231 |
+
max_strain=0.0,
|
| 232 |
+
is_stable=True,
|
| 233 |
+
error=None,
|
| 234 |
+
step_count=self._state.step_count,
|
| 235 |
+
max_steps=self._task["max_folds"],
|
| 236 |
+
current_creases=self._paper_state.crease_edges(),
|
| 237 |
+
anchor_points=[[x, y] for x, y in self._paper_state.anchor_points()],
|
| 238 |
+
reward_breakdown=reward_dict,
|
| 239 |
+
)
|
| 240 |
+
```
|
| 241 |
+
|
| 242 |
+
When `done=True`, optionally run the full simulation (`simulate()`) to populate
|
| 243 |
+
`final_positions` and `shape_similarity` — useful for the viewer but not required for training.
|
| 244 |
+
|
| 245 |
+
### step() — V1 path (when `action.fold_data` is set)
|
| 246 |
+
Identical to current V1 implementation. No changes.
|
| 247 |
+
|
| 248 |
+
---
|
| 249 |
+
|
| 250 |
+
## Step 7 — Update `origami_server/tasks.py`
|
| 251 |
+
**File:** `origami_server/tasks.py`
|
| 252 |
+
|
| 253 |
+
Two changes:
|
| 254 |
+
|
| 255 |
+
### Add `max_folds` to all existing tasks
|
| 256 |
+
```python
|
| 257 |
+
"triangle": max_folds=1
|
| 258 |
+
"half_fold": max_folds=1
|
| 259 |
+
"quarter_fold": max_folds=2
|
| 260 |
+
"letter_fold": max_folds=2
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
`max_folds` is the maximum number of step() calls before done=True.
|
| 264 |
+
|
| 265 |
+
### Add two harder tasks
|
| 266 |
+
|
| 267 |
+
**`waterbomb_base`** (difficulty 3, max_folds=4):
|
| 268 |
+
Two diagonal valley folds (corner to corner) + two perpendicular valley folds (midpoint to
|
| 269 |
+
midpoint). Classic base that requires all four folds to be correct simultaneously.
|
| 270 |
+
```
|
| 271 |
+
target_fold: 9 vertices (4 corners + 4 midpoints + 1 center), 8 crease edges (all V)
|
| 272 |
+
Creases:
|
| 273 |
+
(0,0)→(1,1) V (diagonal)
|
| 274 |
+
(1,0)→(0,1) V (diagonal)
|
| 275 |
+
(0.5,0)→(0.5,1) V (vertical)
|
| 276 |
+
(0,0.5)→(1,0.5) V (horizontal)
|
| 277 |
+
```
|
| 278 |
+
|
| 279 |
+
**`map_fold`** (difficulty 4, max_folds=8):
|
| 280 |
+
Accordion fold into 4 strips horizontally + 4 strips vertically (8 total creases,
|
| 281 |
+
alternating M/V). The most demanding task for V2.
|
| 282 |
+
```
|
| 283 |
+
target_fold: Creases at y=0.25, 0.5, 0.75 (alternating V/M/V) + x=0.25, 0.5, 0.75 (V/M/V)
|
| 284 |
+
plus corner diagonals for proper map fold behavior
|
| 285 |
+
```
|
| 286 |
+
|
| 287 |
+
Add `get_task_for_step_mode(name)` helper that returns the task with `max_folds` validated.
|
| 288 |
+
|
| 289 |
+
---
|
| 290 |
+
|
| 291 |
+
## Step 8 — Update `training/reward.py`
|
| 292 |
+
**File:** `training/reward.py`
|
| 293 |
+
|
| 294 |
+
### Add `valid_crease()` reward function
|
| 295 |
+
New reward for V2 single-crease format:
|
| 296 |
+
```python
|
| 297 |
+
def valid_crease(completions: list, **kwargs) -> list[float]:
|
| 298 |
+
"""V2: Does the LLM output parse as valid single-crease JSON?
|
| 299 |
+
|
| 300 |
+
+1.0 valid {"from": [x,y], "to": [x,y], "assignment": "M"|"V"}
|
| 301 |
+
-0.5 parseable JSON but missing fields or wrong types
|
| 302 |
+
-2.0 not parseable JSON
|
| 303 |
+
"""
|
| 304 |
+
```
|
| 305 |
+
|
| 306 |
+
### Add `extract_crease_json()` helper
|
| 307 |
+
```python
|
| 308 |
+
def extract_crease_json(response: str) -> dict | None:
|
| 309 |
+
"""Extract single-crease JSON from LLM response.
|
| 310 |
+
Looks for {"from": ..., "to": ..., "assignment": ...} object.
|
| 311 |
+
"""
|
| 312 |
+
```
|
| 313 |
+
|
| 314 |
+
Keep all existing V1 functions (`valid_fold`, `flat_foldable_reward`, `extract_fold_json`)
|
| 315 |
+
unchanged for backward compat.
|
| 316 |
+
|
| 317 |
+
---
|
| 318 |
+
|
| 319 |
+
## Step 9 — Update `training/train_grpo.py`
|
| 320 |
+
**File:** `training/train_grpo.py`
|
| 321 |
+
|
| 322 |
+
### New prompt template
|
| 323 |
+
Replace `PROMPT_TEMPLATE` with a step-level format. Key difference: no FOLD fields listed,
|
| 324 |
+
just "output the next crease as JSON":
|
| 325 |
+
|
| 326 |
+
```python
|
| 327 |
+
STEP_PROMPT_TEMPLATE = """You are an origami designer. Add the next fold crease.
|
| 328 |
+
|
| 329 |
+
Target: {description}
|
| 330 |
+
Paper: {width} × {height} unit square
|
| 331 |
+
|
| 332 |
+
CURRENT STATE (step {step} of {max_folds}):
|
| 333 |
+
Creases placed: {crease_history}
|
| 334 |
+
|
| 335 |
+
AVAILABLE ANCHOR POINTS:
|
| 336 |
+
Corners: {corners}
|
| 337 |
+
Boundary pts: {boundary_pts}
|
| 338 |
+
Intersections:{intersections}
|
| 339 |
+
|
| 340 |
+
Flat-foldability rules at every interior vertex:
|
| 341 |
+
- Kawasaki: alternating sector angles each sum to 180°
|
| 342 |
+
- Maekawa: |mountain_count - valley_count| = 2
|
| 343 |
+
- BLB: smallest sector bounded by opposite M/V types
|
| 344 |
+
|
| 345 |
+
Output ONLY this JSON (no explanation):
|
| 346 |
+
{{"from": [x1, y1], "to": [x2, y2], "assignment": "M" or "V"}}"""
|
| 347 |
+
```
|
| 348 |
+
|
| 349 |
+
For V2 MVP (step-0 training), `step=0`, `crease_history="none"`, anchor points = corners + midpoints.
|
| 350 |
+
|
| 351 |
+
### New `per_step_reward()` function
|
| 352 |
+
Replace `shape_match_reward`:
|
| 353 |
+
```python
|
| 354 |
+
def per_step_reward(completions, task_name, **kwargs):
|
| 355 |
+
scores = []
|
| 356 |
+
for completion, tname in zip(completions, task_name):
|
| 357 |
+
response = completion[0]["content"]
|
| 358 |
+
crease = extract_crease_json(response)
|
| 359 |
+
if crease is None:
|
| 360 |
+
scores.append(-2.0)
|
| 361 |
+
continue
|
| 362 |
+
try:
|
| 363 |
+
port, openenv_process = launch_openenv(port, openenv_process)
|
| 364 |
+
openenv_process.reset(task_name=tname)
|
| 365 |
+
result = openenv_process.step(OrigamiAction(crease=crease))
|
| 366 |
+
scores.append(result.reward if result.reward is not None else 0.0)
|
| 367 |
+
except TimeoutError:
|
| 368 |
+
scores.append(-1.0)
|
| 369 |
+
except Exception:
|
| 370 |
+
scores.append(-2.0)
|
| 371 |
+
return scores
|
| 372 |
+
```
|
| 373 |
+
|
| 374 |
+
### Updated reward function list
|
| 375 |
+
```python
|
| 376 |
+
trainer = GRPOTrainer(
|
| 377 |
+
reward_funcs=[valid_crease, per_step_reward], # removed flat_foldable_reward (server handles it now)
|
| 378 |
+
...
|
| 379 |
+
)
|
| 380 |
+
```
|
| 381 |
+
|
| 382 |
+
### Updated task list
|
| 383 |
+
Add new tasks to `ALL_TASKS`:
|
| 384 |
+
```python
|
| 385 |
+
ALL_TASKS = ["triangle", "half_fold", "quarter_fold", "letter_fold", "waterbomb_base", "map_fold"]
|
| 386 |
+
```
|
| 387 |
+
|
| 388 |
+
### Updated GRPO config
|
| 389 |
+
Increase `max_completion_length` since single crease JSON is shorter (~50 tokens):
|
| 390 |
+
```python
|
| 391 |
+
max_prompt_length=512,
|
| 392 |
+
max_completion_length=128, # single crease JSON is ~50 tokens
|
| 393 |
+
max_steps=1200, # more steps since harder tasks
|
| 394 |
+
```
|
| 395 |
+
|
| 396 |
+
---
|
| 397 |
+
|
| 398 |
+
## Step 10 — Update `client.py`
|
| 399 |
+
**File:** `client.py`
|
| 400 |
+
|
| 401 |
+
Minor update: `_step_payload` already calls `action.model_dump()`. With optional fields,
|
| 402 |
+
this will naturally include `crease` or `fold_data` depending on which is set. No change needed
|
| 403 |
+
unless OpenEnv has strict serialization requirements.
|
| 404 |
+
|
| 405 |
+
If OpenEnv rejects None fields, filter them:
|
| 406 |
+
```python
|
| 407 |
+
def _step_payload(self, action: OrigamiAction) -> Dict[str, Any]:
|
| 408 |
+
return {k: v for k, v in action.model_dump().items() if v is not None}
|
| 409 |
+
```
|
| 410 |
+
|
| 411 |
+
---
|
| 412 |
+
|
| 413 |
+
## Implementation Order
|
| 414 |
+
|
| 415 |
+
Execute in this order to minimize broken states:
|
| 416 |
+
|
| 417 |
+
```
|
| 418 |
+
1. requirements.txt + modal_train.py (deps first, triggers image rebuild)
|
| 419 |
+
2. origami_server/engine/graph.py (new file, no dependencies)
|
| 420 |
+
3. origami_server/engine/paper_state.py (depends on graph.py)
|
| 421 |
+
4. origami_server/engine/step_reward.py (depends on paper_state.py)
|
| 422 |
+
5. origami_server/models.py (API types — do before environment)
|
| 423 |
+
6. origami_server/environment.py (depends on models + paper_state + step_reward)
|
| 424 |
+
7. origami_server/tasks.py (add max_folds + new tasks)
|
| 425 |
+
8. training/reward.py (new valid_crease, extract_crease_json)
|
| 426 |
+
9. training/train_grpo.py (new prompts + per_step_reward)
|
| 427 |
+
10. client.py (minor defensive fix)
|
| 428 |
+
```
|
| 429 |
+
|
| 430 |
+
After step 7: run `curl http://localhost:8000/tasks` and verify new tasks appear.
|
| 431 |
+
After step 9: run a single training step locally with `--model unsloth/Qwen2.5-3B-Instruct --max_steps 5`
|
| 432 |
+
to verify reward functions fire.
|
| 433 |
+
|
| 434 |
+
---
|
| 435 |
+
|
| 436 |
+
## Files Changed / Created
|
| 437 |
+
|
| 438 |
+
| File | Status | Notes |
|
| 439 |
+
|------|--------|-------|
|
| 440 |
+
| `requirements.txt` | modified | add shapely>=2.0 |
|
| 441 |
+
| `modal_train.py` | modified | add shapely to run_commands |
|
| 442 |
+
| `origami_server/engine/graph.py` | **new** | port from optigami |
|
| 443 |
+
| `origami_server/engine/paper_state.py` | **new** | port from optigami |
|
| 444 |
+
| `origami_server/engine/step_reward.py` | **new** | port from optigami |
|
| 445 |
+
| `origami_server/models.py` | modified | OrigamiAction crease field, observation V2 fields |
|
| 446 |
+
| `origami_server/environment.py` | modified | multi-step mode |
|
| 447 |
+
| `origami_server/tasks.py` | modified | max_folds + waterbomb_base + map_fold |
|
| 448 |
+
| `training/reward.py` | modified | valid_crease + extract_crease_json |
|
| 449 |
+
| `training/train_grpo.py` | modified | step prompt + per_step_reward |
|
| 450 |
+
| `client.py` | modified | optional fields in _step_payload |
|
| 451 |
+
|
| 452 |
+
**V1 backward compat preserved:** All existing API routes, observation fields, and reward
|
| 453 |
+
functions remain unchanged. `mode='single'` continues to work for existing training runs.
|
| 454 |
+
|
| 455 |
+
---
|
| 456 |
+
|
| 457 |
+
## Risk Notes
|
| 458 |
+
|
| 459 |
+
- **shapely requirement**: PaperState intersection detection uses shapely. If Railway/Modal
|
| 460 |
+
build fails, can fall back to numpy-only intersection (more code, but avoids the dep).
|
| 461 |
+
Suggest testing locally first with `pip install shapely`.
|
| 462 |
+
|
| 463 |
+
- **OrigamiAction change**: Making `fold_data` optional is a breaking change for clients
|
| 464 |
+
sending the field as required. Any existing V1 clients that always set `fold_data` will
|
| 465 |
+
continue to work since pydantic accepts it as optional-with-value.
|
| 466 |
+
|
| 467 |
+
- **step-0 only training**: V2 MVP trains exclusively from empty paper state (step 0).
|
| 468 |
+
The model learns "first crease for task X" but doesn't train on step 1+. This means
|
| 469 |
+
chained inference (running multiple steps at eval time) may degrade at step 2+ because
|
| 470 |
+
the policy was never trained on non-empty paper states. Acceptable for V2 MVP — a future
|
| 471 |
+
V3 adds episode rollout collection to the training loop.
|
| 472 |
+
|
| 473 |
+
- **completion bonus scale**: The 10.0× completion bonus means episodes where the model
|
| 474 |
+
hits >90% coverage + valid geometry will dominate the reward signal. For easy tasks
|
| 475 |
+
(triangle, half_fold) this will happen quickly. For map_fold it may never happen in early
|
| 476 |
+
training. Consider starting with only triangle/waterbomb_base for first training run.
|