optigami / RESEARCH_NOTES.md
sissississi's picture
Add GRPO trainer scaffold with mock environment
aa44758
# Optigami Research Notes
Comprehensive notes on all sources, tools, and architecture for the Optigami project.
---
## Table of Contents
1. [Project Architecture Overview](#1-project-architecture-overview)
2. [Paper: OrigamiSpace (2511.18450)](#2-paper-origamispace-251118450)
3. [Paper: SpatialThinker (2511.07403)](#3-paper-spatialthinker-251107403)
4. [Paper: Automating Rigid Origami Design (2211.13219)](#4-paper-automating-rigid-origami-design-221113219)
5. [Tool: FOLD Format (edemaine/fold)](#5-tool-fold-format)
6. [Tool: Origami Simulator](#6-tool-origami-simulator)
7. [Tool: GamiBench](#7-tool-gamibench)
8. [Tool: SpatialThinker Codebase](#8-tool-spatialthinker-codebase)
9. [Tool: Trackio](#9-tool-trackio)
10. [Tool: Unsloth + GRPO Training](#10-tool-unsloth--grpo-training)
11. [Unsloth ART / GRPO Trainer Plan](#11-unsloth-art--grpo-trainer-plan)
12. [Current Project State](#12-current-project-state)
---
## 1. Project Architecture Overview
```
+---------------------------------------------------+
| OpenEnv Server |
| +-----------+ +----------+ +--------------+ |
| | State | | Action | | Reward | |
| | (FOLD JSON| | (LLM | | (Dense, | |
| | + target)| | output) | | verifiable) | |
| +-----------+ +----------+ +--------------+ |
| | | | |
| v v v |
| +-----------------------------------------------+|
| | Paper Geometry Engine (Python) ||
| | - Polygon state (Shapely) ||
| | - Fold operations (reflection across line) ||
| | - Kawasaki/Maekawa constraint checks ||
| | - Layer tracking ||
| | - FOLD format import/export ||
| +-----------------------------------------------+|
| | |
| v |
| +-----------------------------------------------+|
| | Three.js Visualizer (Demo only) ||
| | - 3D fold animation ||
| | - Strain heatmap ||
| | - Instruction stream ||
| +-----------------------------------------------+|
+---------------------------------------------------+
| ^
v |
+---------------------------------------------------+
| Unsloth ART / GRPO Trainer |
| - Qwen2.5-VL-7B or Qwen3-4B base model |
| - LoRA/QLoRA for efficient training |
| - Multi-turn rollouts |
+---------------------------------------------------+
```
**Three major components:**
1. **OpenEnv Server** - RL environment serving state/action/reward for origami folding
2. **Paper Geometry Engine** - Python-based origami math (Shapely polygons, fold reflections, constraint checking)
3. **Unsloth ART / GRPO Trainer** - RL fine-tuning of vision-language models for origami reasoning
**Current focus:** Unsloth ART / GRPO Trainer
---
## 2. Paper: OrigamiSpace (2511.18450)
**Title:** ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints
**Authors:** Rui Xu, Dakuan Lu, Zicheng Zhao, Xiaoyu Tan, Xintao Wang, Siyu Yuan, Jiangjie Chen, Yinghui Xu
**Date:** November 23, 2025
**Venue:** arXiv (cs.AI)
### Dataset
- **350 primary instances** + 471 auxiliary (without folding processes)
- Each instance: CP diagram, compiled flat pattern, folding process (multi-step images), final 3D shape
- Complexity: Easy (3-9 steps), Medium (10-19), Hard (20-30), avg 8.2 steps
- **1,620 total questions** across 4 tasks
### Four Evaluation Tasks
| Task | Questions | Description |
|------|-----------|-------------|
| Pattern Prediction | 350 | CP diagram -> predict final 3D shape (multiple choice) |
| Multi-step Spatial Reasoning | 250 | Shuffled fold images -> correct chronological sequence |
| Spatial Relationship Prediction | 900 | 3 subtypes: pose localization, layering analysis, geometric change |
| End-to-End CP Code Generation | 120 | Flat layout + folded shape -> generate CP code |
### Compiler Architecture (Critical for OpenEnv)
Four-category error feedback system:
1. **CSE (CP Code Syntax Error):** Validates vertices, edges, faces, crease types; checks Euler's formula V-E+F=2
2. **GIF (Geometrically Impossible Fold):** Maekawa's theorem |M-V|=2, Kawasaki's theorem sum(alpha_i)=2pi, Big-Little-Big angle constraint
3. **PSI (Paper Self-Intersection):** Cyclic layering, collision detection (discrete + CCD), octrees/BVHs
4. **AFS (Ambiguous Folding State):** Multiple valid M/V assignments, non-unique stacking
### CP Code Evaluation (4 dimensions, 0.25 weight each)
1. **Topological Structure Similarity (TSS):** Vertex/edge/face count comparison, s_v = e^(-0.5|V_gen - V_ref| / min(V_gen, V_ref))
2. **Geometric Similarity (GS):** Hausdorff distance, s_p = e^(-5 * d_H), dihedral angle distribution, aspect ratio
3. **Constraint Satisfaction (CS):** Taco-Taco, Taco-Tortilla, transitivity, Maekawa/Kawasaki
4. **Final Folded State (FFS):** Shape similarity, layering comparison, stacking order
### Learning Approaches
- **In-Context Learning:** Single-pass, detailed instructions + examples
- **Environmental Learning:** Iterative model<->compiler loop, max 10 rounds, performance saturates after 8-10
- **Reinforcement Learning (TRICO/PPO-based):**
- Training data: 471 instances from environmental learning
- Model: Qwen2.5-VL-32B
- **Rewards:** Intermediate (success bonus + quality progress), step penalty, final evaluation score
- Result: RL-trained 32B exceeded 72B baseline
### Key Results
- Best closed-source: GPT-4o (42.71% pattern), Gemini2.5-pro (53.45% multi-step)
- Best open-source: Qwen2.5-VL-72B (36.29% pattern, 39.10% multi-step)
- Expert human: 98.45% pattern, 100% multi-step
- **Constraint satisfaction is the primary bottleneck** (~30% for top models)
- Human-model gap: 20-45 percentage points
### Relevance to Optigami
- **Direct blueprint for our OpenEnv server**: the compiler architecture with 4 error types is exactly what we need
- The CP code evaluation framework (TSS/GS/CS/FFS) can be our reward function
- Environmental learning approach maps to multi-turn rollouts in GRPO
- Confirms Qwen2.5-VL as viable base model (they used 32B, we target 7B)
---
## 3. Paper: SpatialThinker (2511.07403)
**Title:** SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
**Authors:** Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark
**Date:** November 10, 2025
**Venue:** NeurIPS 2025 Workshops (SpaVLE, EWM, ARLET, SEA)
### Core Innovation
Dense spatial rewards + GRPO for training Qwen2.5-VL on spatial reasoning tasks. Key insight: **sparse rewards lead to reward hacking; dense multi-objective rewards with lexicographic gating prevent this.**
### GRPO Training Configuration
- **Rollouts:** 8 samples per query, temperature 1.0
- **Batch size:** rollout=512, global=128
- **Training:** 75 steps (~5 episodes)
- **Hardware:** 4x NVIDIA H100 80GB
- **Time:** ~13h (3B), ~15h (7B)
- **Advantage:** A(i) = (r(i) - mu) / (sigma + epsilon), epsilon=1e-6
- **Loss:** PPO-style with clip(epsilon_l=0.2, epsilon_h=0.3), KL penalty beta=0.01
### Dense Spatial Reward Design (CRITICAL - template for our rewards)
**4-component reward with lexicographic gating:**
```
R_total = I[R_format=1] * (w_format*R_f + w_count*R_c + w_accuracy*R_a + I[R_accuracy=1]*w_spatial*R_s)
```
| Component | Weight | Description |
|-----------|--------|-------------|
| Format (R_f) | 0.1 | JSON-parseable scene graph with required fields |
| Count (R_c) | 0.2 | Penalizes deviation in object/relation counts (lambda_obj=0.7, lambda_rel=0.3) |
| Accuracy (R_a) | 0.5 | Binary exact string match |
| Spatial (R_s) | 0.2 | Hungarian matching with CIoU, activated ONLY when answer correct |
**Lexicographic gating is essential:** format compliance gates all rewards; spatial rewards only activate on correct answers. Without gating, severe reward hacking occurs (74.9% -> 23.7% with naive spatial rewards).
### STVQA-7K Dataset
- 7,587 spatial VQA pairs from Visual Genome scene graphs
- Generated by Claude Sonnet, validated by GPT-4o pass@2
- 9 spatial categories, 34 additional spatial predicates beyond standard VG150
- 90/10 train/val split
### Key Results
- SpatialThinker-7B surpasses GPT-4o on 3DSRBench by +12.1%
- Dense reward RL: +7.2% avg across 12 benchmarks (1.8x the +4.0% from sparse GRPO)
- Outperforms models trained on millions of samples (trained on only 7K)
### Relevance to Optigami
- **Direct template for our GRPO training pipeline**
- Dense reward design with lexicographic gating prevents reward hacking
- Proves Qwen2.5-VL-7B is excellent base for spatial reasoning RL
- veRL/EasyR1 framework for training infrastructure
- Shows 7K samples sufficient for strong results
---
## 4. Paper: Automating Rigid Origami Design (2211.13219)
**Title:** Automating Rigid Origami Design
**Authors:** Jeremia Geiger, Karolis Martinkus, Oliver Richter, Roger Wattenhofer
**Date:** November 2022 (revised April 2023)
**Venue:** IJCAI 2023 AI, Arts & Creativity Special Track
### Core Contribution
- Formulates rigid origami design as discrete optimization: the **"rigid origami game"**
- Based on "three units method" principle
- Framework supports diverse objectives via abstract reward functions
- Generates optimized, application-specific crease patterns
### Methodology
- Multiple search methods within optimization framework
- Flexible objective definition for application-specific requirements
- Can approximate target shapes and produce functional designs
### Relevance to Optigami
- Validates the "origami as game/environment" paradigm we're building
- Their reward formulation approach (function-based, abstract) aligns with our OpenEnv design
- Discrete optimization over crease patterns = the action space for our RL agent
---
## 5. Tool: FOLD Format
**Repo:** https://github.com/edemaine/fold
**Authors:** Erik Demaine (MIT), Jason Ku (MIT), Robert Lang
**License:** MIT
### What It Is
FOLD (Flexible Origami List Datastructure) - JSON-based file format (.fold) for representing origami models. The **standard interchange format** for computational origami.
### Data Structure
```json
{
"vertices_coords": [[x,y], ...], // 2D or 3D coordinates
"edges_vertices": [[v1,v2], ...], // Edge endpoints
"edges_assignment": ["M","V",...], // Mountain/Valley/Boundary/Flat/Unassigned
"faces_vertices": [[v1,v2,v3], ...], // Face vertex lists
"faceOrders": [[f1,f2,order], ...], // Stacking/layering order
"frame_*": ... // Multiple frames (folding states)
}
```
### JavaScript API
```javascript
// Browser
<script src="https://edemaine.github.io/fold/dist/fold.js"></script>
// Node.js
npm install --save fold
// Usage: FOLD.moduleName.functionName
FOLD.filter.collapseNearbyVertices(foldObject)
```
### CLI Tools
- `fold-convert`: ORIPA .opx -> .fold conversion
- `fold-convert --flat-fold`: Compute flat-folded state
### Supported Software Ecosystem
OrigamiSimulator, Freeform Origami (Tachi), Rabbit Ear (Kraft), ORIPA, Crease Pattern Editor, Rhino Grasshopper
### Relevance to Optigami
- **Core data format for OpenEnv state representation**
- JSON = easy Python/JS interop
- Stacking order (faceOrders) = layer tracking
- edges_assignment = mountain/valley fold type
- Import/export between geometry engine and visualizer
---
## 6. Tool: Origami Simulator
**Repo:** https://github.com/amandaghassaei/OrigamiSimulator
**URL:** origamisimulator.org
**Author:** Amanda Ghassaei
**License:** MIT
**Stack:** JavaScript (68.4%), Three.js, GPU fragment shaders
### Capabilities
- Real-time GPU-accelerated folding simulation
- Folds ALL creases simultaneously (not sequential)
- Realistic bending simulation between creases
- Strain visualization (internal stress during folding)
- Fold Percent slider: 0% (flat) to 100% (fully folded) to -100% (inverted)
### File Formats
- **Input:** SVG, FOLD
- **Export:** FOLD, STL, OBJ
### Physics Engine
- **Stiffness-based finite element approach:** Triangulated faces are rigid panels connected by rotational hinges along fold lines
- Each fold edge has a **target angle** (+/-pi for mountain/valley), driven by angular spring forces
- Solver computes nodal displacements at each timestep to reach equilibrium
- **Fold stiffness:** Controls how strongly hinges drive toward target angle
- **Face stiffness:** Controls rigidity of triangulated faces (resistance to bending/deformation)
- **Damping:** Controls oscillation decay rate
- **Strain metric:** Per-triangle deviation of edge lengths from rest lengths (flat state)
- Self-intersection is NOT prevented (folds through itself if geometry demands it)
- Based on Schenk & Guest structural engineering approach
- Tomohiro Tachi's freeform origami variations
- Ruling-aware triangulation for curved creases
- GPU fragment shaders for parallel computation
### Programmatic Usage
- Core simulation can be driven **headlessly** (without UI) by importing solver module
- Feed FOLD JSON data -> step simulation programmatically
- FOLD is JSON, so easy to generate crease patterns from Python and pass to simulator
- Can embed in other web pages as a component
### Dependencies
- Three.js (3D rendering)
- FOLD API (internal data structure)
- Earcut + cdt2d (polygon triangulation)
- numeric.js (linear algebra)
- CCapture (GIF/WebM export)
### Relevance to Optigami
- **Direct integration for Three.js Visualizer component**
- Strain heatmap capability already built in
- FOLD format native support
- Can be used for visual verification of generated fold patterns
- Export to STL/OBJ for 3D shape comparison in rewards
---
## 7. Tool: GamiBench
**Repo:** https://github.com/stvngo/GamiBench
**Dataset:** https://huggingface.co/datasets/stvngo/GamiBench
**Paper:** arXiv 2512.22207
**License:** MIT
### Benchmark Design
- 186 valid + 186 impossible crease patterns
- 6 viewpoints per pattern (top, bottom, front, back, right, left)
- **777 total samples** in HuggingFace dataset (45.4 MB)
- 186 label classes (named origami patterns)
### Task Types
1. Standard tasks (2D CP -> 3D prediction)
2. Alternative-view tasks
3. Impossible tasks (validity checking)
### Dataset Schema
```python
{
"image": PIL.Image, # Origami pattern/fold image
"label": int, # 0-185 class label
"split": str # Split identifier
}
```
### Loading
```python
from datasets import load_dataset
dataset = load_dataset("stvngo/GamiBench")
```
### Model Support
- OpenAI (GPT-4, GPT-4o-mini)
- Anthropic (Claude 4.5 Sonnet)
- Google (Gemini)
- xAI (Grok)
- OpenRouter models
### Code Structure
```
models/ # Model wrappers & factory
evaluators/ # BaseEvaluator: evaluate(), evaluate_single()
benchmarks/ # Benchmark implementations
configs/ # YAML/JSON configuration
utils/ # Shared helpers
pipeline.py # Orchestration
run.py # Entry point
```
### Relevance to Optigami
- **Evaluation benchmark for our trained model**
- 186 origami patterns = potential training/eval data
- Impossible patterns useful for constraint satisfaction testing
- Multi-view evaluation tests true 3D understanding
- Config-driven, reproducible evaluation pipeline
---
## 8. Tool: SpatialThinker Codebase
**Repo:** https://github.com/hunarbatra/SpatialThinker
**Paper:** arXiv 2511.07403
### Architecture
- Built on Qwen2.5-VL (3B and 7B variants)
- Uses veRL/EasyR1 for RL training
- vLLM 0.8.0 for inference during rollouts
### Code Structure
```
scripts/ # Training bash scripts per model size
evaluation/ # 18+ benchmark evaluation suite
data_gen/ # Data synthesis pipeline
verl/ # RL training framework (GRPO)
```
### Data Generation Pipeline
1. Generate raw QA pairs (12K-56K options)
2. Balance/filter with 50% spatial relations focus
3. Validate via GPT-4o (~75% pass rate)
4. Upload to HuggingFace
### Requirements
- Python 3.9+
- Transformers >= 4.49.0
- Flash-Attn >= 2.4.3
- vLLM >= 0.7.3
### Relevance to Optigami
- **Reference implementation for our GRPO training setup**
- veRL/EasyR1 framework = our training infrastructure
- Dense reward design directly applicable
- Data generation pipeline can be adapted for origami QA pairs
---
## 9. Tool: Trackio
**Repo:** https://github.com/gradio-app/trackio
**Author:** Hugging Face / Gradio team
**License:** MIT
### What It Is
Lightweight, local-first experiment tracking (Weights & Biases alternative). API-compatible with wandb.
### Key Features
- `import trackio as wandb` - drop-in W&B replacement
- Non-blocking `log()` with background queue (0.5s drain interval)
- SQLite local storage at `~/.cache/huggingface/trackio`
- Optional HuggingFace Spaces deployment for dashboards
- Slack/Discord webhook alerts (INFO/WARN/ERROR)
- 2,000 logs/8s single run; 32,000 logs/14s with 32 threads
### Usage
```python
import trackio
trackio.init(project="optigami-grpo", config={"lr": 1e-6, "model": "Qwen2.5-VL-7B"})
trackio.log({"step": step, "reward": reward, "loss": loss})
trackio.alert(title="Training spike", text="...", level=trackio.AlertLevel.WARN)
trackio.finish()
# Dashboard
trackio.show(project="optigami-grpo")
trackio.sync(project="optigami-grpo", space_id="openenv-community/optigami-training")
```
### Relevance to Optigami
- **Training metrics dashboard for GRPO training runs**
- Can deploy live dashboard to HF Spaces
- Track reward components, loss, constraint satisfaction rates
- Alert on training anomalies (reward hacking, loss spikes)
---
## 10. Tool: Unsloth + GRPO Training
**Repo:** https://github.com/unslothai/unsloth
**Docs:** https://unsloth.ai/docs
### GRPO Algorithm in Unsloth
1. Generate N responses per prompt (8+ recommended)
2. Score each with custom reward functions
3. Z-score normalize rewards across group -> advantages
4. PPO-style policy update (no value model or reward model needed)
### Memory Efficiency
- **90% less VRAM** vs standard GRPO
- 20K context, 8 generations, Llama 8B: 54.3GB (vs 510.8GB standard)
- QLoRA 4-bit: model params (GB) ~ VRAM needed
- Shared GPU memory with vLLM inference engine
### Vision Model Support
- Qwen2.5-VL-7B directly supported
- Qwen3-VL-8B, Gemma 3 (4B) also available
- `FastVisionModel.get_peft_model()` with granular layer control:
- `finetune_vision_layers`, `finetune_language_layers`
- `finetune_attention_modules`, `finetune_mlp_modules`
### LoRA Configuration
```python
model = FastVisionModel.get_peft_model(
model,
r=16, # LoRA rank
lora_alpha=16, # alpha == r recommended
lora_dropout=0,
finetune_vision_layers=True,
finetune_language_layers=True,
finetune_attention_modules=True,
finetune_mlp_modules=True,
)
```
### GRPOConfig Options
```python
GRPOConfig(
loss_type='grpo', # or 'gspo', 'dr_grpo'
epsilon=0.2,
epsilon_high=0.28,
delta=1.5,
# ... standard training args
)
```
### vLLM Integration
- Shared memory between Unsloth and vLLM saves 3-5GB
- A100 40GB: ~4000 tokens/sec, T4 16GB: ~300 tokens/sec
- `fast_inference=True` enables vLLM backend
### Training Requirements
- Minimum 300 steps before meaningful progress
- 500+ data rows recommended (works with 10+)
- Models >= 1.5B parameters for reasoning tokens
- Steps = rows x epochs; increase generations (8->16) for more data
### Vision Data Format
```python
[
{"role": "user", "content": [
{"type": "text", "text": "instruction"},
{"type": "image", "image": pil_image}
]},
{"role": "assistant", "content": [
{"type": "text", "text": "response"}
]}
]
```
### GRPO vs PPO vs DPO Comparison
| Aspect | PPO | DPO | GRPO |
|--------|-----|-----|------|
| Critic/Value model | Required (same size as policy) | Not needed | **Not needed** |
| Reference model | Required | Required | Required (old policy) |
| Training data | Online rollouts | Offline preference pairs | **Online rollouts + group scoring** |
| Reward signal | Scalar per token/step | Implicit from preferences | **Verifiable/explicit** |
| VRAM overhead | ~2x (policy + critic) | ~2x (policy + ref) | **~1.5x (no critic)** |
### GRPO Advantage Estimation
```
A_i = (r_i - mean(r_1..r_G)) / std(r_1..r_G)
```
By sampling G completions and normalizing rewards within the group, GRPO creates its own baseline without a value network - halving VRAM vs PPO.
### Complete Unsloth GRPO Code Example
```python
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel) # Patch TRL with Unsloth optimizations
from trl import GRPOConfig, GRPOTrainer
# Load model with QLoRA
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit",
max_seq_length=4096,
load_in_4bit=True,
dtype=None,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=64, # Higher rank for reasoning tasks
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=64, # alpha == r recommended
lora_dropout=0, # Unsloth recommends 0
bias="none",
use_gradient_checkpointing="unsloth", # Unsloth's optimized GC
random_state=3407,
)
# Reward functions (TRL accepts a list, scores are summed)
def correctness_reward(completions, ground_truth, **kwargs):
rewards = []
for completion, gt in zip(completions, ground_truth):
answer_match = re.search(r'</think>\s*(.*?)$', completion, re.DOTALL)
if answer_match and answer_match.group(1).strip() == gt.strip():
rewards.append(1.0)
else:
rewards.append(0.0)
return rewards
def format_reward(completions, **kwargs):
return [0.5 if ("<think>" in c and "</think>" in c) else 0.0 for c in completions]
# GRPO Config
config = GRPOConfig(
output_dir="./grpo_output",
num_generations=8, # Group size G
max_completion_length=2048,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=5e-6,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
beta=0.04, # KL penalty coefficient
max_grad_norm=0.1,
logging_steps=1,
save_steps=250,
bf16=True,
loss_type='grpo', # or 'gspo', 'dr_grpo'
)
trainer = GRPOTrainer(
model=model,
config=config,
train_dataset=dataset,
reward_funcs=[correctness_reward, format_reward],
tokenizer=tokenizer,
)
trainer.train()
# Save LoRA adapter
model.save_pretrained("./grpo_lora_adapter")
# Optional: merge and push
# model.save_pretrained_merged("./grpo_merged", tokenizer)
# model.push_to_hub_merged("username/model-name", tokenizer)
```
### Vision GRPO with Qwen2.5-VL
```python
from unsloth import FastVisionModel, PatchFastRL
PatchFastRL("GRPO", FastVisionModel)
model, tokenizer = FastVisionModel.from_pretrained(
"unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit",
max_seq_length=4096,
load_in_4bit=True,
)
# For VLMs: typically freeze vision encoder, train language layers
model = FastVisionModel.get_peft_model(
model,
r=16, # Lower rank often sufficient for VLMs
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
finetune_vision_layers=False, # Keep vision encoder frozen
finetune_language_layers=True,
finetune_attention_modules=True,
finetune_mlp_modules=True,
)
```
### Unsloth ART (Agentic Reasoning Training)
ART extends GRPO for multi-turn agentic tasks:
1. **Multi-turn rollouts:** Model interacts with environment over multiple turns (actions + observations)
2. **Environment integration:** Custom env provides observations and final rewards
3. **Verifiable rewards:** Emphasizes automatically verifiable outcomes
**Multi-turn pattern:**
```
Turn 1: User prompt -> Model <think> + action -> Environment observation
Turn 2: Observation -> Model <think> + action -> Environment observation
Turn 3: Observation -> Model final answer -> Reward computed
```
**Implementation options for multi-turn:**
1. **Single-generation (simpler):** Model outputs full plan/sequence in one generation; reward function evaluates the whole sequence
2. **Custom rollout loop (advanced):** Alternate model generation and env response, collect full trajectory, compute GRPO gradients on combined trajectory
### Key Hyperparameters Reference
| Parameter | Range | Notes |
|-----------|-------|-------|
| `num_generations` (G) | 4-16 | 8 common. More = better advantages, more VRAM |
| `beta` (KL penalty) | 0.01-0.1 | 0.04 default. Higher = stay closer to reference |
| `learning_rate` | 1e-6 to 1e-5 | Lower than SFT. 5e-6 starting point |
| `max_completion_length` | 512-4096 | Task-dependent |
| `r` (LoRA rank) | 16-128 | 64 for reasoning, 16 for VLM |
| `gradient_accumulation_steps` | 4-16 | Effective batch = per_device * accum * GPUs |
| `max_grad_norm` | 0.1-1.0 | 0.1 for stability |
| `warmup_ratio` | 0.05-0.1 | Important for RL stability |
| `epsilon` (clip) | 0.2 | PPO-style clipping |
| `epsilon_high` | 0.28 | Asymmetric upper clip |
### Qwen2.5-VL-7B Model Specifics
- Vision encoder: ViT with 2D-RoPE (handles arbitrary image resolutions via dynamic patching)
- LLM backbone: 28 layers, 3584 hidden dim, 28 attn heads, GQA with 4 KV heads
- Context: up to 32K tokens (128K with YaRN)
- Supports: single image, multi-image, video frames
- Unsloth IDs: `unsloth/Qwen2.5-VL-7B-Instruct`, `unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit`
### Qwen3-4B Model Specifics
- Hybrid thinking: can switch between `<think>` mode and direct response
- ~4B parameters, efficient for RL training
- MoE variants also available
- Unsloth IDs: `unsloth/Qwen3-4B`, `unsloth/Qwen3-4B-bnb-4bit`
---
## 11. Unsloth ART / GRPO Trainer Plan
### Phase 1: Data Preparation
**Training Data Sources:**
1. OrigamiSpace dataset (471 auxiliary instances) - CP diagrams, fold sequences, 3D shapes
2. GamiBench dataset (777 samples, 186 patterns) - crease patterns with multi-view 3D
3. Synthetic data generation pipeline (following SpatialThinker approach):
- Generate origami QA pairs with Claude/GPT
- Validate with GPT-4o pass@2
- Balance across difficulty levels
**Data Format for GRPO:**
```python
# Each training example = a prompt with origami task
{
"prompt": [
{"role": "user", "content": [
{"type": "image", "image": cp_diagram_image},
{"type": "text", "text": "Given this crease pattern, describe the folding sequence and predict the final 3D shape. Output your answer as a FOLD JSON."}
]}
]
}
```
### Phase 2: Reward Function Design
**Following SpatialThinker's lexicographic gating pattern, adapted for origami:**
```python
def origami_reward(prompt, response, ground_truth):
# Component 1: Format reward (gate)
r_format = check_valid_fold_json(response) # 0 or 1
# Component 2: Constraint satisfaction
r_constraints = check_origami_constraints(response)
# - Maekawa's theorem: |M-V| = 2
# - Kawasaki's theorem: sum(alpha_i) = 2*pi
# - Euler's formula: V - E + F = 2
# - No self-intersection
# Component 3: Topological similarity
r_topology = compute_tss(response, ground_truth)
# Vertex/edge/face counts, connectivity
# Component 4: Geometric similarity
r_geometry = compute_hausdorff_similarity(response, ground_truth)
# Component 5: Final shape match
r_shape = compute_folded_state_similarity(response, ground_truth)
# Lexicographic gating
if r_format == 0:
return 0.0
total = (0.1 * r_format +
0.25 * r_constraints +
0.2 * r_topology +
0.2 * r_geometry +
0.25 * r_shape)
return total
```
### Phase 3: Training Infrastructure
**Option A: Unsloth (simpler, less VRAM)**
```python
from unsloth import FastVisionModel
from trl import GRPOConfig, GRPOTrainer
model, tokenizer = FastVisionModel.from_pretrained(
"unsloth/Qwen2.5-VL-7B-Instruct",
load_in_4bit=True,
fast_inference=True,
)
model = FastVisionModel.get_peft_model(model, r=16, lora_alpha=16)
config = GRPOConfig(
loss_type="grpo",
num_generations=8,
max_new_tokens=2048,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
num_train_epochs=3,
learning_rate=1e-6,
)
trainer = GRPOTrainer(
model=model,
config=config,
train_dataset=dataset,
reward_funcs=[origami_reward],
)
trainer.train()
```
**Option B: veRL/EasyR1 (following SpatialThinker, more control)**
- Uses veRL framework with GRPO
- vLLM backend for fast rollouts
- More complex but battle-tested for spatial reasoning
- Better for multi-turn rollouts
### Phase 4: Multi-Turn Rollouts
Following OrigamiSpace's environmental learning approach:
1. Model generates CP code / fold sequence
2. OpenEnv compiler validates and returns error feedback
3. Model refines based on error type (CSE/GIF/PSI/AFS)
4. Repeat up to 10 rounds
5. Final reward based on best attempt
**Environment class pattern:**
```python
class OrigamiEnv:
def __init__(self, task):
self.task = task
self.state = task["initial_state"] # FOLD JSON
self.steps = 0
self.max_steps = 10
self.history = []
def step(self, action: str):
"""Process model's fold action, return compiler feedback."""
self.steps += 1
# Validate through compiler (CSE/GIF/PSI/AFS checks)
result = self.compile_and_validate(action)
observation = f"Step {self.steps}: {result['error_type']}: {result['message']}"
self.state = result.get("new_state", self.state)
self.history.append((action, observation))
done = self.steps >= self.max_steps or result.get("valid", False)
reward = self.compute_reward() if done else 0.0
return observation, reward, done
def compute_reward(self):
"""4-dimensional evaluation: TSS + GS + CS + FFS."""
return (0.25 * tss(self.state, self.task["target"]) +
0.25 * gs(self.state, self.task["target"]) +
0.25 * cs(self.state) +
0.25 * ffs(self.state, self.task["target"]))
def multi_turn_reward(completions, prompts, **kwargs):
"""Wrap environment interaction into GRPO reward function."""
rewards = []
for completion, prompt in zip(completions, prompts):
env = OrigamiEnv(extract_task(prompt))
actions = parse_actions(completion)
total_reward = 0.0
for action in actions:
obs, reward, done = env.step(action)
total_reward += reward
if done:
break
rewards.append(total_reward)
return rewards
```
### Phase 5: Evaluation
1. **GamiBench** - standard origami spatial reasoning benchmark
2. **OrigamiSpace tasks** - 4-task evaluation suite
3. **Custom metrics:**
- Constraint satisfaction rate (Maekawa/Kawasaki)
- Compilation success rate
- Topological/geometric similarity scores
### Phase 6: Monitoring with Trackio
```python
import trackio
trackio.init(
project="optigami-grpo",
space_id="openenv-community/optigami-training",
config={
"model": "Qwen2.5-VL-7B",
"lora_r": 16,
"num_generations": 8,
"learning_rate": 1e-6,
}
)
# In training loop
trackio.log({
"step": step,
"reward/total": total_reward,
"reward/format": format_reward,
"reward/constraints": constraint_reward,
"reward/topology": topology_reward,
"reward/geometry": geometry_reward,
"reward/shape": shape_reward,
"loss": loss,
"compilation_rate": compilation_rate,
})
```
---
## 12. GitHub Reference Repo (ianalin123/optigami)
Located at `.reference/optigami-github/` (gitignored, not pushed to HF).
### What It Contains
A complete research repository with detailed architecture docs and a reference 2048 GRPO implementation.
### Key Files
| File | Contents |
|------|----------|
| `research/plan/architecture.md` | **Full architecture spec**: action space, state, physics engine, reward functions, OpenEnv integration, rendering pipeline, project structure, implementation order |
| `research/openenv/2048_example.py` | **636-line reference implementation** of OpenEnv + GRPO for 2048 game (Unsloth + TRL) |
| `research/openenv/overview.md` | OpenEnv framework API, types, project structure, deployment to HF Spaces |
| `research/origami/fold_types_deep.md` | All fold operations, Huzita-Justin axioms, crane step-by-step, compression patterns |
| `research/origami/math_physics_deep.md` | Kawasaki/Maekawa theorems with code, bar-and-hinge model, energy formulas |
| `research/origami/rendering_research.md` | Rendering options comparison |
| `research/origami/fold_format.md` | FOLD file format details |
### Architecture Decisions (from GitHub repo)
| Decision | Choice |
|----------|--------|
| LLM interaction | **Code-as-policy** (LLM writes `fold_strategy()` function) |
| Action space | Named fold ops (valley/mountain + fold line + angle) |
| State format | FOLD-compatible JSON |
| Physics engine | Bar-and-hinge model (NumPy port of Ghassaei) |
| Validation | Kawasaki + Maekawa + triangle-triangle intersection |
| Primary task | Solar panel packing (Miura-ori discovery) |
| Training | GRPO via TRL + Unsloth |
| Deployment | Docker Space on HF Spaces |
### Action Space (Code-as-Policy)
The LLM generates a `fold_strategy(paper_state)` function returning fold instructions:
```python
def fold_strategy(paper_state: dict) -> list[dict]:
# paper_state contains: vertices, edges, assignments, fold_angles, material, etc.
return [
{"type": "valley", "line": {"start": [0,0.5], "end": [1,0.5]}, "angle": 180},
{"type": "mountain", "line": {"start": [0.5,0], "end": [0.5,0.5]}, "angle": 180},
]
```
### Reward Functions (3 from 2048 pattern, adapted for origami)
1. **`code_valid`**: +1.0 valid function, -0.5 exec fails, -2.0 syntax error
2. **`physically_valid`**: +1.0 all valid, -2.0 per Kawasaki/Maekawa violation, -5.0 self-intersection
3. **`fold_quality`**: +20.0 * compactness, +10.0 meets volume target, +5.0 deployable, -0.5 per fold
### Physics Engine (Bar-and-Hinge Model)
```python
E_total = E_bar + E_facet + E_fold
E_bar = sum (1/2) * k_axial * (L - L0)^2 # stretching
E_facet = sum (1/2) * k_facet * l * (theta-pi)^2 # panel bending
E_fold = sum (1/2) * k_fold * l * (rho-rho_t)^2 # crease folding
```
### Planned Project Structure
```
engine/ # Core simulation (numpy/scipy)
paper.py # Paper data structure, FOLD I/O
fold_engine.py # Apply folds (quaternion rotation)
physics.py # Bar-and-hinge energy, strain
validation.py # Kawasaki, Maekawa, self-intersection
metrics.py # Deployment ratio, compactness
materials.py # Material definitions
environment/ # OpenEnv server
models.py # Action, Observation, State
origami_environment.py # Environment (reset/step/state)
tasks.py # Task pool / curriculum
app.py # create_app()
Dockerfile
client/ # OpenEnv client + training bridge
reward_functions.py # code_valid, physically_valid, fold_quality
training/ # Colab notebook
train_origami.ipynb # GRPO training (Unsloth + TRL)
prompts.py # LLM prompt templates
```
### Implementation Order (from architecture.md)
1. **Phase 1: Engine** - paper.py, fold_engine.py, validation.py, metrics.py
2. **Phase 2: OpenEnv Server** - models.py, origami_environment.py, app.py, Dockerfile
3. **Phase 3: Reward + Training** - reward_functions.py, prompts.py, train_origami.ipynb
4. **Phase 4: Rendering + Demo** - matplotlib headless, React + R3F app
### 2048 Reference Implementation (Key Patterns)
The `2048_example.py` shows the exact Unsloth + OpenEnv + GRPO pattern:
- `PatchFastRL` not used (text model, not vision) - for our VLM use `FastVisionModel`
- `extract_function()` parses code from ```python blocks
- `create_locked_down_function()` sandboxes execution
- `check_python_modules()` prevents non-stdlib imports
- `execute_with_time_limit(5)` wraps strategy execution
- Dataset: 1000x replicated prompt, `report_to="trackio"`
- GRPOConfig: temp=1.0, lr=2e-4, max_steps=600, num_generations=2
- Three reward functions passed as list to `GRPOTrainer`
---
## 13. Current Project State
### Repository
- **Location:** HuggingFace Space `openenv-community/optigami`
- **Framework:** Create React App (React 19.1.0)
- **Status:** Fresh scaffold - default CRA boilerplate
- **Build:** `npm run build` -> `build/index.html` (HF Spaces static SDK)
### File Structure
```
optigami/
package.json # React app dependencies
README.md # CRA default + HF Space metadata
public/ # Static assets (favicon, manifest)
src/
App.js # Default CRA component (placeholder)
App.css
index.js # Entry point
index.css
logo.svg
reportWebVitals.js
setupTests.js
App.test.js
```
### What Needs to Be Built
1. **Python backend** - Paper Geometry Engine with Shapely, FOLD import/export, constraint checking
2. **GRPO training scripts** - Unsloth or veRL-based, with origami reward functions
3. **Data pipeline** - Load/process OrigamiSpace + GamiBench datasets
4. **Three.js frontend** - Replace CRA boilerplate with origami visualizer (possibly integrate OrigamiSimulator)
5. **OpenEnv server** - API connecting geometry engine to trainer
---
## Key Takeaways for Immediate Work (GRPO Trainer)
1. **Use Unsloth for simplicity** - 90% VRAM savings, built-in vLLM, QLoRA support for Qwen2.5-VL-7B
2. **Dense rewards with lexicographic gating** - format gate -> constraints -> topology -> geometry -> shape match (SpatialThinker pattern)
3. **OrigamiSpace's 4-error compiler** is the gold standard for reward signal generation
4. **Start with 500+ origami examples** - GamiBench (777) + OrigamiSpace (471) = 1248 examples
5. **8 generations per prompt**, temperature 1.0, 300+ training steps minimum
6. **Multi-turn: max 10 rounds** with compiler feedback (performance saturates after 8-10)
7. **Track with Trackio** - deploy dashboard to HF Spaces for real-time monitoring
8. **Evaluate on GamiBench** for standardized comparison against other MLLMs
---
## Cross-Reference: Tool Compatibility Matrix
| Component | FOLD | OrigamiSim | GamiBench | SpatialThinker | Unsloth | Trackio |
|-----------|------|------------|-----------|----------------|---------|---------|
| State representation | Core | Import | - | - | - | - |
| Visualization | Export | Core | - | - | - | - |
| Training data | - | - | Core | Augment | - | - |
| RL training | - | - | Eval | Template | Core | Monitor |
| Reward functions | Validate | Strain | - | Template | Integrate | Log |
| Constraint checking | Structure | Physics | Impossible set | - | - | - |