# Optigami Research Notes

Comprehensive notes on all sources, tools, and architecture for the Optigami project.

---

## Table of Contents

1. [Project Architecture Overview](#1-project-architecture-overview)
2. [Paper: OrigamiSpace (2511.18450)](#2-paper-origamispace-251118450)
3. [Paper: SpatialThinker (2511.07403)](#3-paper-spatialthinker-251107403)
4. [Paper: Automating Rigid Origami Design (2211.13219)](#4-paper-automating-rigid-origami-design-221113219)
5. [Tool: FOLD Format (edemaine/fold)](#5-tool-fold-format)
6. [Tool: Origami Simulator](#6-tool-origami-simulator)
7. [Tool: GamiBench](#7-tool-gamibench)
8. [Tool: SpatialThinker Codebase](#8-tool-spatialthinker-codebase)
9. [Tool: Trackio](#9-tool-trackio)
10. [Tool: Unsloth + GRPO Training](#10-tool-unsloth--grpo-training)
11. [Unsloth ART / GRPO Trainer Plan](#11-unsloth-art--grpo-trainer-plan)
12. [Current Project State](#12-current-project-state)

---

## 1. Project Architecture Overview

```
+---------------------------------------------------+
|                   OpenEnv Server                   |
|  +-----------+  +----------+  +--------------+    |
|  |   State   |  |  Action  |  |   Reward     |    |
|  | (FOLD JSON|  | (LLM     |  | (Dense,      |    |
|  |  + target)|  |  output) |  |  verifiable) |    |
|  +-----------+  +----------+  +--------------+    |
|         |              |              |            |
|         v              v              v            |
|  +-----------------------------------------------+|
|  |         Paper Geometry Engine (Python)         ||
|  |  - Polygon state (Shapely)                    ||
|  |  - Fold operations (reflection across line)   ||
|  |  - Kawasaki/Maekawa constraint checks         ||
|  |  - Layer tracking                             ||
|  |  - FOLD format import/export                  ||
|  +-----------------------------------------------+|
|         |                                          |
|         v                                          |
|  +-----------------------------------------------+|
|  |         Three.js Visualizer (Demo only)        ||
|  |  - 3D fold animation                          ||
|  |  - Strain heatmap                             ||
|  |  - Instruction stream                         ||
|  +-----------------------------------------------+|
+---------------------------------------------------+
         |                    ^
         v                    |
+---------------------------------------------------+
|              Unsloth ART / GRPO Trainer            |
|  - Qwen2.5-VL-7B or Qwen3-4B base model          |
|  - LoRA/QLoRA for efficient training              |
|  - Multi-turn rollouts                            |
+---------------------------------------------------+
```

**Three major components:**
1. **OpenEnv Server** - RL environment serving state/action/reward for origami folding
2. **Paper Geometry Engine** - Python-based origami math (Shapely polygons, fold reflections, constraint checking)
3. **Unsloth ART / GRPO Trainer** - RL fine-tuning of vision-language models for origami reasoning

**Current focus:** Unsloth ART / GRPO Trainer

---

## 2. Paper: OrigamiSpace (2511.18450)

**Title:** ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints
**Authors:** Rui Xu, Dakuan Lu, Zicheng Zhao, Xiaoyu Tan, Xintao Wang, Siyu Yuan, Jiangjie Chen, Yinghui Xu
**Date:** November 23, 2025
**Venue:** arXiv (cs.AI)

### Dataset
- **350 primary instances** + 471 auxiliary (without folding processes)
- Each instance: CP diagram, compiled flat pattern, folding process (multi-step images), final 3D shape
- Complexity: Easy (3-9 steps), Medium (10-19), Hard (20-30), avg 8.2 steps
- **1,620 total questions** across 4 tasks

### Four Evaluation Tasks

| Task | Questions | Description |
|------|-----------|-------------|
| Pattern Prediction | 350 | CP diagram -> predict final 3D shape (multiple choice) |
| Multi-step Spatial Reasoning | 250 | Shuffled fold images -> correct chronological sequence |
| Spatial Relationship Prediction | 900 | 3 subtypes: pose localization, layering analysis, geometric change |
| End-to-End CP Code Generation | 120 | Flat layout + folded shape -> generate CP code |

### Compiler Architecture (Critical for OpenEnv)
Four-category error feedback system:
1. **CSE (CP Code Syntax Error):** Validates vertices, edges, faces, crease types; checks Euler's formula V-E+F=2
2. **GIF (Geometrically Impossible Fold):** Maekawa's theorem |M-V|=2, Kawasaki's theorem sum(alpha_i)=2pi, Big-Little-Big angle constraint
3. **PSI (Paper Self-Intersection):** Cyclic layering, collision detection (discrete + CCD), octrees/BVHs
4. **AFS (Ambiguous Folding State):** Multiple valid M/V assignments, non-unique stacking

### CP Code Evaluation (4 dimensions, 0.25 weight each)
1. **Topological Structure Similarity (TSS):** Vertex/edge/face count comparison, s_v = e^(-0.5|V_gen - V_ref| / min(V_gen, V_ref))
2. **Geometric Similarity (GS):** Hausdorff distance, s_p = e^(-5 * d_H), dihedral angle distribution, aspect ratio
3. **Constraint Satisfaction (CS):** Taco-Taco, Taco-Tortilla, transitivity, Maekawa/Kawasaki
4. **Final Folded State (FFS):** Shape similarity, layering comparison, stacking order

### Learning Approaches
- **In-Context Learning:** Single-pass, detailed instructions + examples
- **Environmental Learning:** Iterative model<->compiler loop, max 10 rounds, performance saturates after 8-10
- **Reinforcement Learning (TRICO/PPO-based):**
  - Training data: 471 instances from environmental learning
  - Model: Qwen2.5-VL-32B
  - **Rewards:** Intermediate (success bonus + quality progress), step penalty, final evaluation score
  - Result: RL-trained 32B exceeded 72B baseline

### Key Results
- Best closed-source: GPT-4o (42.71% pattern), Gemini2.5-pro (53.45% multi-step)
- Best open-source: Qwen2.5-VL-72B (36.29% pattern, 39.10% multi-step)
- Expert human: 98.45% pattern, 100% multi-step
- **Constraint satisfaction is the primary bottleneck** (~30% for top models)
- Human-model gap: 20-45 percentage points

### Relevance to Optigami
- **Direct blueprint for our OpenEnv server**: the compiler architecture with 4 error types is exactly what we need
- The CP code evaluation framework (TSS/GS/CS/FFS) can be our reward function
- Environmental learning approach maps to multi-turn rollouts in GRPO
- Confirms Qwen2.5-VL as viable base model (they used 32B, we target 7B)

---

## 3. Paper: SpatialThinker (2511.07403)

**Title:** SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
**Authors:** Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark
**Date:** November 10, 2025
**Venue:** NeurIPS 2025 Workshops (SpaVLE, EWM, ARLET, SEA)

### Core Innovation
Dense spatial rewards + GRPO for training Qwen2.5-VL on spatial reasoning tasks. Key insight: **sparse rewards lead to reward hacking; dense multi-objective rewards with lexicographic gating prevent this.**

### GRPO Training Configuration
- **Rollouts:** 8 samples per query, temperature 1.0
- **Batch size:** rollout=512, global=128
- **Training:** 75 steps (~5 episodes)
- **Hardware:** 4x NVIDIA H100 80GB
- **Time:** ~13h (3B), ~15h (7B)
- **Advantage:** A(i) = (r(i) - mu) / (sigma + epsilon), epsilon=1e-6
- **Loss:** PPO-style with clip(epsilon_l=0.2, epsilon_h=0.3), KL penalty beta=0.01

### Dense Spatial Reward Design (CRITICAL - template for our rewards)

**4-component reward with lexicographic gating:**

```
R_total = I[R_format=1] * (w_format*R_f + w_count*R_c + w_accuracy*R_a + I[R_accuracy=1]*w_spatial*R_s)
```

| Component | Weight | Description |
|-----------|--------|-------------|
| Format (R_f) | 0.1 | JSON-parseable scene graph with required fields |
| Count (R_c) | 0.2 | Penalizes deviation in object/relation counts (lambda_obj=0.7, lambda_rel=0.3) |
| Accuracy (R_a) | 0.5 | Binary exact string match |
| Spatial (R_s) | 0.2 | Hungarian matching with CIoU, activated ONLY when answer correct |

**Lexicographic gating is essential:** format compliance gates all rewards; spatial rewards only activate on correct answers. Without gating, severe reward hacking occurs (74.9% -> 23.7% with naive spatial rewards).

### STVQA-7K Dataset
- 7,587 spatial VQA pairs from Visual Genome scene graphs
- Generated by Claude Sonnet, validated by GPT-4o pass@2
- 9 spatial categories, 34 additional spatial predicates beyond standard VG150
- 90/10 train/val split

### Key Results
- SpatialThinker-7B surpasses GPT-4o on 3DSRBench by +12.1%
- Dense reward RL: +7.2% avg across 12 benchmarks (1.8x the +4.0% from sparse GRPO)
- Outperforms models trained on millions of samples (trained on only 7K)

### Relevance to Optigami
- **Direct template for our GRPO training pipeline**
- Dense reward design with lexicographic gating prevents reward hacking
- Proves Qwen2.5-VL-7B is excellent base for spatial reasoning RL
- veRL/EasyR1 framework for training infrastructure
- Shows 7K samples sufficient for strong results

---

## 4. Paper: Automating Rigid Origami Design (2211.13219)

**Title:** Automating Rigid Origami Design
**Authors:** Jeremia Geiger, Karolis Martinkus, Oliver Richter, Roger Wattenhofer
**Date:** November 2022 (revised April 2023)
**Venue:** IJCAI 2023 AI, Arts & Creativity Special Track

### Core Contribution
- Formulates rigid origami design as discrete optimization: the **"rigid origami game"**
- Based on "three units method" principle
- Framework supports diverse objectives via abstract reward functions
- Generates optimized, application-specific crease patterns

### Methodology
- Multiple search methods within optimization framework
- Flexible objective definition for application-specific requirements
- Can approximate target shapes and produce functional designs

### Relevance to Optigami
- Validates the "origami as game/environment" paradigm we're building
- Their reward formulation approach (function-based, abstract) aligns with our OpenEnv design
- Discrete optimization over crease patterns = the action space for our RL agent

---

## 5. Tool: FOLD Format

**Repo:** https://github.com/edemaine/fold
**Authors:** Erik Demaine (MIT), Jason Ku (MIT), Robert Lang
**License:** MIT

### What It Is
FOLD (Flexible Origami List Datastructure) - JSON-based file format (.fold) for representing origami models. The **standard interchange format** for computational origami.

### Data Structure
```json
{
  "vertices_coords": [[x,y], ...],      // 2D or 3D coordinates
  "edges_vertices": [[v1,v2], ...],      // Edge endpoints
  "edges_assignment": ["M","V",...],     // Mountain/Valley/Boundary/Flat/Unassigned
  "faces_vertices": [[v1,v2,v3], ...],   // Face vertex lists
  "faceOrders": [[f1,f2,order], ...],    // Stacking/layering order
  "frame_*": ...                         // Multiple frames (folding states)
}
```

### JavaScript API
```javascript
// Browser
<script src="https://edemaine.github.io/fold/dist/fold.js"></script>

// Node.js
npm install --save fold

// Usage: FOLD.moduleName.functionName
FOLD.filter.collapseNearbyVertices(foldObject)
```

### CLI Tools
- `fold-convert`: ORIPA .opx -> .fold conversion
- `fold-convert --flat-fold`: Compute flat-folded state

### Supported Software Ecosystem
OrigamiSimulator, Freeform Origami (Tachi), Rabbit Ear (Kraft), ORIPA, Crease Pattern Editor, Rhino Grasshopper

### Relevance to Optigami
- **Core data format for OpenEnv state representation**
- JSON = easy Python/JS interop
- Stacking order (faceOrders) = layer tracking
- edges_assignment = mountain/valley fold type
- Import/export between geometry engine and visualizer

---

## 6. Tool: Origami Simulator

**Repo:** https://github.com/amandaghassaei/OrigamiSimulator
**URL:** origamisimulator.org
**Author:** Amanda Ghassaei
**License:** MIT
**Stack:** JavaScript (68.4%), Three.js, GPU fragment shaders

### Capabilities
- Real-time GPU-accelerated folding simulation
- Folds ALL creases simultaneously (not sequential)
- Realistic bending simulation between creases
- Strain visualization (internal stress during folding)
- Fold Percent slider: 0% (flat) to 100% (fully folded) to -100% (inverted)

### File Formats
- **Input:** SVG, FOLD
- **Export:** FOLD, STL, OBJ

### Physics Engine
- **Stiffness-based finite element approach:** Triangulated faces are rigid panels connected by rotational hinges along fold lines
- Each fold edge has a **target angle** (+/-pi for mountain/valley), driven by angular spring forces
- Solver computes nodal displacements at each timestep to reach equilibrium
- **Fold stiffness:** Controls how strongly hinges drive toward target angle
- **Face stiffness:** Controls rigidity of triangulated faces (resistance to bending/deformation)
- **Damping:** Controls oscillation decay rate
- **Strain metric:** Per-triangle deviation of edge lengths from rest lengths (flat state)
- Self-intersection is NOT prevented (folds through itself if geometry demands it)
- Based on Schenk & Guest structural engineering approach
- Tomohiro Tachi's freeform origami variations
- Ruling-aware triangulation for curved creases
- GPU fragment shaders for parallel computation

### Programmatic Usage
- Core simulation can be driven **headlessly** (without UI) by importing solver module
- Feed FOLD JSON data -> step simulation programmatically
- FOLD is JSON, so easy to generate crease patterns from Python and pass to simulator
- Can embed in other web pages as a component

### Dependencies
- Three.js (3D rendering)
- FOLD API (internal data structure)
- Earcut + cdt2d (polygon triangulation)
- numeric.js (linear algebra)
- CCapture (GIF/WebM export)

### Relevance to Optigami
- **Direct integration for Three.js Visualizer component**
- Strain heatmap capability already built in
- FOLD format native support
- Can be used for visual verification of generated fold patterns
- Export to STL/OBJ for 3D shape comparison in rewards

---

## 7. Tool: GamiBench

**Repo:** https://github.com/stvngo/GamiBench
**Dataset:** https://huggingface.co/datasets/stvngo/GamiBench
**Paper:** arXiv 2512.22207
**License:** MIT

### Benchmark Design
- 186 valid + 186 impossible crease patterns
- 6 viewpoints per pattern (top, bottom, front, back, right, left)
- **777 total samples** in HuggingFace dataset (45.4 MB)
- 186 label classes (named origami patterns)

### Task Types
1. Standard tasks (2D CP -> 3D prediction)
2. Alternative-view tasks
3. Impossible tasks (validity checking)

### Dataset Schema
```python
{
  "image": PIL.Image,     # Origami pattern/fold image
  "label": int,           # 0-185 class label
  "split": str            # Split identifier
}
```

### Loading
```python
from datasets import load_dataset
dataset = load_dataset("stvngo/GamiBench")
```

### Model Support
- OpenAI (GPT-4, GPT-4o-mini)
- Anthropic (Claude 4.5 Sonnet)
- Google (Gemini)
- xAI (Grok)
- OpenRouter models

### Code Structure
```
models/          # Model wrappers & factory
evaluators/      # BaseEvaluator: evaluate(), evaluate_single()
benchmarks/      # Benchmark implementations
configs/         # YAML/JSON configuration
utils/           # Shared helpers
pipeline.py      # Orchestration
run.py           # Entry point
```

### Relevance to Optigami
- **Evaluation benchmark for our trained model**
- 186 origami patterns = potential training/eval data
- Impossible patterns useful for constraint satisfaction testing
- Multi-view evaluation tests true 3D understanding
- Config-driven, reproducible evaluation pipeline

---

## 8. Tool: SpatialThinker Codebase

**Repo:** https://github.com/hunarbatra/SpatialThinker
**Paper:** arXiv 2511.07403

### Architecture
- Built on Qwen2.5-VL (3B and 7B variants)
- Uses veRL/EasyR1 for RL training
- vLLM 0.8.0 for inference during rollouts

### Code Structure
```
scripts/         # Training bash scripts per model size
evaluation/      # 18+ benchmark evaluation suite
data_gen/        # Data synthesis pipeline
verl/            # RL training framework (GRPO)
```

### Data Generation Pipeline
1. Generate raw QA pairs (12K-56K options)
2. Balance/filter with 50% spatial relations focus
3. Validate via GPT-4o (~75% pass rate)
4. Upload to HuggingFace

### Requirements
- Python 3.9+
- Transformers >= 4.49.0
- Flash-Attn >= 2.4.3
- vLLM >= 0.7.3

### Relevance to Optigami
- **Reference implementation for our GRPO training setup**
- veRL/EasyR1 framework = our training infrastructure
- Dense reward design directly applicable
- Data generation pipeline can be adapted for origami QA pairs

---

## 9. Tool: Trackio

**Repo:** https://github.com/gradio-app/trackio
**Author:** Hugging Face / Gradio team
**License:** MIT

### What It Is
Lightweight, local-first experiment tracking (Weights & Biases alternative). API-compatible with wandb.

### Key Features
- `import trackio as wandb` - drop-in W&B replacement
- Non-blocking `log()` with background queue (0.5s drain interval)
- SQLite local storage at `~/.cache/huggingface/trackio`
- Optional HuggingFace Spaces deployment for dashboards
- Slack/Discord webhook alerts (INFO/WARN/ERROR)
- 2,000 logs/8s single run; 32,000 logs/14s with 32 threads

### Usage
```python
import trackio

trackio.init(project="optigami-grpo", config={"lr": 1e-6, "model": "Qwen2.5-VL-7B"})
trackio.log({"step": step, "reward": reward, "loss": loss})
trackio.alert(title="Training spike", text="...", level=trackio.AlertLevel.WARN)
trackio.finish()

# Dashboard
trackio.show(project="optigami-grpo")
trackio.sync(project="optigami-grpo", space_id="openenv-community/optigami-training")
```

### Relevance to Optigami
- **Training metrics dashboard for GRPO training runs**
- Can deploy live dashboard to HF Spaces
- Track reward components, loss, constraint satisfaction rates
- Alert on training anomalies (reward hacking, loss spikes)

---

## 10. Tool: Unsloth + GRPO Training

**Repo:** https://github.com/unslothai/unsloth
**Docs:** https://unsloth.ai/docs

### GRPO Algorithm in Unsloth
1. Generate N responses per prompt (8+ recommended)
2. Score each with custom reward functions
3. Z-score normalize rewards across group -> advantages
4. PPO-style policy update (no value model or reward model needed)

### Memory Efficiency
- **90% less VRAM** vs standard GRPO
- 20K context, 8 generations, Llama 8B: 54.3GB (vs 510.8GB standard)
- QLoRA 4-bit: model params (GB) ~ VRAM needed
- Shared GPU memory with vLLM inference engine

### Vision Model Support
- Qwen2.5-VL-7B directly supported
- Qwen3-VL-8B, Gemma 3 (4B) also available
- `FastVisionModel.get_peft_model()` with granular layer control:
  - `finetune_vision_layers`, `finetune_language_layers`
  - `finetune_attention_modules`, `finetune_mlp_modules`

### LoRA Configuration
```python
model = FastVisionModel.get_peft_model(
    model,
    r=16,                          # LoRA rank
    lora_alpha=16,                 # alpha == r recommended
    lora_dropout=0,
    finetune_vision_layers=True,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
)
```

### GRPOConfig Options
```python
GRPOConfig(
    loss_type='grpo',        # or 'gspo', 'dr_grpo'
    epsilon=0.2,
    epsilon_high=0.28,
    delta=1.5,
    # ... standard training args
)
```

### vLLM Integration
- Shared memory between Unsloth and vLLM saves 3-5GB
- A100 40GB: ~4000 tokens/sec, T4 16GB: ~300 tokens/sec
- `fast_inference=True` enables vLLM backend

### Training Requirements
- Minimum 300 steps before meaningful progress
- 500+ data rows recommended (works with 10+)
- Models >= 1.5B parameters for reasoning tokens
- Steps = rows x epochs; increase generations (8->16) for more data

### Vision Data Format
```python
[
    {"role": "user", "content": [
        {"type": "text", "text": "instruction"},
        {"type": "image", "image": pil_image}
    ]},
    {"role": "assistant", "content": [
        {"type": "text", "text": "response"}
    ]}
]
```

### GRPO vs PPO vs DPO Comparison

| Aspect | PPO | DPO | GRPO |
|--------|-----|-----|------|
| Critic/Value model | Required (same size as policy) | Not needed | **Not needed** |
| Reference model | Required | Required | Required (old policy) |
| Training data | Online rollouts | Offline preference pairs | **Online rollouts + group scoring** |
| Reward signal | Scalar per token/step | Implicit from preferences | **Verifiable/explicit** |
| VRAM overhead | ~2x (policy + critic) | ~2x (policy + ref) | **~1.5x (no critic)** |

### GRPO Advantage Estimation
```
A_i = (r_i - mean(r_1..r_G)) / std(r_1..r_G)
```
By sampling G completions and normalizing rewards within the group, GRPO creates its own baseline without a value network - halving VRAM vs PPO.

### Complete Unsloth GRPO Code Example
```python
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)  # Patch TRL with Unsloth optimizations

from trl import GRPOConfig, GRPOTrainer

# Load model with QLoRA
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
    dtype=None,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=64,                    # Higher rank for reasoning tasks
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=64,           # alpha == r recommended
    lora_dropout=0,          # Unsloth recommends 0
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth's optimized GC
    random_state=3407,
)

# Reward functions (TRL accepts a list, scores are summed)
def correctness_reward(completions, ground_truth, **kwargs):
    rewards = []
    for completion, gt in zip(completions, ground_truth):
        answer_match = re.search(r'</think>\s*(.*?)$', completion, re.DOTALL)
        if answer_match and answer_match.group(1).strip() == gt.strip():
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

def format_reward(completions, **kwargs):
    return [0.5 if ("<think>" in c and "</think>" in c) else 0.0 for c in completions]

# GRPO Config
config = GRPOConfig(
    output_dir="./grpo_output",
    num_generations=8,              # Group size G
    max_completion_length=2048,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=5e-6,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    beta=0.04,                      # KL penalty coefficient
    max_grad_norm=0.1,
    logging_steps=1,
    save_steps=250,
    bf16=True,
    loss_type='grpo',               # or 'gspo', 'dr_grpo'
)

trainer = GRPOTrainer(
    model=model,
    config=config,
    train_dataset=dataset,
    reward_funcs=[correctness_reward, format_reward],
    tokenizer=tokenizer,
)
trainer.train()

# Save LoRA adapter
model.save_pretrained("./grpo_lora_adapter")
# Optional: merge and push
# model.save_pretrained_merged("./grpo_merged", tokenizer)
# model.push_to_hub_merged("username/model-name", tokenizer)
```

### Vision GRPO with Qwen2.5-VL
```python
from unsloth import FastVisionModel, PatchFastRL
PatchFastRL("GRPO", FastVisionModel)

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

# For VLMs: typically freeze vision encoder, train language layers
model = FastVisionModel.get_peft_model(
    model,
    r=16,                          # Lower rank often sufficient for VLMs
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    finetune_vision_layers=False,  # Keep vision encoder frozen
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
)
```

### Unsloth ART (Agentic Reasoning Training)

ART extends GRPO for multi-turn agentic tasks:

1. **Multi-turn rollouts:** Model interacts with environment over multiple turns (actions + observations)
2. **Environment integration:** Custom env provides observations and final rewards
3. **Verifiable rewards:** Emphasizes automatically verifiable outcomes

**Multi-turn pattern:**
```
Turn 1: User prompt -> Model <think> + action -> Environment observation
Turn 2: Observation  -> Model <think> + action -> Environment observation
Turn 3: Observation  -> Model final answer     -> Reward computed
```

**Implementation options for multi-turn:**
1. **Single-generation (simpler):** Model outputs full plan/sequence in one generation; reward function evaluates the whole sequence
2. **Custom rollout loop (advanced):** Alternate model generation and env response, collect full trajectory, compute GRPO gradients on combined trajectory

### Key Hyperparameters Reference

| Parameter | Range | Notes |
|-----------|-------|-------|
| `num_generations` (G) | 4-16 | 8 common. More = better advantages, more VRAM |
| `beta` (KL penalty) | 0.01-0.1 | 0.04 default. Higher = stay closer to reference |
| `learning_rate` | 1e-6 to 1e-5 | Lower than SFT. 5e-6 starting point |
| `max_completion_length` | 512-4096 | Task-dependent |
| `r` (LoRA rank) | 16-128 | 64 for reasoning, 16 for VLM |
| `gradient_accumulation_steps` | 4-16 | Effective batch = per_device * accum * GPUs |
| `max_grad_norm` | 0.1-1.0 | 0.1 for stability |
| `warmup_ratio` | 0.05-0.1 | Important for RL stability |
| `epsilon` (clip) | 0.2 | PPO-style clipping |
| `epsilon_high` | 0.28 | Asymmetric upper clip |

### Qwen2.5-VL-7B Model Specifics
- Vision encoder: ViT with 2D-RoPE (handles arbitrary image resolutions via dynamic patching)
- LLM backbone: 28 layers, 3584 hidden dim, 28 attn heads, GQA with 4 KV heads
- Context: up to 32K tokens (128K with YaRN)
- Supports: single image, multi-image, video frames
- Unsloth IDs: `unsloth/Qwen2.5-VL-7B-Instruct`, `unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit`

### Qwen3-4B Model Specifics
- Hybrid thinking: can switch between `<think>` mode and direct response
- ~4B parameters, efficient for RL training
- MoE variants also available
- Unsloth IDs: `unsloth/Qwen3-4B`, `unsloth/Qwen3-4B-bnb-4bit`

---

## 11. Unsloth ART / GRPO Trainer Plan

### Phase 1: Data Preparation

**Training Data Sources:**
1. OrigamiSpace dataset (471 auxiliary instances) - CP diagrams, fold sequences, 3D shapes
2. GamiBench dataset (777 samples, 186 patterns) - crease patterns with multi-view 3D
3. Synthetic data generation pipeline (following SpatialThinker approach):
   - Generate origami QA pairs with Claude/GPT
   - Validate with GPT-4o pass@2
   - Balance across difficulty levels

**Data Format for GRPO:**
```python
# Each training example = a prompt with origami task
{
    "prompt": [
        {"role": "user", "content": [
            {"type": "image", "image": cp_diagram_image},
            {"type": "text", "text": "Given this crease pattern, describe the folding sequence and predict the final 3D shape. Output your answer as a FOLD JSON."}
        ]}
    ]
}
```

### Phase 2: Reward Function Design

**Following SpatialThinker's lexicographic gating pattern, adapted for origami:**

```python
def origami_reward(prompt, response, ground_truth):
    # Component 1: Format reward (gate)
    r_format = check_valid_fold_json(response)  # 0 or 1

    # Component 2: Constraint satisfaction
    r_constraints = check_origami_constraints(response)
    # - Maekawa's theorem: |M-V| = 2
    # - Kawasaki's theorem: sum(alpha_i) = 2*pi
    # - Euler's formula: V - E + F = 2
    # - No self-intersection

    # Component 3: Topological similarity
    r_topology = compute_tss(response, ground_truth)
    # Vertex/edge/face counts, connectivity

    # Component 4: Geometric similarity
    r_geometry = compute_hausdorff_similarity(response, ground_truth)

    # Component 5: Final shape match
    r_shape = compute_folded_state_similarity(response, ground_truth)

    # Lexicographic gating
    if r_format == 0:
        return 0.0

    total = (0.1 * r_format +
             0.25 * r_constraints +
             0.2 * r_topology +
             0.2 * r_geometry +
             0.25 * r_shape)

    return total
```

### Phase 3: Training Infrastructure

**Option A: Unsloth (simpler, less VRAM)**
```python
from unsloth import FastVisionModel
from trl import GRPOConfig, GRPOTrainer

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2.5-VL-7B-Instruct",
    load_in_4bit=True,
    fast_inference=True,
)

model = FastVisionModel.get_peft_model(model, r=16, lora_alpha=16)

config = GRPOConfig(
    loss_type="grpo",
    num_generations=8,
    max_new_tokens=2048,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    num_train_epochs=3,
    learning_rate=1e-6,
)

trainer = GRPOTrainer(
    model=model,
    config=config,
    train_dataset=dataset,
    reward_funcs=[origami_reward],
)

trainer.train()
```

**Option B: veRL/EasyR1 (following SpatialThinker, more control)**
- Uses veRL framework with GRPO
- vLLM backend for fast rollouts
- More complex but battle-tested for spatial reasoning
- Better for multi-turn rollouts

### Phase 4: Multi-Turn Rollouts

Following OrigamiSpace's environmental learning approach:
1. Model generates CP code / fold sequence
2. OpenEnv compiler validates and returns error feedback
3. Model refines based on error type (CSE/GIF/PSI/AFS)
4. Repeat up to 10 rounds
5. Final reward based on best attempt

**Environment class pattern:**
```python
class OrigamiEnv:
    def __init__(self, task):
        self.task = task
        self.state = task["initial_state"]  # FOLD JSON
        self.steps = 0
        self.max_steps = 10
        self.history = []

    def step(self, action: str):
        """Process model's fold action, return compiler feedback."""
        self.steps += 1
        # Validate through compiler (CSE/GIF/PSI/AFS checks)
        result = self.compile_and_validate(action)
        observation = f"Step {self.steps}: {result['error_type']}: {result['message']}"
        self.state = result.get("new_state", self.state)
        self.history.append((action, observation))
        done = self.steps >= self.max_steps or result.get("valid", False)
        reward = self.compute_reward() if done else 0.0
        return observation, reward, done

    def compute_reward(self):
        """4-dimensional evaluation: TSS + GS + CS + FFS."""
        return (0.25 * tss(self.state, self.task["target"]) +
                0.25 * gs(self.state, self.task["target"]) +
                0.25 * cs(self.state) +
                0.25 * ffs(self.state, self.task["target"]))

def multi_turn_reward(completions, prompts, **kwargs):
    """Wrap environment interaction into GRPO reward function."""
    rewards = []
    for completion, prompt in zip(completions, prompts):
        env = OrigamiEnv(extract_task(prompt))
        actions = parse_actions(completion)
        total_reward = 0.0
        for action in actions:
            obs, reward, done = env.step(action)
            total_reward += reward
            if done:
                break
        rewards.append(total_reward)
    return rewards
```

### Phase 5: Evaluation

1. **GamiBench** - standard origami spatial reasoning benchmark
2. **OrigamiSpace tasks** - 4-task evaluation suite
3. **Custom metrics:**
   - Constraint satisfaction rate (Maekawa/Kawasaki)
   - Compilation success rate
   - Topological/geometric similarity scores

### Phase 6: Monitoring with Trackio

```python
import trackio

trackio.init(
    project="optigami-grpo",
    space_id="openenv-community/optigami-training",
    config={
        "model": "Qwen2.5-VL-7B",
        "lora_r": 16,
        "num_generations": 8,
        "learning_rate": 1e-6,
    }
)

# In training loop
trackio.log({
    "step": step,
    "reward/total": total_reward,
    "reward/format": format_reward,
    "reward/constraints": constraint_reward,
    "reward/topology": topology_reward,
    "reward/geometry": geometry_reward,
    "reward/shape": shape_reward,
    "loss": loss,
    "compilation_rate": compilation_rate,
})
```

---

## 12. GitHub Reference Repo (ianalin123/optigami)

Located at `.reference/optigami-github/` (gitignored, not pushed to HF).

### What It Contains
A complete research repository with detailed architecture docs and a reference 2048 GRPO implementation.

### Key Files

| File | Contents |
|------|----------|
| `research/plan/architecture.md` | **Full architecture spec**: action space, state, physics engine, reward functions, OpenEnv integration, rendering pipeline, project structure, implementation order |
| `research/openenv/2048_example.py` | **636-line reference implementation** of OpenEnv + GRPO for 2048 game (Unsloth + TRL) |
| `research/openenv/overview.md` | OpenEnv framework API, types, project structure, deployment to HF Spaces |
| `research/origami/fold_types_deep.md` | All fold operations, Huzita-Justin axioms, crane step-by-step, compression patterns |
| `research/origami/math_physics_deep.md` | Kawasaki/Maekawa theorems with code, bar-and-hinge model, energy formulas |
| `research/origami/rendering_research.md` | Rendering options comparison |
| `research/origami/fold_format.md` | FOLD file format details |

### Architecture Decisions (from GitHub repo)

| Decision | Choice |
|----------|--------|
| LLM interaction | **Code-as-policy** (LLM writes `fold_strategy()` function) |
| Action space | Named fold ops (valley/mountain + fold line + angle) |
| State format | FOLD-compatible JSON |
| Physics engine | Bar-and-hinge model (NumPy port of Ghassaei) |
| Validation | Kawasaki + Maekawa + triangle-triangle intersection |
| Primary task | Solar panel packing (Miura-ori discovery) |
| Training | GRPO via TRL + Unsloth |
| Deployment | Docker Space on HF Spaces |

### Action Space (Code-as-Policy)
The LLM generates a `fold_strategy(paper_state)` function returning fold instructions:
```python
def fold_strategy(paper_state: dict) -> list[dict]:
    # paper_state contains: vertices, edges, assignments, fold_angles, material, etc.
    return [
        {"type": "valley", "line": {"start": [0,0.5], "end": [1,0.5]}, "angle": 180},
        {"type": "mountain", "line": {"start": [0.5,0], "end": [0.5,0.5]}, "angle": 180},
    ]
```

### Reward Functions (3 from 2048 pattern, adapted for origami)

1. **`code_valid`**: +1.0 valid function, -0.5 exec fails, -2.0 syntax error
2. **`physically_valid`**: +1.0 all valid, -2.0 per Kawasaki/Maekawa violation, -5.0 self-intersection
3. **`fold_quality`**: +20.0 * compactness, +10.0 meets volume target, +5.0 deployable, -0.5 per fold

### Physics Engine (Bar-and-Hinge Model)
```python
E_total = E_bar + E_facet + E_fold
E_bar   = sum (1/2) * k_axial * (L - L0)^2      # stretching
E_facet = sum (1/2) * k_facet * l * (theta-pi)^2  # panel bending
E_fold  = sum (1/2) * k_fold * l * (rho-rho_t)^2  # crease folding
```

### Planned Project Structure
```
engine/                  # Core simulation (numpy/scipy)
  paper.py               # Paper data structure, FOLD I/O
  fold_engine.py          # Apply folds (quaternion rotation)
  physics.py              # Bar-and-hinge energy, strain
  validation.py           # Kawasaki, Maekawa, self-intersection
  metrics.py              # Deployment ratio, compactness
  materials.py            # Material definitions

environment/             # OpenEnv server
  models.py              # Action, Observation, State
  origami_environment.py  # Environment (reset/step/state)
  tasks.py               # Task pool / curriculum
  app.py                 # create_app()
  Dockerfile

client/                  # OpenEnv client + training bridge
  reward_functions.py     # code_valid, physically_valid, fold_quality

training/                # Colab notebook
  train_origami.ipynb     # GRPO training (Unsloth + TRL)
  prompts.py             # LLM prompt templates
```

### Implementation Order (from architecture.md)
1. **Phase 1: Engine** - paper.py, fold_engine.py, validation.py, metrics.py
2. **Phase 2: OpenEnv Server** - models.py, origami_environment.py, app.py, Dockerfile
3. **Phase 3: Reward + Training** - reward_functions.py, prompts.py, train_origami.ipynb
4. **Phase 4: Rendering + Demo** - matplotlib headless, React + R3F app

### 2048 Reference Implementation (Key Patterns)
The `2048_example.py` shows the exact Unsloth + OpenEnv + GRPO pattern:
- `PatchFastRL` not used (text model, not vision) - for our VLM use `FastVisionModel`
- `extract_function()` parses code from ```python blocks
- `create_locked_down_function()` sandboxes execution
- `check_python_modules()` prevents non-stdlib imports
- `execute_with_time_limit(5)` wraps strategy execution
- Dataset: 1000x replicated prompt, `report_to="trackio"`
- GRPOConfig: temp=1.0, lr=2e-4, max_steps=600, num_generations=2
- Three reward functions passed as list to `GRPOTrainer`

---

## 13. Current Project State

### Repository
- **Location:** HuggingFace Space `openenv-community/optigami`
- **Framework:** Create React App (React 19.1.0)
- **Status:** Fresh scaffold - default CRA boilerplate
- **Build:** `npm run build` -> `build/index.html` (HF Spaces static SDK)

### File Structure
```
optigami/
  package.json          # React app dependencies
  README.md             # CRA default + HF Space metadata
  public/               # Static assets (favicon, manifest)
  src/
    App.js              # Default CRA component (placeholder)
    App.css
    index.js            # Entry point
    index.css
    logo.svg
    reportWebVitals.js
    setupTests.js
    App.test.js
```

### What Needs to Be Built
1. **Python backend** - Paper Geometry Engine with Shapely, FOLD import/export, constraint checking
2. **GRPO training scripts** - Unsloth or veRL-based, with origami reward functions
3. **Data pipeline** - Load/process OrigamiSpace + GamiBench datasets
4. **Three.js frontend** - Replace CRA boilerplate with origami visualizer (possibly integrate OrigamiSimulator)
5. **OpenEnv server** - API connecting geometry engine to trainer

---

## Key Takeaways for Immediate Work (GRPO Trainer)

1. **Use Unsloth for simplicity** - 90% VRAM savings, built-in vLLM, QLoRA support for Qwen2.5-VL-7B
2. **Dense rewards with lexicographic gating** - format gate -> constraints -> topology -> geometry -> shape match (SpatialThinker pattern)
3. **OrigamiSpace's 4-error compiler** is the gold standard for reward signal generation
4. **Start with 500+ origami examples** - GamiBench (777) + OrigamiSpace (471) = 1248 examples
5. **8 generations per prompt**, temperature 1.0, 300+ training steps minimum
6. **Multi-turn: max 10 rounds** with compiler feedback (performance saturates after 8-10)
7. **Track with Trackio** - deploy dashboard to HF Spaces for real-time monitoring
8. **Evaluate on GamiBench** for standardized comparison against other MLLMs

---

## Cross-Reference: Tool Compatibility Matrix

| Component | FOLD | OrigamiSim | GamiBench | SpatialThinker | Unsloth | Trackio |
|-----------|------|------------|-----------|----------------|---------|---------|
| State representation | Core | Import | - | - | - | - |
| Visualization | Export | Core | - | - | - | - |
| Training data | - | - | Core | Augment | - | - |
| RL training | - | - | Eval | Template | Core | Monitor |
| Reward functions | Validate | Strain | - | Template | Integrate | Log |
| Constraint checking | Structure | Physics | Impossible set | - | - | - |