optigami / RESEARCH_NOTES.md
sissississi's picture
Add GRPO trainer scaffold with mock environment
aa44758

Optigami Research Notes

Comprehensive notes on all sources, tools, and architecture for the Optigami project.


Table of Contents

  1. Project Architecture Overview
  2. Paper: OrigamiSpace (2511.18450)
  3. Paper: SpatialThinker (2511.07403)
  4. Paper: Automating Rigid Origami Design (2211.13219)
  5. Tool: FOLD Format (edemaine/fold)
  6. Tool: Origami Simulator
  7. Tool: GamiBench
  8. Tool: SpatialThinker Codebase
  9. Tool: Trackio
  10. Tool: Unsloth + GRPO Training
  11. Unsloth ART / GRPO Trainer Plan
  12. Current Project State

1. Project Architecture Overview

+---------------------------------------------------+
|                   OpenEnv Server                   |
|  +-----------+  +----------+  +--------------+    |
|  |   State   |  |  Action  |  |   Reward     |    |
|  | (FOLD JSON|  | (LLM     |  | (Dense,      |    |
|  |  + target)|  |  output) |  |  verifiable) |    |
|  +-----------+  +----------+  +--------------+    |
|         |              |              |            |
|         v              v              v            |
|  +-----------------------------------------------+|
|  |         Paper Geometry Engine (Python)         ||
|  |  - Polygon state (Shapely)                    ||
|  |  - Fold operations (reflection across line)   ||
|  |  - Kawasaki/Maekawa constraint checks         ||
|  |  - Layer tracking                             ||
|  |  - FOLD format import/export                  ||
|  +-----------------------------------------------+|
|         |                                          |
|         v                                          |
|  +-----------------------------------------------+|
|  |         Three.js Visualizer (Demo only)        ||
|  |  - 3D fold animation                          ||
|  |  - Strain heatmap                             ||
|  |  - Instruction stream                         ||
|  +-----------------------------------------------+|
+---------------------------------------------------+
         |                    ^
         v                    |
+---------------------------------------------------+
|              Unsloth ART / GRPO Trainer            |
|  - Qwen2.5-VL-7B or Qwen3-4B base model          |
|  - LoRA/QLoRA for efficient training              |
|  - Multi-turn rollouts                            |
+---------------------------------------------------+

Three major components:

  1. OpenEnv Server - RL environment serving state/action/reward for origami folding
  2. Paper Geometry Engine - Python-based origami math (Shapely polygons, fold reflections, constraint checking)
  3. Unsloth ART / GRPO Trainer - RL fine-tuning of vision-language models for origami reasoning

Current focus: Unsloth ART / GRPO Trainer


2. Paper: OrigamiSpace (2511.18450)

Title: ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints Authors: Rui Xu, Dakuan Lu, Zicheng Zhao, Xiaoyu Tan, Xintao Wang, Siyu Yuan, Jiangjie Chen, Yinghui Xu Date: November 23, 2025 Venue: arXiv (cs.AI)

Dataset

  • 350 primary instances + 471 auxiliary (without folding processes)
  • Each instance: CP diagram, compiled flat pattern, folding process (multi-step images), final 3D shape
  • Complexity: Easy (3-9 steps), Medium (10-19), Hard (20-30), avg 8.2 steps
  • 1,620 total questions across 4 tasks

Four Evaluation Tasks

Task Questions Description
Pattern Prediction 350 CP diagram -> predict final 3D shape (multiple choice)
Multi-step Spatial Reasoning 250 Shuffled fold images -> correct chronological sequence
Spatial Relationship Prediction 900 3 subtypes: pose localization, layering analysis, geometric change
End-to-End CP Code Generation 120 Flat layout + folded shape -> generate CP code

Compiler Architecture (Critical for OpenEnv)

Four-category error feedback system:

  1. CSE (CP Code Syntax Error): Validates vertices, edges, faces, crease types; checks Euler's formula V-E+F=2
  2. GIF (Geometrically Impossible Fold): Maekawa's theorem |M-V|=2, Kawasaki's theorem sum(alpha_i)=2pi, Big-Little-Big angle constraint
  3. PSI (Paper Self-Intersection): Cyclic layering, collision detection (discrete + CCD), octrees/BVHs
  4. AFS (Ambiguous Folding State): Multiple valid M/V assignments, non-unique stacking

CP Code Evaluation (4 dimensions, 0.25 weight each)

  1. Topological Structure Similarity (TSS): Vertex/edge/face count comparison, s_v = e^(-0.5|V_gen - V_ref| / min(V_gen, V_ref))
  2. Geometric Similarity (GS): Hausdorff distance, s_p = e^(-5 * d_H), dihedral angle distribution, aspect ratio
  3. Constraint Satisfaction (CS): Taco-Taco, Taco-Tortilla, transitivity, Maekawa/Kawasaki
  4. Final Folded State (FFS): Shape similarity, layering comparison, stacking order

Learning Approaches

  • In-Context Learning: Single-pass, detailed instructions + examples
  • Environmental Learning: Iterative model<->compiler loop, max 10 rounds, performance saturates after 8-10
  • Reinforcement Learning (TRICO/PPO-based):
    • Training data: 471 instances from environmental learning
    • Model: Qwen2.5-VL-32B
    • Rewards: Intermediate (success bonus + quality progress), step penalty, final evaluation score
    • Result: RL-trained 32B exceeded 72B baseline

Key Results

  • Best closed-source: GPT-4o (42.71% pattern), Gemini2.5-pro (53.45% multi-step)
  • Best open-source: Qwen2.5-VL-72B (36.29% pattern, 39.10% multi-step)
  • Expert human: 98.45% pattern, 100% multi-step
  • Constraint satisfaction is the primary bottleneck (~30% for top models)
  • Human-model gap: 20-45 percentage points

Relevance to Optigami

  • Direct blueprint for our OpenEnv server: the compiler architecture with 4 error types is exactly what we need
  • The CP code evaluation framework (TSS/GS/CS/FFS) can be our reward function
  • Environmental learning approach maps to multi-turn rollouts in GRPO
  • Confirms Qwen2.5-VL as viable base model (they used 32B, we target 7B)

3. Paper: SpatialThinker (2511.07403)

Title: SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards Authors: Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark Date: November 10, 2025 Venue: NeurIPS 2025 Workshops (SpaVLE, EWM, ARLET, SEA)

Core Innovation

Dense spatial rewards + GRPO for training Qwen2.5-VL on spatial reasoning tasks. Key insight: sparse rewards lead to reward hacking; dense multi-objective rewards with lexicographic gating prevent this.

GRPO Training Configuration

  • Rollouts: 8 samples per query, temperature 1.0
  • Batch size: rollout=512, global=128
  • Training: 75 steps (~5 episodes)
  • Hardware: 4x NVIDIA H100 80GB
  • Time: ~13h (3B), ~15h (7B)
  • Advantage: A(i) = (r(i) - mu) / (sigma + epsilon), epsilon=1e-6
  • Loss: PPO-style with clip(epsilon_l=0.2, epsilon_h=0.3), KL penalty beta=0.01

Dense Spatial Reward Design (CRITICAL - template for our rewards)

4-component reward with lexicographic gating:

R_total = I[R_format=1] * (w_format*R_f + w_count*R_c + w_accuracy*R_a + I[R_accuracy=1]*w_spatial*R_s)
Component Weight Description
Format (R_f) 0.1 JSON-parseable scene graph with required fields
Count (R_c) 0.2 Penalizes deviation in object/relation counts (lambda_obj=0.7, lambda_rel=0.3)
Accuracy (R_a) 0.5 Binary exact string match
Spatial (R_s) 0.2 Hungarian matching with CIoU, activated ONLY when answer correct

Lexicographic gating is essential: format compliance gates all rewards; spatial rewards only activate on correct answers. Without gating, severe reward hacking occurs (74.9% -> 23.7% with naive spatial rewards).

STVQA-7K Dataset

  • 7,587 spatial VQA pairs from Visual Genome scene graphs
  • Generated by Claude Sonnet, validated by GPT-4o pass@2
  • 9 spatial categories, 34 additional spatial predicates beyond standard VG150
  • 90/10 train/val split

Key Results

  • SpatialThinker-7B surpasses GPT-4o on 3DSRBench by +12.1%
  • Dense reward RL: +7.2% avg across 12 benchmarks (1.8x the +4.0% from sparse GRPO)
  • Outperforms models trained on millions of samples (trained on only 7K)

Relevance to Optigami

  • Direct template for our GRPO training pipeline
  • Dense reward design with lexicographic gating prevents reward hacking
  • Proves Qwen2.5-VL-7B is excellent base for spatial reasoning RL
  • veRL/EasyR1 framework for training infrastructure
  • Shows 7K samples sufficient for strong results

4. Paper: Automating Rigid Origami Design (2211.13219)

Title: Automating Rigid Origami Design Authors: Jeremia Geiger, Karolis Martinkus, Oliver Richter, Roger Wattenhofer Date: November 2022 (revised April 2023) Venue: IJCAI 2023 AI, Arts & Creativity Special Track

Core Contribution

  • Formulates rigid origami design as discrete optimization: the "rigid origami game"
  • Based on "three units method" principle
  • Framework supports diverse objectives via abstract reward functions
  • Generates optimized, application-specific crease patterns

Methodology

  • Multiple search methods within optimization framework
  • Flexible objective definition for application-specific requirements
  • Can approximate target shapes and produce functional designs

Relevance to Optigami

  • Validates the "origami as game/environment" paradigm we're building
  • Their reward formulation approach (function-based, abstract) aligns with our OpenEnv design
  • Discrete optimization over crease patterns = the action space for our RL agent

5. Tool: FOLD Format

Repo: https://github.com/edemaine/fold Authors: Erik Demaine (MIT), Jason Ku (MIT), Robert Lang License: MIT

What It Is

FOLD (Flexible Origami List Datastructure) - JSON-based file format (.fold) for representing origami models. The standard interchange format for computational origami.

Data Structure

{
  "vertices_coords": [[x,y], ...],      // 2D or 3D coordinates
  "edges_vertices": [[v1,v2], ...],      // Edge endpoints
  "edges_assignment": ["M","V",...],     // Mountain/Valley/Boundary/Flat/Unassigned
  "faces_vertices": [[v1,v2,v3], ...],   // Face vertex lists
  "faceOrders": [[f1,f2,order], ...],    // Stacking/layering order
  "frame_*": ...                         // Multiple frames (folding states)
}

JavaScript API

// Browser
<script src="https://edemaine.github.io/fold/dist/fold.js"></script>

// Node.js
npm install --save fold

// Usage: FOLD.moduleName.functionName
FOLD.filter.collapseNearbyVertices(foldObject)

CLI Tools

  • fold-convert: ORIPA .opx -> .fold conversion
  • fold-convert --flat-fold: Compute flat-folded state

Supported Software Ecosystem

OrigamiSimulator, Freeform Origami (Tachi), Rabbit Ear (Kraft), ORIPA, Crease Pattern Editor, Rhino Grasshopper

Relevance to Optigami

  • Core data format for OpenEnv state representation
  • JSON = easy Python/JS interop
  • Stacking order (faceOrders) = layer tracking
  • edges_assignment = mountain/valley fold type
  • Import/export between geometry engine and visualizer

6. Tool: Origami Simulator

Repo: https://github.com/amandaghassaei/OrigamiSimulator URL: origamisimulator.org Author: Amanda Ghassaei License: MIT Stack: JavaScript (68.4%), Three.js, GPU fragment shaders

Capabilities

  • Real-time GPU-accelerated folding simulation
  • Folds ALL creases simultaneously (not sequential)
  • Realistic bending simulation between creases
  • Strain visualization (internal stress during folding)
  • Fold Percent slider: 0% (flat) to 100% (fully folded) to -100% (inverted)

File Formats

  • Input: SVG, FOLD
  • Export: FOLD, STL, OBJ

Physics Engine

  • Stiffness-based finite element approach: Triangulated faces are rigid panels connected by rotational hinges along fold lines
  • Each fold edge has a target angle (+/-pi for mountain/valley), driven by angular spring forces
  • Solver computes nodal displacements at each timestep to reach equilibrium
  • Fold stiffness: Controls how strongly hinges drive toward target angle
  • Face stiffness: Controls rigidity of triangulated faces (resistance to bending/deformation)
  • Damping: Controls oscillation decay rate
  • Strain metric: Per-triangle deviation of edge lengths from rest lengths (flat state)
  • Self-intersection is NOT prevented (folds through itself if geometry demands it)
  • Based on Schenk & Guest structural engineering approach
  • Tomohiro Tachi's freeform origami variations
  • Ruling-aware triangulation for curved creases
  • GPU fragment shaders for parallel computation

Programmatic Usage

  • Core simulation can be driven headlessly (without UI) by importing solver module
  • Feed FOLD JSON data -> step simulation programmatically
  • FOLD is JSON, so easy to generate crease patterns from Python and pass to simulator
  • Can embed in other web pages as a component

Dependencies

  • Three.js (3D rendering)
  • FOLD API (internal data structure)
  • Earcut + cdt2d (polygon triangulation)
  • numeric.js (linear algebra)
  • CCapture (GIF/WebM export)

Relevance to Optigami

  • Direct integration for Three.js Visualizer component
  • Strain heatmap capability already built in
  • FOLD format native support
  • Can be used for visual verification of generated fold patterns
  • Export to STL/OBJ for 3D shape comparison in rewards

7. Tool: GamiBench

Repo: https://github.com/stvngo/GamiBench Dataset: https://huggingface.co/datasets/stvngo/GamiBench Paper: arXiv 2512.22207 License: MIT

Benchmark Design

  • 186 valid + 186 impossible crease patterns
  • 6 viewpoints per pattern (top, bottom, front, back, right, left)
  • 777 total samples in HuggingFace dataset (45.4 MB)
  • 186 label classes (named origami patterns)

Task Types

  1. Standard tasks (2D CP -> 3D prediction)
  2. Alternative-view tasks
  3. Impossible tasks (validity checking)

Dataset Schema

{
  "image": PIL.Image,     # Origami pattern/fold image
  "label": int,           # 0-185 class label
  "split": str            # Split identifier
}

Loading

from datasets import load_dataset
dataset = load_dataset("stvngo/GamiBench")

Model Support

  • OpenAI (GPT-4, GPT-4o-mini)
  • Anthropic (Claude 4.5 Sonnet)
  • Google (Gemini)
  • xAI (Grok)
  • OpenRouter models

Code Structure

models/          # Model wrappers & factory
evaluators/      # BaseEvaluator: evaluate(), evaluate_single()
benchmarks/      # Benchmark implementations
configs/         # YAML/JSON configuration
utils/           # Shared helpers
pipeline.py      # Orchestration
run.py           # Entry point

Relevance to Optigami

  • Evaluation benchmark for our trained model
  • 186 origami patterns = potential training/eval data
  • Impossible patterns useful for constraint satisfaction testing
  • Multi-view evaluation tests true 3D understanding
  • Config-driven, reproducible evaluation pipeline

8. Tool: SpatialThinker Codebase

Repo: https://github.com/hunarbatra/SpatialThinker Paper: arXiv 2511.07403

Architecture

  • Built on Qwen2.5-VL (3B and 7B variants)
  • Uses veRL/EasyR1 for RL training
  • vLLM 0.8.0 for inference during rollouts

Code Structure

scripts/         # Training bash scripts per model size
evaluation/      # 18+ benchmark evaluation suite
data_gen/        # Data synthesis pipeline
verl/            # RL training framework (GRPO)

Data Generation Pipeline

  1. Generate raw QA pairs (12K-56K options)
  2. Balance/filter with 50% spatial relations focus
  3. Validate via GPT-4o (~75% pass rate)
  4. Upload to HuggingFace

Requirements

  • Python 3.9+
  • Transformers >= 4.49.0
  • Flash-Attn >= 2.4.3
  • vLLM >= 0.7.3

Relevance to Optigami

  • Reference implementation for our GRPO training setup
  • veRL/EasyR1 framework = our training infrastructure
  • Dense reward design directly applicable
  • Data generation pipeline can be adapted for origami QA pairs

9. Tool: Trackio

Repo: https://github.com/gradio-app/trackio Author: Hugging Face / Gradio team License: MIT

What It Is

Lightweight, local-first experiment tracking (Weights & Biases alternative). API-compatible with wandb.

Key Features

  • import trackio as wandb - drop-in W&B replacement
  • Non-blocking log() with background queue (0.5s drain interval)
  • SQLite local storage at ~/.cache/huggingface/trackio
  • Optional HuggingFace Spaces deployment for dashboards
  • Slack/Discord webhook alerts (INFO/WARN/ERROR)
  • 2,000 logs/8s single run; 32,000 logs/14s with 32 threads

Usage

import trackio

trackio.init(project="optigami-grpo", config={"lr": 1e-6, "model": "Qwen2.5-VL-7B"})
trackio.log({"step": step, "reward": reward, "loss": loss})
trackio.alert(title="Training spike", text="...", level=trackio.AlertLevel.WARN)
trackio.finish()

# Dashboard
trackio.show(project="optigami-grpo")
trackio.sync(project="optigami-grpo", space_id="openenv-community/optigami-training")

Relevance to Optigami

  • Training metrics dashboard for GRPO training runs
  • Can deploy live dashboard to HF Spaces
  • Track reward components, loss, constraint satisfaction rates
  • Alert on training anomalies (reward hacking, loss spikes)

10. Tool: Unsloth + GRPO Training

Repo: https://github.com/unslothai/unsloth Docs: https://unsloth.ai/docs

GRPO Algorithm in Unsloth

  1. Generate N responses per prompt (8+ recommended)
  2. Score each with custom reward functions
  3. Z-score normalize rewards across group -> advantages
  4. PPO-style policy update (no value model or reward model needed)

Memory Efficiency

  • 90% less VRAM vs standard GRPO
  • 20K context, 8 generations, Llama 8B: 54.3GB (vs 510.8GB standard)
  • QLoRA 4-bit: model params (GB) ~ VRAM needed
  • Shared GPU memory with vLLM inference engine

Vision Model Support

  • Qwen2.5-VL-7B directly supported
  • Qwen3-VL-8B, Gemma 3 (4B) also available
  • FastVisionModel.get_peft_model() with granular layer control:
    • finetune_vision_layers, finetune_language_layers
    • finetune_attention_modules, finetune_mlp_modules

LoRA Configuration

model = FastVisionModel.get_peft_model(
    model,
    r=16,                          # LoRA rank
    lora_alpha=16,                 # alpha == r recommended
    lora_dropout=0,
    finetune_vision_layers=True,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
)

GRPOConfig Options

GRPOConfig(
    loss_type='grpo',        # or 'gspo', 'dr_grpo'
    epsilon=0.2,
    epsilon_high=0.28,
    delta=1.5,
    # ... standard training args
)

vLLM Integration

  • Shared memory between Unsloth and vLLM saves 3-5GB
  • A100 40GB: ~4000 tokens/sec, T4 16GB: ~300 tokens/sec
  • fast_inference=True enables vLLM backend

Training Requirements

  • Minimum 300 steps before meaningful progress
  • 500+ data rows recommended (works with 10+)
  • Models >= 1.5B parameters for reasoning tokens
  • Steps = rows x epochs; increase generations (8->16) for more data

Vision Data Format

[
    {"role": "user", "content": [
        {"type": "text", "text": "instruction"},
        {"type": "image", "image": pil_image}
    ]},
    {"role": "assistant", "content": [
        {"type": "text", "text": "response"}
    ]}
]

GRPO vs PPO vs DPO Comparison

Aspect PPO DPO GRPO
Critic/Value model Required (same size as policy) Not needed Not needed
Reference model Required Required Required (old policy)
Training data Online rollouts Offline preference pairs Online rollouts + group scoring
Reward signal Scalar per token/step Implicit from preferences Verifiable/explicit
VRAM overhead ~2x (policy + critic) ~2x (policy + ref) ~1.5x (no critic)

GRPO Advantage Estimation

A_i = (r_i - mean(r_1..r_G)) / std(r_1..r_G)

By sampling G completions and normalizing rewards within the group, GRPO creates its own baseline without a value network - halving VRAM vs PPO.

Complete Unsloth GRPO Code Example

from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)  # Patch TRL with Unsloth optimizations

from trl import GRPOConfig, GRPOTrainer

# Load model with QLoRA
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
    dtype=None,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=64,                    # Higher rank for reasoning tasks
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=64,           # alpha == r recommended
    lora_dropout=0,          # Unsloth recommends 0
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth's optimized GC
    random_state=3407,
)

# Reward functions (TRL accepts a list, scores are summed)
def correctness_reward(completions, ground_truth, **kwargs):
    rewards = []
    for completion, gt in zip(completions, ground_truth):
        answer_match = re.search(r'</think>\s*(.*?)$', completion, re.DOTALL)
        if answer_match and answer_match.group(1).strip() == gt.strip():
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

def format_reward(completions, **kwargs):
    return [0.5 if ("<think>" in c and "</think>" in c) else 0.0 for c in completions]

# GRPO Config
config = GRPOConfig(
    output_dir="./grpo_output",
    num_generations=8,              # Group size G
    max_completion_length=2048,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=5e-6,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    beta=0.04,                      # KL penalty coefficient
    max_grad_norm=0.1,
    logging_steps=1,
    save_steps=250,
    bf16=True,
    loss_type='grpo',               # or 'gspo', 'dr_grpo'
)

trainer = GRPOTrainer(
    model=model,
    config=config,
    train_dataset=dataset,
    reward_funcs=[correctness_reward, format_reward],
    tokenizer=tokenizer,
)
trainer.train()

# Save LoRA adapter
model.save_pretrained("./grpo_lora_adapter")
# Optional: merge and push
# model.save_pretrained_merged("./grpo_merged", tokenizer)
# model.push_to_hub_merged("username/model-name", tokenizer)

Vision GRPO with Qwen2.5-VL

from unsloth import FastVisionModel, PatchFastRL
PatchFastRL("GRPO", FastVisionModel)

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

# For VLMs: typically freeze vision encoder, train language layers
model = FastVisionModel.get_peft_model(
    model,
    r=16,                          # Lower rank often sufficient for VLMs
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    finetune_vision_layers=False,  # Keep vision encoder frozen
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
)

Unsloth ART (Agentic Reasoning Training)

ART extends GRPO for multi-turn agentic tasks:

  1. Multi-turn rollouts: Model interacts with environment over multiple turns (actions + observations)
  2. Environment integration: Custom env provides observations and final rewards
  3. Verifiable rewards: Emphasizes automatically verifiable outcomes

Multi-turn pattern:

Turn 1: User prompt -> Model <think> + action -> Environment observation
Turn 2: Observation  -> Model <think> + action -> Environment observation
Turn 3: Observation  -> Model final answer     -> Reward computed

Implementation options for multi-turn:

  1. Single-generation (simpler): Model outputs full plan/sequence in one generation; reward function evaluates the whole sequence
  2. Custom rollout loop (advanced): Alternate model generation and env response, collect full trajectory, compute GRPO gradients on combined trajectory

Key Hyperparameters Reference

Parameter Range Notes
num_generations (G) 4-16 8 common. More = better advantages, more VRAM
beta (KL penalty) 0.01-0.1 0.04 default. Higher = stay closer to reference
learning_rate 1e-6 to 1e-5 Lower than SFT. 5e-6 starting point
max_completion_length 512-4096 Task-dependent
r (LoRA rank) 16-128 64 for reasoning, 16 for VLM
gradient_accumulation_steps 4-16 Effective batch = per_device * accum * GPUs
max_grad_norm 0.1-1.0 0.1 for stability
warmup_ratio 0.05-0.1 Important for RL stability
epsilon (clip) 0.2 PPO-style clipping
epsilon_high 0.28 Asymmetric upper clip

Qwen2.5-VL-7B Model Specifics

  • Vision encoder: ViT with 2D-RoPE (handles arbitrary image resolutions via dynamic patching)
  • LLM backbone: 28 layers, 3584 hidden dim, 28 attn heads, GQA with 4 KV heads
  • Context: up to 32K tokens (128K with YaRN)
  • Supports: single image, multi-image, video frames
  • Unsloth IDs: unsloth/Qwen2.5-VL-7B-Instruct, unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit

Qwen3-4B Model Specifics

  • Hybrid thinking: can switch between <think> mode and direct response
  • ~4B parameters, efficient for RL training
  • MoE variants also available
  • Unsloth IDs: unsloth/Qwen3-4B, unsloth/Qwen3-4B-bnb-4bit

11. Unsloth ART / GRPO Trainer Plan

Phase 1: Data Preparation

Training Data Sources:

  1. OrigamiSpace dataset (471 auxiliary instances) - CP diagrams, fold sequences, 3D shapes
  2. GamiBench dataset (777 samples, 186 patterns) - crease patterns with multi-view 3D
  3. Synthetic data generation pipeline (following SpatialThinker approach):
    • Generate origami QA pairs with Claude/GPT
    • Validate with GPT-4o pass@2
    • Balance across difficulty levels

Data Format for GRPO:

# Each training example = a prompt with origami task
{
    "prompt": [
        {"role": "user", "content": [
            {"type": "image", "image": cp_diagram_image},
            {"type": "text", "text": "Given this crease pattern, describe the folding sequence and predict the final 3D shape. Output your answer as a FOLD JSON."}
        ]}
    ]
}

Phase 2: Reward Function Design

Following SpatialThinker's lexicographic gating pattern, adapted for origami:

def origami_reward(prompt, response, ground_truth):
    # Component 1: Format reward (gate)
    r_format = check_valid_fold_json(response)  # 0 or 1

    # Component 2: Constraint satisfaction
    r_constraints = check_origami_constraints(response)
    # - Maekawa's theorem: |M-V| = 2
    # - Kawasaki's theorem: sum(alpha_i) = 2*pi
    # - Euler's formula: V - E + F = 2
    # - No self-intersection

    # Component 3: Topological similarity
    r_topology = compute_tss(response, ground_truth)
    # Vertex/edge/face counts, connectivity

    # Component 4: Geometric similarity
    r_geometry = compute_hausdorff_similarity(response, ground_truth)

    # Component 5: Final shape match
    r_shape = compute_folded_state_similarity(response, ground_truth)

    # Lexicographic gating
    if r_format == 0:
        return 0.0

    total = (0.1 * r_format +
             0.25 * r_constraints +
             0.2 * r_topology +
             0.2 * r_geometry +
             0.25 * r_shape)

    return total

Phase 3: Training Infrastructure

Option A: Unsloth (simpler, less VRAM)

from unsloth import FastVisionModel
from trl import GRPOConfig, GRPOTrainer

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2.5-VL-7B-Instruct",
    load_in_4bit=True,
    fast_inference=True,
)

model = FastVisionModel.get_peft_model(model, r=16, lora_alpha=16)

config = GRPOConfig(
    loss_type="grpo",
    num_generations=8,
    max_new_tokens=2048,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    num_train_epochs=3,
    learning_rate=1e-6,
)

trainer = GRPOTrainer(
    model=model,
    config=config,
    train_dataset=dataset,
    reward_funcs=[origami_reward],
)

trainer.train()

Option B: veRL/EasyR1 (following SpatialThinker, more control)

  • Uses veRL framework with GRPO
  • vLLM backend for fast rollouts
  • More complex but battle-tested for spatial reasoning
  • Better for multi-turn rollouts

Phase 4: Multi-Turn Rollouts

Following OrigamiSpace's environmental learning approach:

  1. Model generates CP code / fold sequence
  2. OpenEnv compiler validates and returns error feedback
  3. Model refines based on error type (CSE/GIF/PSI/AFS)
  4. Repeat up to 10 rounds
  5. Final reward based on best attempt

Environment class pattern:

class OrigamiEnv:
    def __init__(self, task):
        self.task = task
        self.state = task["initial_state"]  # FOLD JSON
        self.steps = 0
        self.max_steps = 10
        self.history = []

    def step(self, action: str):
        """Process model's fold action, return compiler feedback."""
        self.steps += 1
        # Validate through compiler (CSE/GIF/PSI/AFS checks)
        result = self.compile_and_validate(action)
        observation = f"Step {self.steps}: {result['error_type']}: {result['message']}"
        self.state = result.get("new_state", self.state)
        self.history.append((action, observation))
        done = self.steps >= self.max_steps or result.get("valid", False)
        reward = self.compute_reward() if done else 0.0
        return observation, reward, done

    def compute_reward(self):
        """4-dimensional evaluation: TSS + GS + CS + FFS."""
        return (0.25 * tss(self.state, self.task["target"]) +
                0.25 * gs(self.state, self.task["target"]) +
                0.25 * cs(self.state) +
                0.25 * ffs(self.state, self.task["target"]))

def multi_turn_reward(completions, prompts, **kwargs):
    """Wrap environment interaction into GRPO reward function."""
    rewards = []
    for completion, prompt in zip(completions, prompts):
        env = OrigamiEnv(extract_task(prompt))
        actions = parse_actions(completion)
        total_reward = 0.0
        for action in actions:
            obs, reward, done = env.step(action)
            total_reward += reward
            if done:
                break
        rewards.append(total_reward)
    return rewards

Phase 5: Evaluation

  1. GamiBench - standard origami spatial reasoning benchmark
  2. OrigamiSpace tasks - 4-task evaluation suite
  3. Custom metrics:
    • Constraint satisfaction rate (Maekawa/Kawasaki)
    • Compilation success rate
    • Topological/geometric similarity scores

Phase 6: Monitoring with Trackio

import trackio

trackio.init(
    project="optigami-grpo",
    space_id="openenv-community/optigami-training",
    config={
        "model": "Qwen2.5-VL-7B",
        "lora_r": 16,
        "num_generations": 8,
        "learning_rate": 1e-6,
    }
)

# In training loop
trackio.log({
    "step": step,
    "reward/total": total_reward,
    "reward/format": format_reward,
    "reward/constraints": constraint_reward,
    "reward/topology": topology_reward,
    "reward/geometry": geometry_reward,
    "reward/shape": shape_reward,
    "loss": loss,
    "compilation_rate": compilation_rate,
})

12. GitHub Reference Repo (ianalin123/optigami)

Located at .reference/optigami-github/ (gitignored, not pushed to HF).

What It Contains

A complete research repository with detailed architecture docs and a reference 2048 GRPO implementation.

Key Files

File Contents
research/plan/architecture.md Full architecture spec: action space, state, physics engine, reward functions, OpenEnv integration, rendering pipeline, project structure, implementation order
research/openenv/2048_example.py 636-line reference implementation of OpenEnv + GRPO for 2048 game (Unsloth + TRL)
research/openenv/overview.md OpenEnv framework API, types, project structure, deployment to HF Spaces
research/origami/fold_types_deep.md All fold operations, Huzita-Justin axioms, crane step-by-step, compression patterns
research/origami/math_physics_deep.md Kawasaki/Maekawa theorems with code, bar-and-hinge model, energy formulas
research/origami/rendering_research.md Rendering options comparison
research/origami/fold_format.md FOLD file format details

Architecture Decisions (from GitHub repo)

Decision Choice
LLM interaction Code-as-policy (LLM writes fold_strategy() function)
Action space Named fold ops (valley/mountain + fold line + angle)
State format FOLD-compatible JSON
Physics engine Bar-and-hinge model (NumPy port of Ghassaei)
Validation Kawasaki + Maekawa + triangle-triangle intersection
Primary task Solar panel packing (Miura-ori discovery)
Training GRPO via TRL + Unsloth
Deployment Docker Space on HF Spaces

Action Space (Code-as-Policy)

The LLM generates a fold_strategy(paper_state) function returning fold instructions:

def fold_strategy(paper_state: dict) -> list[dict]:
    # paper_state contains: vertices, edges, assignments, fold_angles, material, etc.
    return [
        {"type": "valley", "line": {"start": [0,0.5], "end": [1,0.5]}, "angle": 180},
        {"type": "mountain", "line": {"start": [0.5,0], "end": [0.5,0.5]}, "angle": 180},
    ]

Reward Functions (3 from 2048 pattern, adapted for origami)

  1. code_valid: +1.0 valid function, -0.5 exec fails, -2.0 syntax error
  2. physically_valid: +1.0 all valid, -2.0 per Kawasaki/Maekawa violation, -5.0 self-intersection
  3. fold_quality: +20.0 * compactness, +10.0 meets volume target, +5.0 deployable, -0.5 per fold

Physics Engine (Bar-and-Hinge Model)

E_total = E_bar + E_facet + E_fold
E_bar   = sum (1/2) * k_axial * (L - L0)^2      # stretching
E_facet = sum (1/2) * k_facet * l * (theta-pi)^2  # panel bending
E_fold  = sum (1/2) * k_fold * l * (rho-rho_t)^2  # crease folding

Planned Project Structure

engine/                  # Core simulation (numpy/scipy)
  paper.py               # Paper data structure, FOLD I/O
  fold_engine.py          # Apply folds (quaternion rotation)
  physics.py              # Bar-and-hinge energy, strain
  validation.py           # Kawasaki, Maekawa, self-intersection
  metrics.py              # Deployment ratio, compactness
  materials.py            # Material definitions

environment/             # OpenEnv server
  models.py              # Action, Observation, State
  origami_environment.py  # Environment (reset/step/state)
  tasks.py               # Task pool / curriculum
  app.py                 # create_app()
  Dockerfile

client/                  # OpenEnv client + training bridge
  reward_functions.py     # code_valid, physically_valid, fold_quality

training/                # Colab notebook
  train_origami.ipynb     # GRPO training (Unsloth + TRL)
  prompts.py             # LLM prompt templates

Implementation Order (from architecture.md)

  1. Phase 1: Engine - paper.py, fold_engine.py, validation.py, metrics.py
  2. Phase 2: OpenEnv Server - models.py, origami_environment.py, app.py, Dockerfile
  3. Phase 3: Reward + Training - reward_functions.py, prompts.py, train_origami.ipynb
  4. Phase 4: Rendering + Demo - matplotlib headless, React + R3F app

2048 Reference Implementation (Key Patterns)

The 2048_example.py shows the exact Unsloth + OpenEnv + GRPO pattern:

  • PatchFastRL not used (text model, not vision) - for our VLM use FastVisionModel
  • extract_function() parses code from ```python blocks
  • create_locked_down_function() sandboxes execution
  • check_python_modules() prevents non-stdlib imports
  • execute_with_time_limit(5) wraps strategy execution
  • Dataset: 1000x replicated prompt, report_to="trackio"
  • GRPOConfig: temp=1.0, lr=2e-4, max_steps=600, num_generations=2
  • Three reward functions passed as list to GRPOTrainer

13. Current Project State

Repository

  • Location: HuggingFace Space openenv-community/optigami
  • Framework: Create React App (React 19.1.0)
  • Status: Fresh scaffold - default CRA boilerplate
  • Build: npm run build -> build/index.html (HF Spaces static SDK)

File Structure

optigami/
  package.json          # React app dependencies
  README.md             # CRA default + HF Space metadata
  public/               # Static assets (favicon, manifest)
  src/
    App.js              # Default CRA component (placeholder)
    App.css
    index.js            # Entry point
    index.css
    logo.svg
    reportWebVitals.js
    setupTests.js
    App.test.js

What Needs to Be Built

  1. Python backend - Paper Geometry Engine with Shapely, FOLD import/export, constraint checking
  2. GRPO training scripts - Unsloth or veRL-based, with origami reward functions
  3. Data pipeline - Load/process OrigamiSpace + GamiBench datasets
  4. Three.js frontend - Replace CRA boilerplate with origami visualizer (possibly integrate OrigamiSimulator)
  5. OpenEnv server - API connecting geometry engine to trainer

Key Takeaways for Immediate Work (GRPO Trainer)

  1. Use Unsloth for simplicity - 90% VRAM savings, built-in vLLM, QLoRA support for Qwen2.5-VL-7B
  2. Dense rewards with lexicographic gating - format gate -> constraints -> topology -> geometry -> shape match (SpatialThinker pattern)
  3. OrigamiSpace's 4-error compiler is the gold standard for reward signal generation
  4. Start with 500+ origami examples - GamiBench (777) + OrigamiSpace (471) = 1248 examples
  5. 8 generations per prompt, temperature 1.0, 300+ training steps minimum
  6. Multi-turn: max 10 rounds with compiler feedback (performance saturates after 8-10)
  7. Track with Trackio - deploy dashboard to HF Spaces for real-time monitoring
  8. Evaluate on GamiBench for standardized comparison against other MLLMs

Cross-Reference: Tool Compatibility Matrix

Component FOLD OrigamiSim GamiBench SpatialThinker Unsloth Trackio
State representation Core Import - - - -
Visualization Export Core - - - -
Training data - - Core Augment - -
RL training - - Eval Template Core Monitor
Reward functions Validate Strain - Template Integrate Log
Constraint checking Structure Physics Impossible set - - -