Spaces:

openenv-community
/

optigami

Sleeping

App Files Files Community

optigami / RESEARCH_NOTES.md

sissississi

Add GRPO trainer scaffold with mock environment

aa44758 about 1 month ago

preview code

raw

history blame

39.4 kB

Optigami Research Notes

Comprehensive notes on all sources, tools, and architecture for the Optigami project.

Project Architecture Overview
Paper: OrigamiSpace (2511.18450)
Paper: SpatialThinker (2511.07403)
Paper: Automating Rigid Origami Design (2211.13219)
Tool: FOLD Format (edemaine/fold)
Tool: Origami Simulator
Tool: GamiBench
Tool: SpatialThinker Codebase
Tool: Trackio
Tool: Unsloth + GRPO Training
Unsloth ART / GRPO Trainer Plan
Current Project State

1. Project Architecture Overview

+---------------------------------------------------+
|                   OpenEnv Server                   |
|  +-----------+  +----------+  +--------------+    |
|  |   State   |  |  Action  |  |   Reward     |    |
|  | (FOLD JSON|  | (LLM     |  | (Dense,      |    |
|  |  + target)|  |  output) |  |  verifiable) |    |
|  +-----------+  +----------+  +--------------+    |
|         |              |              |            |
|         v              v              v            |
|  +-----------------------------------------------+|
|  |         Paper Geometry Engine (Python)         ||
|  |  - Polygon state (Shapely)                    ||
|  |  - Fold operations (reflection across line)   ||
|  |  - Kawasaki/Maekawa constraint checks         ||
|  |  - Layer tracking                             ||
|  |  - FOLD format import/export                  ||
|  +-----------------------------------------------+|
|         |                                          |
|         v                                          |
|  +-----------------------------------------------+|
|  |         Three.js Visualizer (Demo only)        ||
|  |  - 3D fold animation                          ||
|  |  - Strain heatmap                             ||
|  |  - Instruction stream                         ||
|  +-----------------------------------------------+|
+---------------------------------------------------+
         |                    ^
         v                    |
+---------------------------------------------------+
|              Unsloth ART / GRPO Trainer            |
|  - Qwen2.5-VL-7B or Qwen3-4B base model          |
|  - LoRA/QLoRA for efficient training              |
|  - Multi-turn rollouts                            |
+---------------------------------------------------+

Three major components:

OpenEnv Server - RL environment serving state/action/reward for origami folding
Paper Geometry Engine - Python-based origami math (Shapely polygons, fold reflections, constraint checking)
Unsloth ART / GRPO Trainer - RL fine-tuning of vision-language models for origami reasoning

Current focus: Unsloth ART / GRPO Trainer

2. Paper: OrigamiSpace (2511.18450)

Title: ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints Authors: Rui Xu, Dakuan Lu, Zicheng Zhao, Xiaoyu Tan, Xintao Wang, Siyu Yuan, Jiangjie Chen, Yinghui Xu Date: November 23, 2025 Venue: arXiv (cs.AI)

Dataset

350 primary instances + 471 auxiliary (without folding processes)
Each instance: CP diagram, compiled flat pattern, folding process (multi-step images), final 3D shape
Complexity: Easy (3-9 steps), Medium (10-19), Hard (20-30), avg 8.2 steps
1,620 total questions across 4 tasks

Four Evaluation Tasks

Task	Questions	Description
Pattern Prediction	350	CP diagram -> predict final 3D shape (multiple choice)
Multi-step Spatial Reasoning	250	Shuffled fold images -> correct chronological sequence
Spatial Relationship Prediction	900	3 subtypes: pose localization, layering analysis, geometric change
End-to-End CP Code Generation	120	Flat layout + folded shape -> generate CP code

Compiler Architecture (Critical for OpenEnv)

Four-category error feedback system:

CSE (CP Code Syntax Error): Validates vertices, edges, faces, crease types; checks Euler's formula V-E+F=2
GIF (Geometrically Impossible Fold): Maekawa's theorem |M-V|=2, Kawasaki's theorem sum(alpha_i)=2pi, Big-Little-Big angle constraint
PSI (Paper Self-Intersection): Cyclic layering, collision detection (discrete + CCD), octrees/BVHs
AFS (Ambiguous Folding State): Multiple valid M/V assignments, non-unique stacking

CP Code Evaluation (4 dimensions, 0.25 weight each)

Topological Structure Similarity (TSS): Vertex/edge/face count comparison, s_v = e^(-0.5|V_gen - V_ref| / min(V_gen, V_ref))
Geometric Similarity (GS): Hausdorff distance, s_p = e^(-5 * d_H), dihedral angle distribution, aspect ratio
Constraint Satisfaction (CS): Taco-Taco, Taco-Tortilla, transitivity, Maekawa/Kawasaki
Final Folded State (FFS): Shape similarity, layering comparison, stacking order

Learning Approaches

In-Context Learning: Single-pass, detailed instructions + examples
Environmental Learning: Iterative model<->compiler loop, max 10 rounds, performance saturates after 8-10
Reinforcement Learning (TRICO/PPO-based):
- Training data: 471 instances from environmental learning
- Model: Qwen2.5-VL-32B
- Rewards: Intermediate (success bonus + quality progress), step penalty, final evaluation score
- Result: RL-trained 32B exceeded 72B baseline

Key Results

Best closed-source: GPT-4o (42.71% pattern), Gemini2.5-pro (53.45% multi-step)
Best open-source: Qwen2.5-VL-72B (36.29% pattern, 39.10% multi-step)
Expert human: 98.45% pattern, 100% multi-step
Constraint satisfaction is the primary bottleneck (~30% for top models)
Human-model gap: 20-45 percentage points

Relevance to Optigami

Direct blueprint for our OpenEnv server: the compiler architecture with 4 error types is exactly what we need
The CP code evaluation framework (TSS/GS/CS/FFS) can be our reward function
Environmental learning approach maps to multi-turn rollouts in GRPO
Confirms Qwen2.5-VL as viable base model (they used 32B, we target 7B)

3. Paper: SpatialThinker (2511.07403)

Title: SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards Authors: Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark Date: November 10, 2025 Venue: NeurIPS 2025 Workshops (SpaVLE, EWM, ARLET, SEA)

Core Innovation

Dense spatial rewards + GRPO for training Qwen2.5-VL on spatial reasoning tasks. Key insight: sparse rewards lead to reward hacking; dense multi-objective rewards with lexicographic gating prevent this.

GRPO Training Configuration

Rollouts: 8 samples per query, temperature 1.0
Batch size: rollout=512, global=128
Training: 75 steps (~5 episodes)
Hardware: 4x NVIDIA H100 80GB
Time: ~13h (3B), ~15h (7B)
Advantage: A(i) = (r(i) - mu) / (sigma + epsilon), epsilon=1e-6
Loss: PPO-style with clip(epsilon_l=0.2, epsilon_h=0.3), KL penalty beta=0.01

Dense Spatial Reward Design (CRITICAL - template for our rewards)

4-component reward with lexicographic gating:

R_total = I[R_format=1] * (w_format*R_f + w_count*R_c + w_accuracy*R_a + I[R_accuracy=1]*w_spatial*R_s)

Component	Weight	Description
Format (R_f)	0.1	JSON-parseable scene graph with required fields
Count (R_c)	0.2	Penalizes deviation in object/relation counts (lambda_obj=0.7, lambda_rel=0.3)
Accuracy (R_a)	0.5	Binary exact string match
Spatial (R_s)	0.2	Hungarian matching with CIoU, activated ONLY when answer correct

Lexicographic gating is essential: format compliance gates all rewards; spatial rewards only activate on correct answers. Without gating, severe reward hacking occurs (74.9% -> 23.7% with naive spatial rewards).

STVQA-7K Dataset

7,587 spatial VQA pairs from Visual Genome scene graphs
Generated by Claude Sonnet, validated by GPT-4o pass@2
9 spatial categories, 34 additional spatial predicates beyond standard VG150
90/10 train/val split

Key Results

SpatialThinker-7B surpasses GPT-4o on 3DSRBench by +12.1%
Dense reward RL: +7.2% avg across 12 benchmarks (1.8x the +4.0% from sparse GRPO)
Outperforms models trained on millions of samples (trained on only 7K)

Relevance to Optigami

Direct template for our GRPO training pipeline
Dense reward design with lexicographic gating prevents reward hacking
Proves Qwen2.5-VL-7B is excellent base for spatial reasoning RL
veRL/EasyR1 framework for training infrastructure
Shows 7K samples sufficient for strong results

4. Paper: Automating Rigid Origami Design (2211.13219)

Title: Automating Rigid Origami Design Authors: Jeremia Geiger, Karolis Martinkus, Oliver Richter, Roger Wattenhofer Date: November 2022 (revised April 2023) Venue: IJCAI 2023 AI, Arts & Creativity Special Track

Core Contribution

Formulates rigid origami design as discrete optimization: the "rigid origami game"
Based on "three units method" principle
Framework supports diverse objectives via abstract reward functions
Generates optimized, application-specific crease patterns

Methodology

Multiple search methods within optimization framework
Flexible objective definition for application-specific requirements
Can approximate target shapes and produce functional designs

Relevance to Optigami

Validates the "origami as game/environment" paradigm we're building
Their reward formulation approach (function-based, abstract) aligns with our OpenEnv design
Discrete optimization over crease patterns = the action space for our RL agent

5. Tool: FOLD Format

Repo: https://github.com/edemaine/fold Authors: Erik Demaine (MIT), Jason Ku (MIT), Robert Lang License: MIT

What It Is

FOLD (Flexible Origami List Datastructure) - JSON-based file format (.fold) for representing origami models. The standard interchange format for computational origami.

Data Structure

{
  "vertices_coords": [[x,y], ...],      // 2D or 3D coordinates
  "edges_vertices": [[v1,v2], ...],      // Edge endpoints
  "edges_assignment": ["M","V",...],     // Mountain/Valley/Boundary/Flat/Unassigned
  "faces_vertices": [[v1,v2,v3], ...],   // Face vertex lists
  "faceOrders": [[f1,f2,order], ...],    // Stacking/layering order
  "frame_*": ...                         // Multiple frames (folding states)
}

JavaScript API

// Browser
<script src="https://edemaine.github.io/fold/dist/fold.js"></script>

// Node.js
npm install --save fold

// Usage: FOLD.moduleName.functionName
FOLD.filter.collapseNearbyVertices(foldObject)

CLI Tools

fold-convert: ORIPA .opx -> .fold conversion
fold-convert --flat-fold: Compute flat-folded state

Supported Software Ecosystem

OrigamiSimulator, Freeform Origami (Tachi), Rabbit Ear (Kraft), ORIPA, Crease Pattern Editor, Rhino Grasshopper

Relevance to Optigami

Core data format for OpenEnv state representation
JSON = easy Python/JS interop
Stacking order (faceOrders) = layer tracking
edges_assignment = mountain/valley fold type
Import/export between geometry engine and visualizer

6. Tool: Origami Simulator

Repo: https://github.com/amandaghassaei/OrigamiSimulator URL: origamisimulator.org Author: Amanda Ghassaei License: MIT Stack: JavaScript (68.4%), Three.js, GPU fragment shaders

Capabilities

Real-time GPU-accelerated folding simulation
Folds ALL creases simultaneously (not sequential)
Realistic bending simulation between creases
Strain visualization (internal stress during folding)
Fold Percent slider: 0% (flat) to 100% (fully folded) to -100% (inverted)

File Formats

Input: SVG, FOLD
Export: FOLD, STL, OBJ

Physics Engine

Stiffness-based finite element approach: Triangulated faces are rigid panels connected by rotational hinges along fold lines
Each fold edge has a target angle (+/-pi for mountain/valley), driven by angular spring forces
Solver computes nodal displacements at each timestep to reach equilibrium
Fold stiffness: Controls how strongly hinges drive toward target angle
Face stiffness: Controls rigidity of triangulated faces (resistance to bending/deformation)
Damping: Controls oscillation decay rate
Strain metric: Per-triangle deviation of edge lengths from rest lengths (flat state)
Self-intersection is NOT prevented (folds through itself if geometry demands it)
Based on Schenk & Guest structural engineering approach
Tomohiro Tachi's freeform origami variations
Ruling-aware triangulation for curved creases
GPU fragment shaders for parallel computation

Programmatic Usage

Core simulation can be driven headlessly (without UI) by importing solver module
Feed FOLD JSON data -> step simulation programmatically
FOLD is JSON, so easy to generate crease patterns from Python and pass to simulator
Can embed in other web pages as a component

Dependencies

Three.js (3D rendering)
FOLD API (internal data structure)
Earcut + cdt2d (polygon triangulation)
numeric.js (linear algebra)
CCapture (GIF/WebM export)

Relevance to Optigami

Direct integration for Three.js Visualizer component
Strain heatmap capability already built in
FOLD format native support
Can be used for visual verification of generated fold patterns
Export to STL/OBJ for 3D shape comparison in rewards

7. Tool: GamiBench

Repo: https://github.com/stvngo/GamiBench Dataset: https://huggingface.co/datasets/stvngo/GamiBench Paper: arXiv 2512.22207 License: MIT

Benchmark Design

186 valid + 186 impossible crease patterns
6 viewpoints per pattern (top, bottom, front, back, right, left)
777 total samples in HuggingFace dataset (45.4 MB)
186 label classes (named origami patterns)

Task Types

Standard tasks (2D CP -> 3D prediction)
Alternative-view tasks
Impossible tasks (validity checking)

Dataset Schema

{
  "image": PIL.Image,     # Origami pattern/fold image
  "label": int,           # 0-185 class label
  "split": str            # Split identifier
}

Loading

from datasets import load_dataset
dataset = load_dataset("stvngo/GamiBench")

Model Support

OpenAI (GPT-4, GPT-4o-mini)
Anthropic (Claude 4.5 Sonnet)
Google (Gemini)
xAI (Grok)
OpenRouter models

Code Structure

models/          # Model wrappers & factory
evaluators/      # BaseEvaluator: evaluate(), evaluate_single()
benchmarks/      # Benchmark implementations
configs/         # YAML/JSON configuration
utils/           # Shared helpers
pipeline.py      # Orchestration
run.py           # Entry point

Relevance to Optigami

Evaluation benchmark for our trained model
186 origami patterns = potential training/eval data
Impossible patterns useful for constraint satisfaction testing
Multi-view evaluation tests true 3D understanding
Config-driven, reproducible evaluation pipeline

8. Tool: SpatialThinker Codebase

Repo: https://github.com/hunarbatra/SpatialThinker Paper: arXiv 2511.07403

Architecture

Built on Qwen2.5-VL (3B and 7B variants)
Uses veRL/EasyR1 for RL training
vLLM 0.8.0 for inference during rollouts

Code Structure

scripts/         # Training bash scripts per model size
evaluation/      # 18+ benchmark evaluation suite
data_gen/        # Data synthesis pipeline
verl/            # RL training framework (GRPO)

Data Generation Pipeline

Generate raw QA pairs (12K-56K options)
Balance/filter with 50% spatial relations focus
Validate via GPT-4o (~75% pass rate)
Upload to HuggingFace

Requirements

Python 3.9+
Transformers >= 4.49.0
Flash-Attn >= 2.4.3
vLLM >= 0.7.3

Relevance to Optigami

Reference implementation for our GRPO training setup
veRL/EasyR1 framework = our training infrastructure
Dense reward design directly applicable
Data generation pipeline can be adapted for origami QA pairs

9. Tool: Trackio

Repo: https://github.com/gradio-app/trackio Author: Hugging Face / Gradio team License: MIT

What It Is

Lightweight, local-first experiment tracking (Weights & Biases alternative). API-compatible with wandb.

Key Features

import trackio as wandb - drop-in W&B replacement
Non-blocking log() with background queue (0.5s drain interval)
SQLite local storage at ~/.cache/huggingface/trackio
Optional HuggingFace Spaces deployment for dashboards
Slack/Discord webhook alerts (INFO/WARN/ERROR)
2,000 logs/8s single run; 32,000 logs/14s with 32 threads

Usage

import trackio

trackio.init(project="optigami-grpo", config={"lr": 1e-6, "model": "Qwen2.5-VL-7B"})
trackio.log({"step": step, "reward": reward, "loss": loss})
trackio.alert(title="Training spike", text="...", level=trackio.AlertLevel.WARN)
trackio.finish()

# Dashboard
trackio.show(project="optigami-grpo")
trackio.sync(project="optigami-grpo", space_id="openenv-community/optigami-training")

Relevance to Optigami

Training metrics dashboard for GRPO training runs
Can deploy live dashboard to HF Spaces
Track reward components, loss, constraint satisfaction rates
Alert on training anomalies (reward hacking, loss spikes)

10. Tool: Unsloth + GRPO Training

Repo: https://github.com/unslothai/unsloth Docs: https://unsloth.ai/docs

GRPO Algorithm in Unsloth

Generate N responses per prompt (8+ recommended)
Score each with custom reward functions
Z-score normalize rewards across group -> advantages
PPO-style policy update (no value model or reward model needed)

Memory Efficiency

90% less VRAM vs standard GRPO
20K context, 8 generations, Llama 8B: 54.3GB (vs 510.8GB standard)
QLoRA 4-bit: model params (GB) ~ VRAM needed
Shared GPU memory with vLLM inference engine

Vision Model Support

Qwen2.5-VL-7B directly supported
Qwen3-VL-8B, Gemma 3 (4B) also available
FastVisionModel.get_peft_model() with granular layer control:
- finetune_vision_layers, finetune_language_layers
- finetune_attention_modules, finetune_mlp_modules

LoRA Configuration

model = FastVisionModel.get_peft_model(
    model,
    r=16,                          # LoRA rank
    lora_alpha=16,                 # alpha == r recommended
    lora_dropout=0,
    finetune_vision_layers=True,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
)

GRPOConfig Options

GRPOConfig(
    loss_type='grpo',        # or 'gspo', 'dr_grpo'
    epsilon=0.2,
    epsilon_high=0.28,
    delta=1.5,
    # ... standard training args
)

vLLM Integration

Shared memory between Unsloth and vLLM saves 3-5GB
A100 40GB: ~4000 tokens/sec, T4 16GB: ~300 tokens/sec
fast_inference=True enables vLLM backend

Training Requirements

Minimum 300 steps before meaningful progress
500+ data rows recommended (works with 10+)
Models >= 1.5B parameters for reasoning tokens
Steps = rows x epochs; increase generations (8->16) for more data

Vision Data Format

[
    {"role": "user", "content": [
        {"type": "text", "text": "instruction"},
        {"type": "image", "image": pil_image}
    ]},
    {"role": "assistant", "content": [
        {"type": "text", "text": "response"}
    ]}
]

GRPO vs PPO vs DPO Comparison

Aspect	PPO	DPO	GRPO
Critic/Value model	Required (same size as policy)	Not needed	Not needed
Reference model	Required	Required	Required (old policy)
Training data	Online rollouts	Offline preference pairs	Online rollouts + group scoring
Reward signal	Scalar per token/step	Implicit from preferences	Verifiable/explicit
VRAM overhead	~2x (policy + critic)	~2x (policy + ref)	~1.5x (no critic)

GRPO Advantage Estimation

A_i = (r_i - mean(r_1..r_G)) / std(r_1..r_G)

By sampling G completions and normalizing rewards within the group, GRPO creates its own baseline without a value network - halving VRAM vs PPO.

Complete Unsloth GRPO Code Example

from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)  # Patch TRL with Unsloth optimizations

from trl import GRPOConfig, GRPOTrainer

# Load model with QLoRA
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
    dtype=None,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=64,                    # Higher rank for reasoning tasks
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=64,           # alpha == r recommended
    lora_dropout=0,          # Unsloth recommends 0
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth's optimized GC
    random_state=3407,
)

# Reward functions (TRL accepts a list, scores are summed)
def correctness_reward(completions, ground_truth, **kwargs):
    rewards = []
    for completion, gt in zip(completions, ground_truth):
        answer_match = re.search(r'</think>\s*(.*?)$', completion, re.DOTALL)
        if answer_match and answer_match.group(1).strip() == gt.strip():
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

def format_reward(completions, **kwargs):
    return [0.5 if ("<think>" in c and "</think>" in c) else 0.0 for c in completions]

# GRPO Config
config = GRPOConfig(
    output_dir="./grpo_output",
    num_generations=8,              # Group size G
    max_completion_length=2048,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=5e-6,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    beta=0.04,                      # KL penalty coefficient
    max_grad_norm=0.1,
    logging_steps=1,
    save_steps=250,
    bf16=True,
    loss_type='grpo',               # or 'gspo', 'dr_grpo'
)

trainer = GRPOTrainer(
    model=model,
    config=config,
    train_dataset=dataset,
    reward_funcs=[correctness_reward, format_reward],
    tokenizer=tokenizer,
)
trainer.train()

# Save LoRA adapter
model.save_pretrained("./grpo_lora_adapter")
# Optional: merge and push
# model.save_pretrained_merged("./grpo_merged", tokenizer)
# model.push_to_hub_merged("username/model-name", tokenizer)

Vision GRPO with Qwen2.5-VL

from unsloth import FastVisionModel, PatchFastRL
PatchFastRL("GRPO", FastVisionModel)

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

# For VLMs: typically freeze vision encoder, train language layers
model = FastVisionModel.get_peft_model(
    model,
    r=16,                          # Lower rank often sufficient for VLMs
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    finetune_vision_layers=False,  # Keep vision encoder frozen
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
)

Unsloth ART (Agentic Reasoning Training)

ART extends GRPO for multi-turn agentic tasks:

Multi-turn rollouts: Model interacts with environment over multiple turns (actions + observations)
Environment integration: Custom env provides observations and final rewards
Verifiable rewards: Emphasizes automatically verifiable outcomes

Multi-turn pattern:

Turn 1: User prompt -> Model <think> + action -> Environment observation
Turn 2: Observation  -> Model <think> + action -> Environment observation
Turn 3: Observation  -> Model final answer     -> Reward computed

Implementation options for multi-turn:

Single-generation (simpler): Model outputs full plan/sequence in one generation; reward function evaluates the whole sequence
Custom rollout loop (advanced): Alternate model generation and env response, collect full trajectory, compute GRPO gradients on combined trajectory

Key Hyperparameters Reference

Parameter	Range	Notes
`num_generations` (G)	4-16	8 common. More = better advantages, more VRAM
`beta` (KL penalty)	0.01-0.1	0.04 default. Higher = stay closer to reference
`learning_rate`	1e-6 to 1e-5	Lower than SFT. 5e-6 starting point
`max_completion_length`	512-4096	Task-dependent
`r` (LoRA rank)	16-128	64 for reasoning, 16 for VLM
`gradient_accumulation_steps`	4-16	Effective batch = per_device * accum * GPUs
`max_grad_norm`	0.1-1.0	0.1 for stability
`warmup_ratio`	0.05-0.1	Important for RL stability
`epsilon` (clip)	0.2	PPO-style clipping
`epsilon_high`	0.28	Asymmetric upper clip

Qwen2.5-VL-7B Model Specifics

Vision encoder: ViT with 2D-RoPE (handles arbitrary image resolutions via dynamic patching)
LLM backbone: 28 layers, 3584 hidden dim, 28 attn heads, GQA with 4 KV heads
Context: up to 32K tokens (128K with YaRN)
Supports: single image, multi-image, video frames
Unsloth IDs: unsloth/Qwen2.5-VL-7B-Instruct, unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit

Qwen3-4B Model Specifics

Hybrid thinking: can switch between <think> mode and direct response
~4B parameters, efficient for RL training
MoE variants also available
Unsloth IDs: unsloth/Qwen3-4B, unsloth/Qwen3-4B-bnb-4bit

11. Unsloth ART / GRPO Trainer Plan

Phase 1: Data Preparation

Training Data Sources:

OrigamiSpace dataset (471 auxiliary instances) - CP diagrams, fold sequences, 3D shapes
GamiBench dataset (777 samples, 186 patterns) - crease patterns with multi-view 3D
Synthetic data generation pipeline (following SpatialThinker approach):
- Generate origami QA pairs with Claude/GPT
- Validate with GPT-4o pass@2
- Balance across difficulty levels

Data Format for GRPO:

# Each training example = a prompt with origami task
{
    "prompt": [
        {"role": "user", "content": [
            {"type": "image", "image": cp_diagram_image},
            {"type": "text", "text": "Given this crease pattern, describe the folding sequence and predict the final 3D shape. Output your answer as a FOLD JSON."}
        ]}
    ]
}

Phase 2: Reward Function Design

Following SpatialThinker's lexicographic gating pattern, adapted for origami:

def origami_reward(prompt, response, ground_truth):
    # Component 1: Format reward (gate)
    r_format = check_valid_fold_json(response)  # 0 or 1

    # Component 2: Constraint satisfaction
    r_constraints = check_origami_constraints(response)
    # - Maekawa's theorem: |M-V| = 2
    # - Kawasaki's theorem: sum(alpha_i) = 2*pi
    # - Euler's formula: V - E + F = 2
    # - No self-intersection

    # Component 3: Topological similarity
    r_topology = compute_tss(response, ground_truth)
    # Vertex/edge/face counts, connectivity

    # Component 4: Geometric similarity
    r_geometry = compute_hausdorff_similarity(response, ground_truth)

    # Component 5: Final shape match
    r_shape = compute_folded_state_similarity(response, ground_truth)

    # Lexicographic gating
    if r_format == 0:
        return 0.0

    total = (0.1 * r_format +
             0.25 * r_constraints +
             0.2 * r_topology +
             0.2 * r_geometry +
             0.25 * r_shape)

    return total

Phase 3: Training Infrastructure

Option A: Unsloth (simpler, less VRAM)

from unsloth import FastVisionModel
from trl import GRPOConfig, GRPOTrainer

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2.5-VL-7B-Instruct",
    load_in_4bit=True,
    fast_inference=True,
)

model = FastVisionModel.get_peft_model(model, r=16, lora_alpha=16)

config = GRPOConfig(
    loss_type="grpo",
    num_generations=8,
    max_new_tokens=2048,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    num_train_epochs=3,
    learning_rate=1e-6,
)

trainer = GRPOTrainer(
    model=model,
    config=config,
    train_dataset=dataset,
    reward_funcs=[origami_reward],
)

trainer.train()

Option B: veRL/EasyR1 (following SpatialThinker, more control)

Uses veRL framework with GRPO
vLLM backend for fast rollouts
More complex but battle-tested for spatial reasoning
Better for multi-turn rollouts

Phase 4: Multi-Turn Rollouts

Following OrigamiSpace's environmental learning approach:

Model generates CP code / fold sequence
OpenEnv compiler validates and returns error feedback
Model refines based on error type (CSE/GIF/PSI/AFS)
Repeat up to 10 rounds
Final reward based on best attempt

Environment class pattern:

class OrigamiEnv:
    def __init__(self, task):
        self.task = task
        self.state = task["initial_state"]  # FOLD JSON
        self.steps = 0
        self.max_steps = 10
        self.history = []

    def step(self, action: str):
        """Process model's fold action, return compiler feedback."""
        self.steps += 1
        # Validate through compiler (CSE/GIF/PSI/AFS checks)
        result = self.compile_and_validate(action)
        observation = f"Step {self.steps}: {result['error_type']}: {result['message']}"
        self.state = result.get("new_state", self.state)
        self.history.append((action, observation))
        done = self.steps >= self.max_steps or result.get("valid", False)
        reward = self.compute_reward() if done else 0.0
        return observation, reward, done

    def compute_reward(self):
        """4-dimensional evaluation: TSS + GS + CS + FFS."""
        return (0.25 * tss(self.state, self.task["target"]) +
                0.25 * gs(self.state, self.task["target"]) +
                0.25 * cs(self.state) +
                0.25 * ffs(self.state, self.task["target"]))

def multi_turn_reward(completions, prompts, **kwargs):
    """Wrap environment interaction into GRPO reward function."""
    rewards = []
    for completion, prompt in zip(completions, prompts):
        env = OrigamiEnv(extract_task(prompt))
        actions = parse_actions(completion)
        total_reward = 0.0
        for action in actions:
            obs, reward, done = env.step(action)
            total_reward += reward
            if done:
                break
        rewards.append(total_reward)
    return rewards

Phase 5: Evaluation

GamiBench - standard origami spatial reasoning benchmark
OrigamiSpace tasks - 4-task evaluation suite
Custom metrics:
- Constraint satisfaction rate (Maekawa/Kawasaki)
- Compilation success rate
- Topological/geometric similarity scores

Phase 6: Monitoring with Trackio

import trackio

trackio.init(
    project="optigami-grpo",
    space_id="openenv-community/optigami-training",
    config={
        "model": "Qwen2.5-VL-7B",
        "lora_r": 16,
        "num_generations": 8,
        "learning_rate": 1e-6,
    }
)

# In training loop
trackio.log({
    "step": step,
    "reward/total": total_reward,
    "reward/format": format_reward,
    "reward/constraints": constraint_reward,
    "reward/topology": topology_reward,
    "reward/geometry": geometry_reward,
    "reward/shape": shape_reward,
    "loss": loss,
    "compilation_rate": compilation_rate,
})

12. GitHub Reference Repo (ianalin123/optigami)

Located at .reference/optigami-github/ (gitignored, not pushed to HF).

What It Contains

A complete research repository with detailed architecture docs and a reference 2048 GRPO implementation.

Key Files

File	Contents
`research/plan/architecture.md`	Full architecture spec: action space, state, physics engine, reward functions, OpenEnv integration, rendering pipeline, project structure, implementation order
`research/openenv/2048_example.py`	636-line reference implementation of OpenEnv + GRPO for 2048 game (Unsloth + TRL)
`research/openenv/overview.md`	OpenEnv framework API, types, project structure, deployment to HF Spaces
`research/origami/fold_types_deep.md`	All fold operations, Huzita-Justin axioms, crane step-by-step, compression patterns
`research/origami/math_physics_deep.md`	Kawasaki/Maekawa theorems with code, bar-and-hinge model, energy formulas
`research/origami/rendering_research.md`	Rendering options comparison
`research/origami/fold_format.md`	FOLD file format details

Architecture Decisions (from GitHub repo)

Decision	Choice
LLM interaction	Code-as-policy (LLM writes `fold_strategy()` function)
Action space	Named fold ops (valley/mountain + fold line + angle)
State format	FOLD-compatible JSON
Physics engine	Bar-and-hinge model (NumPy port of Ghassaei)
Validation	Kawasaki + Maekawa + triangle-triangle intersection
Primary task	Solar panel packing (Miura-ori discovery)
Training	GRPO via TRL + Unsloth
Deployment	Docker Space on HF Spaces

Action Space (Code-as-Policy)

The LLM generates a fold_strategy(paper_state) function returning fold instructions:

def fold_strategy(paper_state: dict) -> list[dict]:
    # paper_state contains: vertices, edges, assignments, fold_angles, material, etc.
    return [
        {"type": "valley", "line": {"start": [0,0.5], "end": [1,0.5]}, "angle": 180},
        {"type": "mountain", "line": {"start": [0.5,0], "end": [0.5,0.5]}, "angle": 180},
    ]

Reward Functions (3 from 2048 pattern, adapted for origami)

code_valid: +1.0 valid function, -0.5 exec fails, -2.0 syntax error
physically_valid: +1.0 all valid, -2.0 per Kawasaki/Maekawa violation, -5.0 self-intersection
fold_quality: +20.0 * compactness, +10.0 meets volume target, +5.0 deployable, -0.5 per fold

Physics Engine (Bar-and-Hinge Model)

E_total = E_bar + E_facet + E_fold
E_bar   = sum (1/2) * k_axial * (L - L0)^2      # stretching
E_facet = sum (1/2) * k_facet * l * (theta-pi)^2  # panel bending
E_fold  = sum (1/2) * k_fold * l * (rho-rho_t)^2  # crease folding

Planned Project Structure

engine/                  # Core simulation (numpy/scipy)
  paper.py               # Paper data structure, FOLD I/O
  fold_engine.py          # Apply folds (quaternion rotation)
  physics.py              # Bar-and-hinge energy, strain
  validation.py           # Kawasaki, Maekawa, self-intersection
  metrics.py              # Deployment ratio, compactness
  materials.py            # Material definitions

environment/             # OpenEnv server
  models.py              # Action, Observation, State
  origami_environment.py  # Environment (reset/step/state)
  tasks.py               # Task pool / curriculum
  app.py                 # create_app()
  Dockerfile

client/                  # OpenEnv client + training bridge
  reward_functions.py     # code_valid, physically_valid, fold_quality

training/                # Colab notebook
  train_origami.ipynb     # GRPO training (Unsloth + TRL)
  prompts.py             # LLM prompt templates

Implementation Order (from architecture.md)

Phase 1: Engine - paper.py, fold_engine.py, validation.py, metrics.py
Phase 2: OpenEnv Server - models.py, origami_environment.py, app.py, Dockerfile
Phase 3: Reward + Training - reward_functions.py, prompts.py, train_origami.ipynb
Phase 4: Rendering + Demo - matplotlib headless, React + R3F app

2048 Reference Implementation (Key Patterns)

The 2048_example.py shows the exact Unsloth + OpenEnv + GRPO pattern:

PatchFastRL not used (text model, not vision) - for our VLM use FastVisionModel
extract_function() parses code from ```python blocks
create_locked_down_function() sandboxes execution
check_python_modules() prevents non-stdlib imports
execute_with_time_limit(5) wraps strategy execution
Dataset: 1000x replicated prompt, report_to="trackio"
GRPOConfig: temp=1.0, lr=2e-4, max_steps=600, num_generations=2
Three reward functions passed as list to GRPOTrainer

13. Current Project State

Repository

Location: HuggingFace Space openenv-community/optigami
Framework: Create React App (React 19.1.0)
Status: Fresh scaffold - default CRA boilerplate
Build: npm run build -> build/index.html (HF Spaces static SDK)

File Structure

optigami/
  package.json          # React app dependencies
  README.md             # CRA default + HF Space metadata
  public/               # Static assets (favicon, manifest)
  src/
    App.js              # Default CRA component (placeholder)
    App.css
    index.js            # Entry point
    index.css
    logo.svg
    reportWebVitals.js
    setupTests.js
    App.test.js

What Needs to Be Built

Python backend - Paper Geometry Engine with Shapely, FOLD import/export, constraint checking
GRPO training scripts - Unsloth or veRL-based, with origami reward functions
Data pipeline - Load/process OrigamiSpace + GamiBench datasets
Three.js frontend - Replace CRA boilerplate with origami visualizer (possibly integrate OrigamiSimulator)
OpenEnv server - API connecting geometry engine to trainer

Key Takeaways for Immediate Work (GRPO Trainer)

Use Unsloth for simplicity - 90% VRAM savings, built-in vLLM, QLoRA support for Qwen2.5-VL-7B
Dense rewards with lexicographic gating - format gate -> constraints -> topology -> geometry -> shape match (SpatialThinker pattern)
OrigamiSpace's 4-error compiler is the gold standard for reward signal generation
Start with 500+ origami examples - GamiBench (777) + OrigamiSpace (471) = 1248 examples
8 generations per prompt, temperature 1.0, 300+ training steps minimum
Multi-turn: max 10 rounds with compiler feedback (performance saturates after 8-10)
Track with Trackio - deploy dashboard to HF Spaces for real-time monitoring
Evaluate on GamiBench for standardized comparison against other MLLMs

Cross-Reference: Tool Compatibility Matrix

Component	FOLD	OrigamiSim	GamiBench	SpatialThinker	Unsloth	Trackio
State representation	Core	Import	-	-	-	-
Visualization	Export	Core	-	-	-	-
Training data	-	-	Core	Augment	-	-
RL training	-	-	Eval	Template	Core	Monitor
Reward functions	Validate	Strain	-	Template	Integrate	Log
Constraint checking	Structure	Physics	Impossible set	-	-	-

Optigami Research Notes

Table of Contents

1. Project Architecture Overview

2. Paper: OrigamiSpace (2511.18450)

Dataset

Four Evaluation Tasks

Compiler Architecture (Critical for OpenEnv)

CP Code Evaluation (4 dimensions, 0.25 weight each)

Learning Approaches

Key Results

Relevance to Optigami

3. Paper: SpatialThinker (2511.07403)

Core Innovation

GRPO Training Configuration

Dense Spatial Reward Design (CRITICAL - template for our rewards)

STVQA-7K Dataset

Key Results

Relevance to Optigami

4. Paper: Automating Rigid Origami Design (2211.13219)

Core Contribution

Methodology

Relevance to Optigami

5. Tool: FOLD Format

What It Is

Data Structure

JavaScript API

CLI Tools

Supported Software Ecosystem

Relevance to Optigami

6. Tool: Origami Simulator

Capabilities

File Formats

Physics Engine

Programmatic Usage

Dependencies

Relevance to Optigami

7. Tool: GamiBench

Benchmark Design

Task Types

Dataset Schema

Loading

Model Support

Code Structure

Relevance to Optigami

8. Tool: SpatialThinker Codebase

Architecture

Code Structure

Data Generation Pipeline

Requirements

Relevance to Optigami

9. Tool: Trackio

What It Is

Key Features

Usage

Relevance to Optigami

10. Tool: Unsloth + GRPO Training

GRPO Algorithm in Unsloth

Memory Efficiency

Vision Model Support

LoRA Configuration

GRPOConfig Options

vLLM Integration

Training Requirements

Vision Data Format

GRPO vs PPO vs DPO Comparison

GRPO Advantage Estimation

Complete Unsloth GRPO Code Example

Vision GRPO with Qwen2.5-VL

Unsloth ART (Agentic Reasoning Training)

Key Hyperparameters Reference

Qwen2.5-VL-7B Model Specifics

Qwen3-4B Model Specifics

11. Unsloth ART / GRPO Trainer Plan

Phase 1: Data Preparation

Phase 2: Reward Function Design

Phase 3: Training Infrastructure

Phase 4: Multi-Turn Rollouts

Phase 5: Evaluation

Phase 6: Monitoring with Trackio

12. GitHub Reference Repo (ianalin123/optigami)