Spaces:
Running
Running
| # Optigami Research Notes | |
| Comprehensive notes on all sources, tools, and architecture for the Optigami project. | |
| --- | |
| ## Table of Contents | |
| 1. [Project Architecture Overview](#1-project-architecture-overview) | |
| 2. [Paper: OrigamiSpace (2511.18450)](#2-paper-origamispace-251118450) | |
| 3. [Paper: SpatialThinker (2511.07403)](#3-paper-spatialthinker-251107403) | |
| 4. [Paper: Automating Rigid Origami Design (2211.13219)](#4-paper-automating-rigid-origami-design-221113219) | |
| 5. [Tool: FOLD Format (edemaine/fold)](#5-tool-fold-format) | |
| 6. [Tool: Origami Simulator](#6-tool-origami-simulator) | |
| 7. [Tool: GamiBench](#7-tool-gamibench) | |
| 8. [Tool: SpatialThinker Codebase](#8-tool-spatialthinker-codebase) | |
| 9. [Tool: Trackio](#9-tool-trackio) | |
| 10. [Tool: Unsloth + GRPO Training](#10-tool-unsloth--grpo-training) | |
| 11. [Unsloth ART / GRPO Trainer Plan](#11-unsloth-art--grpo-trainer-plan) | |
| 12. [Current Project State](#12-current-project-state) | |
| --- | |
| ## 1. Project Architecture Overview | |
| ``` | |
| +---------------------------------------------------+ | |
| | OpenEnv Server | | |
| | +-----------+ +----------+ +--------------+ | | |
| | | State | | Action | | Reward | | | |
| | | (FOLD JSON| | (LLM | | (Dense, | | | |
| | | + target)| | output) | | verifiable) | | | |
| | +-----------+ +----------+ +--------------+ | | |
| | | | | | | |
| | v v v | | |
| | +-----------------------------------------------+| | |
| | | Paper Geometry Engine (Python) || | |
| | | - Polygon state (Shapely) || | |
| | | - Fold operations (reflection across line) || | |
| | | - Kawasaki/Maekawa constraint checks || | |
| | | - Layer tracking || | |
| | | - FOLD format import/export || | |
| | +-----------------------------------------------+| | |
| | | | | |
| | v | | |
| | +-----------------------------------------------+| | |
| | | Three.js Visualizer (Demo only) || | |
| | | - 3D fold animation || | |
| | | - Strain heatmap || | |
| | | - Instruction stream || | |
| | +-----------------------------------------------+| | |
| +---------------------------------------------------+ | |
| | ^ | |
| v | | |
| +---------------------------------------------------+ | |
| | Unsloth ART / GRPO Trainer | | |
| | - Qwen2.5-VL-7B or Qwen3-4B base model | | |
| | - LoRA/QLoRA for efficient training | | |
| | - Multi-turn rollouts | | |
| +---------------------------------------------------+ | |
| ``` | |
| **Three major components:** | |
| 1. **OpenEnv Server** - RL environment serving state/action/reward for origami folding | |
| 2. **Paper Geometry Engine** - Python-based origami math (Shapely polygons, fold reflections, constraint checking) | |
| 3. **Unsloth ART / GRPO Trainer** - RL fine-tuning of vision-language models for origami reasoning | |
| **Current focus:** Unsloth ART / GRPO Trainer | |
| --- | |
| ## 2. Paper: OrigamiSpace (2511.18450) | |
| **Title:** ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints | |
| **Authors:** Rui Xu, Dakuan Lu, Zicheng Zhao, Xiaoyu Tan, Xintao Wang, Siyu Yuan, Jiangjie Chen, Yinghui Xu | |
| **Date:** November 23, 2025 | |
| **Venue:** arXiv (cs.AI) | |
| ### Dataset | |
| - **350 primary instances** + 471 auxiliary (without folding processes) | |
| - Each instance: CP diagram, compiled flat pattern, folding process (multi-step images), final 3D shape | |
| - Complexity: Easy (3-9 steps), Medium (10-19), Hard (20-30), avg 8.2 steps | |
| - **1,620 total questions** across 4 tasks | |
| ### Four Evaluation Tasks | |
| | Task | Questions | Description | | |
| |------|-----------|-------------| | |
| | Pattern Prediction | 350 | CP diagram -> predict final 3D shape (multiple choice) | | |
| | Multi-step Spatial Reasoning | 250 | Shuffled fold images -> correct chronological sequence | | |
| | Spatial Relationship Prediction | 900 | 3 subtypes: pose localization, layering analysis, geometric change | | |
| | End-to-End CP Code Generation | 120 | Flat layout + folded shape -> generate CP code | | |
| ### Compiler Architecture (Critical for OpenEnv) | |
| Four-category error feedback system: | |
| 1. **CSE (CP Code Syntax Error):** Validates vertices, edges, faces, crease types; checks Euler's formula V-E+F=2 | |
| 2. **GIF (Geometrically Impossible Fold):** Maekawa's theorem |M-V|=2, Kawasaki's theorem sum(alpha_i)=2pi, Big-Little-Big angle constraint | |
| 3. **PSI (Paper Self-Intersection):** Cyclic layering, collision detection (discrete + CCD), octrees/BVHs | |
| 4. **AFS (Ambiguous Folding State):** Multiple valid M/V assignments, non-unique stacking | |
| ### CP Code Evaluation (4 dimensions, 0.25 weight each) | |
| 1. **Topological Structure Similarity (TSS):** Vertex/edge/face count comparison, s_v = e^(-0.5|V_gen - V_ref| / min(V_gen, V_ref)) | |
| 2. **Geometric Similarity (GS):** Hausdorff distance, s_p = e^(-5 * d_H), dihedral angle distribution, aspect ratio | |
| 3. **Constraint Satisfaction (CS):** Taco-Taco, Taco-Tortilla, transitivity, Maekawa/Kawasaki | |
| 4. **Final Folded State (FFS):** Shape similarity, layering comparison, stacking order | |
| ### Learning Approaches | |
| - **In-Context Learning:** Single-pass, detailed instructions + examples | |
| - **Environmental Learning:** Iterative model<->compiler loop, max 10 rounds, performance saturates after 8-10 | |
| - **Reinforcement Learning (TRICO/PPO-based):** | |
| - Training data: 471 instances from environmental learning | |
| - Model: Qwen2.5-VL-32B | |
| - **Rewards:** Intermediate (success bonus + quality progress), step penalty, final evaluation score | |
| - Result: RL-trained 32B exceeded 72B baseline | |
| ### Key Results | |
| - Best closed-source: GPT-4o (42.71% pattern), Gemini2.5-pro (53.45% multi-step) | |
| - Best open-source: Qwen2.5-VL-72B (36.29% pattern, 39.10% multi-step) | |
| - Expert human: 98.45% pattern, 100% multi-step | |
| - **Constraint satisfaction is the primary bottleneck** (~30% for top models) | |
| - Human-model gap: 20-45 percentage points | |
| ### Relevance to Optigami | |
| - **Direct blueprint for our OpenEnv server**: the compiler architecture with 4 error types is exactly what we need | |
| - The CP code evaluation framework (TSS/GS/CS/FFS) can be our reward function | |
| - Environmental learning approach maps to multi-turn rollouts in GRPO | |
| - Confirms Qwen2.5-VL as viable base model (they used 32B, we target 7B) | |
| --- | |
| ## 3. Paper: SpatialThinker (2511.07403) | |
| **Title:** SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards | |
| **Authors:** Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark | |
| **Date:** November 10, 2025 | |
| **Venue:** NeurIPS 2025 Workshops (SpaVLE, EWM, ARLET, SEA) | |
| ### Core Innovation | |
| Dense spatial rewards + GRPO for training Qwen2.5-VL on spatial reasoning tasks. Key insight: **sparse rewards lead to reward hacking; dense multi-objective rewards with lexicographic gating prevent this.** | |
| ### GRPO Training Configuration | |
| - **Rollouts:** 8 samples per query, temperature 1.0 | |
| - **Batch size:** rollout=512, global=128 | |
| - **Training:** 75 steps (~5 episodes) | |
| - **Hardware:** 4x NVIDIA H100 80GB | |
| - **Time:** ~13h (3B), ~15h (7B) | |
| - **Advantage:** A(i) = (r(i) - mu) / (sigma + epsilon), epsilon=1e-6 | |
| - **Loss:** PPO-style with clip(epsilon_l=0.2, epsilon_h=0.3), KL penalty beta=0.01 | |
| ### Dense Spatial Reward Design (CRITICAL - template for our rewards) | |
| **4-component reward with lexicographic gating:** | |
| ``` | |
| R_total = I[R_format=1] * (w_format*R_f + w_count*R_c + w_accuracy*R_a + I[R_accuracy=1]*w_spatial*R_s) | |
| ``` | |
| | Component | Weight | Description | | |
| |-----------|--------|-------------| | |
| | Format (R_f) | 0.1 | JSON-parseable scene graph with required fields | | |
| | Count (R_c) | 0.2 | Penalizes deviation in object/relation counts (lambda_obj=0.7, lambda_rel=0.3) | | |
| | Accuracy (R_a) | 0.5 | Binary exact string match | | |
| | Spatial (R_s) | 0.2 | Hungarian matching with CIoU, activated ONLY when answer correct | | |
| **Lexicographic gating is essential:** format compliance gates all rewards; spatial rewards only activate on correct answers. Without gating, severe reward hacking occurs (74.9% -> 23.7% with naive spatial rewards). | |
| ### STVQA-7K Dataset | |
| - 7,587 spatial VQA pairs from Visual Genome scene graphs | |
| - Generated by Claude Sonnet, validated by GPT-4o pass@2 | |
| - 9 spatial categories, 34 additional spatial predicates beyond standard VG150 | |
| - 90/10 train/val split | |
| ### Key Results | |
| - SpatialThinker-7B surpasses GPT-4o on 3DSRBench by +12.1% | |
| - Dense reward RL: +7.2% avg across 12 benchmarks (1.8x the +4.0% from sparse GRPO) | |
| - Outperforms models trained on millions of samples (trained on only 7K) | |
| ### Relevance to Optigami | |
| - **Direct template for our GRPO training pipeline** | |
| - Dense reward design with lexicographic gating prevents reward hacking | |
| - Proves Qwen2.5-VL-7B is excellent base for spatial reasoning RL | |
| - veRL/EasyR1 framework for training infrastructure | |
| - Shows 7K samples sufficient for strong results | |
| --- | |
| ## 4. Paper: Automating Rigid Origami Design (2211.13219) | |
| **Title:** Automating Rigid Origami Design | |
| **Authors:** Jeremia Geiger, Karolis Martinkus, Oliver Richter, Roger Wattenhofer | |
| **Date:** November 2022 (revised April 2023) | |
| **Venue:** IJCAI 2023 AI, Arts & Creativity Special Track | |
| ### Core Contribution | |
| - Formulates rigid origami design as discrete optimization: the **"rigid origami game"** | |
| - Based on "three units method" principle | |
| - Framework supports diverse objectives via abstract reward functions | |
| - Generates optimized, application-specific crease patterns | |
| ### Methodology | |
| - Multiple search methods within optimization framework | |
| - Flexible objective definition for application-specific requirements | |
| - Can approximate target shapes and produce functional designs | |
| ### Relevance to Optigami | |
| - Validates the "origami as game/environment" paradigm we're building | |
| - Their reward formulation approach (function-based, abstract) aligns with our OpenEnv design | |
| - Discrete optimization over crease patterns = the action space for our RL agent | |
| --- | |
| ## 5. Tool: FOLD Format | |
| **Repo:** https://github.com/edemaine/fold | |
| **Authors:** Erik Demaine (MIT), Jason Ku (MIT), Robert Lang | |
| **License:** MIT | |
| ### What It Is | |
| FOLD (Flexible Origami List Datastructure) - JSON-based file format (.fold) for representing origami models. The **standard interchange format** for computational origami. | |
| ### Data Structure | |
| ```json | |
| { | |
| "vertices_coords": [[x,y], ...], // 2D or 3D coordinates | |
| "edges_vertices": [[v1,v2], ...], // Edge endpoints | |
| "edges_assignment": ["M","V",...], // Mountain/Valley/Boundary/Flat/Unassigned | |
| "faces_vertices": [[v1,v2,v3], ...], // Face vertex lists | |
| "faceOrders": [[f1,f2,order], ...], // Stacking/layering order | |
| "frame_*": ... // Multiple frames (folding states) | |
| } | |
| ``` | |
| ### JavaScript API | |
| ```javascript | |
| // Browser | |
| <script src="https://edemaine.github.io/fold/dist/fold.js"></script> | |
| // Node.js | |
| npm install --save fold | |
| // Usage: FOLD.moduleName.functionName | |
| FOLD.filter.collapseNearbyVertices(foldObject) | |
| ``` | |
| ### CLI Tools | |
| - `fold-convert`: ORIPA .opx -> .fold conversion | |
| - `fold-convert --flat-fold`: Compute flat-folded state | |
| ### Supported Software Ecosystem | |
| OrigamiSimulator, Freeform Origami (Tachi), Rabbit Ear (Kraft), ORIPA, Crease Pattern Editor, Rhino Grasshopper | |
| ### Relevance to Optigami | |
| - **Core data format for OpenEnv state representation** | |
| - JSON = easy Python/JS interop | |
| - Stacking order (faceOrders) = layer tracking | |
| - edges_assignment = mountain/valley fold type | |
| - Import/export between geometry engine and visualizer | |
| --- | |
| ## 6. Tool: Origami Simulator | |
| **Repo:** https://github.com/amandaghassaei/OrigamiSimulator | |
| **URL:** origamisimulator.org | |
| **Author:** Amanda Ghassaei | |
| **License:** MIT | |
| **Stack:** JavaScript (68.4%), Three.js, GPU fragment shaders | |
| ### Capabilities | |
| - Real-time GPU-accelerated folding simulation | |
| - Folds ALL creases simultaneously (not sequential) | |
| - Realistic bending simulation between creases | |
| - Strain visualization (internal stress during folding) | |
| - Fold Percent slider: 0% (flat) to 100% (fully folded) to -100% (inverted) | |
| ### File Formats | |
| - **Input:** SVG, FOLD | |
| - **Export:** FOLD, STL, OBJ | |
| ### Physics Engine | |
| - **Stiffness-based finite element approach:** Triangulated faces are rigid panels connected by rotational hinges along fold lines | |
| - Each fold edge has a **target angle** (+/-pi for mountain/valley), driven by angular spring forces | |
| - Solver computes nodal displacements at each timestep to reach equilibrium | |
| - **Fold stiffness:** Controls how strongly hinges drive toward target angle | |
| - **Face stiffness:** Controls rigidity of triangulated faces (resistance to bending/deformation) | |
| - **Damping:** Controls oscillation decay rate | |
| - **Strain metric:** Per-triangle deviation of edge lengths from rest lengths (flat state) | |
| - Self-intersection is NOT prevented (folds through itself if geometry demands it) | |
| - Based on Schenk & Guest structural engineering approach | |
| - Tomohiro Tachi's freeform origami variations | |
| - Ruling-aware triangulation for curved creases | |
| - GPU fragment shaders for parallel computation | |
| ### Programmatic Usage | |
| - Core simulation can be driven **headlessly** (without UI) by importing solver module | |
| - Feed FOLD JSON data -> step simulation programmatically | |
| - FOLD is JSON, so easy to generate crease patterns from Python and pass to simulator | |
| - Can embed in other web pages as a component | |
| ### Dependencies | |
| - Three.js (3D rendering) | |
| - FOLD API (internal data structure) | |
| - Earcut + cdt2d (polygon triangulation) | |
| - numeric.js (linear algebra) | |
| - CCapture (GIF/WebM export) | |
| ### Relevance to Optigami | |
| - **Direct integration for Three.js Visualizer component** | |
| - Strain heatmap capability already built in | |
| - FOLD format native support | |
| - Can be used for visual verification of generated fold patterns | |
| - Export to STL/OBJ for 3D shape comparison in rewards | |
| --- | |
| ## 7. Tool: GamiBench | |
| **Repo:** https://github.com/stvngo/GamiBench | |
| **Dataset:** https://huggingface.co/datasets/stvngo/GamiBench | |
| **Paper:** arXiv 2512.22207 | |
| **License:** MIT | |
| ### Benchmark Design | |
| - 186 valid + 186 impossible crease patterns | |
| - 6 viewpoints per pattern (top, bottom, front, back, right, left) | |
| - **777 total samples** in HuggingFace dataset (45.4 MB) | |
| - 186 label classes (named origami patterns) | |
| ### Task Types | |
| 1. Standard tasks (2D CP -> 3D prediction) | |
| 2. Alternative-view tasks | |
| 3. Impossible tasks (validity checking) | |
| ### Dataset Schema | |
| ```python | |
| { | |
| "image": PIL.Image, # Origami pattern/fold image | |
| "label": int, # 0-185 class label | |
| "split": str # Split identifier | |
| } | |
| ``` | |
| ### Loading | |
| ```python | |
| from datasets import load_dataset | |
| dataset = load_dataset("stvngo/GamiBench") | |
| ``` | |
| ### Model Support | |
| - OpenAI (GPT-4, GPT-4o-mini) | |
| - Anthropic (Claude 4.5 Sonnet) | |
| - Google (Gemini) | |
| - xAI (Grok) | |
| - OpenRouter models | |
| ### Code Structure | |
| ``` | |
| models/ # Model wrappers & factory | |
| evaluators/ # BaseEvaluator: evaluate(), evaluate_single() | |
| benchmarks/ # Benchmark implementations | |
| configs/ # YAML/JSON configuration | |
| utils/ # Shared helpers | |
| pipeline.py # Orchestration | |
| run.py # Entry point | |
| ``` | |
| ### Relevance to Optigami | |
| - **Evaluation benchmark for our trained model** | |
| - 186 origami patterns = potential training/eval data | |
| - Impossible patterns useful for constraint satisfaction testing | |
| - Multi-view evaluation tests true 3D understanding | |
| - Config-driven, reproducible evaluation pipeline | |
| --- | |
| ## 8. Tool: SpatialThinker Codebase | |
| **Repo:** https://github.com/hunarbatra/SpatialThinker | |
| **Paper:** arXiv 2511.07403 | |
| ### Architecture | |
| - Built on Qwen2.5-VL (3B and 7B variants) | |
| - Uses veRL/EasyR1 for RL training | |
| - vLLM 0.8.0 for inference during rollouts | |
| ### Code Structure | |
| ``` | |
| scripts/ # Training bash scripts per model size | |
| evaluation/ # 18+ benchmark evaluation suite | |
| data_gen/ # Data synthesis pipeline | |
| verl/ # RL training framework (GRPO) | |
| ``` | |
| ### Data Generation Pipeline | |
| 1. Generate raw QA pairs (12K-56K options) | |
| 2. Balance/filter with 50% spatial relations focus | |
| 3. Validate via GPT-4o (~75% pass rate) | |
| 4. Upload to HuggingFace | |
| ### Requirements | |
| - Python 3.9+ | |
| - Transformers >= 4.49.0 | |
| - Flash-Attn >= 2.4.3 | |
| - vLLM >= 0.7.3 | |
| ### Relevance to Optigami | |
| - **Reference implementation for our GRPO training setup** | |
| - veRL/EasyR1 framework = our training infrastructure | |
| - Dense reward design directly applicable | |
| - Data generation pipeline can be adapted for origami QA pairs | |
| --- | |
| ## 9. Tool: Trackio | |
| **Repo:** https://github.com/gradio-app/trackio | |
| **Author:** Hugging Face / Gradio team | |
| **License:** MIT | |
| ### What It Is | |
| Lightweight, local-first experiment tracking (Weights & Biases alternative). API-compatible with wandb. | |
| ### Key Features | |
| - `import trackio as wandb` - drop-in W&B replacement | |
| - Non-blocking `log()` with background queue (0.5s drain interval) | |
| - SQLite local storage at `~/.cache/huggingface/trackio` | |
| - Optional HuggingFace Spaces deployment for dashboards | |
| - Slack/Discord webhook alerts (INFO/WARN/ERROR) | |
| - 2,000 logs/8s single run; 32,000 logs/14s with 32 threads | |
| ### Usage | |
| ```python | |
| import trackio | |
| trackio.init(project="optigami-grpo", config={"lr": 1e-6, "model": "Qwen2.5-VL-7B"}) | |
| trackio.log({"step": step, "reward": reward, "loss": loss}) | |
| trackio.alert(title="Training spike", text="...", level=trackio.AlertLevel.WARN) | |
| trackio.finish() | |
| # Dashboard | |
| trackio.show(project="optigami-grpo") | |
| trackio.sync(project="optigami-grpo", space_id="openenv-community/optigami-training") | |
| ``` | |
| ### Relevance to Optigami | |
| - **Training metrics dashboard for GRPO training runs** | |
| - Can deploy live dashboard to HF Spaces | |
| - Track reward components, loss, constraint satisfaction rates | |
| - Alert on training anomalies (reward hacking, loss spikes) | |
| --- | |
| ## 10. Tool: Unsloth + GRPO Training | |
| **Repo:** https://github.com/unslothai/unsloth | |
| **Docs:** https://unsloth.ai/docs | |
| ### GRPO Algorithm in Unsloth | |
| 1. Generate N responses per prompt (8+ recommended) | |
| 2. Score each with custom reward functions | |
| 3. Z-score normalize rewards across group -> advantages | |
| 4. PPO-style policy update (no value model or reward model needed) | |
| ### Memory Efficiency | |
| - **90% less VRAM** vs standard GRPO | |
| - 20K context, 8 generations, Llama 8B: 54.3GB (vs 510.8GB standard) | |
| - QLoRA 4-bit: model params (GB) ~ VRAM needed | |
| - Shared GPU memory with vLLM inference engine | |
| ### Vision Model Support | |
| - Qwen2.5-VL-7B directly supported | |
| - Qwen3-VL-8B, Gemma 3 (4B) also available | |
| - `FastVisionModel.get_peft_model()` with granular layer control: | |
| - `finetune_vision_layers`, `finetune_language_layers` | |
| - `finetune_attention_modules`, `finetune_mlp_modules` | |
| ### LoRA Configuration | |
| ```python | |
| model = FastVisionModel.get_peft_model( | |
| model, | |
| r=16, # LoRA rank | |
| lora_alpha=16, # alpha == r recommended | |
| lora_dropout=0, | |
| finetune_vision_layers=True, | |
| finetune_language_layers=True, | |
| finetune_attention_modules=True, | |
| finetune_mlp_modules=True, | |
| ) | |
| ``` | |
| ### GRPOConfig Options | |
| ```python | |
| GRPOConfig( | |
| loss_type='grpo', # or 'gspo', 'dr_grpo' | |
| epsilon=0.2, | |
| epsilon_high=0.28, | |
| delta=1.5, | |
| # ... standard training args | |
| ) | |
| ``` | |
| ### vLLM Integration | |
| - Shared memory between Unsloth and vLLM saves 3-5GB | |
| - A100 40GB: ~4000 tokens/sec, T4 16GB: ~300 tokens/sec | |
| - `fast_inference=True` enables vLLM backend | |
| ### Training Requirements | |
| - Minimum 300 steps before meaningful progress | |
| - 500+ data rows recommended (works with 10+) | |
| - Models >= 1.5B parameters for reasoning tokens | |
| - Steps = rows x epochs; increase generations (8->16) for more data | |
| ### Vision Data Format | |
| ```python | |
| [ | |
| {"role": "user", "content": [ | |
| {"type": "text", "text": "instruction"}, | |
| {"type": "image", "image": pil_image} | |
| ]}, | |
| {"role": "assistant", "content": [ | |
| {"type": "text", "text": "response"} | |
| ]} | |
| ] | |
| ``` | |
| ### GRPO vs PPO vs DPO Comparison | |
| | Aspect | PPO | DPO | GRPO | | |
| |--------|-----|-----|------| | |
| | Critic/Value model | Required (same size as policy) | Not needed | **Not needed** | | |
| | Reference model | Required | Required | Required (old policy) | | |
| | Training data | Online rollouts | Offline preference pairs | **Online rollouts + group scoring** | | |
| | Reward signal | Scalar per token/step | Implicit from preferences | **Verifiable/explicit** | | |
| | VRAM overhead | ~2x (policy + critic) | ~2x (policy + ref) | **~1.5x (no critic)** | | |
| ### GRPO Advantage Estimation | |
| ``` | |
| A_i = (r_i - mean(r_1..r_G)) / std(r_1..r_G) | |
| ``` | |
| By sampling G completions and normalizing rewards within the group, GRPO creates its own baseline without a value network - halving VRAM vs PPO. | |
| ### Complete Unsloth GRPO Code Example | |
| ```python | |
| from unsloth import FastLanguageModel, PatchFastRL | |
| PatchFastRL("GRPO", FastLanguageModel) # Patch TRL with Unsloth optimizations | |
| from trl import GRPOConfig, GRPOTrainer | |
| # Load model with QLoRA | |
| model, tokenizer = FastLanguageModel.from_pretrained( | |
| model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit", | |
| max_seq_length=4096, | |
| load_in_4bit=True, | |
| dtype=None, | |
| ) | |
| # Add LoRA adapters | |
| model = FastLanguageModel.get_peft_model( | |
| model, | |
| r=64, # Higher rank for reasoning tasks | |
| target_modules=[ | |
| "q_proj", "k_proj", "v_proj", "o_proj", | |
| "gate_proj", "up_proj", "down_proj", | |
| ], | |
| lora_alpha=64, # alpha == r recommended | |
| lora_dropout=0, # Unsloth recommends 0 | |
| bias="none", | |
| use_gradient_checkpointing="unsloth", # Unsloth's optimized GC | |
| random_state=3407, | |
| ) | |
| # Reward functions (TRL accepts a list, scores are summed) | |
| def correctness_reward(completions, ground_truth, **kwargs): | |
| rewards = [] | |
| for completion, gt in zip(completions, ground_truth): | |
| answer_match = re.search(r'</think>\s*(.*?)$', completion, re.DOTALL) | |
| if answer_match and answer_match.group(1).strip() == gt.strip(): | |
| rewards.append(1.0) | |
| else: | |
| rewards.append(0.0) | |
| return rewards | |
| def format_reward(completions, **kwargs): | |
| return [0.5 if ("<think>" in c and "</think>" in c) else 0.0 for c in completions] | |
| # GRPO Config | |
| config = GRPOConfig( | |
| output_dir="./grpo_output", | |
| num_generations=8, # Group size G | |
| max_completion_length=2048, | |
| per_device_train_batch_size=1, | |
| gradient_accumulation_steps=4, | |
| num_train_epochs=1, | |
| learning_rate=5e-6, | |
| lr_scheduler_type="cosine", | |
| warmup_ratio=0.1, | |
| beta=0.04, # KL penalty coefficient | |
| max_grad_norm=0.1, | |
| logging_steps=1, | |
| save_steps=250, | |
| bf16=True, | |
| loss_type='grpo', # or 'gspo', 'dr_grpo' | |
| ) | |
| trainer = GRPOTrainer( | |
| model=model, | |
| config=config, | |
| train_dataset=dataset, | |
| reward_funcs=[correctness_reward, format_reward], | |
| tokenizer=tokenizer, | |
| ) | |
| trainer.train() | |
| # Save LoRA adapter | |
| model.save_pretrained("./grpo_lora_adapter") | |
| # Optional: merge and push | |
| # model.save_pretrained_merged("./grpo_merged", tokenizer) | |
| # model.push_to_hub_merged("username/model-name", tokenizer) | |
| ``` | |
| ### Vision GRPO with Qwen2.5-VL | |
| ```python | |
| from unsloth import FastVisionModel, PatchFastRL | |
| PatchFastRL("GRPO", FastVisionModel) | |
| model, tokenizer = FastVisionModel.from_pretrained( | |
| "unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit", | |
| max_seq_length=4096, | |
| load_in_4bit=True, | |
| ) | |
| # For VLMs: typically freeze vision encoder, train language layers | |
| model = FastVisionModel.get_peft_model( | |
| model, | |
| r=16, # Lower rank often sufficient for VLMs | |
| lora_alpha=16, | |
| lora_dropout=0, | |
| bias="none", | |
| use_gradient_checkpointing="unsloth", | |
| finetune_vision_layers=False, # Keep vision encoder frozen | |
| finetune_language_layers=True, | |
| finetune_attention_modules=True, | |
| finetune_mlp_modules=True, | |
| ) | |
| ``` | |
| ### Unsloth ART (Agentic Reasoning Training) | |
| ART extends GRPO for multi-turn agentic tasks: | |
| 1. **Multi-turn rollouts:** Model interacts with environment over multiple turns (actions + observations) | |
| 2. **Environment integration:** Custom env provides observations and final rewards | |
| 3. **Verifiable rewards:** Emphasizes automatically verifiable outcomes | |
| **Multi-turn pattern:** | |
| ``` | |
| Turn 1: User prompt -> Model <think> + action -> Environment observation | |
| Turn 2: Observation -> Model <think> + action -> Environment observation | |
| Turn 3: Observation -> Model final answer -> Reward computed | |
| ``` | |
| **Implementation options for multi-turn:** | |
| 1. **Single-generation (simpler):** Model outputs full plan/sequence in one generation; reward function evaluates the whole sequence | |
| 2. **Custom rollout loop (advanced):** Alternate model generation and env response, collect full trajectory, compute GRPO gradients on combined trajectory | |
| ### Key Hyperparameters Reference | |
| | Parameter | Range | Notes | | |
| |-----------|-------|-------| | |
| | `num_generations` (G) | 4-16 | 8 common. More = better advantages, more VRAM | | |
| | `beta` (KL penalty) | 0.01-0.1 | 0.04 default. Higher = stay closer to reference | | |
| | `learning_rate` | 1e-6 to 1e-5 | Lower than SFT. 5e-6 starting point | | |
| | `max_completion_length` | 512-4096 | Task-dependent | | |
| | `r` (LoRA rank) | 16-128 | 64 for reasoning, 16 for VLM | | |
| | `gradient_accumulation_steps` | 4-16 | Effective batch = per_device * accum * GPUs | | |
| | `max_grad_norm` | 0.1-1.0 | 0.1 for stability | | |
| | `warmup_ratio` | 0.05-0.1 | Important for RL stability | | |
| | `epsilon` (clip) | 0.2 | PPO-style clipping | | |
| | `epsilon_high` | 0.28 | Asymmetric upper clip | | |
| ### Qwen2.5-VL-7B Model Specifics | |
| - Vision encoder: ViT with 2D-RoPE (handles arbitrary image resolutions via dynamic patching) | |
| - LLM backbone: 28 layers, 3584 hidden dim, 28 attn heads, GQA with 4 KV heads | |
| - Context: up to 32K tokens (128K with YaRN) | |
| - Supports: single image, multi-image, video frames | |
| - Unsloth IDs: `unsloth/Qwen2.5-VL-7B-Instruct`, `unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit` | |
| ### Qwen3-4B Model Specifics | |
| - Hybrid thinking: can switch between `<think>` mode and direct response | |
| - ~4B parameters, efficient for RL training | |
| - MoE variants also available | |
| - Unsloth IDs: `unsloth/Qwen3-4B`, `unsloth/Qwen3-4B-bnb-4bit` | |
| --- | |
| ## 11. Unsloth ART / GRPO Trainer Plan | |
| ### Phase 1: Data Preparation | |
| **Training Data Sources:** | |
| 1. OrigamiSpace dataset (471 auxiliary instances) - CP diagrams, fold sequences, 3D shapes | |
| 2. GamiBench dataset (777 samples, 186 patterns) - crease patterns with multi-view 3D | |
| 3. Synthetic data generation pipeline (following SpatialThinker approach): | |
| - Generate origami QA pairs with Claude/GPT | |
| - Validate with GPT-4o pass@2 | |
| - Balance across difficulty levels | |
| **Data Format for GRPO:** | |
| ```python | |
| # Each training example = a prompt with origami task | |
| { | |
| "prompt": [ | |
| {"role": "user", "content": [ | |
| {"type": "image", "image": cp_diagram_image}, | |
| {"type": "text", "text": "Given this crease pattern, describe the folding sequence and predict the final 3D shape. Output your answer as a FOLD JSON."} | |
| ]} | |
| ] | |
| } | |
| ``` | |
| ### Phase 2: Reward Function Design | |
| **Following SpatialThinker's lexicographic gating pattern, adapted for origami:** | |
| ```python | |
| def origami_reward(prompt, response, ground_truth): | |
| # Component 1: Format reward (gate) | |
| r_format = check_valid_fold_json(response) # 0 or 1 | |
| # Component 2: Constraint satisfaction | |
| r_constraints = check_origami_constraints(response) | |
| # - Maekawa's theorem: |M-V| = 2 | |
| # - Kawasaki's theorem: sum(alpha_i) = 2*pi | |
| # - Euler's formula: V - E + F = 2 | |
| # - No self-intersection | |
| # Component 3: Topological similarity | |
| r_topology = compute_tss(response, ground_truth) | |
| # Vertex/edge/face counts, connectivity | |
| # Component 4: Geometric similarity | |
| r_geometry = compute_hausdorff_similarity(response, ground_truth) | |
| # Component 5: Final shape match | |
| r_shape = compute_folded_state_similarity(response, ground_truth) | |
| # Lexicographic gating | |
| if r_format == 0: | |
| return 0.0 | |
| total = (0.1 * r_format + | |
| 0.25 * r_constraints + | |
| 0.2 * r_topology + | |
| 0.2 * r_geometry + | |
| 0.25 * r_shape) | |
| return total | |
| ``` | |
| ### Phase 3: Training Infrastructure | |
| **Option A: Unsloth (simpler, less VRAM)** | |
| ```python | |
| from unsloth import FastVisionModel | |
| from trl import GRPOConfig, GRPOTrainer | |
| model, tokenizer = FastVisionModel.from_pretrained( | |
| "unsloth/Qwen2.5-VL-7B-Instruct", | |
| load_in_4bit=True, | |
| fast_inference=True, | |
| ) | |
| model = FastVisionModel.get_peft_model(model, r=16, lora_alpha=16) | |
| config = GRPOConfig( | |
| loss_type="grpo", | |
| num_generations=8, | |
| max_new_tokens=2048, | |
| per_device_train_batch_size=1, | |
| gradient_accumulation_steps=16, | |
| num_train_epochs=3, | |
| learning_rate=1e-6, | |
| ) | |
| trainer = GRPOTrainer( | |
| model=model, | |
| config=config, | |
| train_dataset=dataset, | |
| reward_funcs=[origami_reward], | |
| ) | |
| trainer.train() | |
| ``` | |
| **Option B: veRL/EasyR1 (following SpatialThinker, more control)** | |
| - Uses veRL framework with GRPO | |
| - vLLM backend for fast rollouts | |
| - More complex but battle-tested for spatial reasoning | |
| - Better for multi-turn rollouts | |
| ### Phase 4: Multi-Turn Rollouts | |
| Following OrigamiSpace's environmental learning approach: | |
| 1. Model generates CP code / fold sequence | |
| 2. OpenEnv compiler validates and returns error feedback | |
| 3. Model refines based on error type (CSE/GIF/PSI/AFS) | |
| 4. Repeat up to 10 rounds | |
| 5. Final reward based on best attempt | |
| **Environment class pattern:** | |
| ```python | |
| class OrigamiEnv: | |
| def __init__(self, task): | |
| self.task = task | |
| self.state = task["initial_state"] # FOLD JSON | |
| self.steps = 0 | |
| self.max_steps = 10 | |
| self.history = [] | |
| def step(self, action: str): | |
| """Process model's fold action, return compiler feedback.""" | |
| self.steps += 1 | |
| # Validate through compiler (CSE/GIF/PSI/AFS checks) | |
| result = self.compile_and_validate(action) | |
| observation = f"Step {self.steps}: {result['error_type']}: {result['message']}" | |
| self.state = result.get("new_state", self.state) | |
| self.history.append((action, observation)) | |
| done = self.steps >= self.max_steps or result.get("valid", False) | |
| reward = self.compute_reward() if done else 0.0 | |
| return observation, reward, done | |
| def compute_reward(self): | |
| """4-dimensional evaluation: TSS + GS + CS + FFS.""" | |
| return (0.25 * tss(self.state, self.task["target"]) + | |
| 0.25 * gs(self.state, self.task["target"]) + | |
| 0.25 * cs(self.state) + | |
| 0.25 * ffs(self.state, self.task["target"])) | |
| def multi_turn_reward(completions, prompts, **kwargs): | |
| """Wrap environment interaction into GRPO reward function.""" | |
| rewards = [] | |
| for completion, prompt in zip(completions, prompts): | |
| env = OrigamiEnv(extract_task(prompt)) | |
| actions = parse_actions(completion) | |
| total_reward = 0.0 | |
| for action in actions: | |
| obs, reward, done = env.step(action) | |
| total_reward += reward | |
| if done: | |
| break | |
| rewards.append(total_reward) | |
| return rewards | |
| ``` | |
| ### Phase 5: Evaluation | |
| 1. **GamiBench** - standard origami spatial reasoning benchmark | |
| 2. **OrigamiSpace tasks** - 4-task evaluation suite | |
| 3. **Custom metrics:** | |
| - Constraint satisfaction rate (Maekawa/Kawasaki) | |
| - Compilation success rate | |
| - Topological/geometric similarity scores | |
| ### Phase 6: Monitoring with Trackio | |
| ```python | |
| import trackio | |
| trackio.init( | |
| project="optigami-grpo", | |
| space_id="openenv-community/optigami-training", | |
| config={ | |
| "model": "Qwen2.5-VL-7B", | |
| "lora_r": 16, | |
| "num_generations": 8, | |
| "learning_rate": 1e-6, | |
| } | |
| ) | |
| # In training loop | |
| trackio.log({ | |
| "step": step, | |
| "reward/total": total_reward, | |
| "reward/format": format_reward, | |
| "reward/constraints": constraint_reward, | |
| "reward/topology": topology_reward, | |
| "reward/geometry": geometry_reward, | |
| "reward/shape": shape_reward, | |
| "loss": loss, | |
| "compilation_rate": compilation_rate, | |
| }) | |
| ``` | |
| --- | |
| ## 12. GitHub Reference Repo (ianalin123/optigami) | |
| Located at `.reference/optigami-github/` (gitignored, not pushed to HF). | |
| ### What It Contains | |
| A complete research repository with detailed architecture docs and a reference 2048 GRPO implementation. | |
| ### Key Files | |
| | File | Contents | | |
| |------|----------| | |
| | `research/plan/architecture.md` | **Full architecture spec**: action space, state, physics engine, reward functions, OpenEnv integration, rendering pipeline, project structure, implementation order | | |
| | `research/openenv/2048_example.py` | **636-line reference implementation** of OpenEnv + GRPO for 2048 game (Unsloth + TRL) | | |
| | `research/openenv/overview.md` | OpenEnv framework API, types, project structure, deployment to HF Spaces | | |
| | `research/origami/fold_types_deep.md` | All fold operations, Huzita-Justin axioms, crane step-by-step, compression patterns | | |
| | `research/origami/math_physics_deep.md` | Kawasaki/Maekawa theorems with code, bar-and-hinge model, energy formulas | | |
| | `research/origami/rendering_research.md` | Rendering options comparison | | |
| | `research/origami/fold_format.md` | FOLD file format details | | |
| ### Architecture Decisions (from GitHub repo) | |
| | Decision | Choice | | |
| |----------|--------| | |
| | LLM interaction | **Code-as-policy** (LLM writes `fold_strategy()` function) | | |
| | Action space | Named fold ops (valley/mountain + fold line + angle) | | |
| | State format | FOLD-compatible JSON | | |
| | Physics engine | Bar-and-hinge model (NumPy port of Ghassaei) | | |
| | Validation | Kawasaki + Maekawa + triangle-triangle intersection | | |
| | Primary task | Solar panel packing (Miura-ori discovery) | | |
| | Training | GRPO via TRL + Unsloth | | |
| | Deployment | Docker Space on HF Spaces | | |
| ### Action Space (Code-as-Policy) | |
| The LLM generates a `fold_strategy(paper_state)` function returning fold instructions: | |
| ```python | |
| def fold_strategy(paper_state: dict) -> list[dict]: | |
| # paper_state contains: vertices, edges, assignments, fold_angles, material, etc. | |
| return [ | |
| {"type": "valley", "line": {"start": [0,0.5], "end": [1,0.5]}, "angle": 180}, | |
| {"type": "mountain", "line": {"start": [0.5,0], "end": [0.5,0.5]}, "angle": 180}, | |
| ] | |
| ``` | |
| ### Reward Functions (3 from 2048 pattern, adapted for origami) | |
| 1. **`code_valid`**: +1.0 valid function, -0.5 exec fails, -2.0 syntax error | |
| 2. **`physically_valid`**: +1.0 all valid, -2.0 per Kawasaki/Maekawa violation, -5.0 self-intersection | |
| 3. **`fold_quality`**: +20.0 * compactness, +10.0 meets volume target, +5.0 deployable, -0.5 per fold | |
| ### Physics Engine (Bar-and-Hinge Model) | |
| ```python | |
| E_total = E_bar + E_facet + E_fold | |
| E_bar = sum (1/2) * k_axial * (L - L0)^2 # stretching | |
| E_facet = sum (1/2) * k_facet * l * (theta-pi)^2 # panel bending | |
| E_fold = sum (1/2) * k_fold * l * (rho-rho_t)^2 # crease folding | |
| ``` | |
| ### Planned Project Structure | |
| ``` | |
| engine/ # Core simulation (numpy/scipy) | |
| paper.py # Paper data structure, FOLD I/O | |
| fold_engine.py # Apply folds (quaternion rotation) | |
| physics.py # Bar-and-hinge energy, strain | |
| validation.py # Kawasaki, Maekawa, self-intersection | |
| metrics.py # Deployment ratio, compactness | |
| materials.py # Material definitions | |
| environment/ # OpenEnv server | |
| models.py # Action, Observation, State | |
| origami_environment.py # Environment (reset/step/state) | |
| tasks.py # Task pool / curriculum | |
| app.py # create_app() | |
| Dockerfile | |
| client/ # OpenEnv client + training bridge | |
| reward_functions.py # code_valid, physically_valid, fold_quality | |
| training/ # Colab notebook | |
| train_origami.ipynb # GRPO training (Unsloth + TRL) | |
| prompts.py # LLM prompt templates | |
| ``` | |
| ### Implementation Order (from architecture.md) | |
| 1. **Phase 1: Engine** - paper.py, fold_engine.py, validation.py, metrics.py | |
| 2. **Phase 2: OpenEnv Server** - models.py, origami_environment.py, app.py, Dockerfile | |
| 3. **Phase 3: Reward + Training** - reward_functions.py, prompts.py, train_origami.ipynb | |
| 4. **Phase 4: Rendering + Demo** - matplotlib headless, React + R3F app | |
| ### 2048 Reference Implementation (Key Patterns) | |
| The `2048_example.py` shows the exact Unsloth + OpenEnv + GRPO pattern: | |
| - `PatchFastRL` not used (text model, not vision) - for our VLM use `FastVisionModel` | |
| - `extract_function()` parses code from ```python blocks | |
| - `create_locked_down_function()` sandboxes execution | |
| - `check_python_modules()` prevents non-stdlib imports | |
| - `execute_with_time_limit(5)` wraps strategy execution | |
| - Dataset: 1000x replicated prompt, `report_to="trackio"` | |
| - GRPOConfig: temp=1.0, lr=2e-4, max_steps=600, num_generations=2 | |
| - Three reward functions passed as list to `GRPOTrainer` | |
| --- | |
| ## 13. Current Project State | |
| ### Repository | |
| - **Location:** HuggingFace Space `openenv-community/optigami` | |
| - **Framework:** Create React App (React 19.1.0) | |
| - **Status:** Fresh scaffold - default CRA boilerplate | |
| - **Build:** `npm run build` -> `build/index.html` (HF Spaces static SDK) | |
| ### File Structure | |
| ``` | |
| optigami/ | |
| package.json # React app dependencies | |
| README.md # CRA default + HF Space metadata | |
| public/ # Static assets (favicon, manifest) | |
| src/ | |
| App.js # Default CRA component (placeholder) | |
| App.css | |
| index.js # Entry point | |
| index.css | |
| logo.svg | |
| reportWebVitals.js | |
| setupTests.js | |
| App.test.js | |
| ``` | |
| ### What Needs to Be Built | |
| 1. **Python backend** - Paper Geometry Engine with Shapely, FOLD import/export, constraint checking | |
| 2. **GRPO training scripts** - Unsloth or veRL-based, with origami reward functions | |
| 3. **Data pipeline** - Load/process OrigamiSpace + GamiBench datasets | |
| 4. **Three.js frontend** - Replace CRA boilerplate with origami visualizer (possibly integrate OrigamiSimulator) | |
| 5. **OpenEnv server** - API connecting geometry engine to trainer | |
| --- | |
| ## Key Takeaways for Immediate Work (GRPO Trainer) | |
| 1. **Use Unsloth for simplicity** - 90% VRAM savings, built-in vLLM, QLoRA support for Qwen2.5-VL-7B | |
| 2. **Dense rewards with lexicographic gating** - format gate -> constraints -> topology -> geometry -> shape match (SpatialThinker pattern) | |
| 3. **OrigamiSpace's 4-error compiler** is the gold standard for reward signal generation | |
| 4. **Start with 500+ origami examples** - GamiBench (777) + OrigamiSpace (471) = 1248 examples | |
| 5. **8 generations per prompt**, temperature 1.0, 300+ training steps minimum | |
| 6. **Multi-turn: max 10 rounds** with compiler feedback (performance saturates after 8-10) | |
| 7. **Track with Trackio** - deploy dashboard to HF Spaces for real-time monitoring | |
| 8. **Evaluate on GamiBench** for standardized comparison against other MLLMs | |
| --- | |
| ## Cross-Reference: Tool Compatibility Matrix | |
| | Component | FOLD | OrigamiSim | GamiBench | SpatialThinker | Unsloth | Trackio | | |
| |-----------|------|------------|-----------|----------------|---------|---------| | |
| | State representation | Core | Import | - | - | - | - | | |
| | Visualization | Export | Core | - | - | - | - | | |
| | Training data | - | - | Core | Augment | - | - | | |
| | RL training | - | - | Eval | Template | Core | Monitor | | |
| | Reward functions | Validate | Strain | - | Template | Integrate | Log | | |
| | Constraint checking | Structure | Physics | Impossible set | - | - | - | | |