| --- |
| language: |
| - en |
| license: apache-2.0 |
| library_name: pytorch |
| tags: |
| - 3d |
| - orientation |
| - rotation-correction |
| - vggt |
| - lora |
| - image-classification |
| pipeline_tag: image-classification |
| base_model: facebookresearch/vggt |
| --- |
| |
| # Orient90 V2 — Mixed-Domain VGGT-1B + LoRA + 2-head |
|
|
| Discrete 3D orientation-correction model. Given a single RGB render of a 3D |
| asset, predict the Euler rotation `(corr_x, corr_y, corr_z)` (all multiples of |
| 90°) that rotates the asset back to its canonical upright orientation, plus a |
| binary flag indicating whether the input is **off-grid** (i.e. not aligned to |
| the 90° rotation group). |
|
|
| V2 replaces V1's DINOv2-large + full fine-tune with **VGGT-1B + LoRA r=64**, |
| and trains jointly on a 1.53M-sample mix of artstation USDZ + characters + 3d66 |
| indoor scenes — yielding **+7pp on artstation** and **+7pp on 3d66** strict |
| accuracy vs. V1. |
|
|
| - **Backbone**: VGGT-1B aggregator, initialized from the Orient Anything V2 |
| ckpt (`Viglong/OriAnyV2_ckpt`), wrapped with LoRA (rank 64, alpha 64) |
| adapters on `attn.{qkv,proj}` + `mlp.{fc1,fc2}` — base weights are frozen. |
| - **Pool**: `LayerNorm(CLS + mean(patch_tokens))` over the last aggregator layer. |
| - **Heads**: |
| - 24-way classification over the cubic rotation group (`SO(3)` axis-aligned subgroup). |
| - 1-dim off-grid probability (BCE). |
| - **Image size**: 518×518 (VGGT native, no ImageNet mean/std normalization). |
| - **Training data**: 1.53M renders, Blender CYCLES @ 518, balanced across |
| artstation USDZ + character + 3d66 + orbit-aware label smoothing (α=0.1, share=0.8). |
| - **Best checkpoint**: **epoch 4** of 5 (`best.pt`, score=0.8689). |
|
|
| ## Evaluation (val split, full `labels_balanced_v2`) |
|
|
| | Dataset | Strict cls_acc | Orbit-sum acc | Top-k inferred | Off-head acc | |
| |------------|---------------:|--------------:|---------------:|-------------:| |
| | character | **92.89%** | 97.24% | 93.39% | 99.02% | |
| | artstation | 74.59% | **95.04%** | 79.29% | 99.02% | |
| | 3d66 | 86.51% | **98.12%** | 91.15% | 99.02% | |
| |
| - **Strict cls_acc**: argmax over 24 classes equals the GT class (single-GT, |
| authoritative for characters where facial asymmetry matters). |
| - **Orbit-sum acc**: sum of softmax probability over the GT's full octahedral |
| orbit (z180/y180/z90) — authoritative for arts/3d66 where symmetric objects |
| have GT-ambiguous-but-equivalent partner classes. |
| - **Top-k inferred**: argmax==GT OR (top1/top2 ratio≥0.5 AND top1/top2 ∈ orbit |
| partners). |
| |
| ### Δ vs. V1 (DINOv2-large full fine-tune) |
| |
| | Dataset | V1 strict | V2 strict | Δ | |
| |------------|----------:|----------:|------:| |
| | character | 92.33% | **92.89%** | +0.56pp | |
| | artstation | 74.16% | **74.59%** | +0.43pp | |
| | 3d66 | ~80% | **86.51%** | +7pp | |
| |
| V2 wins decisively on **artstation orbit (95.04%)** and **3d66 (98.12% orbit / |
| +7pp strict)** — the cross-domain bottleneck of V1. |
| |
| ## Quick start |
| |
| ```bash |
| pip install -r requirements.txt |
| # Includes: vggt @ git+https://github.com/facebookresearch/vggt.git |
| ``` |
| |
| ```python |
| from orient90_v2 import OrientPredictor |
|
|
| # Load model |
| p = OrientPredictor("best.pt", class_map_path="class_map.json", device="cuda") |
| |
| # Direct image input — must be a Blender-CYCLES-style render (518×518) |
| result = p.predict_image("examples/sample_render.png") |
| print(result) |
| # { |
| # "class_id": 0, |
| # "corr_x": 0, "corr_y": 0, "corr_z": 0, |
| # "R": [[1,0,0],[0,1,0],[0,0,1]], |
| # "confidence": 0.94, |
| # "off_grid": False, |
| # "off_grid_prob": 0.02 |
| # } |
|
|
| # 3D model input — auto-renders via Blender if sibling .png is missing |
| result = p.predict_model("foo.glb", render_gpu="auto") |
| ``` |
| |
| ### Class map → rotation matrix |
| |
| The 24 classes index the proper rotational subgroup of the cube (= octahedral |
| group). Each class entry in `class_map.json` provides: |
| |
| ```jsonc |
| { |
| "id": 7, |
| "euler_xyz": [0, 0, 270], // Apply Rz(-90°) to corrects the input |
| "matrix": [[...], [...], [...]] // 3×3 rotation matrix (canonical = matrix @ input) |
| } |
| ``` |
| |
| To rotate the input mesh back to canonical: |
| |
| ```python |
| import numpy as np, trimesh |
| R = np.array(result["R"]) # 3×3 in our internal Z-up frame |
| # glTF/USD store Y-up vertices → conjugate before applying: |
| M_yz = trimesh.transformations.rotation_matrix(np.pi/2, [1,0,0]) |
| R_yup = M_yz[:3,:3].T @ R @ M_yz[:3,:3] |
| T = np.eye(4); T[:3,:3] = R_yup |
| mesh = trimesh.load("foo.glb", force="scene") |
| mesh.apply_transform(T) |
| mesh.export("foo_canonical.glb") |
| ``` |
| |
| ## Training recipe |
| |
| ``` |
| VGGT-1B base (frozen) + LoRA r=64, α=64 on attn.qkv/proj + mlp.fc1/fc2 |
| loss = CE(cls, GT) + λ_off × BCE(off, GT_offgrid) λ_off = 1.0 |
| optimizer = AdamW, lr_backbone=5e-4, lr_head=1e-3, wd=1e-4, no_wd_on_lora |
| batch_size = 12 × 7 GPUs (effective 84), warmup_ratio = 0.05 |
| label_smoothing = 0.1 with orbit-aware smoothing (orbit_share=0.8) |
| epochs = 5, bf16, grad_clip = 1.0 |
| val = max_val_samples=5000 random subset (training-time eval; full val below) |
| ``` |
| |
| ## Compatibility / known issues |
| |
| - **VGGT requires `pip install git+https://github.com/facebookresearch/vggt.git`** |
| (not on PyPI as of 2026-05). |
| - Image preprocessing has **no ImageNet mean/std** normalization — VGGT's |
| aggregator was trained on raw [0,1] tensors. |
| - `predict_model()` needs a Blender env with `bpy` 4.4 installed. The bundled |
| `blender_scripts/blender_render_preview.py` matches the training render pipeline. |
| - The 2-head ckpt (this V2) **does not produce sub-grid offsets**. For sub-degree |
| refinement, pair with the V4 mesh-only post-process (`feat_iter_qLcont`) — see |
| the project repo's `docs/reports/phase_b2_mesh_postprocess_v4.md`. |
| |
| ## Citations / related |
| |
| - Backbone: VGGT (Visual Geometry Grounded Transformer), Wang et al., 2024. |
| https://github.com/facebookresearch/vggt |
| - Backbone init: Orient Anything V2, NeurIPS'25 spotlight, |
| `Viglong/OriAnyV2_ckpt` on Hugging Face. |
| - V1 baseline (DINOv2 full fine-tune): `noahdudu/orient90-v1`. |
| |
| ## License |
| |
| Apache-2.0 (see `LICENSE`). |
| |