Orient90 V2 β Mixed-Domain VGGT-1B + LoRA + 2-head
Discrete 3D orientation-correction model. Given a single RGB render of a 3D
asset, predict the Euler rotation (corr_x, corr_y, corr_z) (all multiples of
90Β°) that rotates the asset back to its canonical upright orientation, plus a
binary flag indicating whether the input is off-grid (i.e. not aligned to
the 90Β° rotation group).
V2 replaces V1's DINOv2-large + full fine-tune with VGGT-1B + LoRA r=64, and trains jointly on a 1.53M-sample mix of artstation USDZ + characters + 3d66 indoor scenes β yielding +7pp on artstation and +7pp on 3d66 strict accuracy vs. V1.
- Backbone: VGGT-1B aggregator, initialized from the Orient Anything V2
ckpt (
Viglong/OriAnyV2_ckpt), wrapped with LoRA (rank 64, alpha 64) adapters onattn.{qkv,proj}+mlp.{fc1,fc2}β base weights are frozen. - Pool:
LayerNorm(CLS + mean(patch_tokens))over the last aggregator layer. - Heads:
- 24-way classification over the cubic rotation group (
SO(3)axis-aligned subgroup). - 1-dim off-grid probability (BCE).
- 24-way classification over the cubic rotation group (
- Image size: 518Γ518 (VGGT native, no ImageNet mean/std normalization).
- Training data: 1.53M renders, Blender CYCLES @ 518, balanced across artstation USDZ + character + 3d66 + orbit-aware label smoothing (Ξ±=0.1, share=0.8).
- Best checkpoint: epoch 4 of 5 (
best.pt, score=0.8689).
Evaluation (val split, full labels_balanced_v2)
| Dataset | Strict cls_acc | Orbit-sum acc | Top-k inferred | Off-head acc |
|---|---|---|---|---|
| character | 92.89% | 97.24% | 93.39% | 99.02% |
| artstation | 74.59% | 95.04% | 79.29% | 99.02% |
| 3d66 | 86.51% | 98.12% | 91.15% | 99.02% |
- Strict cls_acc: argmax over 24 classes equals the GT class (single-GT, authoritative for characters where facial asymmetry matters).
- Orbit-sum acc: sum of softmax probability over the GT's full octahedral orbit (z180/y180/z90) β authoritative for arts/3d66 where symmetric objects have GT-ambiguous-but-equivalent partner classes.
- Top-k inferred: argmax==GT OR (top1/top2 ratioβ₯0.5 AND top1/top2 β orbit partners).
Ξ vs. V1 (DINOv2-large full fine-tune)
| Dataset | V1 strict | V2 strict | Ξ |
|---|---|---|---|
| character | 92.33% | 92.89% | +0.56pp |
| artstation | 74.16% | 74.59% | +0.43pp |
| 3d66 | ~80% | 86.51% | +7pp |
V2 wins decisively on artstation orbit (95.04%) and 3d66 (98.12% orbit / +7pp strict) β the cross-domain bottleneck of V1.
Quick start
pip install -r requirements.txt
# Includes: vggt @ git+https://github.com/facebookresearch/vggt.git
from orient90_v2 import OrientPredictor
# Load model
p = OrientPredictor("best.pt", class_map_path="class_map.json", device="cuda")
# Direct image input β must be a Blender-CYCLES-style render (518Γ518)
result = p.predict_image("examples/sample_render.png")
print(result)
# {
# "class_id": 0,
# "corr_x": 0, "corr_y": 0, "corr_z": 0,
# "R": [[1,0,0],[0,1,0],[0,0,1]],
# "confidence": 0.94,
# "off_grid": False,
# "off_grid_prob": 0.02
# }
# 3D model input β auto-renders via Blender if sibling .png is missing
result = p.predict_model("foo.glb", render_gpu="auto")
Class map β rotation matrix
The 24 classes index the proper rotational subgroup of the cube (= octahedral
group). Each class entry in class_map.json provides:
{
"id": 7,
"euler_xyz": [0, 0, 270], // Apply Rz(-90Β°) to corrects the input
"matrix": [[...], [...], [...]] // 3Γ3 rotation matrix (canonical = matrix @ input)
}
To rotate the input mesh back to canonical:
import numpy as np, trimesh
R = np.array(result["R"]) # 3Γ3 in our internal Z-up frame
# glTF/USD store Y-up vertices β conjugate before applying:
M_yz = trimesh.transformations.rotation_matrix(np.pi/2, [1,0,0])
R_yup = M_yz[:3,:3].T @ R @ M_yz[:3,:3]
T = np.eye(4); T[:3,:3] = R_yup
mesh = trimesh.load("foo.glb", force="scene")
mesh.apply_transform(T)
mesh.export("foo_canonical.glb")
Training recipe
VGGT-1B base (frozen) + LoRA r=64, Ξ±=64 on attn.qkv/proj + mlp.fc1/fc2
loss = CE(cls, GT) + Ξ»_off Γ BCE(off, GT_offgrid) Ξ»_off = 1.0
optimizer = AdamW, lr_backbone=5e-4, lr_head=1e-3, wd=1e-4, no_wd_on_lora
batch_size = 12 Γ 7 GPUs (effective 84), warmup_ratio = 0.05
label_smoothing = 0.1 with orbit-aware smoothing (orbit_share=0.8)
epochs = 5, bf16, grad_clip = 1.0
val = max_val_samples=5000 random subset (training-time eval; full val below)
Compatibility / known issues
- VGGT requires
pip install git+https://github.com/facebookresearch/vggt.git(not on PyPI as of 2026-05). - Image preprocessing has no ImageNet mean/std normalization β VGGT's aggregator was trained on raw [0,1] tensors.
predict_model()needs a Blender env withbpy4.4 installed. The bundledblender_scripts/blender_render_preview.pymatches the training render pipeline.- The 2-head ckpt (this V2) does not produce sub-grid offsets. For sub-degree
refinement, pair with the V4 mesh-only post-process (
feat_iter_qLcont) β see the project repo'sdocs/reports/phase_b2_mesh_postprocess_v4.md.
Citations / related
- Backbone: VGGT (Visual Geometry Grounded Transformer), Wang et al., 2024. https://github.com/facebookresearch/vggt
- Backbone init: Orient Anything V2, NeurIPS'25 spotlight,
Viglong/OriAnyV2_ckpton Hugging Face. - V1 baseline (DINOv2 full fine-tune):
noahdudu/orient90-v1.
License
Apache-2.0 (see LICENSE).
- Downloads last month
- 52