You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Orient90 V2 — Mixed-Domain VGGT-1B + LoRA + 2-head

Discrete 3D orientation-correction model. Given a single RGB render of a 3D asset, predict the Euler rotation (corr_x, corr_y, corr_z) (all multiples of 90°) that rotates the asset back to its canonical upright orientation, plus a binary flag indicating whether the input is off-grid (i.e. not aligned to the 90° rotation group).

V2 replaces V1's DINOv2-large + full fine-tune with VGGT-1B + LoRA r=64, and trains jointly on a 1.53M-sample mix of artstation USDZ + characters + 3d66 indoor scenes — yielding +7pp on artstation and +7pp on 3d66 strict accuracy vs. V1.

Backbone: VGGT-1B aggregator, initialized from the Orient Anything V2 ckpt (Viglong/OriAnyV2_ckpt), wrapped with LoRA (rank 64, alpha 64) adapters on attn.{qkv,proj} + mlp.{fc1,fc2} — base weights are frozen.
Pool: LayerNorm(CLS + mean(patch_tokens)) over the last aggregator layer.
Heads:
- 24-way classification over the cubic rotation group (SO(3) axis-aligned subgroup).
- 1-dim off-grid probability (BCE).
Image size: 518×518 (VGGT native, no ImageNet mean/std normalization).
Training data: 1.53M renders, Blender CYCLES @ 518, balanced across artstation USDZ + character + 3d66 + orbit-aware label smoothing (α=0.1, share=0.8).
Best checkpoint: epoch 4 of 5 (best.pt, score=0.8689).

Evaluation (val split, full `labels_balanced_v2`)

Dataset	Strict cls_acc	Orbit-sum acc	Top-k inferred	Off-head acc
character	92.89%	97.24%	93.39%	99.02%
artstation	74.59%	95.04%	79.29%	99.02%
3d66	86.51%	98.12%	91.15%	99.02%

Strict cls_acc: argmax over 24 classes equals the GT class (single-GT, authoritative for characters where facial asymmetry matters).
Orbit-sum acc: sum of softmax probability over the GT's full octahedral orbit (z180/y180/z90) — authoritative for arts/3d66 where symmetric objects have GT-ambiguous-but-equivalent partner classes.
Top-k inferred: argmax==GT OR (top1/top2 ratio≥0.5 AND top1/top2 ∈ orbit partners).

Δ vs. V1 (DINOv2-large full fine-tune)

Dataset	V1 strict	V2 strict	Δ
character	92.33%	92.89%	+0.56pp
artstation	74.16%	74.59%	+0.43pp
3d66	~80%	86.51%	+7pp

V2 wins decisively on artstation orbit (95.04%) and 3d66 (98.12% orbit / +7pp strict) — the cross-domain bottleneck of V1.

Quick start

pip install -r requirements.txt
# Includes: vggt @ git+https://github.com/facebookresearch/vggt.git

from orient90_v2 import OrientPredictor

# Load model
p = OrientPredictor("best.pt", class_map_path="class_map.json", device="cuda")

# Direct image input — must be a Blender-CYCLES-style render (518×518)
result = p.predict_image("examples/sample_render.png")
print(result)
# {
#   "class_id": 0,
#   "corr_x": 0, "corr_y": 0, "corr_z": 0,
#   "R": [[1,0,0],[0,1,0],[0,0,1]],
#   "confidence": 0.94,
#   "off_grid": False,
#   "off_grid_prob": 0.02
# }

# 3D model input — auto-renders via Blender if sibling .png is missing
result = p.predict_model("foo.glb", render_gpu="auto")

Class map → rotation matrix

The 24 classes index the proper rotational subgroup of the cube (= octahedral group). Each class entry in class_map.json provides:

{
  "id": 7,
  "euler_xyz": [0, 0, 270],       // Apply Rz(-90°) to corrects the input
  "matrix": [[...], [...], [...]] // 3×3 rotation matrix (canonical = matrix @ input)
}

To rotate the input mesh back to canonical:

import numpy as np, trimesh
R = np.array(result["R"])                      # 3×3 in our internal Z-up frame
# glTF/USD store Y-up vertices → conjugate before applying:
M_yz = trimesh.transformations.rotation_matrix(np.pi/2, [1,0,0])
R_yup = M_yz[:3,:3].T @ R @ M_yz[:3,:3]
T = np.eye(4); T[:3,:3] = R_yup
mesh = trimesh.load("foo.glb", force="scene")
mesh.apply_transform(T)
mesh.export("foo_canonical.glb")

Training recipe

VGGT-1B base (frozen) + LoRA r=64, α=64 on attn.qkv/proj + mlp.fc1/fc2
loss = CE(cls, GT) + λ_off × BCE(off, GT_offgrid)    λ_off = 1.0
optimizer = AdamW, lr_backbone=5e-4, lr_head=1e-3, wd=1e-4, no_wd_on_lora
batch_size = 12 × 7 GPUs (effective 84), warmup_ratio = 0.05
label_smoothing = 0.1 with orbit-aware smoothing (orbit_share=0.8)
epochs = 5, bf16, grad_clip = 1.0
val = max_val_samples=5000 random subset (training-time eval; full val below)

Compatibility / known issues

VGGT requires pip install git+https://github.com/facebookresearch/vggt.git (not on PyPI as of 2026-05).
Image preprocessing has no ImageNet mean/std normalization — VGGT's aggregator was trained on raw [0,1] tensors.
predict_model() needs a Blender env with bpy 4.4 installed. The bundled blender_scripts/blender_render_preview.py matches the training render pipeline.
The 2-head ckpt (this V2) does not produce sub-grid offsets. For sub-degree refinement, pair with the V4 mesh-only post-process (feat_iter_qLcont) — see the project repo's docs/reports/phase_b2_mesh_postprocess_v4.md.

Citations / related

Backbone: VGGT (Visual Geometry Grounded Transformer), Wang et al., 2024. https://github.com/facebookresearch/vggt
Backbone init: Orient Anything V2, NeurIPS'25 spotlight, Viglong/OriAnyV2_ckpt on Hugging Face.
V1 baseline (DINOv2 full fine-tune): noahdudu/orient90-v1.

License

Apache-2.0 (see LICENSE).

Downloads last month: 52