V2 upload (attempt 2)

a9b1534 verified 14 days ago

6.01 kB

language:
  - en
license: apache-2.0
library_name: pytorch
tags:
  - 3d
  - orientation
  - rotation-correction
  - vggt
  - lora
  - image-classification
pipeline_tag: image-classification
base_model: facebookresearch/vggt

Orient90 V2 — Mixed-Domain VGGT-1B + LoRA + 2-head

Discrete 3D orientation-correction model. Given a single RGB render of a 3D asset, predict the Euler rotation (corr_x, corr_y, corr_z) (all multiples of 90°) that rotates the asset back to its canonical upright orientation, plus a binary flag indicating whether the input is off-grid (i.e. not aligned to the 90° rotation group).

V2 replaces V1's DINOv2-large + full fine-tune with VGGT-1B + LoRA r=64, and trains jointly on a 1.53M-sample mix of artstation USDZ + characters + 3d66 indoor scenes — yielding +7pp on artstation and +7pp on 3d66 strict accuracy vs. V1.

Backbone: VGGT-1B aggregator, initialized from the Orient Anything V2 ckpt (Viglong/OriAnyV2_ckpt), wrapped with LoRA (rank 64, alpha 64) adapters on attn.{qkv,proj} + mlp.{fc1,fc2} — base weights are frozen.
Pool: LayerNorm(CLS + mean(patch_tokens)) over the last aggregator layer.
Heads:
- 24-way classification over the cubic rotation group (SO(3) axis-aligned subgroup).
- 1-dim off-grid probability (BCE).
Image size: 518×518 (VGGT native, no ImageNet mean/std normalization).
Training data: 1.53M renders, Blender CYCLES @ 518, balanced across artstation USDZ + character + 3d66 + orbit-aware label smoothing (α=0.1, share=0.8).
Best checkpoint: epoch 4 of 5 (best.pt, score=0.8689).

Evaluation (val split, full `labels_balanced_v2`)

Dataset	Strict cls_acc	Orbit-sum acc	Top-k inferred	Off-head acc
character	92.89%	97.24%	93.39%	99.02%
artstation	74.59%	95.04%	79.29%	99.02%
3d66	86.51%	98.12%	91.15%	99.02%

Strict cls_acc: argmax over 24 classes equals the GT class (single-GT, authoritative for characters where facial asymmetry matters).
Orbit-sum acc: sum of softmax probability over the GT's full octahedral orbit (z180/y180/z90) — authoritative for arts/3d66 where symmetric objects have GT-ambiguous-but-equivalent partner classes.
Top-k inferred: argmax==GT OR (top1/top2 ratio≥0.5 AND top1/top2 ∈ orbit partners).

Δ vs. V1 (DINOv2-large full fine-tune)

Dataset	V1 strict	V2 strict	Δ
character	92.33%	92.89%	+0.56pp
artstation	74.16%	74.59%	+0.43pp
3d66	~80%	86.51%	+7pp

V2 wins decisively on artstation orbit (95.04%) and 3d66 (98.12% orbit / +7pp strict) — the cross-domain bottleneck of V1.

Quick start

pip install -r requirements.txt
# Includes: vggt @ git+https://github.com/facebookresearch/vggt.git

from orient90_v2 import OrientPredictor

# Load model
p = OrientPredictor("best.pt", class_map_path="class_map.json", device="cuda")

# Direct image input — must be a Blender-CYCLES-style render (518×518)
result = p.predict_image("examples/sample_render.png")
print(result)
# {
#   "class_id": 0,
#   "corr_x": 0, "corr_y": 0, "corr_z": 0,
#   "R": [[1,0,0],[0,1,0],[0,0,1]],
#   "confidence": 0.94,
#   "off_grid": False,
#   "off_grid_prob": 0.02
# }

# 3D model input — auto-renders via Blender if sibling .png is missing
result = p.predict_model("foo.glb", render_gpu="auto")

Class map → rotation matrix

The 24 classes index the proper rotational subgroup of the cube (= octahedral group). Each class entry in class_map.json provides:

{
  "id": 7,
  "euler_xyz": [0, 0, 270],       // Apply Rz(-90°) to corrects the input
  "matrix": [[...], [...], [...]] // 3×3 rotation matrix (canonical = matrix @ input)
}

To rotate the input mesh back to canonical:

import numpy as np, trimesh
R = np.array(result["R"])                      # 3×3 in our internal Z-up frame
# glTF/USD store Y-up vertices → conjugate before applying:
M_yz = trimesh.transformations.rotation_matrix(np.pi/2, [1,0,0])
R_yup = M_yz[:3,:3].T @ R @ M_yz[:3,:3]
T = np.eye(4); T[:3,:3] = R_yup
mesh = trimesh.load("foo.glb", force="scene")
mesh.apply_transform(T)
mesh.export("foo_canonical.glb")

Training recipe

VGGT-1B base (frozen) + LoRA r=64, α=64 on attn.qkv/proj + mlp.fc1/fc2
loss = CE(cls, GT) + λ_off × BCE(off, GT_offgrid)    λ_off = 1.0
optimizer = AdamW, lr_backbone=5e-4, lr_head=1e-3, wd=1e-4, no_wd_on_lora
batch_size = 12 × 7 GPUs (effective 84), warmup_ratio = 0.05
label_smoothing = 0.1 with orbit-aware smoothing (orbit_share=0.8)
epochs = 5, bf16, grad_clip = 1.0
val = max_val_samples=5000 random subset (training-time eval; full val below)

Compatibility / known issues

VGGT requires pip install git+https://github.com/facebookresearch/vggt.git (not on PyPI as of 2026-05).
Image preprocessing has no ImageNet mean/std normalization — VGGT's aggregator was trained on raw [0,1] tensors.
predict_model() needs a Blender env with bpy 4.4 installed. The bundled blender_scripts/blender_render_preview.py matches the training render pipeline.
The 2-head ckpt (this V2) does not produce sub-grid offsets. For sub-degree refinement, pair with the V4 mesh-only post-process (feat_iter_qLcont) — see the project repo's docs/reports/phase_b2_mesh_postprocess_v4.md.

Citations / related

Backbone: VGGT (Visual Geometry Grounded Transformer), Wang et al., 2024. https://github.com/facebookresearch/vggt
Backbone init: Orient Anything V2, NeurIPS'25 spotlight, Viglong/OriAnyV2_ckpt on Hugging Face.
V1 baseline (DINOv2 full fine-tune): noahdudu/orient90-v1.

License

Apache-2.0 (see LICENSE).