orient90-v2 / README.md
noahdudu's picture
V2 upload (attempt 2)
a9b1534 verified
---
language:
- en
license: apache-2.0
library_name: pytorch
tags:
- 3d
- orientation
- rotation-correction
- vggt
- lora
- image-classification
pipeline_tag: image-classification
base_model: facebookresearch/vggt
---
# Orient90 V2 — Mixed-Domain VGGT-1B + LoRA + 2-head
Discrete 3D orientation-correction model. Given a single RGB render of a 3D
asset, predict the Euler rotation `(corr_x, corr_y, corr_z)` (all multiples of
90°) that rotates the asset back to its canonical upright orientation, plus a
binary flag indicating whether the input is **off-grid** (i.e. not aligned to
the 90° rotation group).
V2 replaces V1's DINOv2-large + full fine-tune with **VGGT-1B + LoRA r=64**,
and trains jointly on a 1.53M-sample mix of artstation USDZ + characters + 3d66
indoor scenes — yielding **+7pp on artstation** and **+7pp on 3d66** strict
accuracy vs. V1.
- **Backbone**: VGGT-1B aggregator, initialized from the Orient Anything V2
ckpt (`Viglong/OriAnyV2_ckpt`), wrapped with LoRA (rank 64, alpha 64)
adapters on `attn.{qkv,proj}` + `mlp.{fc1,fc2}` — base weights are frozen.
- **Pool**: `LayerNorm(CLS + mean(patch_tokens))` over the last aggregator layer.
- **Heads**:
- 24-way classification over the cubic rotation group (`SO(3)` axis-aligned subgroup).
- 1-dim off-grid probability (BCE).
- **Image size**: 518×518 (VGGT native, no ImageNet mean/std normalization).
- **Training data**: 1.53M renders, Blender CYCLES @ 518, balanced across
artstation USDZ + character + 3d66 + orbit-aware label smoothing (α=0.1, share=0.8).
- **Best checkpoint**: **epoch 4** of 5 (`best.pt`, score=0.8689).
## Evaluation (val split, full `labels_balanced_v2`)
| Dataset | Strict cls_acc | Orbit-sum acc | Top-k inferred | Off-head acc |
|------------|---------------:|--------------:|---------------:|-------------:|
| character | **92.89%** | 97.24% | 93.39% | 99.02% |
| artstation | 74.59% | **95.04%** | 79.29% | 99.02% |
| 3d66 | 86.51% | **98.12%** | 91.15% | 99.02% |
- **Strict cls_acc**: argmax over 24 classes equals the GT class (single-GT,
authoritative for characters where facial asymmetry matters).
- **Orbit-sum acc**: sum of softmax probability over the GT's full octahedral
orbit (z180/y180/z90) — authoritative for arts/3d66 where symmetric objects
have GT-ambiguous-but-equivalent partner classes.
- **Top-k inferred**: argmax==GT OR (top1/top2 ratio≥0.5 AND top1/top2 ∈ orbit
partners).
### Δ vs. V1 (DINOv2-large full fine-tune)
| Dataset | V1 strict | V2 strict | Δ |
|------------|----------:|----------:|------:|
| character | 92.33% | **92.89%** | +0.56pp |
| artstation | 74.16% | **74.59%** | +0.43pp |
| 3d66 | ~80% | **86.51%** | +7pp |
V2 wins decisively on **artstation orbit (95.04%)** and **3d66 (98.12% orbit /
+7pp strict)** — the cross-domain bottleneck of V1.
## Quick start
```bash
pip install -r requirements.txt
# Includes: vggt @ git+https://github.com/facebookresearch/vggt.git
```
```python
from orient90_v2 import OrientPredictor
# Load model
p = OrientPredictor("best.pt", class_map_path="class_map.json", device="cuda")
# Direct image input — must be a Blender-CYCLES-style render (518×518)
result = p.predict_image("examples/sample_render.png")
print(result)
# {
# "class_id": 0,
# "corr_x": 0, "corr_y": 0, "corr_z": 0,
# "R": [[1,0,0],[0,1,0],[0,0,1]],
# "confidence": 0.94,
# "off_grid": False,
# "off_grid_prob": 0.02
# }
# 3D model input — auto-renders via Blender if sibling .png is missing
result = p.predict_model("foo.glb", render_gpu="auto")
```
### Class map → rotation matrix
The 24 classes index the proper rotational subgroup of the cube (= octahedral
group). Each class entry in `class_map.json` provides:
```jsonc
{
"id": 7,
"euler_xyz": [0, 0, 270], // Apply Rz(-90°) to corrects the input
"matrix": [[...], [...], [...]] // 3×3 rotation matrix (canonical = matrix @ input)
}
```
To rotate the input mesh back to canonical:
```python
import numpy as np, trimesh
R = np.array(result["R"]) # 3×3 in our internal Z-up frame
# glTF/USD store Y-up vertices → conjugate before applying:
M_yz = trimesh.transformations.rotation_matrix(np.pi/2, [1,0,0])
R_yup = M_yz[:3,:3].T @ R @ M_yz[:3,:3]
T = np.eye(4); T[:3,:3] = R_yup
mesh = trimesh.load("foo.glb", force="scene")
mesh.apply_transform(T)
mesh.export("foo_canonical.glb")
```
## Training recipe
```
VGGT-1B base (frozen) + LoRA r=64, α=64 on attn.qkv/proj + mlp.fc1/fc2
loss = CE(cls, GT) + λ_off × BCE(off, GT_offgrid) λ_off = 1.0
optimizer = AdamW, lr_backbone=5e-4, lr_head=1e-3, wd=1e-4, no_wd_on_lora
batch_size = 12 × 7 GPUs (effective 84), warmup_ratio = 0.05
label_smoothing = 0.1 with orbit-aware smoothing (orbit_share=0.8)
epochs = 5, bf16, grad_clip = 1.0
val = max_val_samples=5000 random subset (training-time eval; full val below)
```
## Compatibility / known issues
- **VGGT requires `pip install git+https://github.com/facebookresearch/vggt.git`**
(not on PyPI as of 2026-05).
- Image preprocessing has **no ImageNet mean/std** normalization — VGGT's
aggregator was trained on raw [0,1] tensors.
- `predict_model()` needs a Blender env with `bpy` 4.4 installed. The bundled
`blender_scripts/blender_render_preview.py` matches the training render pipeline.
- The 2-head ckpt (this V2) **does not produce sub-grid offsets**. For sub-degree
refinement, pair with the V4 mesh-only post-process (`feat_iter_qLcont`) — see
the project repo's `docs/reports/phase_b2_mesh_postprocess_v4.md`.
## Citations / related
- Backbone: VGGT (Visual Geometry Grounded Transformer), Wang et al., 2024.
https://github.com/facebookresearch/vggt
- Backbone init: Orient Anything V2, NeurIPS'25 spotlight,
`Viglong/OriAnyV2_ckpt` on Hugging Face.
- V1 baseline (DINOv2 full fine-tune): `noahdudu/orient90-v1`.
## License
Apache-2.0 (see `LICENSE`).