V2 upload (attempt 2)

a9b1534 verified 14 days ago

6.01 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: pytorch
	tags:
	- 3d
	- orientation
	- rotation-correction
	- vggt
	- lora
	- image-classification
	pipeline_tag: image-classification
	base_model: facebookresearch/vggt
	---

	# Orient90 V2 — Mixed-Domain VGGT-1B + LoRA + 2-head

	Discrete 3D orientation-correction model. Given a single RGB render of a 3D
	asset, predict the Euler rotation `(corr_x, corr_y, corr_z)` (all multiples of
	90°) that rotates the asset back to its canonical upright orientation, plus a
	binary flag indicating whether the input is off-grid (i.e. not aligned to
	the 90° rotation group).

	V2 replaces V1's DINOv2-large + full fine-tune with VGGT-1B + LoRA r=64,
	and trains jointly on a 1.53M-sample mix of artstation USDZ + characters + 3d66
	indoor scenes — yielding +7pp on artstation and +7pp on 3d66 strict
	accuracy vs. V1.

	- Backbone: VGGT-1B aggregator, initialized from the Orient Anything V2
	ckpt (`Viglong/OriAnyV2_ckpt`), wrapped with LoRA (rank 64, alpha 64)
	adapters on `attn.{qkv,proj}` + `mlp.{fc1,fc2}` — base weights are frozen.
	- Pool: `LayerNorm(CLS + mean(patch_tokens))` over the last aggregator layer.
	- Heads:
	- 24-way classification over the cubic rotation group (`SO(3)` axis-aligned subgroup).
	- 1-dim off-grid probability (BCE).
	- Image size: 518×518 (VGGT native, no ImageNet mean/std normalization).
	- Training data: 1.53M renders, Blender CYCLES @ 518, balanced across
	artstation USDZ + character + 3d66 + orbit-aware label smoothing (α=0.1, share=0.8).
	- Best checkpoint: epoch 4 of 5 (`best.pt`, score=0.8689).

	## Evaluation (val split, full `labels_balanced_v2`)

	\| Dataset \| Strict cls_acc \| Orbit-sum acc \| Top-k inferred \| Off-head acc \|
	\|------------\|---------------:\|--------------:\|---------------:\|-------------:\|
	\| character \| 92.89% \| 97.24% \| 93.39% \| 99.02% \|
	\| artstation \| 74.59% \| 95.04% \| 79.29% \| 99.02% \|
	\| 3d66 \| 86.51% \| 98.12% \| 91.15% \| 99.02% \|

	- Strict cls_acc: argmax over 24 classes equals the GT class (single-GT,
	authoritative for characters where facial asymmetry matters).
	- Orbit-sum acc: sum of softmax probability over the GT's full octahedral
	orbit (z180/y180/z90) — authoritative for arts/3d66 where symmetric objects
	have GT-ambiguous-but-equivalent partner classes.
	- Top-k inferred: argmax==GT OR (top1/top2 ratio≥0.5 AND top1/top2 ∈ orbit
	partners).

	### Δ vs. V1 (DINOv2-large full fine-tune)

	\| Dataset \| V1 strict \| V2 strict \| Δ \|
	\|------------\|----------:\|----------:\|------:\|
	\| character \| 92.33% \| 92.89% \| +0.56pp \|
	\| artstation \| 74.16% \| 74.59% \| +0.43pp \|
	\| 3d66 \| ~80% \| 86.51% \| +7pp \|

	V2 wins decisively on artstation orbit (95.04%) and **3d66 (98.12% orbit /
	+7pp strict)** — the cross-domain bottleneck of V1.

	## Quick start

	```bash
	pip install -r requirements.txt
	# Includes: vggt @ git+https://github.com/facebookresearch/vggt.git
	```

	```python
	from orient90_v2 import OrientPredictor

	# Load model
	p = OrientPredictor("best.pt", class_map_path="class_map.json", device="cuda")

	# Direct image input — must be a Blender-CYCLES-style render (518×518)
	result = p.predict_image("examples/sample_render.png")
	print(result)
	# {
	# "class_id": 0,
	# "corr_x": 0, "corr_y": 0, "corr_z": 0,
	# "R": [[1,0,0],[0,1,0],[0,0,1]],
	# "confidence": 0.94,
	# "off_grid": False,
	# "off_grid_prob": 0.02
	# }

	# 3D model input — auto-renders via Blender if sibling .png is missing
	result = p.predict_model("foo.glb", render_gpu="auto")
	```

	### Class map → rotation matrix

	The 24 classes index the proper rotational subgroup of the cube (= octahedral
	group). Each class entry in `class_map.json` provides:

	```jsonc
	{
	"id": 7,
	"euler_xyz": [0, 0, 270], // Apply Rz(-90°) to corrects the input
	"matrix": [[...], [...], [...]] // 3×3 rotation matrix (canonical = matrix @ input)
	}
	```

	To rotate the input mesh back to canonical:

	```python
	import numpy as np, trimesh
	R = np.array(result["R"]) # 3×3 in our internal Z-up frame
	# glTF/USD store Y-up vertices → conjugate before applying:
	M_yz = trimesh.transformations.rotation_matrix(np.pi/2, [1,0,0])
	R_yup = M_yz[:3,:3].T @ R @ M_yz[:3,:3]
	T = np.eye(4); T[:3,:3] = R_yup
	mesh = trimesh.load("foo.glb", force="scene")
	mesh.apply_transform(T)
	mesh.export("foo_canonical.glb")
	```

	## Training recipe

	```
	VGGT-1B base (frozen) + LoRA r=64, α=64 on attn.qkv/proj + mlp.fc1/fc2
	loss = CE(cls, GT) + λ_off × BCE(off, GT_offgrid) λ_off = 1.0
	optimizer = AdamW, lr_backbone=5e-4, lr_head=1e-3, wd=1e-4, no_wd_on_lora
	batch_size = 12 × 7 GPUs (effective 84), warmup_ratio = 0.05
	label_smoothing = 0.1 with orbit-aware smoothing (orbit_share=0.8)
	epochs = 5, bf16, grad_clip = 1.0
	val = max_val_samples=5000 random subset (training-time eval; full val below)
	```

	## Compatibility / known issues

	- VGGT requires `pip install git+https://github.com/facebookresearch/vggt.git`
	(not on PyPI as of 2026-05).
	- Image preprocessing has no ImageNet mean/std normalization — VGGT's
	aggregator was trained on raw [0,1] tensors.
	- `predict_model()` needs a Blender env with `bpy` 4.4 installed. The bundled
	`blender_scripts/blender_render_preview.py` matches the training render pipeline.
	- The 2-head ckpt (this V2) does not produce sub-grid offsets. For sub-degree
	refinement, pair with the V4 mesh-only post-process (`feat_iter_qLcont`) — see
	the project repo's `docs/reports/phase_b2_mesh_postprocess_v4.md`.

	## Citations / related

	- Backbone: VGGT (Visual Geometry Grounded Transformer), Wang et al., 2024.
	https://github.com/facebookresearch/vggt
	- Backbone init: Orient Anything V2, NeurIPS'25 spotlight,
	`Viglong/OriAnyV2_ckpt` on Hugging Face.
	- V1 baseline (DINOv2 full fine-tune): `noahdudu/orient90-v1`.

	## License

	Apache-2.0 (see `LICENSE`).