Image-to-3D
Transformers
Safetensors
shape-opt
nielsr's picture
nielsr HF Staff
Add pipeline tag and improve model card
df7d766 verified
|
raw
history blame
2.23 kB
metadata
base_model:
  - zx1239856/edgerunner
datasets:
  - zx1239856/3d-front-ar-packed
library_name: transformers
license: cc-by-sa-4.0
pipeline_tag: image-to-3d

PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

PixARMesh is a mesh-native autoregressive framework for single-view 3D scene reconstruction. Instead of reconstructing via intermediate volumetric or implicit representations, PixARMesh directly models instances with native mesh representation. Object poses and meshes are predicted in a unified autoregressive sequence.

Model Details

Reconstructing complete 3D indoor scenes from a single RGB image is a complex task. PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. By augmenting a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, the model enables accurate spatial reasoning. Scenes are generated autoregressively from a unified token stream containing context, pose, and mesh, yielding compact meshes with high-fidelity geometry.

Citation

If you find PixARMesh useful in your research, please consider citing:

@article{zhang2026pixarmesh,
  title={PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction},
  author={Zhang, Xiang and Yoo, Sohyun and Wu, Hongrui and Li, Chuan and Xie, Jianwen and Tu, Zhuowen},
  journal={arXiv preprint arXiv:2603.05888},
  year={2026}
}

Acknowledgements

PixARMesh builds upon several excellent open-source projects including Grounded-Segment-Anything, Depth Pro, DINOv2, and weights from EdgeRunner and BPT. We also use physically-based renderings from the 3D-FRONT scenes.