Hunyuan3D 2.1 — Shape Pipeline (MLX)

MLX-native weights for the shape pipeline of tencent/Hunyuan3D-2.1 — single image in, untextured mesh out. Optimised for Apple Silicon Macs.

These are the artefacts produced by the hy3dmlx educational arc (chapters 13 + 14): a chapter-by-chapter port of the Hunyuan3D 2.1 shape pipeline from PyTorch+CUDA to Apple's MLX. The conversion preserves trained parameters bit-for-bit; only the on-disk layout changes.

Out of scope: the paint pipeline (PBR texturing) of Hunyuan3D 2.1 is not in this repo. For the MLX paint weights, see AgenticVibes/hunyuan3d-2.1-mlx.

Files

File	Size	Description
`dit.safetensors`	~6.10 GB	HunYuanDiT-Plain backbone — 21 layers, hidden 2048, 16 heads, last 6 layers MoE (8 experts, top-2). fp16. ~3.3B params.
`shape_vae_encoder.safetensors`	~227 MB	Point-cross-attention encoder — 8 self-attn layers, width 1024, 16 heads. fp16.
`shape_vae_decoder.safetensors`	~429 MB	Latent-refine transformer (16 layers, width 1024) + cross-attention geo-decoder. fp16.
`config.json`	1 KB	Architectural metadata + module configs + upstream commit pin.
`LICENSE`	—	Tencent Hunyuan 3D 2.1 Community License Agreement (verbatim from upstream).
`NOTICE`	—	Required attribution + change description.

DINOv2 image encoder is not included. The pipeline's image conditioner is DINOv2-Large, a separately-licensed Meta model. hy3dmlx.hub.load_pipeline_from_hf pulls it from the upstream facebook/dinov2-large repo automatically — under its own Apache-2.0 license, distinct from this repo's terms.

Usage

Install hy3dmlx and call load_pipeline_from_hf:

from hy3dmlx.hub import load_pipeline_from_hf
from PIL import Image

pipeline = load_pipeline_from_hf("AgenticVibes/hy3dmlx-shape-v2.1")
result = pipeline(image=Image.open("input.png"))
result.mesh.export("output.glb")

That single call:

Downloads dit.safetensors, shape_vae_encoder.safetensors, shape_vae_decoder.safetensors, and config.json from this repo (cached on subsequent calls).
Pulls DINOv2-Large from facebook/dinov2-large.
Builds and wires up ShapeMeshPipeline (sampling + decode + marching cubes) ready to call.

If you'd rather load components independently — e.g. to swap the scheduler or run latent-only experiments — see hy3dmlx.hub.fetch_weights to download the snapshot directory and hy3dmlx.dit / hy3dmlx.vae for module-level loaders.

Performance — measured

End-to-end image-to-mesh on the chapter-1 demo input, measured on M3 Ultra-class hardware, fp16, 50 inference steps with classifier-free guidance scale 5.0, octree resolution 384 (matching the upstream defaults):

Stage	Wall (s)	Share
DINOv2 forward	0.2	0%
DiT sampling (50 steps × CFG batch=2)	646	58%
VAE refine (post_kl + 16-layer transformer)	0.4	0%
VAE grid eval (~7000 chunks at 8000 voxels)	465	42%
Total	1112 (~18.5 min)

MLX peak memory: 7.85 GB. Output mesh: 343 508 vertices, 687 024 faces.

A faster development setting at octree_resolution=256 finishes in ~5 minutes with comparable global geometry.

Mesh-level parity vs the PyTorch reference

Comparing this repo's MLX inference output against the upstream PyTorch reference run on the same input:

Metric	Value	Informal threshold
Chamfer L2 (symmetric)	0.0244	≤ 0.02 (just over — see below)
F1 @ 0.01	0.190	≥ 0.85 (low — see below)
F1 @ 0.05	0.891	—
Normal consistency	0.915	≥ 0.90 ✓
Bounding-box extent (MLX vs ref)	`[1.276, 1.999, 1.181]` vs `[1.285, 1.993, 1.161]`	within 1% on every axis

The two meshes are not bit-equal, and they aren't expected to be. Differences come from three places:

RNG mismatch. The reference run used non-deterministic noise (generator=None); our MLX run uses a seeded MLX PRNG. PyTorch's and MLX's standard-normal samplers don't agree, so even with bit-equal weights the initial latent differs.
fp16 cumulative error. 50 scheduler steps × CFG arithmetic × 21 DiT blocks × MoE routing accumulates a small wedge in fp16 that fp32 wouldn't have.
Marching-cubes is non-linear. A small change in the occupancy field at a voxel boundary can flip whether that voxel becomes a triangle, producing localised mesh-level differences that don't reflect a weight or architecture issue.

For full discussion (including the discovery that our MLX run produces a better "Y" on the demo sign than the reference does), see chapter 13 of the source repo: https://github.com/AgenticVibes/hy3dmlx/blob/main/chapters/13-end-to-end/README.md.

If you need bit-near parity for a specific input, pre-sample the noise as a numpy array and feed it via initial_noise=. That removes the RNG-mismatch term but won't remove the fp16 + marching-cubes terms.

Inputs and limitations

Image format. Anything PIL reads. PNG with transparency is preferred (skips background removal entirely; the alpha channel directly defines the subject).
Image size. Sweet spot 512–1024 px on the long side. Smaller gets heavily upsampled and loses detail; larger has its detail thrown away.
Subject content. Trained on 3D-asset-grade imagery — characters, props, vehicles. Out-of-distribution inputs (faces, architecture, complex multi-subject scenes) produce uneven results.
Subject pose. Front-on views often produce flat or hollow backs; 3/4 views work better (the demo input is one).
Memory. ~~8 GB MLX peak at fp16 with octree_resolution=384. Smaller M-series Macs may need octree_resolution=256 (~~3M voxels instead of 56M) to fit comfortably.
No quantization. All weights ship at fp16 native to the upstream checkpoint. Lower-precision (Q8/Q4) variants could be future work; not in scope here.

Citations

If you use these weights or the hy3dmlx port in research or applications, please cite the upstream Hunyuan3D papers:

@article{hunyuan3d2025_2_1,
  title={Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material},
  author={Tencent Hunyuan3D Team},
  year={2025},
  eprint={2506.15442},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2506.15442}
}

@article{hunyuan3d2025,
  title={Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation},
  author={Tencent Hunyuan3D Team},
  year={2025},
  eprint={2501.12202},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2501.12202}
}

The shape pipeline relies on several earlier works that the Hunyuan3D papers themselves cite — please consult the upstream papers for the full bibliography. Briefly:

DINOv2 (Oquab et al., 2023) — the image conditioner. https://arxiv.org/abs/2304.07193
Flow Matching (Lipman et al., 2022) — the diffusion-time formulation. https://arxiv.org/abs/2210.02747
Scalable Diffusion Models with Transformers (Peebles & Xie, 2022) — the DiT architecture. https://arxiv.org/abs/2212.09748

License and attribution

These weights are Model Derivatives under the Tencent Hunyuan 3D 2.1 Community License Agreement. See NOTICE for the change description.

Key terms:

Territory restricted. The license expressly excludes the European Union, United Kingdom, and South Korea. This HuggingFace repo is gated (extra_gated_eu_disallowed: true) to disallow EU downloads at the platform level.
Commercial-use threshold. If your product or service has more than 1 million monthly active users at the version-release date, the upstream license requires a separate commercial license from Tencent (see Section 4 of the LICENSE file).
Improvement-of-other-models prohibition. You may not use these weights or their outputs to improve any AI model other than Hunyuan3D 2.1 or its derivatives.
Attribution required. Distributions to third parties must accompany this NOTICE file (Section 3(d) of the upstream license).

The hy3dmlx package itself (the loading code) is independent of this artefact's license and is released separately at https://github.com/AgenticVibes/hy3dmlx.

DINOv2 weights, when fetched via load_pipeline_from_hf, are governed by their own Apache-2.0 license at facebook/dinov2-large — entirely independent of the terms here.

Acknowledgements

These weights would not exist without the Tencent Hunyuan3D team's decision to open the original model under the Community License. Conversion preserves their work bit-for-bit; everything genuinely interesting in the trained behaviour is theirs. Powered by Tencent Hunyuan.