| --- |
| license: mit |
| pipeline_tag: image-to-3d |
| tags: |
| - coreai |
| - core-ml |
| - apple-silicon |
| - gaussian-splatting |
| - image-to-3d |
| - 3d-generation |
| base_model: |
| - VAST-AI/TripoSplat |
| --- |
| |
| # TripoSplat β Core AI (zoo's first 3D) |
|
|
| [VAST-AI/TripoSplat](https://github.com/VAST-AI-Research/TripoSplat) β **single image β 3D Gaussian |
| splats** (`.ply`/`.splat`), MIT. The zoo's first 3D model: outputs drop straight into a Gaussian-splat |
| viewer (e.g. Apple RealityKit on visionOS, or MetalSplatter on iOS/macOS). |
|
|
| Pure-PyTorch pipeline (no diffusers/CUDA kernels): bg-removal β DINOv3 ViT-H encode + Flux2-VAE encode |
| β 20-step flow-matching DiT denoiser β octree probability sampler β Gaussian decoder β splats. |
|
|
| **This repo holds the Core AI `.aimodel` bundles** (each is a directory). Conversion + runner scripts |
| live in the [coreai-models-community](https://huggingface.co/mlboydaisuke) zoo (`conversion/triposplat/`). |
|
|
| ## What runs on Core AI |
|
|
| 5 neural nets converted (each gated converted-vs-eager **cos = 1.000000**): |
|
|
| | net | shape | bundle | dtype | |
| |---|---|---|---| |
| | DINOv3 ViT-H encoder | (1,3,1024,1024)β(1,4101,1280) | `dinov3_fp16.aimodel` | fp16 | |
| | Flux2-VAE encoder | (1,3,1024,1024)β(1,4096,128) | `vae_fp16.aimodel` | fp16 | |
| | DiT denoiser (one step) | latent(1,8192,16)+cam(1,1,5)+t+feat1(1,4101,1280)+feat2(1,4101,128)βlatent,cam | `dit_fp16.aimodel` | fp16 | |
| | Octree probability decoder | x(1,8192,3)+l(1,)+cond(1,8192,16)βlogits(1,8192,8) | `octree_fp32.aimodel` | fp32 | |
| | Decode (gs + build_gaussians + .ply activations, baked) | points(1,8192,3)+cond(1,8192,16)β(262144,14) | `decode_fp32.aimodel` | fp32 | |
|
|
| The flow-matching sampler (`FlowEulerCfgSampler`) and the octree `sample_probs` systematic resampling |
| stay **host-side** (data-dependent control flow). Scripts: `_conv_*.py` convert+gate each net; |
| `_conv_fp16.py` makes the half-size fp16 bundles; `_conv_decode.py` bakes build_gaussians + the |
| Gaussian `.ply`-activation math into one net so the runner just writes raw floats. |
| |
| ## model.py patches (the reusable contribution β see the zoo's conversion guide) |
| |
| coreai-torch 0.4.0 needed six edits to VAST's `model.py`; all are general gotchas: |
| |
| 1. **float-arg `aten.arange`** β `bad_optional_access` C++ abort. Use int-arg arange (DINOv3 RoPE). |
| 2. **fx `got multiple values for 'mod'`** β submodule called with `mod=` kwarg. Pass positionally. |
| 3. **No complex ops** β rewrote the DiT's complex RoPE (`torch.polar`/`view_as_complex`) as real |
| cos/sin math (`apply_rotary_emb`, `RePo3DRotaryEmbedding.forward`). |
| 4. **Constant-folded `sin/cos` of huge args is low-precision** (cosβ0.5) β the DiT positional embed |
| computed from the fixed Sobol constant was folded wrong; precompute it into a `register_buffer`. |
| 5. **`F.normalize` drops the eps clamp** β near-zero vectors blow up ~1e13; rewrote `MultiHeadRMSNorm` |
| as explicit `x*rsqrt(mean(xΒ²)+eps)`. (Emergent only at large seq len β gate by VISUAL/true-scale.) |
| 6. **`prog.optimize()` hangs** on the 24-block/12k-token DiT graph (>90 min) β skip it |
| (`convert(optimize=False)`), AOT `coreai-build` optimizes for the device anyway. |
|
|
| Plus: int8 desaturates this model (per-net cos 0.9998 but colors collapse β use **fp16**, which is |
| GPU-identical to fp32 β gate fp16 on GPU/visual, its CPU cos looks bad but that's a CPU-compute |
| artifact). Octree decoder: int64 `l` (resolution) input β CoreAIError 3 at runtime, pass it as float32. |
|
|
| ## Running it |
|
|
| - **Mac**: `_run_coreai.py` (or `app_backend.py --input <img>`) loads the bundles via coreai.runtime |
| (`SpecializationOptions.default()` = GPU; ~2 min/gen at 20 steps on Apple silicon, full quality). |
| End-to-end latent gate vs torch-DiT: **cos 0.999999**. |
| - **Mac app / iPhone client**: `TripoSplatMac` (standalone) and `TripoSplatPhone` |
| (capture on iPhone β Mac server `server.py` β view splats in MetalSplatter / RealityKit). |
|
|
| ## On-device note |
|
|
| Full on-device (iPhone) was **verified infeasible** with this model: DINOv3 ViT-H AOT `.aimodelc` |
| is ~3.1 GB and the DiT's 12294-token full-attention score matrix alone is ~4.8 GB, both over the |
| ~3.3 GB iOS app memory budget (weight precision doesn't fix the attention working set). Needs |
| flash-attention conversion / weight streaming. The Mac-link client is the shipped path. |
|
|