--- title: Matrix Voxel emoji: 🌏 colorFrom: green colorTo: pink sdk: static pinned: false license: cc-by-nc-nd-4.0 short_description: The next gen 3D generator --- # Matrix Voxel — Full Architecture & Planning Document **3D Generation Model Family | Matrix.Corp** --- ## Family Overview Matrix Voxel is Matrix.Corp's 3D generation family. Five models sharing a common flow-matching backbone, each with task-specific decoder heads. Four specialist models are open source; one unified all-in-one (Voxel Prime) is closed source and API-only. | Model | Task | Output Formats | Source | Hardware | Status | |---|---|---|---|---|---| | Voxel Atlas | World / environment generation | Voxel grids, OBJ scenes, USD stages | 🟢 Open Source | A100 40GB | 🔴 Planned | | Voxel Forge | 3D mesh / asset generation | OBJ, GLB, FBX, USDZ | 🟢 Open Source | A100 40GB | 🔴 Planned | | Voxel Cast | 3D printable model generation | STL, OBJ (watertight), STEP | 🟢 Open Source | A100 40GB | 🔴 Planned | | Voxel Lens | NeRF / Gaussian Splatting scenes | .ply (3DGS), NeRF weights, MP4 render | 🟢 Open Source | A100 40GB | 🔴 Planned | | Voxel Prime | All-in-one unified generation | All of the above | 🟣 Closed Source | API Only | 🔴 Planned | --- ## Input Modalities (All Models) Every Voxel model accepts any combination of: | Input | Description | Encoder | |---|---|---| | Text prompt | Natural language description of desired 3D output | CLIP-ViT-L / T5-XXL | | Single image | Reference image → 3D lift | DINOv2 + custom depth encoder | | Multi-view images | 2–12 images from different angles | Multi-view transformer encoder | | Video | Extracts frames, infers 3D from motion | Temporal encoder (Video-MAE lineage) | | 3D model | Existing mesh/point cloud as conditioning | PointNet++ encoder | All inputs projected to a shared 1024-dim conditioning embedding space before entering the backbone. --- ## Core Architecture — Shared Flow Matching Backbone ### Why Flow Matching? Flow matching (Lipman et al. 2022, extended by Stable Diffusion 3 / FLUX lineage) learns a direct vector field from noise → data. Faster than DDPM diffusion (fewer inference steps, typically 20–50 vs 1000), more stable training, better mode coverage. State of the art for generative models as of 2025–2026. ### 3D Representation — Triplane + Latent Voxel Grid All Voxel models operate in a shared latent 3D space: - **Triplane representation**: three axis-aligned feature planes (XY, XZ, YZ), each 256×256×32 channels - Any 3D point queried by projecting onto all 3 planes and summing features - Compact (3 × 256 × 256 × 32 = ~6M latent values) yet expressive - Flow matching operates on this triplane latent space, not raw 3D points - Decoder heads decode triplane to task-specific output format ### Backbone Architecture ``` VoxelBackbone ├── Input Encoder (multimodal conditioning) │ ├── TextEncoder — T5-XXL + CLIP-ViT-L, projected to 1024-dim │ ├── ImageEncoder — DINOv2-L, projected to 1024-dim │ ├── MultiViewEncoder — custom transformer over N views │ ├── VideoEncoder — Video-MAE, temporal pooling → 1024-dim │ └── PointCloudEncoder — PointNet++, global + local features → 1024-dim │ ├── Conditioning Fusion │ └── CrossModalAttention — fuses all active input modalities │ ├── Flow Matching Transformer (DiT-style) │ ├── 24 transformer blocks │ ├── Hidden dim: 1536 │ ├── Heads: 24 │ ├── Conditioning: AdaLN-Zero (timestep + conditioning signal) │ ├── 3D RoPE positional encoding for triplane tokens │ └── ~2.3B parameters │ └── Triplane Decoder (shared across all specialist models) └── Outputs: triplane feature tensor (3 × 256 × 256 × 32) ``` ### Flow Matching Training - Learn vector field v_θ(x_t, t, c) where x_t is noisy triplane, c is conditioning - Optimal transport flow: straight paths from noise → data (better than DDPM curved paths) - Inference: 20–50 NFE (neural function evaluations) — fast on A100 - Classifier-free guidance: unconditional dropout 10% during training - Guidance scale 5.0–10.0 at inference --- ## Task-Specific Decoder Heads Each specialist model adds a decoder head on top of the shared triplane output. --- ### Voxel Atlas — World Generation Decoder **Task:** Generate full 3D environments and worlds — terrain, buildings, vegetation, interior spaces. **Output formats:** - Voxel grids (`.vox`, Magica Voxel format) — for Minecraft-style worlds - OBJ scene (multiple meshes with materials) — for Unity/Unreal environments - USD stage (`.usd`) — industry standard scene format **Decoder head:** ``` TriplaneAtlasDecoder ├── Scene Layout Transformer │ ├── Divides space into semantic regions (terrain, structures, vegetation, sky) │ └── 6-layer transformer over 32×32 spatial grid of scene tokens ├── Region-wise NeRF decoder (per semantic region) │ └── MLP: 3D coords + triplane features → density + RGB + semantic label ├── Marching Cubes extractor → raw mesh per region ├── Scene graph assembler → parent-child relationships between objects ├── Voxelizer (for .vox output) → discretizes to user-specified resolution └── USD exporter → full scene hierarchy with lighting + materials ``` **Special modules:** - **Infinite world tiling**: generate seamless adjacent chunks that stitch together - **Biome-aware generation**: desert, forest, urban, underwater, space, fantasy - **LOD generator**: auto-generates 4 levels of detail per scene object - **Lighting estimator**: infers plausible sun/sky lighting from scene content **Typical generation sizes:** - Small scene: 64×64×64 voxels or ~500m² OBJ scene — ~8 seconds on A100 - Large world chunk: 256×256×128 voxels — ~35 seconds on A100 --- ### Voxel Forge — Mesh / Asset Generation Decoder **Task:** Generate clean, game-ready 3D assets — characters, objects, props, vehicles, architecture. **Output formats:** - OBJ + MTL (universal) - GLB/GLTF (web & real-time) - FBX (game engine standard) - USDZ (Apple AR) **Decoder head:** ``` TriplaneForgeDec oder ├── Occupancy Network decoder │ └── MLP: 3D point + triplane → occupancy probability ├── Differentiable Marching Cubes → initial raw mesh ├── Mesh Refinement Network │ ├── Graph neural network over mesh vertices/edges │ ├── 8 message-passing rounds │ └── Predicts vertex position offsets → clean topology ├── UV Unwrapper (learned, SeamlessUV lineage) ├── Texture Diffusion Head │ ├── 2D flow matching in UV space │ ├── Albedo + roughness + metallic + normal maps │ └── 1024×1024 or 2048×2048 texture atlas └── LOD Generator → 4 polycount levels (100% / 50% / 25% / 10%) ``` **Special modules:** - **Topology optimizer**: enforces quad-dominant topology for animation rigs - **Symmetry enforcer**: optional bilateral symmetry for characters/vehicles - **Scale normalizer**: outputs at real-world scale (meters) with unit metadata - **Material classifier**: auto-tags materials (metal, wood, fabric, glass, etc.) - **Animation-ready flag**: detects and preserves edge loops needed for rigging **Polygon counts:** - Low-poly asset: 500–5K triangles — ~6 seconds on A100 - Mid-poly asset: 5K–50K triangles — ~15 seconds on A100 - High-poly asset: 50K–500K triangles — ~45 seconds on A100 --- ### Voxel Cast — 3D Printable Generation Decoder **Task:** Generate physically valid, printable 3D models. Watertight, manifold, structurally sound. **Output formats:** - STL (universal printing format) - OBJ (watertight) - STEP (CAD-compatible, parametric) - 3MF (modern printing format with material data) **Decoder head:** ``` TriplaneCastDecoder ├── SDF (Signed Distance Field) decoder │ └── MLP: 3D point + triplane → signed distance value ├── SDF → Watertight Mesh (dual marching cubes, no holes guaranteed) ├── Printability Validator │ ├── Wall thickness checker (min 1.2mm enforced) │ ├── Overhang analyzer (>45° flagged + support detection) │ ├── Manifold checker + auto-repair │ └── Volume/surface area calculator ├── Support Structure Generator (optional) │ └── Generates minimal support trees for FDM printing ├── STEP Converter (via Open CASCADE bindings) └── Slicer Preview Renderer (preview only, not full slicer) ``` **Special modules:** - **Structural stress analyzer**: basic FEA simulation to detect weak points - **Hollowing engine**: auto-hollows solid objects with configurable wall thickness + drain holes - **Interlocking part splitter**: splits large objects into printable parts with snap-fit joints - **Material suggester**: recommends PLA / PETG / resin based on geometry complexity - **Scale validator**: ensures object is printable at specified scale on common bed sizes (Bambu, Prusa, Ender) **Validation requirements (all Cast outputs must pass):** - Zero non-manifold edges - Zero self-intersections - Minimum wall thickness ≥ 1.2mm at requested scale - Watertight (no open boundaries) --- ### Voxel Lens — NeRF / Gaussian Splatting Decoder **Task:** Generate photorealistic 3D scenes represented as Neural Radiance Fields or 3D Gaussian Splats — primarily for visualization, VR/AR, and cinematic rendering. **Output formats:** - `.ply` (3D Gaussian Splatting — compatible with standard 3DGS viewers) - NeRF weights (Instant-NGP / Nerfstudio compatible) - MP4 render (pre-rendered orbital video) - Depth maps + normal maps (per-view, for downstream use) **Decoder head:** ``` TriplaneLensDecoder ├── Gaussian Parameter Decoder │ ├── Samples 3D Gaussian centers from triplane density │ ├── Per-Gaussian: position (3), rotation (4 quaternion), scale (3), │ │ opacity (1), spherical harmonics coefficients (48) → color │ └── Targets: 500K–3M Gaussians per scene ├── Gaussian Densification Module │ ├── Adaptive densification: split/clone in high-gradient regions │ └── Pruning: remove low-opacity Gaussians ├── NeRF branch (parallel) │ ├── Hash-grid encoder (Instant-NGP style) │ └── Tiny MLP: encoded position → density + color ├── Rasterizer (differentiable 3DGS rasterizer) │ └── Used during training for photometric loss └── Novel View Synthesizer └── Renders arbitrary camera trajectories for MP4 export ``` **Special modules:** - **Lighting decomposition**: separates scene into albedo + illumination components - **Dynamic scene support**: temporal Gaussian sequences for animated scenes (from video input) - **Background/foreground separator**: isolates subject from environment - **Camera trajectory planner**: auto-generates cinematic orbital/fly-through paths - **Compression module**: reduces 3DGS file size by 60–80% with minimal quality loss **Generation modes:** - Object-centric: single object, orbital views — ~12 seconds on A100 - Indoor scene: full room with lighting — ~40 seconds on A100 - Outdoor scene: landscape or street — ~90 seconds on A100 --- ### Voxel Prime — Closed Source All-in-One **Access:** API only. Not open source. Weights never distributed. Voxel Prime contains all four decoder heads simultaneously, plus: **Additional Prime-only modules:** - **Cross-task consistency**: ensures Atlas world + Forge assets + Lens scene all match when generated together - **Scene population engine**: generates a world (Atlas) then auto-populates it with assets (Forge) - **Pipeline orchestrator**: chains Atlas → Forge → Cast → Lens in one API call - **Photorealistic texture upscaler**: 4× super-resolution on all generated textures - **Style transfer module**: apply artistic style (e.g. "Studio Ghibli", "cyberpunk", "brutalist architecture") across all output types - **Iterative refinement**: text-guided editing of already-generated 3D content **API endpoint:** ```python POST /v1/voxel/generate { "prompt": "A medieval castle on a cliff at sunset", "output_types": ["world", "mesh", "nerf"], # any combination "inputs": { "image": "base64...", # optional reference image "multiview": ["base64..."], # optional multi-view images "video": "base64...", # optional video "model": "base64..." # optional existing 3D model }, "settings": { "quality": "high", # draft | standard | high "style": "realistic", # realistic | stylized | low-poly | ... "scale_meters": 100.0, # real-world scale "symmetry": false, "printable": false } } ``` --- ## Shared Custom Modules (All Models) | # | Module | Description | |---|---|---| | 1 | **Multi-Modal Conditioning Fusion** | CrossModalAttention over all active input types | | 2 | **3D RoPE Encoder** | RoPE adapted for triplane 3D spatial positions | | 3 | **Geometry Quality Scorer** | Rates generated geometry quality [0–1] before output | | 4 | **Semantic Label Head** | Per-voxel/vertex semantic class (wall, floor, tree, etc.) | | 5 | **Scale & Unit Manager** | Enforces consistent real-world scale across all outputs | | 6 | **Material Property Head** | Predicts PBR material properties (roughness, metallic, IOR) | | 7 | **Confidence & Uncertainty Head** | Per-region generation confidence — flags uncertain areas | | 8 | **Prompt Adherence Scorer** | CLIP-based score: how well output matches text prompt | | 9 | **Multi-Resolution Decoder** | Generates at 64³ → 128³ → 256³ coarse-to-fine | | 10 | **Style Embedding Module** | Encodes style reference images into style conditioning vector | --- ## Training Data Plan | Dataset | Content | Used by | |---|---|---| | ShapeNet (55K models) | Common 3D objects | Forge, Cast | | Objaverse (800K+ models) | Diverse 3D assets | Forge, Cast, Lens | | Objaverse-XL (10M+ objects) | Massive scale | All | | ScanNet / ScanNet++ | Indoor 3D scans | Atlas, Lens | | KITTI / nuScenes | Outdoor 3D scenes | Atlas, Lens | | ABO (Amazon Berkeley Objects) | Product meshes + materials | Forge | | Thingiverse (printable models) | 3D printable STLs | Cast | | Polycam scans | Real-world 3DGS/NeRF | Lens | | Synthetic renders (generated) | Multi-view rendered images | All | | Text-3D pairs (synthetic) | GPT-4o generated descriptions of Objaverse | All | --- ## Parameter Estimates | Model | Backbone | Decoder Head | Total | VRAM (BF16) | |---|---|---|---|---| | Voxel Atlas | 2.3B | ~400M | ~2.7B | ~22GB | | Voxel Forge | 2.3B | ~350M | ~2.65B | ~21GB | | Voxel Cast | 2.3B | ~200M | ~2.5B | ~20GB | | Voxel Lens | 2.3B | ~500M | ~2.8B | ~22GB | | Voxel Prime | 2.3B | ~1.4B (all 4) | ~3.7B | ~30GB | All fit on A100 40GB in BF16. INT8 quantization brings all under 15GB (consumer 4090 viable). --- ## Training Strategy ### Phase 1 — Backbone Pre-training - Train shared backbone on Objaverse-XL triplane reconstructions - Learn general 3D structure without task-specific heads - Context: text + single image conditioning only - 100K steps, A100 cluster ### Phase 2 — Decoder Head Training (parallel) - Freeze backbone, train each decoder head independently - Atlas: ScanNet + synthetic world data - Forge: ShapeNet + Objaverse + texture data - Cast: Thingiverse + watertight synthetic meshes - Lens: Polycam + synthetic multi-view renders - 50K steps each ### Phase 3 — Joint Fine-tuning - Unfreeze backbone, fine-tune end-to-end per specialist model - Add all input modalities (video, multi-view, point cloud) - 30K steps each ### Phase 4 — Prime Training - Initialize from jointly fine-tuned backbone - Train all decoder heads simultaneously - Cross-task consistency losses - Prime-only module training (pipeline orchestrator, style transfer) - 50K steps --- ## HuggingFace Plan ``` Matrix-Corp/Voxel-Atlas-V1 — open source Matrix-Corp/Voxel-Forge-V1 — open source Matrix-Corp/Voxel-Cast-V1 — open source Matrix-Corp/Voxel-Lens-V1 — open source Matrix-Corp/Voxel-Prime-V1 — closed source, API only (card only, no weights) ``` Collection: `Matrix-Corp/voxel-v1` --- ## Status - 🔴 Planned — Architecture specification complete - Backbone design finalized - Decoder head designs finalized - Training data sourcing: TBD - Compute requirements: significant (A100 cluster for training) - Timeline: TBD