Spaces:
Running
Running
| title: Matrix Voxel | |
| emoji: π | |
| colorFrom: green | |
| colorTo: pink | |
| sdk: static | |
| pinned: false | |
| license: cc-by-nc-nd-4.0 | |
| short_description: The next gen 3D generator | |
| # Matrix Voxel β Full Architecture & Planning Document | |
| **3D Generation Model Family | Matrix.Corp** | |
| --- | |
| ## Family Overview | |
| Matrix Voxel is Matrix.Corp's 3D generation family. Five models sharing a common flow-matching backbone, each with task-specific decoder heads. Four specialist models are open source; one unified all-in-one (Voxel Prime) is closed source and API-only. | |
| | Model | Task | Output Formats | Source | Hardware | Status | | |
| |---|---|---|---|---|---| | |
| | Voxel Atlas | World / environment generation | Voxel grids, OBJ scenes, USD stages | π’ Open Source | A100 40GB | π΄ Planned | | |
| | Voxel Forge | 3D mesh / asset generation | OBJ, GLB, FBX, USDZ | π’ Open Source | A100 40GB | π΄ Planned | | |
| | Voxel Cast | 3D printable model generation | STL, OBJ (watertight), STEP | π’ Open Source | A100 40GB | π΄ Planned | | |
| | Voxel Lens | NeRF / Gaussian Splatting scenes | .ply (3DGS), NeRF weights, MP4 render | π’ Open Source | A100 40GB | π΄ Planned | | |
| | Voxel Prime | All-in-one unified generation | All of the above | π£ Closed Source | API Only | π΄ Planned | | |
| --- | |
| ## Input Modalities (All Models) | |
| Every Voxel model accepts any combination of: | |
| | Input | Description | Encoder | | |
| |---|---|---| | |
| | Text prompt | Natural language description of desired 3D output | CLIP-ViT-L / T5-XXL | | |
| | Single image | Reference image β 3D lift | DINOv2 + custom depth encoder | | |
| | Multi-view images | 2β12 images from different angles | Multi-view transformer encoder | | |
| | Video | Extracts frames, infers 3D from motion | Temporal encoder (Video-MAE lineage) | | |
| | 3D model | Existing mesh/point cloud as conditioning | PointNet++ encoder | | |
| All inputs projected to a shared 1024-dim conditioning embedding space before entering the backbone. | |
| --- | |
| ## Core Architecture β Shared Flow Matching Backbone | |
| ### Why Flow Matching? | |
| Flow matching (Lipman et al. 2022, extended by Stable Diffusion 3 / FLUX lineage) learns a direct vector field from noise β data. Faster than DDPM diffusion (fewer inference steps, typically 20β50 vs 1000), more stable training, better mode coverage. State of the art for generative models as of 2025β2026. | |
| ### 3D Representation β Triplane + Latent Voxel Grid | |
| All Voxel models operate in a shared latent 3D space: | |
| - **Triplane representation**: three axis-aligned feature planes (XY, XZ, YZ), each 256Γ256Γ32 channels | |
| - Any 3D point queried by projecting onto all 3 planes and summing features | |
| - Compact (3 Γ 256 Γ 256 Γ 32 = ~6M latent values) yet expressive | |
| - Flow matching operates on this triplane latent space, not raw 3D points | |
| - Decoder heads decode triplane to task-specific output format | |
| ### Backbone Architecture | |
| ``` | |
| VoxelBackbone | |
| βββ Input Encoder (multimodal conditioning) | |
| β βββ TextEncoder β T5-XXL + CLIP-ViT-L, projected to 1024-dim | |
| β βββ ImageEncoder β DINOv2-L, projected to 1024-dim | |
| β βββ MultiViewEncoder β custom transformer over N views | |
| β βββ VideoEncoder β Video-MAE, temporal pooling β 1024-dim | |
| β βββ PointCloudEncoder β PointNet++, global + local features β 1024-dim | |
| β | |
| βββ Conditioning Fusion | |
| β βββ CrossModalAttention β fuses all active input modalities | |
| β | |
| βββ Flow Matching Transformer (DiT-style) | |
| β βββ 24 transformer blocks | |
| β βββ Hidden dim: 1536 | |
| β βββ Heads: 24 | |
| β βββ Conditioning: AdaLN-Zero (timestep + conditioning signal) | |
| β βββ 3D RoPE positional encoding for triplane tokens | |
| β βββ ~2.3B parameters | |
| β | |
| βββ Triplane Decoder (shared across all specialist models) | |
| βββ Outputs: triplane feature tensor (3 Γ 256 Γ 256 Γ 32) | |
| ``` | |
| ### Flow Matching Training | |
| - Learn vector field v_ΞΈ(x_t, t, c) where x_t is noisy triplane, c is conditioning | |
| - Optimal transport flow: straight paths from noise β data (better than DDPM curved paths) | |
| - Inference: 20β50 NFE (neural function evaluations) β fast on A100 | |
| - Classifier-free guidance: unconditional dropout 10% during training | |
| - Guidance scale 5.0β10.0 at inference | |
| --- | |
| ## Task-Specific Decoder Heads | |
| Each specialist model adds a decoder head on top of the shared triplane output. | |
| --- | |
| ### Voxel Atlas β World Generation Decoder | |
| **Task:** Generate full 3D environments and worlds β terrain, buildings, vegetation, interior spaces. | |
| **Output formats:** | |
| - Voxel grids (`.vox`, Magica Voxel format) β for Minecraft-style worlds | |
| - OBJ scene (multiple meshes with materials) β for Unity/Unreal environments | |
| - USD stage (`.usd`) β industry standard scene format | |
| **Decoder head:** | |
| ``` | |
| TriplaneAtlasDecoder | |
| βββ Scene Layout Transformer | |
| β βββ Divides space into semantic regions (terrain, structures, vegetation, sky) | |
| β βββ 6-layer transformer over 32Γ32 spatial grid of scene tokens | |
| βββ Region-wise NeRF decoder (per semantic region) | |
| β βββ MLP: 3D coords + triplane features β density + RGB + semantic label | |
| βββ Marching Cubes extractor β raw mesh per region | |
| βββ Scene graph assembler β parent-child relationships between objects | |
| βββ Voxelizer (for .vox output) β discretizes to user-specified resolution | |
| βββ USD exporter β full scene hierarchy with lighting + materials | |
| ``` | |
| **Special modules:** | |
| - **Infinite world tiling**: generate seamless adjacent chunks that stitch together | |
| - **Biome-aware generation**: desert, forest, urban, underwater, space, fantasy | |
| - **LOD generator**: auto-generates 4 levels of detail per scene object | |
| - **Lighting estimator**: infers plausible sun/sky lighting from scene content | |
| **Typical generation sizes:** | |
| - Small scene: 64Γ64Γ64 voxels or ~500mΒ² OBJ scene β ~8 seconds on A100 | |
| - Large world chunk: 256Γ256Γ128 voxels β ~35 seconds on A100 | |
| --- | |
| ### Voxel Forge β Mesh / Asset Generation Decoder | |
| **Task:** Generate clean, game-ready 3D assets β characters, objects, props, vehicles, architecture. | |
| **Output formats:** | |
| - OBJ + MTL (universal) | |
| - GLB/GLTF (web & real-time) | |
| - FBX (game engine standard) | |
| - USDZ (Apple AR) | |
| **Decoder head:** | |
| ``` | |
| TriplaneForgeDec oder | |
| βββ Occupancy Network decoder | |
| β βββ MLP: 3D point + triplane β occupancy probability | |
| βββ Differentiable Marching Cubes β initial raw mesh | |
| βββ Mesh Refinement Network | |
| β βββ Graph neural network over mesh vertices/edges | |
| β βββ 8 message-passing rounds | |
| β βββ Predicts vertex position offsets β clean topology | |
| βββ UV Unwrapper (learned, SeamlessUV lineage) | |
| βββ Texture Diffusion Head | |
| β βββ 2D flow matching in UV space | |
| β βββ Albedo + roughness + metallic + normal maps | |
| β βββ 1024Γ1024 or 2048Γ2048 texture atlas | |
| βββ LOD Generator β 4 polycount levels (100% / 50% / 25% / 10%) | |
| ``` | |
| **Special modules:** | |
| - **Topology optimizer**: enforces quad-dominant topology for animation rigs | |
| - **Symmetry enforcer**: optional bilateral symmetry for characters/vehicles | |
| - **Scale normalizer**: outputs at real-world scale (meters) with unit metadata | |
| - **Material classifier**: auto-tags materials (metal, wood, fabric, glass, etc.) | |
| - **Animation-ready flag**: detects and preserves edge loops needed for rigging | |
| **Polygon counts:** | |
| - Low-poly asset: 500β5K triangles β ~6 seconds on A100 | |
| - Mid-poly asset: 5Kβ50K triangles β ~15 seconds on A100 | |
| - High-poly asset: 50Kβ500K triangles β ~45 seconds on A100 | |
| --- | |
| ### Voxel Cast β 3D Printable Generation Decoder | |
| **Task:** Generate physically valid, printable 3D models. Watertight, manifold, structurally sound. | |
| **Output formats:** | |
| - STL (universal printing format) | |
| - OBJ (watertight) | |
| - STEP (CAD-compatible, parametric) | |
| - 3MF (modern printing format with material data) | |
| **Decoder head:** | |
| ``` | |
| TriplaneCastDecoder | |
| βββ SDF (Signed Distance Field) decoder | |
| β βββ MLP: 3D point + triplane β signed distance value | |
| βββ SDF β Watertight Mesh (dual marching cubes, no holes guaranteed) | |
| βββ Printability Validator | |
| β βββ Wall thickness checker (min 1.2mm enforced) | |
| β βββ Overhang analyzer (>45Β° flagged + support detection) | |
| β βββ Manifold checker + auto-repair | |
| β βββ Volume/surface area calculator | |
| βββ Support Structure Generator (optional) | |
| β βββ Generates minimal support trees for FDM printing | |
| βββ STEP Converter (via Open CASCADE bindings) | |
| βββ Slicer Preview Renderer (preview only, not full slicer) | |
| ``` | |
| **Special modules:** | |
| - **Structural stress analyzer**: basic FEA simulation to detect weak points | |
| - **Hollowing engine**: auto-hollows solid objects with configurable wall thickness + drain holes | |
| - **Interlocking part splitter**: splits large objects into printable parts with snap-fit joints | |
| - **Material suggester**: recommends PLA / PETG / resin based on geometry complexity | |
| - **Scale validator**: ensures object is printable at specified scale on common bed sizes (Bambu, Prusa, Ender) | |
| **Validation requirements (all Cast outputs must pass):** | |
| - Zero non-manifold edges | |
| - Zero self-intersections | |
| - Minimum wall thickness β₯ 1.2mm at requested scale | |
| - Watertight (no open boundaries) | |
| --- | |
| ### Voxel Lens β NeRF / Gaussian Splatting Decoder | |
| **Task:** Generate photorealistic 3D scenes represented as Neural Radiance Fields or 3D Gaussian Splats β primarily for visualization, VR/AR, and cinematic rendering. | |
| **Output formats:** | |
| - `.ply` (3D Gaussian Splatting β compatible with standard 3DGS viewers) | |
| - NeRF weights (Instant-NGP / Nerfstudio compatible) | |
| - MP4 render (pre-rendered orbital video) | |
| - Depth maps + normal maps (per-view, for downstream use) | |
| **Decoder head:** | |
| ``` | |
| TriplaneLensDecoder | |
| βββ Gaussian Parameter Decoder | |
| β βββ Samples 3D Gaussian centers from triplane density | |
| β βββ Per-Gaussian: position (3), rotation (4 quaternion), scale (3), | |
| β β opacity (1), spherical harmonics coefficients (48) β color | |
| β βββ Targets: 500Kβ3M Gaussians per scene | |
| βββ Gaussian Densification Module | |
| β βββ Adaptive densification: split/clone in high-gradient regions | |
| β βββ Pruning: remove low-opacity Gaussians | |
| βββ NeRF branch (parallel) | |
| β βββ Hash-grid encoder (Instant-NGP style) | |
| β βββ Tiny MLP: encoded position β density + color | |
| βββ Rasterizer (differentiable 3DGS rasterizer) | |
| β βββ Used during training for photometric loss | |
| βββ Novel View Synthesizer | |
| βββ Renders arbitrary camera trajectories for MP4 export | |
| ``` | |
| **Special modules:** | |
| - **Lighting decomposition**: separates scene into albedo + illumination components | |
| - **Dynamic scene support**: temporal Gaussian sequences for animated scenes (from video input) | |
| - **Background/foreground separator**: isolates subject from environment | |
| - **Camera trajectory planner**: auto-generates cinematic orbital/fly-through paths | |
| - **Compression module**: reduces 3DGS file size by 60β80% with minimal quality loss | |
| **Generation modes:** | |
| - Object-centric: single object, orbital views β ~12 seconds on A100 | |
| - Indoor scene: full room with lighting β ~40 seconds on A100 | |
| - Outdoor scene: landscape or street β ~90 seconds on A100 | |
| --- | |
| ### Voxel Prime β Closed Source All-in-One | |
| **Access:** API only. Not open source. Weights never distributed. | |
| Voxel Prime contains all four decoder heads simultaneously, plus: | |
| **Additional Prime-only modules:** | |
| - **Cross-task consistency**: ensures Atlas world + Forge assets + Lens scene all match when generated together | |
| - **Scene population engine**: generates a world (Atlas) then auto-populates it with assets (Forge) | |
| - **Pipeline orchestrator**: chains Atlas β Forge β Cast β Lens in one API call | |
| - **Photorealistic texture upscaler**: 4Γ super-resolution on all generated textures | |
| - **Style transfer module**: apply artistic style (e.g. "Studio Ghibli", "cyberpunk", "brutalist architecture") across all output types | |
| - **Iterative refinement**: text-guided editing of already-generated 3D content | |
| **API endpoint:** | |
| ```python | |
| POST /v1/voxel/generate | |
| { | |
| "prompt": "A medieval castle on a cliff at sunset", | |
| "output_types": ["world", "mesh", "nerf"], # any combination | |
| "inputs": { | |
| "image": "base64...", # optional reference image | |
| "multiview": ["base64..."], # optional multi-view images | |
| "video": "base64...", # optional video | |
| "model": "base64..." # optional existing 3D model | |
| }, | |
| "settings": { | |
| "quality": "high", # draft | standard | high | |
| "style": "realistic", # realistic | stylized | low-poly | ... | |
| "scale_meters": 100.0, # real-world scale | |
| "symmetry": false, | |
| "printable": false | |
| } | |
| } | |
| ``` | |
| --- | |
| ## Shared Custom Modules (All Models) | |
| | # | Module | Description | | |
| |---|---|---| | |
| | 1 | **Multi-Modal Conditioning Fusion** | CrossModalAttention over all active input types | | |
| | 2 | **3D RoPE Encoder** | RoPE adapted for triplane 3D spatial positions | | |
| | 3 | **Geometry Quality Scorer** | Rates generated geometry quality [0β1] before output | | |
| | 4 | **Semantic Label Head** | Per-voxel/vertex semantic class (wall, floor, tree, etc.) | | |
| | 5 | **Scale & Unit Manager** | Enforces consistent real-world scale across all outputs | | |
| | 6 | **Material Property Head** | Predicts PBR material properties (roughness, metallic, IOR) | | |
| | 7 | **Confidence & Uncertainty Head** | Per-region generation confidence β flags uncertain areas | | |
| | 8 | **Prompt Adherence Scorer** | CLIP-based score: how well output matches text prompt | | |
| | 9 | **Multi-Resolution Decoder** | Generates at 64Β³ β 128Β³ β 256Β³ coarse-to-fine | | |
| | 10 | **Style Embedding Module** | Encodes style reference images into style conditioning vector | | |
| --- | |
| ## Training Data Plan | |
| | Dataset | Content | Used by | | |
| |---|---|---| | |
| | ShapeNet (55K models) | Common 3D objects | Forge, Cast | | |
| | Objaverse (800K+ models) | Diverse 3D assets | Forge, Cast, Lens | | |
| | Objaverse-XL (10M+ objects) | Massive scale | All | | |
| | ScanNet / ScanNet++ | Indoor 3D scans | Atlas, Lens | | |
| | KITTI / nuScenes | Outdoor 3D scenes | Atlas, Lens | | |
| | ABO (Amazon Berkeley Objects) | Product meshes + materials | Forge | | |
| | Thingiverse (printable models) | 3D printable STLs | Cast | | |
| | Polycam scans | Real-world 3DGS/NeRF | Lens | | |
| | Synthetic renders (generated) | Multi-view rendered images | All | | |
| | Text-3D pairs (synthetic) | GPT-4o generated descriptions of Objaverse | All | | |
| --- | |
| ## Parameter Estimates | |
| | Model | Backbone | Decoder Head | Total | VRAM (BF16) | | |
| |---|---|---|---|---| | |
| | Voxel Atlas | 2.3B | ~400M | ~2.7B | ~22GB | | |
| | Voxel Forge | 2.3B | ~350M | ~2.65B | ~21GB | | |
| | Voxel Cast | 2.3B | ~200M | ~2.5B | ~20GB | | |
| | Voxel Lens | 2.3B | ~500M | ~2.8B | ~22GB | | |
| | Voxel Prime | 2.3B | ~1.4B (all 4) | ~3.7B | ~30GB | | |
| All fit on A100 40GB in BF16. INT8 quantization brings all under 15GB (consumer 4090 viable). | |
| --- | |
| ## Training Strategy | |
| ### Phase 1 β Backbone Pre-training | |
| - Train shared backbone on Objaverse-XL triplane reconstructions | |
| - Learn general 3D structure without task-specific heads | |
| - Context: text + single image conditioning only | |
| - 100K steps, A100 cluster | |
| ### Phase 2 β Decoder Head Training (parallel) | |
| - Freeze backbone, train each decoder head independently | |
| - Atlas: ScanNet + synthetic world data | |
| - Forge: ShapeNet + Objaverse + texture data | |
| - Cast: Thingiverse + watertight synthetic meshes | |
| - Lens: Polycam + synthetic multi-view renders | |
| - 50K steps each | |
| ### Phase 3 β Joint Fine-tuning | |
| - Unfreeze backbone, fine-tune end-to-end per specialist model | |
| - Add all input modalities (video, multi-view, point cloud) | |
| - 30K steps each | |
| ### Phase 4 β Prime Training | |
| - Initialize from jointly fine-tuned backbone | |
| - Train all decoder heads simultaneously | |
| - Cross-task consistency losses | |
| - Prime-only module training (pipeline orchestrator, style transfer) | |
| - 50K steps | |
| --- | |
| ## HuggingFace Plan | |
| ``` | |
| Matrix-Corp/Voxel-Atlas-V1 β open source | |
| Matrix-Corp/Voxel-Forge-V1 β open source | |
| Matrix-Corp/Voxel-Cast-V1 β open source | |
| Matrix-Corp/Voxel-Lens-V1 β open source | |
| Matrix-Corp/Voxel-Prime-V1 β closed source, API only (card only, no weights) | |
| ``` | |
| Collection: `Matrix-Corp/voxel-v1` | |
| --- | |
| ## Status | |
| - π΄ Planned β Architecture specification complete | |
| - Backbone design finalized | |
| - Decoder head designs finalized | |
| - Training data sourcing: TBD | |
| - Compute requirements: significant (A100 cluster for training) | |
| - Timeline: TBD | |