Spaces:
Running
title: Matrix Voxel
emoji: π
colorFrom: green
colorTo: pink
sdk: static
pinned: false
license: cc-by-nc-nd-4.0
short_description: The next gen 3D generator
Matrix Voxel β Full Architecture & Planning Document
3D Generation Model Family | Matrix.Corp
Family Overview
Matrix Voxel is Matrix.Corp's 3D generation family. Five models sharing a common flow-matching backbone, each with task-specific decoder heads. Four specialist models are open source; one unified all-in-one (Voxel Prime) is closed source and API-only.
| Model | Task | Output Formats | Source | Hardware | Status |
|---|---|---|---|---|---|
| Voxel Atlas | World / environment generation | Voxel grids, OBJ scenes, USD stages | π’ Open Source | A100 40GB | π΄ Planned |
| Voxel Forge | 3D mesh / asset generation | OBJ, GLB, FBX, USDZ | π’ Open Source | A100 40GB | π΄ Planned |
| Voxel Cast | 3D printable model generation | STL, OBJ (watertight), STEP | π’ Open Source | A100 40GB | π΄ Planned |
| Voxel Lens | NeRF / Gaussian Splatting scenes | .ply (3DGS), NeRF weights, MP4 render | π’ Open Source | A100 40GB | π΄ Planned |
| Voxel Prime | All-in-one unified generation | All of the above | π£ Closed Source | API Only | π΄ Planned |
Input Modalities (All Models)
Every Voxel model accepts any combination of:
| Input | Description | Encoder |
|---|---|---|
| Text prompt | Natural language description of desired 3D output | CLIP-ViT-L / T5-XXL |
| Single image | Reference image β 3D lift | DINOv2 + custom depth encoder |
| Multi-view images | 2β12 images from different angles | Multi-view transformer encoder |
| Video | Extracts frames, infers 3D from motion | Temporal encoder (Video-MAE lineage) |
| 3D model | Existing mesh/point cloud as conditioning | PointNet++ encoder |
All inputs projected to a shared 1024-dim conditioning embedding space before entering the backbone.
Core Architecture β Shared Flow Matching Backbone
Why Flow Matching?
Flow matching (Lipman et al. 2022, extended by Stable Diffusion 3 / FLUX lineage) learns a direct vector field from noise β data. Faster than DDPM diffusion (fewer inference steps, typically 20β50 vs 1000), more stable training, better mode coverage. State of the art for generative models as of 2025β2026.
3D Representation β Triplane + Latent Voxel Grid
All Voxel models operate in a shared latent 3D space:
- Triplane representation: three axis-aligned feature planes (XY, XZ, YZ), each 256Γ256Γ32 channels
- Any 3D point queried by projecting onto all 3 planes and summing features
- Compact (3 Γ 256 Γ 256 Γ 32 = ~6M latent values) yet expressive
- Flow matching operates on this triplane latent space, not raw 3D points
- Decoder heads decode triplane to task-specific output format
Backbone Architecture
VoxelBackbone
βββ Input Encoder (multimodal conditioning)
β βββ TextEncoder β T5-XXL + CLIP-ViT-L, projected to 1024-dim
β βββ ImageEncoder β DINOv2-L, projected to 1024-dim
β βββ MultiViewEncoder β custom transformer over N views
β βββ VideoEncoder β Video-MAE, temporal pooling β 1024-dim
β βββ PointCloudEncoder β PointNet++, global + local features β 1024-dim
β
βββ Conditioning Fusion
β βββ CrossModalAttention β fuses all active input modalities
β
βββ Flow Matching Transformer (DiT-style)
β βββ 24 transformer blocks
β βββ Hidden dim: 1536
β βββ Heads: 24
β βββ Conditioning: AdaLN-Zero (timestep + conditioning signal)
β βββ 3D RoPE positional encoding for triplane tokens
β βββ ~2.3B parameters
β
βββ Triplane Decoder (shared across all specialist models)
βββ Outputs: triplane feature tensor (3 Γ 256 Γ 256 Γ 32)
Flow Matching Training
- Learn vector field v_ΞΈ(x_t, t, c) where x_t is noisy triplane, c is conditioning
- Optimal transport flow: straight paths from noise β data (better than DDPM curved paths)
- Inference: 20β50 NFE (neural function evaluations) β fast on A100
- Classifier-free guidance: unconditional dropout 10% during training
- Guidance scale 5.0β10.0 at inference
Task-Specific Decoder Heads
Each specialist model adds a decoder head on top of the shared triplane output.
Voxel Atlas β World Generation Decoder
Task: Generate full 3D environments and worlds β terrain, buildings, vegetation, interior spaces.
Output formats:
- Voxel grids (
.vox, Magica Voxel format) β for Minecraft-style worlds - OBJ scene (multiple meshes with materials) β for Unity/Unreal environments
- USD stage (
.usd) β industry standard scene format
Decoder head:
TriplaneAtlasDecoder
βββ Scene Layout Transformer
β βββ Divides space into semantic regions (terrain, structures, vegetation, sky)
β βββ 6-layer transformer over 32Γ32 spatial grid of scene tokens
βββ Region-wise NeRF decoder (per semantic region)
β βββ MLP: 3D coords + triplane features β density + RGB + semantic label
βββ Marching Cubes extractor β raw mesh per region
βββ Scene graph assembler β parent-child relationships between objects
βββ Voxelizer (for .vox output) β discretizes to user-specified resolution
βββ USD exporter β full scene hierarchy with lighting + materials
Special modules:
- Infinite world tiling: generate seamless adjacent chunks that stitch together
- Biome-aware generation: desert, forest, urban, underwater, space, fantasy
- LOD generator: auto-generates 4 levels of detail per scene object
- Lighting estimator: infers plausible sun/sky lighting from scene content
Typical generation sizes:
- Small scene: 64Γ64Γ64 voxels or ~500mΒ² OBJ scene β ~8 seconds on A100
- Large world chunk: 256Γ256Γ128 voxels β ~35 seconds on A100
Voxel Forge β Mesh / Asset Generation Decoder
Task: Generate clean, game-ready 3D assets β characters, objects, props, vehicles, architecture.
Output formats:
- OBJ + MTL (universal)
- GLB/GLTF (web & real-time)
- FBX (game engine standard)
- USDZ (Apple AR)
Decoder head:
TriplaneForgeDec oder
βββ Occupancy Network decoder
β βββ MLP: 3D point + triplane β occupancy probability
βββ Differentiable Marching Cubes β initial raw mesh
βββ Mesh Refinement Network
β βββ Graph neural network over mesh vertices/edges
β βββ 8 message-passing rounds
β βββ Predicts vertex position offsets β clean topology
βββ UV Unwrapper (learned, SeamlessUV lineage)
βββ Texture Diffusion Head
β βββ 2D flow matching in UV space
β βββ Albedo + roughness + metallic + normal maps
β βββ 1024Γ1024 or 2048Γ2048 texture atlas
βββ LOD Generator β 4 polycount levels (100% / 50% / 25% / 10%)
Special modules:
- Topology optimizer: enforces quad-dominant topology for animation rigs
- Symmetry enforcer: optional bilateral symmetry for characters/vehicles
- Scale normalizer: outputs at real-world scale (meters) with unit metadata
- Material classifier: auto-tags materials (metal, wood, fabric, glass, etc.)
- Animation-ready flag: detects and preserves edge loops needed for rigging
Polygon counts:
- Low-poly asset: 500β5K triangles β ~6 seconds on A100
- Mid-poly asset: 5Kβ50K triangles β ~15 seconds on A100
- High-poly asset: 50Kβ500K triangles β ~45 seconds on A100
Voxel Cast β 3D Printable Generation Decoder
Task: Generate physically valid, printable 3D models. Watertight, manifold, structurally sound.
Output formats:
- STL (universal printing format)
- OBJ (watertight)
- STEP (CAD-compatible, parametric)
- 3MF (modern printing format with material data)
Decoder head:
TriplaneCastDecoder
βββ SDF (Signed Distance Field) decoder
β βββ MLP: 3D point + triplane β signed distance value
βββ SDF β Watertight Mesh (dual marching cubes, no holes guaranteed)
βββ Printability Validator
β βββ Wall thickness checker (min 1.2mm enforced)
β βββ Overhang analyzer (>45Β° flagged + support detection)
β βββ Manifold checker + auto-repair
β βββ Volume/surface area calculator
βββ Support Structure Generator (optional)
β βββ Generates minimal support trees for FDM printing
βββ STEP Converter (via Open CASCADE bindings)
βββ Slicer Preview Renderer (preview only, not full slicer)
Special modules:
- Structural stress analyzer: basic FEA simulation to detect weak points
- Hollowing engine: auto-hollows solid objects with configurable wall thickness + drain holes
- Interlocking part splitter: splits large objects into printable parts with snap-fit joints
- Material suggester: recommends PLA / PETG / resin based on geometry complexity
- Scale validator: ensures object is printable at specified scale on common bed sizes (Bambu, Prusa, Ender)
Validation requirements (all Cast outputs must pass):
- Zero non-manifold edges
- Zero self-intersections
- Minimum wall thickness β₯ 1.2mm at requested scale
- Watertight (no open boundaries)
Voxel Lens β NeRF / Gaussian Splatting Decoder
Task: Generate photorealistic 3D scenes represented as Neural Radiance Fields or 3D Gaussian Splats β primarily for visualization, VR/AR, and cinematic rendering.
Output formats:
.ply(3D Gaussian Splatting β compatible with standard 3DGS viewers)- NeRF weights (Instant-NGP / Nerfstudio compatible)
- MP4 render (pre-rendered orbital video)
- Depth maps + normal maps (per-view, for downstream use)
Decoder head:
TriplaneLensDecoder
βββ Gaussian Parameter Decoder
β βββ Samples 3D Gaussian centers from triplane density
β βββ Per-Gaussian: position (3), rotation (4 quaternion), scale (3),
β β opacity (1), spherical harmonics coefficients (48) β color
β βββ Targets: 500Kβ3M Gaussians per scene
βββ Gaussian Densification Module
β βββ Adaptive densification: split/clone in high-gradient regions
β βββ Pruning: remove low-opacity Gaussians
βββ NeRF branch (parallel)
β βββ Hash-grid encoder (Instant-NGP style)
β βββ Tiny MLP: encoded position β density + color
βββ Rasterizer (differentiable 3DGS rasterizer)
β βββ Used during training for photometric loss
βββ Novel View Synthesizer
βββ Renders arbitrary camera trajectories for MP4 export
Special modules:
- Lighting decomposition: separates scene into albedo + illumination components
- Dynamic scene support: temporal Gaussian sequences for animated scenes (from video input)
- Background/foreground separator: isolates subject from environment
- Camera trajectory planner: auto-generates cinematic orbital/fly-through paths
- Compression module: reduces 3DGS file size by 60β80% with minimal quality loss
Generation modes:
- Object-centric: single object, orbital views β ~12 seconds on A100
- Indoor scene: full room with lighting β ~40 seconds on A100
- Outdoor scene: landscape or street β ~90 seconds on A100
Voxel Prime β Closed Source All-in-One
Access: API only. Not open source. Weights never distributed.
Voxel Prime contains all four decoder heads simultaneously, plus:
Additional Prime-only modules:
- Cross-task consistency: ensures Atlas world + Forge assets + Lens scene all match when generated together
- Scene population engine: generates a world (Atlas) then auto-populates it with assets (Forge)
- Pipeline orchestrator: chains Atlas β Forge β Cast β Lens in one API call
- Photorealistic texture upscaler: 4Γ super-resolution on all generated textures
- Style transfer module: apply artistic style (e.g. "Studio Ghibli", "cyberpunk", "brutalist architecture") across all output types
- Iterative refinement: text-guided editing of already-generated 3D content
API endpoint:
POST /v1/voxel/generate
{
"prompt": "A medieval castle on a cliff at sunset",
"output_types": ["world", "mesh", "nerf"], # any combination
"inputs": {
"image": "base64...", # optional reference image
"multiview": ["base64..."], # optional multi-view images
"video": "base64...", # optional video
"model": "base64..." # optional existing 3D model
},
"settings": {
"quality": "high", # draft | standard | high
"style": "realistic", # realistic | stylized | low-poly | ...
"scale_meters": 100.0, # real-world scale
"symmetry": false,
"printable": false
}
}
Shared Custom Modules (All Models)
| # | Module | Description |
|---|---|---|
| 1 | Multi-Modal Conditioning Fusion | CrossModalAttention over all active input types |
| 2 | 3D RoPE Encoder | RoPE adapted for triplane 3D spatial positions |
| 3 | Geometry Quality Scorer | Rates generated geometry quality [0β1] before output |
| 4 | Semantic Label Head | Per-voxel/vertex semantic class (wall, floor, tree, etc.) |
| 5 | Scale & Unit Manager | Enforces consistent real-world scale across all outputs |
| 6 | Material Property Head | Predicts PBR material properties (roughness, metallic, IOR) |
| 7 | Confidence & Uncertainty Head | Per-region generation confidence β flags uncertain areas |
| 8 | Prompt Adherence Scorer | CLIP-based score: how well output matches text prompt |
| 9 | Multi-Resolution Decoder | Generates at 64Β³ β 128Β³ β 256Β³ coarse-to-fine |
| 10 | Style Embedding Module | Encodes style reference images into style conditioning vector |
Training Data Plan
| Dataset | Content | Used by |
|---|---|---|
| ShapeNet (55K models) | Common 3D objects | Forge, Cast |
| Objaverse (800K+ models) | Diverse 3D assets | Forge, Cast, Lens |
| Objaverse-XL (10M+ objects) | Massive scale | All |
| ScanNet / ScanNet++ | Indoor 3D scans | Atlas, Lens |
| KITTI / nuScenes | Outdoor 3D scenes | Atlas, Lens |
| ABO (Amazon Berkeley Objects) | Product meshes + materials | Forge |
| Thingiverse (printable models) | 3D printable STLs | Cast |
| Polycam scans | Real-world 3DGS/NeRF | Lens |
| Synthetic renders (generated) | Multi-view rendered images | All |
| Text-3D pairs (synthetic) | GPT-4o generated descriptions of Objaverse | All |
Parameter Estimates
| Model | Backbone | Decoder Head | Total | VRAM (BF16) |
|---|---|---|---|---|
| Voxel Atlas | 2.3B | ~400M | ~2.7B | ~22GB |
| Voxel Forge | 2.3B | ~350M | ~2.65B | ~21GB |
| Voxel Cast | 2.3B | ~200M | ~2.5B | ~20GB |
| Voxel Lens | 2.3B | ~500M | ~2.8B | ~22GB |
| Voxel Prime | 2.3B | ~1.4B (all 4) | ~3.7B | ~30GB |
All fit on A100 40GB in BF16. INT8 quantization brings all under 15GB (consumer 4090 viable).
Training Strategy
Phase 1 β Backbone Pre-training
- Train shared backbone on Objaverse-XL triplane reconstructions
- Learn general 3D structure without task-specific heads
- Context: text + single image conditioning only
- 100K steps, A100 cluster
Phase 2 β Decoder Head Training (parallel)
- Freeze backbone, train each decoder head independently
- Atlas: ScanNet + synthetic world data
- Forge: ShapeNet + Objaverse + texture data
- Cast: Thingiverse + watertight synthetic meshes
- Lens: Polycam + synthetic multi-view renders
- 50K steps each
Phase 3 β Joint Fine-tuning
- Unfreeze backbone, fine-tune end-to-end per specialist model
- Add all input modalities (video, multi-view, point cloud)
- 30K steps each
Phase 4 β Prime Training
- Initialize from jointly fine-tuned backbone
- Train all decoder heads simultaneously
- Cross-task consistency losses
- Prime-only module training (pipeline orchestrator, style transfer)
- 50K steps
HuggingFace Plan
Matrix-Corp/Voxel-Atlas-V1 β open source
Matrix-Corp/Voxel-Forge-V1 β open source
Matrix-Corp/Voxel-Cast-V1 β open source
Matrix-Corp/Voxel-Lens-V1 β open source
Matrix-Corp/Voxel-Prime-V1 β closed source, API only (card only, no weights)
Collection: Matrix-Corp/voxel-v1
Status
- π΄ Planned β Architecture specification complete
- Backbone design finalized
- Decoder head designs finalized
- Training data sourcing: TBD
- Compute requirements: significant (A100 cluster for training)
- Timeline: TBD