Matrix-Voxel / README.md
Zandy-Wandy's picture
Update README.md
a4b1601 verified
metadata
title: Matrix Voxel
emoji: 🌏
colorFrom: green
colorTo: pink
sdk: static
pinned: false
license: cc-by-nc-nd-4.0
short_description: The next gen 3D generator

Matrix Voxel β€” Full Architecture & Planning Document

3D Generation Model Family | Matrix.Corp


Family Overview

Matrix Voxel is Matrix.Corp's 3D generation family. Five models sharing a common flow-matching backbone, each with task-specific decoder heads. Four specialist models are open source; one unified all-in-one (Voxel Prime) is closed source and API-only.

Model Task Output Formats Source Hardware Status
Voxel Atlas World / environment generation Voxel grids, OBJ scenes, USD stages 🟒 Open Source A100 40GB πŸ”΄ Planned
Voxel Forge 3D mesh / asset generation OBJ, GLB, FBX, USDZ 🟒 Open Source A100 40GB πŸ”΄ Planned
Voxel Cast 3D printable model generation STL, OBJ (watertight), STEP 🟒 Open Source A100 40GB πŸ”΄ Planned
Voxel Lens NeRF / Gaussian Splatting scenes .ply (3DGS), NeRF weights, MP4 render 🟒 Open Source A100 40GB πŸ”΄ Planned
Voxel Prime All-in-one unified generation All of the above 🟣 Closed Source API Only πŸ”΄ Planned

Input Modalities (All Models)

Every Voxel model accepts any combination of:

Input Description Encoder
Text prompt Natural language description of desired 3D output CLIP-ViT-L / T5-XXL
Single image Reference image β†’ 3D lift DINOv2 + custom depth encoder
Multi-view images 2–12 images from different angles Multi-view transformer encoder
Video Extracts frames, infers 3D from motion Temporal encoder (Video-MAE lineage)
3D model Existing mesh/point cloud as conditioning PointNet++ encoder

All inputs projected to a shared 1024-dim conditioning embedding space before entering the backbone.


Core Architecture β€” Shared Flow Matching Backbone

Why Flow Matching?

Flow matching (Lipman et al. 2022, extended by Stable Diffusion 3 / FLUX lineage) learns a direct vector field from noise β†’ data. Faster than DDPM diffusion (fewer inference steps, typically 20–50 vs 1000), more stable training, better mode coverage. State of the art for generative models as of 2025–2026.

3D Representation β€” Triplane + Latent Voxel Grid

All Voxel models operate in a shared latent 3D space:

  • Triplane representation: three axis-aligned feature planes (XY, XZ, YZ), each 256Γ—256Γ—32 channels
  • Any 3D point queried by projecting onto all 3 planes and summing features
  • Compact (3 Γ— 256 Γ— 256 Γ— 32 = ~6M latent values) yet expressive
  • Flow matching operates on this triplane latent space, not raw 3D points
  • Decoder heads decode triplane to task-specific output format

Backbone Architecture

VoxelBackbone
β”œβ”€β”€ Input Encoder (multimodal conditioning)
β”‚   β”œβ”€β”€ TextEncoder         β€” T5-XXL + CLIP-ViT-L, projected to 1024-dim
β”‚   β”œβ”€β”€ ImageEncoder        β€” DINOv2-L, projected to 1024-dim  
β”‚   β”œβ”€β”€ MultiViewEncoder    β€” custom transformer over N views
β”‚   β”œβ”€β”€ VideoEncoder        β€” Video-MAE, temporal pooling β†’ 1024-dim
β”‚   └── PointCloudEncoder   β€” PointNet++, global + local features β†’ 1024-dim
β”‚
β”œβ”€β”€ Conditioning Fusion
β”‚   └── CrossModalAttention β€” fuses all active input modalities
β”‚
β”œβ”€β”€ Flow Matching Transformer (DiT-style)
β”‚   β”œβ”€β”€ 24 transformer blocks
β”‚   β”œβ”€β”€ Hidden dim: 1536
β”‚   β”œβ”€β”€ Heads: 24
β”‚   β”œβ”€β”€ Conditioning: AdaLN-Zero (timestep + conditioning signal)
β”‚   β”œβ”€β”€ 3D RoPE positional encoding for triplane tokens
β”‚   └── ~2.3B parameters
β”‚
└── Triplane Decoder (shared across all specialist models)
    └── Outputs: triplane feature tensor (3 Γ— 256 Γ— 256 Γ— 32)

Flow Matching Training

  • Learn vector field v_ΞΈ(x_t, t, c) where x_t is noisy triplane, c is conditioning
  • Optimal transport flow: straight paths from noise β†’ data (better than DDPM curved paths)
  • Inference: 20–50 NFE (neural function evaluations) β€” fast on A100
  • Classifier-free guidance: unconditional dropout 10% during training
  • Guidance scale 5.0–10.0 at inference

Task-Specific Decoder Heads

Each specialist model adds a decoder head on top of the shared triplane output.


Voxel Atlas β€” World Generation Decoder

Task: Generate full 3D environments and worlds β€” terrain, buildings, vegetation, interior spaces.

Output formats:

  • Voxel grids (.vox, Magica Voxel format) β€” for Minecraft-style worlds
  • OBJ scene (multiple meshes with materials) β€” for Unity/Unreal environments
  • USD stage (.usd) β€” industry standard scene format

Decoder head:

TriplaneAtlasDecoder
β”œβ”€β”€ Scene Layout Transformer
β”‚   β”œβ”€β”€ Divides space into semantic regions (terrain, structures, vegetation, sky)
β”‚   └── 6-layer transformer over 32Γ—32 spatial grid of scene tokens
β”œβ”€β”€ Region-wise NeRF decoder (per semantic region)
β”‚   └── MLP: 3D coords + triplane features β†’ density + RGB + semantic label
β”œβ”€β”€ Marching Cubes extractor β†’ raw mesh per region
β”œβ”€β”€ Scene graph assembler β†’ parent-child relationships between objects
β”œβ”€β”€ Voxelizer (for .vox output) β†’ discretizes to user-specified resolution
└── USD exporter β†’ full scene hierarchy with lighting + materials

Special modules:

  • Infinite world tiling: generate seamless adjacent chunks that stitch together
  • Biome-aware generation: desert, forest, urban, underwater, space, fantasy
  • LOD generator: auto-generates 4 levels of detail per scene object
  • Lighting estimator: infers plausible sun/sky lighting from scene content

Typical generation sizes:

  • Small scene: 64Γ—64Γ—64 voxels or ~500mΒ² OBJ scene β€” ~8 seconds on A100
  • Large world chunk: 256Γ—256Γ—128 voxels β€” ~35 seconds on A100

Voxel Forge β€” Mesh / Asset Generation Decoder

Task: Generate clean, game-ready 3D assets β€” characters, objects, props, vehicles, architecture.

Output formats:

  • OBJ + MTL (universal)
  • GLB/GLTF (web & real-time)
  • FBX (game engine standard)
  • USDZ (Apple AR)

Decoder head:

TriplaneForgeDec oder
β”œβ”€β”€ Occupancy Network decoder
β”‚   └── MLP: 3D point + triplane β†’ occupancy probability
β”œβ”€β”€ Differentiable Marching Cubes β†’ initial raw mesh
β”œβ”€β”€ Mesh Refinement Network
β”‚   β”œβ”€β”€ Graph neural network over mesh vertices/edges
β”‚   β”œβ”€β”€ 8 message-passing rounds
β”‚   └── Predicts vertex position offsets β†’ clean topology
β”œβ”€β”€ UV Unwrapper (learned, SeamlessUV lineage)
β”œβ”€β”€ Texture Diffusion Head
β”‚   β”œβ”€β”€ 2D flow matching in UV space
β”‚   β”œβ”€β”€ Albedo + roughness + metallic + normal maps
β”‚   └── 1024Γ—1024 or 2048Γ—2048 texture atlas
└── LOD Generator β†’ 4 polycount levels (100% / 50% / 25% / 10%)

Special modules:

  • Topology optimizer: enforces quad-dominant topology for animation rigs
  • Symmetry enforcer: optional bilateral symmetry for characters/vehicles
  • Scale normalizer: outputs at real-world scale (meters) with unit metadata
  • Material classifier: auto-tags materials (metal, wood, fabric, glass, etc.)
  • Animation-ready flag: detects and preserves edge loops needed for rigging

Polygon counts:

  • Low-poly asset: 500–5K triangles β€” ~6 seconds on A100
  • Mid-poly asset: 5K–50K triangles β€” ~15 seconds on A100
  • High-poly asset: 50K–500K triangles β€” ~45 seconds on A100

Voxel Cast β€” 3D Printable Generation Decoder

Task: Generate physically valid, printable 3D models. Watertight, manifold, structurally sound.

Output formats:

  • STL (universal printing format)
  • OBJ (watertight)
  • STEP (CAD-compatible, parametric)
  • 3MF (modern printing format with material data)

Decoder head:

TriplaneCastDecoder
β”œβ”€β”€ SDF (Signed Distance Field) decoder
β”‚   └── MLP: 3D point + triplane β†’ signed distance value
β”œβ”€β”€ SDF β†’ Watertight Mesh (dual marching cubes, no holes guaranteed)
β”œβ”€β”€ Printability Validator
β”‚   β”œβ”€β”€ Wall thickness checker (min 1.2mm enforced)
β”‚   β”œβ”€β”€ Overhang analyzer (>45Β° flagged + support detection)
β”‚   β”œβ”€β”€ Manifold checker + auto-repair
β”‚   └── Volume/surface area calculator
β”œβ”€β”€ Support Structure Generator (optional)
β”‚   └── Generates minimal support trees for FDM printing
β”œβ”€β”€ STEP Converter (via Open CASCADE bindings)
└── Slicer Preview Renderer (preview only, not full slicer)

Special modules:

  • Structural stress analyzer: basic FEA simulation to detect weak points
  • Hollowing engine: auto-hollows solid objects with configurable wall thickness + drain holes
  • Interlocking part splitter: splits large objects into printable parts with snap-fit joints
  • Material suggester: recommends PLA / PETG / resin based on geometry complexity
  • Scale validator: ensures object is printable at specified scale on common bed sizes (Bambu, Prusa, Ender)

Validation requirements (all Cast outputs must pass):

  • Zero non-manifold edges
  • Zero self-intersections
  • Minimum wall thickness β‰₯ 1.2mm at requested scale
  • Watertight (no open boundaries)

Voxel Lens β€” NeRF / Gaussian Splatting Decoder

Task: Generate photorealistic 3D scenes represented as Neural Radiance Fields or 3D Gaussian Splats β€” primarily for visualization, VR/AR, and cinematic rendering.

Output formats:

  • .ply (3D Gaussian Splatting β€” compatible with standard 3DGS viewers)
  • NeRF weights (Instant-NGP / Nerfstudio compatible)
  • MP4 render (pre-rendered orbital video)
  • Depth maps + normal maps (per-view, for downstream use)

Decoder head:

TriplaneLensDecoder
β”œβ”€β”€ Gaussian Parameter Decoder
β”‚   β”œβ”€β”€ Samples 3D Gaussian centers from triplane density
β”‚   β”œβ”€β”€ Per-Gaussian: position (3), rotation (4 quaternion), scale (3),
β”‚   β”‚   opacity (1), spherical harmonics coefficients (48) β†’ color
β”‚   └── Targets: 500K–3M Gaussians per scene
β”œβ”€β”€ Gaussian Densification Module
β”‚   β”œβ”€β”€ Adaptive densification: split/clone in high-gradient regions
β”‚   └── Pruning: remove low-opacity Gaussians
β”œβ”€β”€ NeRF branch (parallel)
β”‚   β”œβ”€β”€ Hash-grid encoder (Instant-NGP style)
β”‚   └── Tiny MLP: encoded position β†’ density + color
β”œβ”€β”€ Rasterizer (differentiable 3DGS rasterizer)
β”‚   └── Used during training for photometric loss
└── Novel View Synthesizer
    └── Renders arbitrary camera trajectories for MP4 export

Special modules:

  • Lighting decomposition: separates scene into albedo + illumination components
  • Dynamic scene support: temporal Gaussian sequences for animated scenes (from video input)
  • Background/foreground separator: isolates subject from environment
  • Camera trajectory planner: auto-generates cinematic orbital/fly-through paths
  • Compression module: reduces 3DGS file size by 60–80% with minimal quality loss

Generation modes:

  • Object-centric: single object, orbital views β€” ~12 seconds on A100
  • Indoor scene: full room with lighting β€” ~40 seconds on A100
  • Outdoor scene: landscape or street β€” ~90 seconds on A100

Voxel Prime β€” Closed Source All-in-One

Access: API only. Not open source. Weights never distributed.

Voxel Prime contains all four decoder heads simultaneously, plus:

Additional Prime-only modules:

  • Cross-task consistency: ensures Atlas world + Forge assets + Lens scene all match when generated together
  • Scene population engine: generates a world (Atlas) then auto-populates it with assets (Forge)
  • Pipeline orchestrator: chains Atlas β†’ Forge β†’ Cast β†’ Lens in one API call
  • Photorealistic texture upscaler: 4Γ— super-resolution on all generated textures
  • Style transfer module: apply artistic style (e.g. "Studio Ghibli", "cyberpunk", "brutalist architecture") across all output types
  • Iterative refinement: text-guided editing of already-generated 3D content

API endpoint:

POST /v1/voxel/generate
{
  "prompt": "A medieval castle on a cliff at sunset",
  "output_types": ["world", "mesh", "nerf"],  # any combination
  "inputs": {
    "image": "base64...",       # optional reference image
    "multiview": ["base64..."], # optional multi-view images
    "video": "base64...",       # optional video
    "model": "base64..."        # optional existing 3D model
  },
  "settings": {
    "quality": "high",          # draft | standard | high
    "style": "realistic",       # realistic | stylized | low-poly | ...
    "scale_meters": 100.0,      # real-world scale
    "symmetry": false,
    "printable": false
  }
}

Shared Custom Modules (All Models)

# Module Description
1 Multi-Modal Conditioning Fusion CrossModalAttention over all active input types
2 3D RoPE Encoder RoPE adapted for triplane 3D spatial positions
3 Geometry Quality Scorer Rates generated geometry quality [0–1] before output
4 Semantic Label Head Per-voxel/vertex semantic class (wall, floor, tree, etc.)
5 Scale & Unit Manager Enforces consistent real-world scale across all outputs
6 Material Property Head Predicts PBR material properties (roughness, metallic, IOR)
7 Confidence & Uncertainty Head Per-region generation confidence β€” flags uncertain areas
8 Prompt Adherence Scorer CLIP-based score: how well output matches text prompt
9 Multi-Resolution Decoder Generates at 64Β³ β†’ 128Β³ β†’ 256Β³ coarse-to-fine
10 Style Embedding Module Encodes style reference images into style conditioning vector

Training Data Plan

Dataset Content Used by
ShapeNet (55K models) Common 3D objects Forge, Cast
Objaverse (800K+ models) Diverse 3D assets Forge, Cast, Lens
Objaverse-XL (10M+ objects) Massive scale All
ScanNet / ScanNet++ Indoor 3D scans Atlas, Lens
KITTI / nuScenes Outdoor 3D scenes Atlas, Lens
ABO (Amazon Berkeley Objects) Product meshes + materials Forge
Thingiverse (printable models) 3D printable STLs Cast
Polycam scans Real-world 3DGS/NeRF Lens
Synthetic renders (generated) Multi-view rendered images All
Text-3D pairs (synthetic) GPT-4o generated descriptions of Objaverse All

Parameter Estimates

Model Backbone Decoder Head Total VRAM (BF16)
Voxel Atlas 2.3B ~400M ~2.7B ~22GB
Voxel Forge 2.3B ~350M ~2.65B ~21GB
Voxel Cast 2.3B ~200M ~2.5B ~20GB
Voxel Lens 2.3B ~500M ~2.8B ~22GB
Voxel Prime 2.3B ~1.4B (all 4) ~3.7B ~30GB

All fit on A100 40GB in BF16. INT8 quantization brings all under 15GB (consumer 4090 viable).


Training Strategy

Phase 1 β€” Backbone Pre-training

  • Train shared backbone on Objaverse-XL triplane reconstructions
  • Learn general 3D structure without task-specific heads
  • Context: text + single image conditioning only
  • 100K steps, A100 cluster

Phase 2 β€” Decoder Head Training (parallel)

  • Freeze backbone, train each decoder head independently
  • Atlas: ScanNet + synthetic world data
  • Forge: ShapeNet + Objaverse + texture data
  • Cast: Thingiverse + watertight synthetic meshes
  • Lens: Polycam + synthetic multi-view renders
  • 50K steps each

Phase 3 β€” Joint Fine-tuning

  • Unfreeze backbone, fine-tune end-to-end per specialist model
  • Add all input modalities (video, multi-view, point cloud)
  • 30K steps each

Phase 4 β€” Prime Training

  • Initialize from jointly fine-tuned backbone
  • Train all decoder heads simultaneously
  • Cross-task consistency losses
  • Prime-only module training (pipeline orchestrator, style transfer)
  • 50K steps

HuggingFace Plan

Matrix-Corp/Voxel-Atlas-V1    β€” open source
Matrix-Corp/Voxel-Forge-V1    β€” open source
Matrix-Corp/Voxel-Cast-V1     β€” open source
Matrix-Corp/Voxel-Lens-V1     β€” open source
Matrix-Corp/Voxel-Prime-V1    β€” closed source, API only (card only, no weights)

Collection: Matrix-Corp/voxel-v1


Status

  • πŸ”΄ Planned β€” Architecture specification complete
  • Backbone design finalized
  • Decoder head designs finalized
  • Training data sourcing: TBD
  • Compute requirements: significant (A100 cluster for training)
  • Timeline: TBD