Spaces:

Matrix-Corp
/

Matrix-Voxel

Running

File size: 16,610 Bytes

13acb05
 
1b7065a
 
 
13acb05
a4b1601
13acb05
 
 
 
1b7065a

---
title: Matrix Voxel
emoji: 🌏
colorFrom: green
colorTo: pink
sdk: static
pinned: false
license: cc-by-nc-nd-4.0
short_description: The next gen 3D generator
---

# Matrix Voxel — Full Architecture & Planning Document
**3D Generation Model Family | Matrix.Corp**

---

## Family Overview

Matrix Voxel is Matrix.Corp's 3D generation family. Five models sharing a common flow-matching backbone, each with task-specific decoder heads. Four specialist models are open source; one unified all-in-one (Voxel Prime) is closed source and API-only.

| Model | Task | Output Formats | Source | Hardware | Status |
|---|---|---|---|---|---|
| Voxel Atlas | World / environment generation | Voxel grids, OBJ scenes, USD stages | 🟢 Open Source | A100 40GB | 🔴 Planned |
| Voxel Forge | 3D mesh / asset generation | OBJ, GLB, FBX, USDZ | 🟢 Open Source | A100 40GB | 🔴 Planned |
| Voxel Cast | 3D printable model generation | STL, OBJ (watertight), STEP | 🟢 Open Source | A100 40GB | 🔴 Planned |
| Voxel Lens | NeRF / Gaussian Splatting scenes | .ply (3DGS), NeRF weights, MP4 render | 🟢 Open Source | A100 40GB | 🔴 Planned |
| Voxel Prime | All-in-one unified generation | All of the above | 🟣 Closed Source | API Only | 🔴 Planned |

---

## Input Modalities (All Models)

Every Voxel model accepts any combination of:

| Input | Description | Encoder |
|---|---|---|
| Text prompt | Natural language description of desired 3D output | CLIP-ViT-L / T5-XXL |
| Single image | Reference image → 3D lift | DINOv2 + custom depth encoder |
| Multi-view images | 2–12 images from different angles | Multi-view transformer encoder |
| Video | Extracts frames, infers 3D from motion | Temporal encoder (Video-MAE lineage) |
| 3D model | Existing mesh/point cloud as conditioning | PointNet++ encoder |

All inputs projected to a shared 1024-dim conditioning embedding space before entering the backbone.

---

## Core Architecture — Shared Flow Matching Backbone

### Why Flow Matching?
Flow matching (Lipman et al. 2022, extended by Stable Diffusion 3 / FLUX lineage) learns a direct vector field from noise → data. Faster than DDPM diffusion (fewer inference steps, typically 20–50 vs 1000), more stable training, better mode coverage. State of the art for generative models as of 2025–2026.

### 3D Representation — Triplane + Latent Voxel Grid
All Voxel models operate in a shared latent 3D space:
- **Triplane representation**: three axis-aligned feature planes (XY, XZ, YZ), each 256×256×32 channels
- Any 3D point queried by projecting onto all 3 planes and summing features
- Compact (3 × 256 × 256 × 32 = ~6M latent values) yet expressive
- Flow matching operates on this triplane latent space, not raw 3D points
- Decoder heads decode triplane to task-specific output format

### Backbone Architecture

```
VoxelBackbone
├── Input Encoder (multimodal conditioning)
│   ├── TextEncoder         — T5-XXL + CLIP-ViT-L, projected to 1024-dim
│   ├── ImageEncoder        — DINOv2-L, projected to 1024-dim  
│   ├── MultiViewEncoder    — custom transformer over N views
│   ├── VideoEncoder        — Video-MAE, temporal pooling → 1024-dim
│   └── PointCloudEncoder   — PointNet++, global + local features → 1024-dim
│
├── Conditioning Fusion
│   └── CrossModalAttention — fuses all active input modalities
│
├── Flow Matching Transformer (DiT-style)
│   ├── 24 transformer blocks
│   ├── Hidden dim: 1536
│   ├── Heads: 24
│   ├── Conditioning: AdaLN-Zero (timestep + conditioning signal)
│   ├── 3D RoPE positional encoding for triplane tokens
│   └── ~2.3B parameters
│
└── Triplane Decoder (shared across all specialist models)
    └── Outputs: triplane feature tensor (3 × 256 × 256 × 32)
```

### Flow Matching Training
- Learn vector field v_θ(x_t, t, c) where x_t is noisy triplane, c is conditioning
- Optimal transport flow: straight paths from noise → data (better than DDPM curved paths)
- Inference: 20–50 NFE (neural function evaluations) — fast on A100
- Classifier-free guidance: unconditional dropout 10% during training
- Guidance scale 5.0–10.0 at inference

---

## Task-Specific Decoder Heads

Each specialist model adds a decoder head on top of the shared triplane output.

---

### Voxel Atlas — World Generation Decoder

**Task:** Generate full 3D environments and worlds — terrain, buildings, vegetation, interior spaces.

**Output formats:**
- Voxel grids (`.vox`, Magica Voxel format) — for Minecraft-style worlds
- OBJ scene (multiple meshes with materials) — for Unity/Unreal environments
- USD stage (`.usd`) — industry standard scene format

**Decoder head:**
```
TriplaneAtlasDecoder
├── Scene Layout Transformer
│   ├── Divides space into semantic regions (terrain, structures, vegetation, sky)
│   └── 6-layer transformer over 32×32 spatial grid of scene tokens
├── Region-wise NeRF decoder (per semantic region)
│   └── MLP: 3D coords + triplane features → density + RGB + semantic label
├── Marching Cubes extractor → raw mesh per region
├── Scene graph assembler → parent-child relationships between objects
├── Voxelizer (for .vox output) → discretizes to user-specified resolution
└── USD exporter → full scene hierarchy with lighting + materials
```

**Special modules:**
- **Infinite world tiling**: generate seamless adjacent chunks that stitch together
- **Biome-aware generation**: desert, forest, urban, underwater, space, fantasy
- **LOD generator**: auto-generates 4 levels of detail per scene object
- **Lighting estimator**: infers plausible sun/sky lighting from scene content

**Typical generation sizes:**
- Small scene: 64×64×64 voxels or ~500m² OBJ scene — ~8 seconds on A100
- Large world chunk: 256×256×128 voxels — ~35 seconds on A100

---

### Voxel Forge — Mesh / Asset Generation Decoder

**Task:** Generate clean, game-ready 3D assets — characters, objects, props, vehicles, architecture.

**Output formats:**
- OBJ + MTL (universal)
- GLB/GLTF (web & real-time)
- FBX (game engine standard)
- USDZ (Apple AR)

**Decoder head:**
```
TriplaneForgeDec oder
├── Occupancy Network decoder
│   └── MLP: 3D point + triplane → occupancy probability
├── Differentiable Marching Cubes → initial raw mesh
├── Mesh Refinement Network
│   ├── Graph neural network over mesh vertices/edges
│   ├── 8 message-passing rounds
│   └── Predicts vertex position offsets → clean topology
├── UV Unwrapper (learned, SeamlessUV lineage)
├── Texture Diffusion Head
│   ├── 2D flow matching in UV space
│   ├── Albedo + roughness + metallic + normal maps
│   └── 1024×1024 or 2048×2048 texture atlas
└── LOD Generator → 4 polycount levels (100% / 50% / 25% / 10%)
```

**Special modules:**
- **Topology optimizer**: enforces quad-dominant topology for animation rigs
- **Symmetry enforcer**: optional bilateral symmetry for characters/vehicles
- **Scale normalizer**: outputs at real-world scale (meters) with unit metadata
- **Material classifier**: auto-tags materials (metal, wood, fabric, glass, etc.)
- **Animation-ready flag**: detects and preserves edge loops needed for rigging

**Polygon counts:**
- Low-poly asset: 500–5K triangles — ~6 seconds on A100
- Mid-poly asset: 5K–50K triangles — ~15 seconds on A100
- High-poly asset: 50K–500K triangles — ~45 seconds on A100

---

### Voxel Cast — 3D Printable Generation Decoder

**Task:** Generate physically valid, printable 3D models. Watertight, manifold, structurally sound.

**Output formats:**
- STL (universal printing format)
- OBJ (watertight)
- STEP (CAD-compatible, parametric)
- 3MF (modern printing format with material data)

**Decoder head:**
```
TriplaneCastDecoder
├── SDF (Signed Distance Field) decoder
│   └── MLP: 3D point + triplane → signed distance value
├── SDF → Watertight Mesh (dual marching cubes, no holes guaranteed)
├── Printability Validator
│   ├── Wall thickness checker (min 1.2mm enforced)
│   ├── Overhang analyzer (>45° flagged + support detection)
│   ├── Manifold checker + auto-repair
│   └── Volume/surface area calculator
├── Support Structure Generator (optional)
│   └── Generates minimal support trees for FDM printing
├── STEP Converter (via Open CASCADE bindings)
└── Slicer Preview Renderer (preview only, not full slicer)
```

**Special modules:**
- **Structural stress analyzer**: basic FEA simulation to detect weak points
- **Hollowing engine**: auto-hollows solid objects with configurable wall thickness + drain holes
- **Interlocking part splitter**: splits large objects into printable parts with snap-fit joints
- **Material suggester**: recommends PLA / PETG / resin based on geometry complexity
- **Scale validator**: ensures object is printable at specified scale on common bed sizes (Bambu, Prusa, Ender)

**Validation requirements (all Cast outputs must pass):**
- Zero non-manifold edges
- Zero self-intersections
- Minimum wall thickness ≥ 1.2mm at requested scale
- Watertight (no open boundaries)

---

### Voxel Lens — NeRF / Gaussian Splatting Decoder

**Task:** Generate photorealistic 3D scenes represented as Neural Radiance Fields or 3D Gaussian Splats — primarily for visualization, VR/AR, and cinematic rendering.

**Output formats:**
- `.ply` (3D Gaussian Splatting — compatible with standard 3DGS viewers)
- NeRF weights (Instant-NGP / Nerfstudio compatible)
- MP4 render (pre-rendered orbital video)
- Depth maps + normal maps (per-view, for downstream use)

**Decoder head:**
```
TriplaneLensDecoder
├── Gaussian Parameter Decoder
│   ├── Samples 3D Gaussian centers from triplane density
│   ├── Per-Gaussian: position (3), rotation (4 quaternion), scale (3),
│   │   opacity (1), spherical harmonics coefficients (48) → color
│   └── Targets: 500K–3M Gaussians per scene
├── Gaussian Densification Module
│   ├── Adaptive densification: split/clone in high-gradient regions
│   └── Pruning: remove low-opacity Gaussians
├── NeRF branch (parallel)
│   ├── Hash-grid encoder (Instant-NGP style)
│   └── Tiny MLP: encoded position → density + color
├── Rasterizer (differentiable 3DGS rasterizer)
│   └── Used during training for photometric loss
└── Novel View Synthesizer
    └── Renders arbitrary camera trajectories for MP4 export
```

**Special modules:**
- **Lighting decomposition**: separates scene into albedo + illumination components
- **Dynamic scene support**: temporal Gaussian sequences for animated scenes (from video input)
- **Background/foreground separator**: isolates subject from environment
- **Camera trajectory planner**: auto-generates cinematic orbital/fly-through paths
- **Compression module**: reduces 3DGS file size by 60–80% with minimal quality loss

**Generation modes:**
- Object-centric: single object, orbital views — ~12 seconds on A100
- Indoor scene: full room with lighting — ~40 seconds on A100
- Outdoor scene: landscape or street — ~90 seconds on A100

---

### Voxel Prime — Closed Source All-in-One

**Access:** API only. Not open source. Weights never distributed.

Voxel Prime contains all four decoder heads simultaneously, plus:

**Additional Prime-only modules:**
- **Cross-task consistency**: ensures Atlas world + Forge assets + Lens scene all match when generated together
- **Scene population engine**: generates a world (Atlas) then auto-populates it with assets (Forge)
- **Pipeline orchestrator**: chains Atlas → Forge → Cast → Lens in one API call
- **Photorealistic texture upscaler**: 4× super-resolution on all generated textures
- **Style transfer module**: apply artistic style (e.g. "Studio Ghibli", "cyberpunk", "brutalist architecture") across all output types
- **Iterative refinement**: text-guided editing of already-generated 3D content

**API endpoint:**
```python
POST /v1/voxel/generate
{
  "prompt": "A medieval castle on a cliff at sunset",
  "output_types": ["world", "mesh", "nerf"],  # any combination
  "inputs": {
    "image": "base64...",       # optional reference image
    "multiview": ["base64..."], # optional multi-view images
    "video": "base64...",       # optional video
    "model": "base64..."        # optional existing 3D model
  },
  "settings": {
    "quality": "high",          # draft | standard | high
    "style": "realistic",       # realistic | stylized | low-poly | ...
    "scale_meters": 100.0,      # real-world scale
    "symmetry": false,
    "printable": false
  }
}
```

---

## Shared Custom Modules (All Models)

| # | Module | Description |
|---|---|---|
| 1 | **Multi-Modal Conditioning Fusion** | CrossModalAttention over all active input types |
| 2 | **3D RoPE Encoder** | RoPE adapted for triplane 3D spatial positions |
| 3 | **Geometry Quality Scorer** | Rates generated geometry quality [0–1] before output |
| 4 | **Semantic Label Head** | Per-voxel/vertex semantic class (wall, floor, tree, etc.) |
| 5 | **Scale & Unit Manager** | Enforces consistent real-world scale across all outputs |
| 6 | **Material Property Head** | Predicts PBR material properties (roughness, metallic, IOR) |
| 7 | **Confidence & Uncertainty Head** | Per-region generation confidence — flags uncertain areas |
| 8 | **Prompt Adherence Scorer** | CLIP-based score: how well output matches text prompt |
| 9 | **Multi-Resolution Decoder** | Generates at 64³ → 128³ → 256³ coarse-to-fine |
| 10 | **Style Embedding Module** | Encodes style reference images into style conditioning vector |

---

## Training Data Plan

| Dataset | Content | Used by |
|---|---|---|
| ShapeNet (55K models) | Common 3D objects | Forge, Cast |
| Objaverse (800K+ models) | Diverse 3D assets | Forge, Cast, Lens |
| Objaverse-XL (10M+ objects) | Massive scale | All |
| ScanNet / ScanNet++ | Indoor 3D scans | Atlas, Lens |
| KITTI / nuScenes | Outdoor 3D scenes | Atlas, Lens |
| ABO (Amazon Berkeley Objects) | Product meshes + materials | Forge |
| Thingiverse (printable models) | 3D printable STLs | Cast |
| Polycam scans | Real-world 3DGS/NeRF | Lens |
| Synthetic renders (generated) | Multi-view rendered images | All |
| Text-3D pairs (synthetic) | GPT-4o generated descriptions of Objaverse | All |

---

## Parameter Estimates

| Model | Backbone | Decoder Head | Total | VRAM (BF16) |
|---|---|---|---|---|
| Voxel Atlas | 2.3B | ~400M | ~2.7B | ~22GB |
| Voxel Forge | 2.3B | ~350M | ~2.65B | ~21GB |
| Voxel Cast | 2.3B | ~200M | ~2.5B | ~20GB |
| Voxel Lens | 2.3B | ~500M | ~2.8B | ~22GB |
| Voxel Prime | 2.3B | ~1.4B (all 4) | ~3.7B | ~30GB |

All fit on A100 40GB in BF16. INT8 quantization brings all under 15GB (consumer 4090 viable).

---

## Training Strategy

### Phase 1 — Backbone Pre-training
- Train shared backbone on Objaverse-XL triplane reconstructions
- Learn general 3D structure without task-specific heads
- Context: text + single image conditioning only
- 100K steps, A100 cluster

### Phase 2 — Decoder Head Training (parallel)
- Freeze backbone, train each decoder head independently
- Atlas: ScanNet + synthetic world data
- Forge: ShapeNet + Objaverse + texture data
- Cast: Thingiverse + watertight synthetic meshes
- Lens: Polycam + synthetic multi-view renders
- 50K steps each

### Phase 3 — Joint Fine-tuning
- Unfreeze backbone, fine-tune end-to-end per specialist model
- Add all input modalities (video, multi-view, point cloud)
- 30K steps each

### Phase 4 — Prime Training
- Initialize from jointly fine-tuned backbone
- Train all decoder heads simultaneously
- Cross-task consistency losses
- Prime-only module training (pipeline orchestrator, style transfer)
- 50K steps

---

## HuggingFace Plan

```
Matrix-Corp/Voxel-Atlas-V1    — open source
Matrix-Corp/Voxel-Forge-V1    — open source
Matrix-Corp/Voxel-Cast-V1     — open source
Matrix-Corp/Voxel-Lens-V1     — open source
Matrix-Corp/Voxel-Prime-V1    — closed source, API only (card only, no weights)
```

Collection: `Matrix-Corp/voxel-v1`

---

## Status
- 🔴 Planned — Architecture specification complete
- Backbone design finalized
- Decoder head designs finalized
- Training data sourcing: TBD
- Compute requirements: significant (A100 cluster for training)
- Timeline: TBD