File size: 16,610 Bytes
13acb05
 
1b7065a
 
 
13acb05
a4b1601
13acb05
 
 
 
1b7065a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
---
title: Matrix Voxel
emoji: 🌏
colorFrom: green
colorTo: pink
sdk: static
pinned: false
license: cc-by-nc-nd-4.0
short_description: The next gen 3D generator
---

# Matrix Voxel β€” Full Architecture & Planning Document
**3D Generation Model Family | Matrix.Corp**

---

## Family Overview

Matrix Voxel is Matrix.Corp's 3D generation family. Five models sharing a common flow-matching backbone, each with task-specific decoder heads. Four specialist models are open source; one unified all-in-one (Voxel Prime) is closed source and API-only.

| Model | Task | Output Formats | Source | Hardware | Status |
|---|---|---|---|---|---|
| Voxel Atlas | World / environment generation | Voxel grids, OBJ scenes, USD stages | 🟒 Open Source | A100 40GB | πŸ”΄ Planned |
| Voxel Forge | 3D mesh / asset generation | OBJ, GLB, FBX, USDZ | 🟒 Open Source | A100 40GB | πŸ”΄ Planned |
| Voxel Cast | 3D printable model generation | STL, OBJ (watertight), STEP | 🟒 Open Source | A100 40GB | πŸ”΄ Planned |
| Voxel Lens | NeRF / Gaussian Splatting scenes | .ply (3DGS), NeRF weights, MP4 render | 🟒 Open Source | A100 40GB | πŸ”΄ Planned |
| Voxel Prime | All-in-one unified generation | All of the above | 🟣 Closed Source | API Only | πŸ”΄ Planned |

---

## Input Modalities (All Models)

Every Voxel model accepts any combination of:

| Input | Description | Encoder |
|---|---|---|
| Text prompt | Natural language description of desired 3D output | CLIP-ViT-L / T5-XXL |
| Single image | Reference image β†’ 3D lift | DINOv2 + custom depth encoder |
| Multi-view images | 2–12 images from different angles | Multi-view transformer encoder |
| Video | Extracts frames, infers 3D from motion | Temporal encoder (Video-MAE lineage) |
| 3D model | Existing mesh/point cloud as conditioning | PointNet++ encoder |

All inputs projected to a shared 1024-dim conditioning embedding space before entering the backbone.

---

## Core Architecture β€” Shared Flow Matching Backbone

### Why Flow Matching?
Flow matching (Lipman et al. 2022, extended by Stable Diffusion 3 / FLUX lineage) learns a direct vector field from noise β†’ data. Faster than DDPM diffusion (fewer inference steps, typically 20–50 vs 1000), more stable training, better mode coverage. State of the art for generative models as of 2025–2026.

### 3D Representation β€” Triplane + Latent Voxel Grid
All Voxel models operate in a shared latent 3D space:
- **Triplane representation**: three axis-aligned feature planes (XY, XZ, YZ), each 256Γ—256Γ—32 channels
- Any 3D point queried by projecting onto all 3 planes and summing features
- Compact (3 Γ— 256 Γ— 256 Γ— 32 = ~6M latent values) yet expressive
- Flow matching operates on this triplane latent space, not raw 3D points
- Decoder heads decode triplane to task-specific output format

### Backbone Architecture

```
VoxelBackbone
β”œβ”€β”€ Input Encoder (multimodal conditioning)
β”‚   β”œβ”€β”€ TextEncoder         β€” T5-XXL + CLIP-ViT-L, projected to 1024-dim
β”‚   β”œβ”€β”€ ImageEncoder        β€” DINOv2-L, projected to 1024-dim  
β”‚   β”œβ”€β”€ MultiViewEncoder    β€” custom transformer over N views
β”‚   β”œβ”€β”€ VideoEncoder        β€” Video-MAE, temporal pooling β†’ 1024-dim
β”‚   └── PointCloudEncoder   β€” PointNet++, global + local features β†’ 1024-dim
β”‚
β”œβ”€β”€ Conditioning Fusion
β”‚   └── CrossModalAttention β€” fuses all active input modalities
β”‚
β”œβ”€β”€ Flow Matching Transformer (DiT-style)
β”‚   β”œβ”€β”€ 24 transformer blocks
β”‚   β”œβ”€β”€ Hidden dim: 1536
β”‚   β”œβ”€β”€ Heads: 24
β”‚   β”œβ”€β”€ Conditioning: AdaLN-Zero (timestep + conditioning signal)
β”‚   β”œβ”€β”€ 3D RoPE positional encoding for triplane tokens
β”‚   └── ~2.3B parameters
β”‚
└── Triplane Decoder (shared across all specialist models)
    └── Outputs: triplane feature tensor (3 Γ— 256 Γ— 256 Γ— 32)
```

### Flow Matching Training
- Learn vector field v_ΞΈ(x_t, t, c) where x_t is noisy triplane, c is conditioning
- Optimal transport flow: straight paths from noise β†’ data (better than DDPM curved paths)
- Inference: 20–50 NFE (neural function evaluations) β€” fast on A100
- Classifier-free guidance: unconditional dropout 10% during training
- Guidance scale 5.0–10.0 at inference

---

## Task-Specific Decoder Heads

Each specialist model adds a decoder head on top of the shared triplane output.

---

### Voxel Atlas β€” World Generation Decoder

**Task:** Generate full 3D environments and worlds β€” terrain, buildings, vegetation, interior spaces.

**Output formats:**
- Voxel grids (`.vox`, Magica Voxel format) β€” for Minecraft-style worlds
- OBJ scene (multiple meshes with materials) β€” for Unity/Unreal environments
- USD stage (`.usd`) β€” industry standard scene format

**Decoder head:**
```
TriplaneAtlasDecoder
β”œβ”€β”€ Scene Layout Transformer
β”‚   β”œβ”€β”€ Divides space into semantic regions (terrain, structures, vegetation, sky)
β”‚   └── 6-layer transformer over 32Γ—32 spatial grid of scene tokens
β”œβ”€β”€ Region-wise NeRF decoder (per semantic region)
β”‚   └── MLP: 3D coords + triplane features β†’ density + RGB + semantic label
β”œβ”€β”€ Marching Cubes extractor β†’ raw mesh per region
β”œβ”€β”€ Scene graph assembler β†’ parent-child relationships between objects
β”œβ”€β”€ Voxelizer (for .vox output) β†’ discretizes to user-specified resolution
└── USD exporter β†’ full scene hierarchy with lighting + materials
```

**Special modules:**
- **Infinite world tiling**: generate seamless adjacent chunks that stitch together
- **Biome-aware generation**: desert, forest, urban, underwater, space, fantasy
- **LOD generator**: auto-generates 4 levels of detail per scene object
- **Lighting estimator**: infers plausible sun/sky lighting from scene content

**Typical generation sizes:**
- Small scene: 64Γ—64Γ—64 voxels or ~500mΒ² OBJ scene β€” ~8 seconds on A100
- Large world chunk: 256Γ—256Γ—128 voxels β€” ~35 seconds on A100

---

### Voxel Forge β€” Mesh / Asset Generation Decoder

**Task:** Generate clean, game-ready 3D assets β€” characters, objects, props, vehicles, architecture.

**Output formats:**
- OBJ + MTL (universal)
- GLB/GLTF (web & real-time)
- FBX (game engine standard)
- USDZ (Apple AR)

**Decoder head:**
```
TriplaneForgeDec oder
β”œβ”€β”€ Occupancy Network decoder
β”‚   └── MLP: 3D point + triplane β†’ occupancy probability
β”œβ”€β”€ Differentiable Marching Cubes β†’ initial raw mesh
β”œβ”€β”€ Mesh Refinement Network
β”‚   β”œβ”€β”€ Graph neural network over mesh vertices/edges
β”‚   β”œβ”€β”€ 8 message-passing rounds
β”‚   └── Predicts vertex position offsets β†’ clean topology
β”œβ”€β”€ UV Unwrapper (learned, SeamlessUV lineage)
β”œβ”€β”€ Texture Diffusion Head
β”‚   β”œβ”€β”€ 2D flow matching in UV space
β”‚   β”œβ”€β”€ Albedo + roughness + metallic + normal maps
β”‚   └── 1024Γ—1024 or 2048Γ—2048 texture atlas
└── LOD Generator β†’ 4 polycount levels (100% / 50% / 25% / 10%)
```

**Special modules:**
- **Topology optimizer**: enforces quad-dominant topology for animation rigs
- **Symmetry enforcer**: optional bilateral symmetry for characters/vehicles
- **Scale normalizer**: outputs at real-world scale (meters) with unit metadata
- **Material classifier**: auto-tags materials (metal, wood, fabric, glass, etc.)
- **Animation-ready flag**: detects and preserves edge loops needed for rigging

**Polygon counts:**
- Low-poly asset: 500–5K triangles β€” ~6 seconds on A100
- Mid-poly asset: 5K–50K triangles β€” ~15 seconds on A100
- High-poly asset: 50K–500K triangles β€” ~45 seconds on A100

---

### Voxel Cast β€” 3D Printable Generation Decoder

**Task:** Generate physically valid, printable 3D models. Watertight, manifold, structurally sound.

**Output formats:**
- STL (universal printing format)
- OBJ (watertight)
- STEP (CAD-compatible, parametric)
- 3MF (modern printing format with material data)

**Decoder head:**
```
TriplaneCastDecoder
β”œβ”€β”€ SDF (Signed Distance Field) decoder
β”‚   └── MLP: 3D point + triplane β†’ signed distance value
β”œβ”€β”€ SDF β†’ Watertight Mesh (dual marching cubes, no holes guaranteed)
β”œβ”€β”€ Printability Validator
β”‚   β”œβ”€β”€ Wall thickness checker (min 1.2mm enforced)
β”‚   β”œβ”€β”€ Overhang analyzer (>45Β° flagged + support detection)
β”‚   β”œβ”€β”€ Manifold checker + auto-repair
β”‚   └── Volume/surface area calculator
β”œβ”€β”€ Support Structure Generator (optional)
β”‚   └── Generates minimal support trees for FDM printing
β”œβ”€β”€ STEP Converter (via Open CASCADE bindings)
└── Slicer Preview Renderer (preview only, not full slicer)
```

**Special modules:**
- **Structural stress analyzer**: basic FEA simulation to detect weak points
- **Hollowing engine**: auto-hollows solid objects with configurable wall thickness + drain holes
- **Interlocking part splitter**: splits large objects into printable parts with snap-fit joints
- **Material suggester**: recommends PLA / PETG / resin based on geometry complexity
- **Scale validator**: ensures object is printable at specified scale on common bed sizes (Bambu, Prusa, Ender)

**Validation requirements (all Cast outputs must pass):**
- Zero non-manifold edges
- Zero self-intersections
- Minimum wall thickness β‰₯ 1.2mm at requested scale
- Watertight (no open boundaries)

---

### Voxel Lens β€” NeRF / Gaussian Splatting Decoder

**Task:** Generate photorealistic 3D scenes represented as Neural Radiance Fields or 3D Gaussian Splats β€” primarily for visualization, VR/AR, and cinematic rendering.

**Output formats:**
- `.ply` (3D Gaussian Splatting β€” compatible with standard 3DGS viewers)
- NeRF weights (Instant-NGP / Nerfstudio compatible)
- MP4 render (pre-rendered orbital video)
- Depth maps + normal maps (per-view, for downstream use)

**Decoder head:**
```
TriplaneLensDecoder
β”œβ”€β”€ Gaussian Parameter Decoder
β”‚   β”œβ”€β”€ Samples 3D Gaussian centers from triplane density
β”‚   β”œβ”€β”€ Per-Gaussian: position (3), rotation (4 quaternion), scale (3),
β”‚   β”‚   opacity (1), spherical harmonics coefficients (48) β†’ color
β”‚   └── Targets: 500K–3M Gaussians per scene
β”œβ”€β”€ Gaussian Densification Module
β”‚   β”œβ”€β”€ Adaptive densification: split/clone in high-gradient regions
β”‚   └── Pruning: remove low-opacity Gaussians
β”œβ”€β”€ NeRF branch (parallel)
β”‚   β”œβ”€β”€ Hash-grid encoder (Instant-NGP style)
β”‚   └── Tiny MLP: encoded position β†’ density + color
β”œβ”€β”€ Rasterizer (differentiable 3DGS rasterizer)
β”‚   └── Used during training for photometric loss
└── Novel View Synthesizer
    └── Renders arbitrary camera trajectories for MP4 export
```

**Special modules:**
- **Lighting decomposition**: separates scene into albedo + illumination components
- **Dynamic scene support**: temporal Gaussian sequences for animated scenes (from video input)
- **Background/foreground separator**: isolates subject from environment
- **Camera trajectory planner**: auto-generates cinematic orbital/fly-through paths
- **Compression module**: reduces 3DGS file size by 60–80% with minimal quality loss

**Generation modes:**
- Object-centric: single object, orbital views β€” ~12 seconds on A100
- Indoor scene: full room with lighting β€” ~40 seconds on A100
- Outdoor scene: landscape or street β€” ~90 seconds on A100

---

### Voxel Prime β€” Closed Source All-in-One

**Access:** API only. Not open source. Weights never distributed.

Voxel Prime contains all four decoder heads simultaneously, plus:

**Additional Prime-only modules:**
- **Cross-task consistency**: ensures Atlas world + Forge assets + Lens scene all match when generated together
- **Scene population engine**: generates a world (Atlas) then auto-populates it with assets (Forge)
- **Pipeline orchestrator**: chains Atlas β†’ Forge β†’ Cast β†’ Lens in one API call
- **Photorealistic texture upscaler**: 4Γ— super-resolution on all generated textures
- **Style transfer module**: apply artistic style (e.g. "Studio Ghibli", "cyberpunk", "brutalist architecture") across all output types
- **Iterative refinement**: text-guided editing of already-generated 3D content

**API endpoint:**
```python
POST /v1/voxel/generate
{
  "prompt": "A medieval castle on a cliff at sunset",
  "output_types": ["world", "mesh", "nerf"],  # any combination
  "inputs": {
    "image": "base64...",       # optional reference image
    "multiview": ["base64..."], # optional multi-view images
    "video": "base64...",       # optional video
    "model": "base64..."        # optional existing 3D model
  },
  "settings": {
    "quality": "high",          # draft | standard | high
    "style": "realistic",       # realistic | stylized | low-poly | ...
    "scale_meters": 100.0,      # real-world scale
    "symmetry": false,
    "printable": false
  }
}
```

---

## Shared Custom Modules (All Models)

| # | Module | Description |
|---|---|---|
| 1 | **Multi-Modal Conditioning Fusion** | CrossModalAttention over all active input types |
| 2 | **3D RoPE Encoder** | RoPE adapted for triplane 3D spatial positions |
| 3 | **Geometry Quality Scorer** | Rates generated geometry quality [0–1] before output |
| 4 | **Semantic Label Head** | Per-voxel/vertex semantic class (wall, floor, tree, etc.) |
| 5 | **Scale & Unit Manager** | Enforces consistent real-world scale across all outputs |
| 6 | **Material Property Head** | Predicts PBR material properties (roughness, metallic, IOR) |
| 7 | **Confidence & Uncertainty Head** | Per-region generation confidence β€” flags uncertain areas |
| 8 | **Prompt Adherence Scorer** | CLIP-based score: how well output matches text prompt |
| 9 | **Multi-Resolution Decoder** | Generates at 64Β³ β†’ 128Β³ β†’ 256Β³ coarse-to-fine |
| 10 | **Style Embedding Module** | Encodes style reference images into style conditioning vector |

---

## Training Data Plan

| Dataset | Content | Used by |
|---|---|---|
| ShapeNet (55K models) | Common 3D objects | Forge, Cast |
| Objaverse (800K+ models) | Diverse 3D assets | Forge, Cast, Lens |
| Objaverse-XL (10M+ objects) | Massive scale | All |
| ScanNet / ScanNet++ | Indoor 3D scans | Atlas, Lens |
| KITTI / nuScenes | Outdoor 3D scenes | Atlas, Lens |
| ABO (Amazon Berkeley Objects) | Product meshes + materials | Forge |
| Thingiverse (printable models) | 3D printable STLs | Cast |
| Polycam scans | Real-world 3DGS/NeRF | Lens |
| Synthetic renders (generated) | Multi-view rendered images | All |
| Text-3D pairs (synthetic) | GPT-4o generated descriptions of Objaverse | All |

---

## Parameter Estimates

| Model | Backbone | Decoder Head | Total | VRAM (BF16) |
|---|---|---|---|---|
| Voxel Atlas | 2.3B | ~400M | ~2.7B | ~22GB |
| Voxel Forge | 2.3B | ~350M | ~2.65B | ~21GB |
| Voxel Cast | 2.3B | ~200M | ~2.5B | ~20GB |
| Voxel Lens | 2.3B | ~500M | ~2.8B | ~22GB |
| Voxel Prime | 2.3B | ~1.4B (all 4) | ~3.7B | ~30GB |

All fit on A100 40GB in BF16. INT8 quantization brings all under 15GB (consumer 4090 viable).

---

## Training Strategy

### Phase 1 β€” Backbone Pre-training
- Train shared backbone on Objaverse-XL triplane reconstructions
- Learn general 3D structure without task-specific heads
- Context: text + single image conditioning only
- 100K steps, A100 cluster

### Phase 2 β€” Decoder Head Training (parallel)
- Freeze backbone, train each decoder head independently
- Atlas: ScanNet + synthetic world data
- Forge: ShapeNet + Objaverse + texture data
- Cast: Thingiverse + watertight synthetic meshes
- Lens: Polycam + synthetic multi-view renders
- 50K steps each

### Phase 3 β€” Joint Fine-tuning
- Unfreeze backbone, fine-tune end-to-end per specialist model
- Add all input modalities (video, multi-view, point cloud)
- 30K steps each

### Phase 4 β€” Prime Training
- Initialize from jointly fine-tuned backbone
- Train all decoder heads simultaneously
- Cross-task consistency losses
- Prime-only module training (pipeline orchestrator, style transfer)
- 50K steps

---

## HuggingFace Plan

```
Matrix-Corp/Voxel-Atlas-V1    β€” open source
Matrix-Corp/Voxel-Forge-V1    β€” open source
Matrix-Corp/Voxel-Cast-V1     β€” open source
Matrix-Corp/Voxel-Lens-V1     β€” open source
Matrix-Corp/Voxel-Prime-V1    β€” closed source, API only (card only, no weights)
```

Collection: `Matrix-Corp/voxel-v1`

---

## Status
- πŸ”΄ Planned β€” Architecture specification complete
- Backbone design finalized
- Decoder head designs finalized
- Training data sourcing: TBD
- Compute requirements: significant (A100 cluster for training)
- Timeline: TBD