Spaces:

Matrix-Corp
/

Matrix-Voxel

Running

App Files Files Community

Matrix-Voxel / README.md

Zandy-Wandy

Update README.md

a4b1601 verified 11 days ago

preview code

raw

history blame contribute delete

16.6 kB

	---
	title: Matrix Voxel
	emoji: 🌏
	colorFrom: green
	colorTo: pink
	sdk: static
	pinned: false
	license: cc-by-nc-nd-4.0
	short_description: The next gen 3D generator
	---

	# Matrix Voxel — Full Architecture & Planning Document
	3D Generation Model Family \| Matrix.Corp

	---

	## Family Overview

	Matrix Voxel is Matrix.Corp's 3D generation family. Five models sharing a common flow-matching backbone, each with task-specific decoder heads. Four specialist models are open source; one unified all-in-one (Voxel Prime) is closed source and API-only.

	\| Model \| Task \| Output Formats \| Source \| Hardware \| Status \|
	\|---\|---\|---\|---\|---\|---\|
	\| Voxel Atlas \| World / environment generation \| Voxel grids, OBJ scenes, USD stages \| 🟢 Open Source \| A100 40GB \| 🔴 Planned \|
	\| Voxel Forge \| 3D mesh / asset generation \| OBJ, GLB, FBX, USDZ \| 🟢 Open Source \| A100 40GB \| 🔴 Planned \|
	\| Voxel Cast \| 3D printable model generation \| STL, OBJ (watertight), STEP \| 🟢 Open Source \| A100 40GB \| 🔴 Planned \|
	\| Voxel Lens \| NeRF / Gaussian Splatting scenes \| .ply (3DGS), NeRF weights, MP4 render \| 🟢 Open Source \| A100 40GB \| 🔴 Planned \|
	\| Voxel Prime \| All-in-one unified generation \| All of the above \| 🟣 Closed Source \| API Only \| 🔴 Planned \|

	---

	## Input Modalities (All Models)

	Every Voxel model accepts any combination of:

	\| Input \| Description \| Encoder \|
	\|---\|---\|---\|
	\| Text prompt \| Natural language description of desired 3D output \| CLIP-ViT-L / T5-XXL \|
	\| Single image \| Reference image → 3D lift \| DINOv2 + custom depth encoder \|
	\| Multi-view images \| 2–12 images from different angles \| Multi-view transformer encoder \|
	\| Video \| Extracts frames, infers 3D from motion \| Temporal encoder (Video-MAE lineage) \|
	\| 3D model \| Existing mesh/point cloud as conditioning \| PointNet++ encoder \|

	All inputs projected to a shared 1024-dim conditioning embedding space before entering the backbone.

	---

	## Core Architecture — Shared Flow Matching Backbone

	### Why Flow Matching?
	Flow matching (Lipman et al. 2022, extended by Stable Diffusion 3 / FLUX lineage) learns a direct vector field from noise → data. Faster than DDPM diffusion (fewer inference steps, typically 20–50 vs 1000), more stable training, better mode coverage. State of the art for generative models as of 2025–2026.

	### 3D Representation — Triplane + Latent Voxel Grid
	All Voxel models operate in a shared latent 3D space:
	- Triplane representation: three axis-aligned feature planes (XY, XZ, YZ), each 256×256×32 channels
	- Any 3D point queried by projecting onto all 3 planes and summing features
	- Compact (3 × 256 × 256 × 32 = ~6M latent values) yet expressive
	- Flow matching operates on this triplane latent space, not raw 3D points
	- Decoder heads decode triplane to task-specific output format

	### Backbone Architecture

	```
	VoxelBackbone
	├── Input Encoder (multimodal conditioning)
	│ ├── TextEncoder — T5-XXL + CLIP-ViT-L, projected to 1024-dim
	│ ├── ImageEncoder — DINOv2-L, projected to 1024-dim
	│ ├── MultiViewEncoder — custom transformer over N views
	│ ├── VideoEncoder — Video-MAE, temporal pooling → 1024-dim
	│ └── PointCloudEncoder — PointNet++, global + local features → 1024-dim
	│
	├── Conditioning Fusion
	│ └── CrossModalAttention — fuses all active input modalities
	│
	├── Flow Matching Transformer (DiT-style)
	│ ├── 24 transformer blocks
	│ ├── Hidden dim: 1536
	│ ├── Heads: 24
	│ ├── Conditioning: AdaLN-Zero (timestep + conditioning signal)
	│ ├── 3D RoPE positional encoding for triplane tokens
	│ └── ~2.3B parameters
	│
	└── Triplane Decoder (shared across all specialist models)
	└── Outputs: triplane feature tensor (3 × 256 × 256 × 32)
	```

	### Flow Matching Training
	- Learn vector field v_θ(x_t, t, c) where x_t is noisy triplane, c is conditioning
	- Optimal transport flow: straight paths from noise → data (better than DDPM curved paths)
	- Inference: 20–50 NFE (neural function evaluations) — fast on A100
	- Classifier-free guidance: unconditional dropout 10% during training
	- Guidance scale 5.0–10.0 at inference

	---

	## Task-Specific Decoder Heads

	Each specialist model adds a decoder head on top of the shared triplane output.

	---

	### Voxel Atlas — World Generation Decoder

	Task: Generate full 3D environments and worlds — terrain, buildings, vegetation, interior spaces.

	Output formats:
	- Voxel grids (`.vox`, Magica Voxel format) — for Minecraft-style worlds
	- OBJ scene (multiple meshes with materials) — for Unity/Unreal environments
	- USD stage (`.usd`) — industry standard scene format

	Decoder head:
	```
	TriplaneAtlasDecoder
	├── Scene Layout Transformer
	│ ├── Divides space into semantic regions (terrain, structures, vegetation, sky)
	│ └── 6-layer transformer over 32×32 spatial grid of scene tokens
	├── Region-wise NeRF decoder (per semantic region)
	│ └── MLP: 3D coords + triplane features → density + RGB + semantic label
	├── Marching Cubes extractor → raw mesh per region
	├── Scene graph assembler → parent-child relationships between objects
	├── Voxelizer (for .vox output) → discretizes to user-specified resolution
	└── USD exporter → full scene hierarchy with lighting + materials
	```

	Special modules:
	- Infinite world tiling: generate seamless adjacent chunks that stitch together
	- Biome-aware generation: desert, forest, urban, underwater, space, fantasy
	- LOD generator: auto-generates 4 levels of detail per scene object
	- Lighting estimator: infers plausible sun/sky lighting from scene content

	Typical generation sizes:
	- Small scene: 64×64×64 voxels or ~500m² OBJ scene — ~8 seconds on A100
	- Large world chunk: 256×256×128 voxels — ~35 seconds on A100

	---

	### Voxel Forge — Mesh / Asset Generation Decoder

	Task: Generate clean, game-ready 3D assets — characters, objects, props, vehicles, architecture.

	Output formats:
	- OBJ + MTL (universal)
	- GLB/GLTF (web & real-time)
	- FBX (game engine standard)
	- USDZ (Apple AR)

	Decoder head:
	```
	TriplaneForgeDec oder
	├── Occupancy Network decoder
	│ └── MLP: 3D point + triplane → occupancy probability
	├── Differentiable Marching Cubes → initial raw mesh
	├── Mesh Refinement Network
	│ ├── Graph neural network over mesh vertices/edges
	│ ├── 8 message-passing rounds
	│ └── Predicts vertex position offsets → clean topology
	├── UV Unwrapper (learned, SeamlessUV lineage)
	├── Texture Diffusion Head
	│ ├── 2D flow matching in UV space
	│ ├── Albedo + roughness + metallic + normal maps
	│ └── 1024×1024 or 2048×2048 texture atlas
	└── LOD Generator → 4 polycount levels (100% / 50% / 25% / 10%)
	```

	Special modules:
	- Topology optimizer: enforces quad-dominant topology for animation rigs
	- Symmetry enforcer: optional bilateral symmetry for characters/vehicles
	- Scale normalizer: outputs at real-world scale (meters) with unit metadata
	- Material classifier: auto-tags materials (metal, wood, fabric, glass, etc.)
	- Animation-ready flag: detects and preserves edge loops needed for rigging

	Polygon counts:
	- Low-poly asset: 500–5K triangles — ~6 seconds on A100
	- Mid-poly asset: 5K–50K triangles — ~15 seconds on A100
	- High-poly asset: 50K–500K triangles — ~45 seconds on A100

	---

	### Voxel Cast — 3D Printable Generation Decoder

	Task: Generate physically valid, printable 3D models. Watertight, manifold, structurally sound.

	Output formats:
	- STL (universal printing format)
	- OBJ (watertight)
	- STEP (CAD-compatible, parametric)
	- 3MF (modern printing format with material data)

	Decoder head:
	```
	TriplaneCastDecoder
	├── SDF (Signed Distance Field) decoder
	│ └── MLP: 3D point + triplane → signed distance value
	├── SDF → Watertight Mesh (dual marching cubes, no holes guaranteed)
	├── Printability Validator
	│ ├── Wall thickness checker (min 1.2mm enforced)
	│ ├── Overhang analyzer (>45° flagged + support detection)
	│ ├── Manifold checker + auto-repair
	│ └── Volume/surface area calculator
	├── Support Structure Generator (optional)
	│ └── Generates minimal support trees for FDM printing
	├── STEP Converter (via Open CASCADE bindings)
	└── Slicer Preview Renderer (preview only, not full slicer)
	```

	Special modules:
	- Structural stress analyzer: basic FEA simulation to detect weak points
	- Hollowing engine: auto-hollows solid objects with configurable wall thickness + drain holes
	- Interlocking part splitter: splits large objects into printable parts with snap-fit joints
	- Material suggester: recommends PLA / PETG / resin based on geometry complexity
	- Scale validator: ensures object is printable at specified scale on common bed sizes (Bambu, Prusa, Ender)

	Validation requirements (all Cast outputs must pass):
	- Zero non-manifold edges
	- Zero self-intersections
	- Minimum wall thickness ≥ 1.2mm at requested scale
	- Watertight (no open boundaries)

	---

	### Voxel Lens — NeRF / Gaussian Splatting Decoder

	Task: Generate photorealistic 3D scenes represented as Neural Radiance Fields or 3D Gaussian Splats — primarily for visualization, VR/AR, and cinematic rendering.

	Output formats:
	- `.ply` (3D Gaussian Splatting — compatible with standard 3DGS viewers)
	- NeRF weights (Instant-NGP / Nerfstudio compatible)
	- MP4 render (pre-rendered orbital video)
	- Depth maps + normal maps (per-view, for downstream use)

	Decoder head:
	```
	TriplaneLensDecoder
	├── Gaussian Parameter Decoder
	│ ├── Samples 3D Gaussian centers from triplane density
	│ ├── Per-Gaussian: position (3), rotation (4 quaternion), scale (3),
	│ │ opacity (1), spherical harmonics coefficients (48) → color
	│ └── Targets: 500K–3M Gaussians per scene
	├── Gaussian Densification Module
	│ ├── Adaptive densification: split/clone in high-gradient regions
	│ └── Pruning: remove low-opacity Gaussians
	├── NeRF branch (parallel)
	│ ├── Hash-grid encoder (Instant-NGP style)
	│ └── Tiny MLP: encoded position → density + color
	├── Rasterizer (differentiable 3DGS rasterizer)
	│ └── Used during training for photometric loss
	└── Novel View Synthesizer
	└── Renders arbitrary camera trajectories for MP4 export
	```

	Special modules:
	- Lighting decomposition: separates scene into albedo + illumination components
	- Dynamic scene support: temporal Gaussian sequences for animated scenes (from video input)
	- Background/foreground separator: isolates subject from environment
	- Camera trajectory planner: auto-generates cinematic orbital/fly-through paths
	- Compression module: reduces 3DGS file size by 60–80% with minimal quality loss

	Generation modes:
	- Object-centric: single object, orbital views — ~12 seconds on A100
	- Indoor scene: full room with lighting — ~40 seconds on A100
	- Outdoor scene: landscape or street — ~90 seconds on A100

	---

	### Voxel Prime — Closed Source All-in-One

	Access: API only. Not open source. Weights never distributed.

	Voxel Prime contains all four decoder heads simultaneously, plus:

	Additional Prime-only modules:
	- Cross-task consistency: ensures Atlas world + Forge assets + Lens scene all match when generated together
	- Scene population engine: generates a world (Atlas) then auto-populates it with assets (Forge)
	- Pipeline orchestrator: chains Atlas → Forge → Cast → Lens in one API call
	- Photorealistic texture upscaler: 4× super-resolution on all generated textures
	- Style transfer module: apply artistic style (e.g. "Studio Ghibli", "cyberpunk", "brutalist architecture") across all output types
	- Iterative refinement: text-guided editing of already-generated 3D content

	API endpoint:
	```python
	POST /v1/voxel/generate
	{
	"prompt": "A medieval castle on a cliff at sunset",
	"output_types": ["world", "mesh", "nerf"], # any combination
	"inputs": {
	"image": "base64...", # optional reference image
	"multiview": ["base64..."], # optional multi-view images
	"video": "base64...", # optional video
	"model": "base64..." # optional existing 3D model
	},
	"settings": {
	"quality": "high", # draft \| standard \| high
	"style": "realistic", # realistic \| stylized \| low-poly \| ...
	"scale_meters": 100.0, # real-world scale
	"symmetry": false,
	"printable": false
	}
	}
	```

	---

	## Shared Custom Modules (All Models)

	\| # \| Module \| Description \|
	\|---\|---\|---\|
	\| 1 \| Multi-Modal Conditioning Fusion \| CrossModalAttention over all active input types \|
	\| 2 \| 3D RoPE Encoder \| RoPE adapted for triplane 3D spatial positions \|
	\| 3 \| Geometry Quality Scorer \| Rates generated geometry quality [0–1] before output \|
	\| 4 \| Semantic Label Head \| Per-voxel/vertex semantic class (wall, floor, tree, etc.) \|
	\| 5 \| Scale & Unit Manager \| Enforces consistent real-world scale across all outputs \|
	\| 6 \| Material Property Head \| Predicts PBR material properties (roughness, metallic, IOR) \|
	\| 7 \| Confidence & Uncertainty Head \| Per-region generation confidence — flags uncertain areas \|
	\| 8 \| Prompt Adherence Scorer \| CLIP-based score: how well output matches text prompt \|
	\| 9 \| Multi-Resolution Decoder \| Generates at 64³ → 128³ → 256³ coarse-to-fine \|
	\| 10 \| Style Embedding Module \| Encodes style reference images into style conditioning vector \|

	---

	## Training Data Plan

	\| Dataset \| Content \| Used by \|
	\|---\|---\|---\|
	\| ShapeNet (55K models) \| Common 3D objects \| Forge, Cast \|
	\| Objaverse (800K+ models) \| Diverse 3D assets \| Forge, Cast, Lens \|
	\| Objaverse-XL (10M+ objects) \| Massive scale \| All \|
	\| ScanNet / ScanNet++ \| Indoor 3D scans \| Atlas, Lens \|
	\| KITTI / nuScenes \| Outdoor 3D scenes \| Atlas, Lens \|
	\| ABO (Amazon Berkeley Objects) \| Product meshes + materials \| Forge \|
	\| Thingiverse (printable models) \| 3D printable STLs \| Cast \|
	\| Polycam scans \| Real-world 3DGS/NeRF \| Lens \|
	\| Synthetic renders (generated) \| Multi-view rendered images \| All \|
	\| Text-3D pairs (synthetic) \| GPT-4o generated descriptions of Objaverse \| All \|

	---

	## Parameter Estimates

	\| Model \| Backbone \| Decoder Head \| Total \| VRAM (BF16) \|
	\|---\|---\|---\|---\|---\|
	\| Voxel Atlas \| 2.3B \| ~400M \| ~2.7B \| ~22GB \|
	\| Voxel Forge \| 2.3B \| ~350M \| ~2.65B \| ~21GB \|
	\| Voxel Cast \| 2.3B \| ~200M \| ~2.5B \| ~20GB \|
	\| Voxel Lens \| 2.3B \| ~500M \| ~2.8B \| ~22GB \|
	\| Voxel Prime \| 2.3B \| ~1.4B (all 4) \| ~3.7B \| ~30GB \|

	All fit on A100 40GB in BF16. INT8 quantization brings all under 15GB (consumer 4090 viable).

	---

	## Training Strategy

	### Phase 1 — Backbone Pre-training
	- Train shared backbone on Objaverse-XL triplane reconstructions
	- Learn general 3D structure without task-specific heads
	- Context: text + single image conditioning only
	- 100K steps, A100 cluster

	### Phase 2 — Decoder Head Training (parallel)
	- Freeze backbone, train each decoder head independently
	- Atlas: ScanNet + synthetic world data
	- Forge: ShapeNet + Objaverse + texture data
	- Cast: Thingiverse + watertight synthetic meshes
	- Lens: Polycam + synthetic multi-view renders
	- 50K steps each

	### Phase 3 — Joint Fine-tuning
	- Unfreeze backbone, fine-tune end-to-end per specialist model
	- Add all input modalities (video, multi-view, point cloud)
	- 30K steps each

	### Phase 4 — Prime Training
	- Initialize from jointly fine-tuned backbone
	- Train all decoder heads simultaneously
	- Cross-task consistency losses
	- Prime-only module training (pipeline orchestrator, style transfer)
	- 50K steps

	---

	## HuggingFace Plan

	```
	Matrix-Corp/Voxel-Atlas-V1 — open source
	Matrix-Corp/Voxel-Forge-V1 — open source
	Matrix-Corp/Voxel-Cast-V1 — open source
	Matrix-Corp/Voxel-Lens-V1 — open source
	Matrix-Corp/Voxel-Prime-V1 — closed source, API only (card only, no weights)
	```

	Collection: `Matrix-Corp/voxel-v1`

	---

	## Status
	- 🔴 Planned — Architecture specification complete
	- Backbone design finalized
	- Decoder head designs finalized
	- Training data sourcing: TBD
	- Compute requirements: significant (A100 cluster for training)
	- Timeline: TBD