Spaces:

Matrix-Corp
/

Matrix-Voxel

Running

App Files Files Community

Matrix-Voxel / README.md

Zandy-Wandy

Update README.md

a4b1601 verified 8 days ago

preview code

raw

history blame contribute delete

16.6 kB

metadata

title: Matrix Voxel
emoji: 🌏
colorFrom: green
colorTo: pink
sdk: static
pinned: false
license: cc-by-nc-nd-4.0
short_description: The next gen 3D generator

Matrix Voxel — Full Architecture & Planning Document

3D Generation Model Family | Matrix.Corp

Family Overview

Matrix Voxel is Matrix.Corp's 3D generation family. Five models sharing a common flow-matching backbone, each with task-specific decoder heads. Four specialist models are open source; one unified all-in-one (Voxel Prime) is closed source and API-only.

Model	Task	Output Formats	Source	Hardware	Status
Voxel Atlas	World / environment generation	Voxel grids, OBJ scenes, USD stages	🟢 Open Source	A100 40GB	🔴 Planned
Voxel Forge	3D mesh / asset generation	OBJ, GLB, FBX, USDZ	🟢 Open Source	A100 40GB	🔴 Planned
Voxel Cast	3D printable model generation	STL, OBJ (watertight), STEP	🟢 Open Source	A100 40GB	🔴 Planned
Voxel Lens	NeRF / Gaussian Splatting scenes	.ply (3DGS), NeRF weights, MP4 render	🟢 Open Source	A100 40GB	🔴 Planned
Voxel Prime	All-in-one unified generation	All of the above	🟣 Closed Source	API Only	🔴 Planned

Input Modalities (All Models)

Every Voxel model accepts any combination of:

Input	Description	Encoder
Text prompt	Natural language description of desired 3D output	CLIP-ViT-L / T5-XXL
Single image	Reference image → 3D lift	DINOv2 + custom depth encoder
Multi-view images	2–12 images from different angles	Multi-view transformer encoder
Video	Extracts frames, infers 3D from motion	Temporal encoder (Video-MAE lineage)
3D model	Existing mesh/point cloud as conditioning	PointNet++ encoder

All inputs projected to a shared 1024-dim conditioning embedding space before entering the backbone.

Core Architecture — Shared Flow Matching Backbone

Why Flow Matching?

Flow matching (Lipman et al. 2022, extended by Stable Diffusion 3 / FLUX lineage) learns a direct vector field from noise → data. Faster than DDPM diffusion (fewer inference steps, typically 20–50 vs 1000), more stable training, better mode coverage. State of the art for generative models as of 2025–2026.

3D Representation — Triplane + Latent Voxel Grid

All Voxel models operate in a shared latent 3D space:

Triplane representation: three axis-aligned feature planes (XY, XZ, YZ), each 256×256×32 channels
Any 3D point queried by projecting onto all 3 planes and summing features
Compact (3 × 256 × 256 × 32 = ~6M latent values) yet expressive
Flow matching operates on this triplane latent space, not raw 3D points
Decoder heads decode triplane to task-specific output format

Backbone Architecture

VoxelBackbone
├── Input Encoder (multimodal conditioning)
│   ├── TextEncoder         — T5-XXL + CLIP-ViT-L, projected to 1024-dim
│   ├── ImageEncoder        — DINOv2-L, projected to 1024-dim  
│   ├── MultiViewEncoder    — custom transformer over N views
│   ├── VideoEncoder        — Video-MAE, temporal pooling → 1024-dim
│   └── PointCloudEncoder   — PointNet++, global + local features → 1024-dim
│
├── Conditioning Fusion
│   └── CrossModalAttention — fuses all active input modalities
│
├── Flow Matching Transformer (DiT-style)
│   ├── 24 transformer blocks
│   ├── Hidden dim: 1536
│   ├── Heads: 24
│   ├── Conditioning: AdaLN-Zero (timestep + conditioning signal)
│   ├── 3D RoPE positional encoding for triplane tokens
│   └── ~2.3B parameters
│
└── Triplane Decoder (shared across all specialist models)
    └── Outputs: triplane feature tensor (3 × 256 × 256 × 32)

Flow Matching Training

Learn vector field v_θ(x_t, t, c) where x_t is noisy triplane, c is conditioning
Optimal transport flow: straight paths from noise → data (better than DDPM curved paths)
Inference: 20–50 NFE (neural function evaluations) — fast on A100
Classifier-free guidance: unconditional dropout 10% during training
Guidance scale 5.0–10.0 at inference

Task-Specific Decoder Heads

Each specialist model adds a decoder head on top of the shared triplane output.

Voxel Atlas — World Generation Decoder

Task: Generate full 3D environments and worlds — terrain, buildings, vegetation, interior spaces.

Output formats:

Voxel grids (.vox, Magica Voxel format) — for Minecraft-style worlds
OBJ scene (multiple meshes with materials) — for Unity/Unreal environments
USD stage (.usd) — industry standard scene format

Decoder head:

TriplaneAtlasDecoder
├── Scene Layout Transformer
│   ├── Divides space into semantic regions (terrain, structures, vegetation, sky)
│   └── 6-layer transformer over 32×32 spatial grid of scene tokens
├── Region-wise NeRF decoder (per semantic region)
│   └── MLP: 3D coords + triplane features → density + RGB + semantic label
├── Marching Cubes extractor → raw mesh per region
├── Scene graph assembler → parent-child relationships between objects
├── Voxelizer (for .vox output) → discretizes to user-specified resolution
└── USD exporter → full scene hierarchy with lighting + materials

Special modules:

Infinite world tiling: generate seamless adjacent chunks that stitch together
Biome-aware generation: desert, forest, urban, underwater, space, fantasy
LOD generator: auto-generates 4 levels of detail per scene object
Lighting estimator: infers plausible sun/sky lighting from scene content

Typical generation sizes:

Small scene: 64×64×64 voxels or ~500m² OBJ scene — ~8 seconds on A100
Large world chunk: 256×256×128 voxels — ~35 seconds on A100

Voxel Forge — Mesh / Asset Generation Decoder

Task: Generate clean, game-ready 3D assets — characters, objects, props, vehicles, architecture.

Output formats:

OBJ + MTL (universal)
GLB/GLTF (web & real-time)
FBX (game engine standard)
USDZ (Apple AR)

Decoder head:

TriplaneForgeDec oder
├── Occupancy Network decoder
│   └── MLP: 3D point + triplane → occupancy probability
├── Differentiable Marching Cubes → initial raw mesh
├── Mesh Refinement Network
│   ├── Graph neural network over mesh vertices/edges
│   ├── 8 message-passing rounds
│   └── Predicts vertex position offsets → clean topology
├── UV Unwrapper (learned, SeamlessUV lineage)
├── Texture Diffusion Head
│   ├── 2D flow matching in UV space
│   ├── Albedo + roughness + metallic + normal maps
│   └── 1024×1024 or 2048×2048 texture atlas
└── LOD Generator → 4 polycount levels (100% / 50% / 25% / 10%)

Special modules:

Topology optimizer: enforces quad-dominant topology for animation rigs
Symmetry enforcer: optional bilateral symmetry for characters/vehicles
Scale normalizer: outputs at real-world scale (meters) with unit metadata
Material classifier: auto-tags materials (metal, wood, fabric, glass, etc.)
Animation-ready flag: detects and preserves edge loops needed for rigging

Polygon counts:

Low-poly asset: 500–5K triangles — ~6 seconds on A100
Mid-poly asset: 5K–50K triangles — ~15 seconds on A100
High-poly asset: 50K–500K triangles — ~45 seconds on A100

Voxel Cast — 3D Printable Generation Decoder

Task: Generate physically valid, printable 3D models. Watertight, manifold, structurally sound.

Output formats:

STL (universal printing format)
OBJ (watertight)
STEP (CAD-compatible, parametric)
3MF (modern printing format with material data)

Decoder head:

TriplaneCastDecoder
├── SDF (Signed Distance Field) decoder
│   └── MLP: 3D point + triplane → signed distance value
├── SDF → Watertight Mesh (dual marching cubes, no holes guaranteed)
├── Printability Validator
│   ├── Wall thickness checker (min 1.2mm enforced)
│   ├── Overhang analyzer (>45° flagged + support detection)
│   ├── Manifold checker + auto-repair
│   └── Volume/surface area calculator
├── Support Structure Generator (optional)
│   └── Generates minimal support trees for FDM printing
├── STEP Converter (via Open CASCADE bindings)
└── Slicer Preview Renderer (preview only, not full slicer)

Special modules:

Structural stress analyzer: basic FEA simulation to detect weak points
Hollowing engine: auto-hollows solid objects with configurable wall thickness + drain holes
Interlocking part splitter: splits large objects into printable parts with snap-fit joints
Material suggester: recommends PLA / PETG / resin based on geometry complexity
Scale validator: ensures object is printable at specified scale on common bed sizes (Bambu, Prusa, Ender)

Validation requirements (all Cast outputs must pass):

Zero non-manifold edges
Zero self-intersections
Minimum wall thickness ≥ 1.2mm at requested scale
Watertight (no open boundaries)

Voxel Lens — NeRF / Gaussian Splatting Decoder

Task: Generate photorealistic 3D scenes represented as Neural Radiance Fields or 3D Gaussian Splats — primarily for visualization, VR/AR, and cinematic rendering.

Output formats:

.ply (3D Gaussian Splatting — compatible with standard 3DGS viewers)
NeRF weights (Instant-NGP / Nerfstudio compatible)
MP4 render (pre-rendered orbital video)
Depth maps + normal maps (per-view, for downstream use)

Decoder head:

TriplaneLensDecoder
├── Gaussian Parameter Decoder
│   ├── Samples 3D Gaussian centers from triplane density
│   ├── Per-Gaussian: position (3), rotation (4 quaternion), scale (3),
│   │   opacity (1), spherical harmonics coefficients (48) → color
│   └── Targets: 500K–3M Gaussians per scene
├── Gaussian Densification Module
│   ├── Adaptive densification: split/clone in high-gradient regions
│   └── Pruning: remove low-opacity Gaussians
├── NeRF branch (parallel)
│   ├── Hash-grid encoder (Instant-NGP style)
│   └── Tiny MLP: encoded position → density + color
├── Rasterizer (differentiable 3DGS rasterizer)
│   └── Used during training for photometric loss
└── Novel View Synthesizer
    └── Renders arbitrary camera trajectories for MP4 export

Special modules:

Lighting decomposition: separates scene into albedo + illumination components
Dynamic scene support: temporal Gaussian sequences for animated scenes (from video input)
Background/foreground separator: isolates subject from environment
Camera trajectory planner: auto-generates cinematic orbital/fly-through paths
Compression module: reduces 3DGS file size by 60–80% with minimal quality loss

Generation modes:

Object-centric: single object, orbital views — ~12 seconds on A100
Indoor scene: full room with lighting — ~40 seconds on A100
Outdoor scene: landscape or street — ~90 seconds on A100

Voxel Prime — Closed Source All-in-One

Access: API only. Not open source. Weights never distributed.

Voxel Prime contains all four decoder heads simultaneously, plus:

Additional Prime-only modules:

Cross-task consistency: ensures Atlas world + Forge assets + Lens scene all match when generated together
Scene population engine: generates a world (Atlas) then auto-populates it with assets (Forge)
Pipeline orchestrator: chains Atlas → Forge → Cast → Lens in one API call
Photorealistic texture upscaler: 4× super-resolution on all generated textures
Style transfer module: apply artistic style (e.g. "Studio Ghibli", "cyberpunk", "brutalist architecture") across all output types
Iterative refinement: text-guided editing of already-generated 3D content

API endpoint:

POST /v1/voxel/generate
{
  "prompt": "A medieval castle on a cliff at sunset",
  "output_types": ["world", "mesh", "nerf"],  # any combination
  "inputs": {
    "image": "base64...",       # optional reference image
    "multiview": ["base64..."], # optional multi-view images
    "video": "base64...",       # optional video
    "model": "base64..."        # optional existing 3D model
  },
  "settings": {
    "quality": "high",          # draft | standard | high
    "style": "realistic",       # realistic | stylized | low-poly | ...
    "scale_meters": 100.0,      # real-world scale
    "symmetry": false,
    "printable": false
  }
}

Shared Custom Modules (All Models)

#	Module	Description
1	Multi-Modal Conditioning Fusion	CrossModalAttention over all active input types
2	3D RoPE Encoder	RoPE adapted for triplane 3D spatial positions
3	Geometry Quality Scorer	Rates generated geometry quality [0–1] before output
4	Semantic Label Head	Per-voxel/vertex semantic class (wall, floor, tree, etc.)
5	Scale & Unit Manager	Enforces consistent real-world scale across all outputs
6	Material Property Head	Predicts PBR material properties (roughness, metallic, IOR)
7	Confidence & Uncertainty Head	Per-region generation confidence — flags uncertain areas
8	Prompt Adherence Scorer	CLIP-based score: how well output matches text prompt
9	Multi-Resolution Decoder	Generates at 64³ → 128³ → 256³ coarse-to-fine
10	Style Embedding Module	Encodes style reference images into style conditioning vector

Training Data Plan

Dataset	Content	Used by
ShapeNet (55K models)	Common 3D objects	Forge, Cast
Objaverse (800K+ models)	Diverse 3D assets	Forge, Cast, Lens
Objaverse-XL (10M+ objects)	Massive scale	All
ScanNet / ScanNet++	Indoor 3D scans	Atlas, Lens
KITTI / nuScenes	Outdoor 3D scenes	Atlas, Lens
ABO (Amazon Berkeley Objects)	Product meshes + materials	Forge
Thingiverse (printable models)	3D printable STLs	Cast
Polycam scans	Real-world 3DGS/NeRF	Lens
Synthetic renders (generated)	Multi-view rendered images	All
Text-3D pairs (synthetic)	GPT-4o generated descriptions of Objaverse	All

Parameter Estimates

Model	Backbone	Decoder Head	Total	VRAM (BF16)
Voxel Atlas	2.3B	~400M	~2.7B	~22GB
Voxel Forge	2.3B	~350M	~2.65B	~21GB
Voxel Cast	2.3B	~200M	~2.5B	~20GB
Voxel Lens	2.3B	~500M	~2.8B	~22GB
Voxel Prime	2.3B	~1.4B (all 4)	~3.7B	~30GB

All fit on A100 40GB in BF16. INT8 quantization brings all under 15GB (consumer 4090 viable).

Training Strategy

Phase 1 — Backbone Pre-training

Train shared backbone on Objaverse-XL triplane reconstructions
Learn general 3D structure without task-specific heads
Context: text + single image conditioning only
100K steps, A100 cluster

Phase 2 — Decoder Head Training (parallel)

Freeze backbone, train each decoder head independently
Atlas: ScanNet + synthetic world data
Forge: ShapeNet + Objaverse + texture data
Cast: Thingiverse + watertight synthetic meshes
Lens: Polycam + synthetic multi-view renders
50K steps each

Phase 3 — Joint Fine-tuning

Unfreeze backbone, fine-tune end-to-end per specialist model
Add all input modalities (video, multi-view, point cloud)
30K steps each

Phase 4 — Prime Training

Initialize from jointly fine-tuned backbone
Train all decoder heads simultaneously
Cross-task consistency losses
Prime-only module training (pipeline orchestrator, style transfer)
50K steps

HuggingFace Plan

Matrix-Corp/Voxel-Atlas-V1    — open source
Matrix-Corp/Voxel-Forge-V1    — open source
Matrix-Corp/Voxel-Cast-V1     — open source
Matrix-Corp/Voxel-Lens-V1     — open source
Matrix-Corp/Voxel-Prime-V1    — closed source, API only (card only, no weights)

Collection: Matrix-Corp/voxel-v1

Status

🔴 Planned — Architecture specification complete
Backbone design finalized
Decoder head designs finalized
Training data sourcing: TBD
Compute requirements: significant (A100 cluster for training)
Timeline: TBD