Spaces:

Matrix-Corp
/

Matrix-Voxel

Running

App Files Files Community

Zandy-Wandy commited on Mar 10

Commit

1b7065a

verified ·

1 Parent(s): 13acb05

Update README.md

Browse files

Files changed (1) hide show

README.md +393 -5

README.md CHANGED Viewed

@@ -1,12 +1,400 @@
 ---
 title: Matrix Voxel
-emoji: 📚
-colorFrom: purple
-colorTo: purple
 sdk: static
-pinned: false
 license: cc-by-nc-nd-4.0
 short_description: The next gen 3D generator
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: Matrix Voxel
+emoji: 🌏
+colorFrom: green
+colorTo: pink
 sdk: static
+pinned: true
 license: cc-by-nc-nd-4.0
 short_description: The next gen 3D generator
 ---
+# Matrix Voxel — Full Architecture & Planning Document
+**3D Generation Model Family | Matrix.Corp**
+---
+## Family Overview
+Matrix Voxel is Matrix.Corp's 3D generation family. Five models sharing a common flow-matching backbone, each with task-specific decoder heads. Four specialist models are open source; one unified all-in-one (Voxel Prime) is closed source and API-only.
+| Model | Task | Output Formats | Source | Hardware | Status |
+|---|---|---|---|---|---|
+| Voxel Atlas | World / environment generation | Voxel grids, OBJ scenes, USD stages | 🟢 Open Source | A100 40GB | 🔴 Planned |
+| Voxel Forge | 3D mesh / asset generation | OBJ, GLB, FBX, USDZ | 🟢 Open Source | A100 40GB | 🔴 Planned |
+| Voxel Cast | 3D printable model generation | STL, OBJ (watertight), STEP | 🟢 Open Source | A100 40GB | 🔴 Planned |
+| Voxel Lens | NeRF / Gaussian Splatting scenes | .ply (3DGS), NeRF weights, MP4 render | 🟢 Open Source | A100 40GB | 🔴 Planned |
+| Voxel Prime | All-in-one unified generation | All of the above | 🟣 Closed Source | API Only | 🔴 Planned |
+---
+## Input Modalities (All Models)
+Every Voxel model accepts any combination of:
+| Input | Description | Encoder |
+|---|---|---|
+| Text prompt | Natural language description of desired 3D output | CLIP-ViT-L / T5-XXL |
+| Single image | Reference image → 3D lift | DINOv2 + custom depth encoder |
+| Multi-view images | 2–12 images from different angles | Multi-view transformer encoder |
+| Video | Extracts frames, infers 3D from motion | Temporal encoder (Video-MAE lineage) |
+| 3D model | Existing mesh/point cloud as conditioning | PointNet++ encoder |
+All inputs projected to a shared 1024-dim conditioning embedding space before entering the backbone.
+---
+## Core Architecture — Shared Flow Matching Backbone
+### Why Flow Matching?
+Flow matching (Lipman et al. 2022, extended by Stable Diffusion 3 / FLUX lineage) learns a direct vector field from noise → data. Faster than DDPM diffusion (fewer inference steps, typically 20–50 vs 1000), more stable training, better mode coverage. State of the art for generative models as of 2025–2026.
+### 3D Representation — Triplane + Latent Voxel Grid
+All Voxel models operate in a shared latent 3D space:
+- **Triplane representation**: three axis-aligned feature planes (XY, XZ, YZ), each 256×256×32 channels
+- Any 3D point queried by projecting onto all 3 planes and summing features
+- Compact (3 × 256 × 256 × 32 = ~6M latent values) yet expressive
+- Flow matching operates on this triplane latent space, not raw 3D points
+- Decoder heads decode triplane to task-specific output format
+### Backbone Architecture
+```
+VoxelBackbone
+├── Input Encoder (multimodal conditioning)
+│   ├── TextEncoder         — T5-XXL + CLIP-ViT-L, projected to 1024-dim
+│   ├── ImageEncoder        — DINOv2-L, projected to 1024-dim
+│   ├── MultiViewEncoder    — custom transformer over N views
+│   ├── VideoEncoder        — Video-MAE, temporal pooling → 1024-dim
+│   └── PointCloudEncoder   — PointNet++, global + local features → 1024-dim
+│
+├── Conditioning Fusion
+│   └── CrossModalAttention — fuses all active input modalities
+│
+├── Flow Matching Transformer (DiT-style)
+│   ├── 24 transformer blocks
+│   ├── Hidden dim: 1536
+│   ├── Heads: 24
+│   ├── Conditioning: AdaLN-Zero (timestep + conditioning signal)
+│   ├── 3D RoPE positional encoding for triplane tokens
+│   └── ~2.3B parameters
+│
+└── Triplane Decoder (shared across all specialist models)
+    └── Outputs: triplane feature tensor (3 × 256 × 256 × 32)
+```
+### Flow Matching Training
+- Learn vector field v_θ(x_t, t, c) where x_t is noisy triplane, c is conditioning
+- Optimal transport flow: straight paths from noise → data (better than DDPM curved paths)
+- Inference: 20–50 NFE (neural function evaluations) — fast on A100
+- Classifier-free guidance: unconditional dropout 10% during training
+- Guidance scale 5.0–10.0 at inference
+---
+## Task-Specific Decoder Heads
+Each specialist model adds a decoder head on top of the shared triplane output.
+---
+### Voxel Atlas — World Generation Decoder
+**Task:** Generate full 3D environments and worlds — terrain, buildings, vegetation, interior spaces.
+**Output formats:**
+- Voxel grids (`.vox`, Magica Voxel format) — for Minecraft-style worlds
+- OBJ scene (multiple meshes with materials) — for Unity/Unreal environments
+- USD stage (`.usd`) — industry standard scene format
+**Decoder head:**
+```
+TriplaneAtlasDecoder
+├── Scene Layout Transformer
+│   ├── Divides space into semantic regions (terrain, structures, vegetation, sky)
+│   └── 6-layer transformer over 32×32 spatial grid of scene tokens
+├── Region-wise NeRF decoder (per semantic region)
+│   └── MLP: 3D coords + triplane features → density + RGB + semantic label
+├── Marching Cubes extractor → raw mesh per region
+├── Scene graph assembler → parent-child relationships between objects
+├── Voxelizer (for .vox output) → discretizes to user-specified resolution
+└── USD exporter → full scene hierarchy with lighting + materials
+```
+**Special modules:**
+- **Infinite world tiling**: generate seamless adjacent chunks that stitch together
+- **Biome-aware generation**: desert, forest, urban, underwater, space, fantasy
+- **LOD generator**: auto-generates 4 levels of detail per scene object
+- **Lighting estimator**: infers plausible sun/sky lighting from scene content
+**Typical generation sizes:**
+- Small scene: 64×64×64 voxels or ~500m² OBJ scene — ~8 seconds on A100
+- Large world chunk: 256×256×128 voxels — ~35 seconds on A100
+---
+### Voxel Forge — Mesh / Asset Generation Decoder
+**Task:** Generate clean, game-ready 3D assets — characters, objects, props, vehicles, architecture.
+**Output formats:**
+- OBJ + MTL (universal)
+- GLB/GLTF (web & real-time)
+- FBX (game engine standard)
+- USDZ (Apple AR)
+**Decoder head:**
+```
+TriplaneForgeDec oder
+├── Occupancy Network decoder
+│   └── MLP: 3D point + triplane → occupancy probability
+├── Differentiable Marching Cubes → initial raw mesh
+├── Mesh Refinement Network
+│   ├── Graph neural network over mesh vertices/edges
+│   ├── 8 message-passing rounds
+│   └── Predicts vertex position offsets → clean topology
+├── UV Unwrapper (learned, SeamlessUV lineage)
+├── Texture Diffusion Head
+│   ├── 2D flow matching in UV space
+│   ├── Albedo + roughness + metallic + normal maps
+│   └── 1024×1024 or 2048×2048 texture atlas
+└── LOD Generator → 4 polycount levels (100% / 50% / 25% / 10%)
+```
+**Special modules:**
+- **Topology optimizer**: enforces quad-dominant topology for animation rigs
+- **Symmetry enforcer**: optional bilateral symmetry for characters/vehicles
+- **Scale normalizer**: outputs at real-world scale (meters) with unit metadata
+- **Material classifier**: auto-tags materials (metal, wood, fabric, glass, etc.)
+- **Animation-ready flag**: detects and preserves edge loops needed for rigging
+**Polygon counts:**
+- Low-poly asset: 500–5K triangles — ~6 seconds on A100
+- Mid-poly asset: 5K–50K triangles — ~15 seconds on A100
+- High-poly asset: 50K–500K triangles — ~45 seconds on A100
+---
+### Voxel Cast — 3D Printable Generation Decoder
+**Task:** Generate physically valid, printable 3D models. Watertight, manifold, structurally sound.
+**Output formats:**
+- STL (universal printing format)
+- OBJ (watertight)
+- STEP (CAD-compatible, parametric)
+- 3MF (modern printing format with material data)
+**Decoder head:**
+```
+TriplaneCastDecoder
+├── SDF (Signed Distance Field) decoder
+│   └── MLP: 3D point + triplane → signed distance value
+├── SDF → Watertight Mesh (dual marching cubes, no holes guaranteed)
+├── Printability Validator
+│   ├── Wall thickness checker (min 1.2mm enforced)
+│   ├── Overhang analyzer (>45° flagged + support detection)
+│   ├── Manifold checker + auto-repair
+│   └── Volume/surface area calculator
+├── Support Structure Generator (optional)
+│   └── Generates minimal support trees for FDM printing
+├── STEP Converter (via Open CASCADE bindings)
+└── Slicer Preview Renderer (preview only, not full slicer)
+```
+**Special modules:**
+- **Structural stress analyzer**: basic FEA simulation to detect weak points
+- **Hollowing engine**: auto-hollows solid objects with configurable wall thickness + drain holes
+- **Interlocking part splitter**: splits large objects into printable parts with snap-fit joints
+- **Material suggester**: recommends PLA / PETG / resin based on geometry complexity
+- **Scale validator**: ensures object is printable at specified scale on common bed sizes (Bambu, Prusa, Ender)
+**Validation requirements (all Cast outputs must pass):**
+- Zero non-manifold edges
+- Zero self-intersections
+- Minimum wall thickness ≥ 1.2mm at requested scale
+- Watertight (no open boundaries)
+---
+### Voxel Lens — NeRF / Gaussian Splatting Decoder
+**Task:** Generate photorealistic 3D scenes represented as Neural Radiance Fields or 3D Gaussian Splats — primarily for visualization, VR/AR, and cinematic rendering.
+**Output formats:**
+- `.ply` (3D Gaussian Splatting — compatible with standard 3DGS viewers)
+- NeRF weights (Instant-NGP / Nerfstudio compatible)
+- MP4 render (pre-rendered orbital video)
+- Depth maps + normal maps (per-view, for downstream use)
+**Decoder head:**
+```
+TriplaneLensDecoder
+├── Gaussian Parameter Decoder
+│   ├── Samples 3D Gaussian centers from triplane density
+│   ├── Per-Gaussian: position (3), rotation (4 quaternion), scale (3),
+│   │   opacity (1), spherical harmonics coefficients (48) → color
+│   └── Targets: 500K–3M Gaussians per scene
+├── Gaussian Densification Module
+│   ├── Adaptive densification: split/clone in high-gradient regions
+│   └── Pruning: remove low-opacity Gaussians
+├── NeRF branch (parallel)
+│   ├── Hash-grid encoder (Instant-NGP style)
+│   └── Tiny MLP: encoded position → density + color
+├── Rasterizer (differentiable 3DGS rasterizer)
+│   └── Used during training for photometric loss
+└── Novel View Synthesizer
+    └── Renders arbitrary camera trajectories for MP4 export
+```
+**Special modules:**
+- **Lighting decomposition**: separates scene into albedo + illumination components
+- **Dynamic scene support**: temporal Gaussian sequences for animated scenes (from video input)
+- **Background/foreground separator**: isolates subject from environment
+- **Camera trajectory planner**: auto-generates cinematic orbital/fly-through paths
+- **Compression module**: reduces 3DGS file size by 60–80% with minimal quality loss
+**Generation modes:**
+- Object-centric: single object, orbital views — ~12 seconds on A100
+- Indoor scene: full room with lighting — ~40 seconds on A100
+- Outdoor scene: landscape or street — ~90 seconds on A100
+---
+### Voxel Prime — Closed Source All-in-One
+**Access:** API only. Not open source. Weights never distributed.
+Voxel Prime contains all four decoder heads simultaneously, plus:
+**Additional Prime-only modules:**
+- **Cross-task consistency**: ensures Atlas world + Forge assets + Lens scene all match when generated together
+- **Scene population engine**: generates a world (Atlas) then auto-populates it with assets (Forge)
+- **Pipeline orchestrator**: chains Atlas → Forge → Cast → Lens in one API call
+- **Photorealistic texture upscaler**: 4× super-resolution on all generated textures
+- **Style transfer module**: apply artistic style (e.g. "Studio Ghibli", "cyberpunk", "brutalist architecture") across all output types
+- **Iterative refinement**: text-guided editing of already-generated 3D content
+**API endpoint:**
+```python
+POST /v1/voxel/generate
+{
+  "prompt": "A medieval castle on a cliff at sunset",
+  "output_types": ["world", "mesh", "nerf"],  # any combination
+  "inputs": {
+    "image": "base64...",       # optional reference image
+    "multiview": ["base64..."], # optional multi-view images
+    "video": "base64...",       # optional video
+    "model": "base64..."        # optional existing 3D model
+  },
+  "settings": {
+    "quality": "high",          # draft | standard | high
+    "style": "realistic",       # realistic | stylized | low-poly | ...
+    "scale_meters": 100.0,      # real-world scale
+    "symmetry": false,
+    "printable": false
+  }
+}
+```
+---
+## Shared Custom Modules (All Models)
+| # | Module | Description |
+|---|---|---|
+| 1 | **Multi-Modal Conditioning Fusion** | CrossModalAttention over all active input types |
+| 2 | **3D RoPE Encoder** | RoPE adapted for triplane 3D spatial positions |
+| 3 | **Geometry Quality Scorer** | Rates generated geometry quality [0–1] before output |
+| 4 | **Semantic Label Head** | Per-voxel/vertex semantic class (wall, floor, tree, etc.) |
+| 5 | **Scale & Unit Manager** | Enforces consistent real-world scale across all outputs |
+| 6 | **Material Property Head** | Predicts PBR material properties (roughness, metallic, IOR) |
+| 7 | **Confidence & Uncertainty Head** | Per-region generation confidence — flags uncertain areas |
+| 8 | **Prompt Adherence Scorer** | CLIP-based score: how well output matches text prompt |
+| 9 | **Multi-Resolution Decoder** | Generates at 64³ → 128³ → 256³ coarse-to-fine |
+| 10 | **Style Embedding Module** | Encodes style reference images into style conditioning vector |
+---
+## Training Data Plan
+| Dataset | Content | Used by |
+|---|---|---|
+| ShapeNet (55K models) | Common 3D objects | Forge, Cast |
+| Objaverse (800K+ models) | Diverse 3D assets | Forge, Cast, Lens |
+| Objaverse-XL (10M+ objects) | Massive scale | All |
+| ScanNet / ScanNet++ | Indoor 3D scans | Atlas, Lens |
+| KITTI / nuScenes | Outdoor 3D scenes | Atlas, Lens |
+| ABO (Amazon Berkeley Objects) | Product meshes + materials | Forge |
+| Thingiverse (printable models) | 3D printable STLs | Cast |
+| Polycam scans | Real-world 3DGS/NeRF | Lens |
+| Synthetic renders (generated) | Multi-view rendered images | All |
+| Text-3D pairs (synthetic) | GPT-4o generated descriptions of Objaverse | All |
+---
+## Parameter Estimates
+| Model | Backbone | Decoder Head | Total | VRAM (BF16) |
+|---|---|---|---|---|
+| Voxel Atlas | 2.3B | ~400M | ~2.7B | ~22GB |
+| Voxel Forge | 2.3B | ~350M | ~2.65B | ~21GB |
+| Voxel Cast | 2.3B | ~200M | ~2.5B | ~20GB |
+| Voxel Lens | 2.3B | ~500M | ~2.8B | ~22GB |
+| Voxel Prime | 2.3B | ~1.4B (all 4) | ~3.7B | ~30GB |
+All fit on A100 40GB in BF16. INT8 quantization brings all under 15GB (consumer 4090 viable).
+---
+## Training Strategy
+### Phase 1 — Backbone Pre-training
+- Train shared backbone on Objaverse-XL triplane reconstructions
+- Learn general 3D structure without task-specific heads
+- Context: text + single image conditioning only
+- 100K steps, A100 cluster
+### Phase 2 — Decoder Head Training (parallel)
+- Freeze backbone, train each decoder head independently
+- Atlas: ScanNet + synthetic world data
+- Forge: ShapeNet + Objaverse + texture data
+- Cast: Thingiverse + watertight synthetic meshes
+- Lens: Polycam + synthetic multi-view renders
+- 50K steps each
+### Phase 3 — Joint Fine-tuning
+- Unfreeze backbone, fine-tune end-to-end per specialist model
+- Add all input modalities (video, multi-view, point cloud)
+- 30K steps each
+### Phase 4 — Prime Training
+- Initialize from jointly fine-tuned backbone
+- Train all decoder heads simultaneously
+- Cross-task consistency losses
+- Prime-only module training (pipeline orchestrator, style transfer)
+- 50K steps
+---
+## HuggingFace Plan
+```
+Matrix-Corp/Voxel-Atlas-V1    — open source
+Matrix-Corp/Voxel-Forge-V1    — open source
+Matrix-Corp/Voxel-Cast-V1     — open source
+Matrix-Corp/Voxel-Lens-V1     — open source
+Matrix-Corp/Voxel-Prime-V1    — closed source, API only (card only, no weights)
+```
+Collection: `Matrix-Corp/voxel-v1`
+---
+## Status
+- 🔴 Planned — Architecture specification complete
+- Backbone design finalized
+- Decoder head designs finalized
+- Training data sourcing: TBD
+- Compute requirements: significant (A100 cluster for training)
+- Timeline: TBD