Zandy-Wandy commited on
Commit
1b7065a
Β·
verified Β·
1 Parent(s): 13acb05

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +393 -5
README.md CHANGED
@@ -1,12 +1,400 @@
1
  ---
2
  title: Matrix Voxel
3
- emoji: πŸ“š
4
- colorFrom: purple
5
- colorTo: purple
6
  sdk: static
7
- pinned: false
8
  license: cc-by-nc-nd-4.0
9
  short_description: The next gen 3D generator
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Matrix Voxel
3
+ emoji: 🌏
4
+ colorFrom: green
5
+ colorTo: pink
6
  sdk: static
7
+ pinned: true
8
  license: cc-by-nc-nd-4.0
9
  short_description: The next gen 3D generator
10
  ---
11
 
12
+ # Matrix Voxel β€” Full Architecture & Planning Document
13
+ **3D Generation Model Family | Matrix.Corp**
14
+
15
+ ---
16
+
17
+ ## Family Overview
18
+
19
+ Matrix Voxel is Matrix.Corp's 3D generation family. Five models sharing a common flow-matching backbone, each with task-specific decoder heads. Four specialist models are open source; one unified all-in-one (Voxel Prime) is closed source and API-only.
20
+
21
+ | Model | Task | Output Formats | Source | Hardware | Status |
22
+ |---|---|---|---|---|---|
23
+ | Voxel Atlas | World / environment generation | Voxel grids, OBJ scenes, USD stages | 🟒 Open Source | A100 40GB | πŸ”΄ Planned |
24
+ | Voxel Forge | 3D mesh / asset generation | OBJ, GLB, FBX, USDZ | 🟒 Open Source | A100 40GB | πŸ”΄ Planned |
25
+ | Voxel Cast | 3D printable model generation | STL, OBJ (watertight), STEP | 🟒 Open Source | A100 40GB | πŸ”΄ Planned |
26
+ | Voxel Lens | NeRF / Gaussian Splatting scenes | .ply (3DGS), NeRF weights, MP4 render | 🟒 Open Source | A100 40GB | πŸ”΄ Planned |
27
+ | Voxel Prime | All-in-one unified generation | All of the above | 🟣 Closed Source | API Only | πŸ”΄ Planned |
28
+
29
+ ---
30
+
31
+ ## Input Modalities (All Models)
32
+
33
+ Every Voxel model accepts any combination of:
34
+
35
+ | Input | Description | Encoder |
36
+ |---|---|---|
37
+ | Text prompt | Natural language description of desired 3D output | CLIP-ViT-L / T5-XXL |
38
+ | Single image | Reference image β†’ 3D lift | DINOv2 + custom depth encoder |
39
+ | Multi-view images | 2–12 images from different angles | Multi-view transformer encoder |
40
+ | Video | Extracts frames, infers 3D from motion | Temporal encoder (Video-MAE lineage) |
41
+ | 3D model | Existing mesh/point cloud as conditioning | PointNet++ encoder |
42
+
43
+ All inputs projected to a shared 1024-dim conditioning embedding space before entering the backbone.
44
+
45
+ ---
46
+
47
+ ## Core Architecture β€” Shared Flow Matching Backbone
48
+
49
+ ### Why Flow Matching?
50
+ Flow matching (Lipman et al. 2022, extended by Stable Diffusion 3 / FLUX lineage) learns a direct vector field from noise β†’ data. Faster than DDPM diffusion (fewer inference steps, typically 20–50 vs 1000), more stable training, better mode coverage. State of the art for generative models as of 2025–2026.
51
+
52
+ ### 3D Representation β€” Triplane + Latent Voxel Grid
53
+ All Voxel models operate in a shared latent 3D space:
54
+ - **Triplane representation**: three axis-aligned feature planes (XY, XZ, YZ), each 256Γ—256Γ—32 channels
55
+ - Any 3D point queried by projecting onto all 3 planes and summing features
56
+ - Compact (3 Γ— 256 Γ— 256 Γ— 32 = ~6M latent values) yet expressive
57
+ - Flow matching operates on this triplane latent space, not raw 3D points
58
+ - Decoder heads decode triplane to task-specific output format
59
+
60
+ ### Backbone Architecture
61
+
62
+ ```
63
+ VoxelBackbone
64
+ β”œβ”€β”€ Input Encoder (multimodal conditioning)
65
+ β”‚ β”œβ”€β”€ TextEncoder β€” T5-XXL + CLIP-ViT-L, projected to 1024-dim
66
+ β”‚ β”œβ”€β”€ ImageEncoder β€” DINOv2-L, projected to 1024-dim
67
+ β”‚ β”œβ”€β”€ MultiViewEncoder β€” custom transformer over N views
68
+ β”‚ β”œβ”€β”€ VideoEncoder β€” Video-MAE, temporal pooling β†’ 1024-dim
69
+ β”‚ └── PointCloudEncoder β€” PointNet++, global + local features β†’ 1024-dim
70
+ β”‚
71
+ β”œβ”€β”€ Conditioning Fusion
72
+ β”‚ └── CrossModalAttention β€” fuses all active input modalities
73
+ β”‚
74
+ β”œβ”€β”€ Flow Matching Transformer (DiT-style)
75
+ β”‚ β”œβ”€β”€ 24 transformer blocks
76
+ β”‚ β”œβ”€β”€ Hidden dim: 1536
77
+ β”‚ β”œβ”€β”€ Heads: 24
78
+ β”‚ β”œβ”€β”€ Conditioning: AdaLN-Zero (timestep + conditioning signal)
79
+ β”‚ β”œβ”€β”€ 3D RoPE positional encoding for triplane tokens
80
+ β”‚ └── ~2.3B parameters
81
+ β”‚
82
+ └── Triplane Decoder (shared across all specialist models)
83
+ └── Outputs: triplane feature tensor (3 Γ— 256 Γ— 256 Γ— 32)
84
+ ```
85
+
86
+ ### Flow Matching Training
87
+ - Learn vector field v_ΞΈ(x_t, t, c) where x_t is noisy triplane, c is conditioning
88
+ - Optimal transport flow: straight paths from noise β†’ data (better than DDPM curved paths)
89
+ - Inference: 20–50 NFE (neural function evaluations) β€” fast on A100
90
+ - Classifier-free guidance: unconditional dropout 10% during training
91
+ - Guidance scale 5.0–10.0 at inference
92
+
93
+ ---
94
+
95
+ ## Task-Specific Decoder Heads
96
+
97
+ Each specialist model adds a decoder head on top of the shared triplane output.
98
+
99
+ ---
100
+
101
+ ### Voxel Atlas β€” World Generation Decoder
102
+
103
+ **Task:** Generate full 3D environments and worlds β€” terrain, buildings, vegetation, interior spaces.
104
+
105
+ **Output formats:**
106
+ - Voxel grids (`.vox`, Magica Voxel format) β€” for Minecraft-style worlds
107
+ - OBJ scene (multiple meshes with materials) β€” for Unity/Unreal environments
108
+ - USD stage (`.usd`) β€” industry standard scene format
109
+
110
+ **Decoder head:**
111
+ ```
112
+ TriplaneAtlasDecoder
113
+ β”œβ”€β”€ Scene Layout Transformer
114
+ β”‚ β”œβ”€β”€ Divides space into semantic regions (terrain, structures, vegetation, sky)
115
+ β”‚ └── 6-layer transformer over 32Γ—32 spatial grid of scene tokens
116
+ β”œβ”€β”€ Region-wise NeRF decoder (per semantic region)
117
+ β”‚ └── MLP: 3D coords + triplane features β†’ density + RGB + semantic label
118
+ β”œβ”€β”€ Marching Cubes extractor β†’ raw mesh per region
119
+ β”œβ”€β”€ Scene graph assembler β†’ parent-child relationships between objects
120
+ β”œβ”€β”€ Voxelizer (for .vox output) β†’ discretizes to user-specified resolution
121
+ └── USD exporter β†’ full scene hierarchy with lighting + materials
122
+ ```
123
+
124
+ **Special modules:**
125
+ - **Infinite world tiling**: generate seamless adjacent chunks that stitch together
126
+ - **Biome-aware generation**: desert, forest, urban, underwater, space, fantasy
127
+ - **LOD generator**: auto-generates 4 levels of detail per scene object
128
+ - **Lighting estimator**: infers plausible sun/sky lighting from scene content
129
+
130
+ **Typical generation sizes:**
131
+ - Small scene: 64Γ—64Γ—64 voxels or ~500mΒ² OBJ scene β€” ~8 seconds on A100
132
+ - Large world chunk: 256Γ—256Γ—128 voxels β€” ~35 seconds on A100
133
+
134
+ ---
135
+
136
+ ### Voxel Forge β€” Mesh / Asset Generation Decoder
137
+
138
+ **Task:** Generate clean, game-ready 3D assets β€” characters, objects, props, vehicles, architecture.
139
+
140
+ **Output formats:**
141
+ - OBJ + MTL (universal)
142
+ - GLB/GLTF (web & real-time)
143
+ - FBX (game engine standard)
144
+ - USDZ (Apple AR)
145
+
146
+ **Decoder head:**
147
+ ```
148
+ TriplaneForgeDec oder
149
+ β”œβ”€β”€ Occupancy Network decoder
150
+ β”‚ └── MLP: 3D point + triplane β†’ occupancy probability
151
+ β”œβ”€β”€ Differentiable Marching Cubes β†’ initial raw mesh
152
+ β”œβ”€β”€ Mesh Refinement Network
153
+ β”‚ β”œβ”€β”€ Graph neural network over mesh vertices/edges
154
+ β”‚ β”œβ”€β”€ 8 message-passing rounds
155
+ β”‚ └── Predicts vertex position offsets β†’ clean topology
156
+ β”œβ”€β”€ UV Unwrapper (learned, SeamlessUV lineage)
157
+ β”œβ”€β”€ Texture Diffusion Head
158
+ β”‚ β”œβ”€β”€ 2D flow matching in UV space
159
+ β”‚ β”œβ”€β”€ Albedo + roughness + metallic + normal maps
160
+ β”‚ └── 1024Γ—1024 or 2048Γ—2048 texture atlas
161
+ └── LOD Generator β†’ 4 polycount levels (100% / 50% / 25% / 10%)
162
+ ```
163
+
164
+ **Special modules:**
165
+ - **Topology optimizer**: enforces quad-dominant topology for animation rigs
166
+ - **Symmetry enforcer**: optional bilateral symmetry for characters/vehicles
167
+ - **Scale normalizer**: outputs at real-world scale (meters) with unit metadata
168
+ - **Material classifier**: auto-tags materials (metal, wood, fabric, glass, etc.)
169
+ - **Animation-ready flag**: detects and preserves edge loops needed for rigging
170
+
171
+ **Polygon counts:**
172
+ - Low-poly asset: 500–5K triangles β€” ~6 seconds on A100
173
+ - Mid-poly asset: 5K–50K triangles β€” ~15 seconds on A100
174
+ - High-poly asset: 50K–500K triangles β€” ~45 seconds on A100
175
+
176
+ ---
177
+
178
+ ### Voxel Cast β€” 3D Printable Generation Decoder
179
+
180
+ **Task:** Generate physically valid, printable 3D models. Watertight, manifold, structurally sound.
181
+
182
+ **Output formats:**
183
+ - STL (universal printing format)
184
+ - OBJ (watertight)
185
+ - STEP (CAD-compatible, parametric)
186
+ - 3MF (modern printing format with material data)
187
+
188
+ **Decoder head:**
189
+ ```
190
+ TriplaneCastDecoder
191
+ β”œβ”€β”€ SDF (Signed Distance Field) decoder
192
+ β”‚ └── MLP: 3D point + triplane β†’ signed distance value
193
+ β”œβ”€β”€ SDF β†’ Watertight Mesh (dual marching cubes, no holes guaranteed)
194
+ β”œβ”€β”€ Printability Validator
195
+ β”‚ β”œβ”€β”€ Wall thickness checker (min 1.2mm enforced)
196
+ β”‚ β”œβ”€β”€ Overhang analyzer (>45Β° flagged + support detection)
197
+ β”‚ β”œβ”€β”€ Manifold checker + auto-repair
198
+ β”‚ └── Volume/surface area calculator
199
+ β”œβ”€β”€ Support Structure Generator (optional)
200
+ β”‚ └── Generates minimal support trees for FDM printing
201
+ β”œβ”€β”€ STEP Converter (via Open CASCADE bindings)
202
+ └── Slicer Preview Renderer (preview only, not full slicer)
203
+ ```
204
+
205
+ **Special modules:**
206
+ - **Structural stress analyzer**: basic FEA simulation to detect weak points
207
+ - **Hollowing engine**: auto-hollows solid objects with configurable wall thickness + drain holes
208
+ - **Interlocking part splitter**: splits large objects into printable parts with snap-fit joints
209
+ - **Material suggester**: recommends PLA / PETG / resin based on geometry complexity
210
+ - **Scale validator**: ensures object is printable at specified scale on common bed sizes (Bambu, Prusa, Ender)
211
+
212
+ **Validation requirements (all Cast outputs must pass):**
213
+ - Zero non-manifold edges
214
+ - Zero self-intersections
215
+ - Minimum wall thickness β‰₯ 1.2mm at requested scale
216
+ - Watertight (no open boundaries)
217
+
218
+ ---
219
+
220
+ ### Voxel Lens β€” NeRF / Gaussian Splatting Decoder
221
+
222
+ **Task:** Generate photorealistic 3D scenes represented as Neural Radiance Fields or 3D Gaussian Splats β€” primarily for visualization, VR/AR, and cinematic rendering.
223
+
224
+ **Output formats:**
225
+ - `.ply` (3D Gaussian Splatting β€” compatible with standard 3DGS viewers)
226
+ - NeRF weights (Instant-NGP / Nerfstudio compatible)
227
+ - MP4 render (pre-rendered orbital video)
228
+ - Depth maps + normal maps (per-view, for downstream use)
229
+
230
+ **Decoder head:**
231
+ ```
232
+ TriplaneLensDecoder
233
+ β”œβ”€β”€ Gaussian Parameter Decoder
234
+ β”‚ β”œβ”€β”€ Samples 3D Gaussian centers from triplane density
235
+ β”‚ β”œβ”€β”€ Per-Gaussian: position (3), rotation (4 quaternion), scale (3),
236
+ β”‚ β”‚ opacity (1), spherical harmonics coefficients (48) β†’ color
237
+ β”‚ └── Targets: 500K–3M Gaussians per scene
238
+ β”œβ”€β”€ Gaussian Densification Module
239
+ β”‚ β”œβ”€β”€ Adaptive densification: split/clone in high-gradient regions
240
+ β”‚ └── Pruning: remove low-opacity Gaussians
241
+ β”œβ”€β”€ NeRF branch (parallel)
242
+ β”‚ β”œβ”€β”€ Hash-grid encoder (Instant-NGP style)
243
+ β”‚ └── Tiny MLP: encoded position β†’ density + color
244
+ β”œβ”€β”€ Rasterizer (differentiable 3DGS rasterizer)
245
+ β”‚ └── Used during training for photometric loss
246
+ └── Novel View Synthesizer
247
+ └── Renders arbitrary camera trajectories for MP4 export
248
+ ```
249
+
250
+ **Special modules:**
251
+ - **Lighting decomposition**: separates scene into albedo + illumination components
252
+ - **Dynamic scene support**: temporal Gaussian sequences for animated scenes (from video input)
253
+ - **Background/foreground separator**: isolates subject from environment
254
+ - **Camera trajectory planner**: auto-generates cinematic orbital/fly-through paths
255
+ - **Compression module**: reduces 3DGS file size by 60–80% with minimal quality loss
256
+
257
+ **Generation modes:**
258
+ - Object-centric: single object, orbital views β€” ~12 seconds on A100
259
+ - Indoor scene: full room with lighting β€” ~40 seconds on A100
260
+ - Outdoor scene: landscape or street β€” ~90 seconds on A100
261
+
262
+ ---
263
+
264
+ ### Voxel Prime β€” Closed Source All-in-One
265
+
266
+ **Access:** API only. Not open source. Weights never distributed.
267
+
268
+ Voxel Prime contains all four decoder heads simultaneously, plus:
269
+
270
+ **Additional Prime-only modules:**
271
+ - **Cross-task consistency**: ensures Atlas world + Forge assets + Lens scene all match when generated together
272
+ - **Scene population engine**: generates a world (Atlas) then auto-populates it with assets (Forge)
273
+ - **Pipeline orchestrator**: chains Atlas β†’ Forge β†’ Cast β†’ Lens in one API call
274
+ - **Photorealistic texture upscaler**: 4Γ— super-resolution on all generated textures
275
+ - **Style transfer module**: apply artistic style (e.g. "Studio Ghibli", "cyberpunk", "brutalist architecture") across all output types
276
+ - **Iterative refinement**: text-guided editing of already-generated 3D content
277
+
278
+ **API endpoint:**
279
+ ```python
280
+ POST /v1/voxel/generate
281
+ {
282
+ "prompt": "A medieval castle on a cliff at sunset",
283
+ "output_types": ["world", "mesh", "nerf"], # any combination
284
+ "inputs": {
285
+ "image": "base64...", # optional reference image
286
+ "multiview": ["base64..."], # optional multi-view images
287
+ "video": "base64...", # optional video
288
+ "model": "base64..." # optional existing 3D model
289
+ },
290
+ "settings": {
291
+ "quality": "high", # draft | standard | high
292
+ "style": "realistic", # realistic | stylized | low-poly | ...
293
+ "scale_meters": 100.0, # real-world scale
294
+ "symmetry": false,
295
+ "printable": false
296
+ }
297
+ }
298
+ ```
299
+
300
+ ---
301
+
302
+ ## Shared Custom Modules (All Models)
303
+
304
+ | # | Module | Description |
305
+ |---|---|---|
306
+ | 1 | **Multi-Modal Conditioning Fusion** | CrossModalAttention over all active input types |
307
+ | 2 | **3D RoPE Encoder** | RoPE adapted for triplane 3D spatial positions |
308
+ | 3 | **Geometry Quality Scorer** | Rates generated geometry quality [0–1] before output |
309
+ | 4 | **Semantic Label Head** | Per-voxel/vertex semantic class (wall, floor, tree, etc.) |
310
+ | 5 | **Scale & Unit Manager** | Enforces consistent real-world scale across all outputs |
311
+ | 6 | **Material Property Head** | Predicts PBR material properties (roughness, metallic, IOR) |
312
+ | 7 | **Confidence & Uncertainty Head** | Per-region generation confidence β€” flags uncertain areas |
313
+ | 8 | **Prompt Adherence Scorer** | CLIP-based score: how well output matches text prompt |
314
+ | 9 | **Multi-Resolution Decoder** | Generates at 64Β³ β†’ 128Β³ β†’ 256Β³ coarse-to-fine |
315
+ | 10 | **Style Embedding Module** | Encodes style reference images into style conditioning vector |
316
+
317
+ ---
318
+
319
+ ## Training Data Plan
320
+
321
+ | Dataset | Content | Used by |
322
+ |---|---|---|
323
+ | ShapeNet (55K models) | Common 3D objects | Forge, Cast |
324
+ | Objaverse (800K+ models) | Diverse 3D assets | Forge, Cast, Lens |
325
+ | Objaverse-XL (10M+ objects) | Massive scale | All |
326
+ | ScanNet / ScanNet++ | Indoor 3D scans | Atlas, Lens |
327
+ | KITTI / nuScenes | Outdoor 3D scenes | Atlas, Lens |
328
+ | ABO (Amazon Berkeley Objects) | Product meshes + materials | Forge |
329
+ | Thingiverse (printable models) | 3D printable STLs | Cast |
330
+ | Polycam scans | Real-world 3DGS/NeRF | Lens |
331
+ | Synthetic renders (generated) | Multi-view rendered images | All |
332
+ | Text-3D pairs (synthetic) | GPT-4o generated descriptions of Objaverse | All |
333
+
334
+ ---
335
+
336
+ ## Parameter Estimates
337
+
338
+ | Model | Backbone | Decoder Head | Total | VRAM (BF16) |
339
+ |---|---|---|---|---|
340
+ | Voxel Atlas | 2.3B | ~400M | ~2.7B | ~22GB |
341
+ | Voxel Forge | 2.3B | ~350M | ~2.65B | ~21GB |
342
+ | Voxel Cast | 2.3B | ~200M | ~2.5B | ~20GB |
343
+ | Voxel Lens | 2.3B | ~500M | ~2.8B | ~22GB |
344
+ | Voxel Prime | 2.3B | ~1.4B (all 4) | ~3.7B | ~30GB |
345
+
346
+ All fit on A100 40GB in BF16. INT8 quantization brings all under 15GB (consumer 4090 viable).
347
+
348
+ ---
349
+
350
+ ## Training Strategy
351
+
352
+ ### Phase 1 β€” Backbone Pre-training
353
+ - Train shared backbone on Objaverse-XL triplane reconstructions
354
+ - Learn general 3D structure without task-specific heads
355
+ - Context: text + single image conditioning only
356
+ - 100K steps, A100 cluster
357
+
358
+ ### Phase 2 β€” Decoder Head Training (parallel)
359
+ - Freeze backbone, train each decoder head independently
360
+ - Atlas: ScanNet + synthetic world data
361
+ - Forge: ShapeNet + Objaverse + texture data
362
+ - Cast: Thingiverse + watertight synthetic meshes
363
+ - Lens: Polycam + synthetic multi-view renders
364
+ - 50K steps each
365
+
366
+ ### Phase 3 β€” Joint Fine-tuning
367
+ - Unfreeze backbone, fine-tune end-to-end per specialist model
368
+ - Add all input modalities (video, multi-view, point cloud)
369
+ - 30K steps each
370
+
371
+ ### Phase 4 β€” Prime Training
372
+ - Initialize from jointly fine-tuned backbone
373
+ - Train all decoder heads simultaneously
374
+ - Cross-task consistency losses
375
+ - Prime-only module training (pipeline orchestrator, style transfer)
376
+ - 50K steps
377
+
378
+ ---
379
+
380
+ ## HuggingFace Plan
381
+
382
+ ```
383
+ Matrix-Corp/Voxel-Atlas-V1 β€” open source
384
+ Matrix-Corp/Voxel-Forge-V1 β€” open source
385
+ Matrix-Corp/Voxel-Cast-V1 β€” open source
386
+ Matrix-Corp/Voxel-Lens-V1 β€” open source
387
+ Matrix-Corp/Voxel-Prime-V1 β€” closed source, API only (card only, no weights)
388
+ ```
389
+
390
+ Collection: `Matrix-Corp/voxel-v1`
391
+
392
+ ---
393
+
394
+ ## Status
395
+ - πŸ”΄ Planned β€” Architecture specification complete
396
+ - Backbone design finalized
397
+ - Decoder head designs finalized
398
+ - Training data sourcing: TBD
399
+ - Compute requirements: significant (A100 cluster for training)
400
+ - Timeline: TBD