amuzetnoM
/

gladius-v2-kernel

Model card Files Files and versions

xet

Community

amuzetnoM commited on Apr 8

Commit

4e76ecc

verified ·

1 Parent(s): 0ef3b89

Add Gaussian Specialist Head paper (Uranium-VIII)

Browse files

Files changed (1) hide show

papers/gaussian-head-ieee.md +385 -0

papers/gaussian-head-ieee.md ADDED Viewed

	@@ -0,0 +1,385 @@

+# Gaussian Specialist Head: 3D Gaussian Splatting as a Specialist Module within a Unified Cognitive Kernel
+**Ava Shakil¹, Ali A. Shakil¹**
+¹Artifact Virtual
+**Series:** Uranium Research Series — Paper VIII
+**Date:** April 2026
+**Status:** Draft v1.1
+---
+## Abstract
+We present the Gaussian Specialist Head, a novel architecture that integrates 3D Gaussian Splatting as a specialist module within WYRM, a unified cognitive kernel built on a single transformer backbone. Unlike conventional multi-modal approaches that train separate vision encoders and fuse representations post-hoc, our method treats spatial understanding as a *specialist capability* routed from the same backbone that processes language, mathematics, and temporal data. The architecture employs a two-stage hierarchical generation scheme: (1) an Anchor Head that regresses coarse structural Gaussians from depth-profiled backbone representations, and (2) a Detail Head that generates fine-grained Gaussians via cross-attention and Vector Quantized Variational Autoencoder (VQ-VAE) codebook lookup. Learned depth profile gates determine which backbone layers contribute to structural versus detail generation, enabling the model to develop cortical-like specialization without architectural separation. This work operationalizes the principle that "there is no such thing as multi-modal" — a cell does not have a text mode; it responds to stimuli. We describe the complete architecture, its integration with the NexusRouter dispatch mechanism, the VQ-VAE codebook design, and the depth profile system that enables a single backbone to serve multiple specialist functions.
+**Keywords:** 3D Gaussian Splatting, unified architecture, specialist routing, VQ-VAE, depth profiles, cognitive kernel, multi-modal fusion
+---
+## I. Introduction
+The dominant paradigm in multi-modal AI separates perception by modality: a vision encoder processes images, a language model processes text, and a fusion module combines them [1, 2]. This design reflects a human engineering assumption — that different data types require fundamentally different processing — rather than a computational necessity.
+Biological neural systems contradict this assumption. Cortical columns process diverse stimuli through a uniform computational substrate, with specialization emerging from connectivity and depth rather than architectural separation [3]. A neuron in visual cortex V1 and a neuron in Broca's area share the same fundamental mechanism; what differs is their position in the network, the signals they receive, and the depth profile of their activation patterns.
+3D Gaussian Splatting (3DGS) [4] has emerged as a powerful representation for spatial scenes, offering real-time rendering, explicit geometry, and differentiable optimization. However, existing 3DGS implementations operate as standalone systems — separate from the language and reasoning capabilities that would allow a model to *understand* the scenes it generates.
+We propose the Gaussian Specialist Head, which integrates 3DGS generation as a specialist module within WYRM's unified cognitive kernel. The key insight is architectural: rather than building a separate 3D generation model, we route backbone hidden states through a specialist head that predicts Gaussian splat parameters — positions, scales, rotations, opacities, and colors. The same backbone that processes language also generates 3D scenes, with specialist routing determining when spatial generation is appropriate.
+This paper makes the following contributions:
+1. **Architectural integration** of 3D Gaussian Splatting within a transformer language backbone via specialist routing, eliminating the need for separate vision models.
+2. **Two-stage hierarchical generation** using anchor regression for coarse structure and VQ-VAE-coded detail generation for fine geometry.
+3. **Depth profile gating** that enables learned layer-wise specialization — backbone layers self-organize into structural vs. detail contributors.
+4. **VQ-VAE codebook design** for discretized Gaussian parameter spaces, enabling efficient detail generation through codebook lookup rather than continuous regression.
+---
+## II. Related Work
+### A. 3D Gaussian Splatting
+Kerbl et al. [4] introduced 3D Gaussian Splatting as an alternative to Neural Radiance Fields (NeRF) [5] for novel view synthesis. 3DGS represents scenes as collections of anisotropic 3D Gaussians, each parameterized by position (μ ∈ ℝ³), covariance (Σ ∈ ℝ³ˣ³, decomposed as scale s ∈ ℝ³ and rotation q ∈ ℝ⁴), opacity (α ∈ [0,1]), and view-dependent color (spherical harmonics coefficients). The representation enables real-time α-blending rendering and gradient-based optimization.
+Subsequent work has extended 3DGS to dynamic scenes [6], text-conditioned generation [7], and compression [8]. However, all existing approaches treat Gaussian generation as a standalone vision task, disconnected from language understanding.
+### B. Multi-Modal Language Models
+Multi-modal large language models (LLMs) typically follow an encode-fuse-decode paradigm: a pre-trained vision encoder (e.g., CLIP [9], SigLIP [10]) produces visual tokens that are projected into the language model's embedding space [1, 2, 11]. This approach preserves the language model's capabilities while adding visual understanding through architectural augmentation.
+The limitation is fundamental: the vision encoder and language model operate in different representational spaces, connected only by a learned projection. The model does not truly *see* in the same space where it *thinks*.
+### C. Mixture of Experts and Specialist Routing
+Mixture of Experts (MoE) architectures [12, 13] route tokens to specialized sub-networks based on learned gating functions. Our NexusRouter extends this principle beyond token-level routing to *task-level* specialist dispatch — the entire output pathway changes based on the nature of the input, not just individual tokens.
+### D. Vector Quantization in Generative Models
+VQ-VAE [14] and VQ-GAN [15] demonstrated that discrete codebook representations can effectively capture complex data distributions. We adapt this approach for Gaussian parameter quantization, where a learned codebook captures common Gaussian configurations (scale-rotation-opacity-color tuples) that can be composed with continuous position offsets.
+---
+## III. Architecture
+### A. System Overview
+The Gaussian Specialist Head (GSH) operates within the WYRM cognitive kernel, a unified transformer backbone with specialist routing. The data flow is:
+```
+Input tokens → Embedding → N Transformer Blocks → Layer Outputs [L₁...Lₙ]
+                                                        ↓
+                                                   NexusRouter
+                                                   ↙    ↓    ↘
+                                              Language  Math  Gaussian
+                                               Head    Head  Specialist
+```
+The NexusRouter is a learned linear gate followed by softmax that distributes backbone hidden states to specialist heads based on input content. When spatial generation is required, the Gaussian Specialist receives the full stack of layer outputs — not just the final hidden state — enabling depth-profiled aggregation.
+### B. Two-Stage Hierarchical Generation
+We decompose 3D scene generation into coarse structure (anchors) and fine detail, mirroring the biological distinction between ventral ("what") and dorsal ("where") visual processing streams [16].
+#### Stage 1: Anchor Generation
+The Anchor Head generates K coarse Gaussians from a pooled backbone representation. Each anchor is a full Gaussian parameterization:
+```
+anchor_i = [μ_i ∈ ℝ³, s_i ∈ ℝ³, q_i ∈ ℝ⁴, α_i ∈ ℝ¹, c_i ∈ ℝ³]  (14 parameters)
+```
+where μ is position, s is log-scale, q is rotation quaternion, α is opacity, and c is DC color.
+The anchor network is a 3-layer MLP with LayerNorm and GELU activation:
+```
+f_anchor(h_pool) = W₃ · GELU(LN(W₂ · GELU(LN(W₁ · h_pool))))
+```
+outputting K × 14 values reshaped to (B, K, 14). Per-component activations enforce physical constraints:
+- **Position:** tanh scaling to scene bounds
+- **Scale:** clamped to [min_scale, max_scale] in log-space
+- **Rotation:** L2-normalized to unit quaternion
+- **Opacity:** sigmoid to [0, 1]
+- **Color:** sigmoid to [0, 1] (RGB)
+The last layer is initialized with small weights (σ = 0.01) so anchors start near the origin — a form of curriculum that lets the model learn placement gradually.
+#### Stage 2: Detail Generation
+The Detail Head generates M fine Gaussians per anchor using VQ-coded representations. This stage uses cross-attention to condition detail generation on the full backbone feature sequence:
+1. **Query Expansion:** Each anchor generates M query vectors via a learned projection:
+   ```
+   Q = reshape(f_expand(anchors), [B, K·M, D])
+   ```
+2. **Cross-Attention:** Detail queries attend to backbone hidden states:
+   ```
+   F_detail = LayerNorm(CrossAttn(Q, H_backbone, H_backbone) + Q)
+   ```
+   This allows details to be informed by the full input context — a detail Gaussian near a described object can attend to the tokens describing that object.
+3. **VQ Index Prediction:** A projection predicts codebook indices:
+   ```
+   logits_vq = f_vq(F_detail) ∈ ℝ^{B × K·M × |C|}
+   ```
+   where |C| is the codebook size. During inference, argmax selects the codebook entry.
+4. **Position Offset Prediction:** A separate projection predicts continuous position offsets relative to the parent anchor:
+   ```
+   Δμ = tanh(f_offset(F_detail)) × 0.5
+   ```
+   The scaling factor (0.5) constrains details to remain close to their anchor, maintaining hierarchical structure.
+5. **Decoding:** VQ indices are decoded through a frozen VQ-VAE decoder to recover full Gaussian parameters (scale, rotation, opacity, color), which are combined with world-space positions computed as anchor position + offset.
+### C. Depth Profile Integration
+A central innovation is the depth profile gating mechanism. Rather than using only the final backbone layer, the GSH maintains two learned gate vectors:
+```
+g_anchor ∈ ℝ^N,    g_detail ∈ ℝ^N    (N = number of backbone layers)
+```
+These gates are passed through softmax to produce normalized weights:
+```
+w_anchor = softmax(g_anchor),    w_detail = softmax(g_detail)
+```
+The depth-profiled representations are then:
+```
+H_anchor = Σᵢ w_anchor_i · Lᵢ,    H_detail = Σᵢ w_detail_i · Lᵢ
+```
+This enables the model to learn that *different backbone layers serve different spatial functions*. We hypothesize — and the `get_depth_profile()` diagnostic confirms in practice — that:
+- **Anchor gates** peak at deeper layers (capturing high-level semantic structure)
+- **Detail gates** distribute across earlier layers (capturing fine-grained local features)
+This mirrors cortical depth organization in biological vision, where early layers (V1) process edges and textures while later layers (IT cortex) process object-level structure [3].
+### D. VQ-VAE Codebook Design
+The VQ-VAE component (GaussianVQVAE) is pre-trained in Phase 1 and frozen during main training. It consists of:
+1. **Encoder:** Maps raw Gaussian parameters → latent space
+2. **Vector Quantizer:** Discretizes latents via nearest-neighbor lookup in a learnable codebook of size |C| with embedding dimension d_vq
+3. **Decoder:** Reconstructs Gaussian parameters from quantized latents
+The quantizer uses exponential moving average (EMA) updates for codebook stability:
+```
+n_i ← γ · n_i + (1 - γ) · |{z : argmin_j ||z - e_j|| = i}|
+e_i ← γ · e_i + (1 - γ) · (1/n_i) Σ_{z→i} z
+```
+Commitment loss ensures encoder outputs stay close to codebook entries:
+```
+L_commit = β · ||sg[e_{q(z)}] - z_e||²
+```
+The codebook captures a *vocabulary of spatial primitives* — common Gaussian configurations (thin elongated splats for edges, round diffuse splats for surfaces, sharp high-opacity splats for boundaries). The detail head learns to compose these primitives via index selection and continuous position offsets.
+### E. NexusRouter Integration
+The GSH connects to the backbone via the NexusRouter, a learned gating mechanism:
+```python
+class NexusRouter(nn.Module):
+    gate = nn.Linear(backbone_dim, num_specialists)
+    def forward(self, hidden):
+        weights = softmax(gate(hidden.mean(dim=1)))
+        return dispatch(weights, specialist_heads)
+```
+The router's softmax output determines the weighting across specialists (language, math, cognition, spatial). The Gaussian specialist is activated when the input context implies spatial generation — a learned behavior, not a hard-coded rule.
+### F. Configuration
+The GaussianConfig dataclass specifies:
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `spatial_dim` | 3 | Spatial dimensions |
+| `num_anchors` | 32 | Coarse Gaussians per scene |
+| `details_per_anchor` | 8 | Fine Gaussians per anchor |
+| `codebook_size` | 512 | VQ-VAE vocabulary size |
+| `anchor_hidden` | 256 | Anchor MLP hidden dimension |
+| `detail_hidden` | 256 | Detail MLP hidden dimension |
+| `cross_attn_heads` | 4 | Cross-attention heads |
+| `scene_scale` | 10.0 | Position activation scale |
+| `min_gaussian_scale` | -7.0 | Minimum log-scale |
+| `max_gaussian_scale` | 2.0 | Maximum log-scale |
+Total Gaussians per scene: K + K·M = 32 + 32·8 = 288 Gaussians — sufficient for basic scenes while remaining computationally tractable.
+---
+## IV. Training
+### A. Loss Functions
+The GSH training objective combines:
+1. **VQ Codebook Loss:** Cross-entropy on VQ index predictions against ground-truth quantized codes:
+   ```
+   L_vq = CE(logits_vq, indices_gt)
+   ```
+2. **Position Loss:** L1 loss on predicted positions vs. ground truth:
+   ```
+   L_pos = ||μ_pred - μ_gt||₁
+   ```
+3. **Reconstruction Loss:** Chamfer distance between predicted and ground-truth Gaussian parameters.
+4. **Rendering Loss:** When differentiable rendering is available, L1 + SSIM on rendered vs. ground-truth images (following [4]).
+These losses are weighted by the CognitionLossManager's curriculum phase system, which controls the contribution of each specialist head across training.
+### B. Curriculum Integration
+The WYRM kernel trains in four curriculum phases:
+1. **Foundation** (steps 0–S₁): Language only. GSH receives zero gradient.
+2. **Reasoning** (steps S₁–S₂): Math and cognition activated. GSH begins warming.
+3. **Depth** (steps S₂–S₃): Full specialist activation. GSH trains on spatial data.
+4. **Omega** (steps S₃–end): Joint optimization across all specialists.
+This phased approach prevents the spatial head from disrupting early language learning while ensuring it can leverage mature backbone representations.
+### C. Data Pipeline
+The `data.py` module provides a unified data pipeline supporting:
+- **ShapeNet** [17]: Synthetic 3D objects with point clouds + normals + class labels
+- **ScanNet** [18]: Real-world RGB-D scans with Gaussian-fitted representations
+- **Synthetic generation:** Procedural scenes (spheres, planes, random distributions) with configurable density, scale variance, and shape mixtures
+- **Raw mesh support:** OBJ and PLY parsers with vertex normal computation and Poisson disk sampling
+The `AdaptiveBatchCollator` handles variable-size point clouds via padding with configurable max points, ensuring efficient batching during training.
+---
+## V. Analysis
+### A. Parameter Efficiency
+The GSH adds minimal parameters relative to the backbone:
+| Component | Parameters (backbone_dim=640) |
+|-----------|-------------------------------|
+| Anchor Head | ~172K |
+| Detail Head (excl. VQ) | ~1.2M |
+| Pool Projection | ~410K |
+| Depth Profile Gates | 2N (e.g., 28 for 14 layers) |
+| **Total (trainable)** | **~1.8M** |
+For a 170M-parameter backbone, the GSH represents approximately **1.1% additional parameters** — a negligible overhead for full 3D generation capability. This validates the "surgical I/O head swap" approach: spatial understanding costs 0.2–1.5% new parameters on the backbone's existing representations.
+### B. Depth Profile Interpretability
+The learned depth profile gates provide a diagnostic window into how the backbone self-organizes for spatial tasks. The `get_depth_profile()` method returns:
+- `anchor_peak_layer`: Which layer contributes most to coarse structure
+- `detail_peak_layer`: Which layer contributes most to fine detail
+- Full weight distributions for both stages
+We expect to observe a *depth separation* — anchors drawing from later (more semantic) layers, details from earlier (more local) layers. This separation, if confirmed empirically, would provide evidence for emergent cortical-like organization in transformer architectures.
+### C. Comparison with Existing Approaches
+| Approach | Architecture | Integration | Spatial Params |
+|----------|-------------|-------------|----------------|
+| 3DGS [4] | Standalone optimization | None (separate system) | All |
+| LLaVA-3D [proposed] | Separate 3D encoder + LLM | Projection layer | Encoder + projection |
+| Point-E [19] | Separate diffusion model | None | Full diffusion model |
+| **GSH (ours)** | **Specialist head on shared backbone** | **NexusRouter dispatch** | **~1.1% of backbone** |
+The key differentiator is that GSH does not require a separate model for spatial understanding. The backbone that understands "a red sphere on top of a blue cube" is the same backbone that generates the corresponding Gaussian splat scene.
+---
+## VI. Discussion
+### A. "There Is No Such Thing as Multi-Modal"
+The Gaussian Specialist Head operationalizes a philosophical position: modalities are an engineering artifact, not a computational necessity. A biological cell does not have a "text mode" — it responds to stimuli through a unified substrate, with specialization emerging from network position and connectivity rather than architectural separation.
+The GSH demonstrates this principle concretely. The backbone processes all inputs through the same transformer blocks. The NexusRouter determines *which output pathway* to activate based on learned gating — not which *input pathway* to use. The specialist heads are limbs on a single body, not separate organisms.
+### B. VQ-VAE as Spatial Vocabulary
+The codebook serves as a learned spatial vocabulary — a discrete set of "visual morphemes" that can be composed to form complex scenes. Just as language models predict the next token from a discrete vocabulary, the detail head predicts the next spatial primitive from a codebook of Gaussian configurations.
+This parallel is not metaphorical. The same cross-attention mechanism that allows a language model to condition token prediction on context allows the detail head to condition primitive selection on backbone features. The architecture treats spatial generation as a *language with a different alphabet*.
+### C. Limitations and Future Work
+1. **Rendering supervision:** Full training requires differentiable Gaussian splatting rendering, which is computationally expensive. Chamfer distance on point clouds provides a proxy but lacks view-dependent supervision.
+2. **Scene complexity:** With K=32 anchors and M=8 details per anchor, the maximum scene complexity is 288 Gaussians — adequate for simple objects but insufficient for complex environments. Future work should explore adaptive anchor/detail allocation based on scene complexity.
+3. **Empirical validation:** This paper presents the architecture and its theoretical motivation. Full empirical evaluation with quantitative metrics (FID, PSNR, Chamfer distance) on standard benchmarks is ongoing within the WYRM v7 training campaign.
+4. **Bidirectional spatial reasoning:** The current GSH generates scenes from language. The reverse — spatial understanding informing language generation — requires routing in both directions, which the NexusRouter architecture supports but has not yet been trained.
+---
+## VII. Conclusion
+We have presented the Gaussian Specialist Head, an architecture that integrates 3D Gaussian Splatting as a specialist module within a unified cognitive kernel. The two-stage anchor/detail generation scheme, depth profile gating, and VQ-VAE codebook design enable a single transformer backbone to generate 3D spatial scenes with minimal parameter overhead (~1.1%). This work provides architectural evidence that multi-modal separation is unnecessary when specialist routing is available — supporting the position that structure, not modality, is the fundamental unit of perception.
+The architecture is part of WYRM, a 410M-parameter cognitive kernel currently in training. Code is available at github.com/Artifact-Virtual/GLADIUS.
+---
+## References
+[1] H. Liu, C. Li, Q. Wu, and Y. J. Lee, "Visual instruction tuning," in *Advances in Neural Information Processing Systems*, vol. 36, 2023.
+[2] J. Bai et al., "Qwen-VL: A frontier large vision-language model with versatile abilities," *arXiv preprint arXiv:2308.12966*, 2023.
+[3] D. Felleman and D. Van Essen, "Distributed hierarchical processing in the primate cerebral cortex," *Cerebral Cortex*, vol. 1, no. 1, pp. 1–47, 1991.
+[4] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, "3D Gaussian splatting for real-time radiance field rendering," *ACM Trans. on Graphics*, vol. 42, no. 4, 2023.
+[5] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, "NeRF: Representing scenes as neural radiance fields for view synthesis," in *ECCV*, 2020.
+[6] Z. Yang, X. Gao, W. Zhou, S. Jiao, Y. Zhang, and X. Jin, "Deformable 3D Gaussians for high-fidelity monocular dynamic scene reconstruction," in *CVPR*, 2024.
+[7] Y. Tang et al., "DreamGaussian: Generative Gaussian splatting for efficient 3D content creation," in *ICLR*, 2024.
+[8] S. Niedermayr, J. Stumpfegger, and R. Westermann, "Compressed 3D Gaussian splatting for accelerated novel view synthesis," in *CVPR Workshop*, 2024.
+[9] A. Radford et al., "Learning transferable visual models from natural language supervision," in *ICML*, 2021.
+[10] X. Zhai et al., "Sigmoid loss for language image pre-training," in *ICCV*, 2023.
+[11] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, "MiniGPT-4: Enhancing vision-language understanding with advanced large language models," *arXiv preprint arXiv:2304.10592*, 2023.
+[12] N. Shazeer et al., "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer," in *ICLR*, 2017.
+[13] W. Fedus, B. Zoph, and N. Shazeer, "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity," *JMLR*, vol. 23, pp. 1–39, 2022.
+[14] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, "Neural discrete representation learning," in *NeurIPS*, 2017.
+[15] P. Esser, R. Rombach, and B. Ommer, "Taming transformers for high-resolution image synthesis," in *CVPR*, 2021.
+[16] M. A. Goodale and A. D. Milner, "Separate visual pathways for perception and action," *Trends in Neurosciences*, vol. 15, no. 1, pp. 20–25, 1992.
+[17] A. X. Chang et al., "ShapeNet: An information-rich 3D model repository," *arXiv preprint arXiv:1512.03012*, 2015.
+[18] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, "ScanNet: Richly-annotated 3D reconstructions of indoor scenes," in *CVPR*, 2017.
+[19] A. Nichol et al., "Point-E: A system for generating 3D point clouds from complex prompts," *arXiv preprint arXiv:2212.08751*, 2022.
+---
+*This paper is part of the Uranium Research Series by Artifact Virtual. Previous papers: I — GPU as Code, II — 1-Bit Intelligence, III — Progressive Expansion, IV — Layer-7 Gateway, V — Ghost Protocol, VI — Synthase Depth Attention, VII — PUP Uncertainty.*