--- license: apache-2.0 tags: - geometric-deep-learning - vae - text-to-geometry - rosetta-stone - multimodal - experimental - research base_model: - AbstractPhil/grid-geometric-multishape - google/flan-t5-small - bert-base-uncased - AbstractPhil/bert-beatrix-2048 datasets: - AbstractPhil/synthetic-characters --- # GeoVAE Proto — The Rosetta Stone Experiments **Text carries geometric structure. This repo proves it.** Three lightweight VAEs project text embeddings from different encoders into geometric patch space — and a pretrained geometric analyzer reads the text-derived patches *more clearly* than actual images. The geometric differentiation is encoder-agnostic: it lives in the language itself. ## The Hypothesis If FLUX-generated images produce measurably differentiated geometric signatures across categories (lighting vs. jewelry vs. pose), and those images were *generated from text prompts*, then the text embeddings should contain enough structural information to produce the same geometric differentiation — without ever seeing an image. ## The Experiment ``` Text Prompt → [Encoder] → 512/768d embedding → TextVAE → (8, 16, 16) patches → Geometric Analyzer → gates + patch features ``` Three encoders tested against the same pipeline: | Directory | Encoder | Dim | Pooling | Architecture | |---|---|---|---|---| | `text_vae/` | flan-t5-small | 512 | mean pool | encoder-decoder | | `bert_vae/` | bert-base-uncased | 768 | [CLS] token | bidirectional MLM | | `beatrix_vae/` | bert-beatrix-2048 | 768 | mean pool | nomic_bert + categorical tokens | Each VAE has identical architecture: `encoder (text_dim → 1024 → 1024) → μ,σ (256d bottleneck) → decoder (256 → 1024 → 1024 → 2048) → reshape (8, 16, 16)`. Trained to reconstruct identical-scale as the adapted FLUX VAE latents from the earlier Image VAE experiments. ~4.5M parameters each. The geometric analyzer is a pretrained `SuperpositionPatchClassifier` from [AbstractPhil/grid-geometric-multishape](https://huggingface.co/AbstractPhil/grid-geometric-multishape) (epoch 200), frozen during evaluation. It extracts gate vectors (64×17 explicit geometric properties) and patch features (64×256 learned representations) from any (8, 16, 16) input. Dataset: 49,286 images from [AbstractPhil/synthetic-characters](https://huggingface.co/datasets/AbstractPhil/synthetic-characters) (schnell_full_1_512), 15 generator_type categories. ## Results ### Overall Discriminability (within-category similarity − weighted between-category similarity) | Representation | Image Path (49k) | T5 (512d) | BERT (768d) | Beatrix (768d) | |---|---|---|---|---| | **patch_feat** | +0.0198 | +0.0526 | **+0.0534** | +0.0502 | | **gate_vectors** | +0.0090 | +0.0311 | **+0.0319** | +0.0302 | | **global_feat** | +0.0084 | +0.0228 | **+0.0219** | +0.0214 | **All three text paths produce 2.5–3.5× stronger geometric differentiation than the image path.** All three encoders converge to ±5% of each other. ### Per-Category Discriminability (patch_feat) | Category | Image | T5 | BERT | Beatrix | |---|---|---|---|---| | character_with_lighting | +0.051 | **+0.145** | +0.093 | +0.069 | | action_scene | +0.020 | +0.123 | **+0.126** | +0.060 | | character_with_jewelry | +0.048 | +0.072 | +0.107 | **+0.121** | | character_with_expression | +0.041 | **+0.092** | +0.066 | +0.088 | | character_in_scene | +0.014 | +0.081 | +0.062 | **+0.089** | | character_full_outfit | +0.025 | +0.080 | **+0.088** | +0.054 | | character_with_pose | +0.001 | +0.007 | -0.008 | +0.007 | Category ranking is preserved across all paths. Lighting and jewelry always differentiate well; pose never does. The geometric hierarchy is stable. ## Key Findings 1. **Text-derived patches are geometrically cleaner than image-derived patches.** Language is already an abstraction — it carries structural intent without per-pixel noise. The geometric analyzer reads intent more clearly than observation. 2. **The bridge is encoder-agnostic.** Three architecturally different encoders (encoder-decoder T5, bidirectional BERT, categorical nomic_bert) produce the same discriminability through a 256d bottleneck. The geometric structure is in the text, not the encoder. 3. **Categorical pretraining doesn't help overall.** Beatrix, trained on 2B+ samples with explicit ``, ``, `` tokens, matches generic BERT/T5 within ±5%. It wins on fine-grained object categories (jewelry +0.121 vs BERT +0.107) but loses on scene-level properties (lighting +0.069 vs T5 +0.145). 4. **The 256d bottleneck is the normalizer.** It strips encoder-specific representational choices and preserves only the geometric signal that all encoders agree on. ## Architecture Each VAE (~4.5M params): ``` Encoder: text_dim → Linear(1024) → LN → GELU → Dropout → Linear(1024) → LN → GELU → Dropout 1024 → μ (256d) 1024 → log_var (256d) Bottleneck: z = μ + ε·σ (training) z = μ (inference) Decoder: 256 → Linear(1024) → LN → GELU → Dropout → Linear(1024) → LN → GELU → Dropout → Linear(2048) reshape → (8, 16, 16) ``` Training: MSE reconstruction + KL divergence (weight 1e-4), AdamW 1e-3, cosine schedule, 50 epochs, batch 512. ## Usage ```python from model import TextVAE # or BertVAE, BeatrixVAE # Load trained VAE vae = TextVAE(text_dim=512) # 768 for BERT/Beatrix ckpt = torch.load("best_model.pt") vae.load_state_dict(ckpt["model_state_dict"]) # Text → geometric patches text_embedding = your_encoder(prompt) # (B, 512/768) patches = vae.generate_latent(text_embedding) # (B, 8, 16, 16) # Feed to geometric analyzer geo_output = geometric_model(patches) gates = geo_output["local_dim_logits"] # geometric properties features = geo_output["patch_features"] # learned representations ``` ## Implications Geometric structure is a shared language that text and images both speak natively. This repo demonstrates the decoder — a lightweight VAE that translates any text encoder's output into the geometric alphabet. The next step is using this bridge for conditionable geometric descriptors in diffusion processes: text descriptions that steer generation through geometric constraints rather than CLIP alignment. ## File Structure ``` geovae-proto/ ├── text_vae/ # flan-t5-small (512d) │ ├── model.py # TextVAE architecture │ ├── train.py # Extract + train + analyze │ └── push.py # Upload to HF ├── bert_vae/ # bert-base-uncased (768d) │ ├── model.py │ ├── train.py │ └── push.py └── beatrix_vae/ # bert-beatrix-2048 (768d) ├── model.py ├── train.py └── push.py ``` ## Citation Part of the geometric deep learning research by [AbstractPhil](https://huggingface.co/AbstractPhil). Built on the geometric analyzer from [grid-geometric-multishape](https://huggingface.co/AbstractPhil/grid-geometric-multishape) and the [synthetic-characters](https://huggingface.co/datasets/AbstractPhil/synthetic-characters) dataset.