geovae-proto / README.md
AbstractPhil's picture
Update README.md
4d83a41 verified
---
license: apache-2.0
tags:
- geometric-deep-learning
- vae
- text-to-geometry
- rosetta-stone
- multimodal
- experimental
- research
base_model:
- AbstractPhil/grid-geometric-multishape
- google/flan-t5-small
- bert-base-uncased
- AbstractPhil/bert-beatrix-2048
datasets:
- AbstractPhil/synthetic-characters
---
# GeoVAE Proto β€” The Rosetta Stone Experiments
**Text carries geometric structure. This repo proves it.**
Three lightweight VAEs project text embeddings from different encoders into geometric patch space β€” and a pretrained geometric analyzer reads the text-derived patches *more clearly* than actual images. The geometric differentiation is encoder-agnostic: it lives in the language itself.
## The Hypothesis
If FLUX-generated images produce measurably differentiated geometric signatures across categories (lighting vs. jewelry vs. pose), and those images were *generated from text prompts*, then the text embeddings should contain enough structural information to produce the same geometric differentiation β€” without ever seeing an image.
## The Experiment
```
Text Prompt β†’ [Encoder] β†’ 512/768d embedding β†’ TextVAE β†’ (8, 16, 16) patches β†’ Geometric Analyzer β†’ gates + patch features
```
Three encoders tested against the same pipeline:
| Directory | Encoder | Dim | Pooling | Architecture |
|---|---|---|---|---|
| `text_vae/` | flan-t5-small | 512 | mean pool | encoder-decoder |
| `bert_vae/` | bert-base-uncased | 768 | [CLS] token | bidirectional MLM |
| `beatrix_vae/` | bert-beatrix-2048 | 768 | mean pool | nomic_bert + categorical tokens |
Each VAE has identical architecture: `encoder (text_dim β†’ 1024 β†’ 1024) β†’ ΞΌ,Οƒ (256d bottleneck) β†’ decoder (256 β†’ 1024 β†’ 1024 β†’ 2048) β†’ reshape (8, 16, 16)`. Trained to reconstruct identical-scale as the adapted FLUX VAE latents from the earlier Image VAE experiments. ~4.5M parameters each.
The geometric analyzer is a pretrained `SuperpositionPatchClassifier` from [AbstractPhil/grid-geometric-multishape](https://huggingface.co/AbstractPhil/grid-geometric-multishape) (epoch 200), frozen during evaluation. It extracts gate vectors (64Γ—17 explicit geometric properties) and patch features (64Γ—256 learned representations) from any (8, 16, 16) input.
Dataset: 49,286 images from [AbstractPhil/synthetic-characters](https://huggingface.co/datasets/AbstractPhil/synthetic-characters) (schnell_full_1_512), 15 generator_type categories.
## Results
### Overall Discriminability (within-category similarity βˆ’ weighted between-category similarity)
| Representation | Image Path (49k) | T5 (512d) | BERT (768d) | Beatrix (768d) |
|---|---|---|---|---|
| **patch_feat** | +0.0198 | +0.0526 | **+0.0534** | +0.0502 |
| **gate_vectors** | +0.0090 | +0.0311 | **+0.0319** | +0.0302 |
| **global_feat** | +0.0084 | +0.0228 | **+0.0219** | +0.0214 |
**All three text paths produce 2.5–3.5Γ— stronger geometric differentiation than the image path.** All three encoders converge to Β±5% of each other.
### Per-Category Discriminability (patch_feat)
| Category | Image | T5 | BERT | Beatrix |
|---|---|---|---|---|
| character_with_lighting | +0.051 | **+0.145** | +0.093 | +0.069 |
| action_scene | +0.020 | +0.123 | **+0.126** | +0.060 |
| character_with_jewelry | +0.048 | +0.072 | +0.107 | **+0.121** |
| character_with_expression | +0.041 | **+0.092** | +0.066 | +0.088 |
| character_in_scene | +0.014 | +0.081 | +0.062 | **+0.089** |
| character_full_outfit | +0.025 | +0.080 | **+0.088** | +0.054 |
| character_with_pose | +0.001 | +0.007 | -0.008 | +0.007 |
Category ranking is preserved across all paths. Lighting and jewelry always differentiate well; pose never does. The geometric hierarchy is stable.
## Key Findings
1. **Text-derived patches are geometrically cleaner than image-derived patches.** Language is already an abstraction β€” it carries structural intent without per-pixel noise. The geometric analyzer reads intent more clearly than observation.
2. **The bridge is encoder-agnostic.** Three architecturally different encoders (encoder-decoder T5, bidirectional BERT, categorical nomic_bert) produce the same discriminability through a 256d bottleneck. The geometric structure is in the text, not the encoder.
3. **Categorical pretraining doesn't help overall.** Beatrix, trained on 2B+ samples with explicit `<lighting>`, `<jewelry>`, `<pose>` tokens, matches generic BERT/T5 within Β±5%. It wins on fine-grained object categories (jewelry +0.121 vs BERT +0.107) but loses on scene-level properties (lighting +0.069 vs T5 +0.145).
4. **The 256d bottleneck is the normalizer.** It strips encoder-specific representational choices and preserves only the geometric signal that all encoders agree on.
## Architecture
Each VAE (~4.5M params):
```
Encoder: text_dim β†’ Linear(1024) β†’ LN β†’ GELU β†’ Dropout
β†’ Linear(1024) β†’ LN β†’ GELU β†’ Dropout
1024 β†’ ΞΌ (256d)
1024 β†’ log_var (256d)
Bottleneck: z = ΞΌ + Ρ·σ (training)
z = ΞΌ (inference)
Decoder: 256 β†’ Linear(1024) β†’ LN β†’ GELU β†’ Dropout
β†’ Linear(1024) β†’ LN β†’ GELU β†’ Dropout
β†’ Linear(2048)
reshape β†’ (8, 16, 16)
```
Training: MSE reconstruction + KL divergence (weight 1e-4), AdamW 1e-3, cosine schedule, 50 epochs, batch 512.
## Usage
```python
from model import TextVAE # or BertVAE, BeatrixVAE
# Load trained VAE
vae = TextVAE(text_dim=512) # 768 for BERT/Beatrix
ckpt = torch.load("best_model.pt")
vae.load_state_dict(ckpt["model_state_dict"])
# Text β†’ geometric patches
text_embedding = your_encoder(prompt) # (B, 512/768)
patches = vae.generate_latent(text_embedding) # (B, 8, 16, 16)
# Feed to geometric analyzer
geo_output = geometric_model(patches)
gates = geo_output["local_dim_logits"] # geometric properties
features = geo_output["patch_features"] # learned representations
```
## Implications
Geometric structure is a shared language that text and images both speak natively. This repo demonstrates the decoder β€” a lightweight VAE that translates any text encoder's output into the geometric alphabet. The next step is using this bridge for conditionable geometric descriptors in diffusion processes: text descriptions that steer generation through geometric constraints rather than CLIP alignment.
## File Structure
```
geovae-proto/
β”œβ”€β”€ text_vae/ # flan-t5-small (512d)
β”‚ β”œβ”€β”€ model.py # TextVAE architecture
β”‚ β”œβ”€β”€ train.py # Extract + train + analyze
β”‚ └── push.py # Upload to HF
β”œβ”€β”€ bert_vae/ # bert-base-uncased (768d)
β”‚ β”œβ”€β”€ model.py
β”‚ β”œβ”€β”€ train.py
β”‚ └── push.py
└── beatrix_vae/ # bert-beatrix-2048 (768d)
β”œβ”€β”€ model.py
β”œβ”€β”€ train.py
└── push.py
```
## Citation
Part of the geometric deep learning research by [AbstractPhil](https://huggingface.co/AbstractPhil). Built on the geometric analyzer from [grid-geometric-multishape](https://huggingface.co/AbstractPhil/grid-geometric-multishape) and the [synthetic-characters](https://huggingface.co/datasets/AbstractPhil/synthetic-characters) dataset.