|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- geometric-deep-learning |
|
|
- vae |
|
|
- text-to-geometry |
|
|
- rosetta-stone |
|
|
- multimodal |
|
|
- experimental |
|
|
- research |
|
|
base_model: |
|
|
- AbstractPhil/grid-geometric-multishape |
|
|
- google/flan-t5-small |
|
|
- bert-base-uncased |
|
|
- AbstractPhil/bert-beatrix-2048 |
|
|
datasets: |
|
|
- AbstractPhil/synthetic-characters |
|
|
--- |
|
|
|
|
|
# GeoVAE Proto β The Rosetta Stone Experiments |
|
|
|
|
|
**Text carries geometric structure. This repo proves it.** |
|
|
|
|
|
Three lightweight VAEs project text embeddings from different encoders into geometric patch space β and a pretrained geometric analyzer reads the text-derived patches *more clearly* than actual images. The geometric differentiation is encoder-agnostic: it lives in the language itself. |
|
|
|
|
|
## The Hypothesis |
|
|
|
|
|
If FLUX-generated images produce measurably differentiated geometric signatures across categories (lighting vs. jewelry vs. pose), and those images were *generated from text prompts*, then the text embeddings should contain enough structural information to produce the same geometric differentiation β without ever seeing an image. |
|
|
|
|
|
## The Experiment |
|
|
|
|
|
``` |
|
|
Text Prompt β [Encoder] β 512/768d embedding β TextVAE β (8, 16, 16) patches β Geometric Analyzer β gates + patch features |
|
|
``` |
|
|
|
|
|
Three encoders tested against the same pipeline: |
|
|
|
|
|
| Directory | Encoder | Dim | Pooling | Architecture | |
|
|
|---|---|---|---|---| |
|
|
| `text_vae/` | flan-t5-small | 512 | mean pool | encoder-decoder | |
|
|
| `bert_vae/` | bert-base-uncased | 768 | [CLS] token | bidirectional MLM | |
|
|
| `beatrix_vae/` | bert-beatrix-2048 | 768 | mean pool | nomic_bert + categorical tokens | |
|
|
|
|
|
Each VAE has identical architecture: `encoder (text_dim β 1024 β 1024) β ΞΌ,Ο (256d bottleneck) β decoder (256 β 1024 β 1024 β 2048) β reshape (8, 16, 16)`. Trained to reconstruct identical-scale as the adapted FLUX VAE latents from the earlier Image VAE experiments. ~4.5M parameters each. |
|
|
|
|
|
The geometric analyzer is a pretrained `SuperpositionPatchClassifier` from [AbstractPhil/grid-geometric-multishape](https://huggingface.co/AbstractPhil/grid-geometric-multishape) (epoch 200), frozen during evaluation. It extracts gate vectors (64Γ17 explicit geometric properties) and patch features (64Γ256 learned representations) from any (8, 16, 16) input. |
|
|
|
|
|
Dataset: 49,286 images from [AbstractPhil/synthetic-characters](https://huggingface.co/datasets/AbstractPhil/synthetic-characters) (schnell_full_1_512), 15 generator_type categories. |
|
|
|
|
|
## Results |
|
|
|
|
|
### Overall Discriminability (within-category similarity β weighted between-category similarity) |
|
|
|
|
|
| Representation | Image Path (49k) | T5 (512d) | BERT (768d) | Beatrix (768d) | |
|
|
|---|---|---|---|---| |
|
|
| **patch_feat** | +0.0198 | +0.0526 | **+0.0534** | +0.0502 | |
|
|
| **gate_vectors** | +0.0090 | +0.0311 | **+0.0319** | +0.0302 | |
|
|
| **global_feat** | +0.0084 | +0.0228 | **+0.0219** | +0.0214 | |
|
|
|
|
|
**All three text paths produce 2.5β3.5Γ stronger geometric differentiation than the image path.** All three encoders converge to Β±5% of each other. |
|
|
|
|
|
### Per-Category Discriminability (patch_feat) |
|
|
|
|
|
| Category | Image | T5 | BERT | Beatrix | |
|
|
|---|---|---|---|---| |
|
|
| character_with_lighting | +0.051 | **+0.145** | +0.093 | +0.069 | |
|
|
| action_scene | +0.020 | +0.123 | **+0.126** | +0.060 | |
|
|
| character_with_jewelry | +0.048 | +0.072 | +0.107 | **+0.121** | |
|
|
| character_with_expression | +0.041 | **+0.092** | +0.066 | +0.088 | |
|
|
| character_in_scene | +0.014 | +0.081 | +0.062 | **+0.089** | |
|
|
| character_full_outfit | +0.025 | +0.080 | **+0.088** | +0.054 | |
|
|
| character_with_pose | +0.001 | +0.007 | -0.008 | +0.007 | |
|
|
|
|
|
Category ranking is preserved across all paths. Lighting and jewelry always differentiate well; pose never does. The geometric hierarchy is stable. |
|
|
|
|
|
## Key Findings |
|
|
|
|
|
1. **Text-derived patches are geometrically cleaner than image-derived patches.** Language is already an abstraction β it carries structural intent without per-pixel noise. The geometric analyzer reads intent more clearly than observation. |
|
|
|
|
|
2. **The bridge is encoder-agnostic.** Three architecturally different encoders (encoder-decoder T5, bidirectional BERT, categorical nomic_bert) produce the same discriminability through a 256d bottleneck. The geometric structure is in the text, not the encoder. |
|
|
|
|
|
3. **Categorical pretraining doesn't help overall.** Beatrix, trained on 2B+ samples with explicit `<lighting>`, `<jewelry>`, `<pose>` tokens, matches generic BERT/T5 within Β±5%. It wins on fine-grained object categories (jewelry +0.121 vs BERT +0.107) but loses on scene-level properties (lighting +0.069 vs T5 +0.145). |
|
|
|
|
|
4. **The 256d bottleneck is the normalizer.** It strips encoder-specific representational choices and preserves only the geometric signal that all encoders agree on. |
|
|
|
|
|
## Architecture |
|
|
|
|
|
Each VAE (~4.5M params): |
|
|
|
|
|
``` |
|
|
Encoder: text_dim β Linear(1024) β LN β GELU β Dropout |
|
|
β Linear(1024) β LN β GELU β Dropout |
|
|
1024 β ΞΌ (256d) |
|
|
1024 β log_var (256d) |
|
|
|
|
|
Bottleneck: z = ΞΌ + Ξ΅Β·Ο (training) |
|
|
z = ΞΌ (inference) |
|
|
|
|
|
Decoder: 256 β Linear(1024) β LN β GELU β Dropout |
|
|
β Linear(1024) β LN β GELU β Dropout |
|
|
β Linear(2048) |
|
|
reshape β (8, 16, 16) |
|
|
``` |
|
|
|
|
|
Training: MSE reconstruction + KL divergence (weight 1e-4), AdamW 1e-3, cosine schedule, 50 epochs, batch 512. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from model import TextVAE # or BertVAE, BeatrixVAE |
|
|
|
|
|
# Load trained VAE |
|
|
vae = TextVAE(text_dim=512) # 768 for BERT/Beatrix |
|
|
ckpt = torch.load("best_model.pt") |
|
|
vae.load_state_dict(ckpt["model_state_dict"]) |
|
|
|
|
|
# Text β geometric patches |
|
|
text_embedding = your_encoder(prompt) # (B, 512/768) |
|
|
patches = vae.generate_latent(text_embedding) # (B, 8, 16, 16) |
|
|
|
|
|
# Feed to geometric analyzer |
|
|
geo_output = geometric_model(patches) |
|
|
gates = geo_output["local_dim_logits"] # geometric properties |
|
|
features = geo_output["patch_features"] # learned representations |
|
|
``` |
|
|
|
|
|
## Implications |
|
|
|
|
|
Geometric structure is a shared language that text and images both speak natively. This repo demonstrates the decoder β a lightweight VAE that translates any text encoder's output into the geometric alphabet. The next step is using this bridge for conditionable geometric descriptors in diffusion processes: text descriptions that steer generation through geometric constraints rather than CLIP alignment. |
|
|
|
|
|
## File Structure |
|
|
|
|
|
``` |
|
|
geovae-proto/ |
|
|
βββ text_vae/ # flan-t5-small (512d) |
|
|
β βββ model.py # TextVAE architecture |
|
|
β βββ train.py # Extract + train + analyze |
|
|
β βββ push.py # Upload to HF |
|
|
βββ bert_vae/ # bert-base-uncased (768d) |
|
|
β βββ model.py |
|
|
β βββ train.py |
|
|
β βββ push.py |
|
|
βββ beatrix_vae/ # bert-beatrix-2048 (768d) |
|
|
βββ model.py |
|
|
βββ train.py |
|
|
βββ push.py |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
Part of the geometric deep learning research by [AbstractPhil](https://huggingface.co/AbstractPhil). Built on the geometric analyzer from [grid-geometric-multishape](https://huggingface.co/AbstractPhil/grid-geometric-multishape) and the [synthetic-characters](https://huggingface.co/datasets/AbstractPhil/synthetic-characters) dataset. |