Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- geometric-deep-learning
|
| 5 |
+
- vae
|
| 6 |
+
- text-to-geometry
|
| 7 |
+
- rosetta-stone
|
| 8 |
+
- multimodal
|
| 9 |
+
- experimental
|
| 10 |
+
- research
|
| 11 |
+
base_model:
|
| 12 |
+
- AbstractPhil/grid-geometric-multishape
|
| 13 |
+
- google/flan-t5-small
|
| 14 |
+
- bert-base-uncased
|
| 15 |
+
- AbstractPhil/bert-beatrix-2048
|
| 16 |
+
datasets:
|
| 17 |
+
- AbstractPhil/synthetic-characters
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
# GeoVAE Proto β The Rosetta Stone Experiments
|
| 21 |
+
|
| 22 |
+
**Text carries geometric structure. This repo proves it.**
|
| 23 |
+
|
| 24 |
+
Three lightweight VAEs project text embeddings from different encoders into geometric patch space β and a pretrained geometric analyzer reads the text-derived patches *more clearly* than actual images. The geometric differentiation is encoder-agnostic: it lives in the language itself.
|
| 25 |
+
|
| 26 |
+
## The Hypothesis
|
| 27 |
+
|
| 28 |
+
If FLUX-generated images produce measurably differentiated geometric signatures across categories (lighting vs. jewelry vs. pose), and those images were *generated from text prompts*, then the text embeddings should contain enough structural information to produce the same geometric differentiation β without ever seeing an image.
|
| 29 |
+
|
| 30 |
+
## The Experiment
|
| 31 |
+
|
| 32 |
+
```
|
| 33 |
+
Text Prompt β [Encoder] β 512/768d embedding β TextVAE β (8, 16, 16) patches β Geometric Analyzer β gates + patch features
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
Three encoders tested against the same pipeline:
|
| 37 |
+
|
| 38 |
+
| Directory | Encoder | Dim | Pooling | Architecture |
|
| 39 |
+
|---|---|---|---|---|
|
| 40 |
+
| `text_vae/` | flan-t5-small | 512 | mean pool | encoder-decoder |
|
| 41 |
+
| `bert_vae/` | bert-base-uncased | 768 | [CLS] token | bidirectional MLM |
|
| 42 |
+
| `beatrix_vae/` | bert-beatrix-2048 | 768 | mean pool | nomic_bert + categorical tokens |
|
| 43 |
+
|
| 44 |
+
Each VAE has identical architecture: `encoder (text_dim β 1024 β 1024) β ΞΌ,Ο (256d bottleneck) β decoder (256 β 1024 β 1024 β 2048) β reshape (8, 16, 16)`. Trained to reconstruct adapted FLUX VAE latents from paired prompts. ~4.5M parameters each.
|
| 45 |
+
|
| 46 |
+
The geometric analyzer is a pretrained `SuperpositionPatchClassifier` from [AbstractPhil/grid-geometric-multishape](https://huggingface.co/AbstractPhil/grid-geometric-multishape) (epoch 200), frozen during evaluation. It extracts gate vectors (64Γ17 explicit geometric properties) and patch features (64Γ256 learned representations) from any (8, 16, 16) input.
|
| 47 |
+
|
| 48 |
+
Dataset: 49,286 images from [AbstractPhil/synthetic-characters](https://huggingface.co/datasets/AbstractPhil/synthetic-characters) (schnell_full_1_512), 15 generator_type categories.
|
| 49 |
+
|
| 50 |
+
## Results
|
| 51 |
+
|
| 52 |
+
### Overall Discriminability (within-category similarity β weighted between-category similarity)
|
| 53 |
+
|
| 54 |
+
| Representation | Image Path (49k) | T5 (512d) | BERT (768d) | Beatrix (768d) |
|
| 55 |
+
|---|---|---|---|---|
|
| 56 |
+
| **patch_feat** | +0.0198 | +0.0526 | **+0.0534** | +0.0502 |
|
| 57 |
+
| **gate_vectors** | +0.0090 | +0.0311 | **+0.0319** | +0.0302 |
|
| 58 |
+
| **global_feat** | +0.0084 | +0.0228 | **+0.0219** | +0.0214 |
|
| 59 |
+
|
| 60 |
+
**All three text paths produce 2.5β3.5Γ stronger geometric differentiation than the image path.** All three encoders converge to Β±5% of each other.
|
| 61 |
+
|
| 62 |
+
### Per-Category Discriminability (patch_feat)
|
| 63 |
+
|
| 64 |
+
| Category | Image | T5 | BERT | Beatrix |
|
| 65 |
+
|---|---|---|---|---|
|
| 66 |
+
| character_with_lighting | +0.051 | **+0.145** | +0.093 | +0.069 |
|
| 67 |
+
| action_scene | +0.020 | +0.123 | **+0.126** | +0.060 |
|
| 68 |
+
| character_with_jewelry | +0.048 | +0.072 | +0.107 | **+0.121** |
|
| 69 |
+
| character_with_expression | +0.041 | **+0.092** | +0.066 | +0.088 |
|
| 70 |
+
| character_in_scene | +0.014 | +0.081 | +0.062 | **+0.089** |
|
| 71 |
+
| character_full_outfit | +0.025 | +0.080 | **+0.088** | +0.054 |
|
| 72 |
+
| character_with_pose | +0.001 | +0.007 | -0.008 | +0.007 |
|
| 73 |
+
|
| 74 |
+
Category ranking is preserved across all paths. Lighting and jewelry always differentiate well; pose never does. The geometric hierarchy is stable.
|
| 75 |
+
|
| 76 |
+
## Key Findings
|
| 77 |
+
|
| 78 |
+
1. **Text-derived patches are geometrically cleaner than image-derived patches.** Language is already an abstraction β it carries structural intent without per-pixel noise. The geometric analyzer reads intent more clearly than observation.
|
| 79 |
+
|
| 80 |
+
2. **The bridge is encoder-agnostic.** Three architecturally different encoders (encoder-decoder T5, bidirectional BERT, categorical nomic_bert) produce the same discriminability through a 256d bottleneck. The geometric structure is in the text, not the encoder.
|
| 81 |
+
|
| 82 |
+
3. **Categorical pretraining doesn't help overall.** Beatrix, trained on 2B+ samples with explicit `<lighting>`, `<jewelry>`, `<pose>` tokens, matches generic BERT/T5 within Β±5%. It wins on fine-grained object categories (jewelry +0.121 vs BERT +0.107) but loses on scene-level properties (lighting +0.069 vs T5 +0.145).
|
| 83 |
+
|
| 84 |
+
4. **The 256d bottleneck is the normalizer.** It strips encoder-specific representational choices and preserves only the geometric signal that all encoders agree on.
|
| 85 |
+
|
| 86 |
+
## Architecture
|
| 87 |
+
|
| 88 |
+
Each VAE (~4.5M params):
|
| 89 |
+
|
| 90 |
+
```
|
| 91 |
+
Encoder: text_dim β Linear(1024) β LN β GELU β Dropout
|
| 92 |
+
β Linear(1024) β LN β GELU β Dropout
|
| 93 |
+
1024 β ΞΌ (256d)
|
| 94 |
+
1024 β log_var (256d)
|
| 95 |
+
|
| 96 |
+
Bottleneck: z = ΞΌ + Ξ΅Β·Ο (training)
|
| 97 |
+
z = ΞΌ (inference)
|
| 98 |
+
|
| 99 |
+
Decoder: 256 β Linear(1024) β LN β GELU β Dropout
|
| 100 |
+
β Linear(1024) β LN β GELU β Dropout
|
| 101 |
+
β Linear(2048)
|
| 102 |
+
reshape β (8, 16, 16)
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
Training: MSE reconstruction + KL divergence (weight 1e-4), AdamW 1e-3, cosine schedule, 50 epochs, batch 512.
|
| 106 |
+
|
| 107 |
+
## Usage
|
| 108 |
+
|
| 109 |
+
```python
|
| 110 |
+
from model import TextVAE # or BertVAE, BeatrixVAE
|
| 111 |
+
|
| 112 |
+
# Load trained VAE
|
| 113 |
+
vae = TextVAE(text_dim=512) # 768 for BERT/Beatrix
|
| 114 |
+
ckpt = torch.load("best_model.pt")
|
| 115 |
+
vae.load_state_dict(ckpt["model_state_dict"])
|
| 116 |
+
|
| 117 |
+
# Text β geometric patches
|
| 118 |
+
text_embedding = your_encoder(prompt) # (B, 512/768)
|
| 119 |
+
patches = vae.generate_latent(text_embedding) # (B, 8, 16, 16)
|
| 120 |
+
|
| 121 |
+
# Feed to geometric analyzer
|
| 122 |
+
geo_output = geometric_model(patches)
|
| 123 |
+
gates = geo_output["local_dim_logits"] # geometric properties
|
| 124 |
+
features = geo_output["patch_features"] # learned representations
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
## Implications
|
| 128 |
+
|
| 129 |
+
Geometric structure is a shared language that text and images both speak natively. This repo demonstrates the decoder β a lightweight VAE that translates any text encoder's output into the geometric alphabet. The next step is using this bridge for conditionable geometric descriptors in diffusion processes: text descriptions that steer generation through geometric constraints rather than CLIP alignment.
|
| 130 |
+
|
| 131 |
+
## File Structure
|
| 132 |
+
|
| 133 |
+
```
|
| 134 |
+
geovae-proto/
|
| 135 |
+
βββ text_vae/ # flan-t5-small (512d)
|
| 136 |
+
β βββ model.py # TextVAE architecture
|
| 137 |
+
β βββ train.py # Extract + train + analyze
|
| 138 |
+
β βββ push.py # Upload to HF
|
| 139 |
+
βββ bert_vae/ # bert-base-uncased (768d)
|
| 140 |
+
β βββ model.py
|
| 141 |
+
β βββ train.py
|
| 142 |
+
β βββ push.py
|
| 143 |
+
βββ beatrix_vae/ # bert-beatrix-2048 (768d)
|
| 144 |
+
βββ model.py
|
| 145 |
+
βββ train.py
|
| 146 |
+
βββ push.py
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
## Citation
|
| 150 |
+
|
| 151 |
+
Part of the geometric deep learning research by [AbstractPhil](https://huggingface.co/AbstractPhil). Built on the geometric analyzer from [grid-geometric-multishape](https://huggingface.co/AbstractPhil/grid-geometric-multishape) and the [synthetic-characters](https://huggingface.co/datasets/AbstractPhil/synthetic-characters) dataset.
|