File size: 7,257 Bytes
698af20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d83a41
698af20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
license: apache-2.0
tags:
  - geometric-deep-learning
  - vae
  - text-to-geometry
  - rosetta-stone
  - multimodal
  - experimental
  - research
base_model:
  - AbstractPhil/grid-geometric-multishape
  - google/flan-t5-small
  - bert-base-uncased
  - AbstractPhil/bert-beatrix-2048
datasets:
  - AbstractPhil/synthetic-characters
---

# GeoVAE Proto β€” The Rosetta Stone Experiments

**Text carries geometric structure. This repo proves it.**

Three lightweight VAEs project text embeddings from different encoders into geometric patch space β€” and a pretrained geometric analyzer reads the text-derived patches *more clearly* than actual images. The geometric differentiation is encoder-agnostic: it lives in the language itself.

## The Hypothesis

If FLUX-generated images produce measurably differentiated geometric signatures across categories (lighting vs. jewelry vs. pose), and those images were *generated from text prompts*, then the text embeddings should contain enough structural information to produce the same geometric differentiation β€” without ever seeing an image.

## The Experiment

```
Text Prompt β†’ [Encoder] β†’ 512/768d embedding β†’ TextVAE β†’ (8, 16, 16) patches β†’ Geometric Analyzer β†’ gates + patch features
```

Three encoders tested against the same pipeline:

| Directory | Encoder | Dim | Pooling | Architecture |
|---|---|---|---|---|
| `text_vae/` | flan-t5-small | 512 | mean pool | encoder-decoder |
| `bert_vae/` | bert-base-uncased | 768 | [CLS] token | bidirectional MLM |
| `beatrix_vae/` | bert-beatrix-2048 | 768 | mean pool | nomic_bert + categorical tokens |

Each VAE has identical architecture: `encoder (text_dim β†’ 1024 β†’ 1024) β†’ ΞΌ,Οƒ (256d bottleneck) β†’ decoder (256 β†’ 1024 β†’ 1024 β†’ 2048) β†’ reshape (8, 16, 16)`. Trained to reconstruct identical-scale as the adapted FLUX VAE latents from the earlier Image VAE experiments. ~4.5M parameters each.

The geometric analyzer is a pretrained `SuperpositionPatchClassifier` from [AbstractPhil/grid-geometric-multishape](https://huggingface.co/AbstractPhil/grid-geometric-multishape) (epoch 200), frozen during evaluation. It extracts gate vectors (64Γ—17 explicit geometric properties) and patch features (64Γ—256 learned representations) from any (8, 16, 16) input.

Dataset: 49,286 images from [AbstractPhil/synthetic-characters](https://huggingface.co/datasets/AbstractPhil/synthetic-characters) (schnell_full_1_512), 15 generator_type categories.

## Results

### Overall Discriminability (within-category similarity βˆ’ weighted between-category similarity)

| Representation | Image Path (49k) | T5 (512d) | BERT (768d) | Beatrix (768d) |
|---|---|---|---|---|
| **patch_feat** | +0.0198 | +0.0526 | **+0.0534** | +0.0502 |
| **gate_vectors** | +0.0090 | +0.0311 | **+0.0319** | +0.0302 |
| **global_feat** | +0.0084 | +0.0228 | **+0.0219** | +0.0214 |

**All three text paths produce 2.5–3.5Γ— stronger geometric differentiation than the image path.** All three encoders converge to Β±5% of each other.

### Per-Category Discriminability (patch_feat)

| Category | Image | T5 | BERT | Beatrix |
|---|---|---|---|---|
| character_with_lighting | +0.051 | **+0.145** | +0.093 | +0.069 |
| action_scene | +0.020 | +0.123 | **+0.126** | +0.060 |
| character_with_jewelry | +0.048 | +0.072 | +0.107 | **+0.121** |
| character_with_expression | +0.041 | **+0.092** | +0.066 | +0.088 |
| character_in_scene | +0.014 | +0.081 | +0.062 | **+0.089** |
| character_full_outfit | +0.025 | +0.080 | **+0.088** | +0.054 |
| character_with_pose | +0.001 | +0.007 | -0.008 | +0.007 |

Category ranking is preserved across all paths. Lighting and jewelry always differentiate well; pose never does. The geometric hierarchy is stable.

## Key Findings

1. **Text-derived patches are geometrically cleaner than image-derived patches.** Language is already an abstraction β€” it carries structural intent without per-pixel noise. The geometric analyzer reads intent more clearly than observation.

2. **The bridge is encoder-agnostic.** Three architecturally different encoders (encoder-decoder T5, bidirectional BERT, categorical nomic_bert) produce the same discriminability through a 256d bottleneck. The geometric structure is in the text, not the encoder.

3. **Categorical pretraining doesn't help overall.** Beatrix, trained on 2B+ samples with explicit `<lighting>`, `<jewelry>`, `<pose>` tokens, matches generic BERT/T5 within Β±5%. It wins on fine-grained object categories (jewelry +0.121 vs BERT +0.107) but loses on scene-level properties (lighting +0.069 vs T5 +0.145).

4. **The 256d bottleneck is the normalizer.** It strips encoder-specific representational choices and preserves only the geometric signal that all encoders agree on.

## Architecture

Each VAE (~4.5M params):

```
Encoder:  text_dim β†’ Linear(1024) β†’ LN β†’ GELU β†’ Dropout
                   β†’ Linear(1024) β†’ LN β†’ GELU β†’ Dropout
          1024 β†’ ΞΌ (256d)
          1024 β†’ log_var (256d)

Bottleneck: z = ΞΌ + Ρ·σ  (training)
            z = ΞΌ          (inference)

Decoder:  256 β†’ Linear(1024) β†’ LN β†’ GELU β†’ Dropout
              β†’ Linear(1024) β†’ LN β†’ GELU β†’ Dropout
              β†’ Linear(2048)
          reshape β†’ (8, 16, 16)
```

Training: MSE reconstruction + KL divergence (weight 1e-4), AdamW 1e-3, cosine schedule, 50 epochs, batch 512.

## Usage

```python
from model import TextVAE  # or BertVAE, BeatrixVAE

# Load trained VAE
vae = TextVAE(text_dim=512)  # 768 for BERT/Beatrix
ckpt = torch.load("best_model.pt")
vae.load_state_dict(ckpt["model_state_dict"])

# Text β†’ geometric patches
text_embedding = your_encoder(prompt)        # (B, 512/768)
patches = vae.generate_latent(text_embedding) # (B, 8, 16, 16)

# Feed to geometric analyzer
geo_output = geometric_model(patches)
gates = geo_output["local_dim_logits"]       # geometric properties
features = geo_output["patch_features"]       # learned representations
```

## Implications

Geometric structure is a shared language that text and images both speak natively. This repo demonstrates the decoder β€” a lightweight VAE that translates any text encoder's output into the geometric alphabet. The next step is using this bridge for conditionable geometric descriptors in diffusion processes: text descriptions that steer generation through geometric constraints rather than CLIP alignment.

## File Structure

```
geovae-proto/
β”œβ”€β”€ text_vae/          # flan-t5-small (512d)
β”‚   β”œβ”€β”€ model.py       # TextVAE architecture
β”‚   β”œβ”€β”€ train.py       # Extract + train + analyze
β”‚   └── push.py        # Upload to HF
β”œβ”€β”€ bert_vae/          # bert-base-uncased (768d)
β”‚   β”œβ”€β”€ model.py
β”‚   β”œβ”€β”€ train.py
β”‚   └── push.py
└── beatrix_vae/       # bert-beatrix-2048 (768d)
    β”œβ”€β”€ model.py
    β”œβ”€β”€ train.py
    └── push.py
```

## Citation

Part of the geometric deep learning research by [AbstractPhil](https://huggingface.co/AbstractPhil). Built on the geometric analyzer from [grid-geometric-multishape](https://huggingface.co/AbstractPhil/grid-geometric-multishape) and the [synthetic-characters](https://huggingface.co/datasets/AbstractPhil/synthetic-characters) dataset.