File size: 7,257 Bytes

---
license: apache-2.0
tags:
  - geometric-deep-learning
  - vae
  - text-to-geometry
  - rosetta-stone
  - multimodal
  - experimental
  - research
base_model:
  - AbstractPhil/grid-geometric-multishape
  - google/flan-t5-small
  - bert-base-uncased
  - AbstractPhil/bert-beatrix-2048
datasets:
  - AbstractPhil/synthetic-characters
---

# GeoVAE Proto — The Rosetta Stone Experiments

**Text carries geometric structure. This repo proves it.**

Three lightweight VAEs project text embeddings from different encoders into geometric patch space — and a pretrained geometric analyzer reads the text-derived patches *more clearly* than actual images. The geometric differentiation is encoder-agnostic: it lives in the language itself.

## The Hypothesis

If FLUX-generated images produce measurably differentiated geometric signatures across categories (lighting vs. jewelry vs. pose), and those images were *generated from text prompts*, then the text embeddings should contain enough structural information to produce the same geometric differentiation — without ever seeing an image.

## The Experiment

```
Text Prompt → [Encoder] → 512/768d embedding → TextVAE → (8, 16, 16) patches → Geometric Analyzer → gates + patch features
```

Three encoders tested against the same pipeline:

| Directory | Encoder | Dim | Pooling | Architecture |
|---|---|---|---|---|
| `text_vae/` | flan-t5-small | 512 | mean pool | encoder-decoder |
| `bert_vae/` | bert-base-uncased | 768 | [CLS] token | bidirectional MLM |
| `beatrix_vae/` | bert-beatrix-2048 | 768 | mean pool | nomic_bert + categorical tokens |

Each VAE has identical architecture: `encoder (text_dim → 1024 → 1024) → μ,σ (256d bottleneck) → decoder (256 → 1024 → 1024 → 2048) → reshape (8, 16, 16)`. Trained to reconstruct identical-scale as the adapted FLUX VAE latents from the earlier Image VAE experiments. ~4.5M parameters each.

The geometric analyzer is a pretrained `SuperpositionPatchClassifier` from [AbstractPhil/grid-geometric-multishape](https://huggingface.co/AbstractPhil/grid-geometric-multishape) (epoch 200), frozen during evaluation. It extracts gate vectors (64×17 explicit geometric properties) and patch features (64×256 learned representations) from any (8, 16, 16) input.

Dataset: 49,286 images from [AbstractPhil/synthetic-characters](https://huggingface.co/datasets/AbstractPhil/synthetic-characters) (schnell_full_1_512), 15 generator_type categories.

## Results

### Overall Discriminability (within-category similarity − weighted between-category similarity)

| Representation | Image Path (49k) | T5 (512d) | BERT (768d) | Beatrix (768d) |
|---|---|---|---|---|
| **patch_feat** | +0.0198 | +0.0526 | **+0.0534** | +0.0502 |
| **gate_vectors** | +0.0090 | +0.0311 | **+0.0319** | +0.0302 |
| **global_feat** | +0.0084 | +0.0228 | **+0.0219** | +0.0214 |

**All three text paths produce 2.5–3.5× stronger geometric differentiation than the image path.** All three encoders converge to ±5% of each other.

### Per-Category Discriminability (patch_feat)

| Category | Image | T5 | BERT | Beatrix |
|---|---|---|---|---|
| character_with_lighting | +0.051 | **+0.145** | +0.093 | +0.069 |
| action_scene | +0.020 | +0.123 | **+0.126** | +0.060 |
| character_with_jewelry | +0.048 | +0.072 | +0.107 | **+0.121** |
| character_with_expression | +0.041 | **+0.092** | +0.066 | +0.088 |
| character_in_scene | +0.014 | +0.081 | +0.062 | **+0.089** |
| character_full_outfit | +0.025 | +0.080 | **+0.088** | +0.054 |
| character_with_pose | +0.001 | +0.007 | -0.008 | +0.007 |

Category ranking is preserved across all paths. Lighting and jewelry always differentiate well; pose never does. The geometric hierarchy is stable.

## Key Findings

1. **Text-derived patches are geometrically cleaner than image-derived patches.** Language is already an abstraction — it carries structural intent without per-pixel noise. The geometric analyzer reads intent more clearly than observation.

2. **The bridge is encoder-agnostic.** Three architecturally different encoders (encoder-decoder T5, bidirectional BERT, categorical nomic_bert) produce the same discriminability through a 256d bottleneck. The geometric structure is in the text, not the encoder.

3. **Categorical pretraining doesn't help overall.** Beatrix, trained on 2B+ samples with explicit `<lighting>`, `<jewelry>`, `<pose>` tokens, matches generic BERT/T5 within ±5%. It wins on fine-grained object categories (jewelry +0.121 vs BERT +0.107) but loses on scene-level properties (lighting +0.069 vs T5 +0.145).

4. **The 256d bottleneck is the normalizer.** It strips encoder-specific representational choices and preserves only the geometric signal that all encoders agree on.

## Architecture

Each VAE (~4.5M params):

```
Encoder:  text_dim → Linear(1024) → LN → GELU → Dropout
                   → Linear(1024) → LN → GELU → Dropout
          1024 → μ (256d)
          1024 → log_var (256d)

Bottleneck: z = μ + ε·σ  (training)
            z = μ          (inference)

Decoder:  256 → Linear(1024) → LN → GELU → Dropout
              → Linear(1024) → LN → GELU → Dropout
              → Linear(2048)
          reshape → (8, 16, 16)
```

Training: MSE reconstruction + KL divergence (weight 1e-4), AdamW 1e-3, cosine schedule, 50 epochs, batch 512.

## Usage

```python
from model import TextVAE  # or BertVAE, BeatrixVAE

# Load trained VAE
vae = TextVAE(text_dim=512)  # 768 for BERT/Beatrix
ckpt = torch.load("best_model.pt")
vae.load_state_dict(ckpt["model_state_dict"])

# Text → geometric patches
text_embedding = your_encoder(prompt)        # (B, 512/768)
patches = vae.generate_latent(text_embedding) # (B, 8, 16, 16)

# Feed to geometric analyzer
geo_output = geometric_model(patches)
gates = geo_output["local_dim_logits"]       # geometric properties
features = geo_output["patch_features"]       # learned representations
```

## Implications

Geometric structure is a shared language that text and images both speak natively. This repo demonstrates the decoder — a lightweight VAE that translates any text encoder's output into the geometric alphabet. The next step is using this bridge for conditionable geometric descriptors in diffusion processes: text descriptions that steer generation through geometric constraints rather than CLIP alignment.

## File Structure

```
geovae-proto/
├── text_vae/          # flan-t5-small (512d)
│   ├── model.py       # TextVAE architecture
│   ├── train.py       # Extract + train + analyze
│   └── push.py        # Upload to HF
├── bert_vae/          # bert-base-uncased (768d)
│   ├── model.py
│   ├── train.py
│   └── push.py
└── beatrix_vae/       # bert-beatrix-2048 (768d)
    ├── model.py
    ├── train.py
    └── push.py
```

## Citation

Part of the geometric deep learning research by [AbstractPhil](https://huggingface.co/AbstractPhil). Built on the geometric analyzer from [grid-geometric-multishape](https://huggingface.co/AbstractPhil/grid-geometric-multishape) and the [synthetic-characters](https://huggingface.co/datasets/AbstractPhil/synthetic-characters) dataset.