geovae-proto / README.md

Update README.md

4d83a41 verified 2 days ago

7.26 kB

	---
	license: apache-2.0
	tags:
	- geometric-deep-learning
	- vae
	- text-to-geometry
	- rosetta-stone
	- multimodal
	- experimental
	- research
	base_model:
	- AbstractPhil/grid-geometric-multishape
	- google/flan-t5-small
	- bert-base-uncased
	- AbstractPhil/bert-beatrix-2048
	datasets:
	- AbstractPhil/synthetic-characters
	---

	# GeoVAE Proto — The Rosetta Stone Experiments

	Text carries geometric structure. This repo proves it.

	Three lightweight VAEs project text embeddings from different encoders into geometric patch space — and a pretrained geometric analyzer reads the text-derived patches more clearly than actual images. The geometric differentiation is encoder-agnostic: it lives in the language itself.

	## The Hypothesis

	If FLUX-generated images produce measurably differentiated geometric signatures across categories (lighting vs. jewelry vs. pose), and those images were generated from text prompts, then the text embeddings should contain enough structural information to produce the same geometric differentiation — without ever seeing an image.

	## The Experiment

	```
	Text Prompt → [Encoder] → 512/768d embedding → TextVAE → (8, 16, 16) patches → Geometric Analyzer → gates + patch features
	```

	Three encoders tested against the same pipeline:

	\| Directory \| Encoder \| Dim \| Pooling \| Architecture \|
	\|---\|---\|---\|---\|---\|
	\| `text_vae/` \| flan-t5-small \| 512 \| mean pool \| encoder-decoder \|
	\| `bert_vae/` \| bert-base-uncased \| 768 \| [CLS] token \| bidirectional MLM \|
	\| `beatrix_vae/` \| bert-beatrix-2048 \| 768 \| mean pool \| nomic_bert + categorical tokens \|

	Each VAE has identical architecture: `encoder (text_dim → 1024 → 1024) → μ,σ (256d bottleneck) → decoder (256 → 1024 → 1024 → 2048) → reshape (8, 16, 16)`. Trained to reconstruct identical-scale as the adapted FLUX VAE latents from the earlier Image VAE experiments. ~4.5M parameters each.

	The geometric analyzer is a pretrained `SuperpositionPatchClassifier` from [AbstractPhil/grid-geometric-multishape](https://huggingface.co/AbstractPhil/grid-geometric-multishape) (epoch 200), frozen during evaluation. It extracts gate vectors (64×17 explicit geometric properties) and patch features (64×256 learned representations) from any (8, 16, 16) input.

	Dataset: 49,286 images from [AbstractPhil/synthetic-characters](https://huggingface.co/datasets/AbstractPhil/synthetic-characters) (schnell_full_1_512), 15 generator_type categories.

	## Results

	### Overall Discriminability (within-category similarity − weighted between-category similarity)

	\| Representation \| Image Path (49k) \| T5 (512d) \| BERT (768d) \| Beatrix (768d) \|
	\|---\|---\|---\|---\|---\|
	\| patch_feat \| +0.0198 \| +0.0526 \| +0.0534 \| +0.0502 \|
	\| gate_vectors \| +0.0090 \| +0.0311 \| +0.0319 \| +0.0302 \|
	\| global_feat \| +0.0084 \| +0.0228 \| +0.0219 \| +0.0214 \|

	All three text paths produce 2.5–3.5× stronger geometric differentiation than the image path. All three encoders converge to ±5% of each other.

	### Per-Category Discriminability (patch_feat)

	\| Category \| Image \| T5 \| BERT \| Beatrix \|
	\|---\|---\|---\|---\|---\|
	\| character_with_lighting \| +0.051 \| +0.145 \| +0.093 \| +0.069 \|
	\| action_scene \| +0.020 \| +0.123 \| +0.126 \| +0.060 \|
	\| character_with_jewelry \| +0.048 \| +0.072 \| +0.107 \| +0.121 \|
	\| character_with_expression \| +0.041 \| +0.092 \| +0.066 \| +0.088 \|
	\| character_in_scene \| +0.014 \| +0.081 \| +0.062 \| +0.089 \|
	\| character_full_outfit \| +0.025 \| +0.080 \| +0.088 \| +0.054 \|
	\| character_with_pose \| +0.001 \| +0.007 \| -0.008 \| +0.007 \|

	Category ranking is preserved across all paths. Lighting and jewelry always differentiate well; pose never does. The geometric hierarchy is stable.

	## Key Findings

	1. Text-derived patches are geometrically cleaner than image-derived patches. Language is already an abstraction — it carries structural intent without per-pixel noise. The geometric analyzer reads intent more clearly than observation.

	2. The bridge is encoder-agnostic. Three architecturally different encoders (encoder-decoder T5, bidirectional BERT, categorical nomic_bert) produce the same discriminability through a 256d bottleneck. The geometric structure is in the text, not the encoder.

	3. Categorical pretraining doesn't help overall. Beatrix, trained on 2B+ samples with explicit `<lighting>`, `<jewelry>`, `<pose>` tokens, matches generic BERT/T5 within ±5%. It wins on fine-grained object categories (jewelry +0.121 vs BERT +0.107) but loses on scene-level properties (lighting +0.069 vs T5 +0.145).

	4. The 256d bottleneck is the normalizer. It strips encoder-specific representational choices and preserves only the geometric signal that all encoders agree on.

	## Architecture

	Each VAE (~4.5M params):

	```
	Encoder: text_dim → Linear(1024) → LN → GELU → Dropout
	→ Linear(1024) → LN → GELU → Dropout
	1024 → μ (256d)
	1024 → log_var (256d)

	Bottleneck: z = μ + ε·σ (training)
	z = μ (inference)

	Decoder: 256 → Linear(1024) → LN → GELU → Dropout
	→ Linear(1024) → LN → GELU → Dropout
	→ Linear(2048)
	reshape → (8, 16, 16)
	```

	Training: MSE reconstruction + KL divergence (weight 1e-4), AdamW 1e-3, cosine schedule, 50 epochs, batch 512.

	## Usage

	```python
	from model import TextVAE # or BertVAE, BeatrixVAE

	# Load trained VAE
	vae = TextVAE(text_dim=512) # 768 for BERT/Beatrix
	ckpt = torch.load("best_model.pt")
	vae.load_state_dict(ckpt["model_state_dict"])

	# Text → geometric patches
	text_embedding = your_encoder(prompt) # (B, 512/768)
	patches = vae.generate_latent(text_embedding) # (B, 8, 16, 16)

	# Feed to geometric analyzer
	geo_output = geometric_model(patches)
	gates = geo_output["local_dim_logits"] # geometric properties
	features = geo_output["patch_features"] # learned representations
	```

	## Implications

	Geometric structure is a shared language that text and images both speak natively. This repo demonstrates the decoder — a lightweight VAE that translates any text encoder's output into the geometric alphabet. The next step is using this bridge for conditionable geometric descriptors in diffusion processes: text descriptions that steer generation through geometric constraints rather than CLIP alignment.

	## File Structure

	```
	geovae-proto/
	├── text_vae/ # flan-t5-small (512d)
	│ ├── model.py # TextVAE architecture
	│ ├── train.py # Extract + train + analyze
	│ └── push.py # Upload to HF
	├── bert_vae/ # bert-base-uncased (768d)
	│ ├── model.py
	│ ├── train.py
	│ └── push.py
	└── beatrix_vae/ # bert-beatrix-2048 (768d)
	├── model.py
	├── train.py
	└── push.py
	```

	## Citation

	Part of the geometric deep learning research by [AbstractPhil](https://huggingface.co/AbstractPhil). Built on the geometric analyzer from [grid-geometric-multishape](https://huggingface.co/AbstractPhil/grid-geometric-multishape) and the [synthetic-characters](https://huggingface.co/datasets/AbstractPhil/synthetic-characters) dataset.