AbstractPhil
/

vae-lyra

text-embeddings

Model card Files Files and versions

vae-lyra / README.md

AbstractPhil's picture

Update model card (step 11899)

a9c60ee verified 4 months ago

|

1.57 kB

	---
	tags:
	- vae
	- multimodal
	- text-embeddings
	- clip
	- t5
	license: mit
	---

	# VAE Lyra 🎵

	Multi-modal Variational Autoencoder for text embedding transformation using geometric fusion.

	## Model Details

	- Fusion Strategy: cantor
	- Latent Dimension: 768
	- Training Steps: 11,899
	- Best Loss: 0.1997

	## Architecture

	- Modalities: CLIP-L (768d) + T5-base (768d)
	- Encoder Layers: 3
	- Decoder Layers: 3
	- Hidden Dimension: 1024

	## Usage
	```python
	from geovocab2.train.model.vae.vae_lyra import MultiModalVAE, MultiModalVAEConfig
	from huggingface_hub import hf_hub_download
	import torch

	# Download model
	model_path = hf_hub_download(
	repo_id="AbstractPhil/vae-lyra",
	filename="model.pt"
	)

	# Load checkpoint
	checkpoint = torch.load(model_path)

	# Create model
	config = MultiModalVAEConfig(
	modality_dims={"clip": 768, "t5": 768},
	latent_dim=768,
	fusion_strategy="cantor"
	)

	model = MultiModalVAE(config)
	model.load_state_dict(checkpoint['model_state_dict'])
	model.eval()

	# Use model
	inputs = {
	"clip": clip_embeddings, # [batch, 77, 768]
	"t5": t5_embeddings # [batch, 77, 768]
	}

	reconstructions, mu, logvar = model(inputs)
	```

	## Training Details

	- Trained on 10,000 diverse prompts
	- Mix of LAION flavors (85%) and synthetic prompts (15%)
	- KL Annealing: True
	- Learning Rate: 0.0001

	## Citation
	```bibtex
	@software{vae_lyra_2025,
	author = {AbstractPhil},
	title = {VAE Lyra: Multi-Modal Variational Autoencoder},
	year = {2025},
	url = {https://huggingface.co/AbstractPhil/vae-lyra}
	}
	```