--- tags: - vae - multimodal - text-embeddings - clip - t5 license: mit --- # VAE Lyra 🎵 Multi-modal Variational Autoencoder for text embedding transformation using geometric fusion. ## Model Details - **Fusion Strategy**: cantor - **Latent Dimension**: 768 - **Training Steps**: 11,899 - **Best Loss**: 0.1997 ## Architecture - **Modalities**: CLIP-L (768d) + T5-base (768d) - **Encoder Layers**: 3 - **Decoder Layers**: 3 - **Hidden Dimension**: 1024 ## Usage ```python from geovocab2.train.model.vae.vae_lyra import MultiModalVAE, MultiModalVAEConfig from huggingface_hub import hf_hub_download import torch # Download model model_path = hf_hub_download( repo_id="AbstractPhil/vae-lyra", filename="model.pt" ) # Load checkpoint checkpoint = torch.load(model_path) # Create model config = MultiModalVAEConfig( modality_dims={"clip": 768, "t5": 768}, latent_dim=768, fusion_strategy="cantor" ) model = MultiModalVAE(config) model.load_state_dict(checkpoint['model_state_dict']) model.eval() # Use model inputs = { "clip": clip_embeddings, # [batch, 77, 768] "t5": t5_embeddings # [batch, 77, 768] } reconstructions, mu, logvar = model(inputs) ``` ## Training Details - Trained on 10,000 diverse prompts - Mix of LAION flavors (85%) and synthetic prompts (15%) - KL Annealing: True - Learning Rate: 0.0001 ## Citation ```bibtex @software{vae_lyra_2025, author = {AbstractPhil}, title = {VAE Lyra: Multi-Modal Variational Autoencoder}, year = {2025}, url = {https://huggingface.co/AbstractPhil/vae-lyra} } ```