| | --- |
| | tags: |
| | - vae |
| | - multimodal |
| | - text-embeddings |
| | - clip |
| | - t5 |
| | license: mit |
| | --- |
| | |
| | # VAE Lyra 🎵 |
| |
|
| | Multi-modal Variational Autoencoder for text embedding transformation using geometric fusion. |
| |
|
| | ## Model Details |
| |
|
| | - **Fusion Strategy**: cantor |
| | - **Latent Dimension**: 768 |
| | - **Training Steps**: 11,899 |
| | - **Best Loss**: 0.1997 |
| |
|
| | ## Architecture |
| |
|
| | - **Modalities**: CLIP-L (768d) + T5-base (768d) |
| | - **Encoder Layers**: 3 |
| | - **Decoder Layers**: 3 |
| | - **Hidden Dimension**: 1024 |
| |
|
| | ## Usage |
| | ```python |
| | from geovocab2.train.model.vae.vae_lyra import MultiModalVAE, MultiModalVAEConfig |
| | from huggingface_hub import hf_hub_download |
| | import torch |
| | |
| | # Download model |
| | model_path = hf_hub_download( |
| | repo_id="AbstractPhil/vae-lyra", |
| | filename="model.pt" |
| | ) |
| | |
| | # Load checkpoint |
| | checkpoint = torch.load(model_path) |
| | |
| | # Create model |
| | config = MultiModalVAEConfig( |
| | modality_dims={"clip": 768, "t5": 768}, |
| | latent_dim=768, |
| | fusion_strategy="cantor" |
| | ) |
| | |
| | model = MultiModalVAE(config) |
| | model.load_state_dict(checkpoint['model_state_dict']) |
| | model.eval() |
| | |
| | # Use model |
| | inputs = { |
| | "clip": clip_embeddings, # [batch, 77, 768] |
| | "t5": t5_embeddings # [batch, 77, 768] |
| | } |
| | |
| | reconstructions, mu, logvar = model(inputs) |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | - Trained on 10,000 diverse prompts |
| | - Mix of LAION flavors (85%) and synthetic prompts (15%) |
| | - KL Annealing: True |
| | - Learning Rate: 0.0001 |
| |
|
| | ## Citation |
| | ```bibtex |
| | @software{vae_lyra_2025, |
| | author = {AbstractPhil}, |
| | title = {VAE Lyra: Multi-Modal Variational Autoencoder}, |
| | year = {2025}, |
| | url = {https://huggingface.co/AbstractPhil/vae-lyra} |
| | } |
| | ``` |
| |
|