metadata
tags:
- vae
- multimodal
- text-embeddings
- clip
- t5
license: mit
VAE Lyra 🎵
Multi-modal Variational Autoencoder for text embedding transformation using geometric fusion.
Model Details
- Fusion Strategy: cantor
- Latent Dimension: 768
- Training Steps: 11,899
- Best Loss: 0.1997
Architecture
- Modalities: CLIP-L (768d) + T5-base (768d)
- Encoder Layers: 3
- Decoder Layers: 3
- Hidden Dimension: 1024
Usage
from geovocab2.train.model.vae.vae_lyra import MultiModalVAE, MultiModalVAEConfig
from huggingface_hub import hf_hub_download
import torch
# Download model
model_path = hf_hub_download(
repo_id="AbstractPhil/vae-lyra",
filename="model.pt"
)
# Load checkpoint
checkpoint = torch.load(model_path)
# Create model
config = MultiModalVAEConfig(
modality_dims={"clip": 768, "t5": 768},
latent_dim=768,
fusion_strategy="cantor"
)
model = MultiModalVAE(config)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Use model
inputs = {
"clip": clip_embeddings, # [batch, 77, 768]
"t5": t5_embeddings # [batch, 77, 768]
}
reconstructions, mu, logvar = model(inputs)
Training Details
- Trained on 10,000 diverse prompts
- Mix of LAION flavors (85%) and synthetic prompts (15%)
- KL Annealing: True
- Learning Rate: 0.0001
Citation
@software{vae_lyra_2025,
author = {AbstractPhil},
title = {VAE Lyra: Multi-Modal Variational Autoencoder},
year = {2025},
url = {https://huggingface.co/AbstractPhil/vae-lyra}
}