---
tags:
- vae
- multimodal
- text-embeddings
- clip
- t5
license: mit
---

# VAE Lyra 🎵

Multi-modal Variational Autoencoder for text embedding transformation using geometric fusion.

## Model Details

- **Fusion Strategy**: cantor
- **Latent Dimension**: 768
- **Training Steps**: 11,899
- **Best Loss**: 0.1997

## Architecture

- **Modalities**: CLIP-L (768d) + T5-base (768d)
- **Encoder Layers**: 3
- **Decoder Layers**: 3
- **Hidden Dimension**: 1024

## Usage
```python
from geovocab2.train.model.vae.vae_lyra import MultiModalVAE, MultiModalVAEConfig
from huggingface_hub import hf_hub_download
import torch

# Download model
model_path = hf_hub_download(
    repo_id="AbstractPhil/vae-lyra",
    filename="model.pt"
)

# Load checkpoint
checkpoint = torch.load(model_path)

# Create model
config = MultiModalVAEConfig(
    modality_dims={"clip": 768, "t5": 768},
    latent_dim=768,
    fusion_strategy="cantor"
)

model = MultiModalVAE(config)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Use model
inputs = {
    "clip": clip_embeddings,  # [batch, 77, 768]
    "t5": t5_embeddings        # [batch, 77, 768]
}

reconstructions, mu, logvar = model(inputs)
```

## Training Details

- Trained on 10,000 diverse prompts
- Mix of LAION flavors (85%) and synthetic prompts (15%)
- KL Annealing: True
- Learning Rate: 0.0001

## Citation
```bibtex
@software{vae_lyra_2025,
  author = {AbstractPhil},
  title = {VAE Lyra: Multi-Modal Variational Autoencoder},
  year = {2025},
  url = {https://huggingface.co/AbstractPhil/vae-lyra}
}
```