vae-lyra / README.md

Update README.md

b88e2bd verified 6 months ago

3.45 kB

tags:
  - vae
  - multimodal
  - text-embeddings
  - clip
  - t5
license: mit
pipeline_tag: any-to-any

MM-VAE Lyra 🎵

Multi-modal Variational Autoencoder for text embedding transformation using geometric fusion.

This first version is essentialy clip_l + t5-base. Similar to those shunt prototypes in concept but entirely divergent in this implementation. This variation is formatted and trained specifically as a VAE to encode/decode pairs of encodings together.

The current implementation is trained with only a handful of token sequences, so it's essentially front-loaded. Expect short sequences to work. Full-sequence pretraining will begin soon with a uniform vocabulary that takes both tokens in for a representative uniform token based on the position.

Trained specifically to encode and decode a PAIR of encodings, each slightly twisted and warped into the direction of intention from the training. This is not your usual VAE, but she's most definitely trained like one.

A lone cybernetic deer with glimmering silver antlers stands beneath a fractured aurora sky, surrounded by glowing fungal trees, floating quartz shards, and bio-luminescent fog. In the distance, ruined monoliths pulse faint glyphs of a forgotten language, while translucent jellyfish swim through the air above a reflective obsidian lake. The atmosphere is electric with tension, color-shifting through prismatic hues. Distant thunderclouds churn violently.

She will do her job when fully trained.

Model Details

Fusion Strategy: cantor
Latent Dimension: 768
Training Steps: 31,899
Best Loss: 0.1840

Architecture

Modalities: CLIP-L (768d) + T5-base (768d)
Encoder Layers: 3
Decoder Layers: 3
Hidden Dimension: 1024

Usage

from geovocab2.train.model.vae.vae_lyra import MultiModalVAE, MultiModalVAEConfig
from huggingface_hub import hf_hub_download
import torch

# Download model
model_path = hf_hub_download(
    repo_id="AbstractPhil/vae-lyra",
    filename="model.pt"
)

# Load checkpoint
checkpoint = torch.load(model_path)

# Create model
config = MultiModalVAEConfig(
    modality_dims={"clip": 768, "t5": 768},
    latent_dim=768,
    fusion_strategy="cantor"
)

model = MultiModalVAE(config)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Use model
inputs = {
    "clip": clip_embeddings,  # [batch, 77, 768]
    "t5": t5_embeddings        # [batch, 77, 768]
}

reconstructions, mu, logvar = model(inputs)

Training Details

Trained on 10,000 diverse prompts
Mix of LAION flavors (85%) and synthetic prompts (15%)
KL Annealing: True
Learning Rate: 0.0001

Citation

@software{vae_lyra_2025,
  author = {AbstractPhil},
  title = {VAE Lyra: Multi-Modal Variational Autoencoder},
  year = {2025},
  url = {https://huggingface.co/AbstractPhil/vae-lyra}
}