⚛️ ATOM VQ-VAE

A lightweight (3.6 MB big), plug-and-play Vector Quantized Variational Autoencoder (VQ-VAE) designed for rapid experimentation and seamless integration into autoregressive vision models.

ATOM VQ-VAE compresses standard 128x128 RGB images into a highly discrete 16x16 grid of visual tokens, utilizing a rich codebook of 2048 embeddings.

🌟 Key Features

Built for Inference: Features built-in encode_to_tokens and decode_from_tokens methods. No manual tensor reshaping required.
Native PIL Support: Pass standard PIL.Image objects directly into the encoder; the model handles resizing and tensor conversion automatically.
Modern Weights: Fully integrated with safetensors for fast, secure weight loading.
Lightweight Architecture: Rapid training and inference, perfect for single-GPU setups or edge devices. ESP32 inference code coming soon!

[ ! ] Important notice

In the repository there is atom_vqvae.safetensors and atom_vqvae_2_full.safetensors.

It is best to use the second because the first legacy model experienced codebook collapse (does not use all tokens it can).

You can look into trainFixedV2.py to see how I resolved the codebook collapse issue.

📊 Training Data

The current pre-trained weights were trained on a highly diverse subset of 10,000 random images sampled from the liuhaotian/LLaVA-Pretrain HuggingFace dataset. This gives the visual codebook a strong baseline understanding of general real-world objects, text, and structures.

License of the images in that dataset is:

CC-3M The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

🛠️ Installation

Make sure you have PyTorch and the required inference libraries installed:

pip install torch torchvision safetensors Pillow tqdm

🚀 Quick Start: Inference

To generate tokens from an image or reconstruct an image from tokens, you only need a few lines of code.

import torch
from safetensors.torch import load_model
from PIL import Image
from torchvision import transforms

# Assuming atom_vqvae.py is in your directory
from AtomVQVAE import ATOMVQVAE 

# 1. Initialize the model
model = ATOMVQVAE()

# 2. Load the secure Safetensors weights
load_model(model, "atom_vqvae_2_full.safetensors")
model.eval()

# 3. Load your image
image = Image.open("sample.jpg").convert("RGB")

# 4. Image -> Tokens (Shape: [1, 16, 16])
tokens = model.encode_to_tokens(image)
print(f"Extracted tokens shape: {tokens.shape}")

# 5. Tokens -> Image
reconstructed_tensor = model.decode_from_tokens(tokens)

# 6. Save the output
to_pil = transforms.ToPILImage()
final_image = to_pil(reconstructed_tensor.squeeze(0))
final_image.save("reconstructed_sample.png")

A full inference script is available in the repository as run.py.

🧠 Architecture Details

ATOM VQ-VAE utilizes a 3-step downsampling and upsampling approach:

Input: 3 x 128 x 128
Latent Grid: 16 x 16
Embedding Dimension: 128
Vocabulary Size (Codebook): 2048 tokens
Compression Ratio: Images are spatially reduced by a factor of 8 ($128 \to 64 \to 32 \to 16$).

🏋️ Training Your Own Model

Want to train ATOM on your own dataset? We provide a ready-to-use training loop. Just organize your images inside a standard folder structure (e.g., my_dataset/images/img1.jpg).

from train import train_atom_vqvae

# Starts the training loop for 10 epochs
# Automatically handles resizing and MSE + VQ-Loss calculation
train_atom_vqvae(
    data_dir="./my_dataset", 
    epochs=10, 
    batch_size=32, 
    learning_rate=2e-4
)

You may need to modify the train.py script to make better use of your hardware.

Note: The training script saves standard .pth checkpoints. You can easily convert these to .safetensors using the safetensors.torch.save_file utility.

📄 License

This project is licensed under the Apache License 2.0.

You are free to use, modify, and distribute this architecture and its weights commercially or privately, provided you include the original copyright notice and state any significant changes made to the software.

Thank you if you use this model. I am happy to hear what you create! Also, feel free to contact me if you had any issues.

Downloads last month: 44

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Wojtekb30
/

AtomVQVAE