VINO — Unified Visual Generator (Official Weights)

VINO: A Unified Visual Generator with Interleaved OmniModal Context

🌐 Project Page • 📑 Paper • 💻 Code • 📺 Demo Video

🔥 What is VINO?

VINO is a unified image & video generation and editing framework powered by a Vision-Language Model (VLM) and Multi-Modal Diffusion Transformer (MMDiT).

A single set of weights supports:

Text-to-Image
Text-to-Video
Image-to-Video
Multi-Image-to-Video
Image Editing
Video Editing
Element Cloning

One model. All visual generation & editing tasks.

📦 Contents of This Repository

This Hugging Face repository provides the official VINO model weights, including:

MMDiT backbone
Learnable multimodal tokens

These weights are intended to be used with:

👉 https://github.com/SOTAMak1r/VINO-code

🧩 Required Base Models

VINO depends on the following public checkpoints:

Component	Source
VLM	Qwen/Qwen3-VL-4B-Instruct
Video VAE	hunyuanvideo-community/HunyuanVideo

They will be automatically downloaded by the VINO codebase.

⬇️ Download

Option 1: Hugging Face CLI

huggingface-cli download SOTAMak1r/VINO-weight \
  --local-dir ./checkpoints/SOTAMak1r/VINO-weight \
  --local-dir-use-symlinks False

Option 2: Inside VINO Repo (recommanded)

python download.py --ak YOUR_HF_TOKEN

🚀 Quick Start

See full instructions in:

👉 https://github.com/SOTAMak1r/VINO-code

📄 License

Model Weights: CC BY-NC 4.0 (Non-Commercial Only)
Code: Apache 2.0

📝 Citation

@article{chen2026vino,
  title={VINO: A Unified Visual Generator with Interleaved OmniModal Context},
  author={Chen, Junyi and He, Tong and Fu, Zhoujie and Wan, Pengfei and Gai, Kun and Ye, Weicai},
  journal={arXiv preprint arXiv:2601.02358},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for SOTAMak1r/VINO-weight

VINO: A Unified Visual Generator with Interleaved OmniModal Context

Paper • 2601.02358 • Published Jan 5 • 30