YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
VINO β Unified Visual Generator (Official Weights)
VINO: A Unified Visual Generator with Interleaved OmniModal Context
π Project Page β’ π Paper β’ π» Code β’ πΊ Demo Video
π₯ What is VINO?
VINO is a unified image & video generation and editing framework powered by a Vision-Language Model (VLM) and Multi-Modal Diffusion Transformer (MMDiT).
A single set of weights supports:
- Text-to-Image
- Text-to-Video
- Image-to-Video
- Multi-Image-to-Video
- Image Editing
- Video Editing
- Element Cloning
One model. All visual generation & editing tasks.
π¦ Contents of This Repository
This Hugging Face repository provides the official VINO model weights, including:
- MMDiT backbone
- Learnable multimodal tokens
These weights are intended to be used with:
π https://github.com/SOTAMak1r/VINO-code
π§© Required Base Models
VINO depends on the following public checkpoints:
| Component | Source |
|---|---|
| VLM | Qwen/Qwen3-VL-4B-Instruct |
| Video VAE | hunyuanvideo-community/HunyuanVideo |
They will be automatically downloaded by the VINO codebase.
β¬οΈ Download
Option 1: Hugging Face CLI
huggingface-cli download SOTAMak1r/VINO-weight \
--local-dir ./checkpoints/SOTAMak1r/VINO-weight \
--local-dir-use-symlinks False
Option 2: Inside VINO Repo (recommanded)
python download.py --ak YOUR_HF_TOKEN
π Quick Start
See full instructions in:
π https://github.com/SOTAMak1r/VINO-code
π License
- Model Weights: CC BY-NC 4.0 (Non-Commercial Only)
- Code: Apache 2.0
π Citation
@article{chen2026vino,
title={VINO: A Unified Visual Generator with Interleaved OmniModal Context},
author={Chen, Junyi and He, Tong and Fu, Zhoujie and Wan, Pengfei and Gai, Kun and Ye, Weicai},
journal={arXiv preprint arXiv:2601.02358},
year={2026}
}