Add model card for OpenVision 3

5970ea6 verified about 1 month ago

2.52 kB

pipeline_tag: image-feature-extraction
library_name: open_clip
license: apache-2.0

OpenVision 3

OpenVision 3 is a family of advanced vision encoders that learn a single, unified visual representation capable of serving both image understanding and image generation. The architecture feeds VAE-compressed image latents to a ViT encoder and trains its output to support two complementary roles: reconstructing the original image via a ViT-VAE decoder to capture generative structure, and optimizing via contrastive learning and image-captioning objectives to strengthen semantic features.

Paper: OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation
Project Page: https://ucsc-vlaa.github.io/OpenVision3/
Repository: https://github.com/UCSC-VLAA/OpenVision

Usage

This model can be used with the OpenCLIP library.

import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer

# Replace with the specific model ID from the Hub
model_id = 'hf-hub:UCSC-VLAA/openvision-vit-large-patch14-224'

model, preprocess = create_model_from_pretrained(model_id)
tokenizer = get_tokenizer(model_id)

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

Citation

@article{zhang2026openvision3,
  title={OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation},
  author={Zhang, Letian and Ren, Sucheng and Liu, Yanqing and Li, Xianhang and Wang, Zeyu and Zhou, Yuyin and Yao, Huaxiu and Zheng, Zeyu and Nie, Weili and Liu, Guilin and Yu, Zhiding and Xie, Cihang},
  journal={arXiv preprint arXiv:2601.15369},
  year={2026}
}