UCSC-VLAA
/

openvision3-vit-large-patch2-32

Model card Files Files and versions

openvision3-vit-large-patch2-32 / README.md

nielsr's picture

nielsr HF Staff

Add model card for OpenVision 3

7bd3a21 verified 1 day ago

|

2.76 kB

	---
	license: apache-2.0
	library_name: open_clip
	pipeline_tag: image-feature-extraction
	---

	# OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

	OpenVision 3 is a family of advanced vision encoders that learn a single, unified visual representation serving both image understanding and image generation.

	Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features.

	- Paper: [OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation](https://huggingface.co/papers/2601.15369)
	- Project Page: [https://ucsc-vlaa.github.io/OpenVision3/](https://ucsc-vlaa.github.io/OpenVision3/)
	- Repository: [https://github.com/UCSC-VLAA/OpenVision](https://github.com/UCSC-VLAA/OpenVision)

	## Usage

	The OpenVision 3 vision encoders are compatible with the `open_clip` library.

	```python
	import torch
	import torch.nn.functional as F
	from urllib.request import urlopen
	from PIL import Image
	from open_clip import create_model_from_pretrained, get_tokenizer

	# Replace 'UCSC-VLAA/openvision3-model-id' with the actual repository ID
	model_id = 'hf-hub:UCSC-VLAA/openvision3-model-id'

	model, preprocess = create_model_from_pretrained(model_id)
	tokenizer = get_tokenizer(model_id)

	# Load and preprocess image
	image = Image.open(urlopen(
	'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
	))
	image = preprocess(image).unsqueeze(0)

	# Prepare text queries
	text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length)

	with torch.no_grad(), torch.cuda.amp.autocast():
	image_features = model.encode_image(image)
	text_features = model.encode_text(text)

	image_features = F.normalize(image_features, dim=-1)
	text_features = F.normalize(text_features, dim=-1)

	text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

	print("Label probs:", text_probs)
	```

	## Citation

	```bibtex
	@article{zhang2026openvision3,
	title = {OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation},
	author = {Zhang, Letian and Ren, Sucheng and Liu, Yanqing and Li, Xianhang and Wang, Zeyu and Zhou, Yuyin and Yao, Huaxiu Zheng, Zeyu and Nie, Weili and Liu, Guilin and Yu, Zhiding and Xie, Cihang},
	journal = {arXiv preprint arXiv:2601.15369},
	year = {2026}
	}
	```