Add model card for OpenVision 3

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +59 -0
README.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: image-feature-extraction
3
+ library_name: open_clip
4
+ license: apache-2.0
5
+ ---
6
+
7
+ # OpenVision 3
8
+
9
+ OpenVision 3 is a family of advanced vision encoders that learn a single, unified visual representation capable of serving both image understanding and image generation. The architecture feeds VAE-compressed image latents to a ViT encoder and trains its output to support two complementary roles: reconstructing the original image via a ViT-VAE decoder to capture generative structure, and optimizing via contrastive learning and image-captioning objectives to strengthen semantic features.
10
+
11
+ - **Paper:** [OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation](https://huggingface.co/papers/2601.15369)
12
+ - **Project Page:** [https://ucsc-vlaa.github.io/OpenVision3/](https://ucsc-vlaa.github.io/OpenVision3/)
13
+ - **Repository:** [https://github.com/UCSC-VLAA/OpenVision](https://github.com/UCSC-VLAA/OpenVision)
14
+
15
+ ## Usage
16
+
17
+ This model can be used with the [OpenCLIP](https://github.com/mlfoundations/open_clip) library.
18
+
19
+ ```python
20
+ import torch
21
+ import torch.nn.functional as F
22
+ from urllib.request import urlopen
23
+ from PIL import Image
24
+ from open_clip import create_model_from_pretrained, get_tokenizer
25
+
26
+ # Replace with the specific model ID from the Hub
27
+ model_id = 'hf-hub:UCSC-VLAA/openvision-vit-large-patch14-224'
28
+
29
+ model, preprocess = create_model_from_pretrained(model_id)
30
+ tokenizer = get_tokenizer(model_id)
31
+
32
+ image = Image.open(urlopen(
33
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
34
+ ))
35
+ image = preprocess(image).unsqueeze(0)
36
+
37
+ text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length)
38
+
39
+ with torch.no_grad(), torch.cuda.amp.autocast():
40
+ image_features = model.encode_image(image)
41
+ text_features = model.encode_text(text)
42
+ image_features = F.normalize(image_features, dim=-1)
43
+ text_features = F.normalize(text_features, dim=-1)
44
+
45
+ text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
46
+
47
+ print("Label probs:", text_probs)
48
+ ```
49
+
50
+ ## Citation
51
+
52
+ ```bibtex
53
+ @article{zhang2026openvision3,
54
+ title={OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation},
55
+ author={Zhang, Letian and Ren, Sucheng and Liu, Yanqing and Li, Xianhang and Wang, Zeyu and Zhou, Yuyin and Yao, Huaxiu and Zheng, Zeyu and Nie, Weili and Liu, Guilin and Yu, Zhiding and Xie, Cihang},
56
+ journal={arXiv preprint arXiv:2601.15369},
57
+ year={2026}
58
+ }
59
+ ```