nielsr HF Staff commited on
Commit
5970ea6
·
verified ·
1 Parent(s): 7c5fbd3

Add model card for OpenVision 3

Browse files

This PR adds a model card for OpenVision 3, a family of unified visual encoders designed for both image understanding and generation.

The card includes:
- Metadata for `pipeline_tag`, `library_name` (`open_clip`), and `license`.
- A brief description of the architecture and training objectives.
- Links to the paper, project page, and GitHub repository.
- A sample usage code snippet based on the official documentation.
- Citation information.

Files changed (1) hide show
  1. README.md +59 -0
README.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: image-feature-extraction
3
+ library_name: open_clip
4
+ license: apache-2.0
5
+ ---
6
+
7
+ # OpenVision 3
8
+
9
+ OpenVision 3 is a family of advanced vision encoders that learn a single, unified visual representation capable of serving both image understanding and image generation. The architecture feeds VAE-compressed image latents to a ViT encoder and trains its output to support two complementary roles: reconstructing the original image via a ViT-VAE decoder to capture generative structure, and optimizing via contrastive learning and image-captioning objectives to strengthen semantic features.
10
+
11
+ - **Paper:** [OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation](https://huggingface.co/papers/2601.15369)
12
+ - **Project Page:** [https://ucsc-vlaa.github.io/OpenVision3/](https://ucsc-vlaa.github.io/OpenVision3/)
13
+ - **Repository:** [https://github.com/UCSC-VLAA/OpenVision](https://github.com/UCSC-VLAA/OpenVision)
14
+
15
+ ## Usage
16
+
17
+ This model can be used with the [OpenCLIP](https://github.com/mlfoundations/open_clip) library.
18
+
19
+ ```python
20
+ import torch
21
+ import torch.nn.functional as F
22
+ from urllib.request import urlopen
23
+ from PIL import Image
24
+ from open_clip import create_model_from_pretrained, get_tokenizer
25
+
26
+ # Replace with the specific model ID from the Hub
27
+ model_id = 'hf-hub:UCSC-VLAA/openvision-vit-large-patch14-224'
28
+
29
+ model, preprocess = create_model_from_pretrained(model_id)
30
+ tokenizer = get_tokenizer(model_id)
31
+
32
+ image = Image.open(urlopen(
33
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
34
+ ))
35
+ image = preprocess(image).unsqueeze(0)
36
+
37
+ text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length)
38
+
39
+ with torch.no_grad(), torch.cuda.amp.autocast():
40
+ image_features = model.encode_image(image)
41
+ text_features = model.encode_text(text)
42
+ image_features = F.normalize(image_features, dim=-1)
43
+ text_features = F.normalize(text_features, dim=-1)
44
+
45
+ text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
46
+
47
+ print("Label probs:", text_probs)
48
+ ```
49
+
50
+ ## Citation
51
+
52
+ ```bibtex
53
+ @article{zhang2026openvision3,
54
+ title={OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation},
55
+ author={Zhang, Letian and Ren, Sucheng and Liu, Yanqing and Li, Xianhang and Wang, Zeyu and Zhou, Yuyin and Yao, Huaxiu and Zheng, Zeyu and Nie, Weili and Liu, Guilin and Yu, Zhiding and Xie, Cihang},
56
+ journal={arXiv preprint arXiv:2601.15369},
57
+ year={2026}
58
+ }
59
+ ```