nielsr HF Staff commited on
Commit
7bd3a21
·
verified ·
1 Parent(s): f9009f7

Add model card for OpenVision 3

Browse files

Hi! I'm Niels from the Hugging Face community science team.

This pull request adds a model card for OpenVision 3. It includes:
- Metadata for `pipeline_tag`, `library_name`, and `license`.
- A description of the model architecture and its unified representation for understanding and generation.
- Links to the paper, project page, and official GitHub repository.
- A sample usage snippet using the `open_clip` library.
- The appropriate citation for the research.

Files changed (1) hide show
  1. README.md +64 -0
README.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: open_clip
4
+ pipeline_tag: image-feature-extraction
5
+ ---
6
+
7
+ # OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation
8
+
9
+ OpenVision 3 is a family of advanced vision encoders that learn a single, unified visual representation serving both image understanding and image generation.
10
+
11
+ Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features.
12
+
13
+ - **Paper:** [OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation](https://huggingface.co/papers/2601.15369)
14
+ - **Project Page:** [https://ucsc-vlaa.github.io/OpenVision3/](https://ucsc-vlaa.github.io/OpenVision3/)
15
+ - **Repository:** [https://github.com/UCSC-VLAA/OpenVision](https://github.com/UCSC-VLAA/OpenVision)
16
+
17
+ ## Usage
18
+
19
+ The OpenVision 3 vision encoders are compatible with the `open_clip` library.
20
+
21
+ ```python
22
+ import torch
23
+ import torch.nn.functional as F
24
+ from urllib.request import urlopen
25
+ from PIL import Image
26
+ from open_clip import create_model_from_pretrained, get_tokenizer
27
+
28
+ # Replace 'UCSC-VLAA/openvision3-model-id' with the actual repository ID
29
+ model_id = 'hf-hub:UCSC-VLAA/openvision3-model-id'
30
+
31
+ model, preprocess = create_model_from_pretrained(model_id)
32
+ tokenizer = get_tokenizer(model_id)
33
+
34
+ # Load and preprocess image
35
+ image = Image.open(urlopen(
36
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
37
+ ))
38
+ image = preprocess(image).unsqueeze(0)
39
+
40
+ # Prepare text queries
41
+ text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length)
42
+
43
+ with torch.no_grad(), torch.cuda.amp.autocast():
44
+ image_features = model.encode_image(image)
45
+ text_features = model.encode_text(text)
46
+
47
+ image_features = F.normalize(image_features, dim=-1)
48
+ text_features = F.normalize(text_features, dim=-1)
49
+
50
+ text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
51
+
52
+ print("Label probs:", text_probs)
53
+ ```
54
+
55
+ ## Citation
56
+
57
+ ```bibtex
58
+ @article{zhang2026openvision3,
59
+ title = {OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation},
60
+ author = {Zhang, Letian and Ren, Sucheng and Liu, Yanqing and Li, Xianhang and Wang, Zeyu and Zhou, Yuyin and Yao, Huaxiu Zheng, Zeyu and Nie, Weili and Liu, Guilin and Yu, Zhiding and Xie, Cihang},
61
+ journal = {arXiv preprint arXiv:2601.15369},
62
+ year = {2026}
63
+ }
64
+ ```