Add model card for OpenVision 3
#1
by
nielsr
HF Staff
- opened
README.md
ADDED
|
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: image-feature-extraction
|
| 3 |
+
library_name: open_clip
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# OpenVision 3
|
| 8 |
+
|
| 9 |
+
OpenVision 3 is a family of advanced vision encoders that learn a single, unified visual representation capable of serving both image understanding and image generation. The architecture feeds VAE-compressed image latents to a ViT encoder and trains its output to support two complementary roles: reconstructing the original image via a ViT-VAE decoder to capture generative structure, and optimizing via contrastive learning and image-captioning objectives to strengthen semantic features.
|
| 10 |
+
|
| 11 |
+
- **Paper:** [OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation](https://huggingface.co/papers/2601.15369)
|
| 12 |
+
- **Project Page:** [https://ucsc-vlaa.github.io/OpenVision3/](https://ucsc-vlaa.github.io/OpenVision3/)
|
| 13 |
+
- **Repository:** [https://github.com/UCSC-VLAA/OpenVision](https://github.com/UCSC-VLAA/OpenVision)
|
| 14 |
+
|
| 15 |
+
## Usage
|
| 16 |
+
|
| 17 |
+
This model can be used with the [OpenCLIP](https://github.com/mlfoundations/open_clip) library.
|
| 18 |
+
|
| 19 |
+
```python
|
| 20 |
+
import torch
|
| 21 |
+
import torch.nn.functional as F
|
| 22 |
+
from urllib.request import urlopen
|
| 23 |
+
from PIL import Image
|
| 24 |
+
from open_clip import create_model_from_pretrained, get_tokenizer
|
| 25 |
+
|
| 26 |
+
# Replace with the specific model ID from the Hub
|
| 27 |
+
model_id = 'hf-hub:UCSC-VLAA/openvision-vit-large-patch14-224'
|
| 28 |
+
|
| 29 |
+
model, preprocess = create_model_from_pretrained(model_id)
|
| 30 |
+
tokenizer = get_tokenizer(model_id)
|
| 31 |
+
|
| 32 |
+
image = Image.open(urlopen(
|
| 33 |
+
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
|
| 34 |
+
))
|
| 35 |
+
image = preprocess(image).unsqueeze(0)
|
| 36 |
+
|
| 37 |
+
text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length)
|
| 38 |
+
|
| 39 |
+
with torch.no_grad(), torch.cuda.amp.autocast():
|
| 40 |
+
image_features = model.encode_image(image)
|
| 41 |
+
text_features = model.encode_text(text)
|
| 42 |
+
image_features = F.normalize(image_features, dim=-1)
|
| 43 |
+
text_features = F.normalize(text_features, dim=-1)
|
| 44 |
+
|
| 45 |
+
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
|
| 46 |
+
|
| 47 |
+
print("Label probs:", text_probs)
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
## Citation
|
| 51 |
+
|
| 52 |
+
```bibtex
|
| 53 |
+
@article{zhang2026openvision3,
|
| 54 |
+
title={OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation},
|
| 55 |
+
author={Zhang, Letian and Ren, Sucheng and Liu, Yanqing and Li, Xianhang and Wang, Zeyu and Zhou, Yuyin and Yao, Huaxiu and Zheng, Zeyu and Nie, Weili and Liu, Guilin and Yu, Zhiding and Xie, Cihang},
|
| 56 |
+
journal={arXiv preprint arXiv:2601.15369},
|
| 57 |
+
year={2026}
|
| 58 |
+
}
|
| 59 |
+
```
|