Add model card for OpenVision 3

This PR adds a model card for OpenVision 3, a family of unified visual encoders designed for both image understanding and generation.

The card includes:
- Metadata for `pipeline_tag`, `library_name` (`open_clip`), and `license`.
- A brief description of the architecture and training objectives.
- Links to the paper, project page, and GitHub repository.
- A sample usage code snippet based on the official documentation.
- Citation information.

Files changed (1) hide show

README.md +59 -0

README.md ADDED Viewed

	@@ -0,0 +1,59 @@

+---
+pipeline_tag: image-feature-extraction
+library_name: open_clip
+license: apache-2.0
+---
+# OpenVision 3
+OpenVision 3 is a family of advanced vision encoders that learn a single, unified visual representation capable of serving both image understanding and image generation. The architecture feeds VAE-compressed image latents to a ViT encoder and trains its output to support two complementary roles: reconstructing the original image via a ViT-VAE decoder to capture generative structure, and optimizing via contrastive learning and image-captioning objectives to strengthen semantic features.
+- **Paper:** [OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation](https://huggingface.co/papers/2601.15369)
+- **Project Page:** [https://ucsc-vlaa.github.io/OpenVision3/](https://ucsc-vlaa.github.io/OpenVision3/)
+- **Repository:** [https://github.com/UCSC-VLAA/OpenVision](https://github.com/UCSC-VLAA/OpenVision)
+## Usage
+This model can be used with the [OpenCLIP](https://github.com/mlfoundations/open_clip) library.
+```python
+import torch
+import torch.nn.functional as F
+from urllib.request import urlopen
+from PIL import Image
+from open_clip import create_model_from_pretrained, get_tokenizer
+# Replace with the specific model ID from the Hub
+model_id = 'hf-hub:UCSC-VLAA/openvision-vit-large-patch14-224'
+model, preprocess = create_model_from_pretrained(model_id)
+tokenizer = get_tokenizer(model_id)
+image = Image.open(urlopen(
+    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
+))
+image = preprocess(image).unsqueeze(0)
+text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length)
+with torch.no_grad(), torch.cuda.amp.autocast():
+    image_features = model.encode_image(image)
+    text_features = model.encode_text(text)
+    image_features = F.normalize(image_features, dim=-1)
+    text_features = F.normalize(text_features, dim=-1)
+    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
+print("Label probs:", text_probs)
+```
+## Citation
+```bibtex
+@article{zhang2026openvision3,
+  title={OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation},
+  author={Zhang, Letian and Ren, Sucheng and Liu, Yanqing and Li, Xianhang and Wang, Zeyu and Zhou, Yuyin and Yao, Huaxiu and Zheng, Zeyu and Nie, Weili and Liu, Guilin and Yu, Zhiding and Xie, Cihang},
+  journal={arXiv preprint arXiv:2601.15369},
+  year={2026}
+}
+```