Instructions to use UCSC-VLAA/openvision-vit-so400m-patch14-224 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use UCSC-VLAA/openvision-vit-so400m-patch14-224 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-feature-extraction", model="UCSC-VLAA/openvision-vit-so400m-patch14-224")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("UCSC-VLAA/openvision-vit-so400m-patch14-224", dtype="auto") - Notebooks
- Google Colab
- Kaggle
This repository contains the OpenVision model, a fully-open, cost-effective family of advanced vision encoders for multimodal learning, as described in the paper OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning.
Abstract:
OpenAI's CLIP, released in early 2021, have long been the go-to choice of vision encoder for building multimodal foundation models. Although recent alternatives such as SigLIP have begun to challenge this status quo, to our knowledge none are fully open: their training data remains proprietary and/or their training recipes are not released. This paper fills this gap with OpenVision, a fully-open, cost-effective family of vision encoders that match or surpass the performance of OpenAI's CLIP when integrated into multimodal frameworks like LLaVA. OpenVision builds on existing works -- e.g., CLIPS for training framework and Recap-DataComp-1B for training data -- while revealing multiple key insights in enhancing encoder quality and showcasing practical benefits in advancing multimodal models. By releasing vision encoders spanning from 5.9M to 632.1M parameters, OpenVision offers practitioners a flexible trade-off between capacity and efficiency in building multimodal models: larger models deliver enhanced multimodal performance, while smaller versions enable lightweight, edge-ready multimodal deployments.
Project Page: https://ucsc-vlaa.github.io/OpenVision