| --- |
| license: mit |
| datasets: |
| - laion/laion2B-en |
| - laion/laion-coco |
| - laion/laion2B-multi |
| - kakaobrain/coyo-700m |
| - conceptual_captions |
| - wanng/wukong100m |
| pipeline_tag: image-feature-extraction |
| --- |
| |
| # InternViT-6B-224px |
|
|
| [\[π GitHub\]](https://github.com/OpenGVLab/InternVL) [\[π InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[π InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[π Mini-InternVL\]](https://arxiv.org/abs/2410.16261) [\[π InternVL 2.5\]](https://huggingface.co/papers/2412.05271) |
|
|
| [\[π Blog\]](https://internvl.github.io/blog/) [\[π¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[π€ HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[π Quick Start\]](#quick-start) [\[π Documents\]](https://internvl.readthedocs.io/en/latest/) |
|
|
| <div align="center"> |
| <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png"> |
| </div> |
|
|
| ## Model Details |
| - **Model Type:** vision foundation model, feature backbone |
| - **Model Stats:** |
| - Params (M): 5903 |
| - Image size: 224 x 224 |
| - **Pretrain Dataset:** LAION-en, LAION-COCO, COYO, CC12M, CC3M, SBU, Wukong, LAION-multi |
| - **Note:** This model has 48 blocks, and we found that using the output after the fourth-to-last block worked best for VLLM. Therefore, when building a VLLM with this model, **please use the features from the fourth-to-last layer.** |
|
|
| ## Linear Probing Performance |
|
|
| See this [document](https://github.com/OpenGVLab/InternVL/tree/main/classification#-evaluation) for more details about the linear probing evaluation. |
|
|
| | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch | |
| | :---: | :-----: | :---: | :--: | :--: | :-------: | |
| | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 | |
|
|
| ## Model Usage (Image Embeddings) |
|
|
| ```python |
| import torch |
| from PIL import Image |
| from transformers import AutoModel, CLIPImageProcessor |
| |
| model = AutoModel.from_pretrained( |
| 'OpenGVLab/InternViT-6B-224px', |
| torch_dtype=torch.bfloat16, |
| low_cpu_mem_usage=True, |
| trust_remote_code=True).cuda().eval() |
| |
| image = Image.open('./examples/image1.jpg').convert('RGB') |
| |
| image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-224px') |
| |
| pixel_values = image_processor(images=image, return_tensors='pt').pixel_values |
| pixel_values = pixel_values.to(torch.bfloat16).cuda() |
| |
| outputs = model(pixel_values) |
| ``` |
|
|
| ## Citation |
|
|
| If you find this project useful in your research, please consider citing: |
|
|
| ```BibTeX |
| @article{chen2024expanding, |
| title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling}, |
| author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others}, |
| journal={arXiv preprint arXiv:2412.05271}, |
| year={2024} |
| } |
| @article{gao2024mini, |
| title={Mini-internvl: A flexible-transfer pocket multimodal model with 5\% parameters and 90\% performance}, |
| author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others}, |
| journal={arXiv preprint arXiv:2410.16261}, |
| year={2024} |
| } |
| @article{chen2024far, |
| title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites}, |
| author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others}, |
| journal={arXiv preprint arXiv:2404.16821}, |
| year={2024} |
| } |
| @inproceedings{chen2024internvl, |
| title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks}, |
| author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others}, |
| booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, |
| pages={24185--24198}, |
| year={2024} |
| } |
| ``` |
|
|