Update README.md
Browse files
README.md
CHANGED
|
@@ -10,25 +10,12 @@ datasets:
|
|
| 10 |
pipeline_tag: image-feature-extraction
|
| 11 |
---
|
| 12 |
|
| 13 |
-
#
|
| 14 |
-
|
| 15 |
-
<p align="center">
|
| 16 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/2yzk5wUY-obL6H4rKiHlU.webp" alt="Image Description" width="300" height="300">
|
| 17 |
-
</p>
|
| 18 |
|
| 19 |
[\[π Blog\]](https://internvl.github.io/blog/) [\[π InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[π InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[π¨οΈ Chat Demo\]](https://internvl.opengvlab.com/)
|
| 20 |
|
| 21 |
[\[π€ HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[π Quick Start\]](#model-usage) [\[π Community-hosted API\]](https://rapidapi.com/adushar1320/api/internvl-chat) [\[π δΈζθ§£θ―»\]](https://zhuanlan.zhihu.com/p/675877376)
|
| 22 |
|
| 23 |
-
| Model | Date | Download | Note |
|
| 24 |
-
| ----------------------- | ---------- | ---------------------------------------------------------------------- | -------------------------------- |
|
| 25 |
-
| InternViT-6B-448px-V1-5 | 2024.04.20 | π€ [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | support dynamic resolution, super strong OCR (π₯new) |
|
| 26 |
-
| InternViT-6B-448px-V1-2 | 2024.02.11 | π€ [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) | 448 resolution |
|
| 27 |
-
| InternViT-6B-448px-V1-0 | 2024.01.30 | π€ [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) | 448 resolution |
|
| 28 |
-
| InternViT-6B-224px | 2023.12.22 | π€ [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-224px) | vision foundation model |
|
| 29 |
-
| InternVL-14B-224px | 2023.12.22 | π€ [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px) | vision-language foundation model |
|
| 30 |
-
|
| 31 |
-
|
| 32 |
## Model Details
|
| 33 |
- **Model Type:** vision-language foundation model
|
| 34 |
- **Support Tasks:** zero-shot image/video classification, image-text/video retrieval, image captioning
|
|
@@ -43,10 +30,8 @@ See this [document](https://github.com/OpenGVLab/InternVL/tree/main/clip_benchma
|
|
| 43 |
|
| 44 |

|
| 45 |
|
| 46 |
-
|
| 47 |

|
| 48 |
|
| 49 |
-
|
| 50 |
## Model Usage
|
| 51 |
|
| 52 |
**Note: the prefix `'summarize:'` and `tokenizer.pad_token_id = 0` are necessary. Their absence will lead to abnormal results.**
|
|
@@ -141,8 +126,3 @@ If you find this project useful in your research, please consider citing:
|
|
| 141 |
year={2024}
|
| 142 |
}
|
| 143 |
```
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
## Acknowledgement
|
| 147 |
-
|
| 148 |
-
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|
|
|
|
| 10 |
pipeline_tag: image-feature-extraction
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# InternVL-14B-224px
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
[\[π Blog\]](https://internvl.github.io/blog/) [\[π InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[π InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[π¨οΈ Chat Demo\]](https://internvl.opengvlab.com/)
|
| 16 |
|
| 17 |
[\[π€ HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[π Quick Start\]](#model-usage) [\[π Community-hosted API\]](https://rapidapi.com/adushar1320/api/internvl-chat) [\[π δΈζθ§£θ―»\]](https://zhuanlan.zhihu.com/p/675877376)
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
## Model Details
|
| 20 |
- **Model Type:** vision-language foundation model
|
| 21 |
- **Support Tasks:** zero-shot image/video classification, image-text/video retrieval, image captioning
|
|
|
|
| 30 |
|
| 31 |

|
| 32 |
|
|
|
|
| 33 |

|
| 34 |
|
|
|
|
| 35 |
## Model Usage
|
| 36 |
|
| 37 |
**Note: the prefix `'summarize:'` and `tokenizer.pad_token_id = 0` are necessary. Their absence will lead to abnormal results.**
|
|
|
|
| 126 |
year={2024}
|
| 127 |
}
|
| 128 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|