Update README.md
Browse files
README.md
CHANGED
|
@@ -8,14 +8,19 @@ datasets:
|
|
| 8 |
- conceptual_captions
|
| 9 |
- wanng/wukong100m
|
| 10 |
pipeline_tag: image-feature-extraction
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
# InternViT-300M-448px
|
| 14 |
|
| 15 |
-
[\[π GitHub\]](https://github.com/OpenGVLab/InternVL) [\[π Blog\]](https://internvl.github.io/blog/) [\[π InternVL 1.0
|
| 16 |
|
| 17 |
[\[π¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[π€ HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[π Quick Start\]](#quick-start) [\[π δΈζθ§£θ―»\]](https://zhuanlan.zhihu.com/p/706547971) [\[π Documents\]](https://internvl.readthedocs.io/en/latest/)
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
This update primarily focuses on enhancing the efficiency of the vision foundation model. We developed InternViT-300M-448px by distilling knowledge from the robust vision foundation model, [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5). Like its predecessor, InternViT-300M-448px features a dynamic input resolution of 448Γ448, with a basic tile size of 448Γ448. During training, it allows for 1 to 12 tiles, and expands to 1 to 40 tiles during testing. Additionally, it inherits the powerful robustness, OCR capability, and high-resolution processing capacity from InternViT-6B-448px-V1-5.
|
| 20 |
|
| 21 |
## Model Details
|
|
@@ -72,5 +77,4 @@ If you find this project useful in your research, please consider citing:
|
|
| 72 |
journal={arXiv preprint arXiv:2404.16821},
|
| 73 |
year={2024}
|
| 74 |
}
|
| 75 |
-
|
| 76 |
```
|
|
|
|
| 8 |
- conceptual_captions
|
| 9 |
- wanng/wukong100m
|
| 10 |
pipeline_tag: image-feature-extraction
|
| 11 |
+
new_version: OpenGVLab/InternViT-300M-448px-V2_5
|
| 12 |
---
|
| 13 |
|
| 14 |
# InternViT-300M-448px
|
| 15 |
|
| 16 |
+
[\[π GitHub\]](https://github.com/OpenGVLab/InternVL) [\[π Blog\]](https://internvl.github.io/blog/) [\[π InternVL 1.0\]](https://arxiv.org/abs/2312.14238) [\[π InternVL 1.5\]](https://arxiv.org/abs/2404.16821) [\[π Mini-InternVL\]](https://arxiv.org/abs/2410.16261)
|
| 17 |
|
| 18 |
[\[π¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[π€ HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[π Quick Start\]](#quick-start) [\[π δΈζθ§£θ―»\]](https://zhuanlan.zhihu.com/p/706547971) [\[π Documents\]](https://internvl.readthedocs.io/en/latest/)
|
| 19 |
|
| 20 |
+
<div align="center">
|
| 21 |
+
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
|
| 22 |
+
</div>
|
| 23 |
+
|
| 24 |
This update primarily focuses on enhancing the efficiency of the vision foundation model. We developed InternViT-300M-448px by distilling knowledge from the robust vision foundation model, [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5). Like its predecessor, InternViT-300M-448px features a dynamic input resolution of 448Γ448, with a basic tile size of 448Γ448. During training, it allows for 1 to 12 tiles, and expands to 1 to 40 tiles during testing. Additionally, it inherits the powerful robustness, OCR capability, and high-resolution processing capacity from InternViT-6B-448px-V1-5.
|
| 25 |
|
| 26 |
## Model Details
|
|
|
|
| 77 |
journal={arXiv preprint arXiv:2404.16821},
|
| 78 |
year={2024}
|
| 79 |
}
|
|
|
|
| 80 |
```
|