Update README.md
Browse files
README.md
CHANGED
|
@@ -14,10 +14,12 @@ pipeline_tag: feature-extraction
|
|
| 14 |
library_name: transformers
|
| 15 |
---
|
| 16 |
|
|
|
|
|
|
|
| 17 |
|
| 18 |
## 🌠 Introduction
|
| 19 |
|
| 20 |
-
We present UniViTAR, a family of homogeneous vision foundation models tailored for unified visual modality and native resolution scenario in the era of multimodal. We train our UniViTAR family across multiple model scales
|
| 21 |
|
| 22 |
|
| 23 |
## 🛠️ Environment
|
|
@@ -64,7 +66,7 @@ print(data_embeds[0].shape, data_embeds[1].shape)
|
|
| 64 |
|--------|-----|----|------|------|------|------|------|------|
|
| 65 |
| [UniViTAR-0.3B](https://huggingface.co/MM-MVR/UniViTAR-0.3B) | 310M | 14.6B | 81.5 | 87.7 | 84.0 | 95.1 | 66.0 | 54.6 |
|
| 66 |
| [UniViTAR-0.6B](https://huggingface.co/MM-MVR/UniViTAR-0.6B) | 637M | 14.6B | 82.3 | 88.3 | 84.1 | 95.5 | 68.6 | 55.1 |
|
| 67 |
-
| [UniViTAR-
|
| 68 |
|
| 69 |
<font size=1>*ZS: Zero-shot Classification, LP: Linear-Probe Classification, T2I/I2T: Text-to-Image/Image-to-Text Retrieval*</font>
|
| 70 |
|
|
|
|
| 14 |
library_name: transformers
|
| 15 |
---
|
| 16 |
|
| 17 |
+
<h1 align="center">Unified Vision Transformer with Native Resolution</h1>
|
| 18 |
+
|
| 19 |
|
| 20 |
## 🌠 Introduction
|
| 21 |
|
| 22 |
+
We present **UniViTAR**, a family of homogeneous vision foundation models tailored **for unified visual modality and native resolution scenario** in the era of multimodal. We train our UniViTAR family across multiple model scales from **0.3B to 1.4B** exclusively on public accessible image-caption data (14.6B), and observe a trend of performance increasing with parameter scaling. UniViTAR is a Transformer-based encoder model that inherits the original architecture of the conventional Vision Transformer but incorporates the following advanced modifications: *Unified Patchify for Native Image and Video Modality, 2D RoPE, SwiGLU, RMSNorm, and QK-Norm*.
|
| 23 |
|
| 24 |
|
| 25 |
## 🛠️ Environment
|
|
|
|
| 66 |
|--------|-----|----|------|------|------|------|------|------|
|
| 67 |
| [UniViTAR-0.3B](https://huggingface.co/MM-MVR/UniViTAR-0.3B) | 310M | 14.6B | 81.5 | 87.7 | 84.0 | 95.1 | 66.0 | 54.6 |
|
| 68 |
| [UniViTAR-0.6B](https://huggingface.co/MM-MVR/UniViTAR-0.6B) | 637M | 14.6B | 82.3 | 88.3 | 84.1 | 95.5 | 68.6 | 55.1 |
|
| 69 |
+
| [UniViTAR-1B](https://huggingface.co/MM-MVR/UniViTAR-1B) | 1419M | 14.6B | 82.9 | 89.2 | 83.5 | 95.1 | 69.0 | 56.2 |
|
| 70 |
|
| 71 |
<font size=1>*ZS: Zero-shot Classification, LP: Linear-Probe Classification, T2I/I2T: Text-to-Image/Image-to-Text Retrieval*</font>
|
| 72 |
|