Feature Extraction
Transformers
PyTorch
English
Chinese
MM-MVR commited on
Commit
cde8e96
·
verified ·
1 Parent(s): a25d74b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -14,10 +14,12 @@ pipeline_tag: feature-extraction
14
  library_name: transformers
15
  ---
16
 
 
 
17
 
18
  ## 🌠 Introduction
19
 
20
- We present UniViTAR, a family of homogeneous vision foundation models tailored for unified visual modality and native resolution scenario in the era of multimodal. We train our UniViTAR family across multiple model scales **from 0.3B to 1B** exclusively on public accessible image-caption data (14.6B), and observe a trend of performance increasing with parameter scaling. UniViTAR is a Transformer-based encoder model that inherits the original architecture of the conventional Vision Transformer but incorporates the following advanced modifications: *Unified Patchify for Native Image and Video Modality, 2D RoPE, SwiGLU, RMSNorm, and QK-Norm*.
21
 
22
 
23
  ## 🛠️ Environment
@@ -64,7 +66,7 @@ print(data_embeds[0].shape, data_embeds[1].shape)
64
  |--------|-----|----|------|------|------|------|------|------|
65
  | [UniViTAR-0.3B](https://huggingface.co/MM-MVR/UniViTAR-0.3B) | 310M | 14.6B | 81.5 | 87.7 | 84.0 | 95.1 | 66.0 | 54.6 |
66
  | [UniViTAR-0.6B](https://huggingface.co/MM-MVR/UniViTAR-0.6B) | 637M | 14.6B | 82.3 | 88.3 | 84.1 | 95.5 | 68.6 | 55.1 |
67
- | [UniViTAR-1.4B](https://huggingface.co/MM-MVR/UniViTAR-1B) | 1419M | 14.6B | 82.9 | 89.2 | 83.5 | 95.1 | 69.0 | 56.2 |
68
 
69
  <font size=1>*ZS: Zero-shot Classification, LP: Linear-Probe Classification, T2I/I2T: Text-to-Image/Image-to-Text Retrieval*</font>
70
 
 
14
  library_name: transformers
15
  ---
16
 
17
+ <h1 align="center">Unified Vision Transformer with Native Resolution</h1>
18
+
19
 
20
  ## 🌠 Introduction
21
 
22
+ We present **UniViTAR**, a family of homogeneous vision foundation models tailored **for unified visual modality and native resolution scenario** in the era of multimodal. We train our UniViTAR family across multiple model scales from **0.3B to 1.4B** exclusively on public accessible image-caption data (14.6B), and observe a trend of performance increasing with parameter scaling. UniViTAR is a Transformer-based encoder model that inherits the original architecture of the conventional Vision Transformer but incorporates the following advanced modifications: *Unified Patchify for Native Image and Video Modality, 2D RoPE, SwiGLU, RMSNorm, and QK-Norm*.
23
 
24
 
25
  ## 🛠️ Environment
 
66
  |--------|-----|----|------|------|------|------|------|------|
67
  | [UniViTAR-0.3B](https://huggingface.co/MM-MVR/UniViTAR-0.3B) | 310M | 14.6B | 81.5 | 87.7 | 84.0 | 95.1 | 66.0 | 54.6 |
68
  | [UniViTAR-0.6B](https://huggingface.co/MM-MVR/UniViTAR-0.6B) | 637M | 14.6B | 82.3 | 88.3 | 84.1 | 95.5 | 68.6 | 55.1 |
69
+ | [UniViTAR-1B](https://huggingface.co/MM-MVR/UniViTAR-1B) | 1419M | 14.6B | 82.9 | 89.2 | 83.5 | 95.1 | 69.0 | 56.2 |
70
 
71
  <font size=1>*ZS: Zero-shot Classification, LP: Linear-Probe Classification, T2I/I2T: Text-to-Image/Image-to-Text Retrieval*</font>
72