MM-MVR
/

UniViTAR-0.6B

Feature Extraction

Model card Files Files and versions

MM-MVR commited on May 22, 2025

Commit

cde8e96

·

verified ·

1 Parent(s): a25d74b

Update README.md

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -14,10 +14,12 @@ pipeline_tag: feature-extraction
 library_name: transformers
 ---
 ## 🌠 Introduction
-We present UniViTAR, a family of homogeneous vision foundation models tailored for unified visual modality and native resolution scenario in the era of multimodal. We train our UniViTAR family across multiple model scales **from 0.3B to 1B** exclusively on public accessible image-caption data (14.6B), and observe a trend of performance increasing with parameter scaling. UniViTAR is a Transformer-based encoder model that inherits the original architecture of the conventional Vision Transformer but incorporates the following advanced modifications: *Unified Patchify for Native Image and Video Modality, 2D RoPE, SwiGLU, RMSNorm, and QK-Norm*.
 ## 🛠️ Environment
@@ -64,7 +66,7 @@ print(data_embeds[0].shape, data_embeds[1].shape)
 |--------|-----|----|------|------|------|------|------|------|
 | [UniViTAR-0.3B](https://huggingface.co/MM-MVR/UniViTAR-0.3B) | 310M | 14.6B  | 81.5  | 87.7 | 84.0 | 95.1 | 66.0 | 54.6 |
 | [UniViTAR-0.6B](https://huggingface.co/MM-MVR/UniViTAR-0.6B) | 637M | 14.6B | 82.3  | 88.3 | 84.1 | 95.5 | 68.6 | 55.1 |
-| [UniViTAR-1.4B](https://huggingface.co/MM-MVR/UniViTAR-1B) | 1419M | 14.6B  | 82.9  | 89.2 | 83.5 | 95.1 | 69.0 | 56.2 |
 <font size=1>*ZS: Zero-shot Classification, LP: Linear-Probe Classification, T2I/I2T: Text-to-Image/Image-to-Text Retrieval*</font>

 library_name: transformers
 ---
+<h1 align="center">Unified Vision Transformer with Native Resolution</h1>
 ## 🌠 Introduction
+We present **UniViTAR**, a family of homogeneous vision foundation models tailored **for unified visual modality and native resolution scenario** in the era of multimodal. We train our UniViTAR family across multiple model scales from **0.3B to 1.4B** exclusively on public accessible image-caption data (14.6B), and observe a trend of performance increasing with parameter scaling. UniViTAR is a Transformer-based encoder model that inherits the original architecture of the conventional Vision Transformer but incorporates the following advanced modifications: *Unified Patchify for Native Image and Video Modality, 2D RoPE, SwiGLU, RMSNorm, and QK-Norm*.
 ## 🛠️ Environment
 |--------|-----|----|------|------|------|------|------|------|
 | [UniViTAR-0.3B](https://huggingface.co/MM-MVR/UniViTAR-0.3B) | 310M | 14.6B  | 81.5  | 87.7 | 84.0 | 95.1 | 66.0 | 54.6 |
 | [UniViTAR-0.6B](https://huggingface.co/MM-MVR/UniViTAR-0.6B) | 637M | 14.6B | 82.3  | 88.3 | 84.1 | 95.5 | 68.6 | 55.1 |
+| [UniViTAR-1B](https://huggingface.co/MM-MVR/UniViTAR-1B) | 1419M | 14.6B  | 82.9  | 89.2 | 83.5 | 95.1 | 69.0 | 56.2 |
 <font size=1>*ZS: Zero-shot Classification, LP: Linear-Probe Classification, T2I/I2T: Text-to-Image/Image-to-Text Retrieval*</font>