|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- mlfoundations/datacomp_1b |
|
|
- kakaobrain/coyo-700m |
|
|
- laion/laion400m |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
metrics: |
|
|
- accuracy |
|
|
- recall |
|
|
pipeline_tag: feature-extraction |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
<h1 align="center">Unified Vision Transformer with Native Resolution</h1> |
|
|
|
|
|
|
|
|
## ๐ Introduction |
|
|
|
|
|
We present **UniViTAR**, a family of homogeneous vision foundation models tailored **for unified visual modality and native resolution scenario** in the era of multimodal. We train our UniViTAR family across multiple model scales from **0.3B to 1.4B** exclusively on public accessible image-caption data (14.6B), and observe a trend of performance increasing with parameter scaling. UniViTAR is a Transformer-based encoder model that inherits the original architecture of the conventional Vision Transformer but incorporates the following advanced modifications: *Unified Patchify for Native Image and Video Modality, 2D RoPE, SwiGLU, RMSNorm, and QK-Norm*. |
|
|
|
|
|
|
|
|
## ๐ ๏ธ Environment |
|
|
```bash |
|
|
conda create -n univitar python=3.11 -y |
|
|
conda activate univitar |
|
|
pip3 install einops==0.8.0 ninja==1.11.1.1 numpy==1.26.4 pillow==10.4.0 psutil==6.0.0 torch==2.2.2 torchvision==0.17.2 transformers==4.49.0 timm==1.0.14 |
|
|
pip3 install flash-attn==2.6.3 |
|
|
``` |
|
|
|
|
|
|
|
|
## ๐๏ธ Model Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import numpy as np |
|
|
from PIL import Image |
|
|
from modeling_univitar import UniViTARVisionModel |
|
|
|
|
|
# Prepare Model |
|
|
model = UniViTARVisionModel("config.json") |
|
|
_ = model.load_state_dict(torch.load(f"pytorch_model.bin", map_location="cpu")) |
|
|
model = model.to(torch.bfloat16).cuda() |
|
|
|
|
|
# Prepare Data: [(3, H1, W1), ..., (3, Hn, Wn)] --> (N1+...+Nn, P) |
|
|
images = [Image.open(f"xx1.jpg"), Image.open(f"xx2.jpg")] |
|
|
data_inputs, grid_shapes = [], [] |
|
|
for image in images: |
|
|
data_item = model.image_transform(image) |
|
|
input_data, grid_shape = model.data_patchify(data_item) |
|
|
data_inputs.append(input_data.to(torch.bfloat16).cuda()) |
|
|
grid_shapes.append(grid_shape) |
|
|
data_inputs = torch.concatenate(data_inputs, dim=0) |
|
|
|
|
|
# Forward: (N1+...+Nn, P) --> [(N1, D), ..., (Nn, D)] |
|
|
data_embeds = model(pixel_values=data_inputs, grid_shapes=grid_shapes) |
|
|
data_embeds = data_embeds.split([np.prod(grid_shape) for grid_shape in grid_shapes]) |
|
|
print(data_embeds[0].shape, data_embeds[1].shape) |
|
|
``` |
|
|
|
|
|
## ๐ Evaluation |
|
|
|
|
|
| Model | Size | \#Seen | IN1K<sup>ZS<sup> | IN1K<sup>LP<sup> | Flickr<sup>T2I<sup> | Flickr<sup>I2T<sup> | K400<sup>ZS<sup> | ADE20K | |
|
|
|--------|-----|----|------|------|------|------|------|------| |
|
|
| [UniViTAR-0.3B](https://huggingface.co/MM-MVR/UniViTAR-0.3B) | 310M | 14.6B | 81.5 | 87.7 | 84.0 | 95.1 | 66.0 | 54.6 | |
|
|
| [UniViTAR-0.6B](https://huggingface.co/MM-MVR/UniViTAR-0.6B) | 637M | 14.6B | 82.3 | 88.3 | 84.1 | 95.5 | 68.6 | 55.1 | |
|
|
| [UniViTAR-1B](https://huggingface.co/MM-MVR/UniViTAR-1B) | 1419M | 14.6B | 82.9 | 89.2 | 83.5 | 95.1 | 69.0 | 56.2 | |
|
|
|
|
|
<font size=1>*ZS: Zero-shot Classification, LP: Linear-Probe Classification, T2I/I2T: Text-to-Image/Image-to-Text Retrieval*</font> |
|
|
|
|
|
|
|
|
## โ๏ธ Reference |
|
|
|
|
|
If you find UniViTAR useful in your research or applications, please consider citing the following BibTeX: |
|
|
|
|
|
``` |
|
|
@article{qiao2025univitar, |
|
|
title={UniViTAR: Unified Vision Transformer with Native Resolution}, |
|
|
author={Qiao, Limeng and Gan, Yiyang and Wang, Bairui and Qin, Jie and Xu, Shuang and Yang, Siqi and Ma, Lin}, |
|
|
journal={arXiv preprint arXiv:2504.01792}, |
|
|
year={2025} |
|
|
} |
|
|
``` |