---
license: unknown
---

# ViT-5

**ViT-5: Vision Transformers for the Mid-2020s**  
Official checkpoint release.

📄 Paper: https://arxiv.org/abs/2602.08071  
💻 Code: https://github.com/wangf3014/ViT-5  

---

## Overview

ViT-5 is a modernized Vision Transformer backbone that preserves the canonical Attention–FFN block structure while systematically upgrading its internal components using best practices from recent large-scale vision modeling research.

Rather than proposing a new paradigm, ViT-5 focuses on refining and consolidating improvements that have emerged over the past few years into a clean, scalable, and reproducible ViT design suitable for mid-2020s workloads.

This repository provides pretrained ViT-5 checkpoints for image recognition and as a general-purpose vision backbone.

---

## Model Architecture

ViT-5 retains the standard Transformer encoder structure:

Patch Embedding → [Attention → FFN] × L → Classification Head

but modernizes key components, including:

- Improved normalization strategy  
- Updated positional encoding  
- Refined activation design  
- Architectural stabilization techniques  
- Training refinements  

Full architectural details are described in the paper.

---

## Available Checkpoints

| Model | Input Resolution | Params | Top-1 (ImageNet-1K) | Notes |
|-------|------------------|--------|---------------------|-------|
| ViT-5-Small | 224 | 22M | 82.2% |  |
| ViT-5-Base  | 224 | 87M | 84.2% |  |
| ViT-5-Base  | 384 | 87M | 85.4% |  |
| ViT-5-Large | 224 | 304M | 84.9% |  |
| ViT-5-Large | 384 | 304M | 86.0% | Available soon |

Please refer to the paper for detailed training configuration.

---

## Intended Use

ViT-5 is designed as a general-purpose vision backbone and can be used for:

- Image classification (fine-tuning or linear probing)  
- Transfer learning to detection and segmentation  
- Vision-language modeling  
- Generative modeling backbones (e.g., diffusion transformers)  
- Research on Transformer scaling and representation learning  

---

## Citation

If you use this model, please cite:

```bibtex
@article{wang2026vit5,
  title={ViT-5: Vision Transformers for The Mid-2020s},
  author={Wang, Feng and Ren, Sucheng and Zhang, Tiezheng and Neskovic, Predrag and Bhattad, Anand and Xie, Cihang and Yuille, Alan},
  journal={arXiv preprint arXiv:2602.08071},
  year={2026}
}
```

---

## Acknowledgements

This work builds on the foundation of Vision Transformers and recent advances in scalable Transformer design.