|
|
--- |
|
|
license: unknown |
|
|
--- |
|
|
|
|
|
# ViT-5 |
|
|
|
|
|
**ViT-5: Vision Transformers for the Mid-2020s** |
|
|
Official checkpoint release. |
|
|
|
|
|
📄 Paper: https://arxiv.org/abs/2602.08071 |
|
|
💻 Code: https://github.com/wangf3014/ViT-5 |
|
|
|
|
|
--- |
|
|
|
|
|
## Overview |
|
|
|
|
|
ViT-5 is a modernized Vision Transformer backbone that preserves the canonical Attention–FFN block structure while systematically upgrading its internal components using best practices from recent large-scale vision modeling research. |
|
|
|
|
|
Rather than proposing a new paradigm, ViT-5 focuses on refining and consolidating improvements that have emerged over the past few years into a clean, scalable, and reproducible ViT design suitable for mid-2020s workloads. |
|
|
|
|
|
This repository provides pretrained ViT-5 checkpoints for image recognition and as a general-purpose vision backbone. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
ViT-5 retains the standard Transformer encoder structure: |
|
|
|
|
|
Patch Embedding → [Attention → FFN] × L → Classification Head |
|
|
|
|
|
but modernizes key components, including: |
|
|
|
|
|
- Improved normalization strategy |
|
|
- Updated positional encoding |
|
|
- Refined activation design |
|
|
- Architectural stabilization techniques |
|
|
- Training refinements |
|
|
|
|
|
Full architectural details are described in the paper. |
|
|
|
|
|
--- |
|
|
|
|
|
## Available Checkpoints |
|
|
|
|
|
| Model | Input Resolution | Params | Top-1 (ImageNet-1K) | Notes | |
|
|
|-------|------------------|--------|---------------------|-------| |
|
|
| ViT-5-Small | 224 | 22M | 82.2% | | |
|
|
| ViT-5-Base | 224 | 87M | 84.2% | | |
|
|
| ViT-5-Base | 384 | 87M | 85.4% | | |
|
|
| ViT-5-Large | 224 | 304M | 84.9% | | |
|
|
| ViT-5-Large | 384 | 304M | 86.0% | Available soon | |
|
|
|
|
|
Please refer to the paper for detailed training configuration. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
ViT-5 is designed as a general-purpose vision backbone and can be used for: |
|
|
|
|
|
- Image classification (fine-tuning or linear probing) |
|
|
- Transfer learning to detection and segmentation |
|
|
- Vision-language modeling |
|
|
- Generative modeling backbones (e.g., diffusion transformers) |
|
|
- Research on Transformer scaling and representation learning |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{wang2026vit5, |
|
|
title={ViT-5: Vision Transformers for The Mid-2020s}, |
|
|
author={Wang, Feng and Ren, Sucheng and Zhang, Tiezheng and Neskovic, Predrag and Bhattad, Anand and Xie, Cihang and Yuille, Alan}, |
|
|
journal={arXiv preprint arXiv:2602.08071}, |
|
|
year={2026} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
This work builds on the foundation of Vision Transformers and recent advances in scalable Transformer design. |
|
|
|