Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,86 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: unknown
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: unknown
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# ViT-5
|
| 6 |
+
|
| 7 |
+
**ViT-5: Vision Transformers for the Mid-2020s**
|
| 8 |
+
Official checkpoint release.
|
| 9 |
+
|
| 10 |
+
📄 Paper: https://arxiv.org/abs/2602.08071
|
| 11 |
+
💻 Code: https://github.com/wangf3014/ViT-5
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## Overview
|
| 16 |
+
|
| 17 |
+
ViT-5 is a modernized Vision Transformer backbone that preserves the canonical Attention–FFN block structure while systematically upgrading its internal components using best practices from recent large-scale vision modeling research.
|
| 18 |
+
|
| 19 |
+
Rather than proposing a new paradigm, ViT-5 focuses on refining and consolidating improvements that have emerged over the past few years into a clean, scalable, and reproducible ViT design suitable for mid-2020s workloads.
|
| 20 |
+
|
| 21 |
+
This repository provides pretrained ViT-5 checkpoints for image recognition and as a general-purpose vision backbone.
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
## Model Architecture
|
| 26 |
+
|
| 27 |
+
ViT-5 retains the standard Transformer encoder structure:
|
| 28 |
+
|
| 29 |
+
Patch Embedding → [Attention → FFN] × L → Classification Head
|
| 30 |
+
|
| 31 |
+
but modernizes key components, including:
|
| 32 |
+
|
| 33 |
+
- Improved normalization strategy
|
| 34 |
+
- Updated positional encoding
|
| 35 |
+
- Refined activation design
|
| 36 |
+
- Architectural stabilization techniques
|
| 37 |
+
- Training refinements
|
| 38 |
+
|
| 39 |
+
Full architectural details are described in the paper.
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## Available Checkpoints
|
| 44 |
+
|
| 45 |
+
| Model | Input Resolution | Params | Top-1 (ImageNet-1K) | Notes |
|
| 46 |
+
|-------|------------------|--------|---------------------|-------|
|
| 47 |
+
| ViT-5-Small | 224 | 22M | 82.2% | |
|
| 48 |
+
| ViT-5-Base | 224 | 87M | 84.2% | |
|
| 49 |
+
| ViT-5-Base | 384 | 87M | 85.4% | |
|
| 50 |
+
| ViT-5-Large | 224 | 304M | 84.9% | |
|
| 51 |
+
| ViT-5-Large | 384 | 304M | 86.0% | Available soon |
|
| 52 |
+
|
| 53 |
+
Please refer to the paper for detailed training configuration.
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## Intended Use
|
| 58 |
+
|
| 59 |
+
ViT-5 is designed as a general-purpose vision backbone and can be used for:
|
| 60 |
+
|
| 61 |
+
- Image classification (fine-tuning or linear probing)
|
| 62 |
+
- Transfer learning to detection and segmentation
|
| 63 |
+
- Vision-language modeling
|
| 64 |
+
- Generative modeling backbones (e.g., diffusion transformers)
|
| 65 |
+
- Research on Transformer scaling and representation learning
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## Citation
|
| 70 |
+
|
| 71 |
+
If you use this model, please cite:
|
| 72 |
+
|
| 73 |
+
```bibtex
|
| 74 |
+
@article{wang2026vit5,
|
| 75 |
+
title={ViT-5: Vision Transformers for The Mid-2020s},
|
| 76 |
+
author={Wang, Feng and Ren, Sucheng and Zhang, Tiezheng and Neskovic, Predrag and Bhattad, Anand and Xie, Cihang and Yuille, Alan},
|
| 77 |
+
journal={arXiv preprint arXiv:2602.08071},
|
| 78 |
+
year={2026}
|
| 79 |
+
}
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
---
|
| 83 |
+
|
| 84 |
+
## Acknowledgements
|
| 85 |
+
|
| 86 |
+
This work builds on the foundation of Vision Transformers and recent advances in scalable Transformer design.
|