FengWang3211 commited on
Commit
068a7c0
·
verified ·
1 Parent(s): b4c93aa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -3
README.md CHANGED
@@ -1,3 +1,86 @@
1
- ---
2
- license: unknown
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: unknown
3
+ ---
4
+
5
+ # ViT-5
6
+
7
+ **ViT-5: Vision Transformers for the Mid-2020s**
8
+ Official checkpoint release.
9
+
10
+ 📄 Paper: https://arxiv.org/abs/2602.08071
11
+ 💻 Code: https://github.com/wangf3014/ViT-5
12
+
13
+ ---
14
+
15
+ ## Overview
16
+
17
+ ViT-5 is a modernized Vision Transformer backbone that preserves the canonical Attention–FFN block structure while systematically upgrading its internal components using best practices from recent large-scale vision modeling research.
18
+
19
+ Rather than proposing a new paradigm, ViT-5 focuses on refining and consolidating improvements that have emerged over the past few years into a clean, scalable, and reproducible ViT design suitable for mid-2020s workloads.
20
+
21
+ This repository provides pretrained ViT-5 checkpoints for image recognition and as a general-purpose vision backbone.
22
+
23
+ ---
24
+
25
+ ## Model Architecture
26
+
27
+ ViT-5 retains the standard Transformer encoder structure:
28
+
29
+ Patch Embedding → [Attention → FFN] × L → Classification Head
30
+
31
+ but modernizes key components, including:
32
+
33
+ - Improved normalization strategy
34
+ - Updated positional encoding
35
+ - Refined activation design
36
+ - Architectural stabilization techniques
37
+ - Training refinements
38
+
39
+ Full architectural details are described in the paper.
40
+
41
+ ---
42
+
43
+ ## Available Checkpoints
44
+
45
+ | Model | Input Resolution | Params | Top-1 (ImageNet-1K) | Notes |
46
+ |-------|------------------|--------|---------------------|-------|
47
+ | ViT-5-Small | 224 | 22M | 82.2% | |
48
+ | ViT-5-Base | 224 | 87M | 84.2% | |
49
+ | ViT-5-Base | 384 | 87M | 85.4% | |
50
+ | ViT-5-Large | 224 | 304M | 84.9% | |
51
+ | ViT-5-Large | 384 | 304M | 86.0% | Available soon |
52
+
53
+ Please refer to the paper for detailed training configuration.
54
+
55
+ ---
56
+
57
+ ## Intended Use
58
+
59
+ ViT-5 is designed as a general-purpose vision backbone and can be used for:
60
+
61
+ - Image classification (fine-tuning or linear probing)
62
+ - Transfer learning to detection and segmentation
63
+ - Vision-language modeling
64
+ - Generative modeling backbones (e.g., diffusion transformers)
65
+ - Research on Transformer scaling and representation learning
66
+
67
+ ---
68
+
69
+ ## Citation
70
+
71
+ If you use this model, please cite:
72
+
73
+ ```bibtex
74
+ @article{wang2026vit5,
75
+ title={ViT-5: Vision Transformers for The Mid-2020s},
76
+ author={Wang, Feng and Ren, Sucheng and Zhang, Tiezheng and Neskovic, Predrag and Bhattad, Anand and Xie, Cihang and Yuille, Alan},
77
+ journal={arXiv preprint arXiv:2602.08071},
78
+ year={2026}
79
+ }
80
+ ```
81
+
82
+ ---
83
+
84
+ ## Acknowledgements
85
+
86
+ This work builds on the foundation of Vision Transformers and recent advances in scalable Transformer design.