FengWang3211
/

ViT-5

Model card Files Files and versions

FengWang3211 commited on Feb 11

Commit

068a7c0

·

verified ·

1 Parent(s): b4c93aa

Update README.md

Files changed (1) hide show

README.md +86 -3

README.md CHANGED Viewed

@@ -1,3 +1,86 @@
----
-license: unknown
----

+---
+license: unknown
+---
+# ViT-5
+**ViT-5: Vision Transformers for the Mid-2020s**
+Official checkpoint release.
+📄 Paper: https://arxiv.org/abs/2602.08071
+💻 Code: https://github.com/wangf3014/ViT-5
+---
+## Overview
+ViT-5 is a modernized Vision Transformer backbone that preserves the canonical Attention–FFN block structure while systematically upgrading its internal components using best practices from recent large-scale vision modeling research.
+Rather than proposing a new paradigm, ViT-5 focuses on refining and consolidating improvements that have emerged over the past few years into a clean, scalable, and reproducible ViT design suitable for mid-2020s workloads.
+This repository provides pretrained ViT-5 checkpoints for image recognition and as a general-purpose vision backbone.
+---
+## Model Architecture
+ViT-5 retains the standard Transformer encoder structure:
+Patch Embedding → [Attention → FFN] × L → Classification Head
+but modernizes key components, including:
+- Improved normalization strategy
+- Updated positional encoding
+- Refined activation design
+- Architectural stabilization techniques
+- Training refinements
+Full architectural details are described in the paper.
+---
+## Available Checkpoints
+| Model | Input Resolution | Params | Top-1 (ImageNet-1K) | Notes |
+|-------|------------------|--------|---------------------|-------|
+| ViT-5-Small | 224 | 22M | 82.2% |  |
+| ViT-5-Base  | 224 | 87M | 84.2% |  |
+| ViT-5-Base  | 384 | 87M | 85.4% |  |
+| ViT-5-Large | 224 | 304M | 84.9% |  |
+| ViT-5-Large | 384 | 304M | 86.0% | Available soon |
+Please refer to the paper for detailed training configuration.
+---
+## Intended Use
+ViT-5 is designed as a general-purpose vision backbone and can be used for:
+- Image classification (fine-tuning or linear probing)
+- Transfer learning to detection and segmentation
+- Vision-language modeling
+- Generative modeling backbones (e.g., diffusion transformers)
+- Research on Transformer scaling and representation learning
+---
+## Citation
+If you use this model, please cite:
+```bibtex
+@article{wang2026vit5,
+  title={ViT-5: Vision Transformers for The Mid-2020s},
+  author={Wang, Feng and Ren, Sucheng and Zhang, Tiezheng and Neskovic, Predrag and Bhattad, Anand and Xie, Cihang and Yuille, Alan},
+  journal={arXiv preprint arXiv:2602.08071},
+  year={2026}
+}
+```
+---
+## Acknowledgements
+This work builds on the foundation of Vision Transformers and recent advances in scalable Transformer design.