FengWang3211
/

ViT-5

Model card Files Files and versions

ViT-5 / README.md

FengWang3211's picture

Update README.md

068a7c0 verified 12 days ago

|

history blame contribute delete

2.52 kB

	---
	license: unknown
	---

	# ViT-5

	ViT-5: Vision Transformers for the Mid-2020s
	Official checkpoint release.

	📄 Paper: https://arxiv.org/abs/2602.08071
	💻 Code: https://github.com/wangf3014/ViT-5

	---

	## Overview

	ViT-5 is a modernized Vision Transformer backbone that preserves the canonical Attention–FFN block structure while systematically upgrading its internal components using best practices from recent large-scale vision modeling research.

	Rather than proposing a new paradigm, ViT-5 focuses on refining and consolidating improvements that have emerged over the past few years into a clean, scalable, and reproducible ViT design suitable for mid-2020s workloads.

	This repository provides pretrained ViT-5 checkpoints for image recognition and as a general-purpose vision backbone.

	---

	## Model Architecture

	ViT-5 retains the standard Transformer encoder structure:

	Patch Embedding → [Attention → FFN] × L → Classification Head

	but modernizes key components, including:

	- Improved normalization strategy
	- Updated positional encoding
	- Refined activation design
	- Architectural stabilization techniques
	- Training refinements

	Full architectural details are described in the paper.

	---

	## Available Checkpoints

	\| Model \| Input Resolution \| Params \| Top-1 (ImageNet-1K) \| Notes \|
	\|-------\|------------------\|--------\|---------------------\|-------\|
	\| ViT-5-Small \| 224 \| 22M \| 82.2% \| \|
	\| ViT-5-Base \| 224 \| 87M \| 84.2% \| \|
	\| ViT-5-Base \| 384 \| 87M \| 85.4% \| \|
	\| ViT-5-Large \| 224 \| 304M \| 84.9% \| \|
	\| ViT-5-Large \| 384 \| 304M \| 86.0% \| Available soon \|

	Please refer to the paper for detailed training configuration.

	---

	## Intended Use

	ViT-5 is designed as a general-purpose vision backbone and can be used for:

	- Image classification (fine-tuning or linear probing)
	- Transfer learning to detection and segmentation
	- Vision-language modeling
	- Generative modeling backbones (e.g., diffusion transformers)
	- Research on Transformer scaling and representation learning

	---

	## Citation

	If you use this model, please cite:

	```bibtex
	@article{wang2026vit5,
	title={ViT-5: Vision Transformers for The Mid-2020s},
	author={Wang, Feng and Ren, Sucheng and Zhang, Tiezheng and Neskovic, Predrag and Bhattad, Anand and Xie, Cihang and Yuille, Alan},
	journal={arXiv preprint arXiv:2602.08071},
	year={2026}
	}
	```

	---

	## Acknowledgements

	This work builds on the foundation of Vision Transformers and recent advances in scalable Transformer design.