Update README.md
Browse files
README.md
CHANGED
|
@@ -4,8 +4,16 @@ This model is a Multi-Axis Vision Transformer (MaxViT) trained for video generat
|
|
| 4 |
|
| 5 |
### Motivation
|
| 6 |
|
| 7 |
-
Vision Transformers (ViTs), as introduced in the [original paper](https://arxiv.org/pdf/2010.11929), showed a new way to process image-based tasks by applying the Transformer mechanism to vision.
|
| 8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
---
|
| 11 |
|
|
|
|
| 4 |
|
| 5 |
### Motivation
|
| 6 |
|
| 7 |
+
Vision Transformers (ViTs), as introduced in the [original paper](https://arxiv.org/pdf/2010.11929), showed a new way to process image-based tasks by applying the Transformer mechanism to vision.
|
| 8 |
+
However, ViTs are data-hungry, lack sufficient inductive bias, and often underperform without extensive pertaining.
|
| 9 |
+
|
| 10 |
+
MaxViT addresses these limitations by combining local and global attention mechanisms with hierarchical architectures, resulting in improved scalability and generalizability.
|
| 11 |
+
|
| 12 |
+
The Swin Transformer attempted to address the data-hungry nature of Transformers by introducing shifted non-overlapping windows for self-attention, enabling hierarchical feature extraction.
|
| 13 |
+
This innovative approach allowed Swin to outperform ConvNets on the ImageNet benchmark, marking a significant milestone for vision Transformers.
|
| 14 |
+
However, its reliance on window-based attention limited the model's capacity due to a loss of non-locality, making it less effective for larger datasets like ImageNet-21K.
|
| 15 |
+
|
| 16 |
+
In contrast, MaxViT leverages Multi-Axis Self-Attention (Max-SA) to seamlessly combine local and global interactions within a single module, overcoming these limitations by providing a global receptive field with linear computational complexity. This approach strikes a balance between capacity, generalizability, and efficiency.
|
| 17 |
|
| 18 |
---
|
| 19 |
|