hash2004 commited on
Commit
647dcdc
·
verified ·
1 Parent(s): 7f770ff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -2
README.md CHANGED
@@ -4,8 +4,16 @@ This model is a Multi-Axis Vision Transformer (MaxViT) trained for video generat
4
 
5
  ### Motivation
6
 
7
- Vision Transformers (ViTs), as introduced in the [original paper](https://arxiv.org/pdf/2010.11929), showed a new way to process image-based tasks by applying the Transformer mechanism to vision. However, ViTs are data-hungry, lack sufficient inductive bias, and often underperform without extensive pretraining. MaxViT addresses these limitations by combining local and global attention mechanisms with hierarchical architectures, resulting in improved scalability and generalizability.
8
- The Swin Transformer attempted to address the data-hungry nature of Transformers by introducing shifted non-overlapping windows for self-attention, enabling hierarchical feature extraction. This innovative approach allowed Swin to outperform ConvNets on the ImageNet benchmark, marking a significant milestone for vision Transformers. However, its reliance on window-based attention limited the model's capacity due to a loss of non-locality, making it less effective for larger datasets like ImageNet-21K. In contrast, MaxViT leverages Multi-Axis Self-Attention (Max-SA) to seamlessly combine local and global interactions within a single module, overcoming these limitations by providing a global receptive field with linear computational complexity. This approach strikes a balance between capacity, generalizability, and efficiency.
 
 
 
 
 
 
 
 
9
 
10
  ---
11
 
 
4
 
5
  ### Motivation
6
 
7
+ Vision Transformers (ViTs), as introduced in the [original paper](https://arxiv.org/pdf/2010.11929), showed a new way to process image-based tasks by applying the Transformer mechanism to vision.
8
+ However, ViTs are data-hungry, lack sufficient inductive bias, and often underperform without extensive pertaining.
9
+
10
+ MaxViT addresses these limitations by combining local and global attention mechanisms with hierarchical architectures, resulting in improved scalability and generalizability.
11
+
12
+ The Swin Transformer attempted to address the data-hungry nature of Transformers by introducing shifted non-overlapping windows for self-attention, enabling hierarchical feature extraction.
13
+ This innovative approach allowed Swin to outperform ConvNets on the ImageNet benchmark, marking a significant milestone for vision Transformers.
14
+ However, its reliance on window-based attention limited the model's capacity due to a loss of non-locality, making it less effective for larger datasets like ImageNet-21K.
15
+
16
+ In contrast, MaxViT leverages Multi-Axis Self-Attention (Max-SA) to seamlessly combine local and global interactions within a single module, overcoming these limitations by providing a global receptive field with linear computational complexity. This approach strikes a balance between capacity, generalizability, and efficiency.
17
 
18
  ---
19