hash2004
/

MaxViT-video-generation

Model card Files Files and versions

hash2004 commited on Dec 5, 2024

Commit

647dcdc

·

verified ·

1 Parent(s): 7f770ff

Update README.md

Files changed (1) hide show

README.md +10 -2

README.md CHANGED Viewed

@@ -4,8 +4,16 @@ This model is a Multi-Axis Vision Transformer (MaxViT) trained for video generat
 ### Motivation
-Vision Transformers (ViTs), as introduced in the [original paper](https://arxiv.org/pdf/2010.11929), showed a new way to process image-based tasks by applying the Transformer mechanism to vision. However, ViTs are data-hungry, lack sufficient inductive bias, and often underperform without extensive pretraining. MaxViT addresses these limitations by combining local and global attention mechanisms with hierarchical architectures, resulting in improved scalability and generalizability.
-The Swin Transformer attempted to address the data-hungry nature of Transformers by introducing shifted non-overlapping windows for self-attention, enabling hierarchical feature extraction. This innovative approach allowed Swin to outperform ConvNets on the ImageNet benchmark, marking a significant milestone for vision Transformers. However, its reliance on window-based attention limited the model's capacity due to a loss of non-locality, making it less effective for larger datasets like ImageNet-21K. In contrast, MaxViT leverages Multi-Axis Self-Attention (Max-SA) to seamlessly combine local and global interactions within a single module, overcoming these limitations by providing a global receptive field with linear computational complexity. This approach strikes a balance between capacity, generalizability, and efficiency.
 ---

 ### Motivation
+Vision Transformers (ViTs), as introduced in the [original paper](https://arxiv.org/pdf/2010.11929), showed a new way to process image-based tasks by applying the Transformer mechanism to vision.
+However, ViTs are data-hungry, lack sufficient inductive bias, and often underperform without extensive pertaining.
+MaxViT addresses these limitations by combining local and global attention mechanisms with hierarchical architectures, resulting in improved scalability and generalizability.
+The Swin Transformer attempted to address the data-hungry nature of Transformers by introducing shifted non-overlapping windows for self-attention, enabling hierarchical feature extraction.
+This innovative approach allowed Swin to outperform ConvNets on the ImageNet benchmark, marking a significant milestone for vision Transformers.
+However, its reliance on window-based attention limited the model's capacity due to a loss of non-locality, making it less effective for larger datasets like ImageNet-21K.
+In contrast, MaxViT leverages Multi-Axis Self-Attention (Max-SA) to seamlessly combine local and global interactions within a single module, overcoming these limitations by providing a global receptive field with linear computational complexity. This approach strikes a balance between capacity, generalizability, and efficiency.
 ---