added author and url links
Browse files
README.md
CHANGED
|
@@ -7,6 +7,8 @@ tags:
|
|
| 7 |
|
| 8 |
# Compact Convolutional Transformers
|
| 9 |
|
|
|
|
|
|
|
| 10 |
## Model description
|
| 11 |
|
| 12 |
As discussed in the [Vision Transformers (ViT)](https://arxiv.org/abs/2010.11929) paper, a Transformer-based architecture for vision typically requires a larger dataset than usual, as well as a longer pre-training schedule. ImageNet-1k (which has about a million images) is considered to fall under the medium-sized data regime with respect to ViTs. This is primarily because, unlike CNNs, ViTs (or a typical Transformer-based architecture) do not have well-informed inductive biases (such as convolutions for processing images). This begs the question: can't we combine the benefits of convolution and the benefits of Transformers in a single network architecture? These benefits include parameter-efficiency, and self-attention to process long-range and global dependencies (interactions between different regions in an image).
|
|
@@ -20,7 +22,7 @@ In [Escaping the Big Data Paradigm with Compact Transformers](https://arxiv.org/
|
|
| 20 |
|
| 21 |
## Training and evaluation data
|
| 22 |
|
| 23 |
-
The model is trained using the CIFAR-10 dataset.
|
| 24 |
|
| 25 |
## Training procedure
|
| 26 |
|
|
@@ -39,4 +41,8 @@ The following hyperparameters were used during training:
|
|
| 39 |
|
| 40 |

|
| 41 |
|
| 42 |
-
</details>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
# Compact Convolutional Transformers
|
| 9 |
|
| 10 |
+
Based on the _Compact Convolutional Transformers_ example on [keras.io](https://keras.io/examples/vision/cct/) created by [Sayak Paul](https://twitter.com/RisingSayak).
|
| 11 |
+
|
| 12 |
## Model description
|
| 13 |
|
| 14 |
As discussed in the [Vision Transformers (ViT)](https://arxiv.org/abs/2010.11929) paper, a Transformer-based architecture for vision typically requires a larger dataset than usual, as well as a longer pre-training schedule. ImageNet-1k (which has about a million images) is considered to fall under the medium-sized data regime with respect to ViTs. This is primarily because, unlike CNNs, ViTs (or a typical Transformer-based architecture) do not have well-informed inductive biases (such as convolutions for processing images). This begs the question: can't we combine the benefits of convolution and the benefits of Transformers in a single network architecture? These benefits include parameter-efficiency, and self-attention to process long-range and global dependencies (interactions between different regions in an image).
|
|
|
|
| 22 |
|
| 23 |
## Training and evaluation data
|
| 24 |
|
| 25 |
+
The model is trained using the [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html).
|
| 26 |
|
| 27 |
## Training procedure
|
| 28 |
|
|
|
|
| 41 |
|
| 42 |

|
| 43 |
|
| 44 |
+
</details>
|
| 45 |
+
|
| 46 |
+
<center>
|
| 47 |
+
Model reproduced by <a href="https://github.com/EdAbati" target="_blank">Edoardo Abati</a>
|
| 48 |
+
</center>
|