--- library_name: keras-hub --- ### Model Overview # Swin Transformer Instantiates the Swin Transformer architecture. ## Model Details The Swin Transformer (Shifted Window Transformer) is a hierarchical vision transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities make Swin Transformer compatible with a broad range of vision tasks, including image classification, object detection, and semantic segmentation. ### Reference * [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) Unlike traditional Vision Transformers (ViT), which compute attention globally across all patches (resulting in quadratic complexity relative to image size), Swin Transformer computes self-attention within local non-overlapping windows. By shifting the window partition between consecutive layers, the model achieves cross-window connections, maintaining linear computational complexity while enabling robust global context modeling. ### Links * [Swin Transformer Quickstart Notebook](https://www.kaggle.com/code/prasadsachin/swin-transformer-quickstart-keras-hub) * [Swin Transformer API Documentation](https://keras.io/keras_hub/api/models/swin_transformer/) * [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/) * [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/) ## Installation Keras and KerasHub can be installed with: ```bash pip install -U -q keras-hub pip install -U -q keras ``` JAX, TensorFlow, and PyTorch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment, see the [Keras Getting Started](https://keras.io/getting_started/) page. ## Presets The following model checkpoints are provided by the Keras team. Weights have been ported from [Hugging Face Hub](https://huggingface.co/microsoft). | Preset name | Parameters | Description | | :--- | :--- | :--- | | **swin_tiny_patch4_window7_224** | 28.29M | Tiny Swin Transformer model pre-trained on ImageNet-1k at a 224x224 resolution | | **swin_small_patch4_window7_224** | 49.61M | Small Swin Transformer model pre-trained on ImageNet-1k at a 224x224 resolution | | **swin_base_patch4_window7_224** | 87.77M | Base Swin Transformer model pre-trained on ImageNet-1k at a 224x224 resolution | | **swin_base_patch4_window12_384** | 87.90M | Base Swin Transformer model pre-trained on ImageNet-1k at a 384x384 resolution | | **swin_large_patch4_window7_224** | 196.53M | Large Swin Transformer model pre-trained ImageNet-1k at a 224x224 resolution | | **swin_large_patch4_window12_384** | 196.74M | Large Swin Transformer model pre-trained on ImageNet-1k at a 384x384 resolution | ## Example Use ```python import numpy as np import keras_hub # Pretrained Swin Transformer backbone model = keras_hub.models.SwinTransformerBackbone.from_preset("swin_tiny_patch4_window7_224") input_data = np.random.uniform(0, 1, size=(2, 224, 224, 3)) model(input_data) # Randomly initialized Swin Transformer backbone with custom config model = keras_hub.models.SwinTransformerBackbone( image_shape=(224, 224, 3), embed_dim=96, depths=(2, 2, 6, 2), num_heads=(3, 6, 12, 24), window_size=7, ) model(input_data) # Use Swin Transformer for image classification task classifier = keras_hub.models.SwinTransformerImageClassifier.from_preset( "swin_tiny_patch4_window7_224", num_classes=1000, ) # Use Hugging Face presets directly for on-the-fly conversion classifier = keras_hub.models.SwinTransformerImageClassifier.from_preset( "hf://microsoft/swin-tiny-patch4-window7-224" ) ``` ## Example Usage ``` import numpy as np import keras_hub # Top-5 ImageNet class decoding. model = keras_hub.models.SwinTransformerImageClassifier.from_preset( "swin_tiny_patch4_window7_224" ) images = np.random.randint(0, 256, size=(1, 384, 384, 3), dtype="uint8") logits = model.predict(images, verbose=0) print(keras_hub.utils.decode_imagenet_predictions(logits, top=5)[0]) ``` ## Example Usage with Hugging Face URI ``` import numpy as np import keras_hub # Top-5 ImageNet class decoding. model = keras_hub.models.SwinTransformerImageClassifier.from_preset( "hf://keras/swin_tiny_patch4_window7_224" ) images = np.random.randint(0, 256, size=(1, 384, 384, 3), dtype="uint8") logits = model.predict(images, verbose=0) print(keras_hub.utils.decode_imagenet_predictions(logits, top=5)[0]) ```