| --- |
| license: mit |
| library_name: transformers |
| tags: |
| - vision-transformer |
| - image-classification |
| - efficient-transformer |
| - selective-attention |
| - knowledge-distillation |
| - computer-vision |
| pipeline_tag: image-classification |
| --- |
| |
| # Soft-Masked Selective Vision Transformer |
|
|
| ## Model Description |
|
|
| Soft-Masked Selective Vision Transformer is an efficient **Vision Transformer (ViT)** model designed to reduce the computational overhead of self-attention while maintaining competitive accuracy. |
| The model introduces a **patch-selective attention mechanism** that enables the transformer to focus on the most salient image regions and dynamically disregard less informative patches. This selective strategy significantly reduces the quadratic complexity typically associated with full self-attention, making the model particularly suitable for **high-resolution vision tasks** and **resource-constrained environments**. |
|
|
| To further improve performance, the model leverages **knowledge distillation**, transferring representational knowledge from a stronger teacher network to enhance the accuracy of lightweight transformer variants. |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| This model is intended for: |
|
|
| - Image classification tasks |
| - Deployment in **compute- or memory-constrained environments** |
| - High-resolution image processing where standard ViTs are prohibitively expensive |
| - Research on efficient attention mechanisms and transformer compression |
|
|
| ### Example Use Cases |
|
|
| - Edge or embedded vision systems |
| - Large-scale image analysis with reduced inference cost |
| - Efficient backbones for downstream vision tasks |
|
|
| --- |
|
|
| ## Training Details |
|
|
| - **Training Objective:** Cross-entropy loss with optional distillation loss |
| - **Distillation:** Teacher–student framework |
| - **Optimization:** AdamW |
| - **Training Dataset:** ILSVRC 2012 |
| - **Evaluation Metrics:** Top-1 accuracy, FLOPs, parameter count |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### Image Classification Example |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForImageClassification, AutoImageProcessor |
| from PIL import Image |
| import requests |
| |
| # Load image |
| url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
| image = Image.open(requests.get(url, stream=True).raw) |
| |
| # Load processor and model |
| processor = AutoImageProcessor.from_pretrained( |
| "XAFT/SM-Selective-ViT-Base-224", |
| trust_remote_code=True, |
| ) |
| |
| model = AutoModelForImageClassification.from_pretrained( |
| "XAFT/SM-Selective-ViT-Base-224", |
| trust_remote_code=True, |
| ) |
| model = model.half() # Cast to FP16 to enable FlashAttention |
| |
| # Preprocess |
| inputs = processor( |
| images=image, |
| return_tensors="pt", |
| ) |
| inputs = inputs.to(torch.half) # Cast to FP16 |
| |
| # Forward pass |
| outputs = model(**inputs) |
| logits = outputs.logits |
| predicted_class = logits.argmax(-1).item() |
| |
| print("Predicted class index:", predicted_class) |
| ``` |
|
|
| --- |
|
|
| ## Evaluation Results |
|
|
| | Model | Top-1 Acc. | Top-5 Acc. | # Params | Avg. GFLOPs | |
| |------------------------|------------|------------|----------|-------------| |
| | Base | 80.350% | 94.980% | 86.60M | 9.61 | |
| | Base (distilled) | 80.990% | 95.386% | 87.37M | 9.21 | |
| | Small | 78.662% | 94.454% | 22.06M | 3.12 | |
| | Small (distilled) | 79.000% | 94.494% | 22.45M | 3.05 | |
| | Tiny tall | 74.802% | 92.794% | 11.07M | 1.64 | |
| | Tiny tall (distilled) | 75.676% | 92.988% | 11.26M | 1.64 | |
| | Tiny | 71.056% | 90.192% | 5.72M | 0.95 | |
| | Tiny (distilled) | 72.618% | 91.338% | 5.92M | 0.93 | |
|
|
|
|
| --- |
|
|
| ## Acknowledgments |
|
|
| We thank the TPU Research Cloud program for providing cloud TPUs that were used to build and train the models for our extensive experiments. |
|
|
| ## Citation |
|
|
| If you find our work helpful, feel free to give us a cite. |
|
|
| ```bibtex |
| @article{TOULAOUI2026115151, |
| title = {Efficient vision transformers via patch selective soft-masked attention and knowledge distillation}, |
| journal = {Applied Soft Computing}, |
| pages = {115151}, |
| year = {2026}, |
| issn = {1568-4946}, |
| doi = {https://doi.org/10.1016/j.asoc.2026.115151}, |
| url = {https://www.sciencedirect.com/science/article/pii/S1568494626005995}, |
| author = {Abdelfattah Toulaoui and Hamza Khalfi and Imad Hafidi}, |
| keywords = {Vision transformer, Patch selection, Soft masking, Efficient inference} |
| } |
| ``` |
|
|