Soft-Masked Selective Vision Transformer

Model Description

Soft-Masked Selective Vision Transformer is an efficient Vision Transformer (ViT) model designed to reduce the computational overhead of self-attention while maintaining competitive accuracy.
The model introduces a patch-selective attention mechanism that enables the transformer to focus on the most salient image regions and dynamically disregard less informative patches. This selective strategy significantly reduces the quadratic complexity typically associated with full self-attention, making the model particularly suitable for high-resolution vision tasks and resource-constrained environments.

To further improve performance, the model leverages knowledge distillation, transferring representational knowledge from a stronger teacher network to enhance the accuracy of lightweight transformer variants.

Intended Use

This model is intended for:

Image classification tasks
Deployment in compute- or memory-constrained environments
High-resolution image processing where standard ViTs are prohibitively expensive
Research on efficient attention mechanisms and transformer compression

Example Use Cases

Edge or embedded vision systems
Large-scale image analysis with reduced inference cost
Efficient backbones for downstream vision tasks

Training Details

Training Objective: Cross-entropy loss with optional distillation loss
Distillation: Teacher–student framework
Optimization: AdamW
Training Dataset: ILSVRC 2012
Evaluation Metrics: Top-1 accuracy, FLOPs, parameter count

Usage

Image Classification Example

import torch
from transformers import AutoModelForImageClassification, AutoImageProcessor
from PIL import Image
import requests

# Load image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Load processor and model
processor = AutoImageProcessor.from_pretrained(
    "XAFT/SM-Selective-ViT-Tiny-224",
    trust_remote_code=True,
)

model = AutoModelForImageClassification.from_pretrained(
    "XAFT/SM-Selective-ViT-Tiny-224",
    trust_remote_code=True,
)
model = model.half() # Cast to FP16 to enable FlashAttention

# Preprocess
inputs = processor(
    images=image,
    return_tensors="pt",
)
inputs = inputs.to(torch.half) # Cast to FP16

# Forward pass
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax(-1).item()

print("Predicted class index:", predicted_class)

Evaluation Results

Model	Top-1 Acc.	Top-5 Acc.	# Params	Avg. GFLOPs
Base	80.350%	94.980%	86.60M	9.61
Base (distilled)	80.990%	95.386%	87.37M	9.21
Small	78.662%	94.454%	22.06M	3.12
Small (distilled)	79.000%	94.494%	22.45M	3.05
Tiny tall	74.802%	92.794%	11.07M	1.64
Tiny tall (distilled)	75.676%	92.988%	11.26M	1.64
Tiny	71.056%	90.192%	5.72M	0.95
Tiny (distilled)	72.618%	91.338%	5.92M	0.93

Acknowledgments

We thank the TPU Research Cloud program for providing cloud TPUs that were used to build and train the models for our extensive experiments.

Citation

If you find our work helpful, feel free to give us a cite.

@article{TOULAOUI2026115151,
    title = {Efficient vision transformers via patch selective soft-masked attention and knowledge distillation},
    journal = {Applied Soft Computing},
    pages = {115151},
    year = {2026},
    issn = {1568-4946},
    doi = {https://doi.org/10.1016/j.asoc.2026.115151},
    url = {https://www.sciencedirect.com/science/article/pii/S1568494626005995},
    author = {Abdelfattah Toulaoui and Hamza Khalfi and Imad Hafidi},
    keywords = {Vision transformer, Patch selection, Soft masking, Efficient inference}
}

Downloads last month: 2

Safetensors

Model size

5.72M params

Tensor type

F32

Collection including XAFT/SM-Selective-ViT-Tiny-224

Soft-Masked Selective Transformers

Collection

Efficient Vision Transformers (ViT) models via soft-masked attention • 9 items • Updated Mar 30