Soft-Masked Selective Vision Transformer is an efficient Vision Transformer (ViT) model designed to reduce the computational overhead of self-attention while maintaining competitive accuracy. The model introduces a patch-selective attention mechanism that enables the transformer to focus on the most salient image regions and dynamically disregard less informative patches. This selective strategy significantly reduces the quadratic complexity typically associated with full self-attention, making the model particularly suitable for high-resolution vision tasks and resource-constrained environments.
To further improve performance, the model leverages knowledge distillation, transferring representational knowledge from a stronger teacher network to enhance the accuracy of lightweight transformer variants.
Intended Use
This model is intended for:
Image classification tasks
Deployment in compute- or memory-constrained environments
High-resolution image processing where standard ViTs are prohibitively expensive
Research on efficient attention mechanisms and transformer compression
Example Use Cases
Edge or embedded vision systems
Large-scale image analysis with reduced inference cost
Efficient backbones for downstream vision tasks
Training Details
Training Objective: Cross-entropy loss with optional distillation loss