🧠
Model

Sm Selective Vit Small 224

by XAFT hf-model--xaft--sm-selective-vit-small-224
Nexus Index
38.4 Top 10%
S: Semantic 50
A: Authority 0
P: Popularity 0
R: Recency 95
Q: Quality 65
Tech Context
0.02B Params
4.096K Ctx
Vital Performance
2 DL / 30D
0.0%
Audited 38.4 FNI Score
Tiny 0.02B Params
4k Context
2 Downloads
8G GPU ~2GB Est. VRAM
Dense SMSELECTIVEVITMODELFORCLASSIFICATION Architecture
Commercial MIT License
Model Information Summary
Entity Passport
Registry ID hf-model--xaft--sm-selective-vit-small-224
License MIT
Provider huggingface
πŸ’Ύ

Compute Threshold

~1.3GB VRAM

Interactive
Analyze Hardware
β–Ό

* Static estimation for 4-Bit Quantization.

πŸ“œ

Cite this model

Academic & Research Attribution

BibTeX
@misc{hf_model__xaft__sm_selective_vit_small_224,
  author = {XAFT},
  title = {Sm Selective Vit Small 224 Model},
  year = {2026},
  howpublished = {\url{https://huggingface.co/XAFT/SM-Selective-ViT-Small-224}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}
APA Style
XAFT. (2026). Sm Selective Vit Small 224 [Model]. Free2AITools. https://huggingface.co/XAFT/SM-Selective-ViT-Small-224

πŸ”¬Technical Deep Dive

Full Specifications [+]

Quick Commands

πŸ¦™ Ollama Run
ollama run sm-selective-vit-small-224
πŸ€— HF Download
huggingface-cli download xaft/sm-selective-vit-small-224
πŸ“¦ Install Lib
pip install -U transformers

βš–οΈ Nexus Index V2.0

38.4
TOP 10% SYSTEM IMPACT
Semantic (S) 50
Authority (A) 0
Popularity (P) 0
Recency (R) 95
Quality (Q) 65

πŸ’¬ Index Insight

FNI V2.0 for Sm Selective Vit Small 224: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:95), Quality (Q:65).

Free2AITools Nexus Index

Verification Authority

Unbiased Data Node Refresh: VFS Live
---

πŸš€ What's Next?

Technical Deep Dive

Soft-Masked Selective Vision Transformer

Model Description

Soft-Masked Selective Vision Transformer is an efficient Vision Transformer (ViT) model designed to reduce the computational overhead of self-attention while maintaining competitive accuracy.
The model introduces a patch-selective attention mechanism that enables the transformer to focus on the most salient image regions and dynamically disregard less informative patches. This selective strategy significantly reduces the quadratic complexity typically associated with full self-attention, making the model particularly suitable for high-resolution vision tasks and resource-constrained environments.

To further improve performance, the model leverages knowledge distillation, transferring representational knowledge from a stronger teacher network to enhance the accuracy of lightweight transformer variants.


Intended Use

This model is intended for:

  • Image classification tasks
  • Deployment in compute- or memory-constrained environments
  • High-resolution image processing where standard ViTs are prohibitively expensive
  • Research on efficient attention mechanisms and transformer compression

Example Use Cases

  • Edge or embedded vision systems
  • Large-scale image analysis with reduced inference cost
  • Efficient backbones for downstream vision tasks

Training Details

  • Training Objective: Cross-entropy loss with optional distillation loss
  • Distillation: Teacher–student framework
  • Optimization: AdamW
  • Training Dataset: ILSVRC 2012
  • Evaluation Metrics: Top-1 accuracy, FLOPs, parameter count

Usage

Image Classification Example

python
import torch
from transformers import AutoModelForImageClassification, AutoImageProcessor
from PIL import Image
import requests

# Load image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Load processor and model
processor = AutoImageProcessor.from_pretrained(
    "XAFT/SM-Selective-ViT-Small-224",
    trust_remote_code=True,
)

model = AutoModelForImageClassification.from_pretrained(
    "XAFT/SM-Selective-ViT-Small-224",
    trust_remote_code=True,
)
model = model.half() # Cast to FP16 to enable FlashAttention

# Preprocess
inputs = processor(
    images=image,
    return_tensors="pt",
)
inputs = inputs.to(torch.half) # Cast to FP16

# Forward pass
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax(-1).item()

print("Predicted class index:", predicted_class)

Evaluation Results

Model Top-1 Acc. Top-5 Acc. # Params Avg. GFLOPs
Base 80.350% 94.980% 86.60M 9.61
Base (distilled) 80.990% 95.386% 87.37M 9.21
Small 78.662% 94.454% 22.06M 3.12
Small (distilled) 79.000% 94.494% 22.45M 3.05
Tiny tall 74.802% 92.794% 11.07M 1.64
Tiny tall (distilled) 75.676% 92.988% 11.26M 1.64
Tiny 71.056% 90.192% 5.72M 0.95
Tiny (distilled) 72.618% 91.338% 5.92M 0.93

Acknowledgments

We thank the TPU Research Cloud program for providing cloud TPUs that were used to build and train the models for our extensive experiments.

Citation

If you find our work helpful, feel free to give us a cite.

bibtex
@article{TOULAOUI2026115151,
    title = {Efficient vision transformers via patch selective soft-masked attention and knowledge distillation},
    journal = {Applied Soft Computing},
    pages = {115151},
    year = {2026},
    issn = {1568-4946},
    doi = {https://doi.org/10.1016/j.asoc.2026.115151},
    url = {https://www.sciencedirect.com/science/article/pii/S1568494626005995},
    author = {Abdelfattah Toulaoui and Hamza Khalfi and Imad Hafidi},
    keywords = {Vision transformer, Patch selection, Soft masking, Efficient inference}
}

⚠️ Incomplete Data

Some information about this model is not available. Use with Caution - Verify details from the original source before relying on this data.

View Original Source β†’

πŸ“ Limitations & Considerations

  • β€’ Benchmark scores may vary based on evaluation methodology and hardware configuration.
  • β€’ VRAM requirements are estimates; actual usage depends on quantization and batch size.
  • β€’ FNI scores are relative rankings and may change as new models are added.
  • ⚠ License Unknown: Verify licensing terms before commercial use.
Top Tier

Social Proof

HuggingFace Hub
2Downloads
πŸ”„ Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

πŸ“Š FNI Methodology πŸ“š Knowledge Baseℹ️ Verify with original source

πŸ›‘οΈ Model Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

πŸ†” Identity & Source

id
hf-model--xaft--sm-selective-vit-small-224
slug
xaft--sm-selective-vit-small-224
source
huggingface
author
XAFT
license
MIT
tags
transformers, safetensors, softmasked_selective_vit, image-classification, vision-transformer, efficient-transformer, selective-attention, knowledge-distillation, computer-vision, custom_code, license:mit, region:us

βš™οΈ Technical Specs

architecture
SMSelectiveViTModelForClassification
params billions
0.02
context length
4,096
pipeline tag
image-classification
vram gb
1.3
vram is estimated
true
vram formula
VRAM β‰ˆ (params * 0.75) + 0.8GB (KV) + 0.5GB (OS)

πŸ“Š Engagement & Metrics

downloads
2
stars
0
forks
0

Data indexed from public sources. Updated daily.