🧠 Model

vit-base-patch16-224-in21k

by google

--- license: apache-2.0 tags: - vision datasets: - imagenet-21k inference: false --- Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million imag

🕐 Updated 12/19/2025

🧠 Architecture Explorer

Neural network architecture

1 Input Layer

2 Hidden Layers

3 Attention

4 Output Layer

Learn about Transformers →

About

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to Py...

📝 Limitations & Considerations

• Benchmark scores may vary based on evaluation methodology and hardware configuration.
• VRAM requirements are estimates; actual usage depends on quantization and batch size.
• FNI scores are relative rankings and may change as new models are added.
• Data source: [{"source_platform":"huggingface","source_url":"https://huggingface.co/google/vit-base-patch16-224-in21k","fetched_at":"2025-12-19T07:41:01.184Z","adapter_version":"3.2.0"}]

📚 Related Resources

📄 Related Papers

No related papers linked yet. Check the model's official documentation for research papers.

📊 Training Datasets

Training data information not available. Refer to the original model card for details.

🔗 Related Models

Data unavailable

Model Information Summary
Model Name	vit-base-patch16-224-in21k
Author	google
Type	image-feature-extraction
Downloads	1,905,242
Likes	385
Source	Hugging Face
Last Updated	December 19, 2025

Graph Overview

200 Models

460 Connections

Explore Full Graph →

🚀 What's Next?

📊