🧠 Model

vit-base-patch16-224-in21k

by google

--- license: apache-2.0 tags: - vision datasets: - imagenet-21k inference: false --- Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million imag

πŸ• Updated 12/19/2025

🧠 Architecture Explorer

Neural network architecture

1 Input Layer
2 Hidden Layers
3 Attention
4 Output Layer

About

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to Py...

πŸ“ Limitations & Considerations

  • β€’ Benchmark scores may vary based on evaluation methodology and hardware configuration.
  • β€’ VRAM requirements are estimates; actual usage depends on quantization and batch size.
  • β€’ FNI scores are relative rankings and may change as new models are added.
  • β€’ Data source: [{"source_platform":"huggingface","source_url":"https://huggingface.co/google/vit-base-patch16-224-in21k","fetched_at":"2025-12-19T07:41:01.184Z","adapter_version":"3.2.0"}]

πŸ“š Related Resources

πŸ“„ Related Papers

No related papers linked yet. Check the model's official documentation for research papers.

πŸ“Š Training Datasets

Training data information not available. Refer to the original model card for details.

πŸ”— Related Models

Data unavailable

πŸš€ What's Next?