aria
"Aria --> [Dec 1, 2024] *We have released the base models (with native multimodal pre-training) for Aria (Aria-Base-8K and Aria-Base-64K) for research purposes and continue training.*..."
⚡ Quick Commands
ollama run aria huggingface-cli download rhymes-ai/aria pip install -U transformers Engineering Specs
⚡ Hardware
🧠 Lifecycle
🌐 Identity
Est. VRAM Benchmark
~20.3GB
* Technical estimation for FP16/Q4 weights. Does not include OS overhead or long-context batching. For Technical Reference Only.
🕸️ Neural Mesh Hub
Interconnecting Research, Data & Ecosystem
🔬 Research & Data
📈 Interest Trend
Real-time Trend Indexing In-Progress
* Real-time activity index across HuggingFace, GitHub and Research citations.
No similar models found.
Social Proof
🔬Technical Deep Dive
Full Specifications [+]▾
🚀 What's Next?
⚡ Quick Commands
ollama run aria huggingface-cli download rhymes-ai/aria pip install -U transformers Hardware Compatibility
Multi-Tier Validation Matrix
RTX 3060 / 4060 Ti
RTX 4070 Super
RTX 4080 / Mac M3
RTX 3090 / 4090
RTX 6000 Ada
A100 / H100
Pro Tip: Compatibility is estimated for 4-bit quantization (Q4). High-precision (FP16) or ultra-long context windows will significantly increase VRAM requirements.
README
Aria Model Card
[Dec 1, 2024] We have released the base models (with native multimodal pre-training) for Aria (Aria-Base-8K and Aria-Base-64K) for research purposes and continue training.
Key features
- SoTA Multimodal Native Performance: Aria achieves strong performance on a wide range of multimodal, language, and coding tasks. It is superior in video and document understanding.
- Lightweight and Fast: Aria is a mixture-of-expert model with 3.9B activated parameters per token. It efficently encodes visual input of variable sizes and aspect ratios.
- Long Multimodal Context Window: Aria supports multimodal input of up to 64K tokens. It can caption a 256-frame video in 10 seconds.
🔗 Try Aria! · 📖 Blog · 📌 Paper · ⭐ GitHub · 🟣 Discord
Benchmark
| Category | Benchmark | Aria | Pixtral 12B | Llama3.2 11B | GPT-4o mini | Gemini-1.5 Flash |
|---|---|---|---|---|---|---|
| Knowledge (Multimodal) | MMMU | 54.9 | 52.5 | 50.7 | 59.4 | 56.1 |
| Math (Multimodal) | MathVista | 66.1 | 58.0 | 51.5 | - | 58.4 |
| Document | DocQA | 92.6 | 90.7 | 84.4 | - | 89.9 |
| Chart | ChartQA | 86.4 | 81.8 | 83.4 | - | 85.4 |
| Scene Text | TextVQA | 81.1 | - | - | - | 78.7 |
| General Visual QA | MMBench-1.1 | 80.3 | - | - | 76.0 | - |
| Video Understanding | LongVideoBench | 65.3 | 47.4 | 45.7 | 58.8 | 62.4 |
| Knowledge (Language) | MMLU (5-shot) | 73.3 | 69.2 | 69.4 | - | 78.9 |
| Math (Language) | MATH | 50.8 | 48.1 | 51.9 | 70.2 | - |
| Reasoning (Language) | ARC Challenge | 91.0 | - | 83.4 | 96.4 | - |
| Coding | HumanEval | 73.2 | 72.0 | 72.6 | 87.2 | 74.3 |
Quick Start
Installation
pip install "transformers>=4.48.0" accelerate sentencepiece torchvision requests torch Pillow
pip install flash-attn --no-build-isolation
# For better inference performance, you can install grouped-gemm, which may take 3-5 minutes to install
pip install grouped_gemm==0.1.6
Inference
Aria has 25.3B total parameters, it can be loaded in one A100 (80GB) GPU with bfloat16 precision.
Here is a code snippet to show you how to use Aria.
import requests
import torch
from PIL import Image
from transformers import AriaProcessor, AriaForConditionalGeneration
model_id_or_path = "rhymes-ai/Aria"
model = AriaForConditionalGeneration.from_pretrained(
model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16
)
processor = AriaProcessor.from_pretrained(model_id_or_path)
image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"text": "what is the image?", "type": "text"},
],
}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
inputs.to(model.device)
output = model.generate(
**inputs,
max_new_tokens=15,
stop_strings=["<|im_end|>"],
tokenizer=processor.tokenizer,
do_sample=True,
temperature=0.9,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
response = processor.decode(output_ids, skip_special_tokens=True)
print(response)
From transformers>=v4.48, you can also pass image url or local path to the conversation history, and let the chat template handle the rest.
Chat template will load the image for you and return inputs in torch.Tensor which you can pass directly to model.generate().
Here is how to rewrite the above example
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"}
{"type": "text", "text": "what is the image?"},
],
},
]
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors"pt")
ipnuts = inputs.to(model.device, torch.bfloat16)
output = model.generate(
**inputs,
max_new_tokens=15,
stop_strings=["<|im_end|>"],
tokenizer=processor.tokenizer,
do_sample=True,
temperature=0.9,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
response = processor.decode(output_ids, skip_special_tokens=True)
print(response)
Advanced Inference and Fine-tuning
We provide a codebase for more advanced usage of Aria, including vllm inference, cookbooks, and fine-tuning on custom datasets.
Citation
If you find our work helpful, please consider citing.
@article{aria,
title={Aria: An Open Multimodal Native Mixture-of-Experts Model},
author={Dongxu Li and Yudong Liu and Haoning Wu and Yue Wang and Zhiqi Shen and Bowen Qu and Xinyao Niu and Guoyin Wang and Bei Chen and Junnan Li},
year={2024},
journal={arXiv preprint arXiv:2410.05993},
}
7,153 chars • Full Disclosure Protocol Active
Aria Model Card
[Dec 1, 2024] We have released the base models (with native multimodal pre-training) for Aria (Aria-Base-8K and Aria-Base-64K) for research purposes and continue training.
Key features
- SoTA Multimodal Native Performance: Aria achieves strong performance on a wide range of multimodal, language, and coding tasks. It is superior in video and document understanding.
- Lightweight and Fast: Aria is a mixture-of-expert model with 3.9B activated parameters per token. It efficently encodes visual input of variable sizes and aspect ratios.
- Long Multimodal Context Window: Aria supports multimodal input of up to 64K tokens. It can caption a 256-frame video in 10 seconds.
🔗 Try Aria! · 📖 Blog · 📌 Paper · ⭐ GitHub · 🟣 Discord
Benchmark
| Category | Benchmark | Aria | Pixtral 12B | Llama3.2 11B | GPT-4o mini | Gemini-1.5 Flash |
|---|---|---|---|---|---|---|
| Knowledge (Multimodal) | MMMU | 54.9 | 52.5 | 50.7 | 59.4 | 56.1 |
| Math (Multimodal) | MathVista | 66.1 | 58.0 | 51.5 | - | 58.4 |
| Document | DocQA | 92.6 | 90.7 | 84.4 | - | 89.9 |
| Chart | ChartQA | 86.4 | 81.8 | 83.4 | - | 85.4 |
| Scene Text | TextVQA | 81.1 | - | - | - | 78.7 |
| General Visual QA | MMBench-1.1 | 80.3 | - | - | 76.0 | - |
| Video Understanding | LongVideoBench | 65.3 | 47.4 | 45.7 | 58.8 | 62.4 |
| Knowledge (Language) | MMLU (5-shot) | 73.3 | 69.2 | 69.4 | - | 78.9 |
| Math (Language) | MATH | 50.8 | 48.1 | 51.9 | 70.2 | - |
| Reasoning (Language) | ARC Challenge | 91.0 | - | 83.4 | 96.4 | - |
| Coding | HumanEval | 73.2 | 72.0 | 72.6 | 87.2 | 74.3 |
Quick Start
Installation
pip install "transformers>=4.48.0" accelerate sentencepiece torchvision requests torch Pillow
pip install flash-attn --no-build-isolation
# For better inference performance, you can install grouped-gemm, which may take 3-5 minutes to install
pip install grouped_gemm==0.1.6
Inference
Aria has 25.3B total parameters, it can be loaded in one A100 (80GB) GPU with bfloat16 precision.
Here is a code snippet to show you how to use Aria.
import requests
import torch
from PIL import Image
from transformers import AriaProcessor, AriaForConditionalGeneration
model_id_or_path = "rhymes-ai/Aria"
model = AriaForConditionalGeneration.from_pretrained(
model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16
)
processor = AriaProcessor.from_pretrained(model_id_or_path)
image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"text": "what is the image?", "type": "text"},
],
}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
inputs.to(model.device)
output = model.generate(
**inputs,
max_new_tokens=15,
stop_strings=["<|im_end|>"],
tokenizer=processor.tokenizer,
do_sample=True,
temperature=0.9,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
response = processor.decode(output_ids, skip_special_tokens=True)
print(response)
From transformers>=v4.48, you can also pass image url or local path to the conversation history, and let the chat template handle the rest.
Chat template will load the image for you and return inputs in torch.Tensor which you can pass directly to model.generate().
Here is how to rewrite the above example
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"}
{"type": "text", "text": "what is the image?"},
],
},
]
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors"pt")
ipnuts = inputs.to(model.device, torch.bfloat16)
output = model.generate(
**inputs,
max_new_tokens=15,
stop_strings=["<|im_end|>"],
tokenizer=processor.tokenizer,
do_sample=True,
temperature=0.9,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
response = processor.decode(output_ids, skip_special_tokens=True)
print(response)
Advanced Inference and Fine-tuning
We provide a codebase for more advanced usage of Aria, including vllm inference, cookbooks, and fine-tuning on custom datasets.
Citation
If you find our work helpful, please consider citing.
@article{aria,
title={Aria: An Open Multimodal Native Mixture-of-Experts Model},
author={Dongxu Li and Yudong Liu and Haoning Wu and Yue Wang and Zhiqi Shen and Bowen Qu and Xinyao Niu and Guoyin Wang and Bei Chen and Junnan Li},
year={2024},
journal={arXiv preprint arXiv:2410.05993},
}
📝 Limitations & Considerations
- • Benchmark scores may vary based on evaluation methodology and hardware configuration.
- • VRAM requirements are estimates; actual usage depends on quantization and batch size.
- • FNI scores are relative rankings and may change as new models are added.
- ⚠ License Unknown: Verify licensing terms before commercial use.
- • Source: Unknown
Cite this model
Academic & Research Attribution
@misc{hf_model__rhymes_ai__aria,
author = {rhymes-ai},
title = {undefined Model},
year = {2026},
howpublished = {\url{https://huggingface.co/rhymes-ai/aria}},
note = {Accessed via Free2AITools Knowledge Fortress}
} AI Summary: Based on Hugging Face metadata. Not a recommendation.
🛡️ Model Transparency Report
Verified data manifest for traceability and transparency.
🆔 Identity & Source
- id
- hf-model--rhymes-ai--aria
- author
- rhymes-ai
- tags
- transformerssafetensorsariaany-to-anymultimodalimage-text-to-textconversationalenarxiv:2410.05993base_model:rhymes-ai/aria-base-64kbase_model:finetune:rhymes-ai/aria-base-64klicense:apache-2.0endpoints_compatibleregion:us
⚙️ Technical Specs
- architecture
- AriaForConditionalGeneration
- params billions
- 25.31
- context length
- 4,096
- vram gb
- 20.3
- vram is estimated
- true
- vram formula
- VRAM ≈ (params * 0.75) + 0.8GB (KV) + 0.5GB (OS)
📊 Engagement & Metrics
- likes
- 637
- downloads
- 42,369
Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)