🧠

aria

by rhymes-ai Model ID: hf-model--rhymes-ai--aria
FNI 7.7
Top 92%

"Aria --> [Dec 1, 2024] *We have released the base models (with native multimodal pre-training) for Aria (Aria-Base-8K and Aria-Base-64K) for research purposes and continue training.*..."

🔗 View Source
Audited 7.7 FNI Score
25.31B Params
4k Context
42.4K Downloads
24G GPU ~21GB Est. VRAM

Quick Commands

🦙 Ollama Run
ollama run aria
🤗 HF Download
huggingface-cli download rhymes-ai/aria
📦 Install Lib
pip install -U transformers
📊

Engineering Specs

Hardware

Parameters
25.31B
Architecture
AriaForConditionalGeneration
Context Length
4K
Model Size
98.9GB

🧠 Lifecycle

Library
-
Precision
float16
Tokenizer
-

🌐 Identity

Source
HuggingFace
License
Open Access
💾

Est. VRAM Benchmark

~20.3GB

Analyze Hardware

* Technical estimation for FP16/Q4 weights. Does not include OS overhead or long-context batching. For Technical Reference Only.

🕸️ Neural Mesh Hub

Interconnecting Research, Data & Ecosystem

📈 Interest Trend

--

* Real-time activity index across HuggingFace, GitHub and Research citations.

No similar models found.

🔬Technical Deep Dive

Full Specifications [+]
---

🚀 What's Next?

Quick Commands

🦙 Ollama Run
ollama run aria
🤗 HF Download
huggingface-cli download rhymes-ai/aria
📦 Install Lib
pip install -U transformers
🖥️

Hardware Compatibility

Multi-Tier Validation Matrix

Live Sync
🎮 Compatible

RTX 3060 / 4060 Ti

Entry 8GB VRAM
🎮 Compatible

RTX 4070 Super

Mid 12GB VRAM
💻 Compatible

RTX 4080 / Mac M3

High 16GB VRAM
🚀 Compatible

RTX 3090 / 4090

Pro 24GB VRAM
🏗️ Compatible

RTX 6000 Ada

Workstation 48GB VRAM
🏭 Compatible

A100 / H100

Datacenter 80GB VRAM
ℹ️

Pro Tip: Compatibility is estimated for 4-bit quantization (Q4). High-precision (FP16) or ultra-long context windows will significantly increase VRAM requirements.

README

7,153 chars • Full Disclosure Protocol Active

ZEN MODE • README

Aria Model Card

[Dec 1, 2024] We have released the base models (with native multimodal pre-training) for Aria (Aria-Base-8K and Aria-Base-64K) for research purposes and continue training.

Key features

  • SoTA Multimodal Native Performance: Aria achieves strong performance on a wide range of multimodal, language, and coding tasks. It is superior in video and document understanding.
  • Lightweight and Fast: Aria is a mixture-of-expert model with 3.9B activated parameters per token. It efficently encodes visual input of variable sizes and aspect ratios.
  • Long Multimodal Context Window: Aria supports multimodal input of up to 64K tokens. It can caption a 256-frame video in 10 seconds.

🔗 Try Aria! · 📖 Blog · 📌 Paper · ⭐ GitHub · 🟣 Discord

Benchmark

Category Benchmark Aria Pixtral 12B Llama3.2 11B GPT-4o mini Gemini-1.5 Flash
Knowledge (Multimodal) MMMU 54.9 52.5 50.7 59.4 56.1
Math (Multimodal) MathVista 66.1 58.0 51.5 - 58.4
Document DocQA 92.6 90.7 84.4 - 89.9
Chart ChartQA 86.4 81.8 83.4 - 85.4
Scene Text TextVQA 81.1 - - - 78.7
General Visual QA MMBench-1.1 80.3 - - 76.0 -
Video Understanding LongVideoBench 65.3 47.4 45.7 58.8 62.4
Knowledge (Language) MMLU (5-shot) 73.3 69.2 69.4 - 78.9
Math (Language) MATH 50.8 48.1 51.9 70.2 -
Reasoning (Language) ARC Challenge 91.0 - 83.4 96.4 -
Coding HumanEval 73.2 72.0 72.6 87.2 74.3

Quick Start

Installation

pip install "transformers>=4.48.0" accelerate sentencepiece torchvision requests torch Pillow
pip install flash-attn --no-build-isolation

# For better inference performance, you can install grouped-gemm, which may take 3-5 minutes to install
pip install grouped_gemm==0.1.6

Inference

Aria has 25.3B total parameters, it can be loaded in one A100 (80GB) GPU with bfloat16 precision.

Here is a code snippet to show you how to use Aria.

import requests
import torch
from PIL import Image

from transformers import AriaProcessor, AriaForConditionalGeneration


model_id_or_path = "rhymes-ai/Aria"
model = AriaForConditionalGeneration.from_pretrained(
    model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16
)

processor = AriaProcessor.from_pretrained(model_id_or_path)

image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"text": "what is the image?", "type": "text"},
        ],
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
inputs.to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=15,
    stop_strings=["<|im_end|>"],
    tokenizer=processor.tokenizer,
    do_sample=True,
    temperature=0.9,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
response = processor.decode(output_ids, skip_special_tokens=True)
print(response)

From transformers>=v4.48, you can also pass image url or local path to the conversation history, and let the chat template handle the rest. Chat template will load the image for you and return inputs in torch.Tensor which you can pass directly to model.generate().

Here is how to rewrite the above example

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"}
            {"type": "text", "text": "what is the image?"},
        ],
    },
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors"pt")
ipnuts = inputs.to(model.device, torch.bfloat16)

output = model.generate(
    **inputs,
    max_new_tokens=15,
    stop_strings=["<|im_end|>"],
    tokenizer=processor.tokenizer,
    do_sample=True,
    temperature=0.9,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
response = processor.decode(output_ids, skip_special_tokens=True)
print(response)

Advanced Inference and Fine-tuning

We provide a codebase for more advanced usage of Aria, including vllm inference, cookbooks, and fine-tuning on custom datasets.

Citation

If you find our work helpful, please consider citing.

@article{aria,
  title={Aria: An Open Multimodal Native Mixture-of-Experts Model}, 
  author={Dongxu Li and Yudong Liu and Haoning Wu and Yue Wang and Zhiqi Shen and Bowen Qu and Xinyao Niu and Guoyin Wang and Bei Chen and Junnan Li},
  year={2024},
  journal={arXiv preprint arXiv:2410.05993},
}

📝 Limitations & Considerations

  • Benchmark scores may vary based on evaluation methodology and hardware configuration.
  • VRAM requirements are estimates; actual usage depends on quantization and batch size.
  • FNI scores are relative rankings and may change as new models are added.
  • License Unknown: Verify licensing terms before commercial use.
  • Source: Unknown
📜

Cite this model

Academic & Research Attribution

BibTeX
@misc{hf_model__rhymes_ai__aria,
  author = {rhymes-ai},
  title = {undefined Model},
  year = {2026},
  howpublished = {\url{https://huggingface.co/rhymes-ai/aria}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}
APA Style
rhymes-ai. (2026). undefined [Model]. Free2AITools. https://huggingface.co/rhymes-ai/aria
🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Model Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id
hf-model--rhymes-ai--aria
author
rhymes-ai
tags
transformerssafetensorsariaany-to-anymultimodalimage-text-to-textconversationalenarxiv:2410.05993base_model:rhymes-ai/aria-base-64kbase_model:finetune:rhymes-ai/aria-base-64klicense:apache-2.0endpoints_compatibleregion:us

⚙️ Technical Specs

architecture
AriaForConditionalGeneration
params billions
25.31
context length
4,096
vram gb
20.3
vram is estimated
true
vram formula
VRAM ≈ (params * 0.75) + 0.8GB (KV) + 0.5GB (OS)

📊 Engagement & Metrics

likes
637
downloads
42,369

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)