🧠

aria

Name: aria
Author: rhymes-ai

by rhymes-ai Model ID: hf-model--rhymes-ai--aria

FNI 7.7

Top 92%

"Aria --> [Dec 1, 2024] *We have released the base models (with native multimodal pre-training) for Aria (Aria-Base-8K and Aria-Base-64K) for research purposes and continue training.*..."

🔗 View Source

Audited 7.7 FNI Score

25.31B Params

4k Context

42.4K Downloads

24G GPU ~21GB Est. VRAM

⚡ Quick Commands

🦙 Ollama Run

ollama run aria

🤗 HF Download

huggingface-cli download rhymes-ai/aria

📦 Install Lib

pip install -U transformers

📊

Engineering Specs

V16.2 Platform Optimized

⚡ Hardware

Parameters

25.31B

Architecture

AriaForConditionalGeneration

Context Length

Model Size

98.9GB

🧠 Lifecycle

Library

Precision

float16

Tokenizer

🌐 Identity

Source

HuggingFace

License

Open Access

💾

Est. VRAM Benchmark

~20.3GB

Analyze Hardware

Test Hardware Compatibility

* Technical estimation for FP16/Q4 weights. Does not include OS overhead or long-context batching. For Technical Reference Only.

⚡

🔗 Core Ecosystem

huggingface

rhymes ai

🔬

🔬 Research & Data

📄CITES

2410.05993paper

Research Paper

→

📈 Interest Trend

* Real-time activity index across HuggingFace, GitHub and Research citations.

🔍 Semantic Keywords

🏷️ transformers 🏷️ safetensors 🏷️ aria 🏷️ any-to-any 🏷️ multimodal 🏷️ image-text-to-text 🏷️ conversational 🏷️ en 🏷️ arxiv:2410.05993 🏷️ base_model:rhymes-ai/aria-base-64k 🏷️ base_model:finetune:rhymes-ai/aria-base-64k 🏷️ license:apache-2.0 🏷️ endpoints_compatible 🏷️ region:us

No similar models found.

Social Proof

FNI RankTop 92%

HuggingFace Hub

637Likes

42.4KDownloads

Hub Discussions

⚙️ Technical Specifications

4 specs

🧠

Parameters

25.31B

📏

Context

🏗️

Architecture

AriaForConditionalGeneration

📚

Library

transformers

🚀 Deployment Info

Difficulty

💎Expert

Recommended Hardware

☁️ Multi-GPU or cloud A100/H100

Quick Info

Library: transformers
Size: 106.2 GB

Model Information Summary
Identity	aria
Author	rhymes-ai
Primary Category	Standard
Downloads	42,369
Likes	637
Source	Unknown
🧬 Parent Model	aria-base-64kAncestry
Technical Specifications
Architecture	AriaForConditionalGeneration

🔬Technical Deep Dive

Full Specifications [+]

---

🚀 What's Next?

📊

Find Training Datasets

Discover datasets compatible with this model

📈

Compare Benchmarks

See how this model ranks on standard tests

⚡

⚡ Quick Commands

🦙 Ollama Run

ollama run aria

🤗 HF Download

huggingface-cli download rhymes-ai/aria

📦 Install Lib

pip install -U transformers

🖥️

Hardware Compatibility

Multi-Tier Validation Matrix

Live Sync

🎮 Compatible

RTX 3060 / 4060 Ti

Entry 8GB VRAM

🎮 Compatible

RTX 4070 Super

Mid 12GB VRAM

💻 Compatible

RTX 4080 / Mac M3

High 16GB VRAM

🚀 Compatible

RTX 3090 / 4090

Pro 24GB VRAM

🏗️ Compatible

RTX 6000 Ada

Workstation 48GB VRAM

🏭 Compatible

A100 / H100

Datacenter 80GB VRAM

ℹ️

Pro Tip: Compatibility is estimated for 4-bit quantization (Q4). High-precision (FP16) or ultra-long context windows will significantly increase VRAM requirements.

README

Aria Model Card

[Dec 1, 2024] We have released the base models (with native multimodal pre-training) for Aria (Aria-Base-8K and Aria-Base-64K) for research purposes and continue training.

Key features

SoTA Multimodal Native Performance: Aria achieves strong performance on a wide range of multimodal, language, and coding tasks. It is superior in video and document understanding.
Lightweight and Fast: Aria is a mixture-of-expert model with 3.9B activated parameters per token. It efficently encodes visual input of variable sizes and aspect ratios.
Long Multimodal Context Window: Aria supports multimodal input of up to 64K tokens. It can caption a 256-frame video in 10 seconds.

🔗 Try Aria! · 📖 Blog · 📌 Paper · ⭐ GitHub · 🟣 Discord

Benchmark

Category	Benchmark	Aria	Pixtral 12B	Llama3.2 11B	GPT-4o mini	Gemini-1.5 Flash
Knowledge (Multimodal)	MMMU	54.9	52.5	50.7	59.4	56.1
Math (Multimodal)	MathVista	66.1	58.0	51.5	-	58.4
Document	DocQA	92.6	90.7	84.4	-	89.9
Chart	ChartQA	86.4	81.8	83.4	-	85.4
Scene Text	TextVQA	81.1	-	-	-	78.7
General Visual QA	MMBench-1.1	80.3	-	-	76.0	-
Video Understanding	LongVideoBench	65.3	47.4	45.7	58.8	62.4
Knowledge (Language)	MMLU (5-shot)	73.3	69.2	69.4	-	78.9
Math (Language)	MATH	50.8	48.1	51.9	70.2	-
Reasoning (Language)	ARC Challenge	91.0	-	83.4	96.4	-
Coding	HumanEval	73.2	72.0	72.6	87.2	74.3

Quick Start

Installation

pip install "transformers>=4.48.0" accelerate sentencepiece torchvision requests torch Pillow
pip install flash-attn --no-build-isolation

# For better inference performance, you can install grouped-gemm, which may take 3-5 minutes to install
pip install grouped_gemm==0.1.6

Inference

Aria has 25.3B total parameters, it can be loaded in one A100 (80GB) GPU with bfloat16 precision.

Here is a code snippet to show you how to use Aria.

import requests
import torch
from PIL import Image

from transformers import AriaProcessor, AriaForConditionalGeneration


model_id_or_path = "rhymes-ai/Aria"
model = AriaForConditionalGeneration.from_pretrained(
    model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16
)

processor = AriaProcessor.from_pretrained(model_id_or_path)

image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"text": "what is the image?", "type": "text"},
        ],
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
inputs.to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=15,
    stop_strings=["<|im_end|>"],
    tokenizer=processor.tokenizer,
    do_sample=True,
    temperature=0.9,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
response = processor.decode(output_ids, skip_special_tokens=True)
print(response)

From transformers>=v4.48, you can also pass image url or local path to the conversation history, and let the chat template handle the rest. Chat template will load the image for you and return inputs in torch.Tensor which you can pass directly to model.generate().

Here is how to rewrite the above example

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"}
            {"type": "text", "text": "what is the image?"},
        ],
    },
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors"pt")
ipnuts = inputs.to(model.device, torch.bfloat16)

output = model.generate(
    **inputs,
    max_new_tokens=15,
    stop_strings=["<|im_end|>"],
    tokenizer=processor.tokenizer,
    do_sample=True,
    temperature=0.9,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
response = processor.decode(output_ids, skip_special_tokens=True)
print(response)

Advanced Inference and Fine-tuning

We provide a codebase for more advanced usage of Aria, including vllm inference, cookbooks, and fine-tuning on custom datasets.

Citation

If you find our work helpful, please consider citing.

@article{aria,
  title={Aria: An Open Multimodal Native Mixture-of-Experts Model}, 
  author={Dongxu Li and Yudong Liu and Haoning Wu and Yue Wang and Zhiqi Shen and Bowen Qu and Xinyao Niu and Guoyin Wang and Bei Chen and Junnan Li},
  year={2024},
  journal={arXiv preprint arXiv:2410.05993},
}

7,153 chars • Full Disclosure Protocol Active

ZEN MODE • README

Aria Model Card

[Dec 1, 2024] We have released the base models (with native multimodal pre-training) for Aria (Aria-Base-8K and Aria-Base-64K) for research purposes and continue training.

Key features

SoTA Multimodal Native Performance: Aria achieves strong performance on a wide range of multimodal, language, and coding tasks. It is superior in video and document understanding.
Lightweight and Fast: Aria is a mixture-of-expert model with 3.9B activated parameters per token. It efficently encodes visual input of variable sizes and aspect ratios.
Long Multimodal Context Window: Aria supports multimodal input of up to 64K tokens. It can caption a 256-frame video in 10 seconds.

🔗 Try Aria! · 📖 Blog · 📌 Paper · ⭐ GitHub · 🟣 Discord

Benchmark

Category	Benchmark	Aria	Pixtral 12B	Llama3.2 11B	GPT-4o mini	Gemini-1.5 Flash
Knowledge (Multimodal)	MMMU	54.9	52.5	50.7	59.4	56.1
Math (Multimodal)	MathVista	66.1	58.0	51.5	-	58.4
Document	DocQA	92.6	90.7	84.4	-	89.9
Chart	ChartQA	86.4	81.8	83.4	-	85.4
Scene Text	TextVQA	81.1	-	-	-	78.7
General Visual QA	MMBench-1.1	80.3	-	-	76.0	-
Video Understanding	LongVideoBench	65.3	47.4	45.7	58.8	62.4
Knowledge (Language)	MMLU (5-shot)	73.3	69.2	69.4	-	78.9
Math (Language)	MATH	50.8	48.1	51.9	70.2	-
Reasoning (Language)	ARC Challenge	91.0	-	83.4	96.4	-
Coding	HumanEval	73.2	72.0	72.6	87.2	74.3

Quick Start

Installation

pip install "transformers>=4.48.0" accelerate sentencepiece torchvision requests torch Pillow
pip install flash-attn --no-build-isolation

# For better inference performance, you can install grouped-gemm, which may take 3-5 minutes to install
pip install grouped_gemm==0.1.6

Inference

Aria has 25.3B total parameters, it can be loaded in one A100 (80GB) GPU with bfloat16 precision.

Here is a code snippet to show you how to use Aria.

import requests
import torch
from PIL import Image

from transformers import AriaProcessor, AriaForConditionalGeneration


model_id_or_path = "rhymes-ai/Aria"
model = AriaForConditionalGeneration.from_pretrained(
    model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16
)

processor = AriaProcessor.from_pretrained(model_id_or_path)

image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"text": "what is the image?", "type": "text"},
        ],
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
inputs.to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=15,
    stop_strings=["<|im_end|>"],
    tokenizer=processor.tokenizer,
    do_sample=True,
    temperature=0.9,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
response = processor.decode(output_ids, skip_special_tokens=True)
print(response)

Here is how to rewrite the above example

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"}
            {"type": "text", "text": "what is the image?"},
        ],
    },
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors"pt")
ipnuts = inputs.to(model.device, torch.bfloat16)

output = model.generate(
    **inputs,
    max_new_tokens=15,
    stop_strings=["<|im_end|>"],
    tokenizer=processor.tokenizer,
    do_sample=True,
    temperature=0.9,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
response = processor.decode(output_ids, skip_special_tokens=True)
print(response)

Advanced Inference and Fine-tuning

We provide a codebase for more advanced usage of Aria, including vllm inference, cookbooks, and fine-tuning on custom datasets.

Citation

If you find our work helpful, please consider citing.

@article{aria,
  title={Aria: An Open Multimodal Native Mixture-of-Experts Model}, 
  author={Dongxu Li and Yudong Liu and Haoning Wu and Yue Wang and Zhiqi Shen and Bowen Qu and Xinyao Niu and Guoyin Wang and Bei Chen and Junnan Li},
  year={2024},
  journal={arXiv preprint arXiv:2410.05993},
}

📝 Limitations & Considerations

• Benchmark scores may vary based on evaluation methodology and hardware configuration.
• VRAM requirements are estimates; actual usage depends on quantization and batch size.
• FNI scores are relative rankings and may change as new models are added.
⚠ License Unknown: Verify licensing terms before commercial use.
• Source: Unknown

📜

Cite this model

Academic & Research Attribution

BibTeX

@misc{hf_model__rhymes_ai__aria,
  author = {rhymes-ai},
  title = {undefined Model},
  year = {2026},
  howpublished = {\url{https://huggingface.co/rhymes-ai/aria}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

rhymes-ai. (2026). undefined [Model]. Free2AITools. https://huggingface.co/rhymes-ai/aria

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Model Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id: hf-model--rhymes-ai--aria
author: rhymes-ai
tags: transformerssafetensorsariaany-to-anymultimodalimage-text-to-textconversationalenarxiv:2410.05993base_model:rhymes-ai/aria-base-64kbase_model:finetune:rhymes-ai/aria-base-64klicense:apache-2.0endpoints_compatibleregion:us

⚙️ Technical Specs

architecture: AriaForConditionalGeneration
params billions: 25.31
context length: 4,096
vram gb: 20.3
vram is estimated: true
vram formula: VRAM ≈ (params * 0.75) + 0.8GB (KV) + 0.5GB (OS)

📊 Engagement & Metrics

likes: 637
downloads: 42,369

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)

aria

⚡ Quick Commands

Engineering Specs

⚡ Hardware

🧠 Lifecycle

🌐 Identity

🕸️ Neural Mesh Hub

🔗 Core Ecosystem

🔬 Research & Data

📈 Interest Trend

🔍 Semantic Keywords

Social Proof

🔬Technical Deep Dive

🚀 What's Next?

Find Training Datasets

Compare Benchmarks

Deployment Guide

⚡ Quick Commands

Hardware Compatibility

RTX 3060 / 4060 Ti

RTX 4070 Super

RTX 4080 / Mac M3

RTX 3090 / 4090

RTX 6000 Ada

A100 / H100

README

Aria Model Card

Key features

Benchmark

Quick Start

Installation

Inference

Advanced Inference and Fine-tuning

Citation

📝 Limitations & Considerations

Cite this model

🛡️ Model Transparency Report

🆔 Identity & Source

⚙️ Technical Specs

📊 Engagement & Metrics

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!

⚡ Quick Commands

Engineering Specs

⚡ Hardware

🧠 Lifecycle

🌐 Identity

🕸️ Neural Mesh Hub

🔗 Core Ecosystem

🔬 Research & Data

📈 Interest Trend

🔍 Semantic Keywords

Social Proof

🔬Technical Deep Dive

🚀 What's Next?

Find Training Datasets

Compare Benchmarks

Deployment Guide

⚡ Quick Commands

Hardware Compatibility

RTX 3060 / 4060 Ti

RTX 4070 Super

RTX 4080 / Mac M3

RTX 3090 / 4090

RTX 6000 Ada

A100 / H100

README

Aria Model Card

Key features

Benchmark

Quick Start

Installation

Inference

Advanced Inference and Fine-tuning

Citation

Aria Model Card

Key features

Benchmark

Quick Start

Installation

Inference

Advanced Inference and Fine-tuning

Citation

📝 Limitations & Considerations

Cite this model

🛡️ Model Transparency Report

🆔 Identity & Source

⚙️ Technical Specs

📊 Engagement & Metrics