📊

Dataset

Llava Onevision 1.5 Instruct Data

Name: Llava Onevision 1.5 Instruct Data
Creator: Mvp Lab
License: Apache-2.0

by Mvp Lab hf-dataset--mvp-lab--llava-onevision-1.5-instruct-data

Nexus Index

36.8 Top 100%

S: Semantic 50

A: Authority 0

P: Popularity 62

R: Recency 46

Q: Quality 30

Tech Context

Vital Performance

0 DL / 30D

0.0%

Source →

Data Integrity 36.8 FNI Score

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-dataset--mvp-lab--llava-onevision-1.5-instruct-data
License	Apache-2.0
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_dataset__mvp_lab__llava_onevision_1.5_instruct_data,
  author = {Mvp Lab},
  title = {Llava Onevision 1.5 Instruct Data Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/mvp-lab/llava-onevision-1.5-instruct-data}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

Mvp Lab. (2026). Llava Onevision 1.5 Instruct Data [Dataset]. Free2AITools. https://huggingface.co/datasets/mvp-lab/llava-onevision-1.5-instruct-data

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Nexus Index V2.0

Methodology Index Protocol

36.8

TOP 100% SYSTEM IMPACT

Semantic (S) 50

Authority (A) 0

Popularity (P) 62

Recency (R) 46

Quality (Q) 30

💬 Index Insight

FNI V2.0 for Llava Onevision 1.5 Instruct Data: Semantic (S:50), Authority (A:0), Popularity (P:62), Recency (R:46), Quality (Q:30).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

⬇️

Downloads

220,698

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

LLaVA-OneVision-1.5 Instruction Data

Paper | Code

📌 Introduction

This dataset, LLaVA-OneVision-1.5-Instruct, was collected and integrated during the development of LLaVA-OneVision-1.5. LLaVA-OneVision-1.5 is a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. This meticulously curated 22M instruction dataset (LLaVA-OneVision-1.5-Instruct) is part of a comprehensive and fully open framework for building high-quality vision-language models entirely from scratch.

It has significantly enhanced the performance of Vision-Language Models (VLMs) in structured information processing and knowledge-based question answering tasks. As part of the LLaVA-OneVision-1.5 open-source initiative, we are releasing this dataset to the community in the hope of advancing VLM research and driving further progress in the field.

⚙️ Usage Notes

Although the dataset itself is of high quality, we recommend deduplicating and combining it with the FineVision dataset to achieve better training results.

🚀 Sample Usage

Below is a quick start guide demonstrating how to use the LLaVA-OneVision-1.5 models with Hugging Face transformers for inference. This snippet is directly from the project's GitHub repository.

python

from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
from qwen_vl_utils import process_vision_info
model_path = "lmms-lab/LLaVA-One-Vision-1.5-8B-Instruct"

# default: Load the model on the available device(s)
model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
)

# default processer
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

📊 Data Analysis

Distribution of Data Categories

sft_dataset_pie_chart

Compare and Scaling with FineVision

Performance comparison of three datasets (Merge46M, FineVision, and LLaVA-OneVision-1.5-Inst-Data) across 16 benchmarks during the SFT phase, demonstrating the superiority of Merge46M on most benchmarks.

ablation_instruct

🙏 Acknowledgement

We would like to acknowledge the contributions of FineVision , whose open dataset served as an important foundation and benchmark for building this SFT dataset.

📜 Cite

If you find LLaVA-OneVision-1.5 useful in your research, please consider to cite the following related papers:

bibtex

@inproceedings{LLaVA-OneVision-1.5,
  title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
  author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Huajie Tan and Li, Chunyuan and Jing Yang and Jie Yu and Xiyao Wang and Bin Qin and Yumeng Wang and Zizhen Yan and Ziyong Feng and Ziwei Liu and Bo Li and Jiankang Deng},
  booktitle={arxiv},  
  year={2025},
  url={https://arxiv.org/abs/2509.23661}, 
 }

@inproceedings{xie2025region,
  title={Region-based Cluster Discrimination for Visual Representation Learning},
  author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang},
  booktitle={ICCV},
  year={2025}
}

@article{lillava,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  journal={Transactions on Machine Learning Research},
  year={2024}
}

📊 Structured Schema (Zero-Fabrication)

Feature Key	Data Type
`id`	`string`
`image`	`Image`
`conversations`	`unknown`
`data_source`	`string`

Estimated Rows: 513,923

Social Proof

HuggingFace Hub

220.7KDownloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

🆔 Identity & Source

id: hf-dataset--mvp-lab--llava-onevision-1.5-instruct-data
slug: mvp-lab--llava-onevision-1.5-instruct-data
source: huggingface
author: Mvp Lab
license: Apache-2.0
tags: task_categories:image-text-to-text, language:en, license:apache-2.0, size_categories:10m<n<100m, modality:image, modality:text, arxiv:2509.23661, region:us, multimodal, vision-language-model, lmm, instruction-tuning, pretraining, dataset-collection, vqa, image-captioning, large-language-model

⚙️ Technical Specs

architecture: null
params billions: null
context length: null
pipeline tag

📊 Engagement & Metrics

downloads: 220,698
stars: 71
forks: 0

Data indexed from public sources. Updated daily.

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!