Honey Data 15m
Pillar scores are computed during the next indexing cycle.
| Entity Passport | |
| Registry ID | hf-dataset--open-bee--honey-data-15m |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__open_bee__honey_data_15m,
author = {Open Bee},
title = {Honey Data 15m Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/open-bee/honey-data-15m}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
âī¸ Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for Honey Data 15m: Semantic (S:0), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).
Verification Authority
đī¸ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
đ Explore Full Dataset âđ§Ŧ Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
[đ Homepage] [đ Arxiv Paper] [đ¤ Models & Datasets] [đģ Code]
Introduction
We introduce Bee-8B, a new state-of-the-art, fully open 8B Multimodal Large Language Model (MLLM) designed to close the performance gap with proprietary models by focusing on data quality.
Bee-8B is trained on our new Honey-Data-15M corpus, a high-quality supervised fine-tuning (SFT) dataset of approximately 15 million samples. This dataset was meticulously created with our transparent, adaptable, and open-source data curation pipeline, HoneyPipe, which systematically cleans noisy data and enriches it with a novel dual-level (short and long) Chain-of-Thought (CoT) strategy.
This dataset enables Bee-8B to achieve exceptional performance, particularly in complex reasoning, establishing a new standard for fully open MLLMs.
Key Features
- High-Quality, Large-Scale Dataset: We release Honey-Data-15M, a new 15M-sample SFT corpus. It has undergone extensive cleaning to remove widespread noise and has been enriched with dual-level CoT reasoning to enhance advanced problem-solving capabilities.
- Fully Open-Source Data Curation Suite: We provide not just the data, but the entire methodology. HoneyPipe and its underlying framework DataStudio offer the community a transparent and reproducible pipeline, moving beyond static dataset releases.
- State-of-the-Art Open Model: Our model, Bee-8B, achieves state-of-the-art performance among fully open MLLMs and is highly competitive with recent semi-open models like InternVL3.5-8B, demonstrating the power of high-quality data.
Honey-Data-15M
[!NOTE] The dataset's responses adhere to two specific tag structures: Short CoT responses are formatted as
<think>\n\n</think>\n\n{short CoT Response}, while Long CoT responses follow the format<think>\n{Long CoT Reasoning}\n</think>\n\n. More details about the dataset can be found in the Paper.
[!NOTE] The complete data is 4.71 T and has been completely transmitted. Due to a bug in the dataviewer, the size and number of items displayed by huggingface are inaccurate.
Honey-Data-15M is a large-scale, high-quality supervised fine-tuning (SFT) dataset containing approximately 15 million meticulously curated samples. We built this dataset with the core objective of addressing the quality bottleneck in current open-source data by systematically cleaning widespread data noise and enriching the data with an innovative "Dual-Level Chain-of-Thought (CoT)" strategy.
The dataset's composition is as follows:
- Approximately 12.2 million short CoT samples: Designed to instill foundational, step-by-step logical inference in the model.
- Approximately 2.7 million long CoT samples: Focused on more intricate, multi-step reasoning problems that challenge and enhance the model's advanced cognitive abilities.
Usage
To load the dataset, you can refer to the following code:
from PIL import Image
from datasets import load_dataset
# Load dataset (using CoSyn_Math subset as example)
item = load_dataset("Open-Bee/Honey-Data-15M",
split="train",
name="CoSyn_Math")[0]
# Extract data fields
item_id = item['id']
conversations = item['conversations']
images_data = item.get('images', [])
source = item.get('source', None)
img_phash = item.get('img_phash', None)
img_size = item.get('img_size', None)
# Save images and record paths
image_paths = []
for img_idx, image_data in enumerate(images_data):
image_filename = f"{item_id}_{img_idx}.jpg"
image_path = image_filename
# Save image (datasets automatically converts to PIL Image object)
if isinstance(image_data, Image.Image):
# JPEG format requires RGB mode
if image_data.mode in ('RGBA', 'LA', 'P'):
image_data = image_data.convert('RGB')
image_data.save(image_path, format='JPEG')
image_paths.append(image_path)
# Build sample
sample = {
'id': item_id,
'conversations': conversations,
'image': image_paths[0] if len(image_paths) == 1 else image_paths,
'source': source,
'img_phash': img_phash,
'img_size': img_size,
}
# Print result
print(sample)
Licensing Information
The Honey-Data-15M dataset is a collection composed of multiple publicly available sub-datasets. Each of these sub-datasets is governed by its own original license.
Sub-dataset Licenses: Users of
Honey-Data-15Mmust strictly adhere to the specific licensing terms and conditions of each original sub-dataset included in this collection. We recommend you carefully review the original license for each sub-dataset before use.Prompts and Responses: To the extent that we hold any intellectual property rights in the modified prompts and newly generated responses created for this project, these contributions are made available under the Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC-4.0) license.
Copyright Concerns: This dataset is compiled for academic research purposes. If you believe any content within
Honey-Data-15Minfringes upon your copyright, please contact us immediately at yi.zhang.4096[at]gmail.com. We will promptly review and address the matter, including the removal of concerned content upon verification.
Acknowledgements
[!NOTE] If you believe we have missed acknowledging any important data source that should be explicitly mentioned here, please contact us.
Honey-Data-15M is built upon a large collection of publicly available datasets. We extend our deepest gratitude to the creators and maintainers of the following major datasets.
- LLaVA-OneVision-Data: A comprehensive multimodal instruction tuning dataset
- MAmmoTH-VL-Instruct-12M: A large-scale vision-language instruction dataset for mathematical reasoning
- VisualWebInstruct: A dataset for web-based visual instruction following
- ArXiv-OCR-v0.2: OCR data from ArXiv papers for document understanding
- CoSyn-400K: Synthetic data for visual reasoning across multiple domains
- PixMo Collection: A collection of high-quality vision-language datasets
- And many other datasets including Cauldron, Cambrian, and numerous individual datasets across VQA, OCR, Charts, STEM, and other domains.
Citation
If you use our dataset in your research, please cite our paper:
@misc{zhang2025beehighqualitycorpusfullstack,
title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs},
author={Yi Zhang and Bolin Ni and Xin-Sheng Chen and Heng-Rui Zhang and Yongming Rao and Houwen Peng and Qinglin Lu and Han Hu and Meng-Hao Guo and Shi-Min Hu},
year={2025},
eprint={2510.13795},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.13795},
}
đ Structured Schema (Zero-Fabrication)
| Feature Key | Data Type |
|---|---|
images |
unknown |
conversations |
unknown |
id |
string |
img_phash |
unknown |
img_size |
unknown |
source |
string |
Estimated Rows: 27,515
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Dataset Transparency Report
Verified data manifest for traceability and transparency.
đ Identity & Source
- id
- hf-dataset--open-bee--honey-data-15m
- slug
- open-bee--honey-data-15m
- source
- huggingface
- author
- Open Bee
- license
- tags
- task_categories:image-text-to-text, language:en, size_categories:10m
âī¸ Technical Specs
- architecture
- null
- params billions
- 0.015
- context length
- null
- pipeline tag
đ Engagement & Metrics
- downloads
- 77,136
- stars
- 115
- forks
- 0
Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)