πŸ“Š
Dataset

S1 Mmalign

by Scienceone Ai hf-dataset--scienceone-ai--s1-mmalign
Nexus Index
43.0 Top 0%
S / A / P / R / Q Breakdown Calibration Pending

Pillar scores are computed during the next indexing cycle.

Tech Context
Vital Performance
0 DL / 30D
0.0%

S1-MMAlign A Large-Scale Multi-Disciplinary Scientific Multimodal Dataset **S1-MMAlign** is a large-scale, multi-disciplinary multimodal dataset comprising over **15.5 million** high-quality image-text pairs derived from **2.5 m...

Data Integrity 43 FNI Score
- Size
- Rows
Parquet Format
- Tokens
Dataset Information Summary
Entity Passport
Registry ID hf-dataset--scienceone-ai--s1-mmalign
Provider huggingface
πŸ“œ

Cite this dataset

Academic & Research Attribution

BibTeX
@misc{hf_dataset__scienceone_ai__s1_mmalign,
  author = {Scienceone Ai},
  title = {S1 Mmalign Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}
APA Style
Scienceone Ai. (2026). S1 Mmalign [Dataset]. Free2AITools. https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign

πŸ”¬Technical Deep Dive

Full Specifications [+]

βš–οΈ Nexus Index V2.0

43.0
ESTIMATED IMPACT TIER
Semantic (S) 50
Authority (A) 0
Popularity (P) 0
Recency (R) 0
Quality (Q) 0

πŸ’¬ Index Insight

FNI V2.0 for S1 Mmalign: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).

Free2AITools Nexus Index

Verification Authority

Unbiased Data Node Refresh: VFS Live
⬇️
Downloads
16,344
❀️
Likes
98

πŸ‘οΈ Data Preview

πŸ“Š

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

πŸ”— Explore Full Dataset β†—

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification


license: cc-by-nc-4.0
task_categories:

  • image-to-text
  • visual-question-answering
  • feature-extraction
    language:
  • en
    tags:
  • science
  • multimodal
  • physics
  • biology
  • chemistry
  • engineering
  • large-scale
    size_categories:
  • 10M<n<100M

S1-MMAlign

A Large-Scale Multi-Disciplinary Scientific Multimodal Dataset

S1-MMAlign is a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers.

Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. S1-MMAlign aims to bridge this gap. Unlike simple "image-reading," scientific understanding requires traversing multiple semantic layers involving variables, structures, hypotheses, and inferences. This dataset is built to address this "short board" in current data resources.

The dataset captures diverse visual modalitiesβ€”including experimental setups, heatmaps, and microscopic imageryβ€”spanning major disciplines such as Mathematics, Physics, Chemistry, Biology, Astronomy, Earth Science, Medicine, Engineering, and Computer Science.

We anticipate that researchers and enthusiasts will utilize this dataset for training foundational AI for Science models, advancing scientific reasoning, and improving cross-modal understanding in specialized domains.

Dataset Information

Total Image-Text Pairs: > 15,500,000

Source Papers: ~ 2,500,000

Disciplines Covered: 9 Major STEM Fields

Alignment Improvement: +18.21% (CLIP Score vs. Raw Data)

License: CC BY-NC 4.0

How was the data processed?

To address the pervasive issue of weak alignment in raw scientific captions, we introduced an AI-ready semantic enhancement pipeline. We utilized the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts.

Technical validation demonstrates significant quality improvements: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment.

Recommendation: Please use the recaption field for model training.

  • image_path: The relative path to the image file.
  • recaption (Recommended): The AI-enhanced caption generated by our pipeline (Qwen-VL). It synthesizes context from the paper abstract and citations to provide a semantically rich description, significantly outperforming the raw caption in alignment and quality.
  • caption: The original, raw caption extracted from the paper figures (often noisy or sparse).
  • metadata: Additional information including source paper arxiv_id and title.

Note on File Structure

The relative paths of the images provided in the jsonl file must follow the file structure we provide in order to be used correctly. Please ensure you maintain the directory hierarchy after downloading and decompressing the dataset. Do not flatten the folder structure, as the metadata relies on specific relative paths.


Citation

If you find this dataset useful, please cite our work:

@article{s1mmalign2026,
  title={S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure–Text Understanding},
  author={He Wang and Longteng Guo and Pengkang Huo and Xuanxu Lin and Yichen Yuan and Jie Jiang and Jing Liu},
  journal={ArXiv preprint},
  url={https://arxiv.org/abs/2601.00264}, 
  year={2026}
}

License and Copyright

This dataset is released under the CC BY-NC 4.0 license for research and non-commercial use only.

  • Non-Commercial: Commercial use of the dataset or any images is strictly prohibited.
  • Copyrights: The images contained in this dataset are extracted from publicly accessible scientific publications. All copyrights of the original figures remain with their original authors or publishers.
  • Compliance: Users must ensure their use complies with the copyrights of the original publications.
Top Tier

Social Proof

HuggingFace Hub
98Likes
16.3KDownloads
πŸ”„ Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

πŸ“Š FNI Methodology πŸ“š Knowledge Baseℹ️ Verify with original source

πŸ›‘οΈ Dataset Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

πŸ†” Identity & Source

id
hf-dataset--scienceone-ai--s1-mmalign
source
huggingface
author
Scienceone Ai
tags
task_categories:image-to-texttask_categories:visual-question-answeringtask_categories:feature-extractionlanguage:enlicense:cc-by-nc-4.0size_categories:10mformat:webdatasetmodality:imagemodality:textlibrary:datasetslibrary:webdatasetlibrary:mlcroissantarxiv:2601.00264region:ussciencemultimodalphysicsbiologychemistryengineeringlarge-scale

βš™οΈ Technical Specs

architecture
null
params billions
null
context length
null

πŸ“Š Engagement & Metrics

likes
98
downloads
16,344

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)