📊

Dataset

Mmbert Midtraining Data

Name: Mmbert Midtraining Data
Creator: Community
License: MIT

by Community hf-dataset--jhu-clsp--mmbert-midtraining-data

Nexus Index

0.0 Top 18%

S: Semantic 50

A: Authority 0

P: Popularity 0

R: Recency 0

Q: Quality 0

Tech Context

Vital Performance

0 DL / 30D

0.0%

> **Phase 2 of 3**: High-quality mid-training data mixture (600B tokens) with context extension to 8192 tokens. This dataset contains the mid-training phase data used to train all mmBERT encoder models. This phase focuses on higher quality data sources and extends the context length from 1024 to 8192 tokens. The data is provided in **MDS format** ready for use with Composer and the ModernBERT traini...

Source →

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-dataset--jhu-clsp--mmbert-midtraining-data
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_dataset__jhu_clsp__mmbert_midtraining_data,
  author = {Community},
  title = {Mmbert Midtraining Data Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/jhu-clsp/mmBERT-midtraining-data}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

Community. (2026). Mmbert Midtraining Data [Dataset]. Free2AITools. https://huggingface.co/datasets/jhu-clsp/mmBERT-midtraining-data

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Nexus Index V2.0

Methodology Index Protocol

0.0

TOP 18% SYSTEM IMPACT

Semantic (S) 50

Authority (A) 0

Popularity (P) 0

Recency (R) 0

Quality (Q) 0

💬 Index Insight

FNI V2.0 for Mmbert Midtraining Data: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

license: mit
task_categories:

fill-mask
tags:
pretraining
encoder
multilingual

mmBERT Mid-training Data

Phase 2 of 3: High-quality mid-training data mixture (600B tokens) with context extension to 8192 tokens.

This dataset contains the mid-training phase data used to train all mmBERT encoder models. This phase focuses on higher quality data sources and extends the context length from 1024 to 8192 tokens. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository.

📊 Data Composition

Data Source	Tokens (B)	Percentage	Description
FineWeb2	506.7	84.3%	High-quality multilingual web crawl data
DCLM (Dolmino)	40.0	6.7%	Filtered high-quality English web data
Starcoder	17.2	2.9%	Code repositories and files
Arxiv	5.4	0.9%	Academic preprints
Dolmino Math	4.3	0.7%	Mathematical content
Books	3.9	0.7%	Literature and reference books
PeS2o	3.2	0.5%	Scientific papers
Tulu Flan	3.1	0.5%	Instruction-following data
StackExchange	3.0	0.5%	Q&A forums
StackExchange (Dolmino)	2.8	0.5%	Curated Q&A content
Wikipedia (MegaWika)	1.2	0.2%	Encyclopedia articles
Total	600.8	100.0%	High-quality data for context extension

🌍 Language Coverage

This phase covers 110 languages plus code, with inverse temperature sampling at τ=0.5. Expands from the initial 60 languages to include:

Additional mid-resource languages: Uzbek, Bosnian, Catalan, Albanian, and 46 others
Enhanced quality: Uses filtered FineWeb2-HQ and higher quality DCLM
Longer contexts: Optimized for 8192 token sequences

⚙️ Key Features

Context Extension: RoPE base frequency adjusted to 160k for 8192 token support
Quality Upgrade: Switches to filtered, higher-quality versions of datasets
Reduced Masking: Mask rate lowered to 15% (from 30% in pre-training)
Language Expansion: Adds 50 new languages while maintaining data quality

🚀 Usage

For mid-training, see the ModernBERT repo: https://github.com/AnswerDotAI/ModernBERT

Direct Access

Use the script at this link to load any section of the dataset on the fly. This will fail if you try to access too many samples though, due to HF rate-limiting. To download the full dataset, use HF Hub's Snapshot Download.

🔗 Related Resources

Models: mmBERT Model Suite
Phase 1: Pre-training Data (2.3T tokens)
Phase 3: Decay Phase Data (100B tokens)
Checkpoints: Training Checkpoints
Paper: Arxiv link
Code: GitHub Repository

Citation

@misc{marone2025mmbertmodernmultilingualencoder,
      title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, 
      author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
      year={2025},
      eprint={2509.06888},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.06888}, 
}

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id: hf-dataset--jhu-clsp--mmbert-midtraining-data
source: huggingface
author: Community
tags

⚙️ Technical Specs

architecture: null
params billions: null
context length: null

📊 Engagement & Metrics

likes: 0
downloads: 0

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!