Mmbert Midtraining Data
> **Phase 2 of 3**: High-quality mid-training data mixture (600B tokens) with context extension to 8192 tokens. This dataset contains the mid-training phase data used to train all mmBERT encoder models. This phase focuses on higher quality data sources and extends the context length from 1024 to 8192 tokens. The data is provided in **MDS format** ready for use with Composer and the ModernBERT traini...
| Entity Passport | |
| Registry ID | hf-dataset--jhu-clsp--mmbert-midtraining-data |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__jhu_clsp__mmbert_midtraining_data,
author = {Community},
title = {Mmbert Midtraining Data Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/jhu-clsp/mmBERT-midtraining-data}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
âī¸ Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for Mmbert Midtraining Data: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).
Verification Authority
đī¸ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
đ Explore Full Dataset âđ§Ŧ Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
license: mit
task_categories:
- fill-mask
tags: - pretraining
- encoder
- multilingual
mmBERT Mid-training Data
Phase 2 of 3: High-quality mid-training data mixture (600B tokens) with context extension to 8192 tokens.
This dataset contains the mid-training phase data used to train all mmBERT encoder models. This phase focuses on higher quality data sources and extends the context length from 1024 to 8192 tokens. The data is provided in MDS format ready for use with Composer and the ModernBERT training repository.
đ Data Composition
| Data Source | Tokens (B) | Percentage | Description |
|---|---|---|---|
| FineWeb2 | 506.7 | 84.3% | High-quality multilingual web crawl data |
| DCLM (Dolmino) | 40.0 | 6.7% | Filtered high-quality English web data |
| Starcoder | 17.2 | 2.9% | Code repositories and files |
| Arxiv | 5.4 | 0.9% | Academic preprints |
| Dolmino Math | 4.3 | 0.7% | Mathematical content |
| Books | 3.9 | 0.7% | Literature and reference books |
| PeS2o | 3.2 | 0.5% | Scientific papers |
| Tulu Flan | 3.1 | 0.5% | Instruction-following data |
| StackExchange | 3.0 | 0.5% | Q&A forums |
| StackExchange (Dolmino) | 2.8 | 0.5% | Curated Q&A content |
| Wikipedia (MegaWika) | 1.2 | 0.2% | Encyclopedia articles |
| Total | 600.8 | 100.0% | High-quality data for context extension |
đ Language Coverage
This phase covers 110 languages plus code, with inverse temperature sampling at Ī=0.5. Expands from the initial 60 languages to include:
- Additional mid-resource languages: Uzbek, Bosnian, Catalan, Albanian, and 46 others
- Enhanced quality: Uses filtered FineWeb2-HQ and higher quality DCLM
- Longer contexts: Optimized for 8192 token sequences
âī¸ Key Features
- Context Extension: RoPE base frequency adjusted to 160k for 8192 token support
- Quality Upgrade: Switches to filtered, higher-quality versions of datasets
- Reduced Masking: Mask rate lowered to 15% (from 30% in pre-training)
- Language Expansion: Adds 50 new languages while maintaining data quality
đ Usage
For mid-training, see the ModernBERT repo: https://github.com/AnswerDotAI/ModernBERT
Direct Access
Use the script at this link to load any section of the dataset on the fly. This will fail if you try to access too many samples though, due to HF rate-limiting. To download the full dataset, use HF Hub's Snapshot Download.
đ Related Resources
- Models: mmBERT Model Suite
- Phase 1: Pre-training Data (2.3T tokens)
- Phase 3: Decay Phase Data (100B tokens)
- Checkpoints: Training Checkpoints
- Paper: Arxiv link
- Code: GitHub Repository
Citation
@misc{marone2025mmbertmodernmultilingualencoder,
title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning},
author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
year={2025},
eprint={2509.06888},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.06888},
}
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Dataset Transparency Report
Verified data manifest for traceability and transparency.
đ Identity & Source
- id
- hf-dataset--jhu-clsp--mmbert-midtraining-data
- source
- huggingface
- author
- Community
- tags
âī¸ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
đ Engagement & Metrics
- likes
- 0
- downloads
- 0
Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)