📊
Dataset

Mmbert Midtraining Data

by Community hf-dataset--jhu-clsp--mmbert-midtraining-data
Nexus Index
0.0 Top 18%
S: Semantic 50
A: Authority 0
P: Popularity 0
R: Recency 0
Q: Quality 0
Tech Context
Vital Performance
0 DL / 30D
0.0%

> **Phase 2 of 3**: High-quality mid-training data mixture (600B tokens) with context extension to 8192 tokens. This dataset contains the mid-training phase data used to train all mmBERT encoder models. This phase focuses on higher quality data sources and extends the context length from 1024 to 8192 tokens. The data is provided in **MDS format** ready for use with Composer and the ModernBERT traini...

- Size
- Rows
Parquet Format
- Tokens
Dataset Information Summary
Entity Passport
Registry ID hf-dataset--jhu-clsp--mmbert-midtraining-data
Provider huggingface
📜

Cite this dataset

Academic & Research Attribution

BibTeX
@misc{hf_dataset__jhu_clsp__mmbert_midtraining_data,
  author = {Community},
  title = {Mmbert Midtraining Data Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/jhu-clsp/mmBERT-midtraining-data}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}
APA Style
Community. (2026). Mmbert Midtraining Data [Dataset]. Free2AITools. https://huggingface.co/datasets/jhu-clsp/mmBERT-midtraining-data

đŸ”ŦTechnical Deep Dive

Full Specifications [+]

âš–ī¸ Nexus Index V2.0

0.0
TOP 18% SYSTEM IMPACT
Semantic (S) 50
Authority (A) 0
Popularity (P) 0
Recency (R) 0
Quality (Q) 0

đŸ’Ŧ Index Insight

FNI V2.0 for Mmbert Midtraining Data: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).

Free2AITools Nexus Index

Verification Authority

Unbiased Data Node Refresh: VFS Live

đŸ‘ī¸ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

đŸ§Ŧ Field Logic

đŸ§Ŧ

Schema not yet indexed for this dataset.

Dataset Specification

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseâ„šī¸ Verify with original source

đŸ›Ąī¸ Dataset Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id
hf-dataset--jhu-clsp--mmbert-midtraining-data
source
huggingface
author
Community
tags

âš™ī¸ Technical Specs

architecture
null
params billions
null
context length
null

📊 Engagement & Metrics

likes
0
downloads
0

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)