πŸ“Š
Dataset

Common Corpus

by PleIAs hf-dataset--pleias--common_corpus
Nexus Index
23.0 Top 2%
S / A / P / R / Q Breakdown Calibration Pending

Pillar scores are computed during the next indexing cycle.

Tech Context
Vital Performance
0 DL / 30D
0.0%

Full data paper Common Corpus is the largest open and permissible licensed text dataset, comprising 2 trillion tokens (1,998,647,168,282 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners and contri...

Data Integrity 23 FNI Score
- Size
- Rows
Parquet Format
- Tokens
Dataset Information Summary
Entity Passport
Registry ID hf-dataset--pleias--common_corpus
Provider huggingface
πŸ“œ

Cite this dataset

Academic & Research Attribution

BibTeX
@misc{hf_dataset__pleias__common_corpus,
  author = {PleIAs},
  title = {Common Corpus Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/PleIAs/common_corpus}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}
APA Style
PleIAs. (2026). Common Corpus [Dataset]. Free2AITools. https://huggingface.co/datasets/PleIAs/common_corpus

πŸ”¬Technical Deep Dive

Full Specifications [+]

βš–οΈ Nexus Index V2.0

23.0
ESTIMATED IMPACT TIER
Semantic (S) 50
Authority (A) 0
Popularity (P) 0
Recency (R) 0
Quality (Q) 0

πŸ’¬ Index Insight

FNI V2.0 for Common Corpus: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).

Free2AITools Nexus Index

Verification Authority

Unbiased Data Node Refresh: VFS Live
⬇️
Downloads
36,432
❀️
Likes
337

πŸ‘οΈ Data Preview

πŸ“Š

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

πŸ”— Explore Full Dataset β†—

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

Top Tier

Social Proof

HuggingFace Hub
337Likes
36.4KDownloads
πŸ”„ Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

πŸ“Š FNI Methodology πŸ“š Knowledge Baseℹ️ Verify with original source

πŸ›‘οΈ Dataset Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

πŸ†” Identity & Source

id
hf-dataset--pleias--common_corpus
source
huggingface
author
PleIAs
tags
language:enlanguage:frlanguage:delanguage:itlanguage:eslanguage:lalanguage:nllanguage:plsize_categories:100mformat:parquetmodality:tabularmodality:textlibrary:datasetslibrary:dasklibrary:polarslibrary:mlcroissantarxiv:2506.01732arxiv:2410.22587region:us

βš™οΈ Technical Specs

architecture
null
params billions
null
context length
null

πŸ“Š Engagement & Metrics

likes
337
downloads
36,432

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)