Hplt2.0 Cleaned
Pillar scores are computed during the next indexing cycle.
| Entity Passport | |
| Registry ID | hf-dataset--hplt--hplt2.0_cleaned |
| License | CC0-1.0 |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__hplt__hplt2.0_cleaned,
author = {HPLT},
title = {Hplt2.0 Cleaned Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/hplt/hplt2.0_cleaned}},
note = {Accessed via Free2AITools Knowledge Fortress}
} π¬Technical Deep Dive
Full Specifications [+]βΎ
βοΈ Nexus Index V2.0
π¬ Index Insight
FNI V2.0 for Hplt2.0 Cleaned: Semantic (S:0), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).
Verification Authority
ποΈ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
π Explore Full Dataset β𧬠Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
NB: HPLT2.0 is now superseded by a newer release:
We recommed switching to v3.0, unless you have a compelling reason to stay on 2.0.
This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project. The source of the data is mostly Internet Archive with some additions from Common Crawl.
For a detailed description of the dataset, please refer to our website and our pre-print.
The Cleaned variant of HPLT Datasets v2.0
This is the cleaned variant of the HPLT Datasets v2.0 converted to the Parquet format semi-automatically when being uploaded here.
The original JSONL files (which take ~4x fewer disk space than this HF version) and the larger non-cleaned version can be found at https://hplt-project.org/datasets/v2.0.
Dataset Performance
Internal Evaluation
We conducted the FineWeb-style ablation studies within the HPLT project with the focus on one high-resource and one low-resource language: English and Norwegian.
We train 1.7B decoder-only LMs using 100B/30B tokens sampled from the English/Norwegian parts of our HPLT v2 dataset respectively. We replicate the FineWeb corpora comparison design and train the models with a fixed pretraining setup except for the pretraining corpus (English: four corpora; Norwegian: five corpora). Please find the general description of the training and evalutaion setups below and refer to more details in Section 6.2 and Appendix I in our paper.
| English Results | Norwegian Results |
|---|---|
![]() |
![]() |
English
- Corpora: HPLT v1.2, FineWeb and HPLT v2 (ours; deduplicated and cleaned versions).
- Pretraining framework and infrastructure: We trained our English models using Megatron-LM on LUMI with 16 nodes, each with 4 AMD MI250x GPUs with dual-GCD (graphics compute die) design, amounting to 8 logical devices. In total, we used 128 devices and a single 64-core CPU for approximately 84 hours, totalling 11,008 GPU hours per model.
- Evaluation tasks: ARC (Easy and Challenge), Hellaswag, PIQA, and OpenbookQA. We consider only the 0-shot evaluation regime.
- Evaluation framework: LightEval.
- Results: See the plot above. Our models trained on the HPLT v2 datasets reach similar performance to the models trained on FineWeb data and considerably outperform the models trained on HPLT v1.2.
Norwegian
- Corpora: HPLT v1.2, FineWeb-2, mC4, CulturaX, and HPLT v2 (ours).
- Pretraining framework and infrastructure: We trained our Norwegian models using Megatron-DeepSpeed on LUMI with 32 nodes, each with 4 AMD MI250x GPUs. The full pretraining run of each model took approximately 15 hours (wall-clock time), or 1,920 GPU-hours.
- Evaluation tasks: NorCommonsenseQA, NorOpenBookQA, NRK-Quiz-QA, NCB, NorIdiom, and NorQuAD. We discarded tasks that provided a low signal based on the monotonicity and non-random performance criteria defined in the FineWeb-2 evaluation design. The resulting tasks were NCB, NRK-Quiz-QA, NorCommonsenseQA, and NorQuAD. We aggregated the performance using the average normalized score. We consider only the 0-shot evaluation regime.
- Evaluation framework: NorEval, a Norwegian language understanding and generation evaluation benchmark based upon LM Evaluation Harness.
- Results: See the plot above. The Norwegian models trained on FineWeb, CulturaX, and mC4 perform on par with HPLT v2 and outperform those trained on HPLT v1.2. Performance gains start to level off after 16B tokens, with the FineWeb and HPLT v2 scores being more stable during pretraining. This suggests that CulturaX, FineWeb, and HPLT v2 are more effective corpora for Norwegian, and their mixtures potentially provide further benefits.
External Evaluation
The HuggingFace team has compared the utility of various multilingual corpora for training large language models in their FineWeb2 initiative.
They found that the HPLT v2 datasets are next to their FineWeb-2, on par with the CulturaX dataset as shown in this figure produced by HuggingFace:
This is a massive improvement compared to the HPLT v1 datasets, as can be seen on the plot above. In fact, itβs even better: if one looks at the language-specific results, it becomes clear that on Arabic, Hindi, Russian, Thai and Turkish (5 out of 9 languages HuggingFace evaluated on), HPLT v2 is on par or better than FineWeb 2. The average score is lower mostly because of Chinese, we expect it to improve a lot in HPLT v3. Note that the source of the FineWeb 2 (and CulturaX) data is exclusively CommonCrawl, while the HPLT datasets are to a large extent composed of Internet Archive crawls. Thus, FineWeb-2 and HPLT v2 are complementary to each other and should be used together.
Languages
The cleaned version of HPLT Datasets v2.0 consists of subsets corresponding to 191 language codes.
Below we provide a list of language codes. For each language code the amount of text is shown as measured in:
- segments: the number of sequences of characters (possibly empty) separated by the newline symbol,
- wcwords: the number of words as defined by the Unix
wcutility, i.e. the number of non-whitespaces with a whitespace or the beginning of document before, - chars: the number of characters,
- docs: the number of documents, each document corresponds to an individual web page from the sourcing web crawls.
| lang | segments | wcwords | chars | docs | Language Name | ISO693-3 code | ISO693-3 code macro | ISO693-1 direct code | ISO693-1 through macro | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | TOTAL | 3.00e+11 | 5.56e+12 | 3.74e+13 | 1.06e+10 | |||||
| 1 | ace_Arab | 1.17e+02 | 8.36e+03 | 4.97e+04 | 1.60e+01 | Achinese | ace | |||
| 2 | ace_Latn | 2.06e+05 | 8.20e+06 | 5.08e+07 | 1.29e+04 | Achinese | ace | |||
| 3 | afr_Latn | 3.77e+07 | 1.00e+09 | 5.95e+09 | 1.46e+06 | Afrikaans | afr | af | af | |
| 4 | als_Latn | 9.51e+07 | 2.71e+09 | 1.61e+10 | 5.38e+06 | Tosk Albanian | als | sqi | sq | |
| 5 | amh_Ethi | 7.01e+06 | 1.96e+08 | 1.03e+09 | 2.96e+05 | Amharic | amh | am | am | |
| 6 | ara_Arab | 2.20e+09 | 4.81e+10 | 2.80e+11 | 8.27e+07 | Arabic | ara | ar | ar | |
| 7 | asm_Beng | 2.68e+06 | 7.34e+07 | 4.76e+08 | 1.76e+05 | Assamese | asm | as | as | |
| 8 | ast_Latn | 7.43e+06 | 1.95e+08 | 1.24e+09 | 2.73e+05 | Asturian | ast | |||
| 9 | awa_Deva | 1.32e+05 | 6.05e+06 | 2.88e+07 | 7.28e+03 | Awadhi | awa | |||
| 10 | ayr_Latn | 1.88e+05 | 3.07e+06 | 2.51e+07 | 9.22e+03 | Central Aymara | ayr | aym | ay | |
| 11 | azb_Arab | 2.39e+06 | 3.96e+07 | 2.60e+08 | 6.61e+04 | South Azerbaijani | azb | aze | az | |
| 12 | azj_Latn | 1.27e+08 | 2.57e+09 | 1.96e+10 | 6.48e+06 | North Azerbaijani | azj | aze | az | |
| 13 | bak_Cyrl | 3.14e+06 | 7.53e+07 | 5.58e+08 | 1.71e+05 | Bashkir | bak | ba | ba | |
| 14 | bam_Latn | 9.17e+04 | 3.98e+06 | 2.07e+07 | 5.72e+03 | Bambara | bam | bm | bm | |
| 15 | ban_Latn | 6.01e+05 | 1.13e+07 | 7.72e+07 | 1.07e+04 | Balinese | ban | |||
| 16 | bel_Cyrl | 4.88e+07 | 1.21e+09 | 8.54e+09 | 2.32e+06 | Belarusian | bel | be | be | |
| 17 | bem_Latn | 1.34e+05 | 4.52e+06 | 3.23e+07 | 6.14e+03 | Bemba (Zambia) | bem | |||
| 18 | ben_Beng | 1.76e+08 | 4.64e+09 | 3.02e+10 | 1.10e+07 | Bengali | ben | bn | bn | |
| 19 | bho_Deva | 4.58e+05 | 1.35e+07 | 6.86e+07 | 2.86e+04 | Bhojpuri | bho | |||
| 20 | bjn_Arab | 1.95e+04 | 5.48e+05 | 3.32e+06 | 1.11e+03 | Banjar | bjn | msa | ms | |
| 21 | bjn_Latn | 3.66e+05 | 8.05e+06 | 5.60e+07 | 1.88e+04 | Banjar | bjn | msa | ms | |
| 22 | bod_Tibt | 4.65e+05 | 5.78e+06 | 2.68e+08 | 2.74e+04 | Tibetan | bod | bo | bo | |
| 23 | bos_Latn | 2.68e+08 | 7.26e+09 | 4.61e+10 | 1.46e+07 | Bosnian | bos | hbs | bs | bs |
| 24 | bug_Latn | 3.86e+04 | 2.70e+06 | 1.93e+07 | 2.02e+03 | Buginese | bug | |||
| 25 | bul_Cyrl | 6.81e+08 | 1.53e+10 | 9.69e+10 | 2.81e+07 | Bulgarian | bul | bg | bg | |
| 26 | cat_Latn | 3.83e+08 | 1.00e+10 | 6.02e+10 | 1.86e+07 | Catalan | cat | ca | ca | |
| 27 | ceb_Latn | 2.86e+06 | 8.59e+07 | 5.16e+08 | 1.39e+05 | Cebuano | ceb | |||
| 28 | ces_Latn | 1.93e+09 | 4.21e+10 | 2.74e+11 | 7.53e+07 | Czech | ces | cs | cs | |
| 29 | cjk_Latn | 3.67e+04 | 9.65e+05 | 7.43e+06 | 1.20e+03 | Chokwe | cjk | |||
| 30 | ckb_Arab | 5.23e+06 | 1.43e+08 | 9.13e+08 | 2.74e+05 | Central Kurdish | ckb | kur | ku | |
| 31 | crh_Latn | 1.38e+06 | 3.68e+07 | 2.81e+08 | 1.23e+05 | Crimean Tatar | crh | |||
| 32 | cym_Latn | 1.56e+07 | 4.09e+08 | 2.40e+09 | 7.58e+05 | Welsh | cym | cy | cy | |
| 33 | dan_Latn | 8.73e+08 | 2.12e+10 | 1.33e+11 | 3.38e+07 | Danish | dan | da | da | |
| 34 | deu_Latn | 1.11e+10 | 2.52e+11 | 1.78e+12 | 4.82e+08 | German | deu | de | de | |
| 35 | dik_Latn | 3.46e+04 | 2.30e+06 | 1.15e+07 | 2.32e+03 | Southwestern Dinka | dik | din | ||
| 36 | dyu_Latn | 2.46e+04 | 1.19e+06 | 5.55e+06 | 1.39e+03 | Dyula | dyu | |||
| 37 | dzo_Tibt | 4.00e+04 | 4.22e+05 | 7.38e+06 | 1.63e+03 | Dzongkha | dzo | dz | dz | |
| 38 | ell_Grek | 1.85e+09 | 4.27e+10 | 2.84e+11 | 7.03e+07 | Modern Greek (1453-) | ell | el | el | |
| 39 | eng_Latn | 1.16e+11 | 2.86e+12 | 1.71e+13 | 4.39e+09 | English | eng | en | en | |
| 40 | epo_Latn | 2.04e+07 | 4.72e+08 | 2.98e+09 | 8.19e+05 | Esperanto | epo | eo | eo | |
| 41 | est_Latn | 2.64e+08 | 4.74e+09 | 3.60e+10 | 8.45e+06 | Estonian | est | et | et | |
| 42 | eus_Latn | 3.76e+07 | 7.77e+08 | 6.05e+09 | 1.97e+06 | Basque | eus | eu | eu | |
| 43 | ewe_Latn | 1.43e+05 | 4.31e+06 | 2.13e+07 | 3.77e+03 | Ewe | ewe | ee | ee | |
| 44 | fao_Latn | 4.53e+06 | 9.34e+07 | 5.82e+08 | 2.40e+05 | Faroese | fao | fo | fo | |
| 45 | fij_Latn | 1.79e+05 | 7.26e+06 | 3.77e+07 | 8.91e+03 | Fijian | fij | fj | fj | |
| 46 | fin_Latn | 9.77e+08 | 1.84e+10 | 1.56e+11 | 3.48e+07 | Finnish | fin | fi | fi | |
| 47 | fon_Latn | 1.48e+04 | 1.23e+06 | 5.34e+06 | 1.23e+03 | Fon | fon | |||
| 48 | fra_Latn | 1.06e+10 | 2.37e+11 | 1.46e+12 | 4.02e+08 | French | fra | fr | fr | |
| 49 | fur_Latn | 7.30e+05 | 2.08e+07 | 1.15e+08 | 3.67e+04 | Friulian | fur | |||
| 50 | fuv_Latn | 1.34e+05 | 5.14e+06 | 2.99e+07 | 7.76e+03 | Nigerian Fulfulde | fuv | ful | ff | |
| 51 | gaz_Latn | 9.74e+05 | 2.89e+07 | 2.19e+08 | 4.91e+04 | West Central Oromo | gaz | orm | om | |
| 52 | gla_Latn | 3.31e+06 | 8.07e+07 | 4.84e+08 | 1.37e+05 | Scottish Gaelic | gla | gd | gd | |
| 53 | gle_Latn | 1.10e+07 | 2.96e+08 | 1.75e+09 | 4.91e+05 | Irish | gle | ga | ga | |
| 54 | glg_Latn | 6.12e+07 | 1.64e+09 | 1.01e+10 | 3.02e+06 | Galician | glg | gl | gl | |
| 55 | grn_Latn | 1.71e+06 | 3.07e+07 | 2.19e+08 | 7.34e+04 | Guarani | grn | gn | gn | |
| 56 | guj_Gujr | 2.06e+07 | 5.77e+08 | 3.39e+09 | 1.13e+06 | Gujarati | guj | gu | gu | |
| 57 | hat_Latn | 4.64e+06 | 1.22e+08 | 6.39e+08 | 2.13e+05 | Haitian | hat | ht | ht | |
| 58 | hau_Latn | 5.69e+06 | 1.53e+08 | 8.54e+08 | 3.16e+05 | Hausa | hau | ha | ha | |
| 59 | heb_Hebr | 4.67e+08 | 9.97e+09 | 5.68e+10 | 1.71e+07 | Hebrew | heb | he | he | |
| 60 | hin_Deva | 2.67e+08 | 8.64e+09 | 4.40e+10 | 1.36e+07 | Hindi | hin | hi | hi | |
| 61 | hne_Deva | 5.50e+04 | 2.20e+06 | 1.06e+07 | 2.81e+03 | Chhattisgarhi | hne | |||
| 62 | hrv_Latn | 2.97e+08 | 7.31e+09 | 4.80e+10 | 1.23e+07 | Croatian | hrv | hbs | hr | hr |
| 63 | hun_Latn | 1.42e+09 | 3.05e+10 | 2.25e+11 | 5.19e+07 | Hungarian | hun | hu | hu | |
| 64 | hye_Armn | 6.52e+07 | 1.40e+09 | 1.07e+10 | 3.60e+06 | Armenian | hye | hy | hy | |
| 65 | ibo_Latn | 1.41e+06 | 3.83e+07 | 2.05e+08 | 5.63e+04 | Igbo | ibo | ig | ig | |
| 66 | ilo_Latn | 1.12e+06 | 2.48e+07 | 1.57e+08 | 4.88e+04 | Iloko | ilo | |||
| 67 | ind_Latn | 2.39e+09 | 5.46e+10 | 3.84e+11 | 9.81e+07 | Indonesian | ind | msa | id | id |
| 68 | isl_Latn | 6.96e+07 | 1.54e+09 | 9.59e+09 | 2.84e+06 | Icelandic | isl | is | is | |
| 69 | ita_Latn | 5.13e+09 | 1.27e+11 | 8.21e+11 | 2.22e+08 | Italian | ita | it | it | |
| 70 | jav_Latn | 6.43e+06 | 1.38e+08 | 9.38e+08 | 1.96e+05 | Javanese | jav | jv | jv | |
| 71 | jpn_Jpan | 2.33e+10 | 4.24e+10 | 9.01e+11 | 4.18e+08 | Japanese | jpn | ja | ja | |
| 72 | kab_Latn | 3.45e+05 | 9.22e+06 | 5.42e+07 | 1.51e+04 | Kabyle | kab | |||
| 73 | kac_Latn | 1.59e+05 | 5.96e+06 | 2.84e+07 | 7.59e+03 | Kachin | kac | |||
| 74 | kam_Latn | 1.43e+04 | 6.74e+05 | 4.64e+06 | 1.18e+03 | Kamba (Kenya) | kam | |||
| 75 | kan_Knda | 2.49e+07 | 5.33e+08 | 4.30e+09 | 1.34e+06 | Kannada | kan | kn | kn | |
| 76 | kas_Arab | 2.71e+04 | 6.78e+05 | 3.47e+06 | 9.49e+02 | Kashmiri | kas | ks | ks | |
| 77 | kas_Deva | 1.36e+03 | 3.19e+04 | 1.85e+05 | 1.06e+02 | Kashmiri | kas | ks | ks | |
| 78 | kat_Geor | 6.37e+07 | 1.24e+09 | 1.02e+10 | 3.34e+06 | Georgian | kat | ka | ka | |
| 79 | kaz_Cyrl | 8.10e+07 | 1.41e+09 | 1.11e+10 | 2.64e+06 | Kazakh | kaz | kk | kk | |
| 80 | kbp_Latn | 4.68e+04 | 4.26e+06 | 2.09e+07 | 7.08e+03 | Kabiyè | kbp | |||
| 81 | kea_Latn | 4.39e+04 | 1.14e+06 | 6.14e+06 | 1.96e+03 | Kabuverdianu | kea | |||
| 82 | khk_Cyrl | 5.35e+07 | 1.34e+09 | 9.33e+09 | 2.12e+06 | Halh Mongolian | khk | mon | mn | |
| 83 | khm_Khmr | 9.86e+06 | 1.14e+08 | 2.12e+09 | 7.01e+05 | Khmer | khm | km | km | |
| 84 | kik_Latn | 5.19e+04 | 1.43e+06 | 9.29e+06 | 4.00e+03 | Kikuyu | kik | ki | ki | |
| 85 | kin_Latn | 1.92e+06 | 5.07e+07 | 3.67e+08 | 9.27e+04 | Kinyarwanda | kin | rw | rw | |
| 86 | kir_Cyrl | 1.00e+07 | 2.47e+08 | 1.92e+09 | 6.76e+05 | Kirghiz | kir | ky | ky | |
| 87 | kmb_Latn | 1.18e+04 | 3.83e+05 | 2.07e+06 | 5.31e+02 | Kimbundu | kmb | |||
| 88 | kmr_Latn | 7.15e+06 | 1.96e+08 | 1.12e+09 | 3.64e+05 | Northern Kurdish | kmr | kur | ku | |
| 89 | knc_Arab | 1.08e+04 | 2.62e+05 | 1.30e+06 | 2.45e+02 | Central Kanuri | knc | kau | kr | |
| 90 | knc_Latn | 1.05e+04 | 2.41e+06 | 1.20e+07 | 2.47e+03 | Central Kanuri | knc | kau | kr | |
| 91 | kon_Latn | 4.75e+04 | 1.94e+06 | 1.13e+07 | 2.54e+03 | Kongo | kon | kg | kg | |
| 92 | kor_Hang | 1.36e+09 | 1.97e+10 | 8.92e+10 | 3.89e+07 | Korean | kor | ko | ko | |
| 93 | lao_Laoo | 3.20e+05 | 5.18e+06 | 8.47e+07 | 2.95e+04 | Lao | lao | lo | lo | |
| 94 | lij_Latn | 1.58e+05 | 5.59e+06 | 3.15e+07 | 8.37e+03 | Ligurian | lij | |||
| 95 | lim_Latn | 7.14e+06 | 1.81e+08 | 1.12e+09 | 3.68e+05 | Limburgan | lim | li | li | |
| 96 | lin_Latn | 2.00e+05 | 5.56e+06 | 3.29e+07 | 7.59e+03 | Lingala | lin | ln | ln | |
| 97 | lit_Latn | 3.22e+08 | 6.68e+09 | 5.04e+10 | 1.33e+07 | Lithuanian | lit | lt | lt | |
| 98 | lmo_Latn | 2.12e+06 | 5.96e+07 | 3.45e+08 | 1.46e+05 | Lombard | lmo | |||
| 99 | ltg_Latn | 1.51e+05 | 3.79e+06 | 2.69e+07 | 9.21e+03 | Latgalian | ltg | lav | lv | |
| 100 | ltz_Latn | 5.06e+06 | 1.07e+08 | 7.10e+08 | 2.47e+05 | Luxembourgish | ltz | lb | lb | |
| 101 | lua_Latn | 3.87e+04 | 1.37e+06 | 9.00e+06 | 1.08e+03 | Luba-Lulua | lua | |||
| 102 | lug_Latn | 4.08e+05 | 9.18e+06 | 6.80e+07 | 2.13e+04 | Ganda | lug | lg | lg | |
| 103 | luo_Latn | 8.41e+04 | 3.73e+06 | 2.03e+07 | 4.15e+03 | Luo (Kenya and Tanzania) | luo | |||
| 104 | lus_Latn | 3.43e+06 | 1.25e+08 | 6.52e+08 | 1.60e+05 | Lushai | lus | |||
| 105 | lvs_Latn | 1.74e+08 | 3.46e+09 | 2.52e+10 | 6.77e+06 | Standard Latvian | lvs | lav | lv | |
| 106 | mag_Deva | 1.93e+04 | 8.91e+05 | 4.28e+06 | 3.28e+02 | Magahi | mag | |||
| 107 | mai_Deva | 6.46e+05 | 1.78e+07 | 9.67e+07 | 2.50e+04 | Maithili | mai | |||
| 108 | mal_Mlym | 4.80e+07 | 9.74e+08 | 9.49e+09 | 3.10e+06 | Malayalam | mal | ml | ml | |
| 109 | mar_Deva | 3.63e+07 | 9.81e+08 | 6.62e+09 | 2.08e+06 | Marathi | mar | mr | mr | |
| 110 | min_Latn | 6.01e+05 | 1.10e+07 | 7.48e+07 | 2.50e+04 | Minangkabau | min | msa | ms | |
| 111 | mkd_Cyrl | 5.70e+07 | 1.48e+09 | 9.44e+09 | 3.57e+06 | Macedonian | mkd | mk | mk | |
| 112 | mlt_Latn | 8.68e+06 | 1.96e+08 | 1.44e+09 | 3.67e+05 | Maltese | mlt | mt | mt | |
| 113 | mni_Beng | 6.58e+04 | 1.63e+06 | 1.18e+07 | 2.93e+03 | Manipuri | mni | |||
| 114 | mos_Latn | 1.91e+04 | 8.08e+05 | 3.86e+06 | 9.31e+02 | Mossi | mos | |||
| 115 | mri_Latn | 2.80e+06 | 8.68e+07 | 4.24e+08 | 1.08e+05 | Maori | mri | mi | mi | |
| 116 | mya_Mymr | 3.05e+07 | 4.53e+08 | 5.82e+09 | 1.37e+06 | Burmese | mya | my | my | |
| 117 | nld_Latn | 3.08e+09 | 7.14e+10 | 4.51e+11 | 1.39e+08 | Dutch | nld | nl | nl | |
| 118 | nno_Latn | 3.46e+07 | 8.60e+08 | 5.40e+09 | 1.42e+06 | Norwegian Nynorsk | nno | nor | nn | nn |
| 119 | nob_Latn | 6.76e+08 | 2.15e+10 | 1.33e+11 | 2.70e+07 | Norwegian BokmΓ₯l | nob | nor | nb | nb |
| 120 | npi_Deva | 3.71e+07 | 1.13e+09 | 7.26e+09 | 2.78e+06 | Nepali (individual language) | npi | nep | ne | |
| 121 | nso_Latn | 1.43e+05 | 5.32e+06 | 2.75e+07 | 6.07e+03 | Pedi | nso | |||
| 122 | nus_Latn | 8.51e+03 | 3.93e+05 | 1.88e+06 | 2.72e+02 | Nuer | nus | |||
| 123 | nya_Latn | 1.34e+06 | 2.71e+07 | 2.03e+08 | 5.31e+04 | Nyanja | nya | ny | ny | |
| 124 | oci_Latn | 4.20e+06 | 1.03e+08 | 6.35e+08 | 1.90e+05 | Occitan (post 1500) | oci | oc | oc | |
| 125 | ory_Orya | 3.60e+06 | 1.20e+08 | 7.82e+08 | 4.13e+05 | Odia | ory | ori | or | |
| 126 | pag_Latn | 8.58e+04 | 5.66e+06 | 3.35e+07 | 6.90e+03 | Pangasinan | pag | |||
| 127 | pan_Guru | 1.17e+07 | 3.72e+08 | 1.90e+09 | 5.85e+05 | Panjabi | pan | pa | pa | |
| 128 | pap_Latn | 1.39e+06 | 4.67e+07 | 2.54e+08 | 8.98e+04 | Papiamento | pap | |||
| 129 | pbt_Arab | 8.46e+06 | 2.79e+08 | 1.30e+09 | 4.66e+05 | Southern Pashto | pbt | pus | ps | |
| 130 | pes_Arab | 3.96e+09 | 8.86e+10 | 4.55e+11 | 9.05e+07 | Iranian Persian | pes | fas | fa | |
| 131 | plt_Latn | 4.74e+06 | 1.17e+08 | 8.10e+08 | 2.08e+05 | Plateau Malagasy | plt | mlg | mg | |
| 132 | pol_Latn | 4.46e+09 | 8.95e+10 | 6.32e+11 | 1.75e+08 | Polish | pol | pl | pl | |
| 133 | por_Latn | 6.12e+09 | 1.46e+11 | 8.96e+11 | 2.38e+08 | Portuguese | por | pt | pt | |
| 134 | prs_Arab | 6.90e+07 | 1.84e+09 | 9.57e+09 | 2.84e+06 | Dari | prs | fas | fa | |
| 135 | quy_Latn | 4.94e+05 | 1.73e+07 | 1.43e+08 | 3.69e+04 | Ayacucho Quechua | quy | que | qu | |
| 136 | ron_Latn | 1.70e+09 | 4.00e+10 | 2.51e+11 | 6.59e+07 | Romanian | ron | ro | ro | |
| 137 | run_Latn | 1.75e+06 | 4.44e+07 | 3.16e+08 | 1.37e+05 | Rundi | run | rn | rn | |
| 138 | rus_Cyrl | 2.63e+10 | 5.41e+11 | 3.91e+12 | 8.85e+08 | Russian | rus | ru | ru | |
| 139 | sag_Latn | 5.19e+04 | 3.61e+06 | 1.67e+07 | 3.16e+03 | Sango | sag | sg | sg | |
| 140 | san_Deva | 3.28e+06 | 4.38e+07 | 3.59e+08 | 5.49e+04 | Sanskrit | san | sa | sa | |
| 141 | sat_Olck | 4.58e+04 | 1.08e+06 | 6.27e+06 | 2.57e+03 | Santali | sat | |||
| 142 | scn_Latn | 1.65e+06 | 4.24e+07 | 2.52e+08 | 8.20e+04 | Sicilian | scn | |||
| 143 | shn_Mymr | 9.21e+04 | 1.65e+06 | 2.12e+07 | 6.00e+03 | Shan | shn | |||
| 144 | sin_Sinh | 3.37e+07 | 7.96e+08 | 4.98e+09 | 1.15e+06 | Sinhala | sin | si | si | |
| 145 | slk_Latn | 4.94e+08 | 1.06e+10 | 7.04e+10 | 2.18e+07 | Slovak | slk | sk | sk | |
| 146 | slv_Latn | 2.39e+08 | 5.44e+09 | 3.53e+10 | 1.03e+07 | Slovenian | slv | sl | sl | |
| 147 | smo_Latn | 1.01e+06 | 3.71e+07 | 1.86e+08 | 4.59e+04 | Samoan | smo | sm | sm | |
| 148 | sna_Latn | 1.20e+06 | 2.39e+07 | 1.93e+08 | 6.11e+04 | Shona | sna | sn | sn | |
| 149 | snd_Arab | 2.83e+06 | 8.95e+07 | 4.29e+08 | 1.00e+05 | Sindhi | snd | sd | sd | |
| 150 | som_Latn | 1.64e+07 | 3.89e+08 | 2.56e+09 | 9.66e+05 | Somali | som | so | so | |
| 151 | sot_Latn | 1.08e+06 | 3.10e+07 | 1.72e+08 | 4.39e+04 | Southern Sotho | sot | st | st | |
| 152 | spa_Latn | 1.21e+10 | 3.22e+11 | 1.95e+12 | 5.03e+08 | Spanish | spa | es | es | |
| 153 | srd_Latn | 9.17e+05 | 2.39e+07 | 1.49e+08 | 5.38e+04 | Sardinian | srd | sc | sc | |
| 154 | srp_Cyrl | 9.38e+07 | 2.52e+09 | 1.62e+10 | 4.12e+06 | Serbian | srp | hbs | sr | sr |
| 155 | ssw_Latn | 6.21e+04 | 9.94e+05 | 8.82e+06 | 2.04e+03 | Swati | ssw | ss | ss | |
| 156 | sun_Latn | 3.24e+06 | 6.96e+07 | 4.75e+08 | 1.15e+05 | Sundanese | sun | su | su | |
| 157 | swe_Latn | 1.76e+09 | 4.01e+10 | 2.51e+11 | 6.68e+07 | Swedish | swe | sv | sv | |
| 158 | swh_Latn | 3.43e+07 | 7.18e+08 | 4.66e+09 | 1.37e+06 | Swahili (individual language) | swh | swa | sw | |
| 159 | szl_Latn | 6.37e+05 | 1.47e+07 | 1.04e+08 | 4.09e+04 | Silesian | szl | |||
| 160 | tam_Taml | 1.69e+08 | 2.98e+09 | 2.62e+10 | 6.11e+06 | Tamil | tam | ta | ta | |
| 161 | taq_Latn | 1.39e+04 | 1.54e+06 | 8.84e+06 | 1.75e+03 | Tamasheq | taq | tmh | ||
| 162 | tat_Cyrl | 1.34e+07 | 2.97e+08 | 2.16e+09 | 6.31e+05 | Tatar | tat | tt | tt | |
| 163 | tel_Telu | 3.92e+07 | 8.35e+08 | 6.50e+09 | 2.06e+06 | Telugu | tel | te | te | |
| 164 | tgk_Cyrl | 2.48e+07 | 6.25e+08 | 4.59e+09 | 1.26e+06 | Tajik | tgk | tg | tg | |
| 165 | tgl_Latn | 5.29e+07 | 1.35e+09 | 8.13e+09 | 1.87e+06 | Tagalog | tgl | tl | tl | |
| 166 | tha_Thai | 3.39e+08 | 3.51e+09 | 6.00e+10 | 1.77e+07 | Thai | tha | th | th | |
| 167 | tir_Ethi | 1.13e+06 | 3.67e+07 | 1.82e+08 | 6.47e+04 | Tigrinya | tir | ti | ti | |
| 168 | tpi_Latn | 2.82e+05 | 1.25e+07 | 6.45e+07 | 1.40e+04 | Tok Pisin | tpi | |||
| 169 | tsn_Latn | 1.32e+05 | 5.27e+06 | 2.77e+07 | 6.05e+03 | Tswana | tsn | tn | tn | |
| 170 | tso_Latn | 2.21e+05 | 8.67e+06 | 4.93e+07 | 1.10e+04 | Tsonga | tso | ts | ts | |
| 171 | tuk_Latn | 3.36e+06 | 7.07e+07 | 5.70e+08 | 1.71e+05 | Turkmen | tuk | tk | tk | |
| 172 | tum_Latn | 9.90e+04 | 2.88e+06 | 2.11e+07 | 4.38e+03 | Tumbuka | tum | |||
| 173 | tur_Latn | 2.58e+09 | 5.17e+10 | 3.90e+11 | 1.17e+08 | Turkish | tur | tr | tr | |
| 174 | twi_Latn | 1.26e+05 | 4.70e+06 | 2.42e+07 | 5.86e+03 | Twi | twi | aka | tw | tw |
| 175 | uig_Arab | 8.98e+06 | 2.24e+08 | 1.75e+09 | 4.42e+05 | Uighur | uig | ug | ug | |
| 176 | ukr_Cyrl | 1.17e+09 | 2.52e+10 | 1.83e+11 | 4.74e+07 | Ukrainian | ukr | uk | uk | |
| 177 | umb_Latn | 5.99e+04 | 2.43e+06 | 1.54e+07 | 2.47e+03 | Umbundu | umb | |||
| 178 | urd_Arab | 5.06e+07 | 2.13e+09 | 1.00e+10 | 3.19e+06 | Urdu | urd | ur | ur | |
| 179 | uzn_Latn | 1.48e+07 | 3.51e+08 | 2.85e+09 | 7.07e+05 | Northern Uzbek | uzn | uzb | uz | |
| 180 | vec_Latn | 1.58e+06 | 3.53e+07 | 2.18e+08 | 8.48e+04 | Venetian | vec | |||
| 181 | vie_Latn | 3.02e+09 | 8.32e+10 | 3.80e+11 | 1.01e+08 | Vietnamese | vie | vi | vi | |
| 182 | war_Latn | 2.01e+05 | 5.89e+06 | 3.56e+07 | 1.39e+04 | Waray (Philippines) | war | |||
| 183 | wol_Latn | 1.62e+05 | 5.46e+06 | 2.75e+07 | 5.68e+03 | Wolof | wol | wo | wo | |
| 184 | xho_Latn | 1.82e+06 | 3.03e+07 | 2.59e+08 | 6.31e+04 | Xhosa | xho | xh | xh | |
| 185 | ydd_Hebr | 2.94e+06 | 7.75e+07 | 4.58e+08 | 1.28e+05 | Eastern Yiddish | ydd | yid | yi | |
| 186 | yor_Latn | 1.47e+06 | 4.28e+07 | 2.18e+08 | 6.61e+04 | Yoruba | yor | yo | yo | |
| 187 | yue_Hant | 1.24e+06 | 3.27e+06 | 7.43e+07 | 6.13e+04 | Yue Chinese | yue | zho | zh | |
| 188 | zho_Hans | 4.24e+10 | 7.40e+10 | 2.35e+12 | 1.25e+09 | Chinese | zho | zh | zh | |
| 189 | zho_Hant | 4.48e+09 | 9.51e+09 | 2.87e+11 | 1.57e+08 | Chinese | zho | zh | zh | |
| 190 | zsm_Latn | 5.80e+08 | 1.15e+10 | 7.84e+10 | 1.84e+07 | Standard Malay | zsm | msa | ms | |
| 191 | zul_Latn | 2.71e+06 | 4.44e+07 | 3.81e+08 | 1.14e+05 | Zulu | zul | zu | zu |
Cite us
@inproceedings{burchell-etal-2025-expanded,
title = "An Expanded Massive Multilingual Dataset for High-Performance Language Technologies ({HPLT})",
author = {Burchell, Laurie and
de Gibert, Ona and
Arefyev, Nikolay and
Aulamo, Mikko and
Ba{\~n}{\'o}n, Marta and
Chen, Pinzhen and
Fedorova, Mariia and
Guillou, Liane and
Haddow, Barry and
Haji{\v{c}}, Jan and
Helcl, Jind{\v{r}}ich and
Henriksson, Erik and
Klimaszewski, Mateusz and
Komulainen, Ville and
Kutuzov, Andrey and
Kyt{\"o}niemi, Joona and
Laippala, Veronika and
M{\ae}hlum, Petter and
Malik, Bhavitvya and
Mehryary, Farrokh and
Mikhailov, Vladislav and
Moghe, Nikita and
Myntti, Amanda and
O{'}Brien, Dayy{\'a}n and
Oepen, Stephan and
Pal, Proyag and
Piha, Jousia and
Pyysalo, Sampo and
Ram{\'i}rez-S{\'a}nchez, Gema and
Samuel, David and
Stepachev, Pavel and
Tiedemann, J{\"o}rg and
Vari{\v{s}}, Du{\v{s}}an and
Vojt{\v{e}}chov{\'a}, Tereza and
Zaragoza-Bernabeu, Jaume},
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.854/",
doi = "10.18653/v1/2025.acl-long.854",
pages = "17452--17485",
ISBN = "979-8-89176-251-0",
abstract = "Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value."
}
π Structured Schema (Zero-Fabrication)
| Feature Key | Data Type |
|---|---|
f |
string |
o |
int64 |
s |
int64 |
rs |
int64 |
u |
string |
c |
string |
ts |
timestamp[ms] |
collection |
string |
lang |
Sequence[string] |
prob |
Sequence[float64] |
text |
string |
seg_langs |
Sequence[string] |
robotstxt |
string |
id |
string |
filter |
string |
pii |
Sequence[Sequence] |
doc_scores |
Sequence[float64] |
Estimated Rows: 16
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
π‘οΈ Dataset Transparency Report
Verified data manifest for traceability and transparency.
π Identity & Source
- id
- hf-dataset--hplt--hplt2.0_cleaned
- slug
- hplt--hplt2.0_cleaned
- source
- huggingface
- author
- HPLT
- license
- CC0-1.0
- tags
- task_categories:fill-mask, task_categories:text-generation, task_ids:language-modeling, multilinguality:multilingual, language:ace, language:af, language:als, language:am, language:ar, language:as, language:ast, language:awa, language:ayr, language:azb, language:azj, language:ba, language:bm, language:ban, language:be, language:bem, language:bn, language:bho, language:bjn, language:bo, language:bs, language:bug, language:bg, language:ca, language:ceb, language:cs, language:cjk, language:ckb, language:crh, language:cy, language:da, language:de, language:dik, language:dyu, language:dz, language:el, language:en, language:eo, language:et, language:eu, language:ee, language:fo, language:fj, language:fi, language:fon, language:fr, language:fur, language:fuv, language:gaz, language:gd, language:ga, language:gl, language:gn, language:gu, language:ht, language:ha, language:he, language:hi, language:hne, language:hr, language:hu, language:hy, language:ig, language:ilo, language:id, language:is, language:it, language:jv, language:ja, language:kab, language:kac, language:kam, language:kn, language:ks, language:ka, language:kk, language:kbp, language:kea, language:khk, language:km, language:ki, language:rw, language:ky, language:kmb, language:kmr, language:knc, language:kg, language:ko, language:lo, language:lij, language:li, language:ln, language:lt, language:lmo, language:ltg, language:lb, language:lua, language:lg, language:luo, language:lus, language:lvs, language:mag, language:mai, language:ml, language:mr, language:min, language:mk, language:mt, language:mni, language:mos, language:mi, language:my, language:nl, language:nn, language:nb, language:npi, language:nso, language:nus, language:ny, language:oc, language:ory, language:pag, language:pa, language:pap, language:pbt, language:pes, language:plt, language:pl, language:pt, language:prs, language:quy, language:ro, language:rn, language:ru, language:sg, language:sa, language:sat, language:scn, language:shn, language:si, language:sk, language:sl, language:sm, language:sn, language:sd, language:so, language:st, language:es, language:sc, language:sr, language:ss, language:su, language:sv, language:swh, language:szl, language:ta, language:taq, language:tt, language:te, language:tg, language:tl, language:th, language:ti, language:tpi, language:tn, language:ts, language:tk, language:tum, language:tr, language:tw, language:ug, language:uk, language:umb, language:ur, language:uzn, language:vec, language:vi, language:war, language:wo, language:xh, language:ydd, language:yo, language:yue, language:zh, language:zsm, language:zu, license:cc0-1.0, size_categories:1b
βοΈ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
π Engagement & Metrics
- downloads
- 36,557
- stars
- 42
- forks
- 0
Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)

