This is a Dataset, not a Model
The following metrics do not apply: FNI Score, Deployment Options, Model Architecture
c4
"--- pretty_name: C4 annotations_creators: - no-annotation language_creators: - found language: - af - am - ar - az - be - bg - bn - ca - ceb - co - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fil - fr - fy - ga - gd - gl - gu - ha - haw - he - hi - hmn - ht - hu - hy - id - ig - is -..."
Best Scenarios
Technical Constraints
πΈοΈ Neural Graph Explorer
v15.13π Interest Trend
* Real-time activity index across HuggingFace, GitHub and Research citations.
Capabilities
- β Data Science
Finding datasets with similar distribution...
No benchmark correlations for this dataset.
π¬Deep Dive
Expand Details [+]βΎ
π οΈ Technical Profile
β‘ Hardware & Scale
π§ Training & Env
π Cloud & Rights
ποΈ Data Preview
| feature | label | split |
|---|---|---|
| example_text_1 | 0 | train |
| example_text_2 | 1 | train |
| example_text_3 | 0 | test |
| example_text_4 | 1 | validation |
| example_text_5 | 0 | train |
𧬠Schema & Configs
Fields
Dataset Card
C4
Dataset Description
- Paper: https://arxiv.org/abs/1910.10683
Dataset Summary
A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org".
This is the processed version of Google's C4 dataset
We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4).
For reference, these are the sizes of the variants:
en: 305GBen.noclean: 2.3TBen.noblocklist: 380GBrealnewslike: 15GBmultilingual(mC4): 9.7TB (108 subsets, one per language)
en.noblocklist variant is exactly the same as the en variant, except we turned off the so-called "badwords filter", which removes all documents that contain words from the lists at https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words.#### How do I download this?
##### Using π€ Datasets
from datasets import load_dataset<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">English only</h1>
en = load_dataset("allenai/c4", "en")
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Other variants in english</h1>
en_noclean = load_dataset("allenai/c4", "en.noclean")
en_noblocklist = load_dataset("allenai/c4", "en.noblocklist")
realnewslike = load_dataset("allenai/c4", "realnewslike")
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Multilingual (108 languages)</h1>
multilingual = load_dataset("allenai/c4", "multilingual")
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">One specific language</h1>
es = load_dataset("allenai/c4", "es")
Since this dataset is big, it is encouraged to load it in streaming mode using streaming=True, for example:
en = load_dataset("allenai/c4", "en", streaming=True)
You can also load and mix multiple languages:
from datasets import concatenate_datasets, interleave_datasets, load_datasetes = load_dataset("allenai/c4", "es", streaming=True)
fr = load_dataset("allenai/c4", "fr", streaming=True)
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Concatenate both datasets</h1>
concatenated = concatenate_datasets([es, fr])
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Or interleave them (alternates between one and the other)</h1>
interleaved = interleave_datasets([es, fr])
##### Using Dask
import dask.dataframe as dddf = dd.read_json("hf://datasets/allenai/c4/en/c4-train.*.json.gz")
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">English only</h1>
en_df = dd.read_json("hf://datasets/allenai/c4/en/c4-*.json.gz")
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Other variants in english</h1>
en_noclean_df = dd.read_json("hf://datasets/allenai/c4/en/noclean/c4-*.json.gz")
en_noblocklist_df = dd.read_json("hf://datasets/allenai/c4/en.noblocklist/c4-*.json.gz")
realnewslike_df = dd.read_json("hf://datasets/allenai/c4/realnewslike/c4-*.json.gz")
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Multilingual (108 languages)</h1>
multilingual_df = dd.read_json("hf://datasets/allenai/c4/multilingual/c4-*.json.gz")
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">One specific language</h1>
es_train_df = dd.read_json("hf://datasets/allenai/c4/multilingual/c4-es.*.json.gz")
es_valid_df = dd.read_json("hf://datasets/allenai/c4/multilingual/c4-es-validation.*.json.gz")
##### Using Git
git clone https://huggingface.co/datasets/allenai/c4
This will download 13TB to your local drive. If you want to be
C4
Dataset Description
- Paper: https://arxiv.org/abs/1910.10683
Dataset Summary
A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org".
This is the processed version of Google's C4 dataset
We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4).
For reference, these are the sizes of the variants:
en: 305GBen.noclean: 2.3TBen.noblocklist: 380GBrealnewslike: 15GBmultilingual(mC4): 9.7TB (108 subsets, one per language)
en.noblocklist variant is exactly the same as the en variant, except we turned off the so-called "badwords filter", which removes all documents that contain words from the lists at https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words.#### How do I download this?
##### Using π€ Datasets
from datasets import load_dataset<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">English only</h1>
en = load_dataset("allenai/c4", "en")
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Other variants in english</h1>
en_noclean = load_dataset("allenai/c4", "en.noclean")
en_noblocklist = load_dataset("allenai/c4", "en.noblocklist")
realnewslike = load_dataset("allenai/c4", "realnewslike")
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Multilingual (108 languages)</h1>
multilingual = load_dataset("allenai/c4", "multilingual")
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">One specific language</h1>
es = load_dataset("allenai/c4", "es")
Since this dataset is big, it is encouraged to load it in streaming mode using streaming=True, for example:
en = load_dataset("allenai/c4", "en", streaming=True)
You can also load and mix multiple languages:
from datasets import concatenate_datasets, interleave_datasets, load_datasetes = load_dataset("allenai/c4", "es", streaming=True)
fr = load_dataset("allenai/c4", "fr", streaming=True)
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Concatenate both datasets</h1>
concatenated = concatenate_datasets([es, fr])
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Or interleave them (alternates between one and the other)</h1>
interleaved = interleave_datasets([es, fr])
##### Using Dask
import dask.dataframe as dddf = dd.read_json("hf://datasets/allenai/c4/en/c4-train.*.json.gz")
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">English only</h1>
en_df = dd.read_json("hf://datasets/allenai/c4/en/c4-*.json.gz")
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Other variants in english</h1>
en_noclean_df = dd.read_json("hf://datasets/allenai/c4/en/noclean/c4-*.json.gz")
en_noblocklist_df = dd.read_json("hf://datasets/allenai/c4/en.noblocklist/c4-*.json.gz")
realnewslike_df = dd.read_json("hf://datasets/allenai/c4/realnewslike/c4-*.json.gz")
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Multilingual (108 languages)</h1>
multilingual_df = dd.read_json("hf://datasets/allenai/c4/multilingual/c4-*.json.gz")
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">One specific language</h1>
es_train_df = dd.read_json("hf://datasets/allenai/c4/multilingual/c4-es.*.json.gz")
es_valid_df = dd.read_json("hf://datasets/allenai/c4/multilingual/c4-es-validation.*.json.gz")
##### Using Git
git clone https://huggingface.co/datasets/allenai/c4
This will download 13TB to your local drive. If you want to be more precise with what you are downloading, follow these commands instead:
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "en/*"
The git clone command in this variant will download a bunch of stub files that Git LFS uses, so you can see all the filenames that exist that way. You can then convert the stubs into their real files with git lfs pull --include "...". For example, if you wanted all the Dutch documents from the multilingual set, you would run
git lfs pull --include "multilingual/c4-nl.*.json.gz"
Supported Tasks and Leaderboards
C4 and mC4 are mainly intended to pretrain language models and word representations.
Languages
The en, en.noclean, en.noblocklist and realnewslike variants are in English.
The other 108 languages are available and are reported in the table below.
Note that the languages that end with "-Latn" are simply romanized variants, i.e. written using the Latin script.
| language code | language name | |:----------------|:---------------------| | af | Afrikaans | | am | Amharic | | ar | Arabic | | az | Azerbaijani | | be | Belarusian | | bg | Bulgarian | | bg-Latn | Bulgarian (Latin) | | bn | Bangla | | ca | Catalan | | ceb | Cebuano | | co | Corsican | | cs | Czech | | cy | Welsh | | da | Danish | | de | German | | el | Greek | | el-Latn | Greek (Latin) | | en | English | | eo | Esperanto | | es | Spanish | | et | Estonian | | eu | Basque | | fa | Persian | | fi | Finnish | | fil | Filipino | | fr | French | | fy | Western Frisian | | ga | Irish | | gd | Scottish Gaelic | | gl | Galician | | gu | Gujarati | | ha | Hausa | | haw | Hawaiian | | hi | Hindi | | hi-Latn | Hindi (Latin script) | | hmn | Hmong, Mong | | ht | Haitian | | hu | Hungarian | | hy | Armenian | | id | Indonesian | | ig | Igbo | | is | Icelandic | | it | Italian | | iw | former Hebrew | | ja | Japanese | | ja-Latn | Japanese (Latin) | | jv | Javanese | | ka | Georgian | | kk | Kazakh | | km | Khmer | | kn | Kannada | | ko | Korean | | ku | Kurdish | | ky | Kyrgyz | | la | Latin | | lb | Luxembourgish | | lo | Lao | | lt | Lithuanian | | lv | Latvian | | mg | Malagasy | | mi | Maori | | mk | Macedonian | | ml | Malayalam | | mn | Mongolian | | mr | Marathi | | ms | Malay | | mt | Maltese | | my | Burmese | | ne | Nepali | | nl | Dutch | | no | Norwegian | | ny | Nyanja | | pa | Punjabi | | pl | Polish | | ps | Pashto | | pt | Portuguese | | ro | Romanian | | ru | Russian | | ru-Latn | Russian (Latin) | | sd | Sindhi | | si | Sinhala | | sk | Slovak | | sl | Slovenian | | sm | Samoan | | sn | Shona | | so | Somali | | sq | Albanian | | sr | Serbian | | st | Southern Sotho | | su | Sundanese | | sv | Swedish | | sw | Swahili | | ta | Tamil | | te | Telugu | | tg | Tajik | | th | Thai | | tr | Turkish | | uk | Ukrainian | | und | Unknown language | | ur | Urdu | | uz | Uzbek | | vi | Vietnamese | | xh | Xhosa | | yi | Yiddish | | yo | Yoruba | | zh | Chinese | | zh-Latn | Chinese (Latin) | | zu | Zulu |
Dataset Structure
Data Instances
An example form the en config is:
{
'url': 'https://klyq.com/beginners-bbq-class-taking-place-in-missoula/',
'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.',
'timestamp': '2019-04-25T12:57:54Z'
}
Data Fields
The data have several fields:
url: url of the source as a stringtext: text content as a stringtimestamp: timestamp as a string
Data Splits
Sizes for the variants in english:
| name | train |validation| |----------------|--------:|---------:| | en |364868892| 364608| | en.noblocklist |393391519| 393226| | en.noclean | ?| ?| | realnewslike | 13799838| 13863|
A train and validation split are also provided for the other languages, but lengths are still to be added.
Source Data
#### Initial Data Collection and Normalization
The C4 and mC4 datasets are collections text sourced from the public Common Crawl web scrape. It includes heuristics to extract only natural language (as opposed to boilerplate and other gibberish) in addition to extensive deduplication. You can find the code that has been used to build this dataset in c4.py by Tensorflow Datasets.
C4 dataset was explicitly designed to be English only: any page that was not given a probability of at least 99% of being English by langdetect was discarded.
To build mC4, the authors used CLD3 to identify over 100 languages.
Licensing Information
We are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
Acknowledgements
Big ups to the good folks at Common Crawl whose data made this possible (consider donating!), to Google for creating the code that curates and filters the data, and to Huggingface, who had no issue with hosting these 3TB of data for public download!
11,726 characters total