⚠️

This is a Dataset, not a Model

The following metrics do not apply: FNI Score, Deployment Options, Model Architecture

πŸ“Š

c4

FNI 22.2
by allenai Dataset

"--- pretty_name: C4 annotations_creators: - no-annotation language_creators: - found language: - af - am - ar - az - be - bg - bn - ca - ceb - co - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fil - fr - fy - ga - gd - gl - gu - ha - haw - he - hi - hmn - ht - hu - hy - id - ig - is -..."

Best Scenarios

✨ Data Science

Technical Constraints

Generic Use
- Size
- Rows
Parquet Format
506 Likes

Capabilities

  • βœ… Data Science

πŸ”¬Deep Dive

Expand Details [+]

πŸ› οΈ Technical Profile

⚑ Hardware & Scale

Size
-
Total Rows
-
Files
69221

🧠 Training & Env

Format
Parquet
Cleaning
Raw

🌐 Cloud & Rights

Source
huggingface
License
["odc-by"]

πŸ‘οΈ Data Preview

feature label split
example_text_1 0 train
example_text_2 1 train
example_text_3 0 test
example_text_4 1 validation
example_text_5 0 train
Showing 5 sample rows. Real-time preview requires login.

🧬 Schema & Configs

Fields

feature: string
label: int64
split: string

Dataset Card

C4

Dataset Description

  • Paper: https://arxiv.org/abs/1910.10683

Dataset Summary

A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org".

This is the processed version of Google's C4 dataset

We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4).

For reference, these are the sizes of the variants:

  • en: 305GB
  • en.noclean: 2.3TB
  • en.noblocklist: 380GB
  • realnewslike: 15GB
  • multilingual (mC4): 9.7TB (108 subsets, one per language)
The en.noblocklist variant is exactly the same as the en variant, except we turned off the so-called "badwords filter", which removes all documents that contain words from the lists at https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words.

#### How do I download this?

##### Using πŸ€— Datasets

python
from datasets import load_dataset

<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">English only</h1> en = load_dataset("allenai/c4", "en")

<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Other variants in english</h1> en_noclean = load_dataset("allenai/c4", "en.noclean") en_noblocklist = load_dataset("allenai/c4", "en.noblocklist") realnewslike = load_dataset("allenai/c4", "realnewslike")

<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Multilingual (108 languages)</h1> multilingual = load_dataset("allenai/c4", "multilingual")

<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">One specific language</h1> es = load_dataset("allenai/c4", "es")

Since this dataset is big, it is encouraged to load it in streaming mode using streaming=True, for example:

python
en = load_dataset("allenai/c4", "en", streaming=True)

You can also load and mix multiple languages:

python
from datasets import concatenate_datasets, interleave_datasets, load_dataset

es = load_dataset("allenai/c4", "es", streaming=True) fr = load_dataset("allenai/c4", "fr", streaming=True)

<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Concatenate both datasets</h1> concatenated = concatenate_datasets([es, fr]) <h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Or interleave them (alternates between one and the other)</h1> interleaved = interleave_datasets([es, fr])

##### Using Dask

python
import dask.dataframe as dd

df = dd.read_json("hf://datasets/allenai/c4/en/c4-train.*.json.gz")

<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">English only</h1> en_df = dd.read_json("hf://datasets/allenai/c4/en/c4-*.json.gz")

<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Other variants in english</h1> en_noclean_df = dd.read_json("hf://datasets/allenai/c4/en/noclean/c4-*.json.gz") en_noblocklist_df = dd.read_json("hf://datasets/allenai/c4/en.noblocklist/c4-*.json.gz") realnewslike_df = dd.read_json("hf://datasets/allenai/c4/realnewslike/c4-*.json.gz")

<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Multilingual (108 languages)</h1> multilingual_df = dd.read_json("hf://datasets/allenai/c4/multilingual/c4-*.json.gz")

<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">One specific language</h1> es_train_df = dd.read_json("hf://datasets/allenai/c4/multilingual/c4-es.*.json.gz") es_valid_df = dd.read_json("hf://datasets/allenai/c4/multilingual/c4-es-validation.*.json.gz")

##### Using Git

bash
git clone https://huggingface.co/datasets/allenai/c4

This will download 13TB to your local drive. If you want to be

11,726 characters total