⚠️

This is a Dataset, not a Model

The following metrics do not apply: FNI Score, Deployment Options, Model Architecture

πŸ“Š

fleurs

FNI 22.7
by google Dataset

"--- annotations_creators: - expert-generated - crowdsourced - machine-generated language_creators: - crowdsourced - expert-generated language: - afr - amh - ara - asm - ast - azj - bel - ben - bos - cat - ceb - cmn - ces - cym - dan - deu - ell - eng - spa - est - fas - ful - fin - tgl - fra - gle -..."

Best Scenarios

✨ Data Science

Technical Constraints

Generic Use
- Size
- Rows
Parquet Format
364 Likes

Capabilities

  • βœ… Data Science

πŸ”¬Deep Dive

Expand Details [+]

πŸ› οΈ Technical Profile

⚑ Hardware & Scale

Size
-
Total Rows
-
Files
616

🧠 Training & Env

Format
Parquet
Cleaning
Raw

🌐 Cloud & Rights

Source
huggingface
License
["cc-by-4.0"]

πŸ‘οΈ Data Preview

feature label split
example_text_1 0 train
example_text_2 1 train
example_text_3 0 test
example_text_4 1 validation
example_text_5 0 train
Showing 5 sample rows. Real-time preview requires login.

🧬 Schema & Configs

Fields

feature: string
label: int64
split: string

Dataset Card

FLEURS

Dataset Description

Universal Representations of Speech
  • Total amount of disk used: ca. 350 GB
Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages.

Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and ”unit error rate” (characters, signs) of all languages is averaged. Languages and results are also grouped into seven geographical areas:

  • Western Europe: Asturian, Bosnian, Catalan, Croatian, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hungarian, Icelandic, Irish, Italian, Kabuverdianu, Luxembourgish, Maltese, Norwegian, Occitan, Portuguese, Spanish, Swedish, Welsh
  • Eastern Europe: Armenian, Belarusian, Bulgarian, Czech, Estonian, Georgian, Latvian, Lithuanian, Macedonian, Polish, Romanian, Russian, Serbian, Slovak, Slovenian, Ukrainian
  • Central-Asia/Middle-East/North-Africa: Arabic, Azerbaijani, Hebrew, Kazakh, Kyrgyz, Mongolian, Pashto, Persian, Sorani-Kurdish, Tajik, Turkish, Uzbek
  • Sub-Saharan Africa: Afrikaans, Amharic, Fula, Ganda, Hausa, Igbo, Kamba, Lingala, Luo, Northern-Sotho, Nyanja, Oromo, Shona, Somali, Swahili, Umbundu, Wolof, Xhosa, Yoruba, Zulu
  • South-Asia: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sindhi, Tamil, Telugu, Urdu
  • South-East Asia: Burmese, Cebuano, Filipino, Indonesian, Javanese, Khmer, Lao, Malay, Maori, Thai, Vietnamese
  • CJK languages: Cantonese and Mandarin Chinese, Japanese, Korean

How to use & Supported Tasks

How to use

The datasets library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the load_dataset function.

For example, to download the Hindi config, simply specify the corresponding language config name (i.e., "hi_in" for Hindi):

python
from datasets import load_dataset
fleurs = load_dataset("google/fleurs", "hi_in", split="train")

Using the datasets library, you can also stream the dataset on-the-fly by adding a streaming=True argument to the load_dataset function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk.

python
from datasets import load_dataset
fleurs = load_dataset("google/fleurs", "hi_in", split="train", streaming=True)
print(next(iter(fleurs)))

Bonus: create a [PyTorch dataloader](https://huggingface.co/docs

12,021 characters total