This is a Dataset, not a Model
The following metrics do not apply: FNI Score, Deployment Options, Model Architecture
fleurs
"--- annotations_creators: - expert-generated - crowdsourced - machine-generated language_creators: - crowdsourced - expert-generated language: - afr - amh - ara - asm - ast - azj - bel - ben - bos - cat - ceb - cmn - ces - cym - dan - deu - ell - eng - spa - est - fas - ful - fin - tgl - fra - gle -..."
Best Scenarios
Technical Constraints
πΈοΈ Neural Graph Explorer
v15.13π Interest Trend
* Real-time activity index across HuggingFace, GitHub and Research citations.
Capabilities
- β Data Science
Finding datasets with similar distribution...
No benchmark correlations for this dataset.
π¬Deep Dive
Expand Details [+]βΎ
π οΈ Technical Profile
β‘ Hardware & Scale
π§ Training & Env
π Cloud & Rights
ποΈ Data Preview
| feature | label | split |
|---|---|---|
| example_text_1 | 0 | train |
| example_text_2 | 1 | train |
| example_text_3 | 0 | test |
| example_text_4 | 1 | validation |
| example_text_5 | 0 | train |
𧬠Schema & Configs
Fields
Dataset Card
FLEURS
Dataset Description
- Fine-Tuning script: pytorch/speech-recognition
- Paper: FLEURS: Few-shot Learning Evaluation of
- Total amount of disk used: ca. 350 GB
Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and βunit error rateβ (characters, signs) of all languages is averaged. Languages and results are also grouped into seven geographical areas:
- Western Europe: Asturian, Bosnian, Catalan, Croatian, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hungarian, Icelandic, Irish, Italian, Kabuverdianu, Luxembourgish, Maltese, Norwegian, Occitan, Portuguese, Spanish, Swedish, Welsh
- Eastern Europe: Armenian, Belarusian, Bulgarian, Czech, Estonian, Georgian, Latvian, Lithuanian, Macedonian, Polish, Romanian, Russian, Serbian, Slovak, Slovenian, Ukrainian
- Central-Asia/Middle-East/North-Africa: Arabic, Azerbaijani, Hebrew, Kazakh, Kyrgyz, Mongolian, Pashto, Persian, Sorani-Kurdish, Tajik, Turkish, Uzbek
- Sub-Saharan Africa: Afrikaans, Amharic, Fula, Ganda, Hausa, Igbo, Kamba, Lingala, Luo, Northern-Sotho, Nyanja, Oromo, Shona, Somali, Swahili, Umbundu, Wolof, Xhosa, Yoruba, Zulu
- South-Asia: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sindhi, Tamil, Telugu, Urdu
- South-East Asia: Burmese, Cebuano, Filipino, Indonesian, Javanese, Khmer, Lao, Malay, Maori, Thai, Vietnamese
- CJK languages: Cantonese and Mandarin Chinese, Japanese, Korean
How to use & Supported Tasks
How to use
The datasets library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the load_dataset function.
For example, to download the Hindi config, simply specify the corresponding language config name (i.e., "hi_in" for Hindi):
from datasets import load_dataset
fleurs = load_dataset("google/fleurs", "hi_in", split="train")
Using the datasets library, you can also stream the dataset on-the-fly by adding a streaming=True argument to the load_dataset function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk.
from datasets import load_dataset
fleurs = load_dataset("google/fleurs", "hi_in", split="train", streaming=True)
print(next(iter(fleurs)))
Bonus: create a [PyTorch dataloader](https://huggingface.co/docs
FLEURS
Dataset Description
- Fine-Tuning script: pytorch/speech-recognition
- Paper: FLEURS: Few-shot Learning Evaluation of
- Total amount of disk used: ca. 350 GB
Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and βunit error rateβ (characters, signs) of all languages is averaged. Languages and results are also grouped into seven geographical areas:
- Western Europe: Asturian, Bosnian, Catalan, Croatian, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hungarian, Icelandic, Irish, Italian, Kabuverdianu, Luxembourgish, Maltese, Norwegian, Occitan, Portuguese, Spanish, Swedish, Welsh
- Eastern Europe: Armenian, Belarusian, Bulgarian, Czech, Estonian, Georgian, Latvian, Lithuanian, Macedonian, Polish, Romanian, Russian, Serbian, Slovak, Slovenian, Ukrainian
- Central-Asia/Middle-East/North-Africa: Arabic, Azerbaijani, Hebrew, Kazakh, Kyrgyz, Mongolian, Pashto, Persian, Sorani-Kurdish, Tajik, Turkish, Uzbek
- Sub-Saharan Africa: Afrikaans, Amharic, Fula, Ganda, Hausa, Igbo, Kamba, Lingala, Luo, Northern-Sotho, Nyanja, Oromo, Shona, Somali, Swahili, Umbundu, Wolof, Xhosa, Yoruba, Zulu
- South-Asia: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sindhi, Tamil, Telugu, Urdu
- South-East Asia: Burmese, Cebuano, Filipino, Indonesian, Javanese, Khmer, Lao, Malay, Maori, Thai, Vietnamese
- CJK languages: Cantonese and Mandarin Chinese, Japanese, Korean
How to use & Supported Tasks
How to use
The datasets library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the load_dataset function.
For example, to download the Hindi config, simply specify the corresponding language config name (i.e., "hi_in" for Hindi):
from datasets import load_dataset
fleurs = load_dataset("google/fleurs", "hi_in", split="train")
Using the datasets library, you can also stream the dataset on-the-fly by adding a streaming=True argument to the load_dataset function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk.
from datasets import load_dataset
fleurs = load_dataset("google/fleurs", "hi_in", split="train", streaming=True)
print(next(iter(fleurs)))
Bonus: create a PyTorch dataloader directly with your own datasets (local/streamed).
Local:
from datasets import load_dataset
from torch.utils.data.sampler import BatchSampler, RandomSampler
fleurs = load_dataset("google/fleurs", "hi_in", split="train")
batch_sampler = BatchSampler(RandomSampler(fleurs), batch_size=32, drop_last=False)
dataloader = DataLoader(fleurs, batch_sampler=batch_sampler)
Streaming:
from datasets import load_dataset
from torch.utils.data import DataLoader
fleurs = load_dataset("google/fleurs", "hi_in", split="train")
dataloader = DataLoader(fleurs, batch_size=32)
To find out more about loading and preparing audio datasets, head over to hf.co/blog/audio-datasets.
Example scripts
Train your own CTC or Seq2Seq Automatic Speech Recognition models on FLEURS with transformers - here.
Fine-tune your own Language Identification models on FLEURS with transformers - here
1. Speech Recognition (ASR)
from datasets import load_datasetfleurs_asr = load_dataset("google/fleurs", "af_za") # for Afrikaans
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">to download all data for multi-lingual fine-tuning uncomment following line</h1>
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">fleurs_asr = load_dataset("google/fleurs", "all")</h1>
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">see structure</h1>
print(fleurs_asr)
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">load audio sample on the fly</h1>
audio_input = fleurs_asr["train"][0]["audio"] # first decoded audio sample
transcription = fleurs_asr["train"][0]["transcription"] # first transcription
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">use audio_input and transcription to fine-tune your model for ASR</h1>
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">for analyses see language groups</h1>
all_language_groups = fleurs_asr["train"].features["lang_group_id"].names
lang_group_id = fleurs_asr["train"][0]["lang_group_id"]
all_language_groups[lang_group_id]
2. Language Identification
LangID can often be a domain classification, but in the case of FLEURS-LangID, recordings are done in a similar setting across languages and the utterances correspond to n-way parallel sentences, in the exact same domain, making this task particularly relevant for evaluating LangID. The setting is simple, FLEURS-LangID is splitted in train/valid/test for each language. We simply create a single train/valid/test for LangID by merging all.
from datasets import load_datasetfleurs_langID = load_dataset("google/fleurs", "all") # to download all data
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">see structure</h1>
print(fleurs_langID)
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">load audio sample on the fly</h1>
audio_input = fleurs_langID["train"][0]["audio"] # first decoded audio sample
language_class = fleurs_langID["train"][0]["lang_id"] # first id class
language = fleurs_langID["train"].features["lang_id"].names[language_class]
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">use audio_input and language_class to fine-tune your model for audio classification</h1>
3. Retrieval
Retrieval provides n-way parallel speech and text data. Similar to how XTREME for text leverages Tatoeba to evaluate bitext mining a.k.a sentence translation retrieval, we use Retrieval to evaluate the quality of fixed-size representations of speech utterances. Our goal is to incentivize the creation of fixed-size speech encoder for speech retrieval. The system has to retrieve the English "key" utterance corresponding to the speech translation of "queries" in 15 languages. Results have to be reported on the test sets of Retrieval whose utterances are used as queries (and keys for English). We augment the English keys with a large number of utterances to make the task more difficult.
from datasets import load_datasetfleurs_retrieval = load_dataset("google/fleurs", "af_za") # for Afrikaans
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">to download all data for multi-lingual fine-tuning uncomment following line</h1>
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">fleurs_retrieval = load_dataset("google/fleurs", "all")</h1>
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">see structure</h1>
print(fleurs_retrieval)
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">load audio sample on the fly</h1>
audio_input = fleurs_retrieval["train"][0]["audio"] # decoded audio sample
text_sample_pos = fleurs_retrieval["train"][0]["transcription"] # positive text sample
text_sample_neg = fleurs_retrieval["train"][1:20]["transcription"] # negative text samples
<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">use audio_input, text_sample_pos, and text_sample_neg to fine-tune your model for retrieval</h1>
Users can leverage the training (and dev) sets of FLEURS-Retrieval with a ranking loss to build better cross-lingual fixed-size representations of speech.
Dataset Structure
We show detailed information the example configurations af_za of the dataset.
All other configurations have the same structure.
Data Instances
af_za
- Size of downloaded dataset files: 1.47 GB
- Size of the generated dataset: 1 MB
- Total amount of disk used: 1.47 GB
af_za looks as follows:
{'id': 91,
'num_samples': 385920,
'path': '/home/patrick/.cache/huggingface/datasets/downloads/extracted/310a663d52322700b3d3473cbc5af429bd92a23f9bc683594e70bc31232db39e/home/vaxelrod/FLEURS/oss2_obfuscated/af_za/audio/train/17797742076841560615.wav',
'audio': {'path': '/home/patrick/.cache/huggingface/datasets/downloads/extracted/310a663d52322700b3d3473cbc5af429bd92a23f9bc683594e70bc31232db39e/home/vaxelrod/FLEURS/oss2_obfuscated/af_za/audio/train/17797742076841560615.wav',
'array': array([ 0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ...,
-1.1205673e-04, -8.4638596e-05, -1.2731552e-04], dtype=float32),
'sampling_rate': 16000},
'raw_transcription': 'Dit is nog nie huidiglik bekend watter aantygings gemaak sal word of wat owerhede na die seun gelei het nie maar jeugmisdaad-verrigtinge het in die federale hof begin',
'transcription': 'dit is nog nie huidiglik bekend watter aantygings gemaak sal word of wat owerhede na die seun gelei het nie maar jeugmisdaad-verrigtinge het in die federale hof begin',
'gender': 0,
'lang_id': 0,
'language': 'Afrikaans',
'lang_group_id': 3}
Data Fields
The data fields are the same among all splits.
- id (int): ID of audio sample
- num_samples (int): Number of float values
- path (str): Path to the audio file
- audio (dict): Audio object including loaded audio array, sampling rate and path ot audio
- raw_transcription (str): The non-normalized transcription of the audio file
- transcription (str): Transcription of the audio file
- gender (int): Class id of gender
- lang_id (int): Class id of language
- lang_group_id (int): Class id of language group
Data Splits
Every config only has the "train" split containing of ca. 1000 examples, and a "validation" and "test" split each containing of ca. 400 examples.
Dataset Creation
We collect between one and three recordings for each sentence (2.3 on average), and buildnew train-dev-test splits with 1509, 150 and 350 sentences for train, dev and test respectively.
Considerations for Using the Data
Social Impact of Dataset
This dataset is meant to encourage the development of speech technology in a lot more languages of the world. One of the goal is to give equal access to technologies like speech recognition or speech translation to everyone, meaning better dubbing or better access to content from the internet (like podcasts, streaming or videos).
Discussion of Biases
Most datasets have a fair distribution of gender utterances (e.g. the newly introduced FLEURS dataset). While many languages are covered from various regions of the world, the benchmark misses many languages that are all equally important. We believe technology built through FLEURS should generalize to all languages.
Other Known Limitations
The dataset has a particular focus on read-speech because common evaluation benchmarks like CoVoST-2 or LibriSpeech evaluate on this type of speech. There is sometimes a known mismatch between performance obtained in a read-speech setting and a more noisy setting (in production for instance). Given the big progress that remains to be made on many languages, we believe better performance on FLEURS should still correlate well with actual progress made for speech understanding.
Additional Information
All datasets are licensed under the Creative Commons license (CC-BY).
Citation Information
You can access the FLEURS paper at https://arxiv.org/abs/2205.12446. Please cite the paper when referencing the FLEURS corpus as:
@article{fleurs2022arxiv,
title = {FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech},
author = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},
journal={arXiv preprint arXiv:2205.12446},
url = {https://arxiv.org/abs/2205.12446},
year = {2022},
Contributions
Thanks to @patrickvonplaten and @aconneau for adding this dataset.
12,021 characters total