This is a Dataset, not a Model
The following metrics do not apply: FNI Score, Deployment Options, Model Architecture
wikitext
"--- annotations_creators: - no-annotation language_creators: - crowdsourced language: - en license: - cc-by-sa-3.0 - gfdl multilinguality: - monolingual size_categories: - 1M"
Best Scenarios
Technical Constraints
๐ธ๏ธ Neural Graph Explorer
v15.13๐ Interest Trend
* Real-time activity index across HuggingFace, GitHub and Research citations.
Capabilities
- โ Data Science
Finding datasets with similar distribution...
No benchmark correlations for this dataset.
๐ฌDeep Dive
Expand Details [+]โพ
๐ ๏ธ Technical Profile
โก Hardware & Scale
๐ง Training & Env
๐ Cloud & Rights
๐๏ธ Data Preview
| feature | label | split |
|---|---|---|
| example_text_1 | 0 | train |
| example_text_2 | 1 | train |
| example_text_3 | 0 | test |
| example_text_4 | 1 | validation |
| example_text_5 | 0 | train |
๐งฌ Schema & Configs
Fields
Dataset Card
Dataset Card for "wikitext"
Table of Contents
- Dataset Summary - Supported Tasks and Leaderboards - Languages - Data Instances - Data Fields - Data Splits - Curation Rationale - Source Data - Annotations - Personal and Sensitive Information - Social Impact of Dataset - Discussion of Biases - Other Known Limitations - Dataset Curators - Licensing Information - Citation Information - ContributionsDataset Description
- Homepage: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/
- Repository: More Information Needed
- Paper: Pointer Sentinel Mixture Models
- Point of Contact: Stephen Merity
- Size of downloaded dataset files: 391.41 MB
- Size of the generated dataset: 1.12 GB
- Total amount of disk used: 1.52 GB
Dataset Summary
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.
Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.
Each subset comes in two different variants:
- Raw (for character level work) contain the raw tokens, before the addition of the
(unknown) tokens. - Non-raw (for word level work) contain only the tokens in their vocabulary (wiki.train.tokens, wiki.valid.tokens, and wiki.test.tokens).
Supported Tasks and Leaderboards
Languages
[More Information Needed
Dataset Card for "wikitext"
Table of Contents
- Dataset Summary - Supported Tasks and Leaderboards - Languages - Data Instances - Data Fields - Data Splits - Curation Rationale - Source Data - Annotations - Personal and Sensitive Information - Social Impact of Dataset - Discussion of Biases - Other Known Limitations - Dataset Curators - Licensing Information - Citation Information - ContributionsDataset Description
- Homepage: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/
- Repository: More Information Needed
- Paper: Pointer Sentinel Mixture Models
- Point of Contact: Stephen Merity
- Size of downloaded dataset files: 391.41 MB
- Size of the generated dataset: 1.12 GB
- Total amount of disk used: 1.52 GB
Dataset Summary
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.
Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.
Each subset comes in two different variants:
- Raw (for character level work) contain the raw tokens, before the addition of the
(unknown) tokens. - Non-raw (for word level work) contain only the tokens in their vocabulary (wiki.train.tokens, wiki.valid.tokens, and wiki.test.tokens).
Supported Tasks and Leaderboards
Languages
Dataset Structure
Data Instances
#### wikitext-103-raw-v1
- Size of downloaded dataset files: 191.98 MB
- Size of the generated dataset: 549.42 MB
- Total amount of disk used: 741.41 MB
This example was too long and was cropped:{
"text": "\" The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from..."
}
#### wikitext-103-v1
- Size of downloaded dataset files: 190.23 MB
- Size of the generated dataset: 548.05 MB
- Total amount of disk used: 738.27 MB
This example was too long and was cropped:{
"text": "\" Senjล no Valkyria 3 : <unk> Chronicles ( Japanese : ๆฆๅ ดใฎใดใกใซใญใฅใชใข3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..."
}
#### wikitext-2-raw-v1
- Size of downloaded dataset files: 4.72 MB
- Size of the generated dataset: 13.54 MB
- Total amount of disk used: 18.26 MB
This example was too long and was cropped:{
"text": "\" The Sinclair Scientific Programmable was introduced in 1975 , with the same case as the Sinclair Oxford . It was larger than t..."
}
#### wikitext-2-v1
- Size of downloaded dataset files: 4.48 MB
- Size of the generated dataset: 13.34 MB
- Total amount of disk used: 17.82 MB
This example was too long and was cropped:{
"text": "\" Senjล no Valkyria 3 : <unk> Chronicles ( Japanese : ๆฆๅ ดใฎใดใกใซใญใฅใชใข3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..."
}
Data Fields
The data fields are the same among all splits.
#### wikitext-103-raw-v1
text: astringfeature.
text: astringfeature.
text: astringfeature.
text: astringfeature.
Data Splits
| name | train |validation|test| |-------------------|------:|---------:|---:| |wikitext-103-raw-v1|1801350| 3760|4358| |wikitext-103-v1 |1801350| 3760|4358| |wikitext-2-raw-v1 | 36718| 3760|4358| |wikitext-2-v1 | 36718| 3760|4358|
Dataset Creation
Curation Rationale
Source Data
#### Initial Data Collection and Normalization
#### Who are the source language producers?
Annotations
#### Annotation process
#### Who are the annotators?
Personal and Sensitive Information
Considerations for Using the Data
Social Impact of Dataset
Discussion of Biases
Other Known Limitations
Additional Information
Dataset Curators
Licensing Information
The dataset is available under the Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0).
Citation Information
@misc{merity2016pointer,
title={Pointer Sentinel Mixture Models},
author={Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher},
year={2016},
eprint={1609.07843},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Contributions
Thanks to @thomwolf, @lewtun, @patrickvonplaten, @mariamabarham for adding this dataset.
7,861 characters total