โš ๏ธ

This is a Dataset, not a Model

The following metrics do not apply: FNI Score, Deployment Options, Model Architecture

๐Ÿ“Š

wikitext

FNI 20.8
by Salesforce Dataset

"--- annotations_creators: - no-annotation language_creators: - crowdsourced language: - en license: - cc-by-sa-3.0 - gfdl multilinguality: - monolingual size_categories: - 1M"

Best Scenarios

โœจ Data Science

Technical Constraints

Generic Use
- Size
- Rows
Parquet Format
611 Likes

Capabilities

  • โœ… Data Science

๐Ÿ”ฌDeep Dive

Expand Details [+]

๐Ÿ› ๏ธ Technical Profile

โšก Hardware & Scale

Size
-
Total Rows
-
Files
16

๐Ÿง  Training & Env

Format
Parquet
Cleaning
Raw

๐ŸŒ Cloud & Rights

Source
huggingface
License
["cc-by-sa-3.0","gfdl"]

๐Ÿ‘๏ธ Data Preview

feature label split
example_text_1 0 train
example_text_2 1 train
example_text_3 0 test
example_text_4 1 validation
example_text_5 0 train
Showing 5 sample rows. Real-time preview requires login.

๐Ÿงฌ Schema & Configs

Fields

feature: string
label: int64
split: string

Dataset Card

Dataset Card for "wikitext"

Table of Contents

- Dataset Summary - Supported Tasks and Leaderboards - Languages - Data Instances - Data Fields - Data Splits - Curation Rationale - Source Data - Annotations - Personal and Sensitive Information - Social Impact of Dataset - Discussion of Biases - Other Known Limitations - Dataset Curators - Licensing Information - Citation Information - Contributions

Dataset Description

Dataset Summary

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

Each subset comes in two different variants:

  • Raw (for character level work) contain the raw tokens, before the addition of the (unknown) tokens.
  • Non-raw (for word level work) contain only the tokens in their vocabulary (wiki.train.tokens, wiki.valid.tokens, and wiki.test.tokens).
The out-of-vocabulary tokens have been replaced with the the token.

Supported Tasks and Leaderboards

More Information Needed

Languages

[More Information Needed

7,861 characters total