This version of the data is meant for data analysis. If you need PGN files you can find those here. That said, once you have a subset of interest, it is trivial to convert it back to PGN as shown in the Dataset Usage section.
This dataset is hive-partitioned into multiple parquet files on two keys: year and month:
bash
.
âââ data
â  âââ year=2015
â  âââ month=01
â  â  âââ train-00000-of-00003.parquet
â  â  âââ train-00001-of-00003.parquet
â  â  âââ train-00002-of-00003.parquet
â  âââ month=02
â  â  âââ train-00000-of-00003.parquet
â  â  âââ train-00001-of-00003.parquet
â  â  âââ train-00002-of-00003.parquet
â  âââ ...
Dataset Usage
Dataset Details
Dataset Sample
Dataset Fields
Notes
About 6% of the games include Stockfish analysis evaluations: [%eval 2.35] (235 centipawn advantage), [%eval #-4] (getting mated in 4), always from White's point of view.
The WhiteElo and BlackElo tags contain Glicko2 ratings.
The movetext column contains clock information as PGN %clk comments since April 2017.
The schema doesn't include the Date header, typically part of the Seven Tag Roster as we deemed the UTCDate field to be enough.
A future version of the data will include the addition of a UCI column containing the corresponding moves in UCI format.