PDB
| Entity Passport | |
| Registry ID | hf-dataset--litefold--pdb |
| License | CC0-1.0 |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__litefold__pdb,
author = {LiteFold},
title = {PDB Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/LiteFold/PDB}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
âī¸ Free2AITools Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for PDB: Semantic (S:50), Authority (A:61), Popularity (P:50), Recency (R:97), Quality (Q:50).
Verification Authority
đī¸ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
đ Explore Full Dataset âđ§Ŧ Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
PDB mmCIF Entry Index
The Protein Data Bank is the single global archive of experimentally-determined 3D structures of biological macromolecules, established in 1971 and now holding well over 230,000 entries. It stores atomic coordinates for proteins, nucleic acids, and their complexes determined by X-ray crystallography, cryo-EM, NMR, micro-electron diffraction, and integrative methods, along with the underlying experimental data (structure factors, EM maps, NMR restraints) and rich metadata covering sequence, ligands, modifications, oligomeric state, and validation reports. Every entry has a four-character PDB ID (e.g. 7PZB) and is distributed primarily in the mmCIF format, with legacy PDB-format files retained for compatibility.Operationally, the archive is jointly managed by the wwPDB consortium: RCSB PDB at Rutgers and UCSD handles deposits from the Americas and Oceania and serves as the wwPDB Archive Keeper, PDBe at EMBL-EBI handles Europe and Africa, PDBj at Osaka University handles Asia, and BMRB hosts NMR-specific data. All wwPDB sites receive synchronized weekly updates and serve the archive free of charge under CC0. Within structural biology and protein ML, the PDB is the canonical training and validation source for structure prediction (AlphaFold2/3, RoseTTAFold, Protenix, OpenFold), inverse folding (ProteinMPNN, ESM-IF), docking, MD setup, and template-based modelling, and time-cutoff splits on PDB release dates are the standard way to control for data leakage when benchmarking these models.
Splits
| Split | Rows |
|---|---|
| train | 88,873 |
| test | 9,951 |
| total | 98,824 |
The split is deterministic: sha256(pdb_id) % 10 == 0 goes to test; buckets 1 through 9 go to train.
Dataset Statistics
| Metric | Value |
|---|---|
| mmCIF files in this repo | 98,824 |
Rows joined to entries.idx metadata |
98,824 |
Full entries.idx rows |
252,816 |
| Total mirrored mmCIF compressed size | 31.08 GB |
| Known-resolution rows | 93,997 |
| Unknown-resolution rows | 4,827 |
| Median known resolution | 2.10 A |
| Mean known resolution | 2.33 A |
Top experimental methods:
| Method | Rows |
|---|---|
| X-RAY DIFFRACTION | 82,380 |
| ELECTRON MICROSCOPY | 11,433 |
| SOLUTION NMR | 4,707 |
| ELECTRON CRYSTALLOGRAPHY | 101 |
| X-RAY DIFFRACTION, NEUTRON DIFFRACTION | 50 |
Top classifications:
| Classification | Rows |
|---|---|
| HYDROLASE | 14,117 |
| TRANSFERASE | 9,970 |
| OXIDOREDUCTASE | 7,743 |
| VIRAL PROTEIN | 4,333 |
| MEMBRANE PROTEIN | 3,206 |
Load With `datasets`
from datasets import load_dataset
ds = load_dataset("LiteFold/PDB")
print(ds)
row = ds["train"][0]
print(row)
Load one split directly:
from datasets import load_dataset
train = load_dataset("LiteFold/PDB", split="train")
test = load_dataset("LiteFold/PDB", split="test")
Stream rows without materializing the full table locally:
from datasets import load_dataset
streamed = load_dataset("LiteFold/PDB", split="train", streaming=True)
first_row = next(iter(streamed))
Use the mmcif_path column with hf_hub_download to fetch a structure file:
from datasets import load_dataset
from huggingface_hub import hf_hub_download
row = load_dataset("LiteFold/PDB", split="train[0]")[0]
local_path = hf_hub_download(
repo_id="LiteFold/PDB",
repo_type="dataset",
filename=row["mmcif_path"],
)
Filter to X-ray structures with known resolution:
from datasets import load_dataset
train = load_dataset("LiteFold/PDB", split="train")
xray = train.filter(
lambda row: row["experimental_method"] == "X-RAY DIFFRACTION"
and not row["resolution_is_unknown"]
)
Columns
| Column | Description |
|---|---|
pdb_id |
Four-character PDB identifier in lowercase. |
mmcif_path |
Path to the mirrored gzipped mmCIF file in this repository. |
mmcif_file_size_bytes |
Compressed mmCIF file size from Hugging Face Hub file metadata. |
mmcif_blob_id |
Hub blob identifier for the mmCIF object. |
pdb_url |
RCSB PDB structure page URL. |
rcsb_download_url |
Direct RCSB mmCIF download URL. |
classification |
PDB header classification. |
accession_date |
Original entries.idx accession date string. |
accession_date_iso |
Parsed ISO date when available. |
title |
Structure title from entries.idx. |
source_organism |
Source organism field from entries.idx. |
authors |
Author list from entries.idx. |
raw_resolution |
Original resolution field from entries.idx. |
resolution_angstrom |
Numeric resolution in Angstroms, nullable for non-numeric values such as NOT. |
resolution_is_unknown |
Whether resolution_angstrom is null. |
experimental_method |
Experimental method field from entries.idx. |
has_entries_idx_metadata |
Whether the mmCIF file matched a row in entries.idx. |
split_bucket |
Deterministic hash bucket; bucket 0 is test. |
Source Files Used
entries.idx- Hub file metadata for paths under
mmcif/**/*.cif.gz
The full parsed entries.idx table is also included as metadata/entries_idx.parquet. The preparation script is included at scripts/prepare_pdb_dataset.py.
Citation
@article{vallat2026rcsbpdb,
title = {{RCSB Protein Data Bank}: Delivering integrative structures alongside experimental structures and computed structure models},
author = {Vallat, Brinda and Rose, Yana and Piehl, Dennis W. and Duarte, Jose M. and Bittrich, Sebastian and Bi, Chunxiao and Segura, Joan and Zalevsky, Arthur and Sekharan, Monica R. and Webb, Benjamin M. and others},
journal = {Nucleic Acids Research},
year = {2026},
publisher = {Oxford University Press},
doi = {10.1093/nar/gkaf1187}
}
đ Structured Schema (Zero-Fabrication)
| Feature Key | Data Type |
|---|---|
pdb_id |
string |
mmcif_path |
string |
mmcif_file_size_bytes |
int64 |
mmcif_blob_id |
string |
pdb_url |
string |
rcsb_download_url |
string |
classification |
string |
accession_date |
string |
accession_date_iso |
string |
title |
string |
source_organism |
string |
authors |
string |
raw_resolution |
string |
resolution_angstrom |
float64 |
resolution_is_unknown |
bool |
experimental_method |
string |
has_entries_idx_metadata |
bool |
split_bucket |
int64 |
Estimated Rows: 98,824
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Dataset Transparency Report
Technical metadata sourced from upstream repositories.
đ Identity & Source
- id
- hf-dataset--litefold--pdb
- slug
- litefold--pdb
- source
- huggingface
- author
- LiteFold
- license
- CC0-1.0
- tags
- license:cc0-1.0, size_categories:10k<n<100k, format:parquet, modality:tabular, modality:text, library:datasets, library:pandas, library:polars, library:mlcroissant, region:us, biology, proteins, protein-structure, pdb, rcsb, mmcif, parquet
âī¸ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
đ Engagement & Metrics
- downloads
- 26,308
- stars
- 0
- forks
- null
Data indexed from public sources. Updated daily.