hpa10m
Pillar scores are computed during the next indexing cycle.
| Entity Passport | |
| Registry ID | hf-dataset--nirschl-lab--hpa10m |
| License | CC-BY-SA-4.0 |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__nirschl_lab__hpa10m,
author = {Nirschl Lab},
title = {hpa10m Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/nirschl-lab/hpa10m}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
âī¸ Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for hpa10m: Semantic (S:0), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).
Verification Authority
đī¸ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
đ Explore Full Dataset âđ§Ŧ Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
HPA10M Dataset
A large-scale immunohistochemistry (IHC) image dataset derived from the Human Protein Atlas (HPA, https://www.proteinatlas.org/), containing approximately 10.5 million pathology and tissue images with detailed annotations.
Dataset Overview
| Statistic | Value |
|---|---|
| Total Images | 10,495,672 |
| Training Set | 10,493,672 images (10,497 tar files) |
| Validation Set | 2,000 images (1 tar file) |
| Image Types | Pathology (7,970,595) / Tissue (2,525,077) |
| Format | JPEG images + JSON metadata |
Directory Structure
hpa10m/
âââ README.md # This file
âââ example_images/ # Sample images for preview
âââ hpa10m_train/ # Training data (WebDataset tar files)
â âââ hpa10m_train_0000.tar # Training shards (10,497 files)
â âââ hpa10m_train_0001.tar
â âââ ...
âââ hpa10m_validation/ # Validation data
â âââ hpa10m_validation.tar # All validation samples (2,000 images)
âââ hpa10m_tar_summary/ # Metadata index files
âââ all.feather # Complete index of all images
Data Format
Tar Archives (WebDataset Format)
Each tar file contains paired .jpg and .json files organized by:
- Image category:
pathology/ortissue/ - Gene prefix: Two-letter gene name prefix (e.g.,
AB/,CD/)
JSON Metadata Structure
Each image has a corresponding JSON file with rich annotations:
{
"metadata": {
"height": 3000,
"width": 3000,
"name": "image_filename.jpg",
"format": ".jpg"
},
"custom_metadata": {
"gene": "TEKT3",
"ensembl_id": "ENSG00000125409",
"uniprot_id": "Q9BXF9",
"tissue": "skin cancer",
"cell_type": "Tumor cells",
"patient_id": 3354,
"patient_age": 92,
"patient_sex": "male",
"snomed_code": "M-80703;T-01000",
"snomed_text": "Squamous cell carcinoma, NOS;Skin",
"staining_intensity": "negative",
"staining_location": "none",
"staining_quantity": "none",
"generic_caption": "Immunohistochemical staining of human skin cancer...",
"caption_1": "Detailed caption describing the image...",
"caption_2": "Alternative caption...",
"url": "http://images.proteinatlas.org/...",
"bboxes": [[x, y, w, h], ...],
"rle_mask": "encoded_segmentation_mask",
"area_px": 3883806,
"area_fraction": 0.431534
}
}
Index Files (Feather Format)
The hpa10m_tar_summary/all.feather file contains an index of all images with columns:
| Column | Description |
|---|---|
tar_filename |
Source tar archive name |
split |
Dataset split (train/validation) |
name |
Full path within tar archive |
type |
Image type (pathology/tissue) |
img_offset |
Byte offset of image in tar |
img_size |
Image file size in bytes |
json_offset |
Byte offset of JSON in tar |
json_size |
JSON file size in bytes |
Key Annotations
Clinical Information
gene: Gene name (e.g., "TEKT3")ensembl_id: Ensembl gene ID (e.g., "ENSG00000125409")uniprot_id: UniProt protein ID (e.g., "Q9BXF9")tissue: Tissue or cancer type (e.g., "skin cancer")uberon_id: UBERON ontology IDcell_type: Cell type (e.g., "Tumor cells")patient_id: Patient identifierpatient_age: Patient agepatient_sex: Patient sex ("male" / "female")snomed_code: SNOMED-CT code (e.g., "M-80703;T-01000")snomed_text: SNOMED-CT description (e.g., "Squamous cell carcinoma, NOS;Skin")
Staining Characteristics
staining_intensity: "negative", "weak", "moderate", "strong"staining_location: "nuclear", "cytoplasmic/membranous", "cytoplasmic/membranous,nuclear", "none"staining_quantity: "none", "<25%", "25-75%", ">75%"
Segmentation Data
bboxes: Bounding boxes in[[x, y, width, height], ...]formatrle_mask: Segmentation maskarea_px: Segmented area in pixelsarea_fraction: Fraction of image covered by segmentation
Natural Language Captions
generic_caption: Standardized descriptioncaption_1: Detailed scientific descriptioncaption_2: Alternative description
Other Metadata
url: Original image URL from Human Protein Atlasimage_md5: MD5 hash of original imagefile_size_kb: Image file size in KB
Usage
Loading Index with Pandas
import pandas as pd
# Load complete index
df = pd.read_feather("hpa10m_tar_summary/all.feather")
# Filter by split
train_df = df[df["split"] == "train"]
val_df = df[df["split"] == "validation"]
# Filter by image type
pathology_df = df[df["type"] == "pathology"]
tissue_df = df[df["type"] == "tissue"]
Data Source
This dataset is derived from the Human Protein Atlas (https://www.proteinatlas.org/), a comprehensive resource for protein expression in human tissues and cancers.
License
Please refer to the Human Protein Atlas data usage terms at https://www.proteinatlas.org/about/licence for licensing information.
đ§ Contact
For questions or suggestions, please contact: [email protected] or [email protected]
đ Structured Schema (Zero-Fabrication)
| Feature Key | Data Type |
|---|---|
__key__ |
string |
__url__ |
string |
jpg |
Image |
json |
unknown |
Estimated Rows: 7,600
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Dataset Transparency Report
Verified data manifest for traceability and transparency.
đ Identity & Source
- id
- hf-dataset--nirschl-lab--hpa10m
- slug
- nirschl-lab--hpa10m
- source
- huggingface
- author
- Nirschl Lab
- license
- CC-BY-SA-4.0
- tags
- license:cc-by-sa-4.0, region:us, size_categories:1m<n<10m, format:webdataset, modality:image, modality:text, library:datasets, library:webdataset, library:mlcroissant
âī¸ Technical Specs
- architecture
- null
- params billions
- 0.01
- context length
- null
- pipeline tag
đ Engagement & Metrics
- downloads
- 31,817
- stars
- 5
- forks
- 0
Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)