📊

Dataset

hpa10m

Name: hpa10m
Creator: Nirschl Lab
License: CC-BY-SA-4.0

by Nirschl Lab hf-dataset--nirschl-lab--hpa10m

Nexus Index

34.2 Top 100%

S: Semantic 50

A: Authority 0

P: Popularity 51

R: Recency 69

Q: Quality 30

Tech Context

Vital Performance

0 DL / 30D

0.0%

Source →

Data Integrity 34.2 FNI Score

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-dataset--nirschl-lab--hpa10m
License	CC-BY-SA-4.0
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_dataset__nirschl_lab__hpa10m,
  author = {Nirschl Lab},
  title = {hpa10m Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/nirschl-lab/hpa10m}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

Nirschl Lab. (2026). hpa10m [Dataset]. Free2AITools. https://huggingface.co/datasets/nirschl-lab/hpa10m

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Nexus Index V2.0

Methodology Index Protocol

34.2

TOP 100% SYSTEM IMPACT

Semantic (S) 50

Authority (A) 0

Popularity (P) 51

Recency (R) 69

Quality (Q) 30

💬 Index Insight

FNI V2.0 for hpa10m: Semantic (S:50), Authority (A:0), Popularity (P:51), Recency (R:69), Quality (Q:30).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

⬇️

Downloads

31,817

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

HPA10M Dataset

A large-scale immunohistochemistry (IHC) image dataset derived from the Human Protein Atlas (HPA, https://www.proteinatlas.org/), containing approximately 10.5 million pathology and tissue images with detailed annotations.

Dataset Overview

Statistic	Value
Total Images	10,495,672
Training Set	10,493,672 images (10,497 tar files)
Validation Set	2,000 images (1 tar file)
Image Types	Pathology (7,970,595) / Tissue (2,525,077)
Format	JPEG images + JSON metadata

Directory Structure

text

hpa10m/
├── README.md                              # This file
├── example_images/                        # Sample images for preview
├── hpa10m_train/                          # Training data (WebDataset tar files)
│   ├── hpa10m_train_0000.tar             # Training shards (10,497 files)
│   ├── hpa10m_train_0001.tar
│   ├── ...
├── hpa10m_validation/                     # Validation data
│   └── hpa10m_validation.tar              # All validation samples (2,000 images)
└── hpa10m_tar_summary/                    # Metadata index files
    └── all.feather                        # Complete index of all images

Data Format

Tar Archives (WebDataset Format)

Each tar file contains paired .jpg and .json files organized by:

Image category: pathology/ or tissue/
Gene prefix: Two-letter gene name prefix (e.g., AB/, CD/)

JSON Metadata Structure

Each image has a corresponding JSON file with rich annotations:

json

{
  "metadata": {
    "height": 3000,
    "width": 3000,
    "name": "image_filename.jpg",
    "format": ".jpg"
  },
  "custom_metadata": {
    "gene": "TEKT3",
    "ensembl_id": "ENSG00000125409",
    "uniprot_id": "Q9BXF9",
    "tissue": "skin cancer",
    "cell_type": "Tumor cells",
    "patient_id": 3354,
    "patient_age": 92,
    "patient_sex": "male",
    "snomed_code": "M-80703;T-01000",
    "snomed_text": "Squamous cell carcinoma, NOS;Skin",
    "staining_intensity": "negative",
    "staining_location": "none",
    "staining_quantity": "none",
    "generic_caption": "Immunohistochemical staining of human skin cancer...",
    "caption_1": "Detailed caption describing the image...",
    "caption_2": "Alternative caption...",
    "url": "http://images.proteinatlas.org/...",
    "bboxes": [[x, y, w, h], ...],
    "rle_mask": "encoded_segmentation_mask",
    "area_px": 3883806,
    "area_fraction": 0.431534
  }
}

Index Files (Feather Format)

The hpa10m_tar_summary/all.feather file contains an index of all images with columns:

Column	Description
`tar_filename`	Source tar archive name
`split`	Dataset split (train/validation)
`name`	Full path within tar archive
`type`	Image type (pathology/tissue)
`img_offset`	Byte offset of image in tar
`img_size`	Image file size in bytes
`json_offset`	Byte offset of JSON in tar
`json_size`	JSON file size in bytes

Key Annotations

Clinical Information

gene: Gene name (e.g., "TEKT3")
ensembl_id: Ensembl gene ID (e.g., "ENSG00000125409")
uniprot_id: UniProt protein ID (e.g., "Q9BXF9")
tissue: Tissue or cancer type (e.g., "skin cancer")
uberon_id: UBERON ontology ID
cell_type: Cell type (e.g., "Tumor cells")
patient_id: Patient identifier
patient_age: Patient age
patient_sex: Patient sex ("male" / "female")
snomed_code: SNOMED-CT code (e.g., "M-80703;T-01000")
snomed_text: SNOMED-CT description (e.g., "Squamous cell carcinoma, NOS;Skin")

Staining Characteristics

staining_intensity: "negative", "weak", "moderate", "strong"
staining_location: "nuclear", "cytoplasmic/membranous", "cytoplasmic/membranous,nuclear", "none"
staining_quantity: "none", "<25%", "25-75%", ">75%"

Segmentation Data

bboxes: Bounding boxes in [[x, y, width, height], ...] format
rle_mask: Segmentation mask
area_px: Segmented area in pixels
area_fraction: Fraction of image covered by segmentation

Natural Language Captions

generic_caption: Standardized description
caption_1: Detailed scientific description
caption_2: Alternative description

Other Metadata

url: Original image URL from Human Protein Atlas
image_md5: MD5 hash of original image
file_size_kb: Image file size in KB

Usage

Loading Index with Pandas

python

import pandas as pd

# Load complete index
df = pd.read_feather("hpa10m_tar_summary/all.feather")

# Filter by split
train_df = df[df["split"] == "train"]
val_df = df[df["split"] == "validation"]

# Filter by image type
pathology_df = df[df["type"] == "pathology"]
tissue_df = df[df["type"] == "tissue"]

Data Source

This dataset is derived from the Human Protein Atlas (https://www.proteinatlas.org/), a comprehensive resource for protein expression in human tissues and cancers.

License

Please refer to the Human Protein Atlas data usage terms at https://www.proteinatlas.org/about/licence for licensing information.

📧 Contact

For questions or suggestions, please contact: [email protected] or [email protected]

📊 Structured Schema (Zero-Fabrication)

Feature Key	Data Type
`__key__`	`string`
`__url__`	`string`
`jpg`	`Image`
`json`	`unknown`

Estimated Rows: 7,600

Social Proof

HuggingFace Hub

31.8KDownloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

🆔 Identity & Source

id: hf-dataset--nirschl-lab--hpa10m
slug: nirschl-lab--hpa10m
source: huggingface
author: Nirschl Lab
license: CC-BY-SA-4.0
tags: license:cc-by-sa-4.0, region:us, size_categories:1m<n<10m, format:webdataset, modality:image, modality:text, library:datasets, library:webdataset, library:mlcroissant

⚙️ Technical Specs

architecture: null
params billions: 0.01
context length: null
pipeline tag

📊 Engagement & Metrics

downloads: 31,817
stars: 5
forks: 0

Data indexed from public sources. Updated daily.

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!