⚠️

This is a Dataset, not a Model

The following metrics do not apply: FNI Score, Deployment Options, Model Architecture

πŸ“Š

dayhoff

FNI 20.8
by microsoft Dataset

"--- configs: - config_name: dayhoffref data_files: dayhoffref/arrow/data*.arrow - config_name: backboneref data_files: - split: BRn path: backboneref/arrow/BRn/data*.arrow - split: BRq path: backboneref/arrow/BRq/data*.arrow - split: BRu path: backboneref/arrow/BRu/data*.arrow - config_name: uniref5..."

Best Scenarios

✨ Data Science

Technical Constraints

Generic Use
- Size
- Rows
Parquet Format
6 Likes

Capabilities

  • βœ… Data Science

πŸ”¬Deep Dive

Expand Details [+]

πŸ› οΈ Technical Profile

⚑ Hardware & Scale

Size
-
Total Rows
-
Files
1303

🧠 Training & Env

Format
Parquet
Cleaning
Raw

🌐 Cloud & Rights

Source
huggingface
License
Open Access

πŸ‘οΈ Data Preview

feature label split
example_text_1 0 train
example_text_2 1 train
example_text_3 0 test
example_text_4 1 validation
example_text_5 0 train
Showing 5 sample rows. Real-time preview requires login.

🧬 Schema & Configs

Fields

feature: string
label: int64
split: string

Dataset Card

Dataset Card for Dayhoff

Dayhoff is an Atlas of both protein sequence data and generative language models β€” a centralized resource that brings together 3.34 billion protein sequences across 1.7 billion clusters of metagenomic and natural protein sequences (GigaRef), 46 million structure-derived synthetic sequences (BackboneRef), and 16 million multiple sequence alignments (OpenProteinSet). These models can natively predict zero-shot mutation effects on fitness, scaffold structural motifs by conditioning on evolutionary or structural context, and perform guided generation of novel proteins within specified families. Learning from metagenomic and structure-based synthetic data from the Dayhoff Atlas increased the cellular expression rates of generated proteins, highlighting the real-world value of expanding the scale, diversity, and novelty of protein sequence data.

The Dayhoff model architecture combines state-space Mamba layers with Transformer self-attention, interleaved with Mixture-of-Experts modules to maximize capacity while preserving efficiency. It natively handles long contexts, allowing both single sequences and unrolled MSAs to be modeled. Trained with an autoregressive objective in both N→C and C→N directions, Dayhoff supports order-agnostic infilling and scales to billions of parameters.

Dataset Structure

The Dayhoff models were trained on the Dayhoff Atlas, with varying data mixes which include:

* UniRef50 (UR50) - dataset from UniProt, clustered at 50% sequence identity, contains only cluster representatives. * _Splits: train (25 GB), test (26 MB), valid (26 MB)_

* UniRef90 (UR90) - dataset from UniProt, clustered at 90% sequence identity, contains cluster representatives and members. * _Splits: train (83 GB), test (90 MB), valid (87 MB)_

* GigaRef (GR)– 3.34B protein sequences across 1.7B clusters of metagenomic and natural protein sequences. There are two subsets of gigaref: * GigaRef-clusters (GR) - Only includes cluster representatives and members, no singletons * _Splits: train (433 GB), test (22 MB)_ * GigaRef-singletons (GR-s) - Only includes singletons * _Splits: train (282 GB)_

* BackboneRef (BR) – 46M structure-derived synthetic sequences from c.a. 240,000 de novo backbones, with three subsets containing 10M sequences each: * BackboneRef unfiltered (BRu) – 10M sequences randomly sampled from all 46M designs. * _Splits: train (3 GB)_ * BackboneRef quality (BRq) – 10M sequences sampled from 127,633 backbones whose average self-consistency RMSD ≀ 2 Γ…. * _Splits: train(3 GB)_ * BackboneRef novelty (BRn) – 10M sequences from 138,044 backbones with a max TM-score < 0.5 to any natural structure. * _Splits: train (3GB)_ * BackboneRef structures (backboneref-structures) - We also make available the stru

9,756 characters total