📊

Dataset

Cra5 Dataset

by taohan10200 ID: hf-dataset--taohan10200--cra5-dataset

FNI Rank 51

Percentile Top 0%

Activity

→ 0.0%

We introduce **VAEformer**, a variational autoencoder transformer designed for the extreme compression of climate data. Addressing the storage challenges of massive datasets like ERA5, ...

View Source Code →

Data Integrity 51 FNI Score

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-dataset--taohan10200--cra5-dataset
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_dataset__taohan10200__cra5_dataset,
  author = {taohan10200},
  title = {Cra5 Dataset Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/taohan10200/CRA5-Dataset}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

taohan10200. (2026). Cra5 Dataset [Dataset]. Free2AITools. https://huggingface.co/datasets/taohan10200/CRA5-Dataset

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Free2AI Nexus Index

Methodology → 📘 What is FNI?

51.0

Top 0% Overall Impact

🔥 Popularity (P) 0

🚀 Velocity (V) 0

🛡️ Credibility (C) 0

🔧 Utility (U) 0

Nexus Verified Data

💬 Why this score?

The Nexus Index for Cra5 Dataset aggregates Popularity (P:0), Velocity (V:0), and Credibility (C:0). The Utility score (U:0) represents deployment readiness, context efficiency, and structural reliability within the Nexus ecosystem.

🔗 Source Links (Click to verify)

📊 P: HuggingFace Stats 📈 V: 7-Day Delta 📄 C: Papers With Code 🔧 U: Deploy Score

Data Verified 🕐 Last Updated: Not calculated

Free2AI Nexus Index | Fair · Transparent · Explainable | Full Methodology

⬇️

Downloads

107,934

❤️

Likes

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

license: cdla-sharing-1.0
task_categories:

time-series-forecasting
compression
tags:
climate
weather
era5
cra5
pretty_name: CRA5 ERA5 Dataset
size_categories:
1T<n<10T

Climate science data can be compressed efficiently by dual-stage extreme compression with a variational auto-encoder transformer

Introduction and get started

CRA5 dataset now is available at OneDrive

Paper Summary

We introduce VAEformer, a variational autoencoder transformer designed for the extreme compression of climate data. Addressing the storage challenges of massive datasets like ERA5, VAEformer utilizes a low-complexity transformer with variance inference to achieve high compression efficiency.

We successfully compressed the 226 TB ERA5 dataset into the 0.7 TB CRA5 dataset, achieving a >300x compression ratio. Despite this extreme reduction, CRA5 retains high scientific utility; global weather forecasting models trained on CRA5 achieve accuracy comparable to those trained on the original data, significantly lowering the barrier for AI-based meteorological research.

CRA5 is a extreme compressed weather dataset of the most popular ERA5 reanalysis dataset. The repository also includes compression models, forecasting model for researchers to conduct portable weather and climate research.

CRA5 currently provides:

A customized variaitional transformer (VAEformer) for climate data compression
A dataset CRA5 less than 1 TiB, but contains the same information with 400+ TiB ERA5 dataset. Covering houly ERA5 from year 1979 to 2023.
A pre-trained Auto-Encoder on the climate/weather data to support some potential weather research.

Note: Multi-GPU support is now experimental.

Installation

CRA5 supports python 3.8+ and PyTorch 1.7+.

conda create --name cra5 python=3.10 -y 
conda activate cra5

Please install cra5 from source:

A C++17 compiler, a recent version of pip (19.0+), and common python packages are also required (see setup.py for the full list).

To get started locally and install the development version of CRA5, run the following commands in a virtual environment:

git clone https://github.com/taohan10200/CRA5 cd CRA5

pip install -U pip && pip install -e .

Test

python test.py

Usages

Using with API:

Supporting functions like: Compression / decompression / latents representation / feature visulization / reconstructed visulization

Step 1: Download from Hugging Face

We provide a simple way to download the data using huggingface_hub.

from huggingface_hub import hf_hub_download

Download CRA5 binary file for a specific timestamp (e.g., 2022-01-01T00:00:00)local_bin_path = hf_hub_download(
    repo_id="taohan10200/CRA5-Dataset",
    repo_type="dataset",
    filename="2022/2022-01-01T00:00:00.bin",
    local_dir="./data/CRA5"
)
print(f"Downloaded to: {local_bin_path}")

Step 2: Use CRA5 API for Decompression and Visualization

# We build a downloader to help use download the original ERA5 netcdf files for testing.

data/ERA5/2024/2024-06-01T00:00:00_pressure.nc (513MiB) and data/ERA5/2024/2024-06-01T00:00:00_single.nc (18MiB)
from cra5.api.era5_downloader import era5_downloader
ERA5_data = era5_downloader('./cra5/api/era5_config.py') #specify the dataset config for what we want to download
data = ERA5_data.get_form_timestamp(time_stamp="2024-06-01T00:00:00",
                                    local_root='./data/ERA5')
After getting the ERA5 data ready, you can explore the compression.
from cra5.api import cra5_api
cra5_API = cra5_api()
####=======================compression functions=====================
Return a continuous latent y for ERA5 data at 2024-06-01T00:00:00
y = cra5_API.encode_to_latent(time_stamp="2024-06-01T00:00:00") 
Return the the arithmetic coded binary stream of y
bin_stream = cra5_API.latent_to_bin(y=y)  
Or if you want to directly compress and save the binary stream to a folder
cra5_API.encode_era5_as_bin(time_stamp="2024-06-01T00:00:00", save_root='./data/cra5')  
####=======================decompression functions=====================
Starting from the bin_stream, you can decode the binary file to the quantized latent.
y_hat = cra5_API.bin_to_latent(bin_path="./data/CRA5/2024/2024-06-01T00:00:00.bin")  # Decoding from binary can only get the quantized latent.
Return the normalized cra5 data
normlized_x_hat = cra5_API.latent_to_reconstruction(y_hat=y_hat) 
If you have saveed  or downloaded the binary file, then you can directly restore the binary file into reconstruction.
normlized_x_hat = cra5_API.decode_from_bin("2024-06-01T00:00:00", return_format='normalized') # Return the normalized cra5 data
x_hat = cra5_API.decode_from_bin("2024-06-01T00:00:00", return_format='de_normalized') # Return the de-normalized cra5 data
Show some channels of the latentcra5_API.show_latent(
    latent=y_hat.squeeze(0).cpu().numpy(), 
    time_stamp="2024-06-01T00:00:00", 
    show_channels=[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150],
    save_path = './data/vis')

# show some variables for the constructed data
cra5_API.show_image(
    reconstruct_data=x_hat.cpu().numpy(), 
    time_stamp="2024-06-01T00:00:00", 
    show_variables=['z_500', 'q_500', 'u_500', 'v_500', 't_500', 'w_500'],
    save_path = './data/vis')

Or using with the pre-trained model

import os 
import torch
from cra5.models.compressai.zoo import vaeformer_pretrained
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)
net = vaeformer_pretrained(quality=268, pretrained=True).eval().to(device)
input_data_norm = torch.rand(1,268, 721,1440).to(device) #This is a proxy weather data. It actually should be a 

print(x.shape)
with torch.no_grad():
    out_net = net.compress(x) 
print(out_net)

Features

1. CRA5 dataset is a product of the VAEformer applied in the atmospheric science. We explore this to facilitate the research in weather and climate.

Train the large data-driven numerical weather forecasting models with our CRA5

Note: For researches who do not have enough disk space to store the 300 TiB+ ERA5 dataset, but have interests to train a large weather forecasting model, like FengWu-GHR, this research can help you save it into less than 1 TiB disk.

Our preliminary attemp has proven that the CRA5 dataset can train the very very similar NWP model compared with the original ERA5 dataset. Also, with this dataset, you can easily train a Nature published forecasting model, like Pangu-Weather.

2. VAEformer is a powerful compression model, we hope it can be extended to other domains, like image and video compression.

3 VAEformer is based on the Auto-Encoder-Decoder, we provide a pretrained VAE for the weather research, you can use our VAEformer to get the latents for downstream research, like diffusion-based or other generation-based forecasting methods.

Using it as a Auto-Encoder-Decoder

Note: For people who are intersted in diffusion-based or other generation-based forecasting methods, we can provide an Auto Encoder and decoder for the weather research, you can use our VAEformer to get the latents for downstream research.

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id: hf-dataset--taohan10200--cra5-dataset
source: huggingface
author: taohan10200
tags: task_categories:time-series-forecastinglicense:cdla-sharing-1.0size_categories:n<1kformat:imagefoldermodality:imagelibrary:datasetslibrary:mlcroissantarxiv:2402.00059arxiv:2405.03376region:usclimateweatherera5cra5

⚙️ Technical Specs

architecture: null
params billions: null
context length: null

📊 Engagement & Metrics

likes: 3
downloads: 107,934

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!