📊

Dataset

Arxiv Papers By Subject

Name: Arxiv Papers By Subject
Creator: permutans
License: MIT

by permutans hf-dataset--permutans--arxiv-papers-by-subject

Nexus Index

37.8 Top 100%

S: Semantic 50

A: Authority 0

P: Popularity 62

R: Recency 53

Q: Quality 30

Tech Context

Vital Performance

0 DL / 30D

0.0%

Source →

Data Integrity 37.8 FNI Score

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-dataset--permutans--arxiv-papers-by-subject
License	MIT
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_dataset__permutans__arxiv_papers_by_subject,
  author = {permutans},
  title = {Arxiv Papers By Subject Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/permutans/arxiv-papers-by-subject}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

permutans. (2026). Arxiv Papers By Subject [Dataset]. Free2AITools. https://huggingface.co/datasets/permutans/arxiv-papers-by-subject

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Nexus Index V2.0

Methodology Index Protocol

37.8

TOP 100% SYSTEM IMPACT

Semantic (S) 50

Authority (A) 0

Popularity (P) 62

Recency (R) 53

Quality (Q) 30

💬 Index Insight

FNI V2.0 for Arxiv Papers By Subject: Semantic (S:50), Authority (A:0), Popularity (P:62), Recency (R:53), Quality (Q:30).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

⬇️

Downloads

231,833

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

arXiv Papers by Subject

A reorganised version of the nick007x/arxiv-papers dataset, partitioned by subject code, year, and month for efficient selective access.

Dataset Description

This dataset contains metadata for over 2.5 million arXiv papers, organised into a hierarchical directory structure that allows users to download only the specific subjects and time periods they need, rather than the entire dataset.

Motivation

The original nick007x/arxiv-papers dataset is an excellent resource containing comprehensive arXiv paper metadata. However, its monolithic structure requires downloading the entire dataset even when only a subset of papers is needed.

This derived dataset addresses that limitation by partitioning the data into small, focused parquet files organised by:

Subject code (e.g., cs.AI, astro-ph.CO, math.NA)
Year (1989–2025)
Month (01–12)

This structure enables:

Downloading only specific research domains
Fetching data for particular time ranges
Incremental updates as new papers are published
Efficient caching and lazy loading

Dataset Structure

text

data/
├── astro-ph.CO/
│   ├── 2009/
│   │   ├── 01/
│   │   │   └── 00000000.parquet
│   │   ├── 02/
│   │   │   └── 00000000.parquet
│   │   └── ...
│   └── ...
├── cs.AI/
│   ├── 1993/
│   │   └── ...
│   └── 2025/
│       └── ...
├── cs.LG/
│   └── ...
└── ...

Subject Categories

The dataset includes 148 arXiv subject categories spanning:

Domain	Example Categories
Astrophysics	`astro-ph.*` x 6
Condensed Matter	`cond-mat.*` x 9
Computer Science	`cs.*` x 60
Economics	`econ.*` x 3
Electrical Engineering	`eess.*` x 4
Mathematics	`math.*` x 30
Physics	`gr-qc`, `hep-` x 4, `nucl-` x 2, `quant-ph`, `physics.*` x 22
Quantitative Biology	`q-bio.*` x 10
Quantitative Finance	`q-fin.*` x 8
Statistics	`stat.*` x 5
Nonlinear Sciences	`nlin.*` x 5

Data Fields

Each parquet file contains the following fields (inherited from the source dataset):

Field	Type	Description
`arxiv_id`	string	Unique arXiv identifier (e.g., `2301.00001`)
`title`	string	Paper title
`authors`	list[string]	List of author names
`submission_date`	string	Date of submission (e.g., `18 Feb 2009`)
`comments`	string	Author comments (page count, figures, etc.)
`primary_subject`	string	Primary arXiv category with description
`subjects`	string	All arXiv categories the paper belongs to
`doi`	string	DOI link if available
`abstract`	string	Paper abstract
`file_path`	string	Path to PDF in the source dataset

Note that the ZIP files in file_path point to nick007x/arxiv-papers !

Usage

Loading Specific Subjects and Time Periods

python

from huggingface_hub import hf_hub_download

# Download a specific subject/year/month
local_path = hf_hub_download(
    repo_id="permutans/arxiv-papers-by-subject",
    repo_type="dataset",
    filename="data/cs.LG/2024/06/00000000.parquet"
)

import polars as pl
df = pl.read_parquet(local_path)

Loading Multiple Files with Glob Patterns

python

from huggingface_hub import snapshot_download

# Download all cs.LG papers from 2024
snapshot_download(
    repo_id="permutans/arxiv-papers-by-subject",
    repo_type="dataset",
    allow_patterns="data/cs.LG/2024/*/*.parquet",
    local_dir="./arxiv_data"
)

Using with Polars LazyFrames

python

import polars as pl

# Scan multiple files lazily
lf = pl.scan_parquet("arxiv_data/data/cs.*/2024/*/*.parquet")

# Filter and collect only what you need
recent_ml = lf.filter(
    pl.col("primary_subject").str.contains("Machine Learning")
).collect()

Dataset Statistics

Total papers: ~2.55 million
Subject categories: 167
Year range: 1998–2025
File format: Parquet (compressed)

Source Attribution

This dataset is derived from nick007x/arxiv-papers, which provides the complete arXiv scientific papers archive. The original dataset contains both metadata and PDFs; this derived dataset includes only the metadata, reorganised for efficient partial access.

The underlying paper content originates from arXiv.org, operated by Cornell University.

License

This dataset follows the licensing structure of the source:

Dataset packaging and organisation: MIT License, as for nick007x/arxiv-papers
Individual paper content: Subject to each paper's license as specified by arXiv and the respective authors

Citation

If you use this dataset, please cite both this reorganized version and the original source:

bibtex

@dataset{arxiv_papers_by_subject_2025,
  title = {arXiv Papers by Subject},
  author = {permutans},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/permutans/arxiv-papers-by-subject}
}

@dataset{arxiv_papers_2025,
  title = {arXiv Papers Dataset},
  author = {nick007x},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/nick007x/arxiv-papers}
}

Social Proof

HuggingFace Hub

231.8KDownloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

🆔 Identity & Source

id: hf-dataset--permutans--arxiv-papers-by-subject
slug: permutans--arxiv-papers-by-subject
source: huggingface
author: permutans
license: MIT
tags: task_categories:text-generation, task_categories:feature-extraction, source_datasets:nick007x/arxiv-papers, language:en, license:mit, size_categories:1m<n<10m, region:us, arxiv, academic-papers, scientific-literature, research, metadata

⚙️ Technical Specs

architecture: null
params billions: null
context length: null
pipeline tag

📊 Engagement & Metrics

downloads: 231,833
stars: 9
forks: 0

Data indexed from public sources. Updated daily.

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!

Cite this dataset

🔬Technical Deep Dive

⚖️ Nexus Index V2.0

💬 Index Insight

Verification Authority

👁️ Data Preview

🧬 Field Logic

Dataset Specification

arXiv Papers by Subject

Dataset Description

Motivation

Dataset Structure

Subject Categories

Data Fields

Usage

Loading Specific Subjects and Time Periods

Loading Multiple Files with Glob Patterns

Using with Polars LazyFrames

Dataset Statistics

Source Attribution

License

Citation

Social Proof

🛡️ Dataset Transparency Report

🆔 Identity & Source

⚙️ Technical Specs

📊 Engagement & Metrics