Arxiv Papers By Subject
Pillar scores are computed during the next indexing cycle.
| Entity Passport | |
| Registry ID | hf-dataset--permutans--arxiv-papers-by-subject |
| License | MIT |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__permutans__arxiv_papers_by_subject,
author = {permutans},
title = {Arxiv Papers By Subject Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/permutans/arxiv-papers-by-subject}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
âī¸ Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for Arxiv Papers By Subject: Semantic (S:0), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).
Verification Authority
đī¸ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
đ Explore Full Dataset âđ§Ŧ Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
arXiv Papers by Subject
A reorganised version of the nick007x/arxiv-papers dataset, partitioned by subject code, year, and month for efficient selective access.
Dataset Description
This dataset contains metadata for over 2.5 million arXiv papers, organised into a hierarchical directory structure that allows users to download only the specific subjects and time periods they need, rather than the entire dataset.
Motivation
The original nick007x/arxiv-papers dataset is an excellent resource containing comprehensive arXiv paper metadata. However, its monolithic structure requires downloading the entire dataset even when only a subset of papers is needed.
This derived dataset addresses that limitation by partitioning the data into small, focused parquet files organised by:
- Subject code (e.g.,
cs.AI,astro-ph.CO,math.NA) - Year (1989â2025)
- Month (01â12)
This structure enables:
- Downloading only specific research domains
- Fetching data for particular time ranges
- Incremental updates as new papers are published
- Efficient caching and lazy loading
Dataset Structure
data/
âââ astro-ph.CO/
â âââ 2009/
â â âââ 01/
â â â âââ 00000000.parquet
â â âââ 02/
â â â âââ 00000000.parquet
â â âââ ...
â âââ ...
âââ cs.AI/
â âââ 1993/
â â âââ ...
â âââ 2025/
â âââ ...
âââ cs.LG/
â âââ ...
âââ ...
Subject Categories
The dataset includes 148 arXiv subject categories spanning:
| Domain | Example Categories |
|---|---|
| Astrophysics | astro-ph.* x 6 |
| Condensed Matter | cond-mat.* x 9 |
| Computer Science | cs.* x 60 |
| Economics | econ.* x 3 |
| Electrical Engineering | eess.* x 4 |
| Mathematics | math.* x 30 |
| Physics | gr-qc, hep-* x 4, nucl-* x 2, quant-ph, physics.* x 22 |
| Quantitative Biology | q-bio.* x 10 |
| Quantitative Finance | q-fin.* x 8 |
| Statistics | stat.* x 5 |
| Nonlinear Sciences | nlin.* x 5 |
Data Fields
Each parquet file contains the following fields (inherited from the source dataset):
| Field | Type | Description |
|---|---|---|
arxiv_id |
string | Unique arXiv identifier (e.g., 2301.00001) |
title |
string | Paper title |
authors |
list[string] | List of author names |
submission_date |
string | Date of submission (e.g., 18 Feb 2009) |
comments |
string | Author comments (page count, figures, etc.) |
primary_subject |
string | Primary arXiv category with description |
subjects |
string | All arXiv categories the paper belongs to |
doi |
string | DOI link if available |
abstract |
string | Paper abstract |
file_path |
string | Path to PDF in the source dataset |
- Note that the ZIP files in
file_pathpoint to nick007x/arxiv-papers !
Usage
Loading Specific Subjects and Time Periods
from huggingface_hub import hf_hub_download
# Download a specific subject/year/month
local_path = hf_hub_download(
repo_id="permutans/arxiv-papers-by-subject",
repo_type="dataset",
filename="data/cs.LG/2024/06/00000000.parquet"
)
import polars as pl
df = pl.read_parquet(local_path)
Loading Multiple Files with Glob Patterns
from huggingface_hub import snapshot_download
# Download all cs.LG papers from 2024
snapshot_download(
repo_id="permutans/arxiv-papers-by-subject",
repo_type="dataset",
allow_patterns="data/cs.LG/2024/*/*.parquet",
local_dir="./arxiv_data"
)
Using with Polars LazyFrames
import polars as pl
# Scan multiple files lazily
lf = pl.scan_parquet("arxiv_data/data/cs.*/2024/*/*.parquet")
# Filter and collect only what you need
recent_ml = lf.filter(
pl.col("primary_subject").str.contains("Machine Learning")
).collect()
Dataset Statistics
- Total papers: ~2.55 million
- Subject categories: 167
- Year range: 1998â2025
- File format: Parquet (compressed)
Source Attribution
This dataset is derived from nick007x/arxiv-papers, which provides the complete arXiv scientific papers archive. The original dataset contains both metadata and PDFs; this derived dataset includes only the metadata, reorganised for efficient partial access.
The underlying paper content originates from arXiv.org, operated by Cornell University.
License
This dataset follows the licensing structure of the source:
- Dataset packaging and organisation: MIT License, as for nick007x/arxiv-papers
- Individual paper content: Subject to each paper's license as specified by arXiv and the respective authors
Citation
If you use this dataset, please cite both this reorganized version and the original source:
@dataset{arxiv_papers_by_subject_2025,
title = {arXiv Papers by Subject},
author = {permutans},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/permutans/arxiv-papers-by-subject}
}
@dataset{arxiv_papers_2025,
title = {arXiv Papers Dataset},
author = {nick007x},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/nick007x/arxiv-papers}
}
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Dataset Transparency Report
Verified data manifest for traceability and transparency.
đ Identity & Source
- id
- hf-dataset--permutans--arxiv-papers-by-subject
- slug
- permutans--arxiv-papers-by-subject
- source
- huggingface
- author
- permutans
- license
- MIT
- tags
- task_categories:text-generation, task_categories:feature-extraction, source_datasets:nick007x/arxiv-papers, language:en, license:mit, size_categories:1m
âī¸ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
đ Engagement & Metrics
- downloads
- 149,894
- stars
- 9
- forks
- 0
Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)