📊

Dataset

Olympiads Ref

by Ai Mo hf-dataset--ai-mo--olympiads-ref

Nexus Index

39.0 Top 0%

P / V / C / U Breakdown Calibration Pending

Pillar scores are computed during the next indexing cycle.

Tech Context

Vital Performance

0 DL / 30D

0.0%

--- pretty_name: Olympiads Reference Dataset dataset_info: features: - name: year dtype: string - name: tier dtype: string - name: problem_label dtype: string - name: problem_type dtype: string - name: exam dtype: string - name: problem dtype: string - name: solution dtype: string - name: metadata struct: - name: resource_path dtype: string - name: problem_match dtype: string - name: solution_match dtype: string configs: - config_name: default data_files: - split: train path: '**/segmented/**...

Source →

Data Integrity 39 FNI Score

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-dataset--ai-mo--olympiads-ref
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_dataset__ai_mo__olympiads_ref,
  author = {Ai Mo},
  title = {Olympiads Ref Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/AI-MO/olympiads-ref}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

Ai Mo. (2026). Olympiads Ref [Dataset]. Free2AITools. https://huggingface.co/datasets/AI-MO/olympiads-ref

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Nexus Index V16.5

Methodology Index Protocol

39.0

ESTIMATED IMPACT TIER

Popularity (P) 0

Freshness (F) 0

Completeness (C) 0

Utility (U) 0

💬 Index Insight

The Free2AITools Nexus Index for Olympiads Ref aggregates Popularity (P:0), Freshness (F:0), and Completeness (C:0). The Utility score (U:0) represents deployment readiness and ecosystem adoption.

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

⬇️

Downloads

16,783

❤️

Likes

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

pretty_name: Olympiads Reference Dataset
dataset_info:
features:
- name: year
dtype: string
- name: tier
dtype: string
- name: problem_label
dtype: string
- name: problem_type
dtype: string
- name: exam
dtype: string
- name: problem
dtype: string
- name: solution
dtype: string
- name: metadata
struct:
- name: resource_path
dtype: string
- name: problem_match
dtype: string
- name: solution_match
dtype: string
configs:

config_name: default
data_files:
- split: train
  path: '/segmented//*.jsonl'

AI-MO Olympiad Reference Dataset

This dataset contains a structured collection of Olympiad problems and their solutions,
organized by competition. Contains high quality data, prioritizing "official" solutions to problems.

Structure

<competition name>/    # Problems and solutions from the International Mathematical Olympiad
├── raw/               # Raw problem/solution statements (.pdf)
│   ├── file1.pdf
│   ├── file2.pdf
├── download_script/   # the scripts used to download raw data
│   ├── download.py    
├── md/                # .md files generated from raw/ files
│   ├── file1.md
│   ├── file2.md
├── segment_script/    # the scripts used to segment the data
│   ├── segment.py     
└── segmented/         # .jsonl segmented data for easier processing
    ├── file1.jsonl
    ├── file2.jsonl
    └── file3.jsonl

Each json in jsonl file follows this structure:

{
 "problem": "string",        // Mandatory: The problem statement in latex or markdown
 "solution": "string",       // Mandatory: The solution for the problem
 "year": "int",              // Optional: Year when the problem was presented
 "problem_type": "string",   // Optional: The mathematical domain of the problem. Here are the supported types: 
                             //['Algebra', 'Geometry', 'Number Theory', 'Combinatorics', 'Calculus',
                             //'Inequalities', 'Logic and Puzzles', 'Other']
 "question_type": "string",  // Optional: The form or style of the mathematical problem. 
                             // The supported classes are: ['MCQ', 'proof' or 'math-word-problem']. 
                             // 'math-word-problem' is a problem with output. 
 "answer": "string",         // Optional: final answer is the question_type is "math-word-problem".
 "source": "string",         // Optional: TODO:describe
 "exam": "string",           // Optional: TODO:describe
 "difficulty": "int",        // Optional: TODO:describe
 "other": "...",             // Optional: You can add other fields with metadata
}

Steps to collect data for formalization

1. Assign yourself a task

Check the tracker and assign yourself one line by updating columns:

status: IN PROGRESS
assignee: your name

2. Setup

Download data locally.

git lfs install
git clone [email protected]:datasets/AI-MO/olympiads-ref

3. Find `.pdf` ressources.

First check if there are already available .pdf in https://huggingface.co/AI-MO/olympiads-0.1

if yes upload them in AI-MO/olympiads-ref/<competition>/raw/ and continue to step 4.
if no, find sources in internet (preferably with official solution), download and upload in AI-MO/olympiads-ref/<competition>/raw/

4. Find `.md` ressources.

First check if there are already available .pdf in https://huggingface.co/AI-MO/olympiads-0.1

if yes upload in AI-MO/olympiads-ref/<competition>/md/ and continue to step 6.
if no, find sources in internet (preferably with official solution), download and upload in AI-MO/olympiads-ref/<competition>/md/

5. Convert `.pdf` to `.md` using Mathpix

Use data_pipeline.
Example:

python -m data_pipeline convert_to_md --method=pdf_to_md --input_dir="/home/marvin/workspace/olympiads-ref/IMO/raw" --output_dir="/home/marvin/workspace/olympiads-ref/IMO/md"

6. Find `.jsonl` ressources.

First check if there are already segmentaions available .jsonl in https://huggingface.co/datasets/AI-MO/olympiads-0.3. You can check if the segmentation has been done in this old tracker.

if yes, check quality and upload in AI-MO/olympiads-ref/<competition>/segmented/ and continue to step 8.
if no, continue to step 7.

7. Segment the `.md` files into `.jsonl`

Write a segment.py that can be applied to your data (please do sanity checks!). Examples are this or that. Once you are fine with your segmentation upload the .jsonl in AI-MO/olympiads-ref/<competition>/segmented/ and the segment.py in AI-MO/olympiads-ref/<competition>/segment_script/.

Ask for a review.

8. Update the status in the trackers

Update the tracker with columns:

status: DONE + a link to your generated data in hf
problem_count: count of problems in data
solution_count: count of solutions in data (different than problem_count since a problem can have several solutions)
years: range of competition years covered in your data (so we can easily track if many years are missing)
assignee: your name

Update the old tracker with this comumn:

ref: color in green for the competition you segmented

9. Integrate the data in a base dataset

Create a ticket in git

Notes

Image placeholders in the dataset (like: ![md5:f571b12c2c566ce1beedd8190c986910](f571b12c2c566ce1beedd8190c986910.jpeg)) correspond to actual images stored in the images.parquet file.

Top Tier

Social Proof

HuggingFace Hub

4Likes

16.8KDownloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id: hf-dataset--ai-mo--olympiads-ref
source: huggingface
author: Ai Mo
tags: size_categories:10kformat:jsonmodality:documentmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantregion:us

⚙️ Technical Specs

architecture: null
params billions: null
context length: null

📊 Engagement & Metrics

likes: 4
downloads: 16,783

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!