Olympiads Ref
Pillar scores are computed during the next indexing cycle.
--- pretty_name: Olympiads Reference Dataset dataset_info: features: - name: year dtype: string - name: tier dtype: string - name: problem_label dtype: string - name: problem_type dtype: string - name: exam dtype: string - name: problem dtype: string - name: solution dtype: string - name: metadata struct: - name: resource_path dtype: string - name: problem_match dtype: string - name: solution_match dtype: string configs: - config_name: default data_files: - split: train path: '**/segmented/**...
| Entity Passport | |
| Registry ID | hf-dataset--ai-mo--olympiads-ref |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__ai_mo__olympiads_ref,
author = {Ai Mo},
title = {Olympiads Ref Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/AI-MO/olympiads-ref}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
âī¸ Nexus Index V16.5
đŦ Index Insight
The Free2AITools Nexus Index for Olympiads Ref aggregates Popularity (P:0), Freshness (F:0), and Completeness (C:0). The Utility score (U:0) represents deployment readiness and ecosystem adoption.
Verification Authority
đī¸ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
đ Explore Full Dataset âđ§Ŧ Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
pretty_name: Olympiads Reference Dataset
dataset_info:
features:
- name: year
dtype: string
- name: tier
dtype: string
- name: problem_label
dtype: string
- name: problem_type
dtype: string
- name: exam
dtype: string
- name: problem
dtype: string
- name: solution
dtype: string
- name: metadata
struct:
- name: resource_path
dtype: string
- name: problem_match
dtype: string
- name: solution_match
dtype: string
configs:
- config_name: default
data_files:- split: train
path: '/segmented//*.jsonl'
- split: train
AI-MO Olympiad Reference Dataset
This dataset contains a structured collection of Olympiad problems and their solutions,
organized by competition. Contains high quality data, prioritizing "official" solutions to problems.
Structure
<competition name>/ # Problems and solutions from the International Mathematical Olympiad
âââ raw/ # Raw problem/solution statements (.pdf)
â âââ file1.pdf
â âââ file2.pdf
âââ download_script/ # the scripts used to download raw data
â âââ download.py
âââ md/ # .md files generated from raw/ files
â âââ file1.md
â âââ file2.md
âââ segment_script/ # the scripts used to segment the data
â âââ segment.py
âââ segmented/ # .jsonl segmented data for easier processing
âââ file1.jsonl
âââ file2.jsonl
âââ file3.jsonl
Each json in jsonl file follows this structure:
{
"problem": "string", // Mandatory: The problem statement in latex or markdown
"solution": "string", // Mandatory: The solution for the problem
"year": "int", // Optional: Year when the problem was presented
"problem_type": "string", // Optional: The mathematical domain of the problem. Here are the supported types:
//['Algebra', 'Geometry', 'Number Theory', 'Combinatorics', 'Calculus',
//'Inequalities', 'Logic and Puzzles', 'Other']
"question_type": "string", // Optional: The form or style of the mathematical problem.
// The supported classes are: ['MCQ', 'proof' or 'math-word-problem'].
// 'math-word-problem' is a problem with output.
"answer": "string", // Optional: final answer is the question_type is "math-word-problem".
"source": "string", // Optional: TODO:describe
"exam": "string", // Optional: TODO:describe
"difficulty": "int", // Optional: TODO:describe
"other": "...", // Optional: You can add other fields with metadata
}
Steps to collect data for formalization
1. Assign yourself a task
Check the tracker and assign yourself one line by updating columns:
- status: IN PROGRESS
- assignee: your name
2. Setup
Download data locally.
git lfs install
git clone [email protected]:datasets/AI-MO/olympiads-ref
3. Find .pdf ressources.
First check if there are already available .pdf in https://huggingface.co/AI-MO/olympiads-0.1
- if yes upload them in
AI-MO/olympiads-ref/<competition>/raw/and continue to step 4. - if no, find sources in internet (preferably with official solution), download and upload in
AI-MO/olympiads-ref/<competition>/raw/
4. Find .md ressources.
First check if there are already available .pdf in https://huggingface.co/AI-MO/olympiads-0.1
- if yes upload in
AI-MO/olympiads-ref/<competition>/md/and continue to step 6. - if no, find sources in internet (preferably with official solution), download and upload in
AI-MO/olympiads-ref/<competition>/md/
5. Convert .pdf to .md using Mathpix
Use data_pipeline.
Example:
python -m data_pipeline convert_to_md --method=pdf_to_md --input_dir="/home/marvin/workspace/olympiads-ref/IMO/raw" --output_dir="/home/marvin/workspace/olympiads-ref/IMO/md"
6. Find .jsonl ressources.
First check if there are already segmentaions available .jsonl in https://huggingface.co/datasets/AI-MO/olympiads-0.3. You can check if the segmentation has been done in this old tracker.
- if yes, check quality and upload in
AI-MO/olympiads-ref/<competition>/segmented/and continue to step 8. - if no, continue to step 7.
7. Segment the .md files into .jsonl
Write a segment.py that can be applied to your data (please do sanity checks!). Examples are this or that. Once you are fine with your segmentation upload the .jsonl in AI-MO/olympiads-ref/<competition>/segmented/ and the segment.py in AI-MO/olympiads-ref/<competition>/segment_script/.
Ask for a review.
8. Update the status in the trackers
Update the tracker with columns:
- status: DONE + a link to your generated data in hf
- problem_count: count of problems in data
- solution_count: count of solutions in data (different than problem_count since a problem can have several solutions)
- years: range of competition years covered in your data (so we can easily track if many years are missing)
- assignee: your name
Update the old tracker with this comumn:
- ref: color in green for the competition you segmented
9. Integrate the data in a base dataset
Create a ticket in git
Notes
- Image placeholders in the dataset (like:
) correspond to actual images stored in theimages.parquetfile.
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Dataset Transparency Report
Verified data manifest for traceability and transparency.
đ Identity & Source
- id
- hf-dataset--ai-mo--olympiads-ref
- source
- huggingface
- author
- Ai Mo
- tags
- size_categories:10k
format:jsonmodality:documentmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantregion:us
âī¸ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
đ Engagement & Metrics
- likes
- 4
- downloads
- 16,783
Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)