OJ4OCRMT
| Entity Passport | |
| Registry ID | hf-dataset--hltcoe--oj4ocrmt |
| License | CC-BY-4.0 |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__hltcoe__oj4ocrmt,
author = {hltcoe},
title = {OJ4OCRMT Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/hltcoe/oj4ocrmt}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
âī¸ Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for OJ4OCRMT: Semantic (S:50), Authority (A:0), Popularity (P:60), Recency (R:35), Quality (Q:30).
Verification Authority
đī¸ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
đ Explore Full Dataset âđ§Ŧ Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
OJ4OCRMT: A Large Multilingual Dataset for OCR-MT Evaluation
The OJ4OCRMT dataset contains source PDF files, rendered images in three resolutions, and text files (both raw extractions, and sentence-boundary split files). There are two partitions, 'dev' and 'test'. Each contains over 1,000 pages of content, with the PDFs, PNGs, and text files available in 23 EU languages.
The dataset is designed to support evaluating systems for translation of document images between any pair of 23 European languages.
The dev partition contains 1,656 pages from 2022. Of these 1,412 (85%) are deemed regular; 193 (12%) contain a table; and, 51 (3%) contain a 'figure'. The test partition contains 1,119 pages, all from 2023. 979 (87%) are regular; 98 (9%) contain a table; and, 42 (4%) contain a 'figure'. Note, all 2,772 pages have translations in all 23 languages. The languages are: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, and Swedish. See the published paper for details about the dataset.
Each page is identified by a document identifier and a page number. For example, OJ:C:2022:240 is a document identifier. And page 22 of that document is available in the dev set. At the time of publication this URL gives access to that page online: https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:C:2022:240:FULL#page=22
For the page identified above, the released dataset contains:
- (a) raw: text obtained from the 'pdftotext' command dev/OJ:C:2022:240/raw/OJ:C:2022:240:FULL.en.p-22.txt
- (b) sbd: normalized text obtained by cleaning the raw text and performing sentence splitting dev/OJ:C:2022:240/sbd/OJ:C:2022:240:FULL.en.p-22.txt
- (c) png72: a PNG image file at 72 dpi resolution dev/OJ:C:2022:240/png72/OJ:C:2022:240:FULL.en.p-22.png
- (d) png150: a PNG image file at 150 dpi resolution dev/OJ:C:2022:240/png150/OJ:C:2022:240:FULL.en.p-22.png
- (e) png300: a PNG image file at 300 dpi resolution dev/OJ:C:2022:240/png300/OJ:C:2022:240:FULL.en.p-22.png
- (f) pdf: a PDF file obtained from the EUR-Lex online portal and using 'pdfseparate' dev/OJ:C:2022:240/pdf/OJ:C:2022:240:FULL.en.p-22.pdf
Splits for the partitions are in TSV format with two-columns: docid [tab] page The eight split files are named {dev,test}.{all,regular,table.figure}.txt
If you use this work, please cite:
@inproceedings{mcnamee-etal-2025-oj4ocrmt,
title = "{OJ}4{OCRMT}: A Large Multilingual Dataset for {OCR}-{MT} Evaluation",
author = "McNamee, Paul and
Duh, Kevin and
Carpenter, Cameron and
Colaianni, Ron and
King, Nolan and
Murray, Kenton",
booktitle = "Proceedings of Machine Translation Summit XX: Volume 1",
month = jun,
year = "2025",
address = "Geneva, Switzerland",
publisher = "European Association for Machine Translation",
url = "https://aclanthology.org/2025.mtsummit-1.9/",
pages = "113--125",
ISBN = "978-2-9701897-0-1"
}
Initial release: 6/16/2025
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Dataset Transparency Report
Technical metadata sourced from upstream repositories.
đ Identity & Source
- id
- hf-dataset--hltcoe--oj4ocrmt
- slug
- hltcoe--oj4ocrmt
- source
- huggingface
- author
- hltcoe
- license
- CC-BY-4.0
- tags
- license:cc-by-4.0, region:us
âī¸ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
đ Engagement & Metrics
- downloads
- 141,042
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.