📊
Dataset

OJ4OCRMT

by hltcoe hf-dataset--hltcoe--oj4ocrmt
Nexus Index
34.9 Top 100%
S / A / P / R / Q Breakdown Calibration Pending

Pillar scores are computed during the next indexing cycle.

Tech Context
Vital Performance
0 DL / 30D
0.0%
Data Integrity 34.9 FNI Score
- Size
- Rows
Parquet Format
- Tokens
Dataset Information Summary
Entity Passport
Registry ID hf-dataset--hltcoe--oj4ocrmt
License CC-BY-4.0
Provider huggingface
📜

Cite this dataset

Academic & Research Attribution

BibTeX
@misc{hf_dataset__hltcoe__oj4ocrmt,
  author = {hltcoe},
  title = {OJ4OCRMT Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/hltcoe/oj4ocrmt}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}
APA Style
hltcoe. (2026). OJ4OCRMT [Dataset]. Free2AITools. https://huggingface.co/datasets/hltcoe/oj4ocrmt

đŸ”ŦTechnical Deep Dive

Full Specifications [+]

âš–ī¸ Nexus Index V2.0

34.9
ESTIMATED IMPACT TIER
Semantic (S) 0
Authority (A) 0
Popularity (P) 0
Recency (R) 0
Quality (Q) 0

đŸ’Ŧ Index Insight

FNI V2.0 for OJ4OCRMT: Semantic (S:0), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).

Free2AITools Nexus Index

Verification Authority

Unbiased Data Node Refresh: VFS Live
âŦ‡ī¸
Downloads
141,042

đŸ‘ī¸ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

đŸ§Ŧ Field Logic

đŸ§Ŧ

Schema not yet indexed for this dataset.

Dataset Specification

OJ4OCRMT: A Large Multilingual Dataset for OCR-MT Evaluation

Check out the Paper: "OJ4OCRMT: A Large Multilingual Dataset for OCR-MT Evaluation" Paul McNamee, Kevin Duh, Cameron Carpenter, Ron Colaianni, Nolan King, and Kenton Murray. Proceedings of Machine Translation Summit XX, Vol. 1: Research Track June 23-27, 2025, Geneva, Switzerland.

The OJ4OCRMT dataset contains source PDF files, rendered images in three resolutions, and text files (both raw extractions, and sentence-boundary split files). There are two partitions, 'dev' and 'test'. Each contains over 1,000 pages of content, with the PDFs, PNGs, and text files available in 23 EU languages.

The dataset is designed to support evaluating systems for translation of document images between any pair of 23 European languages.

The dev partition contains 1,656 pages from 2022. Of these 1,412 (85%) are deemed regular; 193 (12%) contain a table; and, 51 (3%) contain a 'figure'. The test partition contains 1,119 pages, all from 2023. 979 (87%) are regular; 98 (9%) contain a table; and, 42 (4%) contain a 'figure'. Note, all 2,772 pages have translations in all 23 languages. The languages are: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, and Swedish. See the published paper for details about the dataset.

Each page is identified by a document identifier and a page number. For example, OJ:C:2022:240 is a document identifier. And page 22 of that document is available in the dev set. At the time of publication this URL gives access to that page online: https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:C:2022:240:FULL#page=22

For the page identified above, the released dataset contains:

  • (a) raw: text obtained from the 'pdftotext' command dev/OJ:C:2022:240/raw/OJ:C:2022:240:FULL.en.p-22.txt
  • (b) sbd: normalized text obtained by cleaning the raw text and performing sentence splitting dev/OJ:C:2022:240/sbd/OJ:C:2022:240:FULL.en.p-22.txt
  • (c) png72: a PNG image file at 72 dpi resolution dev/OJ:C:2022:240/png72/OJ:C:2022:240:FULL.en.p-22.png
  • (d) png150: a PNG image file at 150 dpi resolution dev/OJ:C:2022:240/png150/OJ:C:2022:240:FULL.en.p-22.png
  • (e) png300: a PNG image file at 300 dpi resolution dev/OJ:C:2022:240/png300/OJ:C:2022:240:FULL.en.p-22.png
  • (f) pdf: a PDF file obtained from the EUR-Lex online portal and using 'pdfseparate' dev/OJ:C:2022:240/pdf/OJ:C:2022:240:FULL.en.p-22.pdf

Splits for the partitions are in TSV format with two-columns: docid [tab] page The eight split files are named {dev,test}.{all,regular,table.figure}.txt

If you use this work, please cite:

text
@inproceedings{mcnamee-etal-2025-oj4ocrmt,
    title = "{OJ}4{OCRMT}: A Large Multilingual Dataset for {OCR}-{MT} Evaluation",
    author = "McNamee, Paul  and
      Duh, Kevin  and
      Carpenter, Cameron  and
      Colaianni, Ron  and
      King, Nolan  and
      Murray, Kenton",
    booktitle = "Proceedings of Machine Translation Summit XX: Volume 1",
    month = jun,
    year = "2025",
    address = "Geneva, Switzerland",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2025.mtsummit-1.9/",
    pages = "113--125",
    ISBN = "978-2-9701897-0-1"    
}

Initial release: 6/16/2025

Social Proof

HuggingFace Hub
141.0KDownloads
🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseâ„šī¸ Verify with original source

đŸ›Ąī¸ Dataset Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id
hf-dataset--hltcoe--oj4ocrmt
slug
hltcoe--oj4ocrmt
source
huggingface
author
hltcoe
license
CC-BY-4.0
tags
license:cc-by-4.0, region:us

âš™ī¸ Technical Specs

architecture
null
params billions
null
context length
null
pipeline tag

📊 Engagement & Metrics

downloads
141,042
stars
0
forks
0

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)