📊

Dataset

OJ4OCRMT

Name: OJ4OCRMT
Creator: hltcoe
License: CC-BY-4.0

by hltcoe hf-dataset--hltcoe--oj4ocrmt

Nexus Index

30.4 Top 100%

S: Semantic 50

A: Authority 0

P: Popularity 60

R: Recency 35

Q: Quality 30

Tech Context

Vital Performance

0 DL / 30D

0.0%

Source →

Data Integrity 30.4 FNI Score

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-dataset--hltcoe--oj4ocrmt
License	CC-BY-4.0
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_dataset__hltcoe__oj4ocrmt,
  author = {hltcoe},
  title = {OJ4OCRMT Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/hltcoe/oj4ocrmt}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

hltcoe. (2026). OJ4OCRMT [Dataset]. Free2AITools. https://huggingface.co/datasets/hltcoe/oj4ocrmt

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Nexus Index V2.0

Methodology Index Protocol

30.4

TOP 100% SYSTEM IMPACT

Semantic (S) 50

Authority (A) 0

Popularity (P) 60

Recency (R) 35

Quality (Q) 30

💬 Index Insight

FNI V2.0 for OJ4OCRMT: Semantic (S:50), Authority (A:0), Popularity (P:60), Recency (R:35), Quality (Q:30).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

⬇️

Downloads

141,042

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

OJ4OCRMT: A Large Multilingual Dataset for OCR-MT Evaluation

Check out the Paper: "OJ4OCRMT: A Large Multilingual Dataset for OCR-MT Evaluation" Paul McNamee, Kevin Duh, Cameron Carpenter, Ron Colaianni, Nolan King, and Kenton Murray. Proceedings of Machine Translation Summit XX, Vol. 1: Research Track June 23-27, 2025, Geneva, Switzerland.

The OJ4OCRMT dataset contains source PDF files, rendered images in three resolutions, and text files (both raw extractions, and sentence-boundary split files). There are two partitions, 'dev' and 'test'. Each contains over 1,000 pages of content, with the PDFs, PNGs, and text files available in 23 EU languages.

The dataset is designed to support evaluating systems for translation of document images between any pair of 23 European languages.

The dev partition contains 1,656 pages from 2022. Of these 1,412 (85%) are deemed regular; 193 (12%) contain a table; and, 51 (3%) contain a 'figure'. The test partition contains 1,119 pages, all from 2023. 979 (87%) are regular; 98 (9%) contain a table; and, 42 (4%) contain a 'figure'. Note, all 2,772 pages have translations in all 23 languages. The languages are: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, and Swedish. See the published paper for details about the dataset.

Each page is identified by a document identifier and a page number. For example, OJ:C:2022:240 is a document identifier. And page 22 of that document is available in the dev set. At the time of publication this URL gives access to that page online: https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:C:2022:240:FULL#page=22

For the page identified above, the released dataset contains:

(a) raw: text obtained from the 'pdftotext' command dev/OJ:C:2022:240/raw/OJ:C:2022:240:FULL.en.p-22.txt
(b) sbd: normalized text obtained by cleaning the raw text and performing sentence splitting dev/OJ:C:2022:240/sbd/OJ:C:2022:240:FULL.en.p-22.txt
(c) png72: a PNG image file at 72 dpi resolution dev/OJ:C:2022:240/png72/OJ:C:2022:240:FULL.en.p-22.png
(d) png150: a PNG image file at 150 dpi resolution dev/OJ:C:2022:240/png150/OJ:C:2022:240:FULL.en.p-22.png
(e) png300: a PNG image file at 300 dpi resolution dev/OJ:C:2022:240/png300/OJ:C:2022:240:FULL.en.p-22.png
(f) pdf: a PDF file obtained from the EUR-Lex online portal and using 'pdfseparate' dev/OJ:C:2022:240/pdf/OJ:C:2022:240:FULL.en.p-22.pdf

Splits for the partitions are in TSV format with two-columns: docid [tab] page The eight split files are named {dev,test}.{all,regular,table.figure}.txt

If you use this work, please cite:

text

@inproceedings{mcnamee-etal-2025-oj4ocrmt,
    title = "{OJ}4{OCRMT}: A Large Multilingual Dataset for {OCR}-{MT} Evaluation",
    author = "McNamee, Paul  and
      Duh, Kevin  and
      Carpenter, Cameron  and
      Colaianni, Ron  and
      King, Nolan  and
      Murray, Kenton",
    booktitle = "Proceedings of Machine Translation Summit XX: Volume 1",
    month = jun,
    year = "2025",
    address = "Geneva, Switzerland",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2025.mtsummit-1.9/",
    pages = "113--125",
    ISBN = "978-2-9701897-0-1"    
}

Initial release: 6/16/2025

Social Proof

HuggingFace Hub

141.0KDownloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

🆔 Identity & Source

id: hf-dataset--hltcoe--oj4ocrmt
slug: hltcoe--oj4ocrmt
source: huggingface
author: hltcoe
license: CC-BY-4.0
tags: license:cc-by-4.0, region:us

⚙️ Technical Specs

architecture: null
params billions: null
context length: null
pipeline tag

📊 Engagement & Metrics

downloads: 141,042
stars: 0
forks: 0

Data indexed from public sources. Updated daily.

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!