📊
Dataset

Findoc Robust

by Arcolab Dev hf-dataset--arcolab-dev--findoc-robust
Free2AITools Nexus Index
60.1 Top 100%
S: Semantic 50
A: Authority 61
P: Popularity 50
R: Recency 98
Q: Quality 50
Tech Context
Vital Performance
0 DL / 30D
0.0%
Data Integrity 60.1 FNI Score
- Size
- Rows
Parquet Format
- Tokens
Dataset Information Summary
Entity Passport
Registry ID hf-dataset--arcolab-dev--findoc-robust
License Apache-2.0
Provider huggingface
📜

Cite this dataset

Academic & Research Attribution

BibTeX
@misc{hf_dataset__arcolab_dev__findoc_robust,
  author = {Arcolab Dev},
  title = {Findoc Robust Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/arcolab-dev/FinDoc-Robust}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}
APA Style
Arcolab Dev. (2026). Findoc Robust [Dataset]. Free2AITools. https://huggingface.co/datasets/arcolab-dev/FinDoc-Robust

đŸ”ŦTechnical Deep Dive

Full Specifications [+]

âš–ī¸ Free2AITools Nexus Index V2.0

Semantic (S) 50
Authority (A) 61
Popularity (P) 50
Recency (R) 98
Quality (Q) 50

đŸ’Ŧ Index Insight

FNI V2.0 for Findoc Robust: Semantic (S:50), Authority (A:61), Popularity (P:50), Recency (R:98), Quality (Q:50).

Free2AITools Nexus Index

Verification Authority

Unbiased Data Node Refresh: VFS Live
âŦ‡ī¸
Downloads
30,761

đŸ‘ī¸ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

đŸ§Ŧ Field Logic

đŸ§Ŧ

Schema not yet indexed for this dataset.

Dataset Specification

Financial Document Extraction & Robustness Dataset (FinDoc-Robust)

Dataset Description

FinDoc-Robust is a multimodal, benchmark-grade dataset designed for Document Layout Analysis (DLA), Visual Information Extraction (VIE), and evaluating model robustness against real-world degradation.

The dataset contains financial reports across 5 distinct document categories (e.g., cash flow statements, balance sheets, trial balances, shareholders' equity, corporate income statements). For every document, it provides perfect digital vectors, tabular ground truths, pixel-level bounding boxes, and 5 structurally degraded ("dirty") variants simulating camera captures, scans, and physical artifacts.

Key Applications

  • Robust Document AI: Training models to resist geometric distortions, noise, and blur.
  • Table Reconstruction: Benchmarking end-to-end Image-to-Excel/HTML/Markdown pipelines.
  • Multimodal Alignment: Fine-tuning models like LayoutLMv3, Donut, or proprietary Vision-LLMs on complex financial structures.

Dataset Structure

The repository is organized hierarchically by document type and numerical index. Each sample folder contains a complete sub-set of modalities:

text
dataset_root/
├── new_type_cash_flow_statement/
├── new_type_shareholders_equity/
├── new_type_trial_balance/
├── pro_doc_corporate_income_statement/
└── pro_doc_full_balance_sheet/
    ├── 001/
    │   ├── 001.pdf             # Original clean vector PDF
    │   ├── 001.png             # Rendered high-res image (clean)
    │   ├── 001.xlsx            # Target ground-truth table structure
    │   ├── 001.json            # Word/Phrase Bounding Boxes (Pixel space)
    │   ├── 001_pdf.json        # Word/Phrase Bounding Boxes (DTP Point space)
    │   ├── 001_dirty_1.png     # Degraded scan/photo simulation variant 1
    │   ├── 001_dirty_1.json    # Adjusted Bounding Boxes for variant 1
    │   ...
    │   ├── 001_dirty_5.png     # Degraded variant 5
    │   └── 001_dirty_5.json    # Adjusted Bounding Boxes for variant 5
    └── 1001/                   # Scale-tested deep indices (up to 4 digits)
        ...

Modality Specifications

1. Ground Truth Structures

  • .pdf: Original vector file preserving strict semantic layout.
  • .xlsx: The ideal downstream target layout. Contains finalized cell alignments, structures, and text groups.

2. Multi-Coordinate Bounding Boxes (`.json`)

The dataset includes two coordinate topologies to match different ingestion pipelines:

  • 001_pdf.json (Vector Scale): Stored in DTP Points ($1 \text{ inch} = 72 \text{ points}$), native to engines like PyMuPDF or pdfplumber. Origin is typically evaluated from Top-Left or Bottom-Left depending on the parser.
  • 001.json (Raster Scale): Mapped directly to high-resolution pixel coordinates matching the native 001.png dimensions (e.g., A4 at 200 DPI: $1654 \times 2339 \text{ px}$).

3. Robustness & Degradation Layers (`_dirty_X`)

Each baseline sheet is supplemented with 5 alternative states mimicking typical pipeline damage:

  • Sensor noise, blur, and lighting gradients.
  • Rotation, skewing, and affine perspective warps.
  • Contrast loss and compression artifacts.

Every dirty image has a corresponding .json containing transformed bounding box parameters adjusted to the physical distortion.


JSON Schema Example

json
{
  "img_file": "001.png",
  "img_width": 1654,
  "img_height": 2339,
  "labels": [
    {
      "text": "CASH FLOWS FROM CORE OPERATIONS",
      "bbox_px": [173.1, 415.24, 781.99, 446.65]
    }
  ]
}

Note: bbox_px format is [xmin, ymin, xmax, ymax].


Usage & Evaluation

python
from datasets import load_dataset

# Configuration for loading the structured layout hierarchy
# Dataset script coming soon
dataset = load_dataset("arcolab-dev/FinDoc-Robust")
  • Tree-edit distance / TEDS: For structural table matching via the .xlsx layout.
  • ANLS (Average Normalized Levenshtein Similarity): For text extraction robustness under dirty variants.
  • mAP (mean Average Precision): For word/cell layout extraction bounding boxes.
text

## 📊 Structured Schema (Zero-Fabrication)
| Feature Key | Data Type |
| :--- | :--- |
| `file_name` | `string` |
| `document_type` | `string` |
| `document_id` | `int64` |
| `clean_pdf` | `string` |
| `clean_xlsx` | `string` |
| `clean_bbox_px` | `string` |
| `clean_bbox_pdf_pt` | `string` |
| `dirty_1_image` | `string` |
| `dirty_1_bbox` | `string` |
| `dirty_2_image` | `string` |
| `dirty_2_bbox` | `string` |
| `dirty_3_image` | `string` |
| `dirty_3_bbox` | `string` |
| `dirty_4_image` | `string` |
| `dirty_4_bbox` | `string` |
| `dirty_5_image` | `string` |
| `dirty_5_bbox` | `string` |

**Estimated Rows:** `3,000`

Social Proof

HuggingFace Hub
30.8KDownloads
🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseâ„šī¸ Verify with original source

đŸ›Ąī¸ Dataset Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

🆔 Identity & Source

id
hf-dataset--arcolab-dev--findoc-robust
slug
arcolab-dev--findoc-robust
source
huggingface
author
Arcolab Dev
license
Apache-2.0
tags
task_categories:object-detection, language:en, license:apache-2.0, size_categories:1k<n<10k, format:csv, modality:image, modality:text, library:datasets, library:pandas, library:polars, library:mlcroissant, region:us, financial, document-ai, multimodal

âš™ī¸ Technical Specs

architecture
null
params billions
null
context length
null
pipeline tag

📊 Engagement & Metrics

downloads
30,761
stars
0
forks
null

Data indexed from public sources. Updated daily.