Findoc Robust
| Entity Passport | |
| Registry ID | hf-dataset--arcolab-dev--findoc-robust |
| License | Apache-2.0 |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__arcolab_dev__findoc_robust,
author = {Arcolab Dev},
title = {Findoc Robust Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/arcolab-dev/FinDoc-Robust}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
âī¸ Free2AITools Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for Findoc Robust: Semantic (S:50), Authority (A:61), Popularity (P:50), Recency (R:98), Quality (Q:50).
Verification Authority
đī¸ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
đ Explore Full Dataset âđ§Ŧ Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
Financial Document Extraction & Robustness Dataset (FinDoc-Robust)
Dataset Description
FinDoc-Robust is a multimodal, benchmark-grade dataset designed for Document Layout Analysis (DLA), Visual Information Extraction (VIE), and evaluating model robustness against real-world degradation.
The dataset contains financial reports across 5 distinct document categories (e.g., cash flow statements, balance sheets, trial balances, shareholders' equity, corporate income statements). For every document, it provides perfect digital vectors, tabular ground truths, pixel-level bounding boxes, and 5 structurally degraded ("dirty") variants simulating camera captures, scans, and physical artifacts.
Key Applications
- Robust Document AI: Training models to resist geometric distortions, noise, and blur.
- Table Reconstruction: Benchmarking end-to-end Image-to-Excel/HTML/Markdown pipelines.
- Multimodal Alignment: Fine-tuning models like LayoutLMv3, Donut, or proprietary Vision-LLMs on complex financial structures.
Dataset Structure
The repository is organized hierarchically by document type and numerical index. Each sample folder contains a complete sub-set of modalities:
dataset_root/
âââ new_type_cash_flow_statement/
âââ new_type_shareholders_equity/
âââ new_type_trial_balance/
âââ pro_doc_corporate_income_statement/
âââ pro_doc_full_balance_sheet/
âââ 001/
â âââ 001.pdf # Original clean vector PDF
â âââ 001.png # Rendered high-res image (clean)
â âââ 001.xlsx # Target ground-truth table structure
â âââ 001.json # Word/Phrase Bounding Boxes (Pixel space)
â âââ 001_pdf.json # Word/Phrase Bounding Boxes (DTP Point space)
â âââ 001_dirty_1.png # Degraded scan/photo simulation variant 1
â âââ 001_dirty_1.json # Adjusted Bounding Boxes for variant 1
â ...
â âââ 001_dirty_5.png # Degraded variant 5
â âââ 001_dirty_5.json # Adjusted Bounding Boxes for variant 5
âââ 1001/ # Scale-tested deep indices (up to 4 digits)
...
Modality Specifications
1. Ground Truth Structures
.pdf: Original vector file preserving strict semantic layout..xlsx: The ideal downstream target layout. Contains finalized cell alignments, structures, and text groups.
2. Multi-Coordinate Bounding Boxes (`.json`)
The dataset includes two coordinate topologies to match different ingestion pipelines:
001_pdf.json(Vector Scale): Stored in DTP Points ($1 \text{ inch} = 72 \text{ points}$), native to engines like PyMuPDF orpdfplumber. Origin is typically evaluated from Top-Left or Bottom-Left depending on the parser.001.json(Raster Scale): Mapped directly to high-resolution pixel coordinates matching the native001.pngdimensions (e.g., A4 at 200 DPI: $1654 \times 2339 \text{ px}$).
3. Robustness & Degradation Layers (`_dirty_X`)
Each baseline sheet is supplemented with 5 alternative states mimicking typical pipeline damage:
- Sensor noise, blur, and lighting gradients.
- Rotation, skewing, and affine perspective warps.
- Contrast loss and compression artifacts.
Every dirty image has a corresponding .json containing transformed bounding box parameters adjusted to the physical distortion.
JSON Schema Example
{
"img_file": "001.png",
"img_width": 1654,
"img_height": 2339,
"labels": [
{
"text": "CASH FLOWS FROM CORE OPERATIONS",
"bbox_px": [173.1, 415.24, 781.99, 446.65]
}
]
}
Note: bbox_px format is [xmin, ymin, xmax, ymax].
Usage & Evaluation
from datasets import load_dataset
# Configuration for loading the structured layout hierarchy
# Dataset script coming soon
dataset = load_dataset("arcolab-dev/FinDoc-Robust")
Recommended Evaluation Metrics.
- Tree-edit distance / TEDS: For structural table matching via the
.xlsxlayout. - ANLS (Average Normalized Levenshtein Similarity): For text extraction robustness under dirty variants.
- mAP (mean Average Precision): For word/cell layout extraction bounding boxes.
## đ Structured Schema (Zero-Fabrication)
| Feature Key | Data Type |
| :--- | :--- |
| `file_name` | `string` |
| `document_type` | `string` |
| `document_id` | `int64` |
| `clean_pdf` | `string` |
| `clean_xlsx` | `string` |
| `clean_bbox_px` | `string` |
| `clean_bbox_pdf_pt` | `string` |
| `dirty_1_image` | `string` |
| `dirty_1_bbox` | `string` |
| `dirty_2_image` | `string` |
| `dirty_2_bbox` | `string` |
| `dirty_3_image` | `string` |
| `dirty_3_bbox` | `string` |
| `dirty_4_image` | `string` |
| `dirty_4_bbox` | `string` |
| `dirty_5_image` | `string` |
| `dirty_5_bbox` | `string` |
**Estimated Rows:** `3,000`
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Dataset Transparency Report
Technical metadata sourced from upstream repositories.
đ Identity & Source
- id
- hf-dataset--arcolab-dev--findoc-robust
- slug
- arcolab-dev--findoc-robust
- source
- huggingface
- author
- Arcolab Dev
- license
- Apache-2.0
- tags
- task_categories:object-detection, language:en, license:apache-2.0, size_categories:1k<n<10k, format:csv, modality:image, modality:text, library:datasets, library:pandas, library:polars, library:mlcroissant, region:us, financial, document-ai, multimodal
âī¸ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
đ Engagement & Metrics
- downloads
- 30,761
- stars
- 0
- forks
- null
Data indexed from public sources. Updated daily.