Vidore Benchmark
Pipeline evaluation allows you to evaluate **complete end-to-end retrieval systems** on the ViDoRe v3 benchmark datasets. Unlike traditional retriever evaluation that f...
| Entity Passport | |
| Registry ID | gh-model--illuin-tech--vidore-benchmark |
| Provider | github |
Cite this model
Academic & Research Attribution
@misc{gh_model__illuin_tech__vidore_benchmark,
author = {Illuin Tech},
title = {Vidore Benchmark Model},
year = {2026},
howpublished = {\url{https://github.com/illuin-tech/vidore-benchmark}},
note = {Accessed via Free2AITools Knowledge Fortress}
} π¬Technical Deep Dive
Full Specifications [+]βΎ
Quick Commands
git clone https://github.com/illuin-tech/vidore-benchmark βοΈ Nexus Index V16.5
π¬ Index Insight
The Free2AITools Nexus Index for Vidore Benchmark aggregates Popularity (P:0), Freshness (F:0), and Completeness (C:0). The Utility score (U:0) represents deployment readiness and ecosystem adoption.
Verification Authority
π What's Next?
Technical Deep Dive
ViDoRe Pipeline Evaluation Framework π
[!IMPORTANT]
π Repository Focus Change
This repository is now focused on pipeline evaluation for visual document retrieval tasks.
All other functionalities (vision retriever evaluation, legacy benchmarks) are kept for reproducibility purposes but are deprecated and no longer actively maintained.
What is Pipeline Evaluation?
Pipeline evaluation allows you to evaluate complete end-to-end retrieval systems on the ViDoRe v3 benchmark datasets. Unlike traditional retriever evaluation that focuses on individual model components, pipeline evaluation lets you test:
- Multi-stage retrieval systems (e.g., retrieve + rerank)
- Hybrid approaches (e.g., dense + sparse retrieval fusion)
- Custom preprocessing pipelines (e.g., OCR β chunking β embedding)
- Arbitrary retrieval logic that goes beyond standard dense/sparse retrievers
π Results Repository & Submission Guidelines
This repository serves as the primary community results repository for visual document retrieval benchmarks using complex pipelines. We encourage researchers and practitioners to submit their pipeline evaluation results to create a centralized location where the community can compare different approaches and track progress on ViDoRe v3 datasets.
How to Submit Your Results
To contribute your pipeline results to the leaderboard:
Run evaluations using this framework on the ViDoRe v3 datasets english splits (
--language englishin cli). It tracks raw scores as well as indexing and search computing times.Open a Pull Request with the following:
- Results files: Add your JSON result files to the
results/metricsfolder, organized as:textresults/metrics/your_pipeline_name/ βββ vidore_v3_hr.json βββ vidore_v3_finance_en.json βββ vidore_v3_industrial.json βββ ... (other datasets) - Pipeline description: Include a
description.jsonfile in the same PR that describes the architecture used. A pipeline is represented as a graph of a set of modules (OCR, retriever, reranker, mcp server... linked together via edges) Some pipeline descriptions files example are written inresults/pipeline_descriptions
We encourage adding as much hardware information as possible in the description to enable the community to get a feel about the latency of each pipeline.
- Results files: Add your JSON result files to the
Installation
pip install vidore-benchmark
List Available Datasets
List all ViDoRe v3 datasets:
vidore-benchmark pipeline list-datasets
Available datasets:
vidore/vidore_v3_hr- Human Resources documentsvidore/vidore_v3_finance_en- Financial documents (English)vidore/vidore_v3_industrial- Industrial documentsvidore/vidore_v3_pharmaceuticals- Pharmaceutical documentsvidore/vidore_v3_computer_science- Computer Science documentsvidore/vidore_v3_energy- Energy sector documentsvidore/vidore_v3_physics- Physics documentsvidore/vidore_v3_finance_fr- Financial documents (French)
Evaluate a Pipeline
You can evaluate any pipeline that inherits from BasePipeline:
Some pipelines are already implemented in the pipeline_implementations folder.
Custom Pipeline
Evaluate your own pipeline implementation:
vidore-benchmark pipeline evaluate \
--dataset-name vidore/vidore_v3_hr \
--module-path path/to/my_pipeline.py \
--class-name MyCustomPipeline \
--language english \
--pipeline-args '{"model_name": "my-model"}'
Your pipeline file (my_pipeline.py):
from vidore_benchmark.pipeline_evaluation import BasePipeline
class MyCustomPipeline(BasePipeline):
def __init__(self, model_name):
self.model_name = model_name
# Initialize your model here
def index(self, corpus_ids, corpus_images, corpus_texts):
# Indexing function to process corpus, should store anything
# relevant as class attributes
self.corpus_ids = corpus_ids
...
def search(self, query_ids, queries):
# Your search logic, returns scores dict (see BasePipeline file for description)
return {query_id: {corpus_id: score}}
Language Filtering
Some datasets contain multilingual queries. You can filter by language:
vidore-benchmark pipeline evaluate \
--dataset-name vidore/vidore_v3_hr \
--pipeline-type random \
--language english
Evaluate on All Datasets
Evaluate your pipeline on all ViDoRe v3 datasets:
With built-in pipeline:
vidore-benchmark pipeline evaluate-all \
--pipeline-type random \
--pipeline-args '{"seed": 42}' \
--output-dir results/
With custom pipeline:
vidore-benchmark pipeline evaluate-all \
--module-path my_pipeline.py \
--class-name MyCustomPipeline \
--output-dir results/
Python API
Implementing Your Own Pipeline
To evaluate a custom pipeline, inherit from BasePipeline and implement the index() and search() methods:
Running Evaluation
from path_to_pipeline import MyCustomPipeline
from vidore_benchmark.pipeline_evaluation import (
load_vidore_dataset,
evaluate_retrieval,
aggregate_results,
)
# Load dataset
query_ids, queries, corpus_ids, corpus_images, corpus_texts, qrels = load_vidore_dataset(
dataset_name="vidore/vidore_v3_hr",
split="test"
)
# Initialize your pipeline
pipeline = MyCustomPipeline(retriever=my_retriever, reranker=my_reranker)
# Run evaluation
results = evaluate_retrieval(
pipeline=pipeline,
query_ids=query_ids,
queries=queries,
corpus_ids=corpus_ids,
corpus_images=corpus_images,
corpus_texts=corpus_texts,
qrels=qrels,
metrics=["ndcg_cut_10", "recall_10"]
)
# Get aggregate scores
aggregated = aggregate_results(results)
print(f"NDCG@10: {aggregated['ndcg_cut_10']:.4f}")
Some examples of pipeline implementations can be found in the pipeline_implementations folder
Advanced Usage
Tracking Additional Metrics (Optional)
Pipelines can optionally return additional tracking information alongside retrieval results. This is useful for monitoring costs, timing, resource usage, or other custom metrics:
from typing import Dict, List, Any, Optional, Tuple
class PipelineWithMetrics(BasePipeline):
def index(
self,
corpus_ids: List[str],
corpus_images: List[Any],
corpus_texts: List[str],
) -> None:
# Indexing logic
...
def search(
self,
query_ids: List[str],
queries: List[str],
) -> Tuple[Dict[str, Dict[str, float]], Optional[Dict[str, Any]]]:
"""
Return both retrieval results and optional tracking metrics.
Returns:
Tuple of (results, infos) where infos can contain:
- Cost tracking (e.g., API costs, GPU hours)
- Granular timing information
- Resource usage (num_gpus, memory, etc.)
- Model-specific metadata
"""
# Your retrieval logic here
results = {...}
# Optional: track additional metrics
infos = {
"estimated_cost_usd": 0.05,
"num_gpus": 1,
"total_time_ms": 1234.5,
"model_name": "my-model-v1",
}
return results, infos
The infos dictionary will be stored in the evaluation results under the _infos key. This is completely optional - pipelines can still return just the results dictionary for backward compatibility:
class SimplePipeline(BasePipeline):
def search(...) -> Dict[str, Dict[str, float]]:
# Just return results, no tracking needed
return results
See example_pipelines/pipeline_with_metrics.py for a complete example.
Dataset Information
from vidore_benchmark.pipeline_evaluation import (
load_vidore_dataset,
print_dataset_info,
get_available_datasets,
)
# List available datasets
datasets = get_available_datasets()
print(datasets)
# Load and inspect a dataset
query_ids, queries, corpus_ids, corpus, qrels = load_vidore_dataset(
"vidore/vidore_v3_industrial"
)
print_dataset_info(
dataset_name="vidore/vidore_v3_industrial",
query_ids=query_ids,
queries=queries,
corpus_ids=corpus_ids,
corpus=corpus,
qrels=qrels,
)
Custom Metrics
You can specify custom metrics to evaluate if you want to:
results = evaluate_retrieval(
pipeline=pipeline,
query_ids=query_ids,
queries=queries,
corpus_ids=corpus_ids,
corpus=corpus,
qrels=qrels,
metrics=[
"ndcg_cut_5",
"ndcg_cut_10",
"recall_5",
"recall_10",
"map",
]
)
All metrics supported by pytrec_eval are available.
Architecture
The pipeline evaluation framework consists of:
BasePipeline: Abstract base class for implementing custom pipelines- Dataset Loaders: Functions to load ViDoRe v3 datasets from HuggingFace
- Evaluator: Uses
pytrec_evalto compute retrieval metrics - CLI: Commands for evaluating any custom pipeline
vidore_benchmark/
βββ pipeline_evaluation/
β βββ base_pipeline.py # BasePipeline abstract class
β βββ dataset_loader.py # ViDoRe v3 dataset loading
β βββ evaluator.py # Evaluation orchestration
β βββ utils.py # Helper utilities
βββ cli/
βββ pipeline_evaluation.py # CLI for pipeline evaluation
Reproducibility & Legacy Features
This repository previously focused on evaluating vision retrievers on the ViDoRe benchmarks v1 and v2. All code related to these functionalities is still available but deprecated:
- Vision Retriever Evaluation: See
README_OLD.md - ViDoRe Benchmarks v1/v2: Now maintained in MTEB
- Model Implementations: Available in
src/vidore_benchmark/retrievers/(for reference only)
β οΈ For new projects, we recommend:
- Using MTEB for vision retriever evaluation on ViDoRe v1/v2
- Using this framework for pipeline evaluation on ViDoRe v3
For reproducibility of published results, see REPRODUCIBILITY.md.
Contributing
We welcome contributions for:
- New example pipelines
- Additional evaluation results
- Dataset utilities
- Documentation improvements
Please open an issue or PR on GitHub.
Citation
If you use this framework or the ViDoRe benchmark in your research, please cite:
ColPali: Efficient Document Retrieval with Vision Language Models
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and CΓ©line Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval
@misc{macΓ©2025vidorebenchmarkv2raising,
title={ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval},
author={Quentin MacΓ© and AntΓ³nio Loison and Manuel Faysse},
year={2025},
eprint={2505.17166},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2505.17166},
}
ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios
@misc{loison2026vidore,
title={ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios},
author={Loison, Ant{\'o}nio and Mac{\'e}, Quentin and Edy, Antoine and Xing, Victor and Balough, Tom and Moreira, Gabriel and Liu, Bo and Faysse, Manuel and Hudelot, C{\'e}line and Viaud, Gautier},
journal={arXiv preprint arXiv:2601.08620},
year={2026}
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
Links
π Quick Start
pip install vidore-benchmark
π Limitations & Considerations
- β’ Benchmark scores may vary based on evaluation methodology and hardware configuration.
- β’ VRAM requirements are estimates; actual usage depends on quantization and batch size.
- β’ FNI scores are relative rankings and may change as new models are added.
- β’ Source: Unknown
AI Summary: Based on GitHub metadata. Not a recommendation.
π‘οΈ Model Transparency Report
Verified data manifest for traceability and transparency.
π Identity & Source
- id
- gh-model--illuin-tech--vidore-benchmark
- source
- github
- author
- Illuin Tech
- tags
- retrievalvision-language-modelsearchcolpaliragpython
βοΈ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
- feature-extraction
π Engagement & Metrics
- likes
- 0
- downloads
- 0
Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)