Yeroyan's picture
init the project
c4ef1cf verified

ViDoRe Benchmark Evaluation

This directory contains scripts for evaluating visual document retrieval on the ViDoRe benchmark.

Quick Start

1. Install Dependencies

# Install visual-rag-toolkit with all dependencies
pip install -e ".[all]"

# Install benchmark-specific dependencies
pip install datasets mteb

2. Run Evaluation

# Run on single dataset
python benchmarks/run_vidore.py --dataset vidore/docvqa_test_subsampled

# Run on all ViDoRe datasets
python benchmarks/run_vidore.py --all

# With two-stage retrieval (our contribution)
python benchmarks/run_vidore.py --dataset vidore/docvqa_test_subsampled --two-stage

3. Submit to Leaderboard

# Generate submission file
python benchmarks/prepare_submission.py --results results/

# Submit to HuggingFace
huggingface-cli login
huggingface-cli upload vidore/results ./submission.json

ViDoRe Datasets

The benchmark includes these datasets (from the leaderboard):

Dataset Type # Queries # Documents
docvqa_test_subsampled DocVQA ~500 ~5,000
infovqa_test_subsampled InfoVQA ~500 ~5,000
tabfquad_test_subsampled TabFQuAD ~500 ~5,000
tatdqa_test TAT-DQA ~1,500 ~2,500
arxivqa_test_subsampled ArXivQA ~500 ~5,000
shiftproject_test SHIFT ~500 ~5,000

Evaluation Metrics

  • NDCG@5: Normalized Discounted Cumulative Gain at 5
  • NDCG@10: Normalized Discounted Cumulative Gain at 10
  • MRR@10: Mean Reciprocal Rank at 10
  • Recall@5: Recall at 5
  • Recall@10: Recall at 10

Two-Stage Retrieval (Our Contribution)

Our key contribution is efficient two-stage retrieval:

Stage 1: Fast prefetch with tile-level pooled vectors
         Uses HNSW index for O(log N) retrieval
         
Stage 2: Exact MaxSim reranking on top-K candidates
         Full multi-vector scoring for precision

This provides:

  • 5-10x speedup over full MaxSim at scale
  • 95%+ accuracy compared to exhaustive search
  • Memory efficient (don't load all embeddings upfront)

To evaluate with two-stage:

python benchmarks/run_vidore.py \
    --dataset vidore/docvqa_test_subsampled \
    --two-stage \
    --prefetch-k 200 \
    --top-k 10

Files

  • run_vidore.py - Main evaluation script
  • prepare_submission.py - Generate leaderboard submission
  • analyze_results.py - Analyze and compare results