Spaces:
Sleeping
Sleeping
ViDoRe Benchmark Evaluation
This directory contains scripts for evaluating visual document retrieval on the ViDoRe benchmark.
Quick Start
1. Install Dependencies
# Install visual-rag-toolkit with all dependencies
pip install -e ".[all]"
# Install benchmark-specific dependencies
pip install datasets mteb
2. Run Evaluation
# Run on single dataset
python benchmarks/run_vidore.py --dataset vidore/docvqa_test_subsampled
# Run on all ViDoRe datasets
python benchmarks/run_vidore.py --all
# With two-stage retrieval (our contribution)
python benchmarks/run_vidore.py --dataset vidore/docvqa_test_subsampled --two-stage
3. Submit to Leaderboard
# Generate submission file
python benchmarks/prepare_submission.py --results results/
# Submit to HuggingFace
huggingface-cli login
huggingface-cli upload vidore/results ./submission.json
ViDoRe Datasets
The benchmark includes these datasets (from the leaderboard):
| Dataset | Type | # Queries | # Documents |
|---|---|---|---|
| docvqa_test_subsampled | DocVQA | ~500 | ~5,000 |
| infovqa_test_subsampled | InfoVQA | ~500 | ~5,000 |
| tabfquad_test_subsampled | TabFQuAD | ~500 | ~5,000 |
| tatdqa_test | TAT-DQA | ~1,500 | ~2,500 |
| arxivqa_test_subsampled | ArXivQA | ~500 | ~5,000 |
| shiftproject_test | SHIFT | ~500 | ~5,000 |
Evaluation Metrics
- NDCG@5: Normalized Discounted Cumulative Gain at 5
- NDCG@10: Normalized Discounted Cumulative Gain at 10
- MRR@10: Mean Reciprocal Rank at 10
- Recall@5: Recall at 5
- Recall@10: Recall at 10
Two-Stage Retrieval (Our Contribution)
Our key contribution is efficient two-stage retrieval:
Stage 1: Fast prefetch with tile-level pooled vectors
Uses HNSW index for O(log N) retrieval
Stage 2: Exact MaxSim reranking on top-K candidates
Full multi-vector scoring for precision
This provides:
- 5-10x speedup over full MaxSim at scale
- 95%+ accuracy compared to exhaustive search
- Memory efficient (don't load all embeddings upfront)
To evaluate with two-stage:
python benchmarks/run_vidore.py \
--dataset vidore/docvqa_test_subsampled \
--two-stage \
--prefetch-k 200 \
--top-k 10
Files
run_vidore.py- Main evaluation scriptprepare_submission.py- Generate leaderboard submissionanalyze_results.py- Analyze and compare results