Yeroyan's picture
init the project
c4ef1cf verified
# ViDoRe Benchmark Evaluation
This directory contains scripts for evaluating visual document retrieval on the [ViDoRe benchmark](https://huggingface.co/spaces/vidore/vidore-leaderboard).
## Quick Start
### 1. Install Dependencies
```bash
# Install visual-rag-toolkit with all dependencies
pip install -e ".[all]"
# Install benchmark-specific dependencies
pip install datasets mteb
```
### 2. Run Evaluation
```bash
# Run on single dataset
python benchmarks/run_vidore.py --dataset vidore/docvqa_test_subsampled
# Run on all ViDoRe datasets
python benchmarks/run_vidore.py --all
# With two-stage retrieval (our contribution)
python benchmarks/run_vidore.py --dataset vidore/docvqa_test_subsampled --two-stage
```
### 3. Submit to Leaderboard
```bash
# Generate submission file
python benchmarks/prepare_submission.py --results results/
# Submit to HuggingFace
huggingface-cli login
huggingface-cli upload vidore/results ./submission.json
```
## ViDoRe Datasets
The benchmark includes these datasets (from the leaderboard):
| Dataset | Type | # Queries | # Documents |
|---------|------|-----------|-------------|
| docvqa_test_subsampled | DocVQA | ~500 | ~5,000 |
| infovqa_test_subsampled | InfoVQA | ~500 | ~5,000 |
| tabfquad_test_subsampled | TabFQuAD | ~500 | ~5,000 |
| tatdqa_test | TAT-DQA | ~1,500 | ~2,500 |
| arxivqa_test_subsampled | ArXivQA | ~500 | ~5,000 |
| shiftproject_test | SHIFT | ~500 | ~5,000 |
## Evaluation Metrics
- **NDCG@5**: Normalized Discounted Cumulative Gain at 5
- **NDCG@10**: Normalized Discounted Cumulative Gain at 10
- **MRR@10**: Mean Reciprocal Rank at 10
- **Recall@5**: Recall at 5
- **Recall@10**: Recall at 10
## Two-Stage Retrieval (Our Contribution)
Our key contribution is efficient two-stage retrieval:
```
Stage 1: Fast prefetch with tile-level pooled vectors
Uses HNSW index for O(log N) retrieval
Stage 2: Exact MaxSim reranking on top-K candidates
Full multi-vector scoring for precision
```
This provides:
- **5-10x speedup** over full MaxSim at scale
- **95%+ accuracy** compared to exhaustive search
- **Memory efficient** (don't load all embeddings upfront)
To evaluate with two-stage:
```bash
python benchmarks/run_vidore.py \
--dataset vidore/docvqa_test_subsampled \
--two-stage \
--prefetch-k 200 \
--top-k 10
```
## Files
- `run_vidore.py` - Main evaluation script
- `prepare_submission.py` - Generate leaderboard submission
- `analyze_results.py` - Analyze and compare results