Spaces:
Sleeping
Sleeping
| # ViDoRe Benchmark Evaluation | |
| This directory contains scripts for evaluating visual document retrieval on the [ViDoRe benchmark](https://huggingface.co/spaces/vidore/vidore-leaderboard). | |
| ## Quick Start | |
| ### 1. Install Dependencies | |
| ```bash | |
| # Install visual-rag-toolkit with all dependencies | |
| pip install -e ".[all]" | |
| # Install benchmark-specific dependencies | |
| pip install datasets mteb | |
| ``` | |
| ### 2. Run Evaluation | |
| ```bash | |
| # Run on single dataset | |
| python benchmarks/run_vidore.py --dataset vidore/docvqa_test_subsampled | |
| # Run on all ViDoRe datasets | |
| python benchmarks/run_vidore.py --all | |
| # With two-stage retrieval (our contribution) | |
| python benchmarks/run_vidore.py --dataset vidore/docvqa_test_subsampled --two-stage | |
| ``` | |
| ### 3. Submit to Leaderboard | |
| ```bash | |
| # Generate submission file | |
| python benchmarks/prepare_submission.py --results results/ | |
| # Submit to HuggingFace | |
| huggingface-cli login | |
| huggingface-cli upload vidore/results ./submission.json | |
| ``` | |
| ## ViDoRe Datasets | |
| The benchmark includes these datasets (from the leaderboard): | |
| | Dataset | Type | # Queries | # Documents | | |
| |---------|------|-----------|-------------| | |
| | docvqa_test_subsampled | DocVQA | ~500 | ~5,000 | | |
| | infovqa_test_subsampled | InfoVQA | ~500 | ~5,000 | | |
| | tabfquad_test_subsampled | TabFQuAD | ~500 | ~5,000 | | |
| | tatdqa_test | TAT-DQA | ~1,500 | ~2,500 | | |
| | arxivqa_test_subsampled | ArXivQA | ~500 | ~5,000 | | |
| | shiftproject_test | SHIFT | ~500 | ~5,000 | | |
| ## Evaluation Metrics | |
| - **NDCG@5**: Normalized Discounted Cumulative Gain at 5 | |
| - **NDCG@10**: Normalized Discounted Cumulative Gain at 10 | |
| - **MRR@10**: Mean Reciprocal Rank at 10 | |
| - **Recall@5**: Recall at 5 | |
| - **Recall@10**: Recall at 10 | |
| ## Two-Stage Retrieval (Our Contribution) | |
| Our key contribution is efficient two-stage retrieval: | |
| ``` | |
| Stage 1: Fast prefetch with tile-level pooled vectors | |
| Uses HNSW index for O(log N) retrieval | |
| Stage 2: Exact MaxSim reranking on top-K candidates | |
| Full multi-vector scoring for precision | |
| ``` | |
| This provides: | |
| - **5-10x speedup** over full MaxSim at scale | |
| - **95%+ accuracy** compared to exhaustive search | |
| - **Memory efficient** (don't load all embeddings upfront) | |
| To evaluate with two-stage: | |
| ```bash | |
| python benchmarks/run_vidore.py \ | |
| --dataset vidore/docvqa_test_subsampled \ | |
| --two-stage \ | |
| --prefetch-k 200 \ | |
| --top-k 10 | |
| ``` | |
| ## Files | |
| - `run_vidore.py` - Main evaluation script | |
| - `prepare_submission.py` - Generate leaderboard submission | |
| - `analyze_results.py` - Analyze and compare results | |