Spaces:

Yeroyan
/

visual-rag-toolkit

Running

File size: 2,519 Bytes

c4ef1cf

# ViDoRe Benchmark Evaluation

This directory contains scripts for evaluating visual document retrieval on the [ViDoRe benchmark](https://huggingface.co/spaces/vidore/vidore-leaderboard).

## Quick Start

### 1. Install Dependencies

```bash
# Install visual-rag-toolkit with all dependencies
pip install -e ".[all]"

# Install benchmark-specific dependencies
pip install datasets mteb
```

### 2. Run Evaluation

```bash
# Run on single dataset
python benchmarks/run_vidore.py --dataset vidore/docvqa_test_subsampled

# Run on all ViDoRe datasets
python benchmarks/run_vidore.py --all

# With two-stage retrieval (our contribution)
python benchmarks/run_vidore.py --dataset vidore/docvqa_test_subsampled --two-stage
```

### 3. Submit to Leaderboard

```bash
# Generate submission file
python benchmarks/prepare_submission.py --results results/

# Submit to HuggingFace
huggingface-cli login
huggingface-cli upload vidore/results ./submission.json
```

## ViDoRe Datasets

The benchmark includes these datasets (from the leaderboard):

| Dataset | Type | # Queries | # Documents |
|---------|------|-----------|-------------|
| docvqa_test_subsampled | DocVQA | ~500 | ~5,000 |
| infovqa_test_subsampled | InfoVQA | ~500 | ~5,000 |
| tabfquad_test_subsampled | TabFQuAD | ~500 | ~5,000 |
| tatdqa_test | TAT-DQA | ~1,500 | ~2,500 |
| arxivqa_test_subsampled | ArXivQA | ~500 | ~5,000 |
| shiftproject_test | SHIFT | ~500 | ~5,000 |

## Evaluation Metrics

- **NDCG@5**: Normalized Discounted Cumulative Gain at 5
- **NDCG@10**: Normalized Discounted Cumulative Gain at 10  
- **MRR@10**: Mean Reciprocal Rank at 10
- **Recall@5**: Recall at 5
- **Recall@10**: Recall at 10

## Two-Stage Retrieval (Our Contribution)

Our key contribution is efficient two-stage retrieval:

```
Stage 1: Fast prefetch with tile-level pooled vectors
         Uses HNSW index for O(log N) retrieval
         
Stage 2: Exact MaxSim reranking on top-K candidates
         Full multi-vector scoring for precision
```

This provides:
- **5-10x speedup** over full MaxSim at scale
- **95%+ accuracy** compared to exhaustive search
- **Memory efficient** (don't load all embeddings upfront)

To evaluate with two-stage:

```bash
python benchmarks/run_vidore.py \
    --dataset vidore/docvqa_test_subsampled \
    --two-stage \
    --prefetch-k 200 \
    --top-k 10
```

## Files

- `run_vidore.py` - Main evaluation script
- `prepare_submission.py` - Generate leaderboard submission
- `analyze_results.py` - Analyze and compare results