| # qrel-analysis | |
| A pipeline for evaluating how well LLM-judge-derived relevance judgments (qrels) align with human judgments, across multiple thresholding strategies and retrieval systems. | |
| --- | |
| ## Overview | |
| This pipeline: | |
| 1. Takes LLM judge scores (e.g., reranking scores) and converts them into binary qrels using various thresholding strategies | |
| 2. Evaluates a retrieval run against those derived qrels using nDCG@10 | |
| 3. Aggregates results across datasets and judge systems into pivot CSVs | |
| --- | |
| ## Pipeline | |
| ``` | |
| Human QRels + LLM Judge Runs + Retrieval Runs | |
| ↓ | |
| eval_autoqrels.py ← convert judge scores to qrels, evaluate | |
| ↓ | |
| Per-strategy nDCG@10 results (JSONL) | |
| ↓ | |
| collect_results.py ← aggregate across judges/datasets | |
| ↓ | |
| Pivot CSVs (one per retriever) | |
| ``` | |
| --- | |
| ## Scripts | |
| ### 1. `eval_autoqrels.py` — Core evaluation | |
| Converts LLM judge scores to binary qrels using a thresholding strategy, then evaluates a retrieval run against those qrels. | |
| **Arguments:** | |
| | Argument | Required | Default | Description | | |
| |---|---|---|---| | |
| | `--dataset_name` | Yes | — | Dataset name (e.g., `beir/nfcorpus/test`) | | |
| | `--loader_type` | No | `irds` | Loader module type | | |
| | `--judge_run` | Yes | — | Path to LLM judge JSONL file (used as qrel source) | | |
| | `--evaluate_run` | No | same as `judge_run` | Path to retrieval run JSONL to evaluate | | |
| | `--strategies` | No | — | One or more strategies (see below); repeatable | | |
| | `--threshold` | No | `0.5` | Score cutoff for `thresholding` strategy | | |
| | `--rank_cutoff` | No | `10` | Top-k docs treated as relevant for `rank` strategy | | |
| | `--gap_k` | No | `1` | k-th largest score gap for `largest_gap` strategy | | |
| | `--quantile_cutoff` | No | `0.75` | Quantile threshold for `quantile` strategy | | |
| | `--min_relevance` | No | `1` | Min human relevance grade for oracle strategies | | |
| | `--exp` | No | — | Optional experiment tag added to output records | | |
| **Thresholding strategies:** | |
| | Strategy | Oracle? | Description | | |
| |---|---|---| | |
| | `direct` | No | Round LLM scores to nearest integer | | |
| | `thresholding` | No | Binary threshold at `--threshold` (default 0.5) | | |
| | `rank` | No | Top-k documents are relevant (k = `--rank_cutoff`) | | |
| | `largest_gap` | No | Threshold at the k-th largest score gap | | |
| | `quantile` | No | Score >= q-th percentile is relevant | | |
| | `optimal_per_topic` | Yes | Per-topic threshold maximizing F1 vs human qrels | | |
| | `optimal_global` | Yes | Single global threshold maximizing macro-avg F1 | | |
| | `all` | — | Apply all strategies at once | | |
| **Input format** (both `--judge_run` and `--evaluate_run`): JSONL, one record per line: | |
| ```json | |
| {"qid": "query_id", "docid": "doc_id", "score": 3.14} | |
| ``` | |
| **Output format:** JSONL to stdout, one record per strategy: | |
| ```json | |
| {"dataset": "beir/nfcorpus/test", "exp": "judge:eval", "strategy": "direct", "nDCG@10": 0.5559} | |
| ``` | |
| **Example usage:** | |
| ```bash | |
| # Evaluate using a single strategy | |
| python eval_autoqrels.py \ | |
| --dataset_name beir/nfcorpus/test \ | |
| --judge_run /path/to/judge_run.jsonl \ | |
| --evaluate_run /path/to/bm25_run.jsonl \ | |
| --strategies rank | |
| # Multiple strategies | |
| python eval_autoqrels.py \ | |
| --dataset_name beir/nfcorpus/test \ | |
| --judge_run /path/to/judge_run.jsonl \ | |
| --evaluate_run /path/to/bm25_run.jsonl \ | |
| --strategies direct --strategies rank --strategies largest_gap | |
| # All strategies with experiment tag, save to file | |
| python eval_autoqrels.py \ | |
| --dataset_name beir/nfcorpus/test \ | |
| --judge_run /path/to/judge_run.jsonl \ | |
| --evaluate_run /path/to/bm25_run.jsonl \ | |
| --strategies all \ | |
| --exp "my_experiment" \ | |
| > results/my_judge/raaj-nfcorpus.jsonl | |
| ``` | |
| --- | |
| ### 2. `collect_results.py` — Aggregate results | |
| Scans result directories for JSONL files (`*/raaj*.jsonl`), parses them, and produces pivot-table CSVs grouped by retrieval system. | |
| **Arguments:** | |
| | Argument | Required | Description | | |
| |---|---|---| | |
| | `--results_dir` | Yes | Base directory containing per-judge subdirectories | | |
| | `--output_dir` | Yes | Directory to write one CSV per retrieval system | | |
| **Expected input directory structure:** | |
| ``` | |
| results_dir/ | |
| ├── bm25-rerank-judge/ | |
| │ ├── raaj-nfcorpus.jsonl | |
| │ ├── raaj-trec-covid.jsonl | |
| │ └── ... | |
| ├── colbert-small-rerank-judge/ | |
| │ └── ... | |
| └── splade-v3-rerank-judge/ | |
| └── ... | |
| ``` | |
| **Output:** One CSV per retrieval system (e.g., `bm25.csv`, `colbert-small.csv`), pipe-delimited pivot tables comparing nDCG@10 across judge systems and strategies. | |
| **Example usage:** | |
| ```bash | |
| python collect_results.py \ | |
| --results_dir ./results \ | |
| --output_dir ./new | |
| ``` | |
| --- | |
| ### 3. `get_beir_stats.py` — Dataset statistics | |
| Prints a tab-separated statistics table for BEIR and MS MARCO datasets. No arguments needed. | |
| **Datasets covered:** msmarco-passage/trec-dl-2019, trec-dl-2020, and 13 BEIR datasets (arguana, climate-fever, dbpedia-entity, fever, fiqa, hotpotqa, nfcorpus, nq, quora, scidocs, scifact, trec-covid, webis-touche2020). | |
| **Statistics reported:** num queries, corpus size, total judgments, avg query/doc length, avg positives/negatives per query, judgment level range. | |
| ```bash | |
| python get_beir_stats.py | |
| ``` | |
| --- | |
| ## Results Directory | |
| Current results are in `results/`, organized by `{retriever}-rerank-{judge}/`, e.g.: | |
| - `bm25-rerank-judge/` | |
| - `colbert-small-rerank-judge/` | |
| - `splade-v3-rerank-judge/` | |
| - `nomicai-modernbert-embed-rerank-judge/` | |
| - `qwen3-embed-600m-rerank-judge/` | |
| Each subdirectory contains `raaj-{dataset}.jsonl` files. | |
| --- | |
| ## Dependencies | |
| - `autollmrerank` — internal module for dataset loading (`loader_dev.irds`) | |
| - `ir_measures` — IR evaluation metrics | |
| - `pandas` — result aggregation | |
Xet Storage Details
- Size:
- 5.73 kB
- Xet hash:
- 6e7c4d67aa109cd9469fdf58fc59660d9e742384642838274bfccf996b5e4afc
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.