Spaces:

Rom89823974978
/

RAG_Eval

Build error

App Files Files Community

Rom89823974978 commited on Jun 6, 2025

Commit

12409b1

1 Parent(s): 549e0c8

Updated codebase

Browse files

Files changed (12) hide show

README.md +13 -16
configs/kilt_hybrid_ce.yaml +42 -0
data/load_datasets.py +29 -0
evaluation/pipeline.py +39 -4
pyserini/search.py +0 -1
requirements.txt +1 -0
scripts/analysis.py +160 -0
scripts/dashboard.py +3 -7
scripts/prep_annotations.py +86 -0
scripts/run_experiments.py +0 -251
scripts/run_grid_experiments.py +0 -239
tests/test_pipeline_end_to_end.py +1 -3

README.md CHANGED Viewed

@@ -22,19 +22,16 @@ Hugginface spaces setup
 ## 1  Quick start
 ```bash
-# ❶ Clone and set up the dev env
-git clone https://github.com/<your-org>/rag-eval-framework.git
-cd rag-eval-framework
 python -m venv .venv && source .venv/bin/activate
 pip install -r requirements.txt
 pre-commit install
-# ❷ Fetch a toy corpus (≈200 docs)
 bash scripts/download_data.sh
-# ❸ First single-config run (indexes auto-build)
-python scripts/run_experiments.py \
-  --config configs/pipeline_hybrid_ce.yaml \
   --queries data/sample_queries.jsonl
 ````
@@ -54,10 +51,10 @@ evaluation/                  ← Core library
 ├─ metrics/                  • Retrieval, generation, composite RAG score
 └─ stats/                    • Correlation, significance, robustness utilities
 scripts/                     ← CLI tools
-├─ run_experiments.py        • Single-config runner (logs, metrics, plots)
-├─ run_grid_experiments.py   • **Grid runner** – all configs × datasets, RQ1-RQ4 analysis
 ├─ dashboard.py              • **Streamlit dashboard** for interactive exploration
-tests/                       ← PyTest smoke tests
 configs/                     ← YAML templates for pipelines & stats
 .github/workflows/           ← Lint + tests CI
 Dockerfile                   ← Slim reproducible image
@@ -69,10 +66,10 @@ Dockerfile                   ← Slim reproducible image
 | Research-proposal element                         | Code artefact                                                    | Purpose                                                                           |
 | ------------------------------------------------- | ---------------------------------------------------------------- | --------------------------------------------------------------------------------- |
-| **RQ1** Classical retrieval ↔ factual correctness | `evaluation/retrievers/`, `run_grid_experiments.py`              | Computes Spearman / Kendall ρ with CIs for MRR, MAP, P\@k vs *human\_correct*.    |
-| **RQ2** Faithfulness metrics vs expert judgements | `evaluation/metrics/`, `evaluation/stats/`, grid script          | Correlates QAGS, FactScore, RAGAS-F etc. with *human\_faithful*; Wilcoxon + Holm. |
-| **RQ3** Error propagation → hallucination         | `evaluation/stats.robustness`, grid script                       | χ² test, conditional failure rates across corpora / document styles.              |
-| **RQ4** Robustness to adversarial evidence        | Perturbed datasets (`*_pert.jsonl`) + grid script                | Δ-metrics & Cohen’s *d* between clean and perturbed runs.                         |
 | Interactive analysis / decision-making            | `scripts/dashboard.py`                                           | Select dataset + configs, explore tables & plots instantly.                       |
 | EU AI-Act traceability (Art. 14-15)               | Rotating file logging (`evaluation/utils/logger.py`), Docker, CI | Full run provenance (config + log + results + stats) stored under `outputs/`.     |
@@ -82,7 +79,7 @@ Dockerfile                   ← Slim reproducible image
 ```bash
 # Evaluate three configs on two datasets, save everything under outputs/grid
-python scripts/run_grid_experiments.py \
   --configs configs/*.yaml \
   --datasets data/legal.jsonl data/finance.jsonl \
   --plots
@@ -104,7 +101,7 @@ outputs/grid/<dataset>/wilcoxon_rag_holm.yaml  ← pairwise p-values
 Run a *single* new config and automatically compare it to all previous ones:
 ```bash
-python scripts/run_grid_experiments.py \
   --configs configs/my_new.yaml \
   --datasets data/legal.jsonl \
   --outdir outputs/grid \

 ## 1  Quick start
 ```bash
+git clone https://github.com/Romainkul/rag_evaluation.git
+cd rag_evaluation
 python -m venv .venv && source .venv/bin/activate
 pip install -r requirements.txt
 pre-commit install
 bash scripts/download_data.sh
+python scripts/analysis.py \
+  --config configs/kilt_hybrid_ce.yaml \
   --queries data/sample_queries.jsonl
 ````
 ├─ metrics/                  • Retrieval, generation, composite RAG score
 └─ stats/                    • Correlation, significance, robustness utilities
 scripts/                     ← CLI tools
+├─ prep_annotations.py       • Runs RAG, and logs all outpus for expert annotations
+├─ analysis.py               • **Grid runner** – all configs × datasets, RQ1-RQ4 analysis
 ├─ dashboard.py              • **Streamlit dashboard** for interactive exploration
+tests/                       ← PyTest tests
 configs/                     ← YAML templates for pipelines & stats
 .github/workflows/           ← Lint + tests CI
 Dockerfile                   ← Slim reproducible image
 | Research-proposal element                         | Code artefact                                                    | Purpose                                                                           |
 | ------------------------------------------------- | ---------------------------------------------------------------- | --------------------------------------------------------------------------------- |
+| **RQ1** Classical retrieval ↔ factual correctness | `evaluation/retrievers/`, `analysis.py`              | Computes Spearman / Kendall ρ with CIs for MRR, MAP, P\@k vs *human\_correct*.    |
+| **RQ2** Faithfulness metrics vs expert judgements | `evaluation/metrics/`, `evaluation/stats/`, `analysis.py`          | Correlates QAGS, FactScore, RAGAS-F etc. with *human\_faithful*; Wilcoxon + Holm. |
+| **RQ3** Error propagation → hallucination         | `evaluation/stats.robustness`, `analysis.py`                       | χ² test, conditional failure rates across corpora / document styles.              |
+| **RQ4** Robustness to adversarial evidence        | Perturbed datasets (`*_pert.jsonl`) + `analysis.py`                | Δ-metrics & Cohen’s *d* between clean and perturbed runs.                         |
 | Interactive analysis / decision-making            | `scripts/dashboard.py`                                           | Select dataset + configs, explore tables & plots instantly.                       |
 | EU AI-Act traceability (Art. 14-15)               | Rotating file logging (`evaluation/utils/logger.py`), Docker, CI | Full run provenance (config + log + results + stats) stored under `outputs/`.     |
 ```bash
 # Evaluate three configs on two datasets, save everything under outputs/grid
+python scripts/analysis.py \
   --configs configs/*.yaml \
   --datasets data/legal.jsonl data/finance.jsonl \
   --plots
 Run a *single* new config and automatically compare it to all previous ones:
 ```bash
+python scripts/analysis.py \
   --configs configs/my_new.yaml \
   --datasets data/legal.jsonl \
   --outdir outputs/grid \

configs/kilt_hybrid_ce.yaml ADDED Viewed

	@@ -0,0 +1,42 @@

+# This configuration file sets up a hybrid pipeline using a retriever, generator, and reranker.
+# It is designed to work with the KILT dataset and uses FAISS for retrieval.
+logging:
+  log_dir: logs
+  level: INFO
+  max_mb: 5
+  backups: 5
+retriever:
+  # using Faiss (dense) retrieval over KILT’s Wikipedia passages
+  name: dense
+  faiss_index: /path/to/kilt_wiki_faiss.index
+  top_k: 5
+  model_name: sentence-transformers/all-MiniLM-L6-v2
+  device: cpu
+generator:
+  model_name: facebook/bart-large
+  device: cpu
+  max_new_tokens: 256
+  temperature: 0.0
+reranker:
+  enable: true
+  model_name: cross-encoder/ms-marco-MiniLM-L-6-v2
+  device: cpu
+  max_length: 512
+  first_stage_k: 5
+  final_k: 5
+stats:
+  correlation_method: spearman
+  n_boot: 1000
+  ci: 0.95
+  wilcoxon_alternative: two-sided
+  multiple_correction: holm-bonferroni
+  alpha: 0.05
+  compute_effect_size: true
+  n_permutations: 1000
+  failure_threshold: 0.0

data/load_datasets.py ADDED Viewed

	@@ -0,0 +1,29 @@

+from datasets import load_dataset
+# Load datasets for evaluation
+# This script loads various datasets for evaluation purposes, including finance, legal, KILT, and Natural Questions (NQ).
+# Finance dataset
+ds_finance = load_dataset("PatronusAI/financebench")
+# Legal dataset
+ds_legal = load_dataset("nguha/legalbench","canada_tax_court_outcomes")
+# Possible datasets in LegalBench:
+# ['abercrombie', 'canada_tax_court_outcomes', 'citation_prediction_classification', 'citation_prediction_open', 'consumer_contracts_qa', 'contract_nli_confidentiality_of_agreement', 'contract_nli_explicit_identification', 'contract_nli_inclusion_of_verbally_conveyed_information', 'contract_nli_limited_use', 'contract_nli_no_licensing', 'contract_nli_notice_on_compelled_disclosure', 'contract_nli_permissible_acquirement_of_similar_information', 'contract_nli_permissible_copy', 'contract_nli_permissible_development_of_similar_information', 'contract_nli_permissible_post-agreement_possession', 'contract_nli_return_of_confidential_information', 'contract_nli_sharing_with_employees', 'contract_nli_sharing_with_third-parties', 'contract_nli_survival_of_obligations', 'contract_qa', 'corporate_lobbying', 'cuad_affiliate_license-licensee', 'cuad_affiliate_license-licensor', 'cuad_anti-assignment', 'cuad_audit_rights', 'cuad_cap_on_liability', 'cuad_change_of_control', 'cuad_competitive_restriction_exception', 'cuad_covenant_not_to_sue', 'cuad_effective_date', 'cuad_exclusivity', 'cuad_expiration_date', 'cuad_governing_law', 'cuad_insurance', 'cuad_ip_ownership_assignment', 'cuad_irrevocable_or_perpetual_license', 'cuad_joint_ip_ownership', 'cuad_license_grant', 'cuad_liquidated_damages', 'cuad_minimum_commitment', 'cuad_most_favored_nation', 'cuad_no-solicit_of_customers', 'cuad_no-solicit_of_employees', 'cuad_non-compete', 'cuad_non-disparagement', 'cuad_non-transferable_license', 'cuad_notice_period_to_terminate_renewal', 'cuad_post-termination_services', 'cuad_price_restrictions', 'cuad_renewal_term', 'cuad_revenue-profit_sharing', 'cuad_rofr-rofo-rofn', 'cuad_source_code_escrow', 'cuad_termination_for_convenience', 'cuad_third_party_beneficiary', 'cuad_uncapped_liability', 'cuad_unlimited-all-you-can-eat-license', 'cuad_volume_restriction', 'cuad_warranty_duration', 'definition_classification', 'definition_extraction', 'diversity_1', 'diversity_2', 'diversity_3', 'diversity_4', 'diversity_5', 'diversity_6', 'function_of_decision_section', 'hearsay', 'insurance_policy_interpretation', 'international_citizenship_questions', 'jcrew_blocker', 'learned_hands_benefits', 'learned_hands_business', 'learned_hands_consumer', 'learned_hands_courts', 'learned_hands_crime', 'learned_hands_divorce', 'learned_hands_domestic_violence', 'learned_hands_education', 'learned_hands_employment', 'learned_hands_estates', 'learned_hands_family', 'learned_hands_health', 'learned_hands_housing', 'learned_hands_immigration', 'learned_hands_torts', 'learned_hands_traffic', 'legal_reasoning_causality', 'maud_ability_to_consummate_concept_is_subject_to_mae_carveouts', 'maud_accuracy_of_fundamental_target_rws_bringdown_standard', 'maud_accuracy_of_target_capitalization_rw_(outstanding_shares)_bringdown_standard_answer', 'maud_accuracy_of_target_general_rw_bringdown_timing_answer', 'maud_additional_matching_rights_period_for_modifications_(cor)', 'maud_application_of_buyer_consent_requirement_(negative_interim_covenant)', 'maud_buyer_consent_requirement_(ordinary_course)', 'maud_change_in_law__subject_to_disproportionate_impact_modifier', 'maud_changes_in_gaap_or_other_accounting_principles__subject_to_disproportionate_impact_modifier', 'maud_cor_permitted_in_response_to_intervening_event', 'maud_cor_permitted_with_board_fiduciary_determination_only', 'maud_cor_standard_(intervening_event)', 'maud_cor_standard_(superior_offer)', 'maud_definition_contains_knowledge_requirement_-_answer', 'maud_definition_includes_asset_deals', 'maud_definition_includes_stock_deals', 'maud_fiduciary_exception__board_determination_standard', 'maud_fiduciary_exception_board_determination_trigger_(no_shop)', 'maud_financial_point_of_view_is_the_sole_consideration', 'maud_fls_(mae)_standard', 'maud_general_economic_and_financial_conditions_subject_to_disproportionate_impact_modifier', 'maud_includes_consistent_with_past_practice', 'maud_initial_matching_rights_period_(cor)', 'maud_initial_matching_rights_period_(ftr)', 'maud_intervening_event_-_required_to_occur_after_signing_-_answer', 'maud_knowledge_definition', 'maud_liability_standard_for_no-shop_breach_by_target_non-do_representatives', 'maud_ordinary_course_efforts_standard', 'maud_pandemic_or_other_public_health_event__subject_to_disproportionate_impact_modifier', 'maud_pandemic_or_other_public_health_event_specific_reference_to_pandemic-related_governmental_responses_or_measures', 'maud_relational_language_(mae)_applies_to', 'maud_specific_performance', 'maud_tail_period_length', 'maud_type_of_consideration', 'nys_judicial_ethics', 'opp115_data_retention', 'opp115_data_security', 'opp115_do_not_track', 'opp115_first_party_collection_use', 'opp115_international_and_specific_audiences', 'opp115_policy_change', 'opp115_third_party_sharing_collection', 'opp115_user_access,_edit_and_deletion', 'opp115_user_choice_control', 'oral_argument_question_purpose', 'overruling', 'personal_jurisdiction', 'privacy_policy_entailment', 'privacy_policy_qa', 'proa', 'rule_qa', 'sara_entailment', 'sara_numeric', 'scalr', 'ssla_company_defendants', 'ssla_individual_defendants', 'ssla_plaintiff', 'successor_liability', 'supply_chain_disclosure_best_practice_accountability', 'supply_chain_disclosure_best_practice_audits', 'supply_chain_disclosure_best_practice_certification', 'supply_chain_disclosure_best_practice_training', 'supply_chain_disclosure_best_practice_verification', 'supply_chain_disclosure_disclosed_accountability', 'supply_chain_disclosure_disclosed_audits', 'supply_chain_disclosure_disclosed_certification', 'supply_chain_disclosure_disclosed_training', 'supply_chain_disclosure_disclosed_verification', 'telemarketing_sales_rule', 'textualism_tool_dictionaries', 'textualism_tool_plain', 'ucc_v_common_law', 'unfair_tos']
+# KILT dataset
+ds_kilt = load_dataset("facebook/kilt_tasks", "nq")
+# Natural Questions dataset
+ds_nq = load_dataset("sentence-transformers/natural-questions")
+def load_datasets():
+    """Load and return the datasets."""
+    return {
+        "finance": ds_finance,
+        "legal": ds_legal,
+        "kilt": ds_kilt,
+        "nq": ds_nq
+    }

evaluation/pipeline.py CHANGED Viewed

@@ -35,14 +35,49 @@ class RAGPipeline:
     # Public API
     # ---------------------------------------------------------------------
     def run(self, question: str) -> Dict[str, Any]:
-        """Retrieve context and generate answer."""
         logger.info("Question: %s", question)
-        contexts = self._retrieve(question)
-        answer = self._generate(question, contexts)
         return {
             "question": question,
             "answer": answer,
-            "contexts": [c.text for c in contexts],
         }
     __call__ = run  # alias

     # Public API
     # ---------------------------------------------------------------------
     def run(self, question: str) -> Dict[str, Any]:
         logger.info("Question: %s", question)
+        # 1. raw retrieval
+        k_first = self.cfg.reranker.first_stage_k if self.reranker else self.cfg.retriever.top_k
+        initial: List[Context] = self.retriever.retrieve(question, top_k=k_first)
+        raw_hits = [
+            {"text": c.text, "id": c.id, "score": getattr(c, "retrieval_score", None)}
+            for c in initial
+        ]
+        # 2. reranking (if enabled)
+        if self.reranker:
+            final_k = self.cfg.reranker.final_k or self.cfg.retriever.top_k
+            reranked: List[Context] = self.reranker.rerank(question, initial, k=final_k)
+            reranked_hits = [
+                {
+                    "text": c.text,
+                    "id": c.id,
+                    "score": getattr(c, "cross_encoder_score", None),
+                }
+                for c in reranked
+            ]
+            contexts_for_gen = reranked
+        else:
+            reranked_hits = []
+            contexts_for_gen = initial
+        # 3. generation
+        answer = self.generator.generate(
+            question,
+            [c.text for c in contexts_for_gen],
+            max_new_tokens=self.cfg.generator.max_new_tokens,
+            temperature=self.cfg.generator.temperature,
+        )
         return {
             "question": question,
+            "raw_retrieval": raw_hits,
+            "reranked": reranked_hits,
+            "contexts": [c.text for c in contexts_for_gen],
             "answer": answer,
         }
     __call__ = run  # alias

pyserini/search.py CHANGED Viewed

@@ -2,7 +2,6 @@
 class SimpleSearcher:
     def __init__(self, index_path):
-        # no-op
         pass
     def set_bm25(self):
         pass

 class SimpleSearcher:
     def __init__(self, index_path):
         pass
     def set_bm25(self):
         pass

requirements.txt CHANGED Viewed

@@ -9,6 +9,7 @@ langchain>=0.1.0
 ragas>=0.1.0
 trulens-eval>=0.21.0
 evaluate
 # Data & science
 pandas>=2.2

 ragas>=0.1.0
 trulens-eval>=0.21.0
 evaluate
+datasets
 # Data & science
 pandas>=2.2

scripts/analysis.py ADDED Viewed

	@@ -0,0 +1,160 @@

+"""
+Runs evaluation (RQ1–RQ4, statistical tests, plots) on previously annotated
+pipeline outputs that include `human_correct` and `human_faithful`.
+Assumes outputs were generated using `separate_for_annotation.py` and
+subsequently annotated.
+"""
+import argparse
+import json
+import logging
+import itertools
+from pathlib import Path
+import numpy as np
+import yaml
+import matplotlib.pyplot as plt
+from evaluation.stats import (
+    corr_ci,
+    wilcoxon_signed_rank,
+    holm_bonferroni,
+    conditional_failure_rate,
+    chi2_error_propagation,
+    delta_metric,
+)
+from evaluation.utils.logger import init_logging
+def read_jsonl(path: Path):
+    with path.open() as f:
+        return [json.loads(line) for line in f]
+def save_yaml(path: Path, obj: dict):
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(yaml.safe_dump(obj, sort_keys=False))
+def agg_mean(rows: list[dict]) -> dict:
+    keys = rows[0]["metrics"].keys()
+    return {k: float(np.mean([r["metrics"][k] for r in rows])) for k in keys}
+def rq1_correlation(rows):
+    if "human_correct" not in rows[0] or rows[0]["human_correct"] is None:
+        return {}
+    retrieval_keys = [k for k in rows[0]["metrics"] if k in {"mrr", "map", "precision@10"}]
+    gold = [1.0 if r["human_correct"] else 0.0 for r in rows]
+    out = {}
+    for k in retrieval_keys:
+        vec = [r["metrics"][k] for r in rows]
+        r, (lo, hi), p = corr_ci(vec, gold, method="pearson", n_boot=1000, ci=0.95)
+        out[k] = dict(r=r, ci=[lo, hi], p=p)
+    return out
+def rq2_faithfulness(rows):
+    if "human_faithful" not in rows[0] or rows[0]["human_faithful"] is None:
+        return {}
+    faith_keys = [k for k in rows[0]["metrics"] if k.lower().startswith(("faith", "qags", "fact", "ragas"))]
+    gold = [r["human_faithful"] for r in rows]
+    out = {}
+    for k in faith_keys:
+        vec = [r["metrics"][k] for r in rows]
+        r, (lo, hi), p = corr_ci(vec, gold, method="pearson", n_boot=1000, ci=0.95)
+        out[k] = dict(r=r, ci=[lo, hi], p=p)
+    return out
+def rq3_error_propagation(rows):
+    if "retrieval_error" not in rows[0] or "hallucination" not in rows[0]:
+        return {}
+    ret_err = [r["retrieval_error"] for r in rows]
+    halluc = [r["hallucination"] for r in rows]
+    return {
+        "conditional": conditional_failure_rate(ret_err, halluc),
+        "chi2": chi2_error_propagation(ret_err, halluc),
+    }
+def rq4_robustness(orig_rows, pert_rows):
+    if pert_rows is None:
+        return {}
+    metrics = orig_rows[0]["metrics"].keys()
+    out = {}
+    for m in metrics:
+        d, eff = delta_metric(
+            [r["metrics"][m] for r in orig_rows],
+            [r["metrics"][m] for r in pert_rows],
+        )
+        out[m] = dict(delta=d, cohen_d=eff)
+    return out
+def scatter_mrr_vs_correct(rows, path: Path):
+    x = [r["metrics"].get("mrr", np.nan) for r in rows]
+    y = [1 if r.get("human_correct") else 0 for r in rows]
+    plt.figure()
+    plt.scatter(x, y, alpha=0.5)
+    plt.xlabel("MRR"); plt.ylabel("Correct (1)")
+    plt.title("MRR vs. Human Correctness")
+    plt.tight_layout(); plt.savefig(path); plt.close()
+def main(argv=None):
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--results", nargs="+", type=Path, required=True,
+                    help="One or more annotated results.jsonl files.")
+    ap.add_argument("--outdir", type=Path, default=Path("outputs/grid"))
+    ap.add_argument("--perturbed-suffix", default="_pert.jsonl",
+                    help="Looks for this perturbed variant for RQ4.")
+    ap.add_argument("--plots", action="store_true")
+    args = ap.parse_args(argv)
+    init_logging(log_dir=args.outdir / "logs", level="INFO")
+    log = logging.getLogger("resume")
+    historical = {}
+    for res_path in args.results:
+        cfg_name = res_path.parent.name
+        dataset_name = res_path.parent.parent.name
+        log.info("Processing %s on %s", cfg_name, dataset_name)
+        rows = read_jsonl(res_path)
+        pert_path = res_path.with_name(res_path.stem.replace("unlabeled", "pert") + args.perturbed_suffix)
+        pert_rows = read_jsonl(pert_path) if pert_path.exists() else None
+        run_dir = args.outdir / dataset_name / cfg_name
+        run_dir.mkdir(parents=True, exist_ok=True)
+        save_yaml(run_dir / "aggregates.yaml", agg_mean(rows))
+        save_yaml(run_dir / "rq1.yaml", rq1_correlation(rows))
+        save_yaml(run_dir / "rq2.yaml", rq2_faithfulness(rows))
+        save_yaml(run_dir / "rq3.yaml", rq3_error_propagation(rows))
+        if pert_rows:
+            save_yaml(run_dir / "rq4.yaml", rq4_robustness(rows, pert_rows))
+        if args.plots:
+            scatter_mrr_vs_correct(rows, run_dir / "mrr_vs_correct.png")
+        historical[cfg_name] = rows
+    # Pairwise Wilcoxon + Holm correction
+    if len(historical) > 1:
+        names = list(historical)
+        pairs = {}
+        for a, b in itertools.combinations(names, 2):
+            x = [r["metrics"]["rag_score"] for r in historical[a]]
+            y = [r["metrics"]["rag_score"] for r in historical[b]]
+            _, p = wilcoxon_signed_rank(x, y)
+            pairs[f"{a}~{b}"] = p
+        dataset_name = args.results[0].parent.parent.name
+        save_yaml(args.outdir / dataset_name / "wilcoxon_rag_raw.yaml", pairs)
+        save_yaml(args.outdir / dataset_name / "wilcoxon_rag_holm.yaml", holm_bonferroni(pairs))
+        log.info("Pairwise significance testing complete (rag_score).")
+if __name__ == "__main__":
+    main()

scripts/dashboard.py CHANGED Viewed

@@ -1,12 +1,8 @@
-#!/usr/bin/env python
 """
-dashboard.py
-============
 Launch with:
     streamlit run scripts/dashboard.py
-Relies on the directory structure produced by run_grid_experiments.py:
 outputs/grid/<dataset>/<config>/{aggregates.yaml, rq1.yaml, ...}
 """
 from __future__ import annotations
@@ -19,8 +15,8 @@ import pandas as pd
 import streamlit as st
 import matplotlib.pyplot as plt
-BASE_DIR = Path("outputs/grid")         # change if you store runs elsewhere
-METRIC_KEY = "rag_score"               # bar/box plots focus on this
 # --------------------------------------------------------------------- Sidebar
 st.sidebar.title("RAG-Eval Dashboard")

 """
 Launch with:
     streamlit run scripts/dashboard.py
+Relies on the directory structure produced by analysis.py:
 outputs/grid/<dataset>/<config>/{aggregates.yaml, rq1.yaml, ...}
 """
 from __future__ import annotations
 import streamlit as st
 import matplotlib.pyplot as plt
+BASE_DIR = Path("outputs/grid")
+METRIC_KEY = "rag_score"
 # --------------------------------------------------------------------- Sidebar
 st.sidebar.title("RAG-Eval Dashboard")

scripts/prep_annotations.py ADDED Viewed

	@@ -0,0 +1,86 @@

+"""
+Runs RAG pipeline over dataset(s) and saves partial results
+for manual annotation.
+"""
+import argparse
+import json
+from pathlib import Path
+from typing import Any, Dict
+from evaluation import PipelineConfig, RetrieverConfig, GeneratorConfig, CrossEncoderConfig, StatsConfig, LoggingConfig, RAGPipeline
+from evaluation.utils.logger import init_logging
+import yaml
+def merge_dataclass(dc_cls, override: Dict[str, Any]):
+    from dataclasses import asdict
+    base = asdict(dc_cls())
+    base.update({k: v for k, v in override.items() if v is not None})
+    return dc_cls(**base)
+def load_pipeline_config(yaml_path: Path) -> PipelineConfig:
+    data = yaml.safe_load(yaml_path.read_text())
+    return PipelineConfig(
+        retriever=merge_dataclass(RetrieverConfig, data.get("retriever", {})),
+        generator=merge_dataclass(GeneratorConfig, data.get("generator", {})),
+        reranker=merge_dataclass(CrossEncoderConfig, data.get("reranker", {})),
+        stats=merge_dataclass(StatsConfig, data.get("stats", {})),
+        logging=merge_dataclass(LoggingConfig, data.get("logging", {})),
+    )
+def read_jsonl(path: Path) -> list[dict]:
+    with path.open() as f:
+        return [json.loads(line) for line in f]
+def write_jsonl(path: Path, rows: list[dict]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with path.open("w") as f:
+        for row in rows:
+            f.write(json.dumps(row) + "\n")
+def main(argv=None):
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--config", type=Path, required=True)
+    ap.add_argument("--datasets", nargs="+", type=Path, required=True)
+    ap.add_argument("--outdir", type=Path, default=Path("outputs/for_annotation"))
+    args = ap.parse_args(argv)
+    init_logging(log_dir=args.outdir / "logs")
+    cfg = load_pipeline_config(args.config)
+    pipe = RAGPipeline(cfg)
+    for dataset in args.datasets:
+        queries = read_jsonl(dataset)
+        output_dir = args.outdir / dataset.stem / args.config.stem
+        output_path = output_dir / "unlabeled_results.jsonl"
+        if output_path.exists():
+            print(f"Skipping {dataset.name} – already exists.")
+            continue
+        rows = []
+        for q in queries:
+            result = pipe.run(q["question"])
+            entry = {
+                "question": q["question"],
+                "retrieved_docs": result.get("retrieved_docs", []),
+                "generated_answer": result.get("generated_answer", ""),
+                "metrics": result.get("metrics", {}),
+                # Human annotators will add these
+                "human_correct": None,
+                "human_faithful": None
+            }
+            rows.append(entry)
+        write_jsonl(output_path, rows)
+        print(f"Wrote {len(rows)} results to {output_path}")
+if __name__ == "__main__":
+    main()

scripts/run_experiments.py DELETED Viewed

@@ -1,251 +0,0 @@
-#!/usr/bin/env python
-"""
-run_experiments.py
-==================
-High-level driver that wires together:
-1.  YAML / CLI → `PipelineConfig` + `LoggingConfig`
-2.  Initialises dual-sink logging (console + rotating file)
-3.  Builds a `RAGPipeline`
-4.  Streams a list of questions through the pipeline
-5.  Logs progress, writes per-query JSONL results, and
-    (optionally) prints aggregate statistics.
-You can keep it minimal – or expand the marked TODO sections to:
-* compute metrics immediately
-* push results to a tracker (W&B, MLflow, etc.)
-* spawn multiple configs in parallel.
-"""
-from __future__ import annotations
-import argparse
-import json
-import sys
-from pathlib import Path
-from typing import Any, Dict, Iterable, List, Mapping
-import yaml
-from evaluation import (
-    PipelineConfig,
-    RetrieverConfig,
-    GeneratorConfig,
-    CrossEncoderConfig,
-    StatsConfig,
-    LoggingConfig,
-    RAGPipeline,
-)
-from evaluation.utils.logger import init_logging
-from evaluation.stats import (
-    corr_ci,
-    wilcoxon_signed_rank,
-    holm_bonferroni,
-)
-import matplotlib.pyplot as plt
-# ──────────────────────────────────────────────────────────────────────────────
-# Helpers
-# ──────────────────────────────────────────────────────────────────────────────
-def _merge_dataclass(dc_cls, default, override: Mapping[str, Any]):
-    """Return a new *dc_cls* where fields from *override* overwrite *default*."""
-    from dataclasses import asdict
-    merged = asdict(default)
-    merged.update({k: v for k, v in override.items() if v is not None})
-    return dc_cls(**merged)
-def _load_pipeline_config(yaml_path: Path | None) -> PipelineConfig:
-    """Parse YAML into nested dataclasses; fall back to defaults."""
-    if yaml_path is None:
-        return PipelineConfig()  # all defaults
-    data = yaml.safe_load(yaml_path.read_text())
-    retr_cfg = _merge_dataclass(
-        RetrieverConfig(), RetrieverConfig(), data.get("retriever", {})
-    )
-    gen_cfg = _merge_dataclass(
-        GeneratorConfig(), GeneratorConfig(), data.get("generator", {})
-    )
-    rr_cfg = _merge_dataclass(
-        CrossEncoderConfig(), CrossEncoderConfig(), data.get("reranker", {})
-    )
-    stats_cfg = _merge_dataclass(StatsConfig(), StatsConfig(), data.get("stats", {}))
-    log_cfg = _merge_dataclass(LoggingConfig(), LoggingConfig(), data.get("logging", {}))
-    return PipelineConfig(
-        retriever=retr_cfg,
-        generator=gen_cfg,
-        reranker=rr_cfg,
-        stats=stats_cfg,
-        logging=log_cfg,
-    )
-def _read_jsonl(path: Path) -> List[Dict[str, Any]]:
-    with path.open() as f:
-        return [json.loads(line) for line in f]
-def _write_jsonl(path: Path, rows: Iterable[Mapping[str, Any]]):
-    path.parent.mkdir(parents=True, exist_ok=True)
-    with path.open("w") as f:
-        for row in rows:
-            f.write(json.dumps(row) + "\n")
-# Stats Helper
-def aggregate_metrics(rows: list[dict[str, Any]]) -> dict[str, float]:
-    """Return mean of every numeric metric found under row['metrics']."""
-    import numpy as np
-    keys = rows[0]["metrics"].keys()
-    return {k: float(np.mean([r["metrics"][k] for r in rows])) for k in keys}
-def correlation_with_gold(rows: list[dict[str, Any]], cfg: StatsConfig):
-    """Spearman/Kendall correlation between retrieval scores and correctness flag."""
-    if "human_correct" not in rows[0]:
-        return None  # nothing to correlate
-    mrr = [r["metrics"].get("mrr", float("nan")) for r in rows]
-    gold = [1.0 if r["human_correct"] else 0.0 for r in rows]
-    r, (lo, hi), p = corr_ci(
-        mrr, gold, method=cfg.correlation_method, n_boot=cfg.n_boot, ci=cfg.ci
-    )
-    return dict(r=r, ci_low=lo, ci_high=hi, p=p)
-def wilcoxon_against_baseline(
-    cur: list[dict[str, Any]],
-    base: list[dict[str, Any]],
-    cfg: StatsConfig,
-):
-    """Paired Wilcoxon + Holm-Bonferroni across all metric keys."""
-    from evaluation.stats import wilcoxon_signed_rank, holm_bonferroni
-    assert len(cur) == len(base), "Runs must have same #queries"
-    metrics = cur[0]["metrics"].keys()
-    p_raw = {}
-    for m in metrics:
-        cur_m = [r["metrics"][m] for r in cur]
-        base_m = [r["metrics"][m] for r in base]
-        _, p = wilcoxon_signed_rank(cur_m, base_m, alternative=cfg.wilcoxon_alternative)
-        p_raw[m] = p
-    return holm_bonferroni(p_raw)
-# Plot helper
-def save_scatter(rows, out_dir: Path):
-    out_dir.mkdir(parents=True, exist_ok=True)
-    x = [r["metrics"]["mrr"] for r in rows if "mrr" in r["metrics"]]
-    y = [1.0 if r.get("human_correct") else 0.0 for r in rows]
-    plt.figure()
-    plt.scatter(x, y, alpha=0.6)
-    plt.xlabel("MRR")
-    plt.ylabel("Correct (1=yes)")
-    plt.title("MRR vs. Human Correctness")
-    path = out_dir / "mrr_vs_correct.png"
-    plt.savefig(path, bbox_inches="tight")
-    plt.close()
-    return path
-# ──────────────────────────────────────────────────────────────────────────────
-# Main
-# ──────────────────────────────────────────────────────────────────────────────
-def main(argv: list[str] | None = None) -> None:
-    ap = argparse.ArgumentParser(description="Run RAG evaluation experiments.")
-    ap.add_argument("--config", type=Path, help="YAML config with pipeline settings")
-    ap.add_argument(
-        "--queries",
-        type=Path,
-        required=True,
-        help="JSONL file – each line must contain at least {'question': ...}",
-    )
-    ap.add_argument(
-        "--output",
-        type=Path,
-        default=Path("outputs/results.jsonl"),
-        help="Where to write JSONL results",
-    )
-    ap.add_argument("--dry-run", action="store_true", help="Do not execute pipeline")
-    ap.add_argument(
-        "--baseline",
-        type=Path,
-        help="Optional: JSONL with baseline run for significance tests",
-    )
-    ap.add_argument(
-        "--plots",
-        action="store_true",
-        help="Save diagnostic plots (PNG) alongside results",
-    )
-    args = ap.parse_args(argv)
-    # 1. Parse configuration
-    cfg = _load_pipeline_config(args.config)
-    # 2. Initialise logging (file + stderr)
-    init_logging(
-        log_dir=cfg.logging.log_dir,
-        level=cfg.logging.level,
-        max_mb=cfg.logging.max_mb,
-        backups=cfg.logging.backups,
-    )
-    import logging
-    logger = logging.getLogger(__name__)
-    logger.info("Loaded PipelineConfig:\n%s", cfg)
-    # 3. Build pipeline (retrieval → (rerank) → generation)
-    pipeline = RAGPipeline(cfg)
-    # 4. Load queries
-    rows = _read_jsonl(args.queries)
-    logger.info("Loaded %d queries from %s", len(rows), args.queries)
-    if args.dry_run:
-        logger.warning("Dry-run flag active – exiting before execution.")
-        sys.exit(0)
-    # 5. Execute pipeline
-    results: List[Dict[str, Any]] = []
-    for i, row in enumerate(rows, 1):
-        q = row["question"]
-        logger.info("[%d/%d] Q: %s", i, len(rows), q)
-        out = pipeline.run(q)
-        merged = {**row, **out}  # keep any gold labels or metadata
-        results.append(merged)
-    # 6. Persist results
-    _write_jsonl(args.output, results)
-    logger.info("Wrote %d results to %s", len(results), args.output)
-    # 7. Aggregate statistics, significance tests, plots
-    agg = aggregate_metrics(results)
-    logger.info("Mean metrics: %s", json.dumps(agg, indent=2))
-    corr = correlation_with_gold(results, cfg.stats)
-    if corr:
-        logger.info(
-            "Correlation MRR↔gold  %s=%.3f  95%%CI=[%.3f, %.3f]  p=%.3g",
-            cfg.stats.correlation_method,
-            corr["r"],
-            corr["ci_low"],
-            corr["ci_high"],
-            corr["p"],
-        )
-    if args.baseline:
-        baseline_rows = _read_jsonl(args.baseline)
-        p_adj = wilcoxon_against_baseline(results, baseline_rows, cfg.stats)
-        logger.info("Wilcoxon vs baseline (Holm-Bonferroni α=%s): %s", cfg.stats.alpha, p_adj)
-    if args.plots:
-        plot_path = save_scatter(results, args.output.parent)
-        logger.info("Saved plot → %s", plot_path)
-if __name__ == "__main__":
-    main()

scripts/run_grid_experiments.py DELETED Viewed

@@ -1,239 +0,0 @@
-#!/usr/bin/env python
-"""
-run_grid_experiments.py
-=======================
-Batch driver for *config × dataset* evaluation, including:
-* RQ1  – Correlation of classical retrieval metrics with factual-correctness
-* RQ2  – Correlation of faithfulness metrics with expert judgements
-* RQ3  – Retrieval-error ➜ hallucination propagation (χ² + conditional rates)
-* RQ4  – Robustness under adversarial perturbations (Δ-metrics, Cohen d)
-Features
---------
-* Incremental mode – pass **one** new --config, it is compared to all
-  previous runs already found under --outdir/<dataset>/.
-* Saves:
-  - `results.jsonl`
-  - `aggregates.yaml`
-  - `rq1.yaml`, `rq2.yaml`, `rq3.yaml`, `rq4.yaml`
-  - pairwise Wilcoxon/ Holm tables
-  - bar-, box-, scatter-plots (if --plots flag)
-"""
-from __future__ import annotations
-import argparse
-import itertools
-import json
-import logging
-import os
-from pathlib import Path
-from typing import Any, Dict, Iterable, List, Mapping
-import matplotlib.pyplot as plt
-import numpy as np
-import yaml
-from evaluation import (
-    PipelineConfig,
-    RetrieverConfig,
-    GeneratorConfig,
-    CrossEncoderConfig,
-    StatsConfig,
-    LoggingConfig,
-    RAGPipeline,
-)
-from evaluation.stats import (
-    corr_ci,
-    wilcoxon_signed_rank,
-    holm_bonferroni,
-    conditional_failure_rate,
-    chi2_error_propagation,
-    delta_metric,
-)
-from evaluation.utils.logger import init_logging
-# ─────────────────────────────── I/O helpers ────────────────────────────────
-def read_jsonl(path: Path) -> List[Dict[str, Any]]:
-    with path.open() as f:
-        return [json.loads(line) for line in f]
-def write_jsonl(path: Path, rows: Iterable[Mapping[str, Any]]) -> None:
-    path.parent.mkdir(parents=True, exist_ok=True)
-    with path.open("w") as f:
-        for row in rows:
-            f.write(json.dumps(row) + "\n")
-def save_yaml(path: Path, obj: Mapping[str, Any]) -> None:
-    path.parent.mkdir(parents=True, exist_ok=True)
-    path.write_text(yaml.safe_dump(obj, sort_keys=False))
-# ─────────────────────── config merge (same as earlier) ─────────────────────
-def merge_dataclass(dc_cls, override: Mapping[str, Any]):
-    from dataclasses import asdict
-    base = asdict(dc_cls())
-    base.update({k: v for k, v in override.items() if v is not None})
-    return dc_cls(**base)
-def load_pipeline_config(yaml_path: Path) -> PipelineConfig:
-    data = yaml.safe_load(yaml_path.read_text())
-    return PipelineConfig(
-        retriever=merge_dataclass(RetrieverConfig, data.get("retriever", {})),
-        generator=merge_dataclass(GeneratorConfig, data.get("generator", {})),
-        reranker=merge_dataclass(CrossEncoderConfig, data.get("reranker", {})),
-        stats=merge_dataclass(StatsConfig, data.get("stats", {})),
-        logging=merge_dataclass(LoggingConfig, data.get("logging", {})),
-    )
-# ───────────────────────────── stats helpers ────────────────────────────────
-def agg_mean(rows: List[dict[str, Any]]) -> dict[str, float]:
-    keys = rows[0]["metrics"].keys()
-    return {k: float(np.mean([r["metrics"][k] for r in rows])) for k in keys}
-def rq1_correlation(rows, cfg: StatsConfig):
-    if "human_correct" not in rows[0]:
-        return {}
-    retrieval_keys = [k for k in rows[0]["metrics"] if k in {"mrr", "map", "precision@10"}]
-    gold = [1.0 if r["human_correct"] else 0.0 for r in rows]
-    out = {}
-    for k in retrieval_keys:
-        vec = [r["metrics"][k] for r in rows]
-        r, (lo, hi), p = corr_ci(vec, gold, method=cfg.correlation_method,
-                                 n_boot=cfg.n_boot, ci=cfg.ci)
-        out[k] = dict(r=r, ci=[lo, hi], p=p)
-    return out
-def rq2_faithfulness(rows, cfg: StatsConfig):
-    if "human_faithful" not in rows[0]:
-        return {}
-    faith_keys = [k for k in rows[0]["metrics"] if k.lower().startswith(("faith", "qags", "fact", "ragas"))]
-    gold = [r["human_faithful"] for r in rows]
-    out = {}
-    for k in faith_keys:
-        vec = [r["metrics"][k] for r in rows]
-        r, (lo, hi), p = corr_ci(vec, gold, method=cfg.correlation_method,
-                                 n_boot=cfg.n_boot, ci=cfg.ci)
-        out[k] = dict(r=r, ci=[lo, hi], p=p)
-    return out
-def rq3_error_propagation(rows):
-    if "retrieval_error" not in rows[0] or "hallucination" not in rows[0]:
-        return {}
-    ret_err = [r["retrieval_error"] for r in rows]
-    halluc = [r["hallucination"] for r in rows]
-    cond = conditional_failure_rate(ret_err, halluc)
-    chi2 = chi2_error_propagation(ret_err, halluc)
-    return {"conditional": cond, "chi2": chi2}
-def rq4_robustness(orig_rows, pert_rows):
-    if pert_rows is None:
-        return {}
-    metrics = orig_rows[0]["metrics"].keys()
-    out = {}
-    for m in metrics:
-        d, eff = delta_metric(
-            [r["metrics"][m] for r in orig_rows],
-            [r["metrics"][m] for r in pert_rows],
-        )
-        out[m] = dict(delta=d, cohen_d=eff)
-    return out
-# ─────────────────────────── plotting helpers ───────────────────────────────
-def scatter_mrr_vs_correct(rows, path: Path):
-    x = [r["metrics"].get("mrr", np.nan) for r in rows]
-    y = [1 if r.get("human_correct") else 0 for r in rows]
-    plt.figure()
-    plt.scatter(x, y, alpha=0.5)
-    plt.xlabel("MRR"); plt.ylabel("Correct (1)")
-    plt.title("MRR vs. Human Correctness")
-    plt.tight_layout(); plt.savefig(path); plt.close()
-# ────────────────────────────────── main ────────────────────────────────────
-def main(argv: list[str] | None = None) -> None:
-    ap = argparse.ArgumentParser()
-    ap.add_argument("--configs", nargs="+", type=Path, required=True,
-                    help="One or more YAML configs; if one, compared against prior runs.")
-    ap.add_argument("--datasets", nargs="+", type=Path, required=True)
-    ap.add_argument("--outdir", type=Path, default=Path("outputs/grid"))
-    ap.add_argument("--plots", action="store_true")
-    ap.add_argument("--perturbed-suffix", default="_pert",
-                    help="If dataset perturbed version exists (name+suffix.jsonl) it's used for RQ4.")
-    args = ap.parse_args(argv)
-    init_logging(log_dir=args.outdir / "logs", level="INFO")
-    log = logging.getLogger("grid")
-    for dataset in args.datasets:
-        log.info("Dataset: %s", dataset.name)
-        queries = read_jsonl(dataset)
-        pert_path = dataset.with_stem(dataset.stem + args.perturbed_suffix)
-        pert_rows = read_jsonl(pert_path) if pert_path.exists() else None
-        # discover historical configs to compare against if incremental mode
-        hist_dirs = (args.outdir / dataset.stem).glob("*") if len(args.configs) == 1 else []
-        historical = {d.name: read_jsonl(d / "results.jsonl") for d in hist_dirs if d.is_dir()}
-        for cfg_yaml in args.configs:
-            cfg_name = cfg_yaml.stem
-            log.info("  Config: %s", cfg_name)
-            cfg = load_pipeline_config(cfg_yaml)
-            pipe = RAGPipeline(cfg)
-            # skip if results already exist
-            run_dir = args.outdir / dataset.stem / cfg_name
-            if (run_dir / "results.jsonl").exists():
-                log.info("    results already present – loading.")
-                rows = read_jsonl(run_dir / "results.jsonl")
-            else:
-                rows = [pipe.run(q["question"]) | q for q in queries]
-                write_jsonl(run_dir / "results.jsonl", rows)
-            # aggregates & RQ1–4
-            save_yaml(run_dir / "aggregates.yaml", agg_mean(rows))
-            save_yaml(run_dir / "rq1.yaml", rq1_correlation(rows, cfg.stats))
-            save_yaml(run_dir / "rq2.yaml", rq2_faithfulness(rows, cfg.stats))
-            save_yaml(run_dir / "rq3.yaml", rq3_error_propagation(rows))
-            if pert_rows:
-                save_yaml(run_dir / "rq4.yaml", rq4_robustness(rows, pert_rows))
-            if args.plots:
-                scatter_mrr_vs_correct(rows, run_dir / "mrr_vs_correct.png")
-            historical[cfg_name] = rows  # include current for pairwise tests
-        # pairwise Wilcoxon on rag_score
-        if len(historical) > 1:
-            pairs = {}
-            names = list(historical)
-            for a, b in itertools.combinations(names, 2):
-                x = [r["metrics"]["rag_score"] for r in historical[a]]
-                y = [r["metrics"]["rag_score"] for r in historical[b]]
-                _, p = wilcoxon_signed_rank(x, y)
-                pairs[f"{a}~{b}"] = p
-            save_yaml(args.outdir / dataset.stem / "wilcoxon_rag_raw.yaml", pairs)
-            save_yaml(args.outdir / dataset.stem / "wilcoxon_rag_holm.yaml",
-                      holm_bonferroni(pairs))
-            log.info("  Pairwise rag_score significance stored (Holm adjusted).")
-if __name__ == "__main__":
-    main()

tests/test_pipeline_end_to_end.py CHANGED Viewed

@@ -34,7 +34,6 @@ def tmp_doc_store(tmp_path_factory):
 def test_pipeline_with_dense(tmp_doc_store, monkeypatch, tmp_path):
-    # Monkey-patch HFGenerator so no actual HF download happens
     import evaluation.generators.hf_generator as hf_module
     monkeypatch.setattr(hf_module, "HFGenerator", _DummyGenerator)
@@ -46,13 +45,12 @@ def test_pipeline_with_dense(tmp_doc_store, monkeypatch, tmp_path):
             faiss_index=tmp_path / "dense.idx",
             doc_store=tmp_doc_store,
             device="cpu",
-            model_name="dummy/ignored",  # the DummyGenerator bypasses HF
         ),
         generator=GeneratorConfig(model_name="dummy"),
     )
     pipeline = RAGPipeline(cfg)
-    # Should not raise, and produce no errors
     results = pipeline.run_queries([{"question": "Q?", "id": 0}])
     assert isinstance(results, list)
     assert all("answer" in r for r in results)

 def test_pipeline_with_dense(tmp_doc_store, monkeypatch, tmp_path):
     import evaluation.generators.hf_generator as hf_module
     monkeypatch.setattr(hf_module, "HFGenerator", _DummyGenerator)
             faiss_index=tmp_path / "dense.idx",
             doc_store=tmp_doc_store,
             device="cpu",
+            model_name="dummy/ignored",
         ),
         generator=GeneratorConfig(model_name="dummy"),
     )
     pipeline = RAGPipeline(cfg)
     results = pipeline.run_queries([{"question": "Q?", "id": 0}])
     assert isinstance(results, list)
     assert all("answer" in r for r in results)