Spaces:
Running
Running
| """RAGAS-based evaluation for retrieval and generation quality. | |
| Uses the legacy ``ragas.metrics`` classes (``LLMContextPrecisionWithReference``, | |
| ``LLMContextRecall``, ``Faithfulness``, ``AnswerRelevancy``, | |
| ``AnswerCorrectness``, ``FactualCorrectness``) rather than the newer | |
| ``ragas.metrics.collections`` API. The collections API only accepts | |
| ``InstructorLLM`` instances and would force us to import provider-specific | |
| clients (openai / anthropic / ...) directly, which violates the project's | |
| ``provider.py``-only rule. | |
| The legacy classes are wired with a ``LangchainLLMWrapper`` and a | |
| ``LangchainEmbeddingsWrapper`` at ``evaluate()`` time, so we keep using the | |
| LangChain abstractions returned by ``src/provider.py`` everywhere. | |
| Two metric families, two evaluation passes | |
| ------------------------------------------ | |
| For a multilingual test set (English questions querying Danish documents, | |
| English-language answers, English+Danish reference fields), different metrics | |
| want different reference languages: | |
| - **Grounding metrics** (``ContextPrecision``, ``ContextRecall``) compare the | |
| reference against retrieved Danish chunks. They work best with a Danish | |
| reference (mono-lingual matching), so we feed them ``source_quote_da``. | |
| ``Faithfulness`` and ``AnswerRelevancy`` ignore the reference field but ride | |
| along in this pass to share the dataset. | |
| - **Correctness metrics** (``AnswerCorrectness``, ``FactualCorrectness``) | |
| compare the generated English answer against the reference directly. They | |
| work best with an English reference, so we feed them ``reference_en``. | |
| ``evaluate()`` builds two ``EvaluationDataset`` instances (same questions, | |
| contexts and answers, different ``reference`` field per pass) and runs RAGAS | |
| twice. The aggregate scores are unioned and per-sample rows are merged on | |
| ``user_input``. | |
| Why both families | |
| ----------------- | |
| ``Faithfulness`` measures whether each claim in the answer is supported by the | |
| retrieved chunks. It does **not** check whether those chunks (and therefore | |
| the answer) are actually correct. An answer that confidently quotes the wrong | |
| chunks scores 1.0; an answer that hedges with prior-knowledge inference but | |
| matches the ground truth scores low. Adding ``AnswerCorrectness`` / | |
| ``FactualCorrectness`` (which compare directly to ``reference_en``) closes | |
| this gap by measuring **whether the answer is right**, not just whether it | |
| is grounded in whatever was retrieved. | |
| Note: ``AnswerRelevancy`` is constructed with ``strictness=1``. The default of | |
| 3 issues an OpenAI-style ``n=3`` request to generate three hypothetical | |
| questions in one API call, which Groq's API rejects with HTTP 400 | |
| ``'n' : number must be at most 1``. With ``strictness=1`` the metric makes a | |
| single call per sample, which all providers support. | |
| """ | |
| import logging | |
| from typing import Any | |
| from langchain_core.embeddings import Embeddings | |
| from langchain_core.language_models.chat_models import BaseChatModel | |
| from ragas import EvaluationDataset, SingleTurnSample, evaluate | |
| from ragas.embeddings import LangchainEmbeddingsWrapper | |
| from ragas.llms import LangchainLLMWrapper | |
| from ragas.metrics._answer_correctness import AnswerCorrectness | |
| from ragas.metrics._answer_relevance import AnswerRelevancy | |
| from ragas.metrics._context_precision import LLMContextPrecisionWithReference | |
| from ragas.metrics._context_recall import LLMContextRecall | |
| from ragas.metrics._factual_correctness import FactualCorrectness | |
| from ragas.metrics._faithfulness import Faithfulness | |
| logger = logging.getLogger(__name__) | |
| # Columns produced by RAGAS dataframes that are NOT metric scores. Used to | |
| # separate metric columns from sample fields when computing aggregates. | |
| _NON_METRIC_COLS: frozenset[str] = frozenset( | |
| { | |
| "user_input", | |
| "retrieved_contexts", | |
| "reference", | |
| "response", | |
| "ground_truth", | |
| "question", | |
| "answer", | |
| "contexts", | |
| } | |
| ) | |
| # Reference-key fallback chains used by ``_resolve_reference``. Each chain | |
| # starts with the preferred key for a particular metric family, then falls | |
| # back to whichever other key is available so legacy plain-string ground | |
| # truths still work. | |
| _GROUNDING_REF_CHAIN: tuple[str, ...] = ("source_quote_da", "reference_en", "reference") | |
| _CORRECTNESS_REF_CHAIN: tuple[str, ...] = ("reference_en", "source_quote_da", "reference") | |
| class RAGEvaluator: | |
| """Evaluates RAG pipeline quality using two complementary RAGAS metric families. | |
| The judge LLM is independent from the generation LLM. This is critical | |
| when generation runs on a small local model: a stronger judge gives | |
| substantially less noisy scores. | |
| Each ground-truth entry may be either a plain string or a dict. The dict | |
| form is used by the multilingual test set:: | |
| { | |
| "reference_en": "English reference answer (informational)", | |
| "source_quote_da": "Verbatim Danish quote from the source document" | |
| } | |
| See the module docstring for why two metric families and two evaluation | |
| passes are used. | |
| """ | |
| def __init__(self, llm: BaseChatModel, embeddings: Embeddings) -> None: | |
| """Initialize the evaluator. | |
| Args: | |
| llm: A LangChain BaseChatModel instance to use as the RAGAS judge. | |
| Should be a strong model (>= ~30B params) for reliable scoring. | |
| embeddings: A LangChain Embeddings instance. Required because | |
| ``AnswerRelevancy`` and ``AnswerCorrectness`` compute cosine | |
| similarity between text pairs. | |
| """ | |
| self._llm = LangchainLLMWrapper(llm) | |
| self._embeddings = LangchainEmbeddingsWrapper(embeddings) | |
| logger.info("RAGEvaluator initialized") | |
| def _resolve_reference( | |
| ground_truth: str | dict[str, Any], | |
| ref_chain: tuple[str, ...], | |
| ) -> str: | |
| """Pick the best reference string for a given metric family. | |
| Args: | |
| ground_truth: Either a plain reference string or a dict with | |
| ``source_quote_da`` / ``reference_en`` / ``reference`` keys. | |
| ref_chain: Ordered tuple of dict keys to try, most preferred first. | |
| Returns: | |
| The reference string to feed into RAGAS. | |
| """ | |
| if isinstance(ground_truth, str): | |
| return ground_truth | |
| if isinstance(ground_truth, dict): | |
| for key in ref_chain: | |
| value = ground_truth.get(key) | |
| if isinstance(value, str) and value.strip(): | |
| return value | |
| return str(ground_truth) | |
| def _build_dataset( | |
| self, | |
| questions: list[str], | |
| contexts: list[list[str]], | |
| ground_truths: list[str | dict[str, Any]], | |
| answers: list[str] | None = None, | |
| *, | |
| ref_chain: tuple[str, ...] = _GROUNDING_REF_CHAIN, | |
| ) -> EvaluationDataset: | |
| """Build a RAGAS EvaluationDataset from raw inputs. | |
| Args: | |
| questions: List of input questions. | |
| contexts: Retrieved context lists per question. | |
| ground_truths: Reference answers (str or dict). | |
| answers: Optional list of generated answers. | |
| ref_chain: Reference-key fallback chain to use when resolving | |
| each ground-truth entry. | |
| Returns: | |
| EvaluationDataset ready for evaluation. | |
| """ | |
| samples: list[SingleTurnSample] = [] | |
| for i, question in enumerate(questions): | |
| sample_kwargs: dict[str, Any] = { | |
| "user_input": question, | |
| "retrieved_contexts": contexts[i], | |
| "reference": self._resolve_reference(ground_truths[i], ref_chain), | |
| } | |
| if answers is not None: | |
| sample_kwargs["response"] = answers[i] | |
| samples.append(SingleTurnSample(**sample_kwargs)) | |
| return EvaluationDataset(samples=samples) | |
| def _result_to_dicts( | |
| result: Any, | |
| ) -> tuple[dict[str, float], list[dict[str, Any]]]: | |
| """Convert a RAGAS EvaluationResult into aggregate scores + per-sample rows. | |
| Uses the public ``to_pandas()`` API instead of the private | |
| ``_repr_dict`` so the code does not break across RAGAS minor versions. | |
| Args: | |
| result: RAGAS EvaluationResult instance. | |
| Returns: | |
| Tuple of (aggregate metric → mean score, list of per-sample row dicts). | |
| """ | |
| df = result.to_pandas() | |
| metric_cols = [c for c in df.columns if c not in _NON_METRIC_COLS] | |
| aggregate: dict[str, float] = {} | |
| for col in metric_cols: | |
| if df[col].dtype.kind in "fi": | |
| aggregate[col] = float(df[col].mean()) | |
| per_sample: list[dict[str, Any]] = df.to_dict(orient="records") | |
| return aggregate, per_sample | |
| def _merge_per_sample( | |
| rows_a: list[dict[str, Any]], | |
| rows_b: list[dict[str, Any]], | |
| ) -> list[dict[str, Any]]: | |
| """Merge two per-sample row lists by ``user_input``. | |
| Both lists are produced by ``_result_to_dicts`` and contain the same | |
| questions but different metric columns. The reference field will | |
| differ between the two passes (Danish vs English), so we keep the | |
| Danish one (from rows_a) as the canonical ``reference`` and add an | |
| ``reference_en`` column from rows_b for transparency. | |
| Args: | |
| rows_a: Rows from the grounding pass (reference = Danish quote). | |
| rows_b: Rows from the correctness pass (reference = English answer). | |
| Returns: | |
| Merged list of row dicts with all metric columns from both passes. | |
| """ | |
| rows_b_by_q: dict[str, dict[str, Any]] = {row["user_input"]: row for row in rows_b} | |
| merged: list[dict[str, Any]] = [] | |
| for row_a in rows_a: | |
| row_b = rows_b_by_q.get(row_a["user_input"], {}) | |
| merged_row: dict[str, Any] = dict(row_a) | |
| # Preserve the English reference for transparency. | |
| if "reference" in row_b: | |
| merged_row["reference_en"] = row_b["reference"] | |
| # Add metric columns from pass B that are not already present. | |
| for key, value in row_b.items(): | |
| if key in _NON_METRIC_COLS: | |
| continue | |
| if key not in merged_row: | |
| merged_row[key] = value | |
| merged.append(merged_row) | |
| return merged | |
| def evaluate( | |
| self, | |
| questions: list[str], | |
| answers: list[str], | |
| contexts: list[list[str]], | |
| ground_truths: list[str | dict[str, Any]], | |
| ) -> dict[str, Any]: | |
| """Run full RAGAS evaluation across grounding + correctness metric families. | |
| Performs two passes against the same questions / answers / contexts | |
| but with different reference languages (see module docstring). | |
| Args: | |
| questions: Input questions. | |
| answers: Generated answers from the RAG pipeline. | |
| contexts: Retrieved context lists per question. | |
| ground_truths: Reference answers (str or dict per | |
| ``_resolve_reference``). | |
| Returns: | |
| Dict with two keys: | |
| ``aggregate``: metric name → mean score across all samples | |
| ``per_sample``: list of per-sample dicts (one row per question) | |
| """ | |
| n = len(questions) | |
| logger.info("Running full evaluation on %d samples (two passes)", n) | |
| # ---- Pass 1: grounding metrics with Danish reference ---- | |
| dataset_da = self._build_dataset( | |
| questions, | |
| contexts, | |
| ground_truths, | |
| answers, | |
| ref_chain=_GROUNDING_REF_CHAIN, | |
| ) | |
| grounding_metrics = [ | |
| Faithfulness(), | |
| AnswerRelevancy(strictness=1), | |
| LLMContextPrecisionWithReference(), | |
| LLMContextRecall(), | |
| ] | |
| logger.info("Pass 1/2: grounding metrics (reference = Danish quote)") | |
| result_a = evaluate( | |
| dataset=dataset_da, | |
| metrics=grounding_metrics, | |
| llm=self._llm, | |
| embeddings=self._embeddings, | |
| show_progress=False, | |
| ) | |
| agg_a, samples_a = self._result_to_dicts(result_a) | |
| logger.info("Pass 1/2 aggregate: %s", agg_a) | |
| # ---- Pass 2: correctness metrics with English reference ---- | |
| dataset_en = self._build_dataset( | |
| questions, | |
| contexts, | |
| ground_truths, | |
| answers, | |
| ref_chain=_CORRECTNESS_REF_CHAIN, | |
| ) | |
| correctness_metrics = [ | |
| AnswerCorrectness(), | |
| FactualCorrectness(), | |
| ] | |
| logger.info("Pass 2/2: correctness metrics (reference = English answer)") | |
| result_b = evaluate( | |
| dataset=dataset_en, | |
| metrics=correctness_metrics, | |
| llm=self._llm, | |
| embeddings=self._embeddings, | |
| show_progress=False, | |
| ) | |
| agg_b, samples_b = self._result_to_dicts(result_b) | |
| logger.info("Pass 2/2 aggregate: %s", agg_b) | |
| # ---- Merge ---- | |
| aggregate: dict[str, float] = {**agg_a, **agg_b} | |
| per_sample = self._merge_per_sample(samples_a, samples_b) | |
| logger.info("Combined aggregate: %s", aggregate) | |
| return {"aggregate": aggregate, "per_sample": per_sample} | |
| def evaluate_retrieval( | |
| self, | |
| questions: list[str], | |
| contexts: list[list[str]], | |
| ground_truths: list[str | dict[str, Any]], | |
| ) -> dict[str, Any]: | |
| """Evaluate retrieval quality only (ContextPrecision + ContextRecall). | |
| Single-pass against the Danish reference, no correctness metrics. | |
| Args: | |
| questions: Input questions. | |
| contexts: Retrieved context lists per question. | |
| ground_truths: Reference answers (str or dict). | |
| Returns: | |
| Dict with ``aggregate`` and ``per_sample`` keys, same shape as | |
| ``evaluate()`` but only retrieval metrics. | |
| """ | |
| logger.info("Running retrieval evaluation on %d samples", len(questions)) | |
| dataset = self._build_dataset( | |
| questions, | |
| contexts, | |
| ground_truths, | |
| ref_chain=_GROUNDING_REF_CHAIN, | |
| ) | |
| metrics = [ | |
| LLMContextPrecisionWithReference(), | |
| LLMContextRecall(), | |
| ] | |
| result = evaluate( | |
| dataset=dataset, | |
| metrics=metrics, | |
| llm=self._llm, | |
| embeddings=self._embeddings, | |
| show_progress=False, | |
| ) | |
| aggregate, per_sample = self._result_to_dicts(result) | |
| logger.info("Retrieval evaluation aggregate: %s", aggregate) | |
| return {"aggregate": aggregate, "per_sample": per_sample} | |