Spaces:

XQ
/

Dokumentassistent

Running

File size: 14,794 Bytes

"""RAGAS-based evaluation for retrieval and generation quality.

Uses the legacy ``ragas.metrics`` classes (``LLMContextPrecisionWithReference``,
``LLMContextRecall``, ``Faithfulness``, ``AnswerRelevancy``,
``AnswerCorrectness``, ``FactualCorrectness``) rather than the newer
``ragas.metrics.collections`` API. The collections API only accepts
``InstructorLLM`` instances and would force us to import provider-specific
clients (openai / anthropic / ...) directly, which violates the project's
``provider.py``-only rule.

The legacy classes are wired with a ``LangchainLLMWrapper`` and a
``LangchainEmbeddingsWrapper`` at ``evaluate()`` time, so we keep using the
LangChain abstractions returned by ``src/provider.py`` everywhere.

Two metric families, two evaluation passes
------------------------------------------

For a multilingual test set (English questions querying Danish documents,
English-language answers, English+Danish reference fields), different metrics
want different reference languages:

- **Grounding metrics** (``ContextPrecision``, ``ContextRecall``) compare the
  reference against retrieved Danish chunks. They work best with a Danish
  reference (mono-lingual matching), so we feed them ``source_quote_da``.
  ``Faithfulness`` and ``AnswerRelevancy`` ignore the reference field but ride
  along in this pass to share the dataset.

- **Correctness metrics** (``AnswerCorrectness``, ``FactualCorrectness``)
  compare the generated English answer against the reference directly. They
  work best with an English reference, so we feed them ``reference_en``.

``evaluate()`` builds two ``EvaluationDataset`` instances (same questions,
contexts and answers, different ``reference`` field per pass) and runs RAGAS
twice. The aggregate scores are unioned and per-sample rows are merged on
``user_input``.

Why both families
-----------------

``Faithfulness`` measures whether each claim in the answer is supported by the
retrieved chunks. It does **not** check whether those chunks (and therefore
the answer) are actually correct. An answer that confidently quotes the wrong
chunks scores 1.0; an answer that hedges with prior-knowledge inference but
matches the ground truth scores low. Adding ``AnswerCorrectness`` /
``FactualCorrectness`` (which compare directly to ``reference_en``) closes
this gap by measuring **whether the answer is right**, not just whether it
is grounded in whatever was retrieved.

Note: ``AnswerRelevancy`` is constructed with ``strictness=1``. The default of
3 issues an OpenAI-style ``n=3`` request to generate three hypothetical
questions in one API call, which Groq's API rejects with HTTP 400
``'n' : number must be at most 1``. With ``strictness=1`` the metric makes a
single call per sample, which all providers support.
"""

import logging
from typing import Any

from langchain_core.embeddings import Embeddings
from langchain_core.language_models.chat_models import BaseChatModel
from ragas import EvaluationDataset, SingleTurnSample, evaluate
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.llms import LangchainLLMWrapper
from ragas.metrics._answer_correctness import AnswerCorrectness
from ragas.metrics._answer_relevance import AnswerRelevancy
from ragas.metrics._context_precision import LLMContextPrecisionWithReference
from ragas.metrics._context_recall import LLMContextRecall
from ragas.metrics._factual_correctness import FactualCorrectness
from ragas.metrics._faithfulness import Faithfulness

logger = logging.getLogger(__name__)

# Columns produced by RAGAS dataframes that are NOT metric scores. Used to
# separate metric columns from sample fields when computing aggregates.
_NON_METRIC_COLS: frozenset[str] = frozenset(
    {
        "user_input",
        "retrieved_contexts",
        "reference",
        "response",
        "ground_truth",
        "question",
        "answer",
        "contexts",
    }
)

# Reference-key fallback chains used by ``_resolve_reference``. Each chain
# starts with the preferred key for a particular metric family, then falls
# back to whichever other key is available so legacy plain-string ground
# truths still work.
_GROUNDING_REF_CHAIN: tuple[str, ...] = ("source_quote_da", "reference_en", "reference")
_CORRECTNESS_REF_CHAIN: tuple[str, ...] = ("reference_en", "source_quote_da", "reference")


class RAGEvaluator:
    """Evaluates RAG pipeline quality using two complementary RAGAS metric families.

    The judge LLM is independent from the generation LLM. This is critical
    when generation runs on a small local model: a stronger judge gives
    substantially less noisy scores.

    Each ground-truth entry may be either a plain string or a dict. The dict
    form is used by the multilingual test set::

        {
            "reference_en":    "English reference answer (informational)",
            "source_quote_da": "Verbatim Danish quote from the source document"
        }

    See the module docstring for why two metric families and two evaluation
    passes are used.
    """

    def __init__(self, llm: BaseChatModel, embeddings: Embeddings) -> None:
        """Initialize the evaluator.

        Args:
            llm: A LangChain BaseChatModel instance to use as the RAGAS judge.
                Should be a strong model (>= ~30B params) for reliable scoring.
            embeddings: A LangChain Embeddings instance. Required because
                ``AnswerRelevancy`` and ``AnswerCorrectness`` compute cosine
                similarity between text pairs.
        """
        self._llm = LangchainLLMWrapper(llm)
        self._embeddings = LangchainEmbeddingsWrapper(embeddings)
        logger.info("RAGEvaluator initialized")

    @staticmethod
    def _resolve_reference(
        ground_truth: str | dict[str, Any],
        ref_chain: tuple[str, ...],
    ) -> str:
        """Pick the best reference string for a given metric family.

        Args:
            ground_truth: Either a plain reference string or a dict with
                ``source_quote_da`` / ``reference_en`` / ``reference`` keys.
            ref_chain: Ordered tuple of dict keys to try, most preferred first.

        Returns:
            The reference string to feed into RAGAS.
        """
        if isinstance(ground_truth, str):
            return ground_truth
        if isinstance(ground_truth, dict):
            for key in ref_chain:
                value = ground_truth.get(key)
                if isinstance(value, str) and value.strip():
                    return value
        return str(ground_truth)

    def _build_dataset(
        self,
        questions: list[str],
        contexts: list[list[str]],
        ground_truths: list[str | dict[str, Any]],
        answers: list[str] | None = None,
        *,
        ref_chain: tuple[str, ...] = _GROUNDING_REF_CHAIN,
    ) -> EvaluationDataset:
        """Build a RAGAS EvaluationDataset from raw inputs.

        Args:
            questions: List of input questions.
            contexts: Retrieved context lists per question.
            ground_truths: Reference answers (str or dict).
            answers: Optional list of generated answers.
            ref_chain: Reference-key fallback chain to use when resolving
                each ground-truth entry.

        Returns:
            EvaluationDataset ready for evaluation.
        """
        samples: list[SingleTurnSample] = []
        for i, question in enumerate(questions):
            sample_kwargs: dict[str, Any] = {
                "user_input": question,
                "retrieved_contexts": contexts[i],
                "reference": self._resolve_reference(ground_truths[i], ref_chain),
            }
            if answers is not None:
                sample_kwargs["response"] = answers[i]
            samples.append(SingleTurnSample(**sample_kwargs))
        return EvaluationDataset(samples=samples)

    @staticmethod
    def _result_to_dicts(
        result: Any,
    ) -> tuple[dict[str, float], list[dict[str, Any]]]:
        """Convert a RAGAS EvaluationResult into aggregate scores + per-sample rows.

        Uses the public ``to_pandas()`` API instead of the private
        ``_repr_dict`` so the code does not break across RAGAS minor versions.

        Args:
            result: RAGAS EvaluationResult instance.

        Returns:
            Tuple of (aggregate metric → mean score, list of per-sample row dicts).
        """
        df = result.to_pandas()
        metric_cols = [c for c in df.columns if c not in _NON_METRIC_COLS]
        aggregate: dict[str, float] = {}
        for col in metric_cols:
            if df[col].dtype.kind in "fi":
                aggregate[col] = float(df[col].mean())
        per_sample: list[dict[str, Any]] = df.to_dict(orient="records")
        return aggregate, per_sample

    @staticmethod
    def _merge_per_sample(
        rows_a: list[dict[str, Any]],
        rows_b: list[dict[str, Any]],
    ) -> list[dict[str, Any]]:
        """Merge two per-sample row lists by ``user_input``.

        Both lists are produced by ``_result_to_dicts`` and contain the same
        questions but different metric columns. The reference field will
        differ between the two passes (Danish vs English), so we keep the
        Danish one (from rows_a) as the canonical ``reference`` and add an
        ``reference_en`` column from rows_b for transparency.

        Args:
            rows_a: Rows from the grounding pass (reference = Danish quote).
            rows_b: Rows from the correctness pass (reference = English answer).

        Returns:
            Merged list of row dicts with all metric columns from both passes.
        """
        rows_b_by_q: dict[str, dict[str, Any]] = {row["user_input"]: row for row in rows_b}
        merged: list[dict[str, Any]] = []
        for row_a in rows_a:
            row_b = rows_b_by_q.get(row_a["user_input"], {})
            merged_row: dict[str, Any] = dict(row_a)
            # Preserve the English reference for transparency.
            if "reference" in row_b:
                merged_row["reference_en"] = row_b["reference"]
            # Add metric columns from pass B that are not already present.
            for key, value in row_b.items():
                if key in _NON_METRIC_COLS:
                    continue
                if key not in merged_row:
                    merged_row[key] = value
            merged.append(merged_row)
        return merged

    def evaluate(
        self,
        questions: list[str],
        answers: list[str],
        contexts: list[list[str]],
        ground_truths: list[str | dict[str, Any]],
    ) -> dict[str, Any]:
        """Run full RAGAS evaluation across grounding + correctness metric families.

        Performs two passes against the same questions / answers / contexts
        but with different reference languages (see module docstring).

        Args:
            questions: Input questions.
            answers: Generated answers from the RAG pipeline.
            contexts: Retrieved context lists per question.
            ground_truths: Reference answers (str or dict per
                ``_resolve_reference``).

        Returns:
            Dict with two keys:
                ``aggregate``:  metric name → mean score across all samples
                ``per_sample``: list of per-sample dicts (one row per question)
        """
        n = len(questions)
        logger.info("Running full evaluation on %d samples (two passes)", n)

        # ---- Pass 1: grounding metrics with Danish reference ----
        dataset_da = self._build_dataset(
            questions,
            contexts,
            ground_truths,
            answers,
            ref_chain=_GROUNDING_REF_CHAIN,
        )
        grounding_metrics = [
            Faithfulness(),
            AnswerRelevancy(strictness=1),
            LLMContextPrecisionWithReference(),
            LLMContextRecall(),
        ]
        logger.info("Pass 1/2: grounding metrics (reference = Danish quote)")
        result_a = evaluate(
            dataset=dataset_da,
            metrics=grounding_metrics,
            llm=self._llm,
            embeddings=self._embeddings,
            show_progress=False,
        )
        agg_a, samples_a = self._result_to_dicts(result_a)
        logger.info("Pass 1/2 aggregate: %s", agg_a)

        # ---- Pass 2: correctness metrics with English reference ----
        dataset_en = self._build_dataset(
            questions,
            contexts,
            ground_truths,
            answers,
            ref_chain=_CORRECTNESS_REF_CHAIN,
        )
        correctness_metrics = [
            AnswerCorrectness(),
            FactualCorrectness(),
        ]
        logger.info("Pass 2/2: correctness metrics (reference = English answer)")
        result_b = evaluate(
            dataset=dataset_en,
            metrics=correctness_metrics,
            llm=self._llm,
            embeddings=self._embeddings,
            show_progress=False,
        )
        agg_b, samples_b = self._result_to_dicts(result_b)
        logger.info("Pass 2/2 aggregate: %s", agg_b)

        # ---- Merge ----
        aggregate: dict[str, float] = {**agg_a, **agg_b}
        per_sample = self._merge_per_sample(samples_a, samples_b)
        logger.info("Combined aggregate: %s", aggregate)
        return {"aggregate": aggregate, "per_sample": per_sample}

    def evaluate_retrieval(
        self,
        questions: list[str],
        contexts: list[list[str]],
        ground_truths: list[str | dict[str, Any]],
    ) -> dict[str, Any]:
        """Evaluate retrieval quality only (ContextPrecision + ContextRecall).

        Single-pass against the Danish reference, no correctness metrics.

        Args:
            questions: Input questions.
            contexts: Retrieved context lists per question.
            ground_truths: Reference answers (str or dict).

        Returns:
            Dict with ``aggregate`` and ``per_sample`` keys, same shape as
            ``evaluate()`` but only retrieval metrics.
        """
        logger.info("Running retrieval evaluation on %d samples", len(questions))
        dataset = self._build_dataset(
            questions,
            contexts,
            ground_truths,
            ref_chain=_GROUNDING_REF_CHAIN,
        )
        metrics = [
            LLMContextPrecisionWithReference(),
            LLMContextRecall(),
        ]
        result = evaluate(
            dataset=dataset,
            metrics=metrics,
            llm=self._llm,
            embeddings=self._embeddings,
            show_progress=False,
        )
        aggregate, per_sample = self._result_to_dicts(result)
        logger.info("Retrieval evaluation aggregate: %s", aggregate)
        return {"aggregate": aggregate, "per_sample": per_sample}