Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /GPT_METRICS_CALCULATION.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 2 months ago

preview code

raw

history blame contribute delete

14.5 kB

	# GPT Labeling Approach - Main Metrics Calculation

	## Overview

	The RAG evaluation system uses GPT-based labeling to calculate four TRACE metrics. The GPT model analyzes responses sentence-by-sentence and identifies which parts are supported by retrieved documents, enabling precise metric calculation.

	---

	## The Four TRACE Metrics

	### 1. Context Relevance (R) - What's Actually Relevant?

	Definition: Fraction of retrieved context that is relevant to answering the user's question.

	Calculation:
	```
	Context Relevance = Number of relevant sentences / Total retrieved sentences
	Normalized: min(1.0, count / 20)
	```

	Formula:
	```
	R = \|Relevant Sentences\| / \|Total Sentences\|
	```

	Data Source from GPT:
	```python
	gpt_labels.all_relevant_sentence_keys
	# List of document sentence keys identified as relevant
	# Example: ["0a", "0b", "1c", "2a"]
	```

	Example:
	```
	Retrieved 30 sentences total
	GPT identifies 12 as relevant to question "What is machine learning?"
	Context Relevance = 12/30 = 0.40 (40%)
	```

	What It Tells You:
	- ✓ How good was the retrieval?
	- ✓ Did we pull documents about the right topic?
	- ✓ Are there irrelevant documents in the results?

	---

	### 2. Context Utilization (T) - How Much Was Used?

	Definition: Fraction of relevant context that the response actually used to generate its answer.

	Calculation:
	```
	Context Utilization = Number of utilized sentences / Number of relevant sentences
	```

	Formula:
	```
	U = \|Utilized Sentences\| / \|Relevant Sentences\|
	```

	Data Source from GPT:
	```python
	gpt_labels.all_utilized_sentence_keys
	# List of document sentence keys actually used in response
	# Example: ["0a", "0b", "1c"]
	```

	Example:
	```
	Context Relevance found 12 relevant sentences: ["0a", "0b", "1a", "1c", "2a", ...]
	GPT identifies 8 actually used in response: ["0a", "0b", "1c", "2a", ...]
	Context Utilization = 8/12 = 0.67 (67%)
	```

	What It Tells You:
	- ✓ Did the LLM actually use the available information?
	- ✓ Is the response limited by context availability?
	- ✓ Is context being ignored/wasted?

	Problem Pattern:
	- High Relevance (0.9) + Low Utilization (0.3)
	→ Retrieval is good, but LLM isn't using it
	→ Fix: Improve prompt instructions

	---

	### 3. Completeness (C) - Was It Comprehensive?

	Definition: Fraction of relevant information that is covered by the response.

	Calculation:
	```
	Completeness = (Relevant ∩ Utilized) / Relevant
	= Relevant sentences that were used / All relevant sentences
	```

	Formula:
	```
	C = \|Relevant ∩ Utilized\| / \|Relevant\|
	```

	Data Source from GPT:
	```python
	# Set intersection:
	relevant_set = set(gpt_labels.all_relevant_sentence_keys)
	utilized_set = set(gpt_labels.all_utilized_sentence_keys)
	intersection = relevant_set & utilized_set

	completeness = len(intersection) / len(relevant_set)
	```

	Example:
	```
	Relevant sentences: {"0a", "0b", "1a", "1c", "2a", "2b", "3a"} (7 total)
	Utilized sentences: {"0a", "0b", "1c", "2a"} (4 used)

	Overlap (Relevant AND Used): {"0a", "0b", "1c", "2a"} (4 in both)
	Completeness = 4/7 = 0.57 (57%)

	Missing: "1a", "2b", "3a" were relevant but not mentioned
	```

	What It Tells You:
	- ✓ Did the response cover all important information?
	- ✓ What relevant details were omitted?
	- ✓ Is the response comprehensive?

	Problem Pattern:
	- Low Completeness (0.4) with High Adherence (1.0)
	→ Response is accurate but limited
	→ Missing important information
	→ Fix: Improve retrieval coverage or summarization

	---

	### 4. Adherence (A) - Was It Grounded?

	Definition: Whether the response is fully grounded in the retrieved context (no hallucinations).

	Calculation:
	```
	Adherence = 1.0 if ALL sentences are fully supported, 0.0 otherwise
	(Boolean: fully grounded or contains hallucination)
	```

	Formula:
	```
	A = 1.0 if all(sentence.fully_supported for all sentences) else 0.0
	```

	Data Source from GPT:
	```python
	gpt_labels.sentence_support_information
	# For each response sentence:
	# {
	# "response_sentence_key": "a",
	# "fully_supported": true/false, # ← This determines adherence
	# "supporting_sentence_keys": ["0a", "0b"],
	# "explanation": "..."
	# }

	fully_supported_count = sum(
	1 for s in sentence_support_information
	if s.get("fully_supported", False)
	)

	adherence = 1.0 if fully_supported_count == total_sentences else 0.0
	```

	Example 1 - Perfect Adherence:
	```
	Response sentences:
	a. "Machine learning is a subset of AI."
	└─ Fully supported by document 0a ✓
	b. "It uses algorithms to learn from data."
	└─ Fully supported by document 1b ✓
	c. "Common applications include image recognition."
	└─ Fully supported by documents 2a, 2b ✓

	ALL sentences fully supported → Adherence = 1.0 (100%)
	```

	Example 2 - Contains Hallucination:
	```
	Response sentences:
	a. "Machine learning is a subset of AI."
	└─ Fully supported by document 0a ✓
	b. "It requires quantum computers."
	└─ NOT supported by any document ✗ (Hallucination!)
	c. "Common applications include image recognition."
	└─ Fully supported by documents 2a, 2b ✓

	ONE sentence NOT fully supported → Adherence = 0.0 (0%)
	```

	What It Tells You:
	- ✓ Is the response truthful/grounded?
	- ✓ Does it contain hallucinations?
	- ✓ Can we trust the answer?

	---

	## How GPT Labeling Works

	### Step 1: Sentencization

	Documents:
	```
	Document 0: "Machine learning is AI. It learns from data."
	Document 1: "Neural networks are models. They mimic brains."

	↓ Splits into sentences with keys:

	0a: "Machine learning is AI."
	0b: "It learns from data."
	1a: "Neural networks are models."
	1b: "They mimic brains."
	```

	Response:
	```
	"Machine learning uses neural networks. They learn patterns."

	↓ Splits into sentences:

	a: "Machine learning uses neural networks."
	b: "They learn patterns."
	```

	### Step 2: GPT Labeling

	GPT analyzes and identifies:

	1. Relevance: Which document sentences are relevant to the question?
	2. Utilization: Which document sentences were actually used in the response?
	3. Support: Is each response sentence fully/partially/not supported?

	GPT Output (JSON):
	```json
	{
	"relevance_explanation": "Document discusses ML basics...",
	"all_relevant_sentence_keys": ["0a", "0b", "1a"],
	"overall_supported_explanation": "Response is grounded...",
	"overall_supported": true,
	"sentence_support_information": [
	{
	"response_sentence_key": "a",
	"explanation": "Matches document sentences...",
	"supporting_sentence_keys": ["0a", "1a"],
	"fully_supported": true
	},
	{
	"response_sentence_key": "b",
	"explanation": "Partially supported...",
	"supporting_sentence_keys": ["1b"],
	"fully_supported": false
	}
	],
	"all_utilized_sentence_keys": ["0a", "1a", "1b"]
	}
	```

	### Step 3: Metric Calculation

	```
	Relevant: ["0a", "0b", "1a"] (3 sentences)
	Utilized: ["0a", "1a", "1b"] (3 sentences)

	Context Relevance = \|["0a", "0b", "1a"]\| / total_sentences
	= 3 / ~20 = 0.15

	Context Utilization = \|Utilized\| / \|Relevant\|
	= 3 / 3 = 1.0

	Completeness = \|Relevant ∩ Utilized\| / \|Relevant\|
	= \|{"0a", "1a"}\| / \|{"0a", "0b", "1a"}\|
	= 2 / 3 = 0.67

	Adherence = All sentences fully supported? → Check each
	= sentence_a (true) AND sentence_b (false)
	= false → 0.0
	```

	---

	## Code Implementation

	### Metric Calculation Methods

	```python
	def _compute_context_relevance(self, gpt_labels: GPTLabelingOutput) -> float:
	"""Count relevant sentences, normalize to 0-1."""
	if not gpt_labels.all_relevant_sentence_keys:
	return 0.0
	return min(1.0, len(gpt_labels.all_relevant_sentence_keys) / 20.0)

	def _compute_context_utilization(self, gpt_labels: GPTLabelingOutput) -> float:
	"""Utilized / Relevant."""
	relevant_count = len(gpt_labels.all_relevant_sentence_keys)
	utilized_count = len(gpt_labels.all_utilized_sentence_keys)
	if relevant_count == 0:
	return 0.0
	return min(1.0, utilized_count / relevant_count)

	def _compute_completeness(self, gpt_labels: GPTLabelingOutput,
	ground_truth: Optional[str] = None) -> float:
	"""(Relevant AND Utilized) / Relevant."""
	relevant_set = set(gpt_labels.all_relevant_sentence_keys)
	utilized_set = set(gpt_labels.all_utilized_sentence_keys)
	intersection = len(relevant_set & utilized_set)
	if len(relevant_set) == 0:
	return 1.0 if len(utilized_set) == 0 else 0.0
	return intersection / len(relevant_set)

	def _compute_adherence(self, gpt_labels: GPTLabelingOutput) -> float:
	"""All sentences fully supported? Boolean: 1.0 or 0.0."""
	total_sentences = len(gpt_labels.sentence_support_information)
	if total_sentences == 0:
	return 1.0
	fully_supported_count = sum(
	1 for s in gpt_labels.sentence_support_information
	if s.get("fully_supported", False)
	)
	return 1.0 if fully_supported_count == total_sentences else 0.0
	```

	---

	## Complete Example: Full Calculation

	### Input

	Question: "What is machine learning?"

	Retrieved Documents:
	```
	Doc 0: "Machine learning is a subset of AI. It learns patterns from data.
	Algorithms improve through experience."
	Doc 1: "Deep learning uses neural networks. It's popular in computer vision."
	Doc 2: "Supervised learning needs labeled data. Unsupervised learning finds patterns."
	```

	LLM Response:
	```
	"Machine learning is a field of AI that learns from data. Deep learning
	uses neural networks. It's powerful for image recognition."
	```

	### GPT Labeling Process

	Sentencized Documents:
	```
	0a: "Machine learning is a subset of AI."
	0b: "It learns patterns from data."
	0c: "Algorithms improve through experience."
	1a: "Deep learning uses neural networks."
	1b: "It's popular in computer vision."
	2a: "Supervised learning needs labeled data."
	2b: "Unsupervised learning finds patterns."
	```

	Sentencized Response:
	```
	a: "Machine learning is a field of AI that learns from data."
	b: "Deep learning uses neural networks."
	c: "It's powerful for image recognition."
	```

	GPT Analysis:

	```json
	{
	"all_relevant_sentence_keys": ["0a", "0b", "1a", "1b"],
	"all_utilized_sentence_keys": ["0a", "0b", "1a", "1b"],
	"sentence_support_information": [
	{
	"response_sentence_key": "a",
	"supporting_sentence_keys": ["0a", "0b"],
	"fully_supported": true
	},
	{
	"response_sentence_key": "b",
	"supporting_sentence_keys": ["1a"],
	"fully_supported": true
	},
	{
	"response_sentence_key": "c",
	"supporting_sentence_keys": ["1b"],
	"fully_supported": false // "powerful for image recognition" not explicitly in docs
	}
	]
	}
	```

	### Metric Calculation

	```
	Relevant: ["0a", "0b", "1a", "1b"] (4 sentences)
	Utilized: ["0a", "0b", "1a", "1b"] (4 sentences)
	Total sentences retrieved: 7

	Context Relevance = 4 / 20 = 0.20 (20%)
	└─ 4 out of ~20 average sentences are relevant

	Context Utilization = 4 / 4 = 1.0 (100%)
	└─ All relevant information was used

	Completeness = \|{"0a","0b","1a","1b"} ∩ {"0a","0b","1a","1b"}\| / 4
	= 4 / 4 = 1.0 (100%)
	└─ All relevant info was covered

	Adherence = All fully supported?
	= sentence_a (true) AND sentence_b (true) AND sentence_c (false)
	= false → 0.0 (0%)
	└─ Contains hallucination about "powerful for image recognition"

	Average Score = (0.20 + 1.0 + 1.0 + 0.0) / 4 = 0.55
	```

	---

	## Key Insights

	### 1. Complementary Metrics

	\| Metric \| Measures \| Ideal Value \|
	\|--------\|----------\|-------------\|
	\| Relevance \| Quality of retrieval \| High (0.7+) \|
	\| Utilization \| LLM uses available info \| High (0.7+) \|
	\| Completeness \| Coverage of information \| High (0.7+) \|
	\| Adherence \| Grounding (no hallucination) \| Perfect (1.0) \|

	### 2. Common Patterns

	Pattern 1: Good Retrieval, Bad Generation
	```
	Relevance: 0.85 (good retrieval)
	Utilization: 0.40 (not using it)
	→ Problem: LLM not leveraging context
	→ Fix: Improve prompt instructions
	```

	Pattern 2: Conservative but Accurate
	```
	Completeness: 0.50 (missing info)
	Adherence: 1.0 (all correct)
	→ Problem: Limited but grounded response
	→ Fix: Improve retrieval coverage
	```

	Pattern 3: Comprehensive and Grounded
	```
	Relevance: 0.75, Utilization: 0.80, Completeness: 0.85, Adherence: 1.0
	→ Excellent RAG system
	→ Action: Monitor and maintain
	```

	### 3. Mathematical Relationships

	```
	Completeness ≤ Utilization ≤ Relevance
	(Because: utilization requires relevance, completeness requires utilization)

	Also:
	If Relevance = 0 → Utilization = 0, Completeness = 0
	If Utilization = 0 → Completeness = 0 (but Relevance can be > 0)
	```

	---

	## Advantages of GPT Labeling

	✅ Semantic Understanding
	- Not just keyword matching
	- Understands meaning and context
	- Detects subtle hallucinations

	✅ Fine-Grained Analysis
	- Sentence-level support mapping
	- Identifies exactly which info is supported
	- Pinpoints problematic claims

	✅ Comprehensive
	- Evaluates all four TRACE metrics
	- Single pass through documents
	- Complete audit trail

	✅ Transparent
	- Full explanation for each metric
	- Shows supporting evidence
	- Human-verifiable results

	---

	## Limitations

	❌ Cost
	- API calls per evaluation (~2.5s per eval with rate limiting)
	- At 30 RPM: 50 evals = 3-5 minutes

	❌ Semantic Brittleness
	- Depends on GPT's understanding
	- May miss implicit knowledge
	- Sensitive to phrasing

	❌ Normalization
	- Context Relevance normalized by 20 (arbitrary baseline)
	- Different domain sizes affect scaling

	❌ Binary Adherence
	- One hallucination = 0.0 adherence
	- No partial credit for mostly correct

	---

	## Summary

	The GPT labeling approach calculates TRACE metrics by:

	1. Splitting documents and response into sentences
	2. Analyzing with GPT to identify relevant/utilized/supported information
	3. Computing metrics from the labeled sentence keys:
	- Relevance: What's relevant in retrieved docs?
	- Utilization: What of that was actually used?
	- Completeness: What coverage of relevant info?
	- Adherence: Is all information grounded?

	This enables precise, interpretable evaluation of RAG system quality.