Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /GPT_METRICS_VISUAL_GUIDE.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 1 month ago

preview code

raw

history blame contribute delete

12.6 kB

	# TRACE Metrics Calculation - Visual Guide

	## Step-by-Step Visualization

	### STEP 1: Sentencization

	```
	DOCUMENTS RESPONSE
	═══════════════════════════════ ══════════════════════════════
	Doc 0: "Machine learning is AI that learns
	"ML is AI. It learns from data. from data. Deep learning uses neural
	Algorithms improve through time." networks. It's powerful for images."

	↓ Split by sentence ends ↓ Split by sentence ends

	0a: "ML is AI." a: "Machine learning is AI that
	0b: "It learns from data." learns from data."
	0c: "Algorithms improve b: "Deep learning uses neural
	through time." networks."
	c: "It's powerful for images."
	```

	### STEP 2: GPT Analysis

	```
	GPT MODEL PROCESSES:
	┌─────────────────────────────────────────────────────────────┐
	│ │
	│ INPUT: Sentencized docs + response + question │
	│ │
	│ ANALYSIS: │
	│ ✓ Which doc sentences are relevant to question? │
	│ ✓ Which doc sentences does response use? │
	│ ✓ Is each response sentence fully/partially supported? │
	│ │
	│ OUTPUT: JSON with sentence keys and support mappings │
	│ │
	└─────────────────────────────────────────────────────────────┘
	```

	### STEP 3: Metric Calculation

	```
	GPT OUTPUT (SIMPLIFIED):
	{
	"all_relevant_sentence_keys": ["0a", "0b"],
	"all_utilized_sentence_keys": ["0a", "0b"],
	"sentence_support_information": [
	{"response_sentence_key": "a", "fully_supported": true},
	{"response_sentence_key": "b", "fully_supported": true},
	{"response_sentence_key": "c", "fully_supported": false}
	]
	}

	↓

	METRIC CALCULATION:
	├─ Context Relevance = \|relevant\| / 20 = 2/20 = 0.10
	├─ Context Utilization = \|utilized\| / \|relevant\| = 2/2 = 1.0
	├─ Completeness = \|relevant ∩ utilized\| / \|relevant\| = 2/2 = 1.0
	└─ Adherence = all_fully_supported? = false → 0.0
	```

	---

	## Metric Formulas with Venn Diagrams

	### Context Relevance (R)

	```
	ALL RETRIEVED SENTENCES
	┌──────────────────────────┐
	│ │
	│ Total: ~20 sentences │
	│ │
	│ ┌──────────────────┐ │
	│ │ RELEVANT: │ │
	│ │ ["0a", "0b"] │ │
	│ │ Count: 2 │ │
	│ └──────────────────┘ │
	│ │
	│ Irrelevant: 18 │
	│ │
	└──────────────────────────┘

	Formula: R = 2 / 20 = 0.10 (10%)
	Interpretation: 10% of retrieved content is relevant to question
	```

	### Context Utilization (T)

	```
	RELEVANT SENTENCES
	┌──────────────────────────┐
	│ RELEVANT: ["0a", "0b"] │
	│ │
	│ ┌────────────────────┐ │
	│ │ UTILIZED: │ │
	│ │ ["0a", "0b"] │ │
	│ │ Count: 2 │ │
	│ └────────────────────┘ │
	│ │
	│ NOT USED: 0 │
	│ │
	└──────────────────────────┘

	Formula: U = 2 / 2 = 1.0 (100%)
	Interpretation: All relevant information was used
	```

	### Completeness (C)

	```
	RELEVANT UTILIZED
	┌──────────────┐ ┌──────────────┐
	│ ["0a", "0b"] │ │ ["0a", "0b"] │
	│ │ │ │
	│ COUNT: 2 │ │ COUNT: 2 │
	│ │ │ │
	└──────────────┘ └──────────────┘
	│ │
	└────────┬───────────┘
	│
	OVERLAP: ["0a", "0b"]
	COUNT: 2

	Formula: C = 2 / 2 = 1.0 (100%)
	Interpretation: All relevant info is in response
	```

	### Adherence (A)

	```
	RESPONSE SENTENCES: SUPPORT STATUS:
	┌──────────────────┐ ┌──────────────────┐
	│ a: "ML is AI..." │ ───────→│ ✓ Fully │
	│ │ │ Supported │
	│ b: "Deep learns..│ ───────→│ ✓ Fully │
	│ │ │ Supported │
	│ c: "Powerful..." │ ───────→│ ✗ Not │
	│ │ │ Supported │
	└──────────────────┘ └──────────────────┘

	Formula: A = (all_supported) ? 1.0 : 0.0
	= (true AND true AND false) ? 1.0 : 0.0
	= 0.0 (100% = 0 because of one failure)

	Interpretation: Response contains hallucination (adherence fails)
	```

	---

	## Complete Example Walkthrough

	### Input

	```
	QUESTION:
	"What makes machine learning different from traditional programming?"

	RETRIEVED DOCUMENTS:
	0: "Machine learning is a subset of AI. It learns patterns from data.
	Traditional programming requires explicit instructions."
	1: "ML algorithms improve through experience. They adapt to new data.
	Rule-based systems are rigid and hard to maintain."

	LLM RESPONSE:
	"Machine learning differs because it learns from data rather than
	requiring explicit instructions. ML algorithms improve over time.
	It's the future of all computing."
	```

	### Step 1: Sentencization

	```
	DOCUMENTS:
	0a: "Machine learning is a subset of AI."
	0b: "It learns patterns from data."
	0c: "Traditional programming requires explicit instructions."
	1a: "ML algorithms improve through experience."
	1b: "They adapt to new data."
	1c: "Rule-based systems are rigid and hard to maintain."

	RESPONSE:
	a: "Machine learning differs because it learns from data rather than
	requiring explicit instructions."
	b: "ML algorithms improve over time."
	c: "It's the future of all computing."
	```

	### Step 2: GPT Labeling

	```
	ANALYSIS BY GPT:

	Question focus: Differences between ML and traditional programming
	└─ "learns from data" vs "explicit instructions"
	└─ "improves through experience"
	└─ Adaptability

	RELEVANT SENTENCES (to question):
	├─ 0a: "subset of AI" → Partially relevant
	├─ 0b: "learns patterns from data" → RELEVANT ✓
	├─ 0c: "requires explicit instructions" → RELEVANT ✓
	├─ 1a: "improve through experience" → RELEVANT ✓
	├─ 1b: "adapt to new data" → RELEVANT ✓
	└─ 1c: "rule-based systems rigid" → Partially relevant

	UTILIZED SENTENCES (used in response):
	├─ response_a uses: 0b, 0c → Document references: [0b, 0c]
	├─ response_b uses: 1a → Document references: [1a]
	└─ response_c uses: NONE → No support → [hallucination]

	FULLY SUPPORTED CHECK:
	├─ response_a "learns from data, not explicit" → Supported by 0b, 0c ✓
	├─ response_b "algorithms improve" → Supported by 1a ✓
	└─ response_c "future of all computing" → NOT in documents ✗
	```

	### Step 3: Metric Calculation

	```
	EXTRACTED DATA:
	all_relevant_sentence_keys = ["0b", "0c", "1a", "1b"] (4 sentences)
	all_utilized_sentence_keys = ["0b", "0c", "1a"] (3 sentences)
	sentence_support_information = [
	{key: "a", fully_supported: true},
	{key: "b", fully_supported: true},
	{key: "c", fully_supported: false}
	]

	CALCULATIONS:

	1. Context Relevance
	= \|relevant\| / 20
	= 4 / 20
	= 0.20 (20%)

	2. Context Utilization
	= \|utilized\| / \|relevant\|
	= 3 / 4
	= 0.75 (75%)

	3. Completeness
	= \|relevant ∩ utilized\| / \|relevant\|
	= \|{0b, 0c, 1a}\| / \|{0b, 0c, 1a, 1b}\|
	= 3 / 4
	= 0.75 (75%)

	4. Adherence
	= all fully_supported?
	= true AND true AND false
	= FALSE → 0.0 (0%)
	```

	### Results

	```
	┌─────────────────────────────────────────┐
	│ TRACE METRICS RESULTS │
	├─────────────────────────────────────────┤
	│ Context Relevance: 0.20 (20%) │
	│ Context Utilization: 0.75 (75%) │
	│ Completeness: 0.75 (75%) │
	│ Adherence: 0.0 (0%) │
	├─────────────────────────────────────────┤
	│ Average: 0.425 (42.5%) │
	│ RMSE Aggregation: 0.437 │
	│ Consistency Score: 0.563 │
	└─────────────────────────────────────────┘

	INTERPRETATION:
	✓ Good relevance targeting (20%)
	✓ Decent information usage (75%)
	✓ Good coverage of relevant info (75%)
	✗ Contains hallucination (0% adherence)

	ACTION: Address the hallucination about "future of all computing"
	```

	---

	## Calculation Pseudocode

	```python
	# INPUT: GPT labeled output
	gpt_labels = {
	"all_relevant_sentence_keys": [...],
	"all_utilized_sentence_keys": [...],
	"sentence_support_information": [...]
	}

	# METRIC 1: Context Relevance
	relevant_count = len(gpt_labels["all_relevant_sentence_keys"])
	context_relevance = min(1.0, relevant_count / 20.0)

	# METRIC 2: Context Utilization
	utilized_count = len(gpt_labels["all_utilized_sentence_keys"])
	if relevant_count == 0:
	context_utilization = 0.0
	else:
	context_utilization = min(1.0, utilized_count / relevant_count)

	# METRIC 3: Completeness
	relevant_set = set(gpt_labels["all_relevant_sentence_keys"])
	utilized_set = set(gpt_labels["all_utilized_sentence_keys"])
	overlap_count = len(relevant_set & utilized_set)
	if len(relevant_set) == 0:
	completeness = 1.0 if len(utilized_set) == 0 else 0.0
	else:
	completeness = overlap_count / len(relevant_set)

	# METRIC 4: Adherence
	fully_supported_count = sum(
	1 for sentence in gpt_labels["sentence_support_information"]
	if sentence["fully_supported"]
	)
	total_sentences = len(gpt_labels["sentence_support_information"])
	if total_sentences == 0:
	adherence = 1.0
	else:
	adherence = 1.0 if fully_supported_count == total_sentences else 0.0

	# OUTPUT
	scores = {
	"context_relevance": context_relevance,
	"context_utilization": context_utilization,
	"completeness": completeness,
	"adherence": adherence,
	"average": (context_relevance + context_utilization +
	completeness + adherence) / 4
	}
	```

	---

	## Key Takeaways

	### 1. Each Metric Answers a Different Question

	\| Metric \| Question \| Data Source \|
	\|--------\|----------\|-------------\|
	\| R \| Is retrieval good? \| Relevant sentences \|
	\| U \| Does LLM use it? \| Utilized sentences \|
	\| C \| Is response comprehensive? \| Overlap \|
	\| A \| Is response truthful? \| Support flags \|

	### 2. Metrics Are Independent

	- Low R, high U is possible (ignore irrelevant)
	- Low U, high R is possible (retrieval good, generation bad)
	- Low C, high A is possible (limited but correct)

	### 3. GPT Labeling is Sentence-Level

	- Fine-grained sentence keys (0a, 0b, 1c, etc.)
	- Exact mapping of support
	- Transparent and verifiable

	### 4. All Four Metrics Required for Full Picture

	```
	Relevance: ← "Did we retrieve the right docs?"
	Utilization: ← "Did the LLM use them?"
	Completeness: ← "Did it cover the information?"
	Adherence: ← "Is it accurate?"
	```

	All four needed to understand RAG quality.