Spaces:
Sleeping
GPT Labeling Prompt β TRACE Metrics: Complete Explanation β¨
π― The Big Picture
Your RAG Capstone Project uses GPT (LLM) to evaluate RAG responses instead of simple keyword matching. Here's how it works:
ββββββββββββββββ
β Query β
β + Response β
β + Documents β
ββββββββ¬ββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββ
β Sentencize (Add keys: β
β doc_0_s0, resp_s0, etc.) β
ββββββββ¬ββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββ
β Generate Structured GPT β
β Labeling Prompt β
ββββββββ¬ββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββ
β Call Groq LLM API β
β (llm_client.generate) β
ββββββββ¬ββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββ
β LLM Returns JSON with: β
β - relevant_sentence_keys β
β - utilized_sentence_keys β
β - support_info β
ββββββββ¬ββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββ
β Extract and Calculate: β
β R (Relevance) = 0.15 β
β T (Utilization) = 0.67 β
β C (Completeness)= 0.67 β
β A (Adherence) = 1.0 β
ββββββββ¬ββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββ
β Return AdvancedTRACEScores β
β with all metrics + metadata β
ββββββββββββββββββββββββββββββββ
π What the GPT Prompt Asks
The GPT labeling prompt (in advanced_rag_evaluator.py, line 305) instructs the LLM to:
"You are a Fact-Checking and Citation Specialist"
- Identify Relevant Information: Which document sentences are relevant to the question?
- Verify Support: Which document sentences support each response sentence?
- Check Completeness: Is all important information covered?
- Detect Hallucinations: Are there any unsupported claims?
π What the LLM Returns (JSON)
{
"relevance_explanation": "Documents 1-2 are relevant, document 3 is not",
"all_relevant_sentence_keys": [
"doc_0_s0", β Sentence 0 from document 0
"doc_0_s1", β Sentence 1 from document 0
"doc_1_s0" β Sentence 0 from document 1
],
"sentence_support_information": [
{
"response_sentence_key": "resp_s0",
"explanation": "Matches doc_0_s0 about COVID-19",
"supporting_sentence_keys": ["doc_0_s0"],
"fully_supported": true β β No hallucination
},
{
"response_sentence_key": "resp_s1",
"explanation": "Matches doc_0_s1 about droplet spread",
"supporting_sentence_keys": ["doc_0_s1"],
"fully_supported": true β β No hallucination
}
],
"all_utilized_sentence_keys": [
"doc_0_s0",
"doc_0_s1"
],
"overall_supported": true β Response is fully grounded
}
π How Each TRACE Metric is Calculated
Metric 1: RELEVANCE (R)
Question Being Answered: "How much of the retrieved documents are relevant to the question?"
Code Location: advanced_rag_evaluator.py, Lines 554-562
Calculation:
R = len(all_relevant_sentence_keys) / 20
From GPT Response:
- Uses:
all_relevant_sentence_keyscount - Example:
["doc_0_s0", "doc_0_s1", "doc_1_s0"]β 3 keys - Divided by 20 (normalized max)
- Result: 3/20 = 0.15 (15%)
Interpretation: Only 15% of the document context is relevant to the query. Rest is noise.
Metric 2: UTILIZATION (T)
Question Being Answered: "Of the relevant information, how much did the LLM actually use?"
Code Location: advanced_rag_evaluator.py, Lines 564-576
Calculation:
T = len(all_utilized_sentence_keys) / len(all_relevant_sentence_keys)
From GPT Response:
- Numerator:
all_utilized_sentence_keyscount (e.g., 2) - Denominator:
all_relevant_sentence_keyscount (e.g., 3) - Result: 2/3 = 0.67 (67%)
Interpretation: The LLM used 67% of the relevant information. It ignored one relevant sentence.
Metric 3: COMPLETENESS (C)
Question Being Answered: "Does the response cover all the relevant information?"
Code Location: advanced_rag_evaluator.py, Lines 577-591
Calculation:
C = len(relevant_AND_utilized) / len(relevant)
From GPT Response:
- Find intersection of:
all_relevant_sentence_keys= {doc_0_s0, doc_0_s1, doc_1_s0}all_utilized_sentence_keys= {doc_0_s0, doc_0_s1}
- Intersection = {doc_0_s0, doc_0_s1} β 2 items
- Result: 2/3 = 0.67 (67%)
Interpretation: The response covers 67% of the relevant information. Missing doc_1_s0.
Metric 4: ADHERENCE (A) - Hallucination Detection
Question Being Answered: "Does the response contain hallucinations? (Are all claims supported by documents?)"
Code Location: advanced_rag_evaluator.py, Lines 593-609
Calculation:
if ALL response sentences have fully_supported=true:
A = 1.0
else:
A = 0.0 (at least one hallucination found!)
From GPT Response:
Check each item in
sentence_support_informationLook at the
fully_supportedfieldExample:
resp_s0: fully_supported = true β resp_s1: fully_supported = true βAll are true β Result: 1.0 (No hallucinations!)
If any were false:
resp_s0: fully_supported = true β resp_s1: fully_supported = false β HALLUCINATION!Result: 0.0 (Contains hallucination)
Interpretation: 1.0 = Response is completely grounded in documents. 0.0 = Contains at least one unsupported claim.
π Real Example: Full Walkthrough
Input:
Question: "What is COVID-19?"
Response: "COVID-19 is a respiratory disease. It spreads via droplets."
Documents:
1. "COVID-19 is a respiratory disease caused by SARS-CoV-2."
2. "The virus spreads through respiratory droplets."
3. "Vaccines help prevent infection."
Step 1: Sentencize
doc_0_s0: "COVID-19 is a respiratory disease caused by SARS-CoV-2."
doc_0_s1: "The virus spreads through respiratory droplets."
doc_1_s0: "Vaccines help prevent infection."
resp_s0: "COVID-19 is a respiratory disease."
resp_s1: "It spreads via droplets."
Step 2: Send to GPT Labeling Prompt
GPT analyzes and returns:
{
"all_relevant_sentence_keys": ["doc_0_s0", "doc_0_s1"],
"all_utilized_sentence_keys": ["doc_0_s0", "doc_0_s1"],
"sentence_support_information": [
{"response_sentence_key": "resp_s0", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s0"]},
{"response_sentence_key": "resp_s1", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s1"]}
]
}
Step 3: Calculate TRACE Metrics
Relevance (R):
- Relevant keys: 2 (doc_0_s0, doc_0_s1)
- Formula: 2/20 = 0.10 (10%)
- Meaning: 10% of the documents are relevant
Utilization (T):
- Used: 2, Relevant: 2
- Formula: 2/2 = 1.00 (100%)
- Meaning: Used all relevant information
Completeness (C):
- Relevant β© Used = 2
- Formula: 2/2 = 1.00 (100%)
- Meaning: Response covers all relevant info
Adherence (A):
- All sentences: fully_supported=true?
- YES β 1.0 (No hallucinations!)
Average Score:
- (0.10 + 1.00 + 1.00 + 1.0) / 4 = 0.775 (77.5% overall quality)
π Why This Is Better Than Simple Metrics
| Aspect | Simple Keywords | GPT Labeling |
|---|---|---|
| Understanding | β Keyword matching | β Semantic understanding |
| Hallucination Detection | β Can't detect | β Detects all unsupported claims |
| Paraphrasing | β Misses rephrased info | β Understands meaning |
| Explainability | β "Just a number" | β Shows exact support mapping |
| Domain Specificity | β οΈ Needs tuning | β Works across all domains |
π Key Files to Reference
| File | Purpose | Key Lines |
|---|---|---|
advanced_rag_evaluator.py |
Main evaluation engine | All calculations |
advanced_rag_evaluator.py |
Prompt template | Lines 305-350 |
advanced_rag_evaluator.py |
Get GPT response | Lines 470-552 |
advanced_rag_evaluator.py |
Calculate R metric | Lines 554-562 |
advanced_rag_evaluator.py |
Calculate T metric | Lines 564-576 |
advanced_rag_evaluator.py |
Calculate C metric | Lines 577-591 |
advanced_rag_evaluator.py |
Calculate A metric | Lines 593-609 |
llm_client.py |
Groq API calls | LLM integration |
π‘ Key Insights
- All metrics come from ONE GPT response: They're consistent and complementary
- Sentence keys enable traceability: Can show exactly which doc supported which claim
- Adherence is binary: Either fully supported (1.0) or not (0.0) - catches all hallucinations
- Relevance normalization: Divided by 20 to ensure 0-1 range regardless of doc length
- LLM as Judge: Semantic understanding without any code-based rule engineering
π― Summary in One Sentence
GPT analyzes which document sentences support which response sentences, then metrics are calculated from this mapping to assess RAG quality.
π Complete Documentation Available
- TRACE_METRICS_QUICK_REFERENCE.md - Quick lookup
- TRACE_METRICS_EXPLANATION.md - Detailed explanation
- TRACE_Metrics_Flow.png - Visual process flow
- Sentence_Mapping_Example.png - Sentence-level details
- RAG_Architecture_Diagram.png - System overview
- RAG_Data_Flow_Diagram.png - Complete pipeline
- RAG_Capstone_Project_Presentation.pptx - Full presentation
- DOCUMENTATION_INDEX.md - Navigation guide