Spaces:
Sleeping
Sleeping
| # GPT Labeling Evaluation (RAGBench Approach) | |
| ## Overview | |
| This implementation adds advanced RAG evaluation using sentence-level GPT labeling prompts, as described in the **RAGBench paper** (arXiv:2407.11005). This approach is more accurate than heuristic-based metrics because it uses an LLM to understand semantic relationships between documents, questions, and responses. | |
| ## Key Concepts | |
| ### Sentence-Level Labeling | |
| Instead of computing metrics based on word overlap, the GPT labeling approach: | |
| 1. **Splits documents into sentences** with unique keys (e.g., `0a`, `0b`, `1a`, `1b`) | |
| 2. **Splits response into sentences** with unique keys (e.g., `a`, `b`, `c`) | |
| 3. **Calls GPT-4** with a specialized prompt to label: | |
| - Which document sentences are relevant to the question | |
| - Which document sentences support each response sentence | |
| - Whether each response sentence is fully/partially/unsupported | |
| ### Evaluation Metrics (From Labeled Data) | |
| The four TRACE metrics are computed from sentence-level labels: | |
| #### Context Relevance | |
| - **Definition**: Fraction of retrieved context relevant to the question | |
| - **Calculation**: Number of relevant document sentences / Total document sentences | |
| - **Semantic**: Does the context contain information needed to answer the question? | |
| #### Context Utilization | |
| - **Definition**: Fraction of relevant context actually used in the response | |
| - **Calculation**: Number of utilized relevant sentences / Total relevant sentences | |
| - **Semantic**: Did the response use all the important information from the context? | |
| #### Completeness | |
| - **Definition**: Fraction of relevant information covered in the response | |
| - **Calculation**: (Relevant ∩ Utilized) / Relevant | |
| - **Semantic**: Does the response comprehensively address the question using available context? | |
| #### Adherence | |
| - **Definition**: Whether the response is grounded in the context (no hallucinations) | |
| - **Calculation**: Fully supported sentences / Total response sentences | |
| - **Semantic**: Is every claim in the response backed by the context documents? | |
| ## Architecture | |
| ### Core Components | |
| ``` | |
| advanced_rag_evaluator.py | |
| ├── DocumentSentencizer | |
| │ ├── sentencize_documents() - Split docs into labeled sentences | |
| │ └── sentencize_response() - Split response into labeled sentences | |
| ├── GPTLabelingPromptGenerator | |
| │ └── generate_labeling_prompt() - Create prompt with sentence keys | |
| ├── GPTLabelingOutput | |
| │ └── Dataclass for LLM response | |
| └── AdvancedRAGEvaluator | |
| ├── evaluate() - Single case evaluation | |
| └── evaluate_batch() - Batch evaluation | |
| evaluation_pipeline.py | |
| └── UnifiedEvaluationPipeline | |
| ├── evaluate() | |
| └── evaluate_batch() | |
| ``` | |
| ### Data Flow | |
| ``` | |
| User Input | |
| ↓ | |
| Question, Response, Documents | |
| ↓ | |
| DocumentSentencizer | |
| ↓ | |
| Labeled Sentences (0a, 0b, 1a... and a, b, c...) | |
| ↓ | |
| GPTLabelingPromptGenerator | |
| ↓ | |
| Prompt with Full Sentence Text + Keys | |
| ↓ | |
| LLM (GPT-4 / Groq Llama) | |
| ↓ | |
| JSON with Labels: | |
| - relevance_explanation | |
| - all_relevant_sentence_keys: [0a, 0b, 1d, ...] | |
| - overall_supported: true/false | |
| - sentence_support_information: [{response_key: "a", fully_supported: true, ...}, ...] | |
| - all_utilized_sentence_keys: [0a, 1b, 1d, ...] | |
| ↓ | |
| Metric Computation | |
| ↓ | |
| Scores: Context Relevance, Utilization, Completeness, Adherence | |
| ``` | |
| ## GPT Labeling Prompt Template | |
| The prompt is carefully designed to make GPT understand: | |
| 1. **Document Structure**: Documents split into sentences with keys (0a, 0b, etc.) | |
| 2. **Response Structure**: Response split into sentences with keys (a, b, c, etc.) | |
| 3. **Task**: Assess support for each response sentence | |
| 4. **Output**: Structured JSON with 5 required fields | |
| ### Prompt Fields | |
| ``` | |
| LABELING_PROMPT_TEMPLATE = """ | |
| I asked someone to answer a question based on one or more documents. | |
| Your task is to review their response and assess whether or not each sentence | |
| in that response is supported by text in the documents... | |
| [Documents with sentence keys 0a, 0b, 1a, 1b...] | |
| [Question] | |
| [Response with sentence keys a, b, c...] | |
| Return JSON with: | |
| - relevance_explanation: Which docs are relevant | |
| - all_relevant_sentence_keys: [0a, 0b, ...] - All relevant doc sentences | |
| - overall_supported_explanation: Is response fully supported | |
| - overall_supported: true/false | |
| - sentence_support_information: [{response_sentence_key, explanation, supporting_sentence_keys, fully_supported}, ...] | |
| - all_utilized_sentence_keys: [0a, 1b, ...] - Document sentences used in response | |
| """ | |
| ``` | |
| ## Usage Examples | |
| ### Basic Usage with TRACE (Heuristic) | |
| ```python | |
| from trace_evaluator import TRACEEvaluator | |
| evaluator = TRACEEvaluator( | |
| llm_client=None, # Not needed for TRACE | |
| chunking_strategy="dense", | |
| embedding_model="sentence-transformers/all-mpnet-base-v2", | |
| chunk_size=512, | |
| chunk_overlap=50 | |
| ) | |
| scores = evaluator.evaluate( | |
| question="What is machine learning?", | |
| response="Machine learning is a subset of AI...", | |
| retrieved_documents=["Doc 1 text...", "Doc 2 text..."], | |
| ground_truth="Optional ground truth" | |
| ) | |
| print(f"Utilization: {scores.utilization}") | |
| print(f"Relevance: {scores.relevance}") | |
| print(f"Adherence: {scores.adherence}") | |
| print(f"Completeness: {scores.completeness}") | |
| print(f"Average: {scores.average()}") | |
| ``` | |
| ### Advanced Usage with GPT Labeling | |
| ```python | |
| from advanced_rag_evaluator import AdvancedRAGEvaluator | |
| evaluator = AdvancedRAGEvaluator( | |
| llm_client=groq_llm_client, # Required for GPT labeling | |
| chunking_strategy="dense", | |
| embedding_model="sentence-transformers/all-mpnet-base-v2", | |
| chunk_size=512, | |
| chunk_overlap=50 | |
| ) | |
| scores = evaluator.evaluate( | |
| question="What is machine learning?", | |
| response="Machine learning is a subset of AI...", | |
| retrieved_documents=["Doc 1 text...", "Doc 2 text..."] | |
| ) | |
| print(f"Context Relevance: {scores.context_relevance}") | |
| print(f"Context Utilization: {scores.context_utilization}") | |
| print(f"Completeness: {scores.completeness}") | |
| print(f"Adherence: {scores.adherence}") | |
| print(f"Overall Supported: {scores.overall_supported}") | |
| print(f"Fully Supported Sentences: {scores.num_fully_supported_sentences}") | |
| ``` | |
| ### Unified Pipeline (TRACE + GPT) | |
| ```python | |
| from evaluation_pipeline import UnifiedEvaluationPipeline | |
| pipeline = UnifiedEvaluationPipeline( | |
| llm_client=groq_llm_client, | |
| chunking_strategy="dense" | |
| ) | |
| # Single evaluation with TRACE | |
| result = pipeline.evaluate( | |
| question="What is RAG?", | |
| response="RAG stands for...", | |
| retrieved_documents=["Doc text..."], | |
| method="trace" | |
| ) | |
| # Single evaluation with GPT labeling | |
| result = pipeline.evaluate( | |
| question="What is RAG?", | |
| response="RAG stands for...", | |
| retrieved_documents=["Doc text..."], | |
| method="gpt_labeling" | |
| ) | |
| # Hybrid evaluation (both methods) | |
| result = pipeline.evaluate( | |
| question="What is RAG?", | |
| response="RAG stands for...", | |
| retrieved_documents=["Doc text..."], | |
| method="hybrid" | |
| ) | |
| # Batch evaluation | |
| results = pipeline.evaluate_batch( | |
| test_cases=[ | |
| { | |
| "query": "Question 1", | |
| "response": "Response 1", | |
| "retrieved_documents": ["Doc 1", "Doc 2"], | |
| "ground_truth": "Ground truth 1" | |
| }, | |
| # ... more test cases | |
| ], | |
| method="gpt_labeling" | |
| ) | |
| ``` | |
| ## Integration with Streamlit UI | |
| ### Adding Evaluation Method Selection | |
| ```python | |
| import streamlit as st | |
| from evaluation_pipeline import UnifiedEvaluationPipeline | |
| def evaluation_interface(): | |
| st.header("RAG Evaluation") | |
| # Method selection | |
| eval_methods = UnifiedEvaluationPipeline.get_evaluation_methods() | |
| method_names = [m["name"] for m in eval_methods] | |
| method_ids = [m["id"] for m in eval_methods] | |
| selected_method = st.radio( | |
| "Evaluation Method", | |
| options=method_names, | |
| index=0, | |
| help="TRACE is fast (no LLM). GPT Labeling is accurate but slower." | |
| ) | |
| method_id = method_ids[method_names.index(selected_method)] | |
| # Run evaluation | |
| pipeline = UnifiedEvaluationPipeline( | |
| llm_client=st.session_state.rag_pipeline.llm, | |
| chunking_strategy=collection_metadata.get("chunking_strategy"), | |
| embedding_model=collection_metadata.get("embedding_model"), | |
| chunk_size=collection_metadata.get("chunk_size"), | |
| chunk_overlap=collection_metadata.get("chunk_overlap") | |
| ) | |
| if st.button("Run Evaluation", key="eval_button"): | |
| results = pipeline.evaluate_batch( | |
| test_cases=prepared_test_cases, | |
| method=method_id | |
| ) | |
| st.json(results) | |
| ``` | |
| ## Performance Considerations | |
| ### TRACE Method (Rule-Based) | |
| - **Speed**: ~100ms per evaluation (no LLM calls) | |
| - **Accuracy**: Good for obvious cases, misses semantic nuances | |
| - **Cost**: Free (no API calls) | |
| - **Scalability**: Can evaluate thousands of samples quickly | |
| ### GPT Labeling Method | |
| - **Speed**: ~2-5 seconds per evaluation (LLM call required) | |
| - **Accuracy**: Excellent, understands semantic relationships | |
| - **Cost**: $0.002-0.01 per evaluation (depends on document length) | |
| - **Rate Limit**: Limited by Groq API (30 RPM = 1 evaluation every 2 seconds) | |
| - **Scalability**: Limited by API rate limits | |
| ### Recommendations | |
| - Use **TRACE** for quick prototyping and large-scale evaluation | |
| - Use **GPT Labeling** for accurate evaluation on smaller subsets | |
| - Use **Hybrid** when you need both speed and accuracy | |
| ## JSON Output Format | |
| ### TRACE Results | |
| ```json | |
| { | |
| "context_relevance": 0.85, | |
| "context_utilization": 0.72, | |
| "completeness": 0.78, | |
| "adherence": 0.90, | |
| "average": 0.81, | |
| "num_samples": 10, | |
| "detailed_results": [ | |
| { | |
| "query_id": 1, | |
| "question": "What is RAG?", | |
| "llm_response": "RAG stands for...", | |
| "retrieved_documents": ["Doc 1", "Doc 2"], | |
| "ground_truth": "Expected answer", | |
| "metrics": {...} | |
| } | |
| ] | |
| } | |
| ``` | |
| ### GPT Labeling Results | |
| ```json | |
| { | |
| "context_relevance": 0.88, | |
| "context_utilization": 0.75, | |
| "completeness": 0.82, | |
| "adherence": 0.95, | |
| "average": 0.85, | |
| "overall_supported": true, | |
| "fully_supported_sentences": 3, | |
| "partially_supported_sentences": 1, | |
| "unsupported_sentences": 0, | |
| "detailed_results": [ | |
| { | |
| "query_id": 1, | |
| "question": "What is RAG?", | |
| "llm_response": "RAG stands for...", | |
| "retrieved_documents": ["Doc 1", "Doc 2"], | |
| "metrics": { | |
| "context_relevance": 0.88, | |
| "context_utilization": 0.75, | |
| "completeness": 0.82, | |
| "adherence": 0.95, | |
| "overall_supported": true, | |
| "fully_supported_sentences": 3, | |
| "partially_supported_sentences": 1, | |
| "unsupported_sentences": 0 | |
| } | |
| } | |
| ] | |
| } | |
| ``` | |
| ## References | |
| - **RAGBench Paper**: "RAGBench: A Framework for Evaluating Retrieval-Augmented Generation Systems" (arXiv:2407.11005) | |
| - **TRACE Metrics**: Foundational framework for RAG evaluation | |
| - **Sentence-Level Grounding**: LLM-based assessment of semantic support | |
| ## Common Issues and Solutions | |
| ### Issue: LLM Refuses to Output JSON | |
| **Solution**: Add `response_format={"type": "json_object"}` to Groq API calls | |
| ### Issue: Long Documents Cause Token Limits | |
| **Solution**: Use smaller chunk_size (256-512) or summarize documents first | |
| ### Issue: Inconsistent Sentence Keys | |
| **Solution**: Use consistent delimiters (`.!?`) for sentence splitting | |
| ### Issue: Metric Values All 0.0 | |
| **Solution**: Check that LLM client is properly initialized; TRACE metrics should work without LLM | |
| ## Future Enhancements | |
| 1. **Multi-LLM Labeling**: Average labels from multiple LLMs for robustness | |
| 2. **Sentence Clustering**: Group semantically similar sentences for efficiency | |
| 3. **Selective Labeling**: Only label uncertain cases after initial heuristic pass | |
| 4. **Caching**: Store labels for identical question-document pairs | |
| 5. **Custom Metrics**: User-defined evaluation criteria through prompt customization | |