Spaces:
Sleeping
Sleeping
| # GPT Labeling Integration - Implementation Guide | |
| ## Overview | |
| The RAG Capstone Project now includes **three evaluation methods**: | |
| 1. **TRACE Heuristics** - Fast, rule-based metrics (no LLM calls) | |
| 2. **GPT Labeling** - Accurate, LLM-based sentence-level grounding (RAGBench paper) | |
| 3. **Hybrid** - Combines both approaches for comprehensive analysis | |
| ## New Files Created | |
| ### Core Implementation Files | |
| 1. **`advanced_rag_evaluator.py`** (380 lines) | |
| - `DocumentSentencizer` - Splits documents and responses into labeled sentences | |
| - `GPTLabelingPromptGenerator` - Creates GPT labeling prompts | |
| - `GPTLabelingOutput` - Dataclass for structured LLM response | |
| - `AdvancedTRACEScores` - Enhanced scores with GPT labeling metrics | |
| - `AdvancedRAGEvaluator` - Main evaluator using GPT labeling approach | |
| 2. **`evaluation_pipeline.py`** (175 lines) | |
| - `UnifiedEvaluationPipeline` - Facade for TRACE + GPT Labeling | |
| - Supports single evaluation or batch processing | |
| - Provides method information and descriptions | |
| 3. **`docs/GPT_LABELING_EVALUATION.md`** (Comprehensive guide) | |
| - Conceptual overview of sentence-level labeling | |
| - Architecture and data flow diagrams | |
| - Usage examples for all three methods | |
| - Performance considerations and recommendations | |
| - JSON output formats | |
| ### Modified Files | |
| 1. **`streamlit_app.py`** | |
| - Updated `evaluation_interface()` to support method selection | |
| - Updated `run_evaluation()` to handle three methods | |
| - Added method descriptions and warnings | |
| - Enhanced logging for each method | |
| 2. **`trace_evaluator.py`** | |
| - Added documentation about GPT labeling integration | |
| - No functional changes (backward compatible) | |
| ## Key Components Explained | |
| ### 1. Sentence Sentencization | |
| **Document Sentences**: Labeled with keys like `0a`, `0b`, `1a`, `1b` | |
| ``` | |
| 0a. This is the first sentence. | |
| 0b. This is the second sentence. | |
| 1a. Another document's first sentence. | |
| 1b. And the second sentence. | |
| ``` | |
| **Response Sentences**: Labeled with keys like `a`, `b`, `c` | |
| ``` | |
| a. The response starts here. | |
| b. It contains multiple sentences. | |
| c. Each one gets a unique key. | |
| ``` | |
| ### 2. GPT Labeling Process | |
| The GPT labeling prompt asks the LLM to: | |
| 1. Identify which document sentences are relevant to the question | |
| 2. For each response sentence, identify supporting document sentences | |
| 3. Determine if each response sentence is fully/partially/unsupported | |
| 4. Return structured JSON with 5 evaluation fields | |
| ### 3. Metric Computation | |
| From GPT-labeled data: | |
| - **Context Relevance**: Fraction of relevant document sentences (0-1) | |
| - **Context Utilization**: Fraction of relevant sentences used (0-1) | |
| - **Completeness**: Overlap between relevant and utilized (0-1) | |
| - **Adherence**: Fraction of response sentences with full support (0-1) | |
| ## Usage Examples | |
| ### In Streamlit UI | |
| 1. **Select Evaluation Method** | |
| ``` | |
| [Radio button: TRACE / GPT Labeling / Hybrid] | |
| ``` | |
| 2. **Choose LLM and Samples** | |
| ``` | |
| LLM: [Dropdown: llama-3.1-8b-instant, etc.] | |
| Samples: [Slider: 5-100] | |
| Button: "Run Evaluation" | |
| ``` | |
| 3. **View Results** | |
| - Aggregate scores in metric cards | |
| - Per-query detailed results | |
| - JSON download | |
| ### Programmatically (Python) | |
| ```python | |
| from evaluation_pipeline import UnifiedEvaluationPipeline | |
| # Create pipeline | |
| pipeline = UnifiedEvaluationPipeline( | |
| llm_client=my_llm_client, | |
| chunking_strategy="dense", | |
| embedding_model="all-mpnet-base-v2" | |
| ) | |
| # Single evaluation | |
| result = pipeline.evaluate( | |
| question="What is RAG?", | |
| response="RAG stands for...", | |
| retrieved_documents=["Doc 1", "Doc 2"], | |
| method="gpt_labeling" | |
| ) | |
| # Batch evaluation | |
| results = pipeline.evaluate_batch( | |
| test_cases=[ | |
| { | |
| "query": "Question 1", | |
| "response": "Response 1", | |
| "retrieved_documents": ["Doc 1", "Doc 2"], | |
| "ground_truth": "Expected answer" | |
| }, | |
| # ... more cases | |
| ], | |
| method="hybrid" # "trace", "gpt_labeling", or "hybrid" | |
| ) | |
| print(f"Results: {results}") | |
| ``` | |
| ## Performance Characteristics | |
| ### TRACE Method | |
| - **Time per evaluation**: ~100ms | |
| - **Total time for 10 samples**: ~1 second | |
| - **Total time for 100 samples**: ~10 seconds | |
| - **Cost**: Free (no API calls) | |
| - **Accuracy**: Good for obvious cases | |
| ### GPT Labeling Method | |
| - **Time per evaluation**: 2-5 seconds (due to API + rate limiting) | |
| - **Total time for 10 samples**: 20-50 seconds | |
| - **Total time for 100 samples**: 3-8 minutes | |
| - **Cost**: ~$0.002-0.01 per evaluation ($0.02-0.10 per 10 samples) | |
| - **Accuracy**: Excellent, semantic understanding | |
| - **Limitation**: 30 RPM Groq rate limit | |
| ### Hybrid Method | |
| - **Time per evaluation**: 2-5 seconds | |
| - **Cost**: Same as GPT Labeling | |
| - **Benefit**: Get both fast and accurate metrics | |
| ## Important Considerations | |
| ### Rate Limiting | |
| The Groq API has a **30 RPM (requests per minute)** limit: | |
| - Each evaluation = 1 request | |
| - Wait 2 seconds between requests | |
| - For 10 evaluations: ~20-40 seconds | |
| - For 50 evaluations: ~100-200 seconds (2-3 minutes) | |
| - For 100 evaluations: ~200-400 seconds (3-7 minutes) | |
| ### When to Use Each Method | |
| | Scenario | Recommended Method | | |
| |----------|-------------------| | |
| | Quick prototyping | TRACE | | |
| | Small high-quality subset (< 20 samples) | GPT Labeling | | |
| | Large-scale evaluation (100+ samples) | TRACE | | |
| | Need both speed and accuracy | Hybrid on small subset | | |
| | Production evaluation | TRACE + spot-check with GPT | | |
| ### Token Cost Estimation | |
| For Groq's Llama model (~$0.05 per 1M input tokens): | |
| - Average prompt: ~2KB = ~500 tokens input + ~200 output = ~700 tokens | |
| - Cost per evaluation: 700 / 1M * $0.05 = $0.000035 | |
| - For 100 evaluations: ~$0.0035 (very cheap!) | |
| **Note**: Exact costs depend on document length and model choice. | |
| ## Troubleshooting | |
| ### Issue: "evaluation_pipeline module not found" | |
| **Solution**: Ensure `evaluation_pipeline.py` is in the project root directory | |
| ### Issue: GPT Labeling always returns 0.0 scores | |
| **Solution**: Check that LLM client is properly initialized and returning valid JSON | |
| ### Issue: Rate limit exceeded | |
| **Solution**: The code handles this with exponential backoff. Reduce number of samples. | |
| ### Issue: LLM returns non-JSON response | |
| **Solution**: Use `temperature=0.0` in LLM calls for deterministic output | |
| ## Integration Checklist | |
| - [x] Created `advanced_rag_evaluator.py` with GPT labeling implementation | |
| - [x] Created `evaluation_pipeline.py` with unified interface | |
| - [x] Updated `streamlit_app.py` to support method selection | |
| - [x] Added comprehensive documentation in `docs/GPT_LABELING_EVALUATION.md` | |
| - [x] Tested module imports and basic functionality | |
| - [x] Verified syntax in all files | |
| - [x] Backward compatible with existing TRACE evaluation | |
| - [x] Handles LLM client gracefully (fallback to TRACE if unavailable) | |
| ## Next Steps (Optional Enhancements) | |
| 1. **Caching**: Store evaluation results for identical Q-D-R triplets | |
| 2. **Batch Processing**: Evaluate multiple samples in parallel | |
| 3. **Custom Prompts**: Allow users to customize GPT labeling prompts | |
| 4. **Multi-LLM**: Average labels from multiple LLMs for robustness | |
| 5. **Sampling Strategy**: Smart sampling for large datasets | |
| 6. **Visualization**: Charts comparing TRACE vs GPT Labeling results | |
| ## API Reference | |
| ### UnifiedEvaluationPipeline | |
| ```python | |
| class UnifiedEvaluationPipeline: | |
| def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap) | |
| def evaluate(question, response, retrieved_documents, ground_truth=None, | |
| method="trace") -> Dict | |
| def evaluate_batch(test_cases, method="trace") -> Dict | |
| @staticmethod | |
| def get_evaluation_methods() -> List[Dict] | |
| ``` | |
| ### AdvancedRAGEvaluator | |
| ```python | |
| class AdvancedRAGEvaluator: | |
| def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap) | |
| def evaluate(question, response, retrieved_documents, ground_truth=None) -> AdvancedTRACEScores | |
| def evaluate_batch(test_cases) -> Dict | |
| ``` | |
| ### DocumentSentencizer | |
| ```python | |
| class DocumentSentencizer: | |
| @staticmethod | |
| def sentencize_documents(documents: List[str]) -> Tuple[List[Dict], str] | |
| @staticmethod | |
| def sentencize_response(response: str) -> Tuple[List[Dict], str] | |
| ``` | |
| ## File Summary | |
| | File | Lines | Purpose | Status | | |
| |------|-------|---------|--------| | |
| | `advanced_rag_evaluator.py` | 380 | GPT labeling evaluator | NEW | | |
| | `evaluation_pipeline.py` | 175 | Unified evaluation interface | NEW | | |
| | `streamlit_app.py` | 927 | Updated UI with method selection | MODIFIED | | |
| | `trace_evaluator.py` | 438 | Original TRACE metrics (unchanged) | UPDATED DOCS | | |
| | `docs/GPT_LABELING_EVALUATION.md` | 500+ | Comprehensive guide | NEW | | |
| ## Total Impact | |
| - **New Code**: ~550 lines (2 new modules) | |
| - **Modified Code**: ~50 lines in streamlit_app.py + documentation | |
| - **Backward Compatible**: Yes, existing TRACE evaluation still works | |
| - **Breaking Changes**: None | |
| - **New Dependencies**: None (all already installed) | |
| ## Verification Commands | |
| ```bash | |
| # Check Python syntax | |
| python -m py_compile advanced_rag_evaluator.py evaluation_pipeline.py | |
| # Run imports test | |
| python -c "from advanced_rag_evaluator import AdvancedRAGEvaluator; from evaluation_pipeline import UnifiedEvaluationPipeline; print('OK')" | |
| # Start Streamlit with new features | |
| streamlit run streamlit_app.py | |
| ``` | |
| ## Support | |
| For issues with GPT labeling: | |
| 1. Check that LLM client is initialized (`st.session_state.rag_pipeline.llm`) | |
| 2. Verify Groq API key is valid | |
| 3. Ensure rate limiting (30 RPM) is respected | |
| 4. Check LLM response is valid JSON | |
| 5. Review `docs/GPT_LABELING_EVALUATION.md` for detailed guidance | |