CapStoneRAG10 / docs /IMPLEMENTATION_GUIDE_GPT_LABELING.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a
# GPT Labeling Integration - Implementation Guide
## Overview
The RAG Capstone Project now includes **three evaluation methods**:
1. **TRACE Heuristics** - Fast, rule-based metrics (no LLM calls)
2. **GPT Labeling** - Accurate, LLM-based sentence-level grounding (RAGBench paper)
3. **Hybrid** - Combines both approaches for comprehensive analysis
## New Files Created
### Core Implementation Files
1. **`advanced_rag_evaluator.py`** (380 lines)
- `DocumentSentencizer` - Splits documents and responses into labeled sentences
- `GPTLabelingPromptGenerator` - Creates GPT labeling prompts
- `GPTLabelingOutput` - Dataclass for structured LLM response
- `AdvancedTRACEScores` - Enhanced scores with GPT labeling metrics
- `AdvancedRAGEvaluator` - Main evaluator using GPT labeling approach
2. **`evaluation_pipeline.py`** (175 lines)
- `UnifiedEvaluationPipeline` - Facade for TRACE + GPT Labeling
- Supports single evaluation or batch processing
- Provides method information and descriptions
3. **`docs/GPT_LABELING_EVALUATION.md`** (Comprehensive guide)
- Conceptual overview of sentence-level labeling
- Architecture and data flow diagrams
- Usage examples for all three methods
- Performance considerations and recommendations
- JSON output formats
### Modified Files
1. **`streamlit_app.py`**
- Updated `evaluation_interface()` to support method selection
- Updated `run_evaluation()` to handle three methods
- Added method descriptions and warnings
- Enhanced logging for each method
2. **`trace_evaluator.py`**
- Added documentation about GPT labeling integration
- No functional changes (backward compatible)
## Key Components Explained
### 1. Sentence Sentencization
**Document Sentences**: Labeled with keys like `0a`, `0b`, `1a`, `1b`
```
0a. This is the first sentence.
0b. This is the second sentence.
1a. Another document's first sentence.
1b. And the second sentence.
```
**Response Sentences**: Labeled with keys like `a`, `b`, `c`
```
a. The response starts here.
b. It contains multiple sentences.
c. Each one gets a unique key.
```
### 2. GPT Labeling Process
The GPT labeling prompt asks the LLM to:
1. Identify which document sentences are relevant to the question
2. For each response sentence, identify supporting document sentences
3. Determine if each response sentence is fully/partially/unsupported
4. Return structured JSON with 5 evaluation fields
### 3. Metric Computation
From GPT-labeled data:
- **Context Relevance**: Fraction of relevant document sentences (0-1)
- **Context Utilization**: Fraction of relevant sentences used (0-1)
- **Completeness**: Overlap between relevant and utilized (0-1)
- **Adherence**: Fraction of response sentences with full support (0-1)
## Usage Examples
### In Streamlit UI
1. **Select Evaluation Method**
```
[Radio button: TRACE / GPT Labeling / Hybrid]
```
2. **Choose LLM and Samples**
```
LLM: [Dropdown: llama-3.1-8b-instant, etc.]
Samples: [Slider: 5-100]
Button: "Run Evaluation"
```
3. **View Results**
- Aggregate scores in metric cards
- Per-query detailed results
- JSON download
### Programmatically (Python)
```python
from evaluation_pipeline import UnifiedEvaluationPipeline
# Create pipeline
pipeline = UnifiedEvaluationPipeline(
llm_client=my_llm_client,
chunking_strategy="dense",
embedding_model="all-mpnet-base-v2"
)
# Single evaluation
result = pipeline.evaluate(
question="What is RAG?",
response="RAG stands for...",
retrieved_documents=["Doc 1", "Doc 2"],
method="gpt_labeling"
)
# Batch evaluation
results = pipeline.evaluate_batch(
test_cases=[
{
"query": "Question 1",
"response": "Response 1",
"retrieved_documents": ["Doc 1", "Doc 2"],
"ground_truth": "Expected answer"
},
# ... more cases
],
method="hybrid" # "trace", "gpt_labeling", or "hybrid"
)
print(f"Results: {results}")
```
## Performance Characteristics
### TRACE Method
- **Time per evaluation**: ~100ms
- **Total time for 10 samples**: ~1 second
- **Total time for 100 samples**: ~10 seconds
- **Cost**: Free (no API calls)
- **Accuracy**: Good for obvious cases
### GPT Labeling Method
- **Time per evaluation**: 2-5 seconds (due to API + rate limiting)
- **Total time for 10 samples**: 20-50 seconds
- **Total time for 100 samples**: 3-8 minutes
- **Cost**: ~$0.002-0.01 per evaluation ($0.02-0.10 per 10 samples)
- **Accuracy**: Excellent, semantic understanding
- **Limitation**: 30 RPM Groq rate limit
### Hybrid Method
- **Time per evaluation**: 2-5 seconds
- **Cost**: Same as GPT Labeling
- **Benefit**: Get both fast and accurate metrics
## Important Considerations
### Rate Limiting
The Groq API has a **30 RPM (requests per minute)** limit:
- Each evaluation = 1 request
- Wait 2 seconds between requests
- For 10 evaluations: ~20-40 seconds
- For 50 evaluations: ~100-200 seconds (2-3 minutes)
- For 100 evaluations: ~200-400 seconds (3-7 minutes)
### When to Use Each Method
| Scenario | Recommended Method |
|----------|-------------------|
| Quick prototyping | TRACE |
| Small high-quality subset (< 20 samples) | GPT Labeling |
| Large-scale evaluation (100+ samples) | TRACE |
| Need both speed and accuracy | Hybrid on small subset |
| Production evaluation | TRACE + spot-check with GPT |
### Token Cost Estimation
For Groq's Llama model (~$0.05 per 1M input tokens):
- Average prompt: ~2KB = ~500 tokens input + ~200 output = ~700 tokens
- Cost per evaluation: 700 / 1M * $0.05 = $0.000035
- For 100 evaluations: ~$0.0035 (very cheap!)
**Note**: Exact costs depend on document length and model choice.
## Troubleshooting
### Issue: "evaluation_pipeline module not found"
**Solution**: Ensure `evaluation_pipeline.py` is in the project root directory
### Issue: GPT Labeling always returns 0.0 scores
**Solution**: Check that LLM client is properly initialized and returning valid JSON
### Issue: Rate limit exceeded
**Solution**: The code handles this with exponential backoff. Reduce number of samples.
### Issue: LLM returns non-JSON response
**Solution**: Use `temperature=0.0` in LLM calls for deterministic output
## Integration Checklist
- [x] Created `advanced_rag_evaluator.py` with GPT labeling implementation
- [x] Created `evaluation_pipeline.py` with unified interface
- [x] Updated `streamlit_app.py` to support method selection
- [x] Added comprehensive documentation in `docs/GPT_LABELING_EVALUATION.md`
- [x] Tested module imports and basic functionality
- [x] Verified syntax in all files
- [x] Backward compatible with existing TRACE evaluation
- [x] Handles LLM client gracefully (fallback to TRACE if unavailable)
## Next Steps (Optional Enhancements)
1. **Caching**: Store evaluation results for identical Q-D-R triplets
2. **Batch Processing**: Evaluate multiple samples in parallel
3. **Custom Prompts**: Allow users to customize GPT labeling prompts
4. **Multi-LLM**: Average labels from multiple LLMs for robustness
5. **Sampling Strategy**: Smart sampling for large datasets
6. **Visualization**: Charts comparing TRACE vs GPT Labeling results
## API Reference
### UnifiedEvaluationPipeline
```python
class UnifiedEvaluationPipeline:
def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap)
def evaluate(question, response, retrieved_documents, ground_truth=None,
method="trace") -> Dict
def evaluate_batch(test_cases, method="trace") -> Dict
@staticmethod
def get_evaluation_methods() -> List[Dict]
```
### AdvancedRAGEvaluator
```python
class AdvancedRAGEvaluator:
def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap)
def evaluate(question, response, retrieved_documents, ground_truth=None) -> AdvancedTRACEScores
def evaluate_batch(test_cases) -> Dict
```
### DocumentSentencizer
```python
class DocumentSentencizer:
@staticmethod
def sentencize_documents(documents: List[str]) -> Tuple[List[Dict], str]
@staticmethod
def sentencize_response(response: str) -> Tuple[List[Dict], str]
```
## File Summary
| File | Lines | Purpose | Status |
|------|-------|---------|--------|
| `advanced_rag_evaluator.py` | 380 | GPT labeling evaluator | NEW |
| `evaluation_pipeline.py` | 175 | Unified evaluation interface | NEW |
| `streamlit_app.py` | 927 | Updated UI with method selection | MODIFIED |
| `trace_evaluator.py` | 438 | Original TRACE metrics (unchanged) | UPDATED DOCS |
| `docs/GPT_LABELING_EVALUATION.md` | 500+ | Comprehensive guide | NEW |
## Total Impact
- **New Code**: ~550 lines (2 new modules)
- **Modified Code**: ~50 lines in streamlit_app.py + documentation
- **Backward Compatible**: Yes, existing TRACE evaluation still works
- **Breaking Changes**: None
- **New Dependencies**: None (all already installed)
## Verification Commands
```bash
# Check Python syntax
python -m py_compile advanced_rag_evaluator.py evaluation_pipeline.py
# Run imports test
python -c "from advanced_rag_evaluator import AdvancedRAGEvaluator; from evaluation_pipeline import UnifiedEvaluationPipeline; print('OK')"
# Start Streamlit with new features
streamlit run streamlit_app.py
```
## Support
For issues with GPT labeling:
1. Check that LLM client is initialized (`st.session_state.rag_pipeline.llm`)
2. Verify Groq API key is valid
3. Ensure rate limiting (30 RPM) is respected
4. Check LLM response is valid JSON
5. Review `docs/GPT_LABELING_EVALUATION.md` for detailed guidance