Spaces:
Sleeping
GPT Labeling Integration - Implementation Guide
Overview
The RAG Capstone Project now includes three evaluation methods:
- TRACE Heuristics - Fast, rule-based metrics (no LLM calls)
- GPT Labeling - Accurate, LLM-based sentence-level grounding (RAGBench paper)
- Hybrid - Combines both approaches for comprehensive analysis
New Files Created
Core Implementation Files
advanced_rag_evaluator.py(380 lines)DocumentSentencizer- Splits documents and responses into labeled sentencesGPTLabelingPromptGenerator- Creates GPT labeling promptsGPTLabelingOutput- Dataclass for structured LLM responseAdvancedTRACEScores- Enhanced scores with GPT labeling metricsAdvancedRAGEvaluator- Main evaluator using GPT labeling approach
evaluation_pipeline.py(175 lines)UnifiedEvaluationPipeline- Facade for TRACE + GPT Labeling- Supports single evaluation or batch processing
- Provides method information and descriptions
docs/GPT_LABELING_EVALUATION.md(Comprehensive guide)- Conceptual overview of sentence-level labeling
- Architecture and data flow diagrams
- Usage examples for all three methods
- Performance considerations and recommendations
- JSON output formats
Modified Files
streamlit_app.py- Updated
evaluation_interface()to support method selection - Updated
run_evaluation()to handle three methods - Added method descriptions and warnings
- Enhanced logging for each method
- Updated
trace_evaluator.py- Added documentation about GPT labeling integration
- No functional changes (backward compatible)
Key Components Explained
1. Sentence Sentencization
Document Sentences: Labeled with keys like 0a, 0b, 1a, 1b
0a. This is the first sentence.
0b. This is the second sentence.
1a. Another document's first sentence.
1b. And the second sentence.
Response Sentences: Labeled with keys like a, b, c
a. The response starts here.
b. It contains multiple sentences.
c. Each one gets a unique key.
2. GPT Labeling Process
The GPT labeling prompt asks the LLM to:
- Identify which document sentences are relevant to the question
- For each response sentence, identify supporting document sentences
- Determine if each response sentence is fully/partially/unsupported
- Return structured JSON with 5 evaluation fields
3. Metric Computation
From GPT-labeled data:
- Context Relevance: Fraction of relevant document sentences (0-1)
- Context Utilization: Fraction of relevant sentences used (0-1)
- Completeness: Overlap between relevant and utilized (0-1)
- Adherence: Fraction of response sentences with full support (0-1)
Usage Examples
In Streamlit UI
Select Evaluation Method
[Radio button: TRACE / GPT Labeling / Hybrid]Choose LLM and Samples
LLM: [Dropdown: llama-3.1-8b-instant, etc.] Samples: [Slider: 5-100] Button: "Run Evaluation"View Results
- Aggregate scores in metric cards
- Per-query detailed results
- JSON download
Programmatically (Python)
from evaluation_pipeline import UnifiedEvaluationPipeline
# Create pipeline
pipeline = UnifiedEvaluationPipeline(
llm_client=my_llm_client,
chunking_strategy="dense",
embedding_model="all-mpnet-base-v2"
)
# Single evaluation
result = pipeline.evaluate(
question="What is RAG?",
response="RAG stands for...",
retrieved_documents=["Doc 1", "Doc 2"],
method="gpt_labeling"
)
# Batch evaluation
results = pipeline.evaluate_batch(
test_cases=[
{
"query": "Question 1",
"response": "Response 1",
"retrieved_documents": ["Doc 1", "Doc 2"],
"ground_truth": "Expected answer"
},
# ... more cases
],
method="hybrid" # "trace", "gpt_labeling", or "hybrid"
)
print(f"Results: {results}")
Performance Characteristics
TRACE Method
- Time per evaluation: ~100ms
- Total time for 10 samples: ~1 second
- Total time for 100 samples: ~10 seconds
- Cost: Free (no API calls)
- Accuracy: Good for obvious cases
GPT Labeling Method
- Time per evaluation: 2-5 seconds (due to API + rate limiting)
- Total time for 10 samples: 20-50 seconds
- Total time for 100 samples: 3-8 minutes
- Cost: ~$0.002-0.01 per evaluation ($0.02-0.10 per 10 samples)
- Accuracy: Excellent, semantic understanding
- Limitation: 30 RPM Groq rate limit
Hybrid Method
- Time per evaluation: 2-5 seconds
- Cost: Same as GPT Labeling
- Benefit: Get both fast and accurate metrics
Important Considerations
Rate Limiting
The Groq API has a 30 RPM (requests per minute) limit:
- Each evaluation = 1 request
- Wait 2 seconds between requests
- For 10 evaluations: ~20-40 seconds
- For 50 evaluations: ~100-200 seconds (2-3 minutes)
- For 100 evaluations: ~200-400 seconds (3-7 minutes)
When to Use Each Method
| Scenario | Recommended Method |
|---|---|
| Quick prototyping | TRACE |
| Small high-quality subset (< 20 samples) | GPT Labeling |
| Large-scale evaluation (100+ samples) | TRACE |
| Need both speed and accuracy | Hybrid on small subset |
| Production evaluation | TRACE + spot-check with GPT |
Token Cost Estimation
For Groq's Llama model (~$0.05 per 1M input tokens):
- Average prompt: ~2KB = ~500 tokens input + ~200 output = ~700 tokens
- Cost per evaluation: 700 / 1M * $0.05 = $0.000035
- For 100 evaluations: ~$0.0035 (very cheap!)
Note: Exact costs depend on document length and model choice.
Troubleshooting
Issue: "evaluation_pipeline module not found"
Solution: Ensure evaluation_pipeline.py is in the project root directory
Issue: GPT Labeling always returns 0.0 scores
Solution: Check that LLM client is properly initialized and returning valid JSON
Issue: Rate limit exceeded
Solution: The code handles this with exponential backoff. Reduce number of samples.
Issue: LLM returns non-JSON response
Solution: Use temperature=0.0 in LLM calls for deterministic output
Integration Checklist
- Created
advanced_rag_evaluator.pywith GPT labeling implementation - Created
evaluation_pipeline.pywith unified interface - Updated
streamlit_app.pyto support method selection - Added comprehensive documentation in
docs/GPT_LABELING_EVALUATION.md - Tested module imports and basic functionality
- Verified syntax in all files
- Backward compatible with existing TRACE evaluation
- Handles LLM client gracefully (fallback to TRACE if unavailable)
Next Steps (Optional Enhancements)
- Caching: Store evaluation results for identical Q-D-R triplets
- Batch Processing: Evaluate multiple samples in parallel
- Custom Prompts: Allow users to customize GPT labeling prompts
- Multi-LLM: Average labels from multiple LLMs for robustness
- Sampling Strategy: Smart sampling for large datasets
- Visualization: Charts comparing TRACE vs GPT Labeling results
API Reference
UnifiedEvaluationPipeline
class UnifiedEvaluationPipeline:
def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap)
def evaluate(question, response, retrieved_documents, ground_truth=None,
method="trace") -> Dict
def evaluate_batch(test_cases, method="trace") -> Dict
@staticmethod
def get_evaluation_methods() -> List[Dict]
AdvancedRAGEvaluator
class AdvancedRAGEvaluator:
def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap)
def evaluate(question, response, retrieved_documents, ground_truth=None) -> AdvancedTRACEScores
def evaluate_batch(test_cases) -> Dict
DocumentSentencizer
class DocumentSentencizer:
@staticmethod
def sentencize_documents(documents: List[str]) -> Tuple[List[Dict], str]
@staticmethod
def sentencize_response(response: str) -> Tuple[List[Dict], str]
File Summary
| File | Lines | Purpose | Status |
|---|---|---|---|
advanced_rag_evaluator.py |
380 | GPT labeling evaluator | NEW |
evaluation_pipeline.py |
175 | Unified evaluation interface | NEW |
streamlit_app.py |
927 | Updated UI with method selection | MODIFIED |
trace_evaluator.py |
438 | Original TRACE metrics (unchanged) | UPDATED DOCS |
docs/GPT_LABELING_EVALUATION.md |
500+ | Comprehensive guide | NEW |
Total Impact
- New Code: ~550 lines (2 new modules)
- Modified Code: ~50 lines in streamlit_app.py + documentation
- Backward Compatible: Yes, existing TRACE evaluation still works
- Breaking Changes: None
- New Dependencies: None (all already installed)
Verification Commands
# Check Python syntax
python -m py_compile advanced_rag_evaluator.py evaluation_pipeline.py
# Run imports test
python -c "from advanced_rag_evaluator import AdvancedRAGEvaluator; from evaluation_pipeline import UnifiedEvaluationPipeline; print('OK')"
# Start Streamlit with new features
streamlit run streamlit_app.py
Support
For issues with GPT labeling:
- Check that LLM client is initialized (
st.session_state.rag_pipeline.llm) - Verify Groq API key is valid
- Ensure rate limiting (30 RPM) is respected
- Check LLM response is valid JSON
- Review
docs/GPT_LABELING_EVALUATION.mdfor detailed guidance