Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /IMPLEMENTATION_GUIDE_GPT_LABELING.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 1 month ago

preview code

raw

history blame contribute delete

9.62 kB

	# GPT Labeling Integration - Implementation Guide

	## Overview

	The RAG Capstone Project now includes three evaluation methods:

	1. TRACE Heuristics - Fast, rule-based metrics (no LLM calls)
	2. GPT Labeling - Accurate, LLM-based sentence-level grounding (RAGBench paper)
	3. Hybrid - Combines both approaches for comprehensive analysis

	## New Files Created

	### Core Implementation Files

	1. `advanced_rag_evaluator.py` (380 lines)
	- `DocumentSentencizer` - Splits documents and responses into labeled sentences
	- `GPTLabelingPromptGenerator` - Creates GPT labeling prompts
	- `GPTLabelingOutput` - Dataclass for structured LLM response
	- `AdvancedTRACEScores` - Enhanced scores with GPT labeling metrics
	- `AdvancedRAGEvaluator` - Main evaluator using GPT labeling approach

	2. `evaluation_pipeline.py` (175 lines)
	- `UnifiedEvaluationPipeline` - Facade for TRACE + GPT Labeling
	- Supports single evaluation or batch processing
	- Provides method information and descriptions

	3. `docs/GPT_LABELING_EVALUATION.md` (Comprehensive guide)
	- Conceptual overview of sentence-level labeling
	- Architecture and data flow diagrams
	- Usage examples for all three methods
	- Performance considerations and recommendations
	- JSON output formats

	### Modified Files

	1. `streamlit_app.py`
	- Updated `evaluation_interface()` to support method selection
	- Updated `run_evaluation()` to handle three methods
	- Added method descriptions and warnings
	- Enhanced logging for each method

	2. `trace_evaluator.py`
	- Added documentation about GPT labeling integration
	- No functional changes (backward compatible)

	## Key Components Explained

	### 1. Sentence Sentencization

	Document Sentences: Labeled with keys like `0a`, `0b`, `1a`, `1b`
	```
	0a. This is the first sentence.
	0b. This is the second sentence.
	1a. Another document's first sentence.
	1b. And the second sentence.
	```

	Response Sentences: Labeled with keys like `a`, `b`, `c`
	```
	a. The response starts here.
	b. It contains multiple sentences.
	c. Each one gets a unique key.
	```

	### 2. GPT Labeling Process

	The GPT labeling prompt asks the LLM to:
	1. Identify which document sentences are relevant to the question
	2. For each response sentence, identify supporting document sentences
	3. Determine if each response sentence is fully/partially/unsupported
	4. Return structured JSON with 5 evaluation fields

	### 3. Metric Computation

	From GPT-labeled data:
	- Context Relevance: Fraction of relevant document sentences (0-1)
	- Context Utilization: Fraction of relevant sentences used (0-1)
	- Completeness: Overlap between relevant and utilized (0-1)
	- Adherence: Fraction of response sentences with full support (0-1)

	## Usage Examples

	### In Streamlit UI

	1. Select Evaluation Method
	```
	[Radio button: TRACE / GPT Labeling / Hybrid]
	```

	2. Choose LLM and Samples
	```
	LLM: [Dropdown: llama-3.1-8b-instant, etc.]
	Samples: [Slider: 5-100]
	Button: "Run Evaluation"
	```

	3. View Results
	- Aggregate scores in metric cards
	- Per-query detailed results
	- JSON download

	### Programmatically (Python)

	```python
	from evaluation_pipeline import UnifiedEvaluationPipeline

	# Create pipeline
	pipeline = UnifiedEvaluationPipeline(
	llm_client=my_llm_client,
	chunking_strategy="dense",
	embedding_model="all-mpnet-base-v2"
	)

	# Single evaluation
	result = pipeline.evaluate(
	question="What is RAG?",
	response="RAG stands for...",
	retrieved_documents=["Doc 1", "Doc 2"],
	method="gpt_labeling"
	)

	# Batch evaluation
	results = pipeline.evaluate_batch(
	test_cases=[
	{
	"query": "Question 1",
	"response": "Response 1",
	"retrieved_documents": ["Doc 1", "Doc 2"],
	"ground_truth": "Expected answer"
	},
	# ... more cases
	],
	method="hybrid" # "trace", "gpt_labeling", or "hybrid"
	)

	print(f"Results: {results}")
	```

	## Performance Characteristics

	### TRACE Method
	- Time per evaluation: ~100ms
	- Total time for 10 samples: ~1 second
	- Total time for 100 samples: ~10 seconds
	- Cost: Free (no API calls)
	- Accuracy: Good for obvious cases

	### GPT Labeling Method
	- Time per evaluation: 2-5 seconds (due to API + rate limiting)
	- Total time for 10 samples: 20-50 seconds
	- Total time for 100 samples: 3-8 minutes
	- Cost: ~$0.002-0.01 per evaluation ($0.02-0.10 per 10 samples)
	- Accuracy: Excellent, semantic understanding
	- Limitation: 30 RPM Groq rate limit

	### Hybrid Method
	- Time per evaluation: 2-5 seconds
	- Cost: Same as GPT Labeling
	- Benefit: Get both fast and accurate metrics

	## Important Considerations

	### Rate Limiting
	The Groq API has a 30 RPM (requests per minute) limit:
	- Each evaluation = 1 request
	- Wait 2 seconds between requests
	- For 10 evaluations: ~20-40 seconds
	- For 50 evaluations: ~100-200 seconds (2-3 minutes)
	- For 100 evaluations: ~200-400 seconds (3-7 minutes)

	### When to Use Each Method

	\| Scenario \| Recommended Method \|
	\|----------\|-------------------\|
	\| Quick prototyping \| TRACE \|
	\| Small high-quality subset (< 20 samples) \| GPT Labeling \|
	\| Large-scale evaluation (100+ samples) \| TRACE \|
	\| Need both speed and accuracy \| Hybrid on small subset \|
	\| Production evaluation \| TRACE + spot-check with GPT \|

	### Token Cost Estimation

	For Groq's Llama model (~$0.05 per 1M input tokens):
	- Average prompt: ~2KB = ~500 tokens input + ~200 output = ~700 tokens
	- Cost per evaluation: 700 / 1M * $0.05 = $0.000035
	- For 100 evaluations: ~$0.0035 (very cheap!)

	Note: Exact costs depend on document length and model choice.

	## Troubleshooting

	### Issue: "evaluation_pipeline module not found"
	Solution: Ensure `evaluation_pipeline.py` is in the project root directory

	### Issue: GPT Labeling always returns 0.0 scores
	Solution: Check that LLM client is properly initialized and returning valid JSON

	### Issue: Rate limit exceeded
	Solution: The code handles this with exponential backoff. Reduce number of samples.

	### Issue: LLM returns non-JSON response
	Solution: Use `temperature=0.0` in LLM calls for deterministic output

	## Integration Checklist

	- [x] Created `advanced_rag_evaluator.py` with GPT labeling implementation
	- [x] Created `evaluation_pipeline.py` with unified interface
	- [x] Updated `streamlit_app.py` to support method selection
	- [x] Added comprehensive documentation in `docs/GPT_LABELING_EVALUATION.md`
	- [x] Tested module imports and basic functionality
	- [x] Verified syntax in all files
	- [x] Backward compatible with existing TRACE evaluation
	- [x] Handles LLM client gracefully (fallback to TRACE if unavailable)

	## Next Steps (Optional Enhancements)

	1. Caching: Store evaluation results for identical Q-D-R triplets
	2. Batch Processing: Evaluate multiple samples in parallel
	3. Custom Prompts: Allow users to customize GPT labeling prompts
	4. Multi-LLM: Average labels from multiple LLMs for robustness
	5. Sampling Strategy: Smart sampling for large datasets
	6. Visualization: Charts comparing TRACE vs GPT Labeling results

	## API Reference

	### UnifiedEvaluationPipeline

	```python
	class UnifiedEvaluationPipeline:
	def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap)

	def evaluate(question, response, retrieved_documents, ground_truth=None,
	method="trace") -> Dict

	def evaluate_batch(test_cases, method="trace") -> Dict

	@staticmethod
	def get_evaluation_methods() -> List[Dict]
	```

	### AdvancedRAGEvaluator

	```python
	class AdvancedRAGEvaluator:
	def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap)

	def evaluate(question, response, retrieved_documents, ground_truth=None) -> AdvancedTRACEScores

	def evaluate_batch(test_cases) -> Dict
	```

	### DocumentSentencizer

	```python
	class DocumentSentencizer:
	@staticmethod
	def sentencize_documents(documents: List[str]) -> Tuple[List[Dict], str]

	@staticmethod
	def sentencize_response(response: str) -> Tuple[List[Dict], str]
	```

	## File Summary

	\| File \| Lines \| Purpose \| Status \|
	\|------\|-------\|---------\|--------\|
	\| `advanced_rag_evaluator.py` \| 380 \| GPT labeling evaluator \| NEW \|
	\| `evaluation_pipeline.py` \| 175 \| Unified evaluation interface \| NEW \|
	\| `streamlit_app.py` \| 927 \| Updated UI with method selection \| MODIFIED \|
	\| `trace_evaluator.py` \| 438 \| Original TRACE metrics (unchanged) \| UPDATED DOCS \|
	\| `docs/GPT_LABELING_EVALUATION.md` \| 500+ \| Comprehensive guide \| NEW \|

	## Total Impact

	- New Code: ~550 lines (2 new modules)
	- Modified Code: ~50 lines in streamlit_app.py + documentation
	- Backward Compatible: Yes, existing TRACE evaluation still works
	- Breaking Changes: None
	- New Dependencies: None (all already installed)

	## Verification Commands

	```bash
	# Check Python syntax
	python -m py_compile advanced_rag_evaluator.py evaluation_pipeline.py

	# Run imports test
	python -c "from advanced_rag_evaluator import AdvancedRAGEvaluator; from evaluation_pipeline import UnifiedEvaluationPipeline; print('OK')"

	# Start Streamlit with new features
	streamlit run streamlit_app.py
	```

	## Support

	For issues with GPT labeling:
	1. Check that LLM client is initialized (`st.session_state.rag_pipeline.llm`)
	2. Verify Groq API key is valid
	3. Ensure rate limiting (30 RPM) is respected
	4. Check LLM response is valid JSON
	5. Review `docs/GPT_LABELING_EVALUATION.md` for detailed guidance