CapStoneRAG10 / docs /QUICK_START_GPT_LABELING.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a
# GPT Labeling Evaluation - Quick Start Guide
## 🎯 In 30 Seconds
The RAG project now has **three evaluation methods** accessible from Streamlit:
1. **TRACE** - Fast, rule-based (100ms per evaluation, free)
2. **GPT Labeling** - Accurate, LLM-based (2-5s per evaluation, ~$0.01 each)
3. **Hybrid** - Both methods combined
## πŸš€ Using in Streamlit
### Step 1: Start the App
```bash
streamlit run streamlit_app.py
```
### Step 2: Load Data
- Select a RAGBench dataset
- Load it into the vector store
### Step 3: Run Evaluation
1. Go to the "Evaluation" tab
2. Choose method:
```
[Radio button] TRACE / GPT Labeling / Hybrid
```
3. Set parameters:
- LLM: Select from dropdown
- Samples: Slider 5-500
4. Click "Run Evaluation"
### Step 4: View Results
- Aggregate metrics in cards
- Per-query details in expanders
- Download JSON results
## πŸ’» Using in Code
```python
from evaluation_pipeline import UnifiedEvaluationPipeline
# Initialize
pipeline = UnifiedEvaluationPipeline(
llm_client=my_llm,
chunking_strategy="dense"
)
# Single evaluation
result = pipeline.evaluate(
question="What is RAG?",
response="RAG is a technique...",
retrieved_documents=["Doc 1", "Doc 2"],
method="gpt_labeling" # "trace", "gpt_labeling", or "hybrid"
)
# Batch evaluation
results = pipeline.evaluate_batch(
test_cases=[{...}, {...}],
method="trace" # Fast for 100+ samples
)
```
## ⚑ Performance Guide
| Method | Speed | Cost | Best For |
|--------|-------|------|----------|
| **TRACE** | 100ms | Free | Large-scale (100+ samples) |
| **GPT Labeling** | 2-5s | $0.01 | Small high-quality (< 20) |
| **Hybrid** | 2-5s | $0.01 | Need both metrics |
## πŸŽ›οΈ What Each Method Shows
### TRACE Metrics
- Utilization: How much context was used
- Relevance: How relevant was the context
- Adherence: No hallucinations in response
- Completeness: Covered all necessary info
### GPT Labeling Metrics
- Context Relevance: Fraction of relevant context
- Context Utilization: How much relevant was used
- Completeness: Coverage of relevant info
- Adherence: Response fully supported
## ⚠️ Important Notes
### Rate Limiting
- Groq API: 30 RPM (1 request every 2 seconds)
- 10 samples: ~20-50 seconds
- 50 samples: ~2-3 minutes
- 100 samples: ~3-7 minutes
### When to Use GPT Labeling
βœ… Small high-quality subset (5-20 samples)
βœ… Want semantic understanding (not just keywords)
βœ… Evaluating new dataset
❌ Large-scale evaluation (100+ samples) β†’ Use TRACE
❌ Budget-conscious β†’ Use TRACE
## πŸ“Š Example Results
### TRACE Output
```
Utilization: 0.75
Relevance: 0.82
Adherence: 0.88
Completeness: 0.79
Average: 0.81
```
### GPT Labeling Output
```
Context Relevance: 0.88
Context Utilization: 0.75
Completeness: 0.82
Adherence: 0.95
Overall Supported: true
Fully Supported Sentences: 3
Partially Supported: 1
Unsupported: 0
```
## πŸ”§ Troubleshooting
**Q: "Method not found" error?**
A: Ensure `evaluation_pipeline.py` exists in project root
**Q: GPT Labeling returns all 0.0?**
A: Check LLM client is initialized: `st.session_state.rag_pipeline.llm`
**Q: Too slow for many samples?**
A: Use TRACE instead (100x faster, still good accuracy)
**Q: Budget concerns?**
A: Hybrid/GPT Labeling = ~$0.01 per evaluation. With 30 RPM limit, that's <$30 for 1000 evals
## πŸ“š Documentation
For detailed information:
- **Conceptual**: See `docs/GPT_LABELING_EVALUATION.md`
- **Technical**: See `docs/IMPLEMENTATION_GUIDE_GPT_LABELING.md`
- **Summary**: See `GPT_LABELING_IMPLEMENTATION_SUMMARY.md`
## πŸŽ“ How GPT Labeling Works (Simple Version)
1. Split documents into labeled sentences: `0a`, `0b`, `1a`, etc.
2. Split response into labeled sentences: `a`, `b`, `c`, etc.
3. Ask GPT-4 (via Groq): "Which document sentences support each response sentence?"
4. GPT returns JSON with labeled support information
5. Compute metrics from labeled data (more accurate than word overlap)
## πŸ” API Configuration
Your existing LLM client is used automatically:
- Already configured in `st.session_state.rag_pipeline.llm`
- No additional API keys needed
- Same rate limiting (30 RPM) applies
## βœ… Verification
To verify installation works:
```bash
python -c "
from advanced_rag_evaluator import AdvancedRAGEvaluator
from evaluation_pipeline import UnifiedEvaluationPipeline
print('Success: GPT Labeling modules installed')
"
```
Expected output: `Success: GPT Labeling modules installed`
## πŸ“ž Support
If GPT Labeling doesn't work:
1. Check Groq API key is valid
2. Verify LLM client is initialized
3. Test with TRACE method first
4. Check available rate limit (30 RPM)
5. Review detailed guides in `docs/`
## πŸŽ‰ You're Ready!
Start Streamlit and try the new evaluation methods now:
```bash
streamlit run streamlit_app.py
```
Then go to **Evaluation tab β†’ Select method β†’ Run**
That's it! Enjoy accurate LLM-based RAG evaluation! πŸš€