# GPT Labeling Evaluation - Quick Start Guide ## 🎯 In 30 Seconds The RAG project now has **three evaluation methods** accessible from Streamlit: 1. **TRACE** - Fast, rule-based (100ms per evaluation, free) 2. **GPT Labeling** - Accurate, LLM-based (2-5s per evaluation, ~$0.01 each) 3. **Hybrid** - Both methods combined ## 🚀 Using in Streamlit ### Step 1: Start the App ```bash streamlit run streamlit_app.py ``` ### Step 2: Load Data - Select a RAGBench dataset - Load it into the vector store ### Step 3: Run Evaluation 1. Go to the "Evaluation" tab 2. Choose method: ``` [Radio button] TRACE / GPT Labeling / Hybrid ``` 3. Set parameters: - LLM: Select from dropdown - Samples: Slider 5-500 4. Click "Run Evaluation" ### Step 4: View Results - Aggregate metrics in cards - Per-query details in expanders - Download JSON results ## 💻 Using in Code ```python from evaluation_pipeline import UnifiedEvaluationPipeline # Initialize pipeline = UnifiedEvaluationPipeline( llm_client=my_llm, chunking_strategy="dense" ) # Single evaluation result = pipeline.evaluate( question="What is RAG?", response="RAG is a technique...", retrieved_documents=["Doc 1", "Doc 2"], method="gpt_labeling" # "trace", "gpt_labeling", or "hybrid" ) # Batch evaluation results = pipeline.evaluate_batch( test_cases=[{...}, {...}], method="trace" # Fast for 100+ samples ) ``` ## ⚡ Performance Guide | Method | Speed | Cost | Best For | |--------|-------|------|----------| | **TRACE** | 100ms | Free | Large-scale (100+ samples) | | **GPT Labeling** | 2-5s | $0.01 | Small high-quality (< 20) | | **Hybrid** | 2-5s | $0.01 | Need both metrics | ## 🎛️ What Each Method Shows ### TRACE Metrics - Utilization: How much context was used - Relevance: How relevant was the context - Adherence: No hallucinations in response - Completeness: Covered all necessary info ### GPT Labeling Metrics - Context Relevance: Fraction of relevant context - Context Utilization: How much relevant was used - Completeness: Coverage of relevant info - Adherence: Response fully supported ## ⚠️ Important Notes ### Rate Limiting - Groq API: 30 RPM (1 request every 2 seconds) - 10 samples: ~20-50 seconds - 50 samples: ~2-3 minutes - 100 samples: ~3-7 minutes ### When to Use GPT Labeling ✅ Small high-quality subset (5-20 samples) ✅ Want semantic understanding (not just keywords) ✅ Evaluating new dataset ❌ Large-scale evaluation (100+ samples) → Use TRACE ❌ Budget-conscious → Use TRACE ## 📊 Example Results ### TRACE Output ``` Utilization: 0.75 Relevance: 0.82 Adherence: 0.88 Completeness: 0.79 Average: 0.81 ``` ### GPT Labeling Output ``` Context Relevance: 0.88 Context Utilization: 0.75 Completeness: 0.82 Adherence: 0.95 Overall Supported: true Fully Supported Sentences: 3 Partially Supported: 1 Unsupported: 0 ``` ## 🔧 Troubleshooting **Q: "Method not found" error?** A: Ensure `evaluation_pipeline.py` exists in project root **Q: GPT Labeling returns all 0.0?** A: Check LLM client is initialized: `st.session_state.rag_pipeline.llm` **Q: Too slow for many samples?** A: Use TRACE instead (100x faster, still good accuracy) **Q: Budget concerns?** A: Hybrid/GPT Labeling = ~$0.01 per evaluation. With 30 RPM limit, that's <$30 for 1000 evals ## 📚 Documentation For detailed information: - **Conceptual**: See `docs/GPT_LABELING_EVALUATION.md` - **Technical**: See `docs/IMPLEMENTATION_GUIDE_GPT_LABELING.md` - **Summary**: See `GPT_LABELING_IMPLEMENTATION_SUMMARY.md` ## 🎓 How GPT Labeling Works (Simple Version) 1. Split documents into labeled sentences: `0a`, `0b`, `1a`, etc. 2. Split response into labeled sentences: `a`, `b`, `c`, etc. 3. Ask GPT-4 (via Groq): "Which document sentences support each response sentence?" 4. GPT returns JSON with labeled support information 5. Compute metrics from labeled data (more accurate than word overlap) ## 🔐 API Configuration Your existing LLM client is used automatically: - Already configured in `st.session_state.rag_pipeline.llm` - No additional API keys needed - Same rate limiting (30 RPM) applies ## ✅ Verification To verify installation works: ```bash python -c " from advanced_rag_evaluator import AdvancedRAGEvaluator from evaluation_pipeline import UnifiedEvaluationPipeline print('Success: GPT Labeling modules installed') " ``` Expected output: `Success: GPT Labeling modules installed` ## 📞 Support If GPT Labeling doesn't work: 1. Check Groq API key is valid 2. Verify LLM client is initialized 3. Test with TRACE method first 4. Check available rate limit (30 RPM) 5. Review detailed guides in `docs/` ## 🎉 You're Ready! Start Streamlit and try the new evaluation methods now: ```bash streamlit run streamlit_app.py ``` Then go to **Evaluation tab → Select method → Run** That's it! Enjoy accurate LLM-based RAG evaluation! 🚀