Spaces:
Sleeping
Sleeping
| # GPT Labeling Evaluation - Quick Start Guide | |
| ## π― In 30 Seconds | |
| The RAG project now has **three evaluation methods** accessible from Streamlit: | |
| 1. **TRACE** - Fast, rule-based (100ms per evaluation, free) | |
| 2. **GPT Labeling** - Accurate, LLM-based (2-5s per evaluation, ~$0.01 each) | |
| 3. **Hybrid** - Both methods combined | |
| ## π Using in Streamlit | |
| ### Step 1: Start the App | |
| ```bash | |
| streamlit run streamlit_app.py | |
| ``` | |
| ### Step 2: Load Data | |
| - Select a RAGBench dataset | |
| - Load it into the vector store | |
| ### Step 3: Run Evaluation | |
| 1. Go to the "Evaluation" tab | |
| 2. Choose method: | |
| ``` | |
| [Radio button] TRACE / GPT Labeling / Hybrid | |
| ``` | |
| 3. Set parameters: | |
| - LLM: Select from dropdown | |
| - Samples: Slider 5-500 | |
| 4. Click "Run Evaluation" | |
| ### Step 4: View Results | |
| - Aggregate metrics in cards | |
| - Per-query details in expanders | |
| - Download JSON results | |
| ## π» Using in Code | |
| ```python | |
| from evaluation_pipeline import UnifiedEvaluationPipeline | |
| # Initialize | |
| pipeline = UnifiedEvaluationPipeline( | |
| llm_client=my_llm, | |
| chunking_strategy="dense" | |
| ) | |
| # Single evaluation | |
| result = pipeline.evaluate( | |
| question="What is RAG?", | |
| response="RAG is a technique...", | |
| retrieved_documents=["Doc 1", "Doc 2"], | |
| method="gpt_labeling" # "trace", "gpt_labeling", or "hybrid" | |
| ) | |
| # Batch evaluation | |
| results = pipeline.evaluate_batch( | |
| test_cases=[{...}, {...}], | |
| method="trace" # Fast for 100+ samples | |
| ) | |
| ``` | |
| ## β‘ Performance Guide | |
| | Method | Speed | Cost | Best For | | |
| |--------|-------|------|----------| | |
| | **TRACE** | 100ms | Free | Large-scale (100+ samples) | | |
| | **GPT Labeling** | 2-5s | $0.01 | Small high-quality (< 20) | | |
| | **Hybrid** | 2-5s | $0.01 | Need both metrics | | |
| ## ποΈ What Each Method Shows | |
| ### TRACE Metrics | |
| - Utilization: How much context was used | |
| - Relevance: How relevant was the context | |
| - Adherence: No hallucinations in response | |
| - Completeness: Covered all necessary info | |
| ### GPT Labeling Metrics | |
| - Context Relevance: Fraction of relevant context | |
| - Context Utilization: How much relevant was used | |
| - Completeness: Coverage of relevant info | |
| - Adherence: Response fully supported | |
| ## β οΈ Important Notes | |
| ### Rate Limiting | |
| - Groq API: 30 RPM (1 request every 2 seconds) | |
| - 10 samples: ~20-50 seconds | |
| - 50 samples: ~2-3 minutes | |
| - 100 samples: ~3-7 minutes | |
| ### When to Use GPT Labeling | |
| β Small high-quality subset (5-20 samples) | |
| β Want semantic understanding (not just keywords) | |
| β Evaluating new dataset | |
| β Large-scale evaluation (100+ samples) β Use TRACE | |
| β Budget-conscious β Use TRACE | |
| ## π Example Results | |
| ### TRACE Output | |
| ``` | |
| Utilization: 0.75 | |
| Relevance: 0.82 | |
| Adherence: 0.88 | |
| Completeness: 0.79 | |
| Average: 0.81 | |
| ``` | |
| ### GPT Labeling Output | |
| ``` | |
| Context Relevance: 0.88 | |
| Context Utilization: 0.75 | |
| Completeness: 0.82 | |
| Adherence: 0.95 | |
| Overall Supported: true | |
| Fully Supported Sentences: 3 | |
| Partially Supported: 1 | |
| Unsupported: 0 | |
| ``` | |
| ## π§ Troubleshooting | |
| **Q: "Method not found" error?** | |
| A: Ensure `evaluation_pipeline.py` exists in project root | |
| **Q: GPT Labeling returns all 0.0?** | |
| A: Check LLM client is initialized: `st.session_state.rag_pipeline.llm` | |
| **Q: Too slow for many samples?** | |
| A: Use TRACE instead (100x faster, still good accuracy) | |
| **Q: Budget concerns?** | |
| A: Hybrid/GPT Labeling = ~$0.01 per evaluation. With 30 RPM limit, that's <$30 for 1000 evals | |
| ## π Documentation | |
| For detailed information: | |
| - **Conceptual**: See `docs/GPT_LABELING_EVALUATION.md` | |
| - **Technical**: See `docs/IMPLEMENTATION_GUIDE_GPT_LABELING.md` | |
| - **Summary**: See `GPT_LABELING_IMPLEMENTATION_SUMMARY.md` | |
| ## π How GPT Labeling Works (Simple Version) | |
| 1. Split documents into labeled sentences: `0a`, `0b`, `1a`, etc. | |
| 2. Split response into labeled sentences: `a`, `b`, `c`, etc. | |
| 3. Ask GPT-4 (via Groq): "Which document sentences support each response sentence?" | |
| 4. GPT returns JSON with labeled support information | |
| 5. Compute metrics from labeled data (more accurate than word overlap) | |
| ## π API Configuration | |
| Your existing LLM client is used automatically: | |
| - Already configured in `st.session_state.rag_pipeline.llm` | |
| - No additional API keys needed | |
| - Same rate limiting (30 RPM) applies | |
| ## β Verification | |
| To verify installation works: | |
| ```bash | |
| python -c " | |
| from advanced_rag_evaluator import AdvancedRAGEvaluator | |
| from evaluation_pipeline import UnifiedEvaluationPipeline | |
| print('Success: GPT Labeling modules installed') | |
| " | |
| ``` | |
| Expected output: `Success: GPT Labeling modules installed` | |
| ## π Support | |
| If GPT Labeling doesn't work: | |
| 1. Check Groq API key is valid | |
| 2. Verify LLM client is initialized | |
| 3. Test with TRACE method first | |
| 4. Check available rate limit (30 RPM) | |
| 5. Review detailed guides in `docs/` | |
| ## π You're Ready! | |
| Start Streamlit and try the new evaluation methods now: | |
| ```bash | |
| streamlit run streamlit_app.py | |
| ``` | |
| Then go to **Evaluation tab β Select method β Run** | |
| That's it! Enjoy accurate LLM-based RAG evaluation! π | |