Spaces:
Sleeping
Sleeping
File size: 4,914 Bytes
1d10b0a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | # GPT Labeling Evaluation - Quick Start Guide
## π― In 30 Seconds
The RAG project now has **three evaluation methods** accessible from Streamlit:
1. **TRACE** - Fast, rule-based (100ms per evaluation, free)
2. **GPT Labeling** - Accurate, LLM-based (2-5s per evaluation, ~$0.01 each)
3. **Hybrid** - Both methods combined
## π Using in Streamlit
### Step 1: Start the App
```bash
streamlit run streamlit_app.py
```
### Step 2: Load Data
- Select a RAGBench dataset
- Load it into the vector store
### Step 3: Run Evaluation
1. Go to the "Evaluation" tab
2. Choose method:
```
[Radio button] TRACE / GPT Labeling / Hybrid
```
3. Set parameters:
- LLM: Select from dropdown
- Samples: Slider 5-500
4. Click "Run Evaluation"
### Step 4: View Results
- Aggregate metrics in cards
- Per-query details in expanders
- Download JSON results
## π» Using in Code
```python
from evaluation_pipeline import UnifiedEvaluationPipeline
# Initialize
pipeline = UnifiedEvaluationPipeline(
llm_client=my_llm,
chunking_strategy="dense"
)
# Single evaluation
result = pipeline.evaluate(
question="What is RAG?",
response="RAG is a technique...",
retrieved_documents=["Doc 1", "Doc 2"],
method="gpt_labeling" # "trace", "gpt_labeling", or "hybrid"
)
# Batch evaluation
results = pipeline.evaluate_batch(
test_cases=[{...}, {...}],
method="trace" # Fast for 100+ samples
)
```
## β‘ Performance Guide
| Method | Speed | Cost | Best For |
|--------|-------|------|----------|
| **TRACE** | 100ms | Free | Large-scale (100+ samples) |
| **GPT Labeling** | 2-5s | $0.01 | Small high-quality (< 20) |
| **Hybrid** | 2-5s | $0.01 | Need both metrics |
## ποΈ What Each Method Shows
### TRACE Metrics
- Utilization: How much context was used
- Relevance: How relevant was the context
- Adherence: No hallucinations in response
- Completeness: Covered all necessary info
### GPT Labeling Metrics
- Context Relevance: Fraction of relevant context
- Context Utilization: How much relevant was used
- Completeness: Coverage of relevant info
- Adherence: Response fully supported
## β οΈ Important Notes
### Rate Limiting
- Groq API: 30 RPM (1 request every 2 seconds)
- 10 samples: ~20-50 seconds
- 50 samples: ~2-3 minutes
- 100 samples: ~3-7 minutes
### When to Use GPT Labeling
β
Small high-quality subset (5-20 samples)
β
Want semantic understanding (not just keywords)
β
Evaluating new dataset
β Large-scale evaluation (100+ samples) β Use TRACE
β Budget-conscious β Use TRACE
## π Example Results
### TRACE Output
```
Utilization: 0.75
Relevance: 0.82
Adherence: 0.88
Completeness: 0.79
Average: 0.81
```
### GPT Labeling Output
```
Context Relevance: 0.88
Context Utilization: 0.75
Completeness: 0.82
Adherence: 0.95
Overall Supported: true
Fully Supported Sentences: 3
Partially Supported: 1
Unsupported: 0
```
## π§ Troubleshooting
**Q: "Method not found" error?**
A: Ensure `evaluation_pipeline.py` exists in project root
**Q: GPT Labeling returns all 0.0?**
A: Check LLM client is initialized: `st.session_state.rag_pipeline.llm`
**Q: Too slow for many samples?**
A: Use TRACE instead (100x faster, still good accuracy)
**Q: Budget concerns?**
A: Hybrid/GPT Labeling = ~$0.01 per evaluation. With 30 RPM limit, that's <$30 for 1000 evals
## π Documentation
For detailed information:
- **Conceptual**: See `docs/GPT_LABELING_EVALUATION.md`
- **Technical**: See `docs/IMPLEMENTATION_GUIDE_GPT_LABELING.md`
- **Summary**: See `GPT_LABELING_IMPLEMENTATION_SUMMARY.md`
## π How GPT Labeling Works (Simple Version)
1. Split documents into labeled sentences: `0a`, `0b`, `1a`, etc.
2. Split response into labeled sentences: `a`, `b`, `c`, etc.
3. Ask GPT-4 (via Groq): "Which document sentences support each response sentence?"
4. GPT returns JSON with labeled support information
5. Compute metrics from labeled data (more accurate than word overlap)
## π API Configuration
Your existing LLM client is used automatically:
- Already configured in `st.session_state.rag_pipeline.llm`
- No additional API keys needed
- Same rate limiting (30 RPM) applies
## β
Verification
To verify installation works:
```bash
python -c "
from advanced_rag_evaluator import AdvancedRAGEvaluator
from evaluation_pipeline import UnifiedEvaluationPipeline
print('Success: GPT Labeling modules installed')
"
```
Expected output: `Success: GPT Labeling modules installed`
## π Support
If GPT Labeling doesn't work:
1. Check Groq API key is valid
2. Verify LLM client is initialized
3. Test with TRACE method first
4. Check available rate limit (30 RPM)
5. Review detailed guides in `docs/`
## π You're Ready!
Start Streamlit and try the new evaluation methods now:
```bash
streamlit run streamlit_app.py
```
Then go to **Evaluation tab β Select method β Run**
That's it! Enjoy accurate LLM-based RAG evaluation! π
|