File size: 4,914 Bytes
1d10b0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
# GPT Labeling Evaluation - Quick Start Guide

## 🎯 In 30 Seconds

The RAG project now has **three evaluation methods** accessible from Streamlit:

1. **TRACE** - Fast, rule-based (100ms per evaluation, free)
2. **GPT Labeling** - Accurate, LLM-based (2-5s per evaluation, ~$0.01 each)
3. **Hybrid** - Both methods combined

## πŸš€ Using in Streamlit

### Step 1: Start the App
```bash
streamlit run streamlit_app.py
```

### Step 2: Load Data
- Select a RAGBench dataset
- Load it into the vector store

### Step 3: Run Evaluation
1. Go to the "Evaluation" tab
2. Choose method:
   ```
   [Radio button] TRACE / GPT Labeling / Hybrid
   ```
3. Set parameters:
   - LLM: Select from dropdown
   - Samples: Slider 5-500
4. Click "Run Evaluation"

### Step 4: View Results
- Aggregate metrics in cards
- Per-query details in expanders
- Download JSON results

## πŸ’» Using in Code

```python
from evaluation_pipeline import UnifiedEvaluationPipeline

# Initialize
pipeline = UnifiedEvaluationPipeline(
    llm_client=my_llm,
    chunking_strategy="dense"
)

# Single evaluation
result = pipeline.evaluate(
    question="What is RAG?",
    response="RAG is a technique...",
    retrieved_documents=["Doc 1", "Doc 2"],
    method="gpt_labeling"  # "trace", "gpt_labeling", or "hybrid"
)

# Batch evaluation
results = pipeline.evaluate_batch(
    test_cases=[{...}, {...}],
    method="trace"  # Fast for 100+ samples
)
```

## ⚑ Performance Guide

| Method | Speed | Cost | Best For |
|--------|-------|------|----------|
| **TRACE** | 100ms | Free | Large-scale (100+ samples) |
| **GPT Labeling** | 2-5s | $0.01 | Small high-quality (< 20) |
| **Hybrid** | 2-5s | $0.01 | Need both metrics |

## πŸŽ›οΈ What Each Method Shows

### TRACE Metrics
- Utilization: How much context was used
- Relevance: How relevant was the context
- Adherence: No hallucinations in response
- Completeness: Covered all necessary info

### GPT Labeling Metrics
- Context Relevance: Fraction of relevant context
- Context Utilization: How much relevant was used
- Completeness: Coverage of relevant info
- Adherence: Response fully supported

## ⚠️ Important Notes

### Rate Limiting
- Groq API: 30 RPM (1 request every 2 seconds)
- 10 samples: ~20-50 seconds
- 50 samples: ~2-3 minutes
- 100 samples: ~3-7 minutes

### When to Use GPT Labeling
βœ… Small high-quality subset (5-20 samples)
βœ… Want semantic understanding (not just keywords)
βœ… Evaluating new dataset
❌ Large-scale evaluation (100+ samples) β†’ Use TRACE
❌ Budget-conscious β†’ Use TRACE

## πŸ“Š Example Results

### TRACE Output
```
Utilization: 0.75
Relevance: 0.82
Adherence: 0.88
Completeness: 0.79
Average: 0.81
```

### GPT Labeling Output
```
Context Relevance: 0.88
Context Utilization: 0.75
Completeness: 0.82
Adherence: 0.95
Overall Supported: true
Fully Supported Sentences: 3
Partially Supported: 1
Unsupported: 0
```

## πŸ”§ Troubleshooting

**Q: "Method not found" error?**
A: Ensure `evaluation_pipeline.py` exists in project root

**Q: GPT Labeling returns all 0.0?**
A: Check LLM client is initialized: `st.session_state.rag_pipeline.llm`

**Q: Too slow for many samples?**
A: Use TRACE instead (100x faster, still good accuracy)

**Q: Budget concerns?**
A: Hybrid/GPT Labeling = ~$0.01 per evaluation. With 30 RPM limit, that's <$30 for 1000 evals

## πŸ“š Documentation

For detailed information:
- **Conceptual**: See `docs/GPT_LABELING_EVALUATION.md`
- **Technical**: See `docs/IMPLEMENTATION_GUIDE_GPT_LABELING.md`
- **Summary**: See `GPT_LABELING_IMPLEMENTATION_SUMMARY.md`

## πŸŽ“ How GPT Labeling Works (Simple Version)

1. Split documents into labeled sentences: `0a`, `0b`, `1a`, etc.
2. Split response into labeled sentences: `a`, `b`, `c`, etc.
3. Ask GPT-4 (via Groq): "Which document sentences support each response sentence?"
4. GPT returns JSON with labeled support information
5. Compute metrics from labeled data (more accurate than word overlap)

## πŸ” API Configuration

Your existing LLM client is used automatically:
- Already configured in `st.session_state.rag_pipeline.llm`
- No additional API keys needed
- Same rate limiting (30 RPM) applies

## βœ… Verification

To verify installation works:

```bash
python -c "
from advanced_rag_evaluator import AdvancedRAGEvaluator
from evaluation_pipeline import UnifiedEvaluationPipeline
print('Success: GPT Labeling modules installed')
"
```

Expected output: `Success: GPT Labeling modules installed`

## πŸ“ž Support

If GPT Labeling doesn't work:
1. Check Groq API key is valid
2. Verify LLM client is initialized
3. Test with TRACE method first
4. Check available rate limit (30 RPM)
5. Review detailed guides in `docs/`

## πŸŽ‰ You're Ready!

Start Streamlit and try the new evaluation methods now:
```bash
streamlit run streamlit_app.py
```

Then go to **Evaluation tab β†’ Select method β†’ Run**

That's it! Enjoy accurate LLM-based RAG evaluation! πŸš€