File size: 9,617 Bytes
1d10b0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
# GPT Labeling Integration - Implementation Guide

## Overview

The RAG Capstone Project now includes **three evaluation methods**:

1. **TRACE Heuristics** - Fast, rule-based metrics (no LLM calls)
2. **GPT Labeling** - Accurate, LLM-based sentence-level grounding (RAGBench paper)
3. **Hybrid** - Combines both approaches for comprehensive analysis

## New Files Created

### Core Implementation Files

1. **`advanced_rag_evaluator.py`** (380 lines)
   - `DocumentSentencizer` - Splits documents and responses into labeled sentences
   - `GPTLabelingPromptGenerator` - Creates GPT labeling prompts
   - `GPTLabelingOutput` - Dataclass for structured LLM response
   - `AdvancedTRACEScores` - Enhanced scores with GPT labeling metrics
   - `AdvancedRAGEvaluator` - Main evaluator using GPT labeling approach

2. **`evaluation_pipeline.py`** (175 lines)
   - `UnifiedEvaluationPipeline` - Facade for TRACE + GPT Labeling
   - Supports single evaluation or batch processing
   - Provides method information and descriptions

3. **`docs/GPT_LABELING_EVALUATION.md`** (Comprehensive guide)
   - Conceptual overview of sentence-level labeling
   - Architecture and data flow diagrams
   - Usage examples for all three methods
   - Performance considerations and recommendations
   - JSON output formats

### Modified Files

1. **`streamlit_app.py`**
   - Updated `evaluation_interface()` to support method selection
   - Updated `run_evaluation()` to handle three methods
   - Added method descriptions and warnings
   - Enhanced logging for each method

2. **`trace_evaluator.py`**
   - Added documentation about GPT labeling integration
   - No functional changes (backward compatible)

## Key Components Explained

### 1. Sentence Sentencization

**Document Sentences**: Labeled with keys like `0a`, `0b`, `1a`, `1b`
```
0a. This is the first sentence.
0b. This is the second sentence.
1a. Another document's first sentence.
1b. And the second sentence.
```

**Response Sentences**: Labeled with keys like `a`, `b`, `c`
```
a. The response starts here.
b. It contains multiple sentences.
c. Each one gets a unique key.
```

### 2. GPT Labeling Process

The GPT labeling prompt asks the LLM to:
1. Identify which document sentences are relevant to the question
2. For each response sentence, identify supporting document sentences
3. Determine if each response sentence is fully/partially/unsupported
4. Return structured JSON with 5 evaluation fields

### 3. Metric Computation

From GPT-labeled data:
- **Context Relevance**: Fraction of relevant document sentences (0-1)
- **Context Utilization**: Fraction of relevant sentences used (0-1)
- **Completeness**: Overlap between relevant and utilized (0-1)
- **Adherence**: Fraction of response sentences with full support (0-1)

## Usage Examples

### In Streamlit UI

1. **Select Evaluation Method**
   ```
   [Radio button: TRACE / GPT Labeling / Hybrid]
   ```

2. **Choose LLM and Samples**
   ```
   LLM: [Dropdown: llama-3.1-8b-instant, etc.]
   Samples: [Slider: 5-100]
   Button: "Run Evaluation"
   ```

3. **View Results**
   - Aggregate scores in metric cards
   - Per-query detailed results
   - JSON download

### Programmatically (Python)

```python
from evaluation_pipeline import UnifiedEvaluationPipeline

# Create pipeline
pipeline = UnifiedEvaluationPipeline(
    llm_client=my_llm_client,
    chunking_strategy="dense",
    embedding_model="all-mpnet-base-v2"
)

# Single evaluation
result = pipeline.evaluate(
    question="What is RAG?",
    response="RAG stands for...",
    retrieved_documents=["Doc 1", "Doc 2"],
    method="gpt_labeling"
)

# Batch evaluation
results = pipeline.evaluate_batch(
    test_cases=[
        {
            "query": "Question 1",
            "response": "Response 1",
            "retrieved_documents": ["Doc 1", "Doc 2"],
            "ground_truth": "Expected answer"
        },
        # ... more cases
    ],
    method="hybrid"  # "trace", "gpt_labeling", or "hybrid"
)

print(f"Results: {results}")
```

## Performance Characteristics

### TRACE Method
- **Time per evaluation**: ~100ms
- **Total time for 10 samples**: ~1 second
- **Total time for 100 samples**: ~10 seconds
- **Cost**: Free (no API calls)
- **Accuracy**: Good for obvious cases

### GPT Labeling Method
- **Time per evaluation**: 2-5 seconds (due to API + rate limiting)
- **Total time for 10 samples**: 20-50 seconds
- **Total time for 100 samples**: 3-8 minutes
- **Cost**: ~$0.002-0.01 per evaluation ($0.02-0.10 per 10 samples)
- **Accuracy**: Excellent, semantic understanding
- **Limitation**: 30 RPM Groq rate limit

### Hybrid Method
- **Time per evaluation**: 2-5 seconds
- **Cost**: Same as GPT Labeling
- **Benefit**: Get both fast and accurate metrics

## Important Considerations

### Rate Limiting
The Groq API has a **30 RPM (requests per minute)** limit:
- Each evaluation = 1 request
- Wait 2 seconds between requests
- For 10 evaluations: ~20-40 seconds
- For 50 evaluations: ~100-200 seconds (2-3 minutes)
- For 100 evaluations: ~200-400 seconds (3-7 minutes)

### When to Use Each Method

| Scenario | Recommended Method |
|----------|-------------------|
| Quick prototyping | TRACE |
| Small high-quality subset (< 20 samples) | GPT Labeling |
| Large-scale evaluation (100+ samples) | TRACE |
| Need both speed and accuracy | Hybrid on small subset |
| Production evaluation | TRACE + spot-check with GPT |

### Token Cost Estimation

For Groq's Llama model (~$0.05 per 1M input tokens):
- Average prompt: ~2KB = ~500 tokens input + ~200 output = ~700 tokens
- Cost per evaluation: 700 / 1M * $0.05 = $0.000035
- For 100 evaluations: ~$0.0035 (very cheap!)

**Note**: Exact costs depend on document length and model choice.

## Troubleshooting

### Issue: "evaluation_pipeline module not found"
**Solution**: Ensure `evaluation_pipeline.py` is in the project root directory

### Issue: GPT Labeling always returns 0.0 scores
**Solution**: Check that LLM client is properly initialized and returning valid JSON

### Issue: Rate limit exceeded
**Solution**: The code handles this with exponential backoff. Reduce number of samples.

### Issue: LLM returns non-JSON response
**Solution**: Use `temperature=0.0` in LLM calls for deterministic output

## Integration Checklist

- [x] Created `advanced_rag_evaluator.py` with GPT labeling implementation
- [x] Created `evaluation_pipeline.py` with unified interface
- [x] Updated `streamlit_app.py` to support method selection
- [x] Added comprehensive documentation in `docs/GPT_LABELING_EVALUATION.md`
- [x] Tested module imports and basic functionality
- [x] Verified syntax in all files
- [x] Backward compatible with existing TRACE evaluation
- [x] Handles LLM client gracefully (fallback to TRACE if unavailable)

## Next Steps (Optional Enhancements)

1. **Caching**: Store evaluation results for identical Q-D-R triplets
2. **Batch Processing**: Evaluate multiple samples in parallel
3. **Custom Prompts**: Allow users to customize GPT labeling prompts
4. **Multi-LLM**: Average labels from multiple LLMs for robustness
5. **Sampling Strategy**: Smart sampling for large datasets
6. **Visualization**: Charts comparing TRACE vs GPT Labeling results

## API Reference

### UnifiedEvaluationPipeline

```python
class UnifiedEvaluationPipeline:
    def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap)
    
    def evaluate(question, response, retrieved_documents, ground_truth=None, 
                 method="trace") -> Dict
    
    def evaluate_batch(test_cases, method="trace") -> Dict
    
    @staticmethod
    def get_evaluation_methods() -> List[Dict]
```

### AdvancedRAGEvaluator

```python
class AdvancedRAGEvaluator:
    def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap)
    
    def evaluate(question, response, retrieved_documents, ground_truth=None) -> AdvancedTRACEScores
    
    def evaluate_batch(test_cases) -> Dict
```

### DocumentSentencizer

```python
class DocumentSentencizer:
    @staticmethod
    def sentencize_documents(documents: List[str]) -> Tuple[List[Dict], str]
    
    @staticmethod
    def sentencize_response(response: str) -> Tuple[List[Dict], str]
```

## File Summary

| File | Lines | Purpose | Status |
|------|-------|---------|--------|
| `advanced_rag_evaluator.py` | 380 | GPT labeling evaluator | NEW |
| `evaluation_pipeline.py` | 175 | Unified evaluation interface | NEW |
| `streamlit_app.py` | 927 | Updated UI with method selection | MODIFIED |
| `trace_evaluator.py` | 438 | Original TRACE metrics (unchanged) | UPDATED DOCS |
| `docs/GPT_LABELING_EVALUATION.md` | 500+ | Comprehensive guide | NEW |

## Total Impact

- **New Code**: ~550 lines (2 new modules)
- **Modified Code**: ~50 lines in streamlit_app.py + documentation
- **Backward Compatible**: Yes, existing TRACE evaluation still works
- **Breaking Changes**: None
- **New Dependencies**: None (all already installed)

## Verification Commands

```bash
# Check Python syntax
python -m py_compile advanced_rag_evaluator.py evaluation_pipeline.py

# Run imports test
python -c "from advanced_rag_evaluator import AdvancedRAGEvaluator; from evaluation_pipeline import UnifiedEvaluationPipeline; print('OK')"

# Start Streamlit with new features
streamlit run streamlit_app.py
```

## Support

For issues with GPT labeling:
1. Check that LLM client is initialized (`st.session_state.rag_pipeline.llm`)
2. Verify Groq API key is valid
3. Ensure rate limiting (30 RPM) is respected
4. Check LLM response is valid JSON
5. Review `docs/GPT_LABELING_EVALUATION.md` for detailed guidance