# GPT Labeling Integration - Implementation Guide

## Overview

The RAG Capstone Project now includes **three evaluation methods**:

1. **TRACE Heuristics** - Fast, rule-based metrics (no LLM calls)
2. **GPT Labeling** - Accurate, LLM-based sentence-level grounding (RAGBench paper)
3. **Hybrid** - Combines both approaches for comprehensive analysis

## New Files Created

### Core Implementation Files

1. **`advanced_rag_evaluator.py`** (380 lines)
   - `DocumentSentencizer` - Splits documents and responses into labeled sentences
   - `GPTLabelingPromptGenerator` - Creates GPT labeling prompts
   - `GPTLabelingOutput` - Dataclass for structured LLM response
   - `AdvancedTRACEScores` - Enhanced scores with GPT labeling metrics
   - `AdvancedRAGEvaluator` - Main evaluator using GPT labeling approach

2. **`evaluation_pipeline.py`** (175 lines)
   - `UnifiedEvaluationPipeline` - Facade for TRACE + GPT Labeling
   - Supports single evaluation or batch processing
   - Provides method information and descriptions

3. **`docs/GPT_LABELING_EVALUATION.md`** (Comprehensive guide)
   - Conceptual overview of sentence-level labeling
   - Architecture and data flow diagrams
   - Usage examples for all three methods
   - Performance considerations and recommendations
   - JSON output formats

### Modified Files

1. **`streamlit_app.py`**
   - Updated `evaluation_interface()` to support method selection
   - Updated `run_evaluation()` to handle three methods
   - Added method descriptions and warnings
   - Enhanced logging for each method

2. **`trace_evaluator.py`**
   - Added documentation about GPT labeling integration
   - No functional changes (backward compatible)

## Key Components Explained

### 1. Sentence Sentencization

**Document Sentences**: Labeled with keys like `0a`, `0b`, `1a`, `1b`
```
0a. This is the first sentence.
0b. This is the second sentence.
1a. Another document's first sentence.
1b. And the second sentence.
```

**Response Sentences**: Labeled with keys like `a`, `b`, `c`
```
a. The response starts here.
b. It contains multiple sentences.
c. Each one gets a unique key.
```

### 2. GPT Labeling Process

The GPT labeling prompt asks the LLM to:
1. Identify which document sentences are relevant to the question
2. For each response sentence, identify supporting document sentences
3. Determine if each response sentence is fully/partially/unsupported
4. Return structured JSON with 5 evaluation fields

### 3. Metric Computation

From GPT-labeled data:
- **Context Relevance**: Fraction of relevant document sentences (0-1)
- **Context Utilization**: Fraction of relevant sentences used (0-1)
- **Completeness**: Overlap between relevant and utilized (0-1)
- **Adherence**: Fraction of response sentences with full support (0-1)

## Usage Examples

### In Streamlit UI

1. **Select Evaluation Method**
   ```
   [Radio button: TRACE / GPT Labeling / Hybrid]
   ```

2. **Choose LLM and Samples**
   ```
   LLM: [Dropdown: llama-3.1-8b-instant, etc.]
   Samples: [Slider: 5-100]
   Button: "Run Evaluation"
   ```

3. **View Results**
   - Aggregate scores in metric cards
   - Per-query detailed results
   - JSON download

### Programmatically (Python)

```python
from evaluation_pipeline import UnifiedEvaluationPipeline

# Create pipeline
pipeline = UnifiedEvaluationPipeline(
    llm_client=my_llm_client,
    chunking_strategy="dense",
    embedding_model="all-mpnet-base-v2"
)

# Single evaluation
result = pipeline.evaluate(
    question="What is RAG?",
    response="RAG stands for...",
    retrieved_documents=["Doc 1", "Doc 2"],
    method="gpt_labeling"
)

# Batch evaluation
results = pipeline.evaluate_batch(
    test_cases=[
        {
            "query": "Question 1",
            "response": "Response 1",
            "retrieved_documents": ["Doc 1", "Doc 2"],
            "ground_truth": "Expected answer"
        },
        # ... more cases
    ],
    method="hybrid"  # "trace", "gpt_labeling", or "hybrid"
)

print(f"Results: {results}")
```

## Performance Characteristics

### TRACE Method
- **Time per evaluation**: ~100ms
- **Total time for 10 samples**: ~1 second
- **Total time for 100 samples**: ~10 seconds
- **Cost**: Free (no API calls)
- **Accuracy**: Good for obvious cases

### GPT Labeling Method
- **Time per evaluation**: 2-5 seconds (due to API + rate limiting)
- **Total time for 10 samples**: 20-50 seconds
- **Total time for 100 samples**: 3-8 minutes
- **Cost**: ~$0.002-0.01 per evaluation ($0.02-0.10 per 10 samples)
- **Accuracy**: Excellent, semantic understanding
- **Limitation**: 30 RPM Groq rate limit

### Hybrid Method
- **Time per evaluation**: 2-5 seconds
- **Cost**: Same as GPT Labeling
- **Benefit**: Get both fast and accurate metrics

## Important Considerations

### Rate Limiting
The Groq API has a **30 RPM (requests per minute)** limit:
- Each evaluation = 1 request
- Wait 2 seconds between requests
- For 10 evaluations: ~20-40 seconds
- For 50 evaluations: ~100-200 seconds (2-3 minutes)
- For 100 evaluations: ~200-400 seconds (3-7 minutes)

### When to Use Each Method

| Scenario | Recommended Method |
|----------|-------------------|
| Quick prototyping | TRACE |
| Small high-quality subset (< 20 samples) | GPT Labeling |
| Large-scale evaluation (100+ samples) | TRACE |
| Need both speed and accuracy | Hybrid on small subset |
| Production evaluation | TRACE + spot-check with GPT |

### Token Cost Estimation

For Groq's Llama model (~$0.05 per 1M input tokens):
- Average prompt: ~2KB = ~500 tokens input + ~200 output = ~700 tokens
- Cost per evaluation: 700 / 1M * $0.05 = $0.000035
- For 100 evaluations: ~$0.0035 (very cheap!)

**Note**: Exact costs depend on document length and model choice.

## Troubleshooting

### Issue: "evaluation_pipeline module not found"
**Solution**: Ensure `evaluation_pipeline.py` is in the project root directory

### Issue: GPT Labeling always returns 0.0 scores
**Solution**: Check that LLM client is properly initialized and returning valid JSON

### Issue: Rate limit exceeded
**Solution**: The code handles this with exponential backoff. Reduce number of samples.

### Issue: LLM returns non-JSON response
**Solution**: Use `temperature=0.0` in LLM calls for deterministic output

## Integration Checklist

- [x] Created `advanced_rag_evaluator.py` with GPT labeling implementation
- [x] Created `evaluation_pipeline.py` with unified interface
- [x] Updated `streamlit_app.py` to support method selection
- [x] Added comprehensive documentation in `docs/GPT_LABELING_EVALUATION.md`
- [x] Tested module imports and basic functionality
- [x] Verified syntax in all files
- [x] Backward compatible with existing TRACE evaluation
- [x] Handles LLM client gracefully (fallback to TRACE if unavailable)

## Next Steps (Optional Enhancements)

1. **Caching**: Store evaluation results for identical Q-D-R triplets
2. **Batch Processing**: Evaluate multiple samples in parallel
3. **Custom Prompts**: Allow users to customize GPT labeling prompts
4. **Multi-LLM**: Average labels from multiple LLMs for robustness
5. **Sampling Strategy**: Smart sampling for large datasets
6. **Visualization**: Charts comparing TRACE vs GPT Labeling results

## API Reference

### UnifiedEvaluationPipeline

```python
class UnifiedEvaluationPipeline:
    def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap)
    
    def evaluate(question, response, retrieved_documents, ground_truth=None, 
                 method="trace") -> Dict
    
    def evaluate_batch(test_cases, method="trace") -> Dict
    
    @staticmethod
    def get_evaluation_methods() -> List[Dict]
```

### AdvancedRAGEvaluator

```python
class AdvancedRAGEvaluator:
    def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap)
    
    def evaluate(question, response, retrieved_documents, ground_truth=None) -> AdvancedTRACEScores
    
    def evaluate_batch(test_cases) -> Dict
```

### DocumentSentencizer

```python
class DocumentSentencizer:
    @staticmethod
    def sentencize_documents(documents: List[str]) -> Tuple[List[Dict], str]
    
    @staticmethod
    def sentencize_response(response: str) -> Tuple[List[Dict], str]
```

## File Summary

| File | Lines | Purpose | Status |
|------|-------|---------|--------|
| `advanced_rag_evaluator.py` | 380 | GPT labeling evaluator | NEW |
| `evaluation_pipeline.py` | 175 | Unified evaluation interface | NEW |
| `streamlit_app.py` | 927 | Updated UI with method selection | MODIFIED |
| `trace_evaluator.py` | 438 | Original TRACE metrics (unchanged) | UPDATED DOCS |
| `docs/GPT_LABELING_EVALUATION.md` | 500+ | Comprehensive guide | NEW |

## Total Impact

- **New Code**: ~550 lines (2 new modules)
- **Modified Code**: ~50 lines in streamlit_app.py + documentation
- **Backward Compatible**: Yes, existing TRACE evaluation still works
- **Breaking Changes**: None
- **New Dependencies**: None (all already installed)

## Verification Commands

```bash
# Check Python syntax
python -m py_compile advanced_rag_evaluator.py evaluation_pipeline.py

# Run imports test
python -c "from advanced_rag_evaluator import AdvancedRAGEvaluator; from evaluation_pipeline import UnifiedEvaluationPipeline; print('OK')"

# Start Streamlit with new features
streamlit run streamlit_app.py
```

## Support

For issues with GPT labeling:
1. Check that LLM client is initialized (`st.session_state.rag_pipeline.llm`)
2. Verify Groq API key is valid
3. Ensure rate limiting (30 RPM) is respected
4. Check LLM response is valid JSON
5. Review `docs/GPT_LABELING_EVALUATION.md` for detailed guidance