# GPT Labeling Evaluation - Quick Start Guide

## 🎯 In 30 Seconds

The RAG project now has **three evaluation methods** accessible from Streamlit:

1. **TRACE** - Fast, rule-based (100ms per evaluation, free)
2. **GPT Labeling** - Accurate, LLM-based (2-5s per evaluation, ~$0.01 each)
3. **Hybrid** - Both methods combined

## 🚀 Using in Streamlit

### Step 1: Start the App
```bash
streamlit run streamlit_app.py
```

### Step 2: Load Data
- Select a RAGBench dataset
- Load it into the vector store

### Step 3: Run Evaluation
1. Go to the "Evaluation" tab
2. Choose method:
   ```
   [Radio button] TRACE / GPT Labeling / Hybrid
   ```
3. Set parameters:
   - LLM: Select from dropdown
   - Samples: Slider 5-500
4. Click "Run Evaluation"

### Step 4: View Results
- Aggregate metrics in cards
- Per-query details in expanders
- Download JSON results

## 💻 Using in Code

```python
from evaluation_pipeline import UnifiedEvaluationPipeline

# Initialize
pipeline = UnifiedEvaluationPipeline(
    llm_client=my_llm,
    chunking_strategy="dense"
)

# Single evaluation
result = pipeline.evaluate(
    question="What is RAG?",
    response="RAG is a technique...",
    retrieved_documents=["Doc 1", "Doc 2"],
    method="gpt_labeling"  # "trace", "gpt_labeling", or "hybrid"
)

# Batch evaluation
results = pipeline.evaluate_batch(
    test_cases=[{...}, {...}],
    method="trace"  # Fast for 100+ samples
)
```

## ⚡ Performance Guide

| Method | Speed | Cost | Best For |
|--------|-------|------|----------|
| **TRACE** | 100ms | Free | Large-scale (100+ samples) |
| **GPT Labeling** | 2-5s | $0.01 | Small high-quality (< 20) |
| **Hybrid** | 2-5s | $0.01 | Need both metrics |

## 🎛️ What Each Method Shows

### TRACE Metrics
- Utilization: How much context was used
- Relevance: How relevant was the context
- Adherence: No hallucinations in response
- Completeness: Covered all necessary info

### GPT Labeling Metrics
- Context Relevance: Fraction of relevant context
- Context Utilization: How much relevant was used
- Completeness: Coverage of relevant info
- Adherence: Response fully supported

## ⚠️ Important Notes

### Rate Limiting
- Groq API: 30 RPM (1 request every 2 seconds)
- 10 samples: ~20-50 seconds
- 50 samples: ~2-3 minutes
- 100 samples: ~3-7 minutes

### When to Use GPT Labeling
✅ Small high-quality subset (5-20 samples)
✅ Want semantic understanding (not just keywords)
✅ Evaluating new dataset
❌ Large-scale evaluation (100+ samples) → Use TRACE
❌ Budget-conscious → Use TRACE

## 📊 Example Results

### TRACE Output
```
Utilization: 0.75
Relevance: 0.82
Adherence: 0.88
Completeness: 0.79
Average: 0.81
```

### GPT Labeling Output
```
Context Relevance: 0.88
Context Utilization: 0.75
Completeness: 0.82
Adherence: 0.95
Overall Supported: true
Fully Supported Sentences: 3
Partially Supported: 1
Unsupported: 0
```

## 🔧 Troubleshooting

**Q: "Method not found" error?**
A: Ensure `evaluation_pipeline.py` exists in project root

**Q: GPT Labeling returns all 0.0?**
A: Check LLM client is initialized: `st.session_state.rag_pipeline.llm`

**Q: Too slow for many samples?**
A: Use TRACE instead (100x faster, still good accuracy)

**Q: Budget concerns?**
A: Hybrid/GPT Labeling = ~$0.01 per evaluation. With 30 RPM limit, that's <$30 for 1000 evals

## 📚 Documentation

For detailed information:
- **Conceptual**: See `docs/GPT_LABELING_EVALUATION.md`
- **Technical**: See `docs/IMPLEMENTATION_GUIDE_GPT_LABELING.md`
- **Summary**: See `GPT_LABELING_IMPLEMENTATION_SUMMARY.md`

## 🎓 How GPT Labeling Works (Simple Version)

1. Split documents into labeled sentences: `0a`, `0b`, `1a`, etc.
2. Split response into labeled sentences: `a`, `b`, `c`, etc.
3. Ask GPT-4 (via Groq): "Which document sentences support each response sentence?"
4. GPT returns JSON with labeled support information
5. Compute metrics from labeled data (more accurate than word overlap)

## 🔐 API Configuration

Your existing LLM client is used automatically:
- Already configured in `st.session_state.rag_pipeline.llm`
- No additional API keys needed
- Same rate limiting (30 RPM) applies

## ✅ Verification

To verify installation works:

```bash
python -c "
from advanced_rag_evaluator import AdvancedRAGEvaluator
from evaluation_pipeline import UnifiedEvaluationPipeline
print('Success: GPT Labeling modules installed')
"
```

Expected output: `Success: GPT Labeling modules installed`

## 📞 Support

If GPT Labeling doesn't work:
1. Check Groq API key is valid
2. Verify LLM client is initialized
3. Test with TRACE method first
4. Check available rate limit (30 RPM)
5. Review detailed guides in `docs/`

## 🎉 You're Ready!

Start Streamlit and try the new evaluation methods now:
```bash
streamlit run streamlit_app.py
```

Then go to **Evaluation tab → Select method → Run**

That's it! Enjoy accurate LLM-based RAG evaluation! 🚀