CapStoneRAG10 / docs /QUICK_START_GPT_LABELING.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

GPT Labeling Evaluation - Quick Start Guide

🎯 In 30 Seconds

The RAG project now has three evaluation methods accessible from Streamlit:

  1. TRACE - Fast, rule-based (100ms per evaluation, free)
  2. GPT Labeling - Accurate, LLM-based (2-5s per evaluation, ~$0.01 each)
  3. Hybrid - Both methods combined

πŸš€ Using in Streamlit

Step 1: Start the App

streamlit run streamlit_app.py

Step 2: Load Data

  • Select a RAGBench dataset
  • Load it into the vector store

Step 3: Run Evaluation

  1. Go to the "Evaluation" tab
  2. Choose method:
    [Radio button] TRACE / GPT Labeling / Hybrid
    
  3. Set parameters:
    • LLM: Select from dropdown
    • Samples: Slider 5-500
  4. Click "Run Evaluation"

Step 4: View Results

  • Aggregate metrics in cards
  • Per-query details in expanders
  • Download JSON results

πŸ’» Using in Code

from evaluation_pipeline import UnifiedEvaluationPipeline

# Initialize
pipeline = UnifiedEvaluationPipeline(
    llm_client=my_llm,
    chunking_strategy="dense"
)

# Single evaluation
result = pipeline.evaluate(
    question="What is RAG?",
    response="RAG is a technique...",
    retrieved_documents=["Doc 1", "Doc 2"],
    method="gpt_labeling"  # "trace", "gpt_labeling", or "hybrid"
)

# Batch evaluation
results = pipeline.evaluate_batch(
    test_cases=[{...}, {...}],
    method="trace"  # Fast for 100+ samples
)

⚑ Performance Guide

Method Speed Cost Best For
TRACE 100ms Free Large-scale (100+ samples)
GPT Labeling 2-5s $0.01 Small high-quality (< 20)
Hybrid 2-5s $0.01 Need both metrics

πŸŽ›οΈ What Each Method Shows

TRACE Metrics

  • Utilization: How much context was used
  • Relevance: How relevant was the context
  • Adherence: No hallucinations in response
  • Completeness: Covered all necessary info

GPT Labeling Metrics

  • Context Relevance: Fraction of relevant context
  • Context Utilization: How much relevant was used
  • Completeness: Coverage of relevant info
  • Adherence: Response fully supported

⚠️ Important Notes

Rate Limiting

  • Groq API: 30 RPM (1 request every 2 seconds)
  • 10 samples: ~20-50 seconds
  • 50 samples: ~2-3 minutes
  • 100 samples: ~3-7 minutes

When to Use GPT Labeling

βœ… Small high-quality subset (5-20 samples) βœ… Want semantic understanding (not just keywords) βœ… Evaluating new dataset ❌ Large-scale evaluation (100+ samples) β†’ Use TRACE ❌ Budget-conscious β†’ Use TRACE

πŸ“Š Example Results

TRACE Output

Utilization: 0.75
Relevance: 0.82
Adherence: 0.88
Completeness: 0.79
Average: 0.81

GPT Labeling Output

Context Relevance: 0.88
Context Utilization: 0.75
Completeness: 0.82
Adherence: 0.95
Overall Supported: true
Fully Supported Sentences: 3
Partially Supported: 1
Unsupported: 0

πŸ”§ Troubleshooting

Q: "Method not found" error? A: Ensure evaluation_pipeline.py exists in project root

Q: GPT Labeling returns all 0.0? A: Check LLM client is initialized: st.session_state.rag_pipeline.llm

Q: Too slow for many samples? A: Use TRACE instead (100x faster, still good accuracy)

Q: Budget concerns? A: Hybrid/GPT Labeling = ~$0.01 per evaluation. With 30 RPM limit, that's <$30 for 1000 evals

πŸ“š Documentation

For detailed information:

  • Conceptual: See docs/GPT_LABELING_EVALUATION.md
  • Technical: See docs/IMPLEMENTATION_GUIDE_GPT_LABELING.md
  • Summary: See GPT_LABELING_IMPLEMENTATION_SUMMARY.md

πŸŽ“ How GPT Labeling Works (Simple Version)

  1. Split documents into labeled sentences: 0a, 0b, 1a, etc.
  2. Split response into labeled sentences: a, b, c, etc.
  3. Ask GPT-4 (via Groq): "Which document sentences support each response sentence?"
  4. GPT returns JSON with labeled support information
  5. Compute metrics from labeled data (more accurate than word overlap)

πŸ” API Configuration

Your existing LLM client is used automatically:

  • Already configured in st.session_state.rag_pipeline.llm
  • No additional API keys needed
  • Same rate limiting (30 RPM) applies

βœ… Verification

To verify installation works:

python -c "
from advanced_rag_evaluator import AdvancedRAGEvaluator
from evaluation_pipeline import UnifiedEvaluationPipeline
print('Success: GPT Labeling modules installed')
"

Expected output: Success: GPT Labeling modules installed

πŸ“ž Support

If GPT Labeling doesn't work:

  1. Check Groq API key is valid
  2. Verify LLM client is initialized
  3. Test with TRACE method first
  4. Check available rate limit (30 RPM)
  5. Review detailed guides in docs/

πŸŽ‰ You're Ready!

Start Streamlit and try the new evaluation methods now:

streamlit run streamlit_app.py

Then go to Evaluation tab β†’ Select method β†’ Run

That's it! Enjoy accurate LLM-based RAG evaluation! πŸš€