Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /QUICK_START_GPT_LABELING.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 2 months ago

preview code

raw

history blame contribute delete

4.91 kB

GPT Labeling Evaluation - Quick Start Guide

🎯 In 30 Seconds

The RAG project now has three evaluation methods accessible from Streamlit:

TRACE - Fast, rule-based (100ms per evaluation, free)
GPT Labeling - Accurate, LLM-based (2-5s per evaluation, ~$0.01 each)
Hybrid - Both methods combined

🚀 Using in Streamlit

Step 1: Start the App

streamlit run streamlit_app.py

Step 2: Load Data

Select a RAGBench dataset
Load it into the vector store

Step 3: Run Evaluation

Go to the "Evaluation" tab

Choose method:

[Radio button] TRACE / GPT Labeling / Hybrid

Set parameters:
- LLM: Select from dropdown
- Samples: Slider 5-500
Click "Run Evaluation"

Step 4: View Results

Aggregate metrics in cards
Per-query details in expanders
Download JSON results

💻 Using in Code

from evaluation_pipeline import UnifiedEvaluationPipeline

# Initialize
pipeline = UnifiedEvaluationPipeline(
    llm_client=my_llm,
    chunking_strategy="dense"
)

# Single evaluation
result = pipeline.evaluate(
    question="What is RAG?",
    response="RAG is a technique...",
    retrieved_documents=["Doc 1", "Doc 2"],
    method="gpt_labeling"  # "trace", "gpt_labeling", or "hybrid"
)

# Batch evaluation
results = pipeline.evaluate_batch(
    test_cases=[{...}, {...}],
    method="trace"  # Fast for 100+ samples
)

⚡ Performance Guide

Method	Speed	Cost	Best For
TRACE	100ms	Free	Large-scale (100+ samples)
GPT Labeling	2-5s	$0.01	Small high-quality (< 20)
Hybrid	2-5s	$0.01	Need both metrics

🎛️ What Each Method Shows

TRACE Metrics

Utilization: How much context was used
Relevance: How relevant was the context
Adherence: No hallucinations in response
Completeness: Covered all necessary info

GPT Labeling Metrics

Context Relevance: Fraction of relevant context
Context Utilization: How much relevant was used
Completeness: Coverage of relevant info
Adherence: Response fully supported

⚠️ Important Notes

Rate Limiting

Groq API: 30 RPM (1 request every 2 seconds)
10 samples: ~20-50 seconds
50 samples: ~2-3 minutes
100 samples: ~3-7 minutes

When to Use GPT Labeling

✅ Small high-quality subset (5-20 samples) ✅ Want semantic understanding (not just keywords) ✅ Evaluating new dataset ❌ Large-scale evaluation (100+ samples) → Use TRACE ❌ Budget-conscious → Use TRACE

📊 Example Results

TRACE Output

Utilization: 0.75
Relevance: 0.82
Adherence: 0.88
Completeness: 0.79
Average: 0.81

GPT Labeling Output

Context Relevance: 0.88
Context Utilization: 0.75
Completeness: 0.82
Adherence: 0.95
Overall Supported: true
Fully Supported Sentences: 3
Partially Supported: 1
Unsupported: 0

🔧 Troubleshooting

Q: "Method not found" error? A: Ensure evaluation_pipeline.py exists in project root

Q: GPT Labeling returns all 0.0? A: Check LLM client is initialized: st.session_state.rag_pipeline.llm

Q: Too slow for many samples? A: Use TRACE instead (100x faster, still good accuracy)

Q: Budget concerns? A: Hybrid/GPT Labeling = ~$0.01 per evaluation. With 30 RPM limit, that's <$30 for 1000 evals

📚 Documentation

For detailed information:

Conceptual: See docs/GPT_LABELING_EVALUATION.md
Technical: See docs/IMPLEMENTATION_GUIDE_GPT_LABELING.md
Summary: See GPT_LABELING_IMPLEMENTATION_SUMMARY.md

🎓 How GPT Labeling Works (Simple Version)

Split documents into labeled sentences: 0a, 0b, 1a, etc.
Split response into labeled sentences: a, b, c, etc.
Ask GPT-4 (via Groq): "Which document sentences support each response sentence?"
GPT returns JSON with labeled support information
Compute metrics from labeled data (more accurate than word overlap)

🔐 API Configuration

Your existing LLM client is used automatically:

Already configured in st.session_state.rag_pipeline.llm
No additional API keys needed
Same rate limiting (30 RPM) applies

✅ Verification

To verify installation works:

python -c "
from advanced_rag_evaluator import AdvancedRAGEvaluator
from evaluation_pipeline import UnifiedEvaluationPipeline
print('Success: GPT Labeling modules installed')
"

Expected output: Success: GPT Labeling modules installed

📞 Support

If GPT Labeling doesn't work:

Check Groq API key is valid
Verify LLM client is initialized
Test with TRACE method first
Check available rate limit (30 RPM)
Review detailed guides in docs/

🎉 You're Ready!

Start Streamlit and try the new evaluation methods now:

streamlit run streamlit_app.py

Then go to Evaluation tab → Select method → Run

That's it! Enjoy accurate LLM-based RAG evaluation! 🚀