Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /QUICK_START_GPT_LABELING.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 2 months ago

preview code

raw

history blame contribute delete

4.91 kB

	# GPT Labeling Evaluation - Quick Start Guide

	## 🎯 In 30 Seconds

	The RAG project now has three evaluation methods accessible from Streamlit:

	1. TRACE - Fast, rule-based (100ms per evaluation, free)
	2. GPT Labeling - Accurate, LLM-based (2-5s per evaluation, ~$0.01 each)
	3. Hybrid - Both methods combined

	## 🚀 Using in Streamlit

	### Step 1: Start the App
	```bash
	streamlit run streamlit_app.py
	```

	### Step 2: Load Data
	- Select a RAGBench dataset
	- Load it into the vector store

	### Step 3: Run Evaluation
	1. Go to the "Evaluation" tab
	2. Choose method:
	```
	[Radio button] TRACE / GPT Labeling / Hybrid
	```
	3. Set parameters:
	- LLM: Select from dropdown
	- Samples: Slider 5-500
	4. Click "Run Evaluation"

	### Step 4: View Results
	- Aggregate metrics in cards
	- Per-query details in expanders
	- Download JSON results

	## 💻 Using in Code

	```python
	from evaluation_pipeline import UnifiedEvaluationPipeline

	# Initialize
	pipeline = UnifiedEvaluationPipeline(
	llm_client=my_llm,
	chunking_strategy="dense"
	)

	# Single evaluation
	result = pipeline.evaluate(
	question="What is RAG?",
	response="RAG is a technique...",
	retrieved_documents=["Doc 1", "Doc 2"],
	method="gpt_labeling" # "trace", "gpt_labeling", or "hybrid"
	)

	# Batch evaluation
	results = pipeline.evaluate_batch(
	test_cases=[{...}, {...}],
	method="trace" # Fast for 100+ samples
	)
	```

	## ⚡ Performance Guide

	\| Method \| Speed \| Cost \| Best For \|
	\|--------\|-------\|------\|----------\|
	\| TRACE \| 100ms \| Free \| Large-scale (100+ samples) \|
	\| GPT Labeling \| 2-5s \| $0.01 \| Small high-quality (< 20) \|
	\| Hybrid \| 2-5s \| $0.01 \| Need both metrics \|

	## 🎛️ What Each Method Shows

	### TRACE Metrics
	- Utilization: How much context was used
	- Relevance: How relevant was the context
	- Adherence: No hallucinations in response
	- Completeness: Covered all necessary info

	### GPT Labeling Metrics
	- Context Relevance: Fraction of relevant context
	- Context Utilization: How much relevant was used
	- Completeness: Coverage of relevant info
	- Adherence: Response fully supported

	## ⚠️ Important Notes

	### Rate Limiting
	- Groq API: 30 RPM (1 request every 2 seconds)
	- 10 samples: ~20-50 seconds
	- 50 samples: ~2-3 minutes
	- 100 samples: ~3-7 minutes

	### When to Use GPT Labeling
	✅ Small high-quality subset (5-20 samples)
	✅ Want semantic understanding (not just keywords)
	✅ Evaluating new dataset
	❌ Large-scale evaluation (100+ samples) → Use TRACE
	❌ Budget-conscious → Use TRACE

	## 📊 Example Results

	### TRACE Output
	```
	Utilization: 0.75
	Relevance: 0.82
	Adherence: 0.88
	Completeness: 0.79
	Average: 0.81
	```

	### GPT Labeling Output
	```
	Context Relevance: 0.88
	Context Utilization: 0.75
	Completeness: 0.82
	Adherence: 0.95
	Overall Supported: true
	Fully Supported Sentences: 3
	Partially Supported: 1
	Unsupported: 0
	```

	## 🔧 Troubleshooting

	Q: "Method not found" error?
	A: Ensure `evaluation_pipeline.py` exists in project root

	Q: GPT Labeling returns all 0.0?
	A: Check LLM client is initialized: `st.session_state.rag_pipeline.llm`

	Q: Too slow for many samples?
	A: Use TRACE instead (100x faster, still good accuracy)

	Q: Budget concerns?
	A: Hybrid/GPT Labeling = ~$0.01 per evaluation. With 30 RPM limit, that's <$30 for 1000 evals

	## 📚 Documentation

	For detailed information:
	- Conceptual: See `docs/GPT_LABELING_EVALUATION.md`
	- Technical: See `docs/IMPLEMENTATION_GUIDE_GPT_LABELING.md`
	- Summary: See `GPT_LABELING_IMPLEMENTATION_SUMMARY.md`

	## 🎓 How GPT Labeling Works (Simple Version)

	1. Split documents into labeled sentences: `0a`, `0b`, `1a`, etc.
	2. Split response into labeled sentences: `a`, `b`, `c`, etc.
	3. Ask GPT-4 (via Groq): "Which document sentences support each response sentence?"
	4. GPT returns JSON with labeled support information
	5. Compute metrics from labeled data (more accurate than word overlap)

	## 🔐 API Configuration

	Your existing LLM client is used automatically:
	- Already configured in `st.session_state.rag_pipeline.llm`
	- No additional API keys needed
	- Same rate limiting (30 RPM) applies

	## ✅ Verification

	To verify installation works:

	```bash
	python -c "
	from advanced_rag_evaluator import AdvancedRAGEvaluator
	from evaluation_pipeline import UnifiedEvaluationPipeline
	print('Success: GPT Labeling modules installed')
	"
	```

	Expected output: `Success: GPT Labeling modules installed`

	## 📞 Support

	If GPT Labeling doesn't work:
	1. Check Groq API key is valid
	2. Verify LLM client is initialized
	3. Test with TRACE method first
	4. Check available rate limit (30 RPM)
	5. Review detailed guides in `docs/`

	## 🎉 You're Ready!

	Start Streamlit and try the new evaluation methods now:
	```bash
	streamlit run streamlit_app.py
	```

	Then go to Evaluation tab → Select method → Run

	That's it! Enjoy accurate LLM-based RAG evaluation! 🚀