Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /IMPLEMENTATION_GUIDE_GPT_LABELING.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 1 month ago

preview code

raw

history blame contribute delete

9.62 kB

GPT Labeling Integration - Implementation Guide

Overview

The RAG Capstone Project now includes three evaluation methods:

TRACE Heuristics - Fast, rule-based metrics (no LLM calls)
GPT Labeling - Accurate, LLM-based sentence-level grounding (RAGBench paper)
Hybrid - Combines both approaches for comprehensive analysis

New Files Created

Core Implementation Files

advanced_rag_evaluator.py (380 lines)
- DocumentSentencizer - Splits documents and responses into labeled sentences
- GPTLabelingPromptGenerator - Creates GPT labeling prompts
- GPTLabelingOutput - Dataclass for structured LLM response
- AdvancedTRACEScores - Enhanced scores with GPT labeling metrics
- AdvancedRAGEvaluator - Main evaluator using GPT labeling approach
evaluation_pipeline.py (175 lines)
- UnifiedEvaluationPipeline - Facade for TRACE + GPT Labeling
- Supports single evaluation or batch processing
- Provides method information and descriptions
docs/GPT_LABELING_EVALUATION.md (Comprehensive guide)
- Conceptual overview of sentence-level labeling
- Architecture and data flow diagrams
- Usage examples for all three methods
- Performance considerations and recommendations
- JSON output formats

Modified Files

streamlit_app.py
- Updated evaluation_interface() to support method selection
- Updated run_evaluation() to handle three methods
- Added method descriptions and warnings
- Enhanced logging for each method
trace_evaluator.py
- Added documentation about GPT labeling integration
- No functional changes (backward compatible)

Key Components Explained

1. Sentence Sentencization

Document Sentences: Labeled with keys like 0a, 0b, 1a, 1b

0a. This is the first sentence.
0b. This is the second sentence.
1a. Another document's first sentence.
1b. And the second sentence.

Response Sentences: Labeled with keys like a, b, c

a. The response starts here.
b. It contains multiple sentences.
c. Each one gets a unique key.

2. GPT Labeling Process

The GPT labeling prompt asks the LLM to:

Identify which document sentences are relevant to the question
For each response sentence, identify supporting document sentences
Determine if each response sentence is fully/partially/unsupported
Return structured JSON with 5 evaluation fields

3. Metric Computation

From GPT-labeled data:

Context Relevance: Fraction of relevant document sentences (0-1)
Context Utilization: Fraction of relevant sentences used (0-1)
Completeness: Overlap between relevant and utilized (0-1)
Adherence: Fraction of response sentences with full support (0-1)

Usage Examples

In Streamlit UI

Select Evaluation Method

[Radio button: TRACE / GPT Labeling / Hybrid]

Choose LLM and Samples

LLM: [Dropdown: llama-3.1-8b-instant, etc.]
Samples: [Slider: 5-100]
Button: "Run Evaluation"

View Results
- Aggregate scores in metric cards
- Per-query detailed results
- JSON download

Programmatically (Python)

from evaluation_pipeline import UnifiedEvaluationPipeline

# Create pipeline
pipeline = UnifiedEvaluationPipeline(
    llm_client=my_llm_client,
    chunking_strategy="dense",
    embedding_model="all-mpnet-base-v2"
)

# Single evaluation
result = pipeline.evaluate(
    question="What is RAG?",
    response="RAG stands for...",
    retrieved_documents=["Doc 1", "Doc 2"],
    method="gpt_labeling"
)

# Batch evaluation
results = pipeline.evaluate_batch(
    test_cases=[
        {
            "query": "Question 1",
            "response": "Response 1",
            "retrieved_documents": ["Doc 1", "Doc 2"],
            "ground_truth": "Expected answer"
        },
        # ... more cases
    ],
    method="hybrid"  # "trace", "gpt_labeling", or "hybrid"
)

print(f"Results: {results}")

Performance Characteristics

TRACE Method

Time per evaluation: ~100ms
Total time for 10 samples: ~1 second
Total time for 100 samples: ~10 seconds
Cost: Free (no API calls)
Accuracy: Good for obvious cases

GPT Labeling Method

Time per evaluation: 2-5 seconds (due to API + rate limiting)
Total time for 10 samples: 20-50 seconds
Total time for 100 samples: 3-8 minutes
Cost: ~$0.002-0.01 per evaluation ($0.02-0.10 per 10 samples)
Accuracy: Excellent, semantic understanding
Limitation: 30 RPM Groq rate limit

Hybrid Method

Time per evaluation: 2-5 seconds
Cost: Same as GPT Labeling
Benefit: Get both fast and accurate metrics

Important Considerations

Rate Limiting

The Groq API has a 30 RPM (requests per minute) limit:

Each evaluation = 1 request
Wait 2 seconds between requests
For 10 evaluations: ~20-40 seconds
For 50 evaluations: ~100-200 seconds (2-3 minutes)
For 100 evaluations: ~200-400 seconds (3-7 minutes)

When to Use Each Method

Scenario	Recommended Method
Quick prototyping	TRACE
Small high-quality subset (< 20 samples)	GPT Labeling
Large-scale evaluation (100+ samples)	TRACE
Need both speed and accuracy	Hybrid on small subset
Production evaluation	TRACE + spot-check with GPT

Token Cost Estimation

For Groq's Llama model (~$0.05 per 1M input tokens):

Average prompt: ~2KB = ~500 tokens input + ~200 output = ~700 tokens
Cost per evaluation: 700 / 1M * $0.05 = $0.000035
For 100 evaluations: ~$0.0035 (very cheap!)

Note: Exact costs depend on document length and model choice.

Troubleshooting

Issue: "evaluation_pipeline module not found"

Solution: Ensure evaluation_pipeline.py is in the project root directory

Issue: GPT Labeling always returns 0.0 scores

Solution: Check that LLM client is properly initialized and returning valid JSON

Issue: Rate limit exceeded

Solution: The code handles this with exponential backoff. Reduce number of samples.

Issue: LLM returns non-JSON response

Solution: Use temperature=0.0 in LLM calls for deterministic output

Integration Checklist

Created advanced_rag_evaluator.py with GPT labeling implementation
Created evaluation_pipeline.py with unified interface
Updated streamlit_app.py to support method selection
Added comprehensive documentation in docs/GPT_LABELING_EVALUATION.md
Tested module imports and basic functionality
Verified syntax in all files
Backward compatible with existing TRACE evaluation
Handles LLM client gracefully (fallback to TRACE if unavailable)

Next Steps (Optional Enhancements)

Caching: Store evaluation results for identical Q-D-R triplets
Batch Processing: Evaluate multiple samples in parallel
Custom Prompts: Allow users to customize GPT labeling prompts
Multi-LLM: Average labels from multiple LLMs for robustness
Sampling Strategy: Smart sampling for large datasets
Visualization: Charts comparing TRACE vs GPT Labeling results

API Reference

UnifiedEvaluationPipeline

class UnifiedEvaluationPipeline:
    def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap)
    
    def evaluate(question, response, retrieved_documents, ground_truth=None, 
                 method="trace") -> Dict
    
    def evaluate_batch(test_cases, method="trace") -> Dict
    
    @staticmethod
    def get_evaluation_methods() -> List[Dict]

AdvancedRAGEvaluator

class AdvancedRAGEvaluator:
    def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap)
    
    def evaluate(question, response, retrieved_documents, ground_truth=None) -> AdvancedTRACEScores
    
    def evaluate_batch(test_cases) -> Dict

DocumentSentencizer

class DocumentSentencizer:
    @staticmethod
    def sentencize_documents(documents: List[str]) -> Tuple[List[Dict], str]
    
    @staticmethod
    def sentencize_response(response: str) -> Tuple[List[Dict], str]

File Summary

File	Lines	Purpose	Status
`advanced_rag_evaluator.py`	380	GPT labeling evaluator	NEW
`evaluation_pipeline.py`	175	Unified evaluation interface	NEW
`streamlit_app.py`	927	Updated UI with method selection	MODIFIED
`trace_evaluator.py`	438	Original TRACE metrics (unchanged)	UPDATED DOCS
`docs/GPT_LABELING_EVALUATION.md`	500+	Comprehensive guide	NEW

Total Impact

New Code: ~550 lines (2 new modules)
Modified Code: ~50 lines in streamlit_app.py + documentation
Backward Compatible: Yes, existing TRACE evaluation still works
Breaking Changes: None
New Dependencies: None (all already installed)

Verification Commands

# Check Python syntax
python -m py_compile advanced_rag_evaluator.py evaluation_pipeline.py

# Run imports test
python -c "from advanced_rag_evaluator import AdvancedRAGEvaluator; from evaluation_pipeline import UnifiedEvaluationPipeline; print('OK')"

# Start Streamlit with new features
streamlit run streamlit_app.py

Support

For issues with GPT labeling:

Check that LLM client is initialized (st.session_state.rag_pipeline.llm)
Verify Groq API key is valid
Ensure rate limiting (30 RPM) is respected
Check LLM response is valid JSON
Review docs/GPT_LABELING_EVALUATION.md for detailed guidance