CapStoneRAG10 / docs /IMPLEMENTATION_GUIDE_GPT_LABELING.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

GPT Labeling Integration - Implementation Guide

Overview

The RAG Capstone Project now includes three evaluation methods:

  1. TRACE Heuristics - Fast, rule-based metrics (no LLM calls)
  2. GPT Labeling - Accurate, LLM-based sentence-level grounding (RAGBench paper)
  3. Hybrid - Combines both approaches for comprehensive analysis

New Files Created

Core Implementation Files

  1. advanced_rag_evaluator.py (380 lines)

    • DocumentSentencizer - Splits documents and responses into labeled sentences
    • GPTLabelingPromptGenerator - Creates GPT labeling prompts
    • GPTLabelingOutput - Dataclass for structured LLM response
    • AdvancedTRACEScores - Enhanced scores with GPT labeling metrics
    • AdvancedRAGEvaluator - Main evaluator using GPT labeling approach
  2. evaluation_pipeline.py (175 lines)

    • UnifiedEvaluationPipeline - Facade for TRACE + GPT Labeling
    • Supports single evaluation or batch processing
    • Provides method information and descriptions
  3. docs/GPT_LABELING_EVALUATION.md (Comprehensive guide)

    • Conceptual overview of sentence-level labeling
    • Architecture and data flow diagrams
    • Usage examples for all three methods
    • Performance considerations and recommendations
    • JSON output formats

Modified Files

  1. streamlit_app.py

    • Updated evaluation_interface() to support method selection
    • Updated run_evaluation() to handle three methods
    • Added method descriptions and warnings
    • Enhanced logging for each method
  2. trace_evaluator.py

    • Added documentation about GPT labeling integration
    • No functional changes (backward compatible)

Key Components Explained

1. Sentence Sentencization

Document Sentences: Labeled with keys like 0a, 0b, 1a, 1b

0a. This is the first sentence.
0b. This is the second sentence.
1a. Another document's first sentence.
1b. And the second sentence.

Response Sentences: Labeled with keys like a, b, c

a. The response starts here.
b. It contains multiple sentences.
c. Each one gets a unique key.

2. GPT Labeling Process

The GPT labeling prompt asks the LLM to:

  1. Identify which document sentences are relevant to the question
  2. For each response sentence, identify supporting document sentences
  3. Determine if each response sentence is fully/partially/unsupported
  4. Return structured JSON with 5 evaluation fields

3. Metric Computation

From GPT-labeled data:

  • Context Relevance: Fraction of relevant document sentences (0-1)
  • Context Utilization: Fraction of relevant sentences used (0-1)
  • Completeness: Overlap between relevant and utilized (0-1)
  • Adherence: Fraction of response sentences with full support (0-1)

Usage Examples

In Streamlit UI

  1. Select Evaluation Method

    [Radio button: TRACE / GPT Labeling / Hybrid]
    
  2. Choose LLM and Samples

    LLM: [Dropdown: llama-3.1-8b-instant, etc.]
    Samples: [Slider: 5-100]
    Button: "Run Evaluation"
    
  3. View Results

    • Aggregate scores in metric cards
    • Per-query detailed results
    • JSON download

Programmatically (Python)

from evaluation_pipeline import UnifiedEvaluationPipeline

# Create pipeline
pipeline = UnifiedEvaluationPipeline(
    llm_client=my_llm_client,
    chunking_strategy="dense",
    embedding_model="all-mpnet-base-v2"
)

# Single evaluation
result = pipeline.evaluate(
    question="What is RAG?",
    response="RAG stands for...",
    retrieved_documents=["Doc 1", "Doc 2"],
    method="gpt_labeling"
)

# Batch evaluation
results = pipeline.evaluate_batch(
    test_cases=[
        {
            "query": "Question 1",
            "response": "Response 1",
            "retrieved_documents": ["Doc 1", "Doc 2"],
            "ground_truth": "Expected answer"
        },
        # ... more cases
    ],
    method="hybrid"  # "trace", "gpt_labeling", or "hybrid"
)

print(f"Results: {results}")

Performance Characteristics

TRACE Method

  • Time per evaluation: ~100ms
  • Total time for 10 samples: ~1 second
  • Total time for 100 samples: ~10 seconds
  • Cost: Free (no API calls)
  • Accuracy: Good for obvious cases

GPT Labeling Method

  • Time per evaluation: 2-5 seconds (due to API + rate limiting)
  • Total time for 10 samples: 20-50 seconds
  • Total time for 100 samples: 3-8 minutes
  • Cost: ~$0.002-0.01 per evaluation ($0.02-0.10 per 10 samples)
  • Accuracy: Excellent, semantic understanding
  • Limitation: 30 RPM Groq rate limit

Hybrid Method

  • Time per evaluation: 2-5 seconds
  • Cost: Same as GPT Labeling
  • Benefit: Get both fast and accurate metrics

Important Considerations

Rate Limiting

The Groq API has a 30 RPM (requests per minute) limit:

  • Each evaluation = 1 request
  • Wait 2 seconds between requests
  • For 10 evaluations: ~20-40 seconds
  • For 50 evaluations: ~100-200 seconds (2-3 minutes)
  • For 100 evaluations: ~200-400 seconds (3-7 minutes)

When to Use Each Method

Scenario Recommended Method
Quick prototyping TRACE
Small high-quality subset (< 20 samples) GPT Labeling
Large-scale evaluation (100+ samples) TRACE
Need both speed and accuracy Hybrid on small subset
Production evaluation TRACE + spot-check with GPT

Token Cost Estimation

For Groq's Llama model (~$0.05 per 1M input tokens):

  • Average prompt: ~2KB = ~500 tokens input + ~200 output = ~700 tokens
  • Cost per evaluation: 700 / 1M * $0.05 = $0.000035
  • For 100 evaluations: ~$0.0035 (very cheap!)

Note: Exact costs depend on document length and model choice.

Troubleshooting

Issue: "evaluation_pipeline module not found"

Solution: Ensure evaluation_pipeline.py is in the project root directory

Issue: GPT Labeling always returns 0.0 scores

Solution: Check that LLM client is properly initialized and returning valid JSON

Issue: Rate limit exceeded

Solution: The code handles this with exponential backoff. Reduce number of samples.

Issue: LLM returns non-JSON response

Solution: Use temperature=0.0 in LLM calls for deterministic output

Integration Checklist

  • Created advanced_rag_evaluator.py with GPT labeling implementation
  • Created evaluation_pipeline.py with unified interface
  • Updated streamlit_app.py to support method selection
  • Added comprehensive documentation in docs/GPT_LABELING_EVALUATION.md
  • Tested module imports and basic functionality
  • Verified syntax in all files
  • Backward compatible with existing TRACE evaluation
  • Handles LLM client gracefully (fallback to TRACE if unavailable)

Next Steps (Optional Enhancements)

  1. Caching: Store evaluation results for identical Q-D-R triplets
  2. Batch Processing: Evaluate multiple samples in parallel
  3. Custom Prompts: Allow users to customize GPT labeling prompts
  4. Multi-LLM: Average labels from multiple LLMs for robustness
  5. Sampling Strategy: Smart sampling for large datasets
  6. Visualization: Charts comparing TRACE vs GPT Labeling results

API Reference

UnifiedEvaluationPipeline

class UnifiedEvaluationPipeline:
    def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap)
    
    def evaluate(question, response, retrieved_documents, ground_truth=None, 
                 method="trace") -> Dict
    
    def evaluate_batch(test_cases, method="trace") -> Dict
    
    @staticmethod
    def get_evaluation_methods() -> List[Dict]

AdvancedRAGEvaluator

class AdvancedRAGEvaluator:
    def __init__(llm_client, chunking_strategy, embedding_model, chunk_size, chunk_overlap)
    
    def evaluate(question, response, retrieved_documents, ground_truth=None) -> AdvancedTRACEScores
    
    def evaluate_batch(test_cases) -> Dict

DocumentSentencizer

class DocumentSentencizer:
    @staticmethod
    def sentencize_documents(documents: List[str]) -> Tuple[List[Dict], str]
    
    @staticmethod
    def sentencize_response(response: str) -> Tuple[List[Dict], str]

File Summary

File Lines Purpose Status
advanced_rag_evaluator.py 380 GPT labeling evaluator NEW
evaluation_pipeline.py 175 Unified evaluation interface NEW
streamlit_app.py 927 Updated UI with method selection MODIFIED
trace_evaluator.py 438 Original TRACE metrics (unchanged) UPDATED DOCS
docs/GPT_LABELING_EVALUATION.md 500+ Comprehensive guide NEW

Total Impact

  • New Code: ~550 lines (2 new modules)
  • Modified Code: ~50 lines in streamlit_app.py + documentation
  • Backward Compatible: Yes, existing TRACE evaluation still works
  • Breaking Changes: None
  • New Dependencies: None (all already installed)

Verification Commands

# Check Python syntax
python -m py_compile advanced_rag_evaluator.py evaluation_pipeline.py

# Run imports test
python -c "from advanced_rag_evaluator import AdvancedRAGEvaluator; from evaluation_pipeline import UnifiedEvaluationPipeline; print('OK')"

# Start Streamlit with new features
streamlit run streamlit_app.py

Support

For issues with GPT labeling:

  1. Check that LLM client is initialized (st.session_state.rag_pipeline.llm)
  2. Verify Groq API key is valid
  3. Ensure rate limiting (30 RPM) is respected
  4. Check LLM response is valid JSON
  5. Review docs/GPT_LABELING_EVALUATION.md for detailed guidance