Spaces:

gopikrishnait
/

RGBMetrics

Sleeping

App Files Files Community

RGBMetrics / DETAILED_CHANGES.md

RGB Evaluation

feat: Add separate grid layout for 4 RAG abilities in Streamlit UI

af25c62 about 2 months ago

preview code

raw

history blame contribute delete

8.26 kB

A newer version of the Streamlit SDK is available: 1.55.0

Upgrade

Code Compliance Verification - Detailed Changes

1. src/prompts.py - System Instruction & Template

Added: SYSTEM_INSTRUCTION Constant

SYSTEM_INSTRUCTION = """You are an accurate and reliable AI assistant that can answer questions with the help of external documents. Please note that external documents may contain noisy or factually incorrect information. If the information in the document contains the correct answer, you will give an accurate answer. If the information in the document does not contain the answer, you will generate 'I can not answer the question because of the insufficient information in documents.' If there are inconsistencies with the facts in some of the documents, please generate the response 'There are factual errors in the provided documents.' and provide the correct answer."""

Source: Figure 3 of RGB benchmark paper (2309.01431v2.pdf)

Modified: RAG_PROMPT_TEMPLATE

Before:

RAG_PROMPT_TEMPLATE = """Answer the following question based on the given documents. 
If there is no relevant information in the documents, say you cannot answer.

Documents:
{documents}

Question: {question}

Answer:"""

After:

RAG_PROMPT_TEMPLATE = """Document:
{documents}

Question: {question}"""

Rationale: Exact format from Figure 3. Task-specific instructions moved to system prompt.

2. src/evaluator.py - Rejection Phrase Matching

Added: PRIMARY_REJECTION_PHRASES

PRIMARY_REJECTION_PHRASES = [
    "i can not answer the question because of the insufficient information in documents",
    "insufficient information in documents",
    "can not answer",
    "cannot answer"
]

Modified: is_rejection() Method

Key Change: Check exact phrase first, then keywords

def is_rejection(self, response: str) -> bool:
    response_lower = response.lower().strip()
    
    # Check for exact primary phrases first (as per Figure 3)
    for phrase in self.PRIMARY_REJECTION_PHRASES:
        if phrase in response_lower:
            return True
    
    # Fall back to more flexible keyword matching
    for keyword in self.REJECTION_KEYWORDS:
        if keyword in response_lower:
            return True
    
    return False

Modified: Type Annotations

Changed:

detects_error(response: str, counterfactual_answer: str)
corrects_error(response: str, correct_answer: str, counterfactual_answer: str)

To:

detects_error(response: str, counterfactual_answer: Optional[str])
corrects_error(response: str, correct_answer: str, counterfactual_answer: Optional[str])

Modified: evaluate_noise_robustness() Signature

Changed from:

def evaluate_noise_robustness(
    self,
    responses: List[str],
    ground_truths: List[str],
    noise_levels: List[int],
    model_name: str
) -> EvaluationResult:

Changed to:

def evaluate_noise_robustness(
    self,
    responses: List[str],
    ground_truths: List[str],
    model_name: str,
    noise_ratio: float
) -> EvaluationResult:

Rationale: Each evaluation is now for a specific noise ratio, not aggregated.

3. src/pipeline.py - Multiple Noise Ratio Testing

Modified: evaluate_noise_robustness() Method

New Implementation:

def evaluate_noise_robustness(
    self,
    model: str,
    max_samples: Optional[int] = None,
    noise_ratios: Optional[List[float]] = None
) -> List[EvaluationResult]:
    """
    Evaluate noise robustness for a model.
    Tests multiple noise ratios as per the RGB paper (0%, 20%, 40%, 60%, 80%).
    
    Args:
        model: The model name to evaluate.
        max_samples: Maximum samples to evaluate per noise ratio.
        noise_ratios: List of noise ratios to test. Defaults to paper's ratios.
        
    Returns:
        List of EvaluationResults for different noise ratios.
    """
    if noise_ratios is None:
        noise_ratios = [0.0, 0.2, 0.4, 0.6, 0.8]  # Paper's exact ratios
    
    print(f"\n[Noise Robustness] Evaluating {model}...")
    print(f"  Testing noise ratios: {noise_ratios}")
    
    client = self._create_client(model)
    results = []
    
    for noise_ratio in noise_ratios:
        samples = self.data_loader.load_noise_robustness(max_samples, noise_rate=noise_ratio)
        
        if not samples:
            print(f"  Warning: No noise robustness samples found for noise_rate={noise_ratio}")
            continue
        
        prompt_template = get_prompt_template("default")
        responses = self._generate_responses(
            client, samples, prompt_template,
            desc=f"  {model} - Noise {int(noise_ratio*100)}%"
        )
        
        ground_truths = [s.answer for s in samples]
        
        result = self.evaluator.evaluate_noise_robustness(
            responses, ground_truths, model, noise_ratio
        )
        results.append(result)
        
        print(f"  Noise {int(noise_ratio*100)}%: Accuracy = {result.accuracy:.2f}%")
    
    return results

Modified: run_full_evaluation() Method

Changed from:

if "noise_robustness" in all_tasks:
    result = self.evaluate_noise_robustness(model, max_samples_per_task)
    self.results.append(result)

Changed to:

if "noise_robustness" in all_tasks:
    # Noise robustness returns a list of results (one per noise ratio)
    noise_results = self.evaluate_noise_robustness(model, max_samples_per_task)
    self.results.extend(noise_results)

Rationale: Handle list return type from noise robustness testing.

Compliance Mapping

Noise Robustness (RGB Table 1)

✅ Tests 5 noise ratios: 0%, 20%, 40%, 60%, 80%
✅ Separate evaluation per ratio
✅ Calculates accuracy for each noise level
✅ Returns List[EvaluationResult] for comparison

Negative Rejection (RGB Table 2)

✅ Checks for exact rejection phrase from Figure 3
✅ Falls back to keyword matching for robustness
✅ Calculates rejection_rate metric
✅ System instruction guides LLM to reject appropriately

Information Integration (RGB Table 3)

✅ Evaluates multi-document synthesis
✅ Calculates accuracy metric
✅ System instruction guides information combination

Counterfactual Robustness (RGB Table 4)

✅ Detects error detection keywords
✅ Verifies error correction with correct answer
✅ Calculates error_detection_rate and error_correction_rate
✅ System instruction guides factual error detection

Testing Validation

Test Script: test_refactored_pipeline.py

Test 1: Data Loader ✅

Loads all 4 task types successfully
Noise robustness supports all 5 ratios
Proper sample formatting

Test 2: Evaluator ✅

Noise robustness: Returns correct task type noise_robustness_20%
Negative rejection: Detects exact phrases correctly
Information integration: Calculates accuracy properly
Counterfactual robustness: Detects errors and corrections

Test 3: Prompts ✅

System instruction: 649 characters, contains required phrases
Template format: Matches Figure 3 exactly
Formatting: Properly interpolates documents and question

Test 4: Pipeline ✅

All required methods present and callable
Pipeline instantiation successful
Structure ready for Groq API calls

Production Readiness

All changes verified against:

RGB benchmark paper (2309.01431v2.pdf)
Figure 3 prompt template specification
Table 1-7 evaluation methodology
Paper's exact wording for rejection and error detection

Type Safety: ✅ No type errors Integration: ✅ All components properly connected Documentation: ✅ Code comments and docstrings updated Testing: ✅ Comprehensive test suite passes

Summary

The RGB evaluation pipeline has been successfully refactored to achieve 100% compliance with the capstone requirements:

✅ Exact Figure 3 prompt format with system instruction
✅ All 4 RAG abilities fully implemented
✅ 5 noise ratios tested separately (matching Table 1)
✅ Exact rejection phrase detection
✅ System instruction properly integrated
✅ Type-safe code with no errors
✅ Comprehensive test coverage

Status: READY FOR GROQ API EVALUATION