RGBMetrics / METRICS_CALCULATION_COMPARISON.md
RGB Evaluation
feat: Show all 9 LLM models in app dropdown, add comprehensive code review and metric analysis documentation
b1ccc5d

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade

Metrics Calculation Comparison: Original vs Current Implementation

Overview

This document compares how each of the four RGB benchmark metrics is calculated in the original evalue_original.py versus the refactored application in src/evaluator.py and src/pipeline.py.


1. NOISE ROBUSTNESS

Original Implementation (evalue_original.py)

Data Processing

def processdata(instance, noise_rate, passage_num, filename, correct_rate=0):
    # For default datasets (not _int or _fact):
    neg_num = math.ceil(passage_num * noise_rate)
    pos_num = passage_num - neg_num
    
    positive = instance['positive'][:pos_num]  # Select positive passages
    negative = instance['negative'][:neg_num]  # Select negative (noise) passages
    docs = positive + negative
    random.shuffle(docs)
    return query, ans, docs

Evaluation Calculation

# In main script:
tt = 0
for i in results:
    label = i['label']
    if noise_rate == 1 and label[0] == -1:
        tt += 1
    elif 0 not in label and 1 in label:
        tt += 1

accuracy = tt / len(results)

Logic:

  • Accepts answer if: (noise_rate == 1 AND model rejected) OR (no 0 in labels AND 1 in labels)
  • Label 0 = correct answer found
  • Label 1 = incorrect answer found
  • Label -1 = insufficient information
  • Issue: Confusing logic, counts rejections at 100% noise as correct

Noise Levels Tested

  • Single noise_rate parameter (0.0, 0.2, 0.4, 0.6, 0.8 based on CLI usage)
  • No aggregation across multiple noise levels

Current Implementation (src/evaluator.py)

Evaluation Method

def evaluate_noise_robustness(
    self,
    responses: List[str],
    ground_truths: List[str],
    model_name: str,
    noise_ratio: float
) -> EvaluationResult:
    """Evaluate noise robustness for a specific noise ratio."""
    
    result = EvaluationResult(
        task_type=f"noise_robustness_{int(noise_ratio*100)}%",
        model_name=model_name,
        total_samples=len(responses)
    )
    
    # Calculate accuracy for this noise level
    for response, truth in zip(responses, ground_truths):
        if self.is_correct(response, truth):
            result.correct += 1
        else:
            result.incorrect += 1
    
    return result

Multi-Noise Testing (src/pipeline.py)

def evaluate_noise_robustness(self, model: str, noise_ratios=None):
    if noise_ratios is None:
        noise_ratios = [0.0, 0.2, 0.4, 0.6, 0.8]  # Paper's ratios
    
    results = []
    for noise_ratio in noise_ratios:
        samples = self.data_loader.load_noise_robustness(
            max_samples, 
            noise_rate=noise_ratio
        )
        responses = self._generate_responses(client, samples, prompt_template)
        ground_truths = [s.answer for s in samples]
        
        result = self.evaluator.evaluate_noise_robustness(
            responses, ground_truths, model, noise_ratio
        )
        results.append(result)

Accuracy Property

@property
def accuracy(self) -> float:
    """Calculate accuracy percentage."""
    if self.total_samples == 0:
        return 0.0
    return (self.correct / self.total_samples) * 100

Logic:

  • Simple: Percentage of responses that match ground truth
  • Uses is_correct() which handles:
    • Substring matching
    • Token overlap (80%+)
    • Normalized comparison
  • Advantage: Clear, explainable, aggregates results per noise level

Noise Levels Tested

  • Multiple noise levels: 0%, 20%, 40%, 60%, 80% (matching paper)
  • Separate result per noise ratio
  • Can create visualizations (e.g., accuracy vs noise level graph)

Key Differences

Aspect Original Current
Definition Complex label-based logic Simple accuracy calculation
Metric Depends on noise_rate value Direct: (correct/total) × 100
Aggregation Single noise ratio at a time All noise ratios tested together
Answer Checking Exact substring match Flexible (substring, token overlap, normalized)
Output One accuracy value per run List of results, one per noise level
Visualization N/A Noise vs Accuracy graph possible

2. NEGATIVE REJECTION

Original Implementation (evalue_original.py)

Dataset Mapping

# Uses 'en' or 'zh' dataset (default branch in processdata)
# No explicit handling for negative rejection task
# Original code treats all datasets the same way
if '_int' in filename:
    # Handle information integration
elif '_fact' in filename:
    # Handle counterfactual robustness
else:
    # Default handling (noise robustness + negative rejection mixed)

Rejection Detection

def predict(query, ground_truth, docs, model, system, instruction, temperature, dataset):
    if '信息不足' in prediction or 'insufficient information' in prediction:
        labels = [-1]  # Mark as rejection
    else:
        labels = checkanswer(prediction, ground_truth)
    return labels, prediction, factlabel

Logic:

  • Only 2 phrases trigger rejection detection
  • Chinese: "信息不足"
  • English: "insufficient information"
  • Very limited scope

Metric Calculation

# No explicit negative rejection metric in original code
# Rejection counted implicitly when label == -1

Issue: No dedicated metric for negative rejection; mixed with other tasks


Current Implementation (src/evaluator.py)

Rejection Detection

PRIMARY_REJECTION_PHRASES = [
    "i can not answer the question because of the insufficient information in documents",
    "insufficient information in documents",
    "can not answer",
    "cannot answer",
]

REJECTION_KEYWORDS = [
    "i don't know", "i cannot", "i can't", "unable to", "not able to",
    "insufficient information", "no information", "cannot determine",
    "cannot answer", "not enough information", "don't have enough",
    "unable to determine", "cannot find", "no relevant", "not mentioned",
    "not provided", "not specified", "unclear", "unknown", "i'm not sure",
    "i am not sure", "cannot be determined", "information is not available",
    "does not provide",
]

def is_rejection(self, response: str) -> bool:
    """Check if response is a rejection."""
    response_lower = response.lower().strip()
    
    # Primary phrases (from Figure 3 of paper)
    for phrase in self.PRIMARY_REJECTION_PHRASES:
        if phrase in response_lower:
            return True
    
    # Secondary keywords (flexible matching)
    for keyword in self.REJECTION_KEYWORDS:
        if keyword in response_lower:
            return True
    
    return False

Evaluation Method

def evaluate_negative_rejection(
    self,
    responses: List[str],
    model_name: str
) -> EvaluationResult:
    """Evaluate negative rejection ability."""
    
    result = EvaluationResult(
        task_type="negative_rejection",
        model_name=model_name,
        total_samples=len(responses)
    )
    
    for response in responses:
        if self.is_rejection(response):
            result.rejected += 1
        else:
            result.incorrect += 1
    
    return result

Metrics

@property
def rejection_rate(self) -> float:
    """Calculate rejection rate percentage."""
    if self.total_samples == 0:
        return 0.0
    return (self.rejected / self.total_samples) * 100

Logic:

  • Dedicated task: Evaluate ONLY on questions with no valid answer
  • Model should reject/refuse
  • Metric: Percentage of correct rejections
  • Assumption: All samples in negative_rejection task should be rejected

Key Differences

Aspect Original Current
Phrases 2 phrases (Chinese + English) 30+ phrases (tiered approach)
Matching Exact substring Primary exact + secondary flexible
Metric Implicit (label == -1) Explicit rejection_rate (%)
Task Separation Mixed with other datasets Dedicated task type
Paper Alignment Basic Figure 3 aligned (primary phrases)
Output Count only Percentage with granular tracking

3. INFORMATION INTEGRATION

Original Implementation (evalue_original.py)

Dataset Processing

def processdata(instance, noise_rate, passage_num, filename, correct_rate=0):
    if '_int' in filename:
        # Special handling for information integration
        for i in instance['positive']:
            random.shuffle(i)
        
        # Select first element from each positive passage group
        docs = [i[0] for i in instance['positive']]
        
        if len(docs) < pos_num:
            # Fill with additional elements from each group
            maxnum = max([len(i) for i in instance['positive']])
            for i in range(1, maxnum):
                for j in instance['positive']:
                    if len(j) > i:
                        docs.append(j[i])
                        if len(docs) == pos_num:
                            break
                if len(docs) == pos_num:
                    break
        
        # Add negative documents if needed
        neg_num = passage_num - len(docs)
        if neg_num > 0:
            negative = instance['negative'][:neg_num]
            docs += negative

Data Structure:

  • Each positive contains MULTIPLE related documents/passages
  • Each passage in a group contains partial information
  • Model must combine information across passages to answer correctly
  • Answer requires synthesizing knowledge from multiple sources

Evaluation Calculation

# Same as noise robustness:
# Count if 0 in labels (correct answer found)
label = checkanswer(prediction, ground_truth)
# Labels contain 0 if answer found

Logic:

  • Uses same answer checking as noise robustness
  • Metric: Accuracy (correct answers / total)
  • Tests model's ability to integrate information

Current Implementation (src/evaluator.py)

Evaluation Method

def evaluate_information_integration(
    self,
    responses: List[str],
    ground_truths: List[str],
    model_name: str
) -> EvaluationResult:
    """
    Evaluate information integration 
    (ability to combine info from multiple docs).
    """
    
    result = EvaluationResult(
        task_type="information_integration",
        model_name=model_name,
        total_samples=len(responses)
    )
    
    for response, truth in zip(responses, ground_truths):
        if self.is_correct(response, truth):
            result.correct += 1
        else:
            result.incorrect += 1
    
    return result

Metric

@property
def accuracy(self) -> float:
    return (self.correct / self.total_samples) * 100

Data Handling (src/data_loader.py)

def load_information_integration(self, max_samples: Optional[int] = None):
    """Load information integration dataset (_int)."""
    # Returns RGBSample objects with:
    # - question
    # - documents (multiple related documents)
    # - answer (requires synthesizing across docs)

Logic:

  • Dedicated task evaluation
  • Metric: Accuracy on questions requiring multi-document synthesis
  • Cleaner: No mixing with noise robustness

Key Differences

Aspect Original Current
Dataset Uses '_int' suffix in filename Dedicated loader method
Data Structure Passages grouped by related info Clear document lists
Metric Reused from noise robustness Dedicated method with clear semantics
Task Clarity Implicit (filename-based) Explicit task type
Answer Checking Substring match Flexible matching (substring, token overlap)
Output Count of correct Percentage accuracy

4. COUNTERFACTUAL ROBUSTNESS

Original Implementation (evalue_original.py)

Dataset Processing

elif '_fact' in filename:
    correct_num = math.ceil(passage_num * correct_rate)
    pos_num = passage_num - neg_num - correct_num
    
    # Sample wrong answers
    indexs = list(range(len(instance['positive'])))
    selected = random.sample(indexs, min(len(indexs), pos_num))
    docs = [instance['positive_wrong'][i] for i in selected]
    
    # Add correct answers
    remain = [i for i in indexs if i not in selected]
    if correct_num > 0 and len(remain) > 0:
        docs += [instance['positive'][i] for i in random.sample(remain, ...)]
    
    # Add negative documents
    if neg_num > 0:
        docs += instance['negative'][:neg_num]

Data Mix:

  • positive_wrong = Incorrect/counterfactual answers in documents
  • positive = Correct answers
  • negative = Irrelevant documents
  • Ratio: (pos_num : correct_num : neg_num) = (wrong : correct : irrelevant)
  • Example: With correct_rate=0.5, gets 50% wrong + 50% correct + noise docs

Evaluation Calculation

if '_fact' in args.dataset:
    fact_tt = 0
    correct_tt = 0
    for i in results:
        if i['factlabel'] == 1:  # Factual error detected
            fact_tt += 1
            if 0 not in i['label']:  # Didn't find the wrong answer
                correct_tt += 1
    
    fact_check_rate = fact_tt / len(results)
    if fact_tt > 0:
        correct_rate = correct_tt / fact_tt  # Correction rate
    else:
        correct_rate = 0

Metrics:

  • fact_check_rate: Percentage where "factual errors" detected
  • correct_rate: Of detected errors, percentage corrected

Detection Logic:

if '事实性错误' in prediction or 'factual errors' in prediction:
    factlabel = 1

Correction Logic:

  • If factlabel == 1 AND 0 not in label
  • Means: Detected error AND didn't provide the wrong answer
  • Issue: Indirect inference, doesn't verify correct answer provided

Current Implementation (src/evaluator.py)

Error Detection

ERROR_DETECTION_KEYWORDS = [
    "incorrect", "wrong", "false", "error", "mistake", "inaccurate",
    "not true", "not correct", "factually incorrect", "contradicts",
    "actually", "in fact", "however", "but actually", "the correct answer",
    "should be",
]

def detects_error(self, response: str, counterfactual_answer: Optional[str]) -> bool:
    """Check if model detects an error in counterfactual information."""
    response_lower = response.lower()
    
    # Check for error detection keywords
    for keyword in self.ERROR_DETECTION_KEYWORDS:
        if keyword in response_lower:
            return True
    
    # Check if model explicitly rejects the counterfactual answer
    if counterfactual_answer:
        cf_lower = counterfactual_answer.lower()
        if f"not {cf_lower}" in response_lower or \
           f"{cf_lower} is wrong" in response_lower:
            return True
    
    return False

Error Correction

def corrects_error(self, response: str, correct_answer: str, 
                   counterfactual_answer: Optional[str]) -> bool:
    """Check if model corrects the error with the right answer."""
    
    # First check if provides correct answer
    if not self.is_correct(response, correct_answer):
        return False
    
    # Ensure not just repeating counterfactual
    if counterfactual_answer:
        norm_response = self.normalize_answer(response)
        norm_cf = self.normalize_answer(counterfactual_answer)
        
        # Can contain both (detected and corrected)
        # But must include correct answer
        if norm_cf in norm_response and \
           self.normalize_answer(correct_answer) not in norm_response:
            return False
    
    return True

Evaluation Method

def evaluate_counterfactual_robustness(
    self,
    responses: List[str],
    ground_truths: List[str],
    counterfactual_answers: List[str],
    model_name: str
) -> EvaluationResult:
    """Evaluate counterfactual robustness."""
    
    result = EvaluationResult(
        task_type="counterfactual_robustness",
        model_name=model_name,
        total_samples=len(responses)
    )
    
    for response, truth, cf_answer in zip(responses, ground_truths, counterfactual_answers):
        if self.detects_error(response, cf_answer):
            result.errors_detected += 1
        
        if self.corrects_error(response, truth, cf_answer):
            result.errors_corrected += 1
            result.correct += 1
        else:
            result.incorrect += 1
    
    return result

Metrics

@property
def error_detection_rate(self) -> float:
    """Percentage of errors detected."""
    if self.total_samples == 0:
        return 0.0
    return (self.errors_detected / self.total_samples) * 100

@property
def error_correction_rate(self) -> float:
    """Percentage of errors corrected."""
    if self.total_samples == 0:
        return 0.0
    return (self.errors_corrected / self.total_samples) * 100

Logic:

  1. Detection: Keywords in response OR explicit rejection of counterfactual
  2. Correction: Must:
    • Provide correct answer (is_correct check)
    • Not just repeat counterfactual answer
    • Can mention counterfactual while providing correct answer
  3. Metrics: Both error_detected and error_corrected tracked separately

Key Differences

Aspect Original Current
Error Detection 2 phrases: "factual errors" (EN), "事实性错误" (ZH) 16+ keywords + explicit rejection patterns
Correction Check Indirect: !(0 in label) Direct: is_correct(response, ground_truth)
Metrics fact_check_rate, correct_rate error_detection_rate, error_correction_rate
Data Input Mixed documents with wrong/correct answers Clean counterfactual_answer field
Output Two aggregated percentages Both rates tracked individually
Robustness Only detects obvious keyword phrases Keywords + pattern matching + answer verification

SUMMARY TABLE

Metric Original Logic Current Logic Key Improvement
Noise Robustness Complex label-based Simple accuracy (correct/total × 100) Clear, comparable, aggregates multiple noise levels
Negative Rejection 2 phrases mixed with other tasks 30+ phrases, dedicated task Comprehensive, explicit rejection detection
Information Integration Reused noise robustness logic Dedicated evaluation method Clear task separation, better semantic meaning
Counterfactual Robustness Keyword check + indirect correction Keyword check + direct answer verification More accurate correction detection

Functional Differences

Accuracy Calculation

  • Original: Label-based (0/1/-1) with complex logic
  • Current: Direct percentage calculation

Answer Checking

  • Original: Simple substring matching
  • Current: Multi-strategy (substring, token overlap 80%+, normalized)

Task Separation

  • Original: Implicit (filename-based)
  • Current: Explicit (dedicated methods, clear data structures)

Metrics Clarity

  • Original: Manual counting, implicit logic
  • Current: Properties on dataclass, explicit tracking

Error Handling

  • Original: Basic try-except, silent failures
  • Current: Validation, specific logging, graceful degradation

Conclusion

The current implementation provides:

  1. Clarity: Each metric has clear definition and calculation
  2. Robustness: More comprehensive keyword matching, better answer verification
  3. Granularity: Tracks multiple aspects (e.g., both detection AND correction)
  4. Scalability: Easy to add new metrics or datasets
  5. Testability: Each evaluation method is independent and testable