Spaces:

gopikrishnait
/

RGBMetrics

Sleeping

App Files Files Community

RGBMetrics / METRICS_CALCULATION_COMPARISON.md

RGB Evaluation

feat: Show all 9 LLM models in app dropdown, add comprehensive code review and metric analysis documentation

b1ccc5d 3 months ago

preview code

raw

history blame contribute delete

19.8 kB

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade

Metrics Calculation Comparison: Original vs Current Implementation

Overview

This document compares how each of the four RGB benchmark metrics is calculated in the original evalue_original.py versus the refactored application in src/evaluator.py and src/pipeline.py.

1. NOISE ROBUSTNESS

Original Implementation (`evalue_original.py`)

Data Processing

def processdata(instance, noise_rate, passage_num, filename, correct_rate=0):
    # For default datasets (not _int or _fact):
    neg_num = math.ceil(passage_num * noise_rate)
    pos_num = passage_num - neg_num
    
    positive = instance['positive'][:pos_num]  # Select positive passages
    negative = instance['negative'][:neg_num]  # Select negative (noise) passages
    docs = positive + negative
    random.shuffle(docs)
    return query, ans, docs

Evaluation Calculation

# In main script:
tt = 0
for i in results:
    label = i['label']
    if noise_rate == 1 and label[0] == -1:
        tt += 1
    elif 0 not in label and 1 in label:
        tt += 1

accuracy = tt / len(results)

Logic:

Accepts answer if: (noise_rate == 1 AND model rejected) OR (no 0 in labels AND 1 in labels)
Label 0 = correct answer found
Label 1 = incorrect answer found
Label -1 = insufficient information
Issue: Confusing logic, counts rejections at 100% noise as correct

Noise Levels Tested

Single noise_rate parameter (0.0, 0.2, 0.4, 0.6, 0.8 based on CLI usage)
No aggregation across multiple noise levels

Current Implementation (`src/evaluator.py`)

Evaluation Method

def evaluate_noise_robustness(
    self,
    responses: List[str],
    ground_truths: List[str],
    model_name: str,
    noise_ratio: float
) -> EvaluationResult:
    """Evaluate noise robustness for a specific noise ratio."""
    
    result = EvaluationResult(
        task_type=f"noise_robustness_{int(noise_ratio*100)}%",
        model_name=model_name,
        total_samples=len(responses)
    )
    
    # Calculate accuracy for this noise level
    for response, truth in zip(responses, ground_truths):
        if self.is_correct(response, truth):
            result.correct += 1
        else:
            result.incorrect += 1
    
    return result

Multi-Noise Testing (`src/pipeline.py`)

def evaluate_noise_robustness(self, model: str, noise_ratios=None):
    if noise_ratios is None:
        noise_ratios = [0.0, 0.2, 0.4, 0.6, 0.8]  # Paper's ratios
    
    results = []
    for noise_ratio in noise_ratios:
        samples = self.data_loader.load_noise_robustness(
            max_samples, 
            noise_rate=noise_ratio
        )
        responses = self._generate_responses(client, samples, prompt_template)
        ground_truths = [s.answer for s in samples]
        
        result = self.evaluator.evaluate_noise_robustness(
            responses, ground_truths, model, noise_ratio
        )
        results.append(result)

Accuracy Property

@property
def accuracy(self) -> float:
    """Calculate accuracy percentage."""
    if self.total_samples == 0:
        return 0.0
    return (self.correct / self.total_samples) * 100

Logic:

Simple: Percentage of responses that match ground truth
Uses is_correct() which handles:
- Substring matching
- Token overlap (80%+)
- Normalized comparison
Advantage: Clear, explainable, aggregates results per noise level

Noise Levels Tested

Multiple noise levels: 0%, 20%, 40%, 60%, 80% (matching paper)
Separate result per noise ratio
Can create visualizations (e.g., accuracy vs noise level graph)

Key Differences

Aspect	Original	Current
Definition	Complex label-based logic	Simple accuracy calculation
Metric	Depends on noise_rate value	Direct: (correct/total) × 100
Aggregation	Single noise ratio at a time	All noise ratios tested together
Answer Checking	Exact substring match	Flexible (substring, token overlap, normalized)
Output	One accuracy value per run	List of results, one per noise level
Visualization	N/A	Noise vs Accuracy graph possible

2. NEGATIVE REJECTION

Original Implementation (`evalue_original.py`)

Dataset Mapping

# Uses 'en' or 'zh' dataset (default branch in processdata)
# No explicit handling for negative rejection task
# Original code treats all datasets the same way
if '_int' in filename:
    # Handle information integration
elif '_fact' in filename:
    # Handle counterfactual robustness
else:
    # Default handling (noise robustness + negative rejection mixed)

Rejection Detection

def predict(query, ground_truth, docs, model, system, instruction, temperature, dataset):
    if '信息不足' in prediction or 'insufficient information' in prediction:
        labels = [-1]  # Mark as rejection
    else:
        labels = checkanswer(prediction, ground_truth)
    return labels, prediction, factlabel

Logic:

Only 2 phrases trigger rejection detection
Chinese: "信息不足"
English: "insufficient information"
Very limited scope

Metric Calculation

# No explicit negative rejection metric in original code
# Rejection counted implicitly when label == -1

Issue: No dedicated metric for negative rejection; mixed with other tasks

Current Implementation (`src/evaluator.py`)

Rejection Detection

PRIMARY_REJECTION_PHRASES = [
    "i can not answer the question because of the insufficient information in documents",
    "insufficient information in documents",
    "can not answer",
    "cannot answer",
]

REJECTION_KEYWORDS = [
    "i don't know", "i cannot", "i can't", "unable to", "not able to",
    "insufficient information", "no information", "cannot determine",
    "cannot answer", "not enough information", "don't have enough",
    "unable to determine", "cannot find", "no relevant", "not mentioned",
    "not provided", "not specified", "unclear", "unknown", "i'm not sure",
    "i am not sure", "cannot be determined", "information is not available",
    "does not provide",
]

def is_rejection(self, response: str) -> bool:
    """Check if response is a rejection."""
    response_lower = response.lower().strip()
    
    # Primary phrases (from Figure 3 of paper)
    for phrase in self.PRIMARY_REJECTION_PHRASES:
        if phrase in response_lower:
            return True
    
    # Secondary keywords (flexible matching)
    for keyword in self.REJECTION_KEYWORDS:
        if keyword in response_lower:
            return True
    
    return False

Evaluation Method

def evaluate_negative_rejection(
    self,
    responses: List[str],
    model_name: str
) -> EvaluationResult:
    """Evaluate negative rejection ability."""
    
    result = EvaluationResult(
        task_type="negative_rejection",
        model_name=model_name,
        total_samples=len(responses)
    )
    
    for response in responses:
        if self.is_rejection(response):
            result.rejected += 1
        else:
            result.incorrect += 1
    
    return result

Metrics

@property
def rejection_rate(self) -> float:
    """Calculate rejection rate percentage."""
    if self.total_samples == 0:
        return 0.0
    return (self.rejected / self.total_samples) * 100

Logic:

Dedicated task: Evaluate ONLY on questions with no valid answer
Model should reject/refuse
Metric: Percentage of correct rejections
Assumption: All samples in negative_rejection task should be rejected

Key Differences

Aspect	Original	Current
Phrases	2 phrases (Chinese + English)	30+ phrases (tiered approach)
Matching	Exact substring	Primary exact + secondary flexible
Metric	Implicit (label == -1)	Explicit rejection_rate (%)
Task Separation	Mixed with other datasets	Dedicated task type
Paper Alignment	Basic	Figure 3 aligned (primary phrases)
Output	Count only	Percentage with granular tracking

3. INFORMATION INTEGRATION

Original Implementation (`evalue_original.py`)

Dataset Processing

def processdata(instance, noise_rate, passage_num, filename, correct_rate=0):
    if '_int' in filename:
        # Special handling for information integration
        for i in instance['positive']:
            random.shuffle(i)
        
        # Select first element from each positive passage group
        docs = [i[0] for i in instance['positive']]
        
        if len(docs) < pos_num:
            # Fill with additional elements from each group
            maxnum = max([len(i) for i in instance['positive']])
            for i in range(1, maxnum):
                for j in instance['positive']:
                    if len(j) > i:
                        docs.append(j[i])
                        if len(docs) == pos_num:
                            break
                if len(docs) == pos_num:
                    break
        
        # Add negative documents if needed
        neg_num = passage_num - len(docs)
        if neg_num > 0:
            negative = instance['negative'][:neg_num]
            docs += negative

Data Structure:

Each positive contains MULTIPLE related documents/passages
Each passage in a group contains partial information
Model must combine information across passages to answer correctly
Answer requires synthesizing knowledge from multiple sources

Evaluation Calculation

# Same as noise robustness:
# Count if 0 in labels (correct answer found)
label = checkanswer(prediction, ground_truth)
# Labels contain 0 if answer found

Logic:

Uses same answer checking as noise robustness
Metric: Accuracy (correct answers / total)
Tests model's ability to integrate information

Current Implementation (`src/evaluator.py`)

Evaluation Method

def evaluate_information_integration(
    self,
    responses: List[str],
    ground_truths: List[str],
    model_name: str
) -> EvaluationResult:
    """
    Evaluate information integration 
    (ability to combine info from multiple docs).
    """
    
    result = EvaluationResult(
        task_type="information_integration",
        model_name=model_name,
        total_samples=len(responses)
    )
    
    for response, truth in zip(responses, ground_truths):
        if self.is_correct(response, truth):
            result.correct += 1
        else:
            result.incorrect += 1
    
    return result

Metric

@property
def accuracy(self) -> float:
    return (self.correct / self.total_samples) * 100

Data Handling (`src/data_loader.py`)

def load_information_integration(self, max_samples: Optional[int] = None):
    """Load information integration dataset (_int)."""
    # Returns RGBSample objects with:
    # - question
    # - documents (multiple related documents)
    # - answer (requires synthesizing across docs)

Logic:

Dedicated task evaluation
Metric: Accuracy on questions requiring multi-document synthesis
Cleaner: No mixing with noise robustness

Key Differences

Aspect	Original	Current
Dataset	Uses '_int' suffix in filename	Dedicated loader method
Data Structure	Passages grouped by related info	Clear document lists
Metric	Reused from noise robustness	Dedicated method with clear semantics
Task Clarity	Implicit (filename-based)	Explicit task type
Answer Checking	Substring match	Flexible matching (substring, token overlap)
Output	Count of correct	Percentage accuracy

4. COUNTERFACTUAL ROBUSTNESS

Original Implementation (`evalue_original.py`)

Dataset Processing

elif '_fact' in filename:
    correct_num = math.ceil(passage_num * correct_rate)
    pos_num = passage_num - neg_num - correct_num
    
    # Sample wrong answers
    indexs = list(range(len(instance['positive'])))
    selected = random.sample(indexs, min(len(indexs), pos_num))
    docs = [instance['positive_wrong'][i] for i in selected]
    
    # Add correct answers
    remain = [i for i in indexs if i not in selected]
    if correct_num > 0 and len(remain) > 0:
        docs += [instance['positive'][i] for i in random.sample(remain, ...)]
    
    # Add negative documents
    if neg_num > 0:
        docs += instance['negative'][:neg_num]

Data Mix:

positive_wrong = Incorrect/counterfactual answers in documents
positive = Correct answers
negative = Irrelevant documents
Ratio: (pos_num : correct_num : neg_num) = (wrong : correct : irrelevant)
Example: With correct_rate=0.5, gets 50% wrong + 50% correct + noise docs

Evaluation Calculation

if '_fact' in args.dataset:
    fact_tt = 0
    correct_tt = 0
    for i in results:
        if i['factlabel'] == 1:  # Factual error detected
            fact_tt += 1
            if 0 not in i['label']:  # Didn't find the wrong answer
                correct_tt += 1
    
    fact_check_rate = fact_tt / len(results)
    if fact_tt > 0:
        correct_rate = correct_tt / fact_tt  # Correction rate
    else:
        correct_rate = 0

Metrics:

fact_check_rate: Percentage where "factual errors" detected
correct_rate: Of detected errors, percentage corrected

Detection Logic:

if '事实性错误' in prediction or 'factual errors' in prediction:
    factlabel = 1

Correction Logic:

If factlabel == 1 AND 0 not in label
Means: Detected error AND didn't provide the wrong answer
Issue: Indirect inference, doesn't verify correct answer provided

Current Implementation (`src/evaluator.py`)

Error Detection

ERROR_DETECTION_KEYWORDS = [
    "incorrect", "wrong", "false", "error", "mistake", "inaccurate",
    "not true", "not correct", "factually incorrect", "contradicts",
    "actually", "in fact", "however", "but actually", "the correct answer",
    "should be",
]

def detects_error(self, response: str, counterfactual_answer: Optional[str]) -> bool:
    """Check if model detects an error in counterfactual information."""
    response_lower = response.lower()
    
    # Check for error detection keywords
    for keyword in self.ERROR_DETECTION_KEYWORDS:
        if keyword in response_lower:
            return True
    
    # Check if model explicitly rejects the counterfactual answer
    if counterfactual_answer:
        cf_lower = counterfactual_answer.lower()
        if f"not {cf_lower}" in response_lower or \
           f"{cf_lower} is wrong" in response_lower:
            return True
    
    return False

Error Correction

def corrects_error(self, response: str, correct_answer: str, 
                   counterfactual_answer: Optional[str]) -> bool:
    """Check if model corrects the error with the right answer."""
    
    # First check if provides correct answer
    if not self.is_correct(response, correct_answer):
        return False
    
    # Ensure not just repeating counterfactual
    if counterfactual_answer:
        norm_response = self.normalize_answer(response)
        norm_cf = self.normalize_answer(counterfactual_answer)
        
        # Can contain both (detected and corrected)
        # But must include correct answer
        if norm_cf in norm_response and \
           self.normalize_answer(correct_answer) not in norm_response:
            return False
    
    return True

Evaluation Method

def evaluate_counterfactual_robustness(
    self,
    responses: List[str],
    ground_truths: List[str],
    counterfactual_answers: List[str],
    model_name: str
) -> EvaluationResult:
    """Evaluate counterfactual robustness."""
    
    result = EvaluationResult(
        task_type="counterfactual_robustness",
        model_name=model_name,
        total_samples=len(responses)
    )
    
    for response, truth, cf_answer in zip(responses, ground_truths, counterfactual_answers):
        if self.detects_error(response, cf_answer):
            result.errors_detected += 1
        
        if self.corrects_error(response, truth, cf_answer):
            result.errors_corrected += 1
            result.correct += 1
        else:
            result.incorrect += 1
    
    return result

Metrics

@property
def error_detection_rate(self) -> float:
    """Percentage of errors detected."""
    if self.total_samples == 0:
        return 0.0
    return (self.errors_detected / self.total_samples) * 100

@property
def error_correction_rate(self) -> float:
    """Percentage of errors corrected."""
    if self.total_samples == 0:
        return 0.0
    return (self.errors_corrected / self.total_samples) * 100

Logic:

Detection: Keywords in response OR explicit rejection of counterfactual
Correction: Must:
- Provide correct answer (is_correct check)
- Not just repeat counterfactual answer
- Can mention counterfactual while providing correct answer
Metrics: Both error_detected and error_corrected tracked separately

Key Differences

Aspect	Original	Current
Error Detection	2 phrases: "factual errors" (EN), "事实性错误" (ZH)	16+ keywords + explicit rejection patterns
Correction Check	Indirect: !(0 in label)	Direct: is_correct(response, ground_truth)
Metrics	fact_check_rate, correct_rate	error_detection_rate, error_correction_rate
Data Input	Mixed documents with wrong/correct answers	Clean counterfactual_answer field
Output	Two aggregated percentages	Both rates tracked individually
Robustness	Only detects obvious keyword phrases	Keywords + pattern matching + answer verification

SUMMARY TABLE

Metric	Original Logic	Current Logic	Key Improvement
Noise Robustness	Complex label-based	Simple accuracy (correct/total × 100)	Clear, comparable, aggregates multiple noise levels
Negative Rejection	2 phrases mixed with other tasks	30+ phrases, dedicated task	Comprehensive, explicit rejection detection
Information Integration	Reused noise robustness logic	Dedicated evaluation method	Clear task separation, better semantic meaning
Counterfactual Robustness	Keyword check + indirect correction	Keyword check + direct answer verification	More accurate correction detection

Functional Differences

Accuracy Calculation

Original: Label-based (0/1/-1) with complex logic
Current: Direct percentage calculation

Answer Checking

Original: Simple substring matching
Current: Multi-strategy (substring, token overlap 80%+, normalized)

Task Separation

Original: Implicit (filename-based)
Current: Explicit (dedicated methods, clear data structures)

Metrics Clarity

Original: Manual counting, implicit logic
Current: Properties on dataclass, explicit tracking

Error Handling

Original: Basic try-except, silent failures
Current: Validation, specific logging, graceful degradation

Conclusion

The current implementation provides:

Clarity: Each metric has clear definition and calculation
Robustness: More comprehensive keyword matching, better answer verification
Granularity: Tracks multiple aspects (e.g., both detection AND correction)
Scalability: Easy to add new metrics or datasets
Testability: Each evaluation method is independent and testable

Metrics Calculation Comparison: Original vs Current Implementation

Overview

1. NOISE ROBUSTNESS

Original Implementation (evalue_original.py)

Data Processing

Evaluation Calculation

Noise Levels Tested

Current Implementation (src/evaluator.py)

Evaluation Method

Multi-Noise Testing (src/pipeline.py)

Noise Levels Tested

Key Differences

2. NEGATIVE REJECTION

Original Implementation (evalue_original.py)

Dataset Mapping

Rejection Detection

Metric Calculation

Current Implementation (src/evaluator.py)

Rejection Detection

Evaluation Method

Key Differences

3. INFORMATION INTEGRATION

Original Implementation (evalue_original.py)

Dataset Processing

Evaluation Calculation

Current Implementation (src/evaluator.py)

Evaluation Method

Data Handling (src/data_loader.py)

Key Differences

4. COUNTERFACTUAL ROBUSTNESS

Original Implementation (evalue_original.py)

Dataset Processing

Evaluation Calculation

Current Implementation (src/evaluator.py)

Error Detection

Error Correction

Evaluation Method

Key Differences

SUMMARY TABLE

Functional Differences

Accuracy Calculation

Answer Checking

Task Separation

Metrics Clarity

Error Handling

Conclusion

Original Implementation (`evalue_original.py`)

Current Implementation (`src/evaluator.py`)

Multi-Noise Testing (`src/pipeline.py`)

Original Implementation (`evalue_original.py`)

Current Implementation (`src/evaluator.py`)

Original Implementation (`evalue_original.py`)

Current Implementation (`src/evaluator.py`)

Data Handling (`src/data_loader.py`)

Original Implementation (`evalue_original.py`)

Current Implementation (`src/evaluator.py`)