Spaces:

gopikrishnait
/

RGBMetrics

Sleeping

App Files Files Community

RGBMetrics / EVALUATION_CALCULATION_EXPLANATION.md

RGB Evaluation

feat: Show all 9 LLM models in app dropdown, add comprehensive code review and metric analysis documentation

b1ccc5d about 1 month ago

preview code

raw

history blame contribute delete

17 kB

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade

RGB Evaluation Calculation Explanation

This document explains how each of the four RGB evaluations is calculated in the application, step-by-step through the code.

Overview: EvaluationResult Class

All evaluations return an EvaluationResult object that stores:

@dataclass
class EvaluationResult:
    task_type: str                           # Type of evaluation task
    model_name: str                          # Name of model being evaluated
    total_samples: int = 0                   # Total number of samples tested
    correct: int = 0                         # Count of correct responses
    incorrect: int = 0                       # Count of incorrect responses
    rejected: int = 0                        # Count of rejections (for negative rejection)
    errors_detected: int = 0                 # Count of errors detected
    errors_corrected: int = 0                # Count of errors corrected
    accuracy_by_noise: Dict[int, float] = {} # Breakdown by noise level

Then it calculates metrics via properties:

@property
def accuracy(self) -> float:
    if self.total_samples == 0:
        return 0.0
    return (self.correct / self.total_samples) * 100

@property
def rejection_rate(self) -> float:
    if self.total_samples == 0:
        return 0.0
    return (self.rejected / self.total_samples) * 100

@property
def error_detection_rate(self) -> float:
    return (self.errors_detected / self.total_samples) * 100

@property
def error_correction_rate(self) -> float:
    return (self.errors_corrected / self.total_samples) * 100

1. NOISE ROBUSTNESS EVALUATION

Method

def evaluate_noise_robustness(
    self,
    responses: List[str],        # Model's responses
    ground_truths: List[str],    # Correct answers
    model_name: str,
    noise_ratio: float           # Noise level (0.0 to 0.8)
) -> EvaluationResult:

Step-by-Step Calculation

Step 1: Create Result Object

result = EvaluationResult(
    task_type=f"noise_robustness_{int(noise_ratio*100)}%",  # e.g., "noise_robustness_40%"
    model_name=model_name,
    total_samples=len(responses)  # Total number of test samples
)

Creates a result object with task type like "noise_robustness_40%" to track noise level
Records total number of samples tested

Step 2: Loop Through Each Response

for response, truth in zip(responses, ground_truths):
    if self.is_correct(response, truth):
        result.correct += 1
    else:
        result.incorrect += 1

Compares each model response against the ground truth answer
Increments correct counter if answer matches
Increments incorrect counter if answer doesn't match

Step 3: Return Result

Final metric calculated via property: accuracy = (correct / total_samples) × 100

How `is_correct()` Works

def is_correct(self, response: str, ground_truth: str, strict: bool = False) -> bool:
    # Step 1: Normalize both answers
    norm_response = self.normalize_answer(response)    # Lowercase, remove punctuation
    norm_truth = self.normalize_answer(ground_truth)
    
    # Step 2: Basic checks
    if not norm_response or not norm_truth:
        return False
    
    # Step 3: Exact match (if strict mode)
    if strict:
        return norm_response == norm_truth
    
    # Step 4: Substring matching (ground truth in response)
    if norm_truth in norm_response:
        return True
    
    # Step 5: Short answer in long answer
    if len(norm_response) < len(norm_truth) and norm_response in norm_truth:
        return True
    
    # Step 6: Token overlap (80%+)
    truth_tokens = set(norm_truth.split())
    response_tokens = set(norm_response.split())
    
    if len(truth_tokens) > 0:
        overlap = len(truth_tokens & response_tokens) / len(truth_tokens)
        if overlap >= 0.8:  # 80% of truth tokens present in response
            return True
    
    return False

Example:

Ground Truth: "Paris"
Response:     "The capital of France is Paris."

Step 1: Normalize
  - truth:    "paris"
  - response: "the capital of france is paris"

Step 2: Substring check
  - Is "paris" in "the capital of france is paris"? YES ✓
  
Result: CORRECT

Example: Multiple Noise Levels

Noise Robustness Evaluation (GPT-4):
  0% noise (5 correct docs):   85% accuracy
  20% noise (1 noise doc):     82% accuracy
  40% noise (2 noise docs):    78% accuracy
  60% noise (3 noise docs):    72% accuracy
  80% noise (4 noise docs):    65% accuracy

2. NEGATIVE REJECTION EVALUATION

Method

def evaluate_negative_rejection(
    self,
    responses: List[str],   # Model's responses to questions without answers
    model_name: str
) -> EvaluationResult:

Step-by-Step Calculation

Step 1: Create Result Object

result = EvaluationResult(
    task_type="negative_rejection",
    model_name=model_name,
    total_samples=len(responses)  # Total samples where documents have no answer
)

Creates result object for rejection task
All samples in this test have NO relevant information in documents

Step 2: Check Each Response for Rejection

for response in responses:
    if self.is_rejection(response):
        result.rejected += 1      # Model correctly rejected
    else:
        result.incorrect += 1     # Model should have rejected but didn't

For each model response, checks if it's a rejection
Counts proper rejections vs inappropriate answers

Step 3: Return Result

Final metric: rejection_rate = (rejected / total_samples) × 100

How `is_rejection()` Works

def is_rejection(self, response: str) -> bool:
    response_lower = response.lower().strip()
    
    # Primary phrases from Figure 3 of the paper (exact match required)
    PRIMARY_REJECTION_PHRASES = [
        "i can not answer the question because of the insufficient information in documents",
        "insufficient information in documents",
        "can not answer",
        "cannot answer",
    ]
    
    # Step 1: Check primary phrases (paper standard)
    for phrase in self.PRIMARY_REJECTION_PHRASES:
        if phrase in response_lower:
            return True
    
    # Secondary keywords (flexible matching)
    REJECTION_KEYWORDS = [
        "i don't know", "i cannot", "i can't", "unable to", "not able to",
        "insufficient information", "no information", "cannot determine",
        "cannot answer", "not enough information", "don't have enough",
        "unable to determine", "cannot find", "no relevant", "not mentioned",
        "not provided", "not specified", "unclear", "unknown", "i'm not sure",
        "i am not sure", "cannot be determined", "information is not available",
        "does not provide",
    ]
    
    # Step 2: Check secondary keywords
    for keyword in self.REJECTION_KEYWORDS:
        if keyword in response_lower:
            return True
    
    return False

Examples:

Response 1: "I cannot answer this question because the documents don't contain relevant information."
- Contains "cannot answer"? YES ✓
- Result: REJECTION (Count as rejected)

Response 2: "Based on the provided documents, I cannot determine the answer."
- Contains "cannot determine"? YES ✓
- Result: REJECTION (Count as rejected)

Response 3: "The documents do not mention this topic, so I cannot provide an answer."
- Contains "not mention"? Checking... Yes, "not mentioned" in keywords ✓
- Result: REJECTION (Count as rejected)

Response 4: "The answer is probably 42 but I'm not sure."
- Contains rejection keywords? "not sure" YES ✓
- Result: REJECTION (Count as rejected)

Response 5: "Based on the information, the answer is London."
- Contains any rejection phrase/keyword? NO ✗
- Result: INCORRECT (Model should have rejected but answered instead)

Example Output

Negative Rejection Evaluation (GPT-4):
  Total samples: 100
  Correctly rejected: 92
  Incorrectly answered: 8
  
  Rejection Rate: 92%

3. INFORMATION INTEGRATION EVALUATION

Method

def evaluate_information_integration(
    self,
    responses: List[str],      # Model's responses
    ground_truths: List[str],  # Correct answers (require synthesis)
    model_name: str
) -> EvaluationResult:

Step-by-Step Calculation

Step 1: Create Result Object

result = EvaluationResult(
    task_type="information_integration",
    model_name=model_name,
    total_samples=len(responses)
)

Creates result for information integration task
Samples contain multiple documents; answer requires combining them

Step 2: Check Each Response for Correctness

for response, truth in zip(responses, ground_truths):
    if self.is_correct(response, truth):
        result.correct += 1
    else:
        result.incorrect += 1

Uses same is_correct() method as noise robustness
Checks if model successfully synthesized answer from multiple documents

Step 3: Return Result

Final metric: accuracy = (correct / total_samples) × 100

Key Difference from Noise Robustness

Aspect	Noise Robustness	Information Integration
Documents	Mix of relevant + noise	Multiple relevant documents with partial info
Challenge	Filter out noise	Synthesize from multiple sources
Evaluation	Simple accuracy	Simple accuracy (but harder task)
Example	5 docs: 3 relevant + 2 noise	5 docs: each has partial answer

Example Scenario

Question: "What are the main causes of climate change?"

Information Integration Dataset:
  Doc 1: "Greenhouse gases from burning fossil fuels..."
  Doc 2: "Deforestation reduces CO2 absorption..."
  Doc 3: "Industrial emissions contribute significantly..."
  Doc 4: "Methane from agriculture is another factor..."
  Doc 5: "Human activities have increased CO2 levels..."

Correct Answer: "Greenhouse gases from fossil fuels, deforestation, 
industrial emissions, and agricultural methane"

Model Response: "The main causes include greenhouse gas emissions from 
burning fossil fuels, loss of forests that absorb CO2, industrial pollution, 
and methane from agriculture."

Evaluation:
  - Checks if response contains all key concepts
  - Token overlap: Contains all required terms
  - Result: CORRECT (85% token overlap ≥ 80%)

Example Output

Information Integration Evaluation (GPT-4):
  Total samples: 100
  Correct responses: 78
  Incorrect responses: 22
  
  Accuracy: 78%

4. COUNTERFACTUAL ROBUSTNESS EVALUATION

Method

def evaluate_counterfactual_robustness(
    self,
    responses: List[str],              # Model's responses
    ground_truths: List[str],          # Correct answers
    counterfactual_answers: List[str], # Incorrect answers in documents
    model_name: str
) -> EvaluationResult:

Step-by-Step Calculation

Step 1: Create Result Object

result = EvaluationResult(
    task_type="counterfactual_robustness",
    model_name=model_name,
    total_samples=len(responses)
)

Creates result object for counterfactual task
Tracks both error detection AND error correction

Step 2: Process Each Response

for response, truth, cf_answer in zip(responses, ground_truths, counterfactual_answers):
    # Check 1: Did model detect the error?
    if self.detects_error(response, cf_answer):
        result.errors_detected += 1
    
    # Check 2: Did model correct the error?
    if self.corrects_error(response, truth, cf_answer):
        result.errors_corrected += 1
        result.correct += 1
    else:
        result.incorrect += 1

Step 3: Return Result

Final metrics:
- error_detection_rate = (errors_detected / total_samples) × 100
- error_correction_rate = (errors_corrected / total_samples) × 100

How `detects_error()` Works

def detects_error(self, response: str, counterfactual_answer: Optional[str]) -> bool:
    response_lower = response.lower()
    
    # Keywords that indicate error detection
    ERROR_DETECTION_KEYWORDS = [
        "incorrect", "wrong", "false", "error", "mistake", "inaccurate",
        "not true", "not correct", "factually incorrect", "contradicts",
        "actually", "in fact", "however", "but actually", 
        "the correct answer", "should be",
    ]
    
    # Step 1: Check for error detection keywords
    for keyword in self.ERROR_DETECTION_KEYWORDS:
        if keyword in response_lower:
            return True
    
    # Step 2: Check if model explicitly rejects the counterfactual
    if counterfactual_answer:
        cf_lower = counterfactual_answer.lower()
        # Look for patterns like "X is incorrect" or "not X"
        if f"not {cf_lower}" in response_lower or \
           f"{cf_lower} is wrong" in response_lower:
            return True
    
    return False

How `corrects_error()` Works

def corrects_error(self, response: str, correct_answer: str, 
                   counterfactual_answer: Optional[str]) -> bool:
    # Step 1: Must provide the correct answer first
    if not self.is_correct(response, correct_answer):
        return False  # Didn't provide correct answer
    
    # Step 2: Ensure not just repeating the counterfactual
    if counterfactual_answer:
        norm_response = self.normalize_answer(response)
        norm_cf = self.normalize_answer(counterfactual_answer)
        
        # If response only contains counterfactual (not correct answer too)
        if norm_cf in norm_response and \
           self.normalize_answer(correct_answer) not in norm_response:
            return False  # Only mentioned counterfactual, not correction
    
    return True  # Successfully detected and corrected

Example Scenario

Question: "What is the capital of France?"

Documents contain: "The capital of France is London."
Correct answer: "Paris"
Counterfactual: "London"

=== Response 1 (Error Detection Only) ===
"The documents state London, but that is incorrect. 
The actual capital is Paris."

Evaluation:
  - detects_error(): "incorrect" keyword found ✓ → errors_detected += 1
  - is_correct(): Response contains "Paris" ✓
  - Counterfactual check: Response mentions "London" but also "Paris" ✓
  - Result: errors_corrected += 1 ✓

=== Response 2 (No Detection) ===
"According to the documents, the capital is London."

Evaluation:
  - detects_error(): No error keywords found ✗ → errors_detected += 0
  - is_correct(): Response says "London" but truth is "Paris" ✗
  - Counterfactual check: Not reached
  - Result: errors_corrected += 0

=== Response 3 (Detection but Wrong Correction) ===
"The documents are wrong - the capital is Tokyo."

Evaluation:
  - detects_error(): "wrong" keyword found ✓ → errors_detected += 1
  - is_correct(): Response says "Tokyo" but truth is "Paris" ✗
  - Result: errors_corrected += 0 (detected but wrong correction)

Example Output

Counterfactual Robustness Evaluation (GPT-4):
  Total samples: 100
  Errors detected: 89
  Errors corrected: 85
  
  Error Detection Rate: 89%
  Error Correction Rate: 85%

Summary Comparison Table

Evaluation	Input	Logic	Metric	Formula
Noise Robustness	Responses + Answers	Check correctness with flexible matching	Accuracy	(correct / total) × 100
Negative Rejection	Responses only	Check if rejection keywords present	Rejection Rate	(rejected / total) × 100
Information Integration	Responses + Answers	Check correctness (multi-document synthesis)	Accuracy	(correct / total) × 100
Counterfactual Robustness	Responses + Answers + Counterfactual	Check error detection AND correction	Detection/Correction Rate	(detected/total) × 100

Key Helper Methods

`normalize_answer()`

def normalize_answer(self, answer: str) -> str:
    # Step 1: Lowercase
    answer = answer.lower().strip()
    
    # Step 2: Remove punctuation at end
    answer = re.sub(r'[.!?,;:]+$', '', answer)
    
    # Step 3: Remove extra spaces
    answer = ' '.join(answer.split())
    
    return answer

Purpose: Makes answer comparison fair by ignoring formatting differences

Example:

Input:  "Paris."
Output: "paris"

Input:  "The capital is Paris!!!"
Output: "the capital is paris"

Metric Calculation Properties

All metrics are calculated as properties on the result object:

# Accuracy metrics
accuracy = (result.correct / result.total_samples) × 100

# Rejection metric  
rejection_rate = (result.rejected / result.total_samples) × 100

# Error metrics
error_detection_rate = (result.errors_detected / result.total_samples) × 100
error_correction_rate = (result.errors_corrected / result.total_samples) × 100

These are calculated dynamically when accessed, so they're always in sync with the counts.

RGB Evaluation Calculation Explanation

Overview: EvaluationResult Class

1. NOISE ROBUSTNESS EVALUATION

Method

Step-by-Step Calculation

How is_correct() Works

Example: Multiple Noise Levels

2. NEGATIVE REJECTION EVALUATION

Method

Step-by-Step Calculation

How is_rejection() Works

Example Output

3. INFORMATION INTEGRATION EVALUATION

Method

Step-by-Step Calculation

Key Difference from Noise Robustness

Example Scenario

Example Output

4. COUNTERFACTUAL ROBUSTNESS EVALUATION

Method

Step-by-Step Calculation

How detects_error() Works

How corrects_error() Works

Example Scenario

Example Output

Summary Comparison Table

Key Helper Methods

normalize_answer()

Metric Calculation Properties

How `is_correct()` Works

How `is_rejection()` Works

How `detects_error()` Works

How `corrects_error()` Works

`normalize_answer()`