# RGB Evaluation Calculation Explanation This document explains how each of the four RGB evaluations is calculated in the application, step-by-step through the code. --- ## Overview: EvaluationResult Class All evaluations return an `EvaluationResult` object that stores: ```python @dataclass class EvaluationResult: task_type: str # Type of evaluation task model_name: str # Name of model being evaluated total_samples: int = 0 # Total number of samples tested correct: int = 0 # Count of correct responses incorrect: int = 0 # Count of incorrect responses rejected: int = 0 # Count of rejections (for negative rejection) errors_detected: int = 0 # Count of errors detected errors_corrected: int = 0 # Count of errors corrected accuracy_by_noise: Dict[int, float] = {} # Breakdown by noise level ``` Then it calculates metrics via properties: ```python @property def accuracy(self) -> float: if self.total_samples == 0: return 0.0 return (self.correct / self.total_samples) * 100 @property def rejection_rate(self) -> float: if self.total_samples == 0: return 0.0 return (self.rejected / self.total_samples) * 100 @property def error_detection_rate(self) -> float: return (self.errors_detected / self.total_samples) * 100 @property def error_correction_rate(self) -> float: return (self.errors_corrected / self.total_samples) * 100 ``` --- ## 1. NOISE ROBUSTNESS EVALUATION ### Method ```python def evaluate_noise_robustness( self, responses: List[str], # Model's responses ground_truths: List[str], # Correct answers model_name: str, noise_ratio: float # Noise level (0.0 to 0.8) ) -> EvaluationResult: ``` ### Step-by-Step Calculation **Step 1: Create Result Object** ```python result = EvaluationResult( task_type=f"noise_robustness_{int(noise_ratio*100)}%", # e.g., "noise_robustness_40%" model_name=model_name, total_samples=len(responses) # Total number of test samples ) ``` - Creates a result object with task type like "noise_robustness_40%" to track noise level - Records total number of samples tested **Step 2: Loop Through Each Response** ```python for response, truth in zip(responses, ground_truths): if self.is_correct(response, truth): result.correct += 1 else: result.incorrect += 1 ``` - Compares each model response against the ground truth answer - Increments `correct` counter if answer matches - Increments `incorrect` counter if answer doesn't match **Step 3: Return Result** - Final metric calculated via property: `accuracy = (correct / total_samples) × 100` ### How `is_correct()` Works ```python def is_correct(self, response: str, ground_truth: str, strict: bool = False) -> bool: # Step 1: Normalize both answers norm_response = self.normalize_answer(response) # Lowercase, remove punctuation norm_truth = self.normalize_answer(ground_truth) # Step 2: Basic checks if not norm_response or not norm_truth: return False # Step 3: Exact match (if strict mode) if strict: return norm_response == norm_truth # Step 4: Substring matching (ground truth in response) if norm_truth in norm_response: return True # Step 5: Short answer in long answer if len(norm_response) < len(norm_truth) and norm_response in norm_truth: return True # Step 6: Token overlap (80%+) truth_tokens = set(norm_truth.split()) response_tokens = set(norm_response.split()) if len(truth_tokens) > 0: overlap = len(truth_tokens & response_tokens) / len(truth_tokens) if overlap >= 0.8: # 80% of truth tokens present in response return True return False ``` **Example:** ``` Ground Truth: "Paris" Response: "The capital of France is Paris." Step 1: Normalize - truth: "paris" - response: "the capital of france is paris" Step 2: Substring check - Is "paris" in "the capital of france is paris"? YES ✓ Result: CORRECT ``` ### Example: Multiple Noise Levels ``` Noise Robustness Evaluation (GPT-4): 0% noise (5 correct docs): 85% accuracy 20% noise (1 noise doc): 82% accuracy 40% noise (2 noise docs): 78% accuracy 60% noise (3 noise docs): 72% accuracy 80% noise (4 noise docs): 65% accuracy ``` --- ## 2. NEGATIVE REJECTION EVALUATION ### Method ```python def evaluate_negative_rejection( self, responses: List[str], # Model's responses to questions without answers model_name: str ) -> EvaluationResult: ``` ### Step-by-Step Calculation **Step 1: Create Result Object** ```python result = EvaluationResult( task_type="negative_rejection", model_name=model_name, total_samples=len(responses) # Total samples where documents have no answer ) ``` - Creates result object for rejection task - All samples in this test have NO relevant information in documents **Step 2: Check Each Response for Rejection** ```python for response in responses: if self.is_rejection(response): result.rejected += 1 # Model correctly rejected else: result.incorrect += 1 # Model should have rejected but didn't ``` - For each model response, checks if it's a rejection - Counts proper rejections vs inappropriate answers **Step 3: Return Result** - Final metric: `rejection_rate = (rejected / total_samples) × 100` ### How `is_rejection()` Works ```python def is_rejection(self, response: str) -> bool: response_lower = response.lower().strip() # Primary phrases from Figure 3 of the paper (exact match required) PRIMARY_REJECTION_PHRASES = [ "i can not answer the question because of the insufficient information in documents", "insufficient information in documents", "can not answer", "cannot answer", ] # Step 1: Check primary phrases (paper standard) for phrase in self.PRIMARY_REJECTION_PHRASES: if phrase in response_lower: return True # Secondary keywords (flexible matching) REJECTION_KEYWORDS = [ "i don't know", "i cannot", "i can't", "unable to", "not able to", "insufficient information", "no information", "cannot determine", "cannot answer", "not enough information", "don't have enough", "unable to determine", "cannot find", "no relevant", "not mentioned", "not provided", "not specified", "unclear", "unknown", "i'm not sure", "i am not sure", "cannot be determined", "information is not available", "does not provide", ] # Step 2: Check secondary keywords for keyword in self.REJECTION_KEYWORDS: if keyword in response_lower: return True return False ``` **Examples:** ``` Response 1: "I cannot answer this question because the documents don't contain relevant information." - Contains "cannot answer"? YES ✓ - Result: REJECTION (Count as rejected) Response 2: "Based on the provided documents, I cannot determine the answer." - Contains "cannot determine"? YES ✓ - Result: REJECTION (Count as rejected) Response 3: "The documents do not mention this topic, so I cannot provide an answer." - Contains "not mention"? Checking... Yes, "not mentioned" in keywords ✓ - Result: REJECTION (Count as rejected) Response 4: "The answer is probably 42 but I'm not sure." - Contains rejection keywords? "not sure" YES ✓ - Result: REJECTION (Count as rejected) Response 5: "Based on the information, the answer is London." - Contains any rejection phrase/keyword? NO ✗ - Result: INCORRECT (Model should have rejected but answered instead) ``` ### Example Output ``` Negative Rejection Evaluation (GPT-4): Total samples: 100 Correctly rejected: 92 Incorrectly answered: 8 Rejection Rate: 92% ``` --- ## 3. INFORMATION INTEGRATION EVALUATION ### Method ```python def evaluate_information_integration( self, responses: List[str], # Model's responses ground_truths: List[str], # Correct answers (require synthesis) model_name: str ) -> EvaluationResult: ``` ### Step-by-Step Calculation **Step 1: Create Result Object** ```python result = EvaluationResult( task_type="information_integration", model_name=model_name, total_samples=len(responses) ) ``` - Creates result for information integration task - Samples contain multiple documents; answer requires combining them **Step 2: Check Each Response for Correctness** ```python for response, truth in zip(responses, ground_truths): if self.is_correct(response, truth): result.correct += 1 else: result.incorrect += 1 ``` - Uses same `is_correct()` method as noise robustness - Checks if model successfully synthesized answer from multiple documents **Step 3: Return Result** - Final metric: `accuracy = (correct / total_samples) × 100` ### Key Difference from Noise Robustness | Aspect | Noise Robustness | Information Integration | |--------|------------------|------------------------| | **Documents** | Mix of relevant + noise | Multiple relevant documents with partial info | | **Challenge** | Filter out noise | Synthesize from multiple sources | | **Evaluation** | Simple accuracy | Simple accuracy (but harder task) | | **Example** | 5 docs: 3 relevant + 2 noise | 5 docs: each has partial answer | ### Example Scenario ``` Question: "What are the main causes of climate change?" Information Integration Dataset: Doc 1: "Greenhouse gases from burning fossil fuels..." Doc 2: "Deforestation reduces CO2 absorption..." Doc 3: "Industrial emissions contribute significantly..." Doc 4: "Methane from agriculture is another factor..." Doc 5: "Human activities have increased CO2 levels..." Correct Answer: "Greenhouse gases from fossil fuels, deforestation, industrial emissions, and agricultural methane" Model Response: "The main causes include greenhouse gas emissions from burning fossil fuels, loss of forests that absorb CO2, industrial pollution, and methane from agriculture." Evaluation: - Checks if response contains all key concepts - Token overlap: Contains all required terms - Result: CORRECT (85% token overlap ≥ 80%) ``` ### Example Output ``` Information Integration Evaluation (GPT-4): Total samples: 100 Correct responses: 78 Incorrect responses: 22 Accuracy: 78% ``` --- ## 4. COUNTERFACTUAL ROBUSTNESS EVALUATION ### Method ```python def evaluate_counterfactual_robustness( self, responses: List[str], # Model's responses ground_truths: List[str], # Correct answers counterfactual_answers: List[str], # Incorrect answers in documents model_name: str ) -> EvaluationResult: ``` ### Step-by-Step Calculation **Step 1: Create Result Object** ```python result = EvaluationResult( task_type="counterfactual_robustness", model_name=model_name, total_samples=len(responses) ) ``` - Creates result object for counterfactual task - Tracks both error detection AND error correction **Step 2: Process Each Response** ```python for response, truth, cf_answer in zip(responses, ground_truths, counterfactual_answers): # Check 1: Did model detect the error? if self.detects_error(response, cf_answer): result.errors_detected += 1 # Check 2: Did model correct the error? if self.corrects_error(response, truth, cf_answer): result.errors_corrected += 1 result.correct += 1 else: result.incorrect += 1 ``` **Step 3: Return Result** - Final metrics: - `error_detection_rate = (errors_detected / total_samples) × 100` - `error_correction_rate = (errors_corrected / total_samples) × 100` ### How `detects_error()` Works ```python def detects_error(self, response: str, counterfactual_answer: Optional[str]) -> bool: response_lower = response.lower() # Keywords that indicate error detection ERROR_DETECTION_KEYWORDS = [ "incorrect", "wrong", "false", "error", "mistake", "inaccurate", "not true", "not correct", "factually incorrect", "contradicts", "actually", "in fact", "however", "but actually", "the correct answer", "should be", ] # Step 1: Check for error detection keywords for keyword in self.ERROR_DETECTION_KEYWORDS: if keyword in response_lower: return True # Step 2: Check if model explicitly rejects the counterfactual if counterfactual_answer: cf_lower = counterfactual_answer.lower() # Look for patterns like "X is incorrect" or "not X" if f"not {cf_lower}" in response_lower or \ f"{cf_lower} is wrong" in response_lower: return True return False ``` ### How `corrects_error()` Works ```python def corrects_error(self, response: str, correct_answer: str, counterfactual_answer: Optional[str]) -> bool: # Step 1: Must provide the correct answer first if not self.is_correct(response, correct_answer): return False # Didn't provide correct answer # Step 2: Ensure not just repeating the counterfactual if counterfactual_answer: norm_response = self.normalize_answer(response) norm_cf = self.normalize_answer(counterfactual_answer) # If response only contains counterfactual (not correct answer too) if norm_cf in norm_response and \ self.normalize_answer(correct_answer) not in norm_response: return False # Only mentioned counterfactual, not correction return True # Successfully detected and corrected ``` ### Example Scenario ``` Question: "What is the capital of France?" Documents contain: "The capital of France is London." Correct answer: "Paris" Counterfactual: "London" === Response 1 (Error Detection Only) === "The documents state London, but that is incorrect. The actual capital is Paris." Evaluation: - detects_error(): "incorrect" keyword found ✓ → errors_detected += 1 - is_correct(): Response contains "Paris" ✓ - Counterfactual check: Response mentions "London" but also "Paris" ✓ - Result: errors_corrected += 1 ✓ === Response 2 (No Detection) === "According to the documents, the capital is London." Evaluation: - detects_error(): No error keywords found ✗ → errors_detected += 0 - is_correct(): Response says "London" but truth is "Paris" ✗ - Counterfactual check: Not reached - Result: errors_corrected += 0 === Response 3 (Detection but Wrong Correction) === "The documents are wrong - the capital is Tokyo." Evaluation: - detects_error(): "wrong" keyword found ✓ → errors_detected += 1 - is_correct(): Response says "Tokyo" but truth is "Paris" ✗ - Result: errors_corrected += 0 (detected but wrong correction) ``` ### Example Output ``` Counterfactual Robustness Evaluation (GPT-4): Total samples: 100 Errors detected: 89 Errors corrected: 85 Error Detection Rate: 89% Error Correction Rate: 85% ``` --- ## Summary Comparison Table | Evaluation | Input | Logic | Metric | Formula | |------------|-------|-------|--------|---------| | **Noise Robustness** | Responses + Answers | Check correctness with flexible matching | Accuracy | (correct / total) × 100 | | **Negative Rejection** | Responses only | Check if rejection keywords present | Rejection Rate | (rejected / total) × 100 | | **Information Integration** | Responses + Answers | Check correctness (multi-document synthesis) | Accuracy | (correct / total) × 100 | | **Counterfactual Robustness** | Responses + Answers + Counterfactual | Check error detection AND correction | Detection/Correction Rate | (detected/total) × 100 | --- ## Key Helper Methods ### `normalize_answer()` ```python def normalize_answer(self, answer: str) -> str: # Step 1: Lowercase answer = answer.lower().strip() # Step 2: Remove punctuation at end answer = re.sub(r'[.!?,;:]+$', '', answer) # Step 3: Remove extra spaces answer = ' '.join(answer.split()) return answer ``` **Purpose**: Makes answer comparison fair by ignoring formatting differences **Example:** ``` Input: "Paris." Output: "paris" Input: "The capital is Paris!!!" Output: "the capital is paris" ``` --- ## Metric Calculation Properties All metrics are calculated as properties on the result object: ```python # Accuracy metrics accuracy = (result.correct / result.total_samples) × 100 # Rejection metric rejection_rate = (result.rejected / result.total_samples) × 100 # Error metrics error_detection_rate = (result.errors_detected / result.total_samples) × 100 error_correction_rate = (result.errors_corrected / result.total_samples) × 100 ``` These are calculated dynamically when accessed, so they're always in sync with the counts.