Spaces:
Sleeping
Sleeping
RGB Evaluation
feat: Show all 9 LLM models in app dropdown, add comprehensive code review and metric analysis documentation
b1ccc5d | # RGB Evaluation Calculation Explanation | |
| This document explains how each of the four RGB evaluations is calculated in the application, step-by-step through the code. | |
| --- | |
| ## Overview: EvaluationResult Class | |
| All evaluations return an `EvaluationResult` object that stores: | |
| ```python | |
| @dataclass | |
| class EvaluationResult: | |
| task_type: str # Type of evaluation task | |
| model_name: str # Name of model being evaluated | |
| total_samples: int = 0 # Total number of samples tested | |
| correct: int = 0 # Count of correct responses | |
| incorrect: int = 0 # Count of incorrect responses | |
| rejected: int = 0 # Count of rejections (for negative rejection) | |
| errors_detected: int = 0 # Count of errors detected | |
| errors_corrected: int = 0 # Count of errors corrected | |
| accuracy_by_noise: Dict[int, float] = {} # Breakdown by noise level | |
| ``` | |
| Then it calculates metrics via properties: | |
| ```python | |
| @property | |
| def accuracy(self) -> float: | |
| if self.total_samples == 0: | |
| return 0.0 | |
| return (self.correct / self.total_samples) * 100 | |
| @property | |
| def rejection_rate(self) -> float: | |
| if self.total_samples == 0: | |
| return 0.0 | |
| return (self.rejected / self.total_samples) * 100 | |
| @property | |
| def error_detection_rate(self) -> float: | |
| return (self.errors_detected / self.total_samples) * 100 | |
| @property | |
| def error_correction_rate(self) -> float: | |
| return (self.errors_corrected / self.total_samples) * 100 | |
| ``` | |
| --- | |
| ## 1. NOISE ROBUSTNESS EVALUATION | |
| ### Method | |
| ```python | |
| def evaluate_noise_robustness( | |
| self, | |
| responses: List[str], # Model's responses | |
| ground_truths: List[str], # Correct answers | |
| model_name: str, | |
| noise_ratio: float # Noise level (0.0 to 0.8) | |
| ) -> EvaluationResult: | |
| ``` | |
| ### Step-by-Step Calculation | |
| **Step 1: Create Result Object** | |
| ```python | |
| result = EvaluationResult( | |
| task_type=f"noise_robustness_{int(noise_ratio*100)}%", # e.g., "noise_robustness_40%" | |
| model_name=model_name, | |
| total_samples=len(responses) # Total number of test samples | |
| ) | |
| ``` | |
| - Creates a result object with task type like "noise_robustness_40%" to track noise level | |
| - Records total number of samples tested | |
| **Step 2: Loop Through Each Response** | |
| ```python | |
| for response, truth in zip(responses, ground_truths): | |
| if self.is_correct(response, truth): | |
| result.correct += 1 | |
| else: | |
| result.incorrect += 1 | |
| ``` | |
| - Compares each model response against the ground truth answer | |
| - Increments `correct` counter if answer matches | |
| - Increments `incorrect` counter if answer doesn't match | |
| **Step 3: Return Result** | |
| - Final metric calculated via property: `accuracy = (correct / total_samples) Γ 100` | |
| ### How `is_correct()` Works | |
| ```python | |
| def is_correct(self, response: str, ground_truth: str, strict: bool = False) -> bool: | |
| # Step 1: Normalize both answers | |
| norm_response = self.normalize_answer(response) # Lowercase, remove punctuation | |
| norm_truth = self.normalize_answer(ground_truth) | |
| # Step 2: Basic checks | |
| if not norm_response or not norm_truth: | |
| return False | |
| # Step 3: Exact match (if strict mode) | |
| if strict: | |
| return norm_response == norm_truth | |
| # Step 4: Substring matching (ground truth in response) | |
| if norm_truth in norm_response: | |
| return True | |
| # Step 5: Short answer in long answer | |
| if len(norm_response) < len(norm_truth) and norm_response in norm_truth: | |
| return True | |
| # Step 6: Token overlap (80%+) | |
| truth_tokens = set(norm_truth.split()) | |
| response_tokens = set(norm_response.split()) | |
| if len(truth_tokens) > 0: | |
| overlap = len(truth_tokens & response_tokens) / len(truth_tokens) | |
| if overlap >= 0.8: # 80% of truth tokens present in response | |
| return True | |
| return False | |
| ``` | |
| **Example:** | |
| ``` | |
| Ground Truth: "Paris" | |
| Response: "The capital of France is Paris." | |
| Step 1: Normalize | |
| - truth: "paris" | |
| - response: "the capital of france is paris" | |
| Step 2: Substring check | |
| - Is "paris" in "the capital of france is paris"? YES β | |
| Result: CORRECT | |
| ``` | |
| ### Example: Multiple Noise Levels | |
| ``` | |
| Noise Robustness Evaluation (GPT-4): | |
| 0% noise (5 correct docs): 85% accuracy | |
| 20% noise (1 noise doc): 82% accuracy | |
| 40% noise (2 noise docs): 78% accuracy | |
| 60% noise (3 noise docs): 72% accuracy | |
| 80% noise (4 noise docs): 65% accuracy | |
| ``` | |
| --- | |
| ## 2. NEGATIVE REJECTION EVALUATION | |
| ### Method | |
| ```python | |
| def evaluate_negative_rejection( | |
| self, | |
| responses: List[str], # Model's responses to questions without answers | |
| model_name: str | |
| ) -> EvaluationResult: | |
| ``` | |
| ### Step-by-Step Calculation | |
| **Step 1: Create Result Object** | |
| ```python | |
| result = EvaluationResult( | |
| task_type="negative_rejection", | |
| model_name=model_name, | |
| total_samples=len(responses) # Total samples where documents have no answer | |
| ) | |
| ``` | |
| - Creates result object for rejection task | |
| - All samples in this test have NO relevant information in documents | |
| **Step 2: Check Each Response for Rejection** | |
| ```python | |
| for response in responses: | |
| if self.is_rejection(response): | |
| result.rejected += 1 # Model correctly rejected | |
| else: | |
| result.incorrect += 1 # Model should have rejected but didn't | |
| ``` | |
| - For each model response, checks if it's a rejection | |
| - Counts proper rejections vs inappropriate answers | |
| **Step 3: Return Result** | |
| - Final metric: `rejection_rate = (rejected / total_samples) Γ 100` | |
| ### How `is_rejection()` Works | |
| ```python | |
| def is_rejection(self, response: str) -> bool: | |
| response_lower = response.lower().strip() | |
| # Primary phrases from Figure 3 of the paper (exact match required) | |
| PRIMARY_REJECTION_PHRASES = [ | |
| "i can not answer the question because of the insufficient information in documents", | |
| "insufficient information in documents", | |
| "can not answer", | |
| "cannot answer", | |
| ] | |
| # Step 1: Check primary phrases (paper standard) | |
| for phrase in self.PRIMARY_REJECTION_PHRASES: | |
| if phrase in response_lower: | |
| return True | |
| # Secondary keywords (flexible matching) | |
| REJECTION_KEYWORDS = [ | |
| "i don't know", "i cannot", "i can't", "unable to", "not able to", | |
| "insufficient information", "no information", "cannot determine", | |
| "cannot answer", "not enough information", "don't have enough", | |
| "unable to determine", "cannot find", "no relevant", "not mentioned", | |
| "not provided", "not specified", "unclear", "unknown", "i'm not sure", | |
| "i am not sure", "cannot be determined", "information is not available", | |
| "does not provide", | |
| ] | |
| # Step 2: Check secondary keywords | |
| for keyword in self.REJECTION_KEYWORDS: | |
| if keyword in response_lower: | |
| return True | |
| return False | |
| ``` | |
| **Examples:** | |
| ``` | |
| Response 1: "I cannot answer this question because the documents don't contain relevant information." | |
| - Contains "cannot answer"? YES β | |
| - Result: REJECTION (Count as rejected) | |
| Response 2: "Based on the provided documents, I cannot determine the answer." | |
| - Contains "cannot determine"? YES β | |
| - Result: REJECTION (Count as rejected) | |
| Response 3: "The documents do not mention this topic, so I cannot provide an answer." | |
| - Contains "not mention"? Checking... Yes, "not mentioned" in keywords β | |
| - Result: REJECTION (Count as rejected) | |
| Response 4: "The answer is probably 42 but I'm not sure." | |
| - Contains rejection keywords? "not sure" YES β | |
| - Result: REJECTION (Count as rejected) | |
| Response 5: "Based on the information, the answer is London." | |
| - Contains any rejection phrase/keyword? NO β | |
| - Result: INCORRECT (Model should have rejected but answered instead) | |
| ``` | |
| ### Example Output | |
| ``` | |
| Negative Rejection Evaluation (GPT-4): | |
| Total samples: 100 | |
| Correctly rejected: 92 | |
| Incorrectly answered: 8 | |
| Rejection Rate: 92% | |
| ``` | |
| --- | |
| ## 3. INFORMATION INTEGRATION EVALUATION | |
| ### Method | |
| ```python | |
| def evaluate_information_integration( | |
| self, | |
| responses: List[str], # Model's responses | |
| ground_truths: List[str], # Correct answers (require synthesis) | |
| model_name: str | |
| ) -> EvaluationResult: | |
| ``` | |
| ### Step-by-Step Calculation | |
| **Step 1: Create Result Object** | |
| ```python | |
| result = EvaluationResult( | |
| task_type="information_integration", | |
| model_name=model_name, | |
| total_samples=len(responses) | |
| ) | |
| ``` | |
| - Creates result for information integration task | |
| - Samples contain multiple documents; answer requires combining them | |
| **Step 2: Check Each Response for Correctness** | |
| ```python | |
| for response, truth in zip(responses, ground_truths): | |
| if self.is_correct(response, truth): | |
| result.correct += 1 | |
| else: | |
| result.incorrect += 1 | |
| ``` | |
| - Uses same `is_correct()` method as noise robustness | |
| - Checks if model successfully synthesized answer from multiple documents | |
| **Step 3: Return Result** | |
| - Final metric: `accuracy = (correct / total_samples) Γ 100` | |
| ### Key Difference from Noise Robustness | |
| | Aspect | Noise Robustness | Information Integration | | |
| |--------|------------------|------------------------| | |
| | **Documents** | Mix of relevant + noise | Multiple relevant documents with partial info | | |
| | **Challenge** | Filter out noise | Synthesize from multiple sources | | |
| | **Evaluation** | Simple accuracy | Simple accuracy (but harder task) | | |
| | **Example** | 5 docs: 3 relevant + 2 noise | 5 docs: each has partial answer | | |
| ### Example Scenario | |
| ``` | |
| Question: "What are the main causes of climate change?" | |
| Information Integration Dataset: | |
| Doc 1: "Greenhouse gases from burning fossil fuels..." | |
| Doc 2: "Deforestation reduces CO2 absorption..." | |
| Doc 3: "Industrial emissions contribute significantly..." | |
| Doc 4: "Methane from agriculture is another factor..." | |
| Doc 5: "Human activities have increased CO2 levels..." | |
| Correct Answer: "Greenhouse gases from fossil fuels, deforestation, | |
| industrial emissions, and agricultural methane" | |
| Model Response: "The main causes include greenhouse gas emissions from | |
| burning fossil fuels, loss of forests that absorb CO2, industrial pollution, | |
| and methane from agriculture." | |
| Evaluation: | |
| - Checks if response contains all key concepts | |
| - Token overlap: Contains all required terms | |
| - Result: CORRECT (85% token overlap β₯ 80%) | |
| ``` | |
| ### Example Output | |
| ``` | |
| Information Integration Evaluation (GPT-4): | |
| Total samples: 100 | |
| Correct responses: 78 | |
| Incorrect responses: 22 | |
| Accuracy: 78% | |
| ``` | |
| --- | |
| ## 4. COUNTERFACTUAL ROBUSTNESS EVALUATION | |
| ### Method | |
| ```python | |
| def evaluate_counterfactual_robustness( | |
| self, | |
| responses: List[str], # Model's responses | |
| ground_truths: List[str], # Correct answers | |
| counterfactual_answers: List[str], # Incorrect answers in documents | |
| model_name: str | |
| ) -> EvaluationResult: | |
| ``` | |
| ### Step-by-Step Calculation | |
| **Step 1: Create Result Object** | |
| ```python | |
| result = EvaluationResult( | |
| task_type="counterfactual_robustness", | |
| model_name=model_name, | |
| total_samples=len(responses) | |
| ) | |
| ``` | |
| - Creates result object for counterfactual task | |
| - Tracks both error detection AND error correction | |
| **Step 2: Process Each Response** | |
| ```python | |
| for response, truth, cf_answer in zip(responses, ground_truths, counterfactual_answers): | |
| # Check 1: Did model detect the error? | |
| if self.detects_error(response, cf_answer): | |
| result.errors_detected += 1 | |
| # Check 2: Did model correct the error? | |
| if self.corrects_error(response, truth, cf_answer): | |
| result.errors_corrected += 1 | |
| result.correct += 1 | |
| else: | |
| result.incorrect += 1 | |
| ``` | |
| **Step 3: Return Result** | |
| - Final metrics: | |
| - `error_detection_rate = (errors_detected / total_samples) Γ 100` | |
| - `error_correction_rate = (errors_corrected / total_samples) Γ 100` | |
| ### How `detects_error()` Works | |
| ```python | |
| def detects_error(self, response: str, counterfactual_answer: Optional[str]) -> bool: | |
| response_lower = response.lower() | |
| # Keywords that indicate error detection | |
| ERROR_DETECTION_KEYWORDS = [ | |
| "incorrect", "wrong", "false", "error", "mistake", "inaccurate", | |
| "not true", "not correct", "factually incorrect", "contradicts", | |
| "actually", "in fact", "however", "but actually", | |
| "the correct answer", "should be", | |
| ] | |
| # Step 1: Check for error detection keywords | |
| for keyword in self.ERROR_DETECTION_KEYWORDS: | |
| if keyword in response_lower: | |
| return True | |
| # Step 2: Check if model explicitly rejects the counterfactual | |
| if counterfactual_answer: | |
| cf_lower = counterfactual_answer.lower() | |
| # Look for patterns like "X is incorrect" or "not X" | |
| if f"not {cf_lower}" in response_lower or \ | |
| f"{cf_lower} is wrong" in response_lower: | |
| return True | |
| return False | |
| ``` | |
| ### How `corrects_error()` Works | |
| ```python | |
| def corrects_error(self, response: str, correct_answer: str, | |
| counterfactual_answer: Optional[str]) -> bool: | |
| # Step 1: Must provide the correct answer first | |
| if not self.is_correct(response, correct_answer): | |
| return False # Didn't provide correct answer | |
| # Step 2: Ensure not just repeating the counterfactual | |
| if counterfactual_answer: | |
| norm_response = self.normalize_answer(response) | |
| norm_cf = self.normalize_answer(counterfactual_answer) | |
| # If response only contains counterfactual (not correct answer too) | |
| if norm_cf in norm_response and \ | |
| self.normalize_answer(correct_answer) not in norm_response: | |
| return False # Only mentioned counterfactual, not correction | |
| return True # Successfully detected and corrected | |
| ``` | |
| ### Example Scenario | |
| ``` | |
| Question: "What is the capital of France?" | |
| Documents contain: "The capital of France is London." | |
| Correct answer: "Paris" | |
| Counterfactual: "London" | |
| === Response 1 (Error Detection Only) === | |
| "The documents state London, but that is incorrect. | |
| The actual capital is Paris." | |
| Evaluation: | |
| - detects_error(): "incorrect" keyword found β β errors_detected += 1 | |
| - is_correct(): Response contains "Paris" β | |
| - Counterfactual check: Response mentions "London" but also "Paris" β | |
| - Result: errors_corrected += 1 β | |
| === Response 2 (No Detection) === | |
| "According to the documents, the capital is London." | |
| Evaluation: | |
| - detects_error(): No error keywords found β β errors_detected += 0 | |
| - is_correct(): Response says "London" but truth is "Paris" β | |
| - Counterfactual check: Not reached | |
| - Result: errors_corrected += 0 | |
| === Response 3 (Detection but Wrong Correction) === | |
| "The documents are wrong - the capital is Tokyo." | |
| Evaluation: | |
| - detects_error(): "wrong" keyword found β β errors_detected += 1 | |
| - is_correct(): Response says "Tokyo" but truth is "Paris" β | |
| - Result: errors_corrected += 0 (detected but wrong correction) | |
| ``` | |
| ### Example Output | |
| ``` | |
| Counterfactual Robustness Evaluation (GPT-4): | |
| Total samples: 100 | |
| Errors detected: 89 | |
| Errors corrected: 85 | |
| Error Detection Rate: 89% | |
| Error Correction Rate: 85% | |
| ``` | |
| --- | |
| ## Summary Comparison Table | |
| | Evaluation | Input | Logic | Metric | Formula | | |
| |------------|-------|-------|--------|---------| | |
| | **Noise Robustness** | Responses + Answers | Check correctness with flexible matching | Accuracy | (correct / total) Γ 100 | | |
| | **Negative Rejection** | Responses only | Check if rejection keywords present | Rejection Rate | (rejected / total) Γ 100 | | |
| | **Information Integration** | Responses + Answers | Check correctness (multi-document synthesis) | Accuracy | (correct / total) Γ 100 | | |
| | **Counterfactual Robustness** | Responses + Answers + Counterfactual | Check error detection AND correction | Detection/Correction Rate | (detected/total) Γ 100 | | |
| --- | |
| ## Key Helper Methods | |
| ### `normalize_answer()` | |
| ```python | |
| def normalize_answer(self, answer: str) -> str: | |
| # Step 1: Lowercase | |
| answer = answer.lower().strip() | |
| # Step 2: Remove punctuation at end | |
| answer = re.sub(r'[.!?,;:]+$', '', answer) | |
| # Step 3: Remove extra spaces | |
| answer = ' '.join(answer.split()) | |
| return answer | |
| ``` | |
| **Purpose**: Makes answer comparison fair by ignoring formatting differences | |
| **Example:** | |
| ``` | |
| Input: "Paris." | |
| Output: "paris" | |
| Input: "The capital is Paris!!!" | |
| Output: "the capital is paris" | |
| ``` | |
| --- | |
| ## Metric Calculation Properties | |
| All metrics are calculated as properties on the result object: | |
| ```python | |
| # Accuracy metrics | |
| accuracy = (result.correct / result.total_samples) Γ 100 | |
| # Rejection metric | |
| rejection_rate = (result.rejected / result.total_samples) Γ 100 | |
| # Error metrics | |
| error_detection_rate = (result.errors_detected / result.total_samples) Γ 100 | |
| error_correction_rate = (result.errors_corrected / result.total_samples) Γ 100 | |
| ``` | |
| These are calculated dynamically when accessed, so they're always in sync with the counts. | |