Spaces:
Sleeping
Sleeping
RGB Evaluation
feat: Show all 9 LLM models in app dropdown, add comprehensive code review and metric analysis documentation
b1ccc5d A newer version of the Streamlit SDK is available: 1.56.0
Metrics Calculation Comparison: Original vs Current Implementation
Overview
This document compares how each of the four RGB benchmark metrics is calculated in the original evalue_original.py versus the refactored application in src/evaluator.py and src/pipeline.py.
1. NOISE ROBUSTNESS
Original Implementation (evalue_original.py)
Data Processing
def processdata(instance, noise_rate, passage_num, filename, correct_rate=0):
# For default datasets (not _int or _fact):
neg_num = math.ceil(passage_num * noise_rate)
pos_num = passage_num - neg_num
positive = instance['positive'][:pos_num] # Select positive passages
negative = instance['negative'][:neg_num] # Select negative (noise) passages
docs = positive + negative
random.shuffle(docs)
return query, ans, docs
Evaluation Calculation
# In main script:
tt = 0
for i in results:
label = i['label']
if noise_rate == 1 and label[0] == -1:
tt += 1
elif 0 not in label and 1 in label:
tt += 1
accuracy = tt / len(results)
Logic:
- Accepts answer if: (noise_rate == 1 AND model rejected) OR (no 0 in labels AND 1 in labels)
- Label 0 = correct answer found
- Label 1 = incorrect answer found
- Label -1 = insufficient information
- Issue: Confusing logic, counts rejections at 100% noise as correct
Noise Levels Tested
- Single noise_rate parameter (0.0, 0.2, 0.4, 0.6, 0.8 based on CLI usage)
- No aggregation across multiple noise levels
Current Implementation (src/evaluator.py)
Evaluation Method
def evaluate_noise_robustness(
self,
responses: List[str],
ground_truths: List[str],
model_name: str,
noise_ratio: float
) -> EvaluationResult:
"""Evaluate noise robustness for a specific noise ratio."""
result = EvaluationResult(
task_type=f"noise_robustness_{int(noise_ratio*100)}%",
model_name=model_name,
total_samples=len(responses)
)
# Calculate accuracy for this noise level
for response, truth in zip(responses, ground_truths):
if self.is_correct(response, truth):
result.correct += 1
else:
result.incorrect += 1
return result
Multi-Noise Testing (src/pipeline.py)
def evaluate_noise_robustness(self, model: str, noise_ratios=None):
if noise_ratios is None:
noise_ratios = [0.0, 0.2, 0.4, 0.6, 0.8] # Paper's ratios
results = []
for noise_ratio in noise_ratios:
samples = self.data_loader.load_noise_robustness(
max_samples,
noise_rate=noise_ratio
)
responses = self._generate_responses(client, samples, prompt_template)
ground_truths = [s.answer for s in samples]
result = self.evaluator.evaluate_noise_robustness(
responses, ground_truths, model, noise_ratio
)
results.append(result)
Accuracy Property
@property
def accuracy(self) -> float:
"""Calculate accuracy percentage."""
if self.total_samples == 0:
return 0.0
return (self.correct / self.total_samples) * 100
Logic:
- Simple: Percentage of responses that match ground truth
- Uses
is_correct()which handles:- Substring matching
- Token overlap (80%+)
- Normalized comparison
- Advantage: Clear, explainable, aggregates results per noise level
Noise Levels Tested
- Multiple noise levels: 0%, 20%, 40%, 60%, 80% (matching paper)
- Separate result per noise ratio
- Can create visualizations (e.g., accuracy vs noise level graph)
Key Differences
| Aspect | Original | Current |
|---|---|---|
| Definition | Complex label-based logic | Simple accuracy calculation |
| Metric | Depends on noise_rate value | Direct: (correct/total) × 100 |
| Aggregation | Single noise ratio at a time | All noise ratios tested together |
| Answer Checking | Exact substring match | Flexible (substring, token overlap, normalized) |
| Output | One accuracy value per run | List of results, one per noise level |
| Visualization | N/A | Noise vs Accuracy graph possible |
2. NEGATIVE REJECTION
Original Implementation (evalue_original.py)
Dataset Mapping
# Uses 'en' or 'zh' dataset (default branch in processdata)
# No explicit handling for negative rejection task
# Original code treats all datasets the same way
if '_int' in filename:
# Handle information integration
elif '_fact' in filename:
# Handle counterfactual robustness
else:
# Default handling (noise robustness + negative rejection mixed)
Rejection Detection
def predict(query, ground_truth, docs, model, system, instruction, temperature, dataset):
if '信息不足' in prediction or 'insufficient information' in prediction:
labels = [-1] # Mark as rejection
else:
labels = checkanswer(prediction, ground_truth)
return labels, prediction, factlabel
Logic:
- Only 2 phrases trigger rejection detection
- Chinese: "信息不足"
- English: "insufficient information"
- Very limited scope
Metric Calculation
# No explicit negative rejection metric in original code
# Rejection counted implicitly when label == -1
Issue: No dedicated metric for negative rejection; mixed with other tasks
Current Implementation (src/evaluator.py)
Rejection Detection
PRIMARY_REJECTION_PHRASES = [
"i can not answer the question because of the insufficient information in documents",
"insufficient information in documents",
"can not answer",
"cannot answer",
]
REJECTION_KEYWORDS = [
"i don't know", "i cannot", "i can't", "unable to", "not able to",
"insufficient information", "no information", "cannot determine",
"cannot answer", "not enough information", "don't have enough",
"unable to determine", "cannot find", "no relevant", "not mentioned",
"not provided", "not specified", "unclear", "unknown", "i'm not sure",
"i am not sure", "cannot be determined", "information is not available",
"does not provide",
]
def is_rejection(self, response: str) -> bool:
"""Check if response is a rejection."""
response_lower = response.lower().strip()
# Primary phrases (from Figure 3 of paper)
for phrase in self.PRIMARY_REJECTION_PHRASES:
if phrase in response_lower:
return True
# Secondary keywords (flexible matching)
for keyword in self.REJECTION_KEYWORDS:
if keyword in response_lower:
return True
return False
Evaluation Method
def evaluate_negative_rejection(
self,
responses: List[str],
model_name: str
) -> EvaluationResult:
"""Evaluate negative rejection ability."""
result = EvaluationResult(
task_type="negative_rejection",
model_name=model_name,
total_samples=len(responses)
)
for response in responses:
if self.is_rejection(response):
result.rejected += 1
else:
result.incorrect += 1
return result
Metrics
@property
def rejection_rate(self) -> float:
"""Calculate rejection rate percentage."""
if self.total_samples == 0:
return 0.0
return (self.rejected / self.total_samples) * 100
Logic:
- Dedicated task: Evaluate ONLY on questions with no valid answer
- Model should reject/refuse
- Metric: Percentage of correct rejections
- Assumption: All samples in negative_rejection task should be rejected
Key Differences
| Aspect | Original | Current |
|---|---|---|
| Phrases | 2 phrases (Chinese + English) | 30+ phrases (tiered approach) |
| Matching | Exact substring | Primary exact + secondary flexible |
| Metric | Implicit (label == -1) | Explicit rejection_rate (%) |
| Task Separation | Mixed with other datasets | Dedicated task type |
| Paper Alignment | Basic | Figure 3 aligned (primary phrases) |
| Output | Count only | Percentage with granular tracking |
3. INFORMATION INTEGRATION
Original Implementation (evalue_original.py)
Dataset Processing
def processdata(instance, noise_rate, passage_num, filename, correct_rate=0):
if '_int' in filename:
# Special handling for information integration
for i in instance['positive']:
random.shuffle(i)
# Select first element from each positive passage group
docs = [i[0] for i in instance['positive']]
if len(docs) < pos_num:
# Fill with additional elements from each group
maxnum = max([len(i) for i in instance['positive']])
for i in range(1, maxnum):
for j in instance['positive']:
if len(j) > i:
docs.append(j[i])
if len(docs) == pos_num:
break
if len(docs) == pos_num:
break
# Add negative documents if needed
neg_num = passage_num - len(docs)
if neg_num > 0:
negative = instance['negative'][:neg_num]
docs += negative
Data Structure:
- Each positive contains MULTIPLE related documents/passages
- Each passage in a group contains partial information
- Model must combine information across passages to answer correctly
- Answer requires synthesizing knowledge from multiple sources
Evaluation Calculation
# Same as noise robustness:
# Count if 0 in labels (correct answer found)
label = checkanswer(prediction, ground_truth)
# Labels contain 0 if answer found
Logic:
- Uses same answer checking as noise robustness
- Metric: Accuracy (correct answers / total)
- Tests model's ability to integrate information
Current Implementation (src/evaluator.py)
Evaluation Method
def evaluate_information_integration(
self,
responses: List[str],
ground_truths: List[str],
model_name: str
) -> EvaluationResult:
"""
Evaluate information integration
(ability to combine info from multiple docs).
"""
result = EvaluationResult(
task_type="information_integration",
model_name=model_name,
total_samples=len(responses)
)
for response, truth in zip(responses, ground_truths):
if self.is_correct(response, truth):
result.correct += 1
else:
result.incorrect += 1
return result
Metric
@property
def accuracy(self) -> float:
return (self.correct / self.total_samples) * 100
Data Handling (src/data_loader.py)
def load_information_integration(self, max_samples: Optional[int] = None):
"""Load information integration dataset (_int)."""
# Returns RGBSample objects with:
# - question
# - documents (multiple related documents)
# - answer (requires synthesizing across docs)
Logic:
- Dedicated task evaluation
- Metric: Accuracy on questions requiring multi-document synthesis
- Cleaner: No mixing with noise robustness
Key Differences
| Aspect | Original | Current |
|---|---|---|
| Dataset | Uses '_int' suffix in filename | Dedicated loader method |
| Data Structure | Passages grouped by related info | Clear document lists |
| Metric | Reused from noise robustness | Dedicated method with clear semantics |
| Task Clarity | Implicit (filename-based) | Explicit task type |
| Answer Checking | Substring match | Flexible matching (substring, token overlap) |
| Output | Count of correct | Percentage accuracy |
4. COUNTERFACTUAL ROBUSTNESS
Original Implementation (evalue_original.py)
Dataset Processing
elif '_fact' in filename:
correct_num = math.ceil(passage_num * correct_rate)
pos_num = passage_num - neg_num - correct_num
# Sample wrong answers
indexs = list(range(len(instance['positive'])))
selected = random.sample(indexs, min(len(indexs), pos_num))
docs = [instance['positive_wrong'][i] for i in selected]
# Add correct answers
remain = [i for i in indexs if i not in selected]
if correct_num > 0 and len(remain) > 0:
docs += [instance['positive'][i] for i in random.sample(remain, ...)]
# Add negative documents
if neg_num > 0:
docs += instance['negative'][:neg_num]
Data Mix:
positive_wrong= Incorrect/counterfactual answers in documentspositive= Correct answersnegative= Irrelevant documents- Ratio: (pos_num : correct_num : neg_num) = (wrong : correct : irrelevant)
- Example: With correct_rate=0.5, gets 50% wrong + 50% correct + noise docs
Evaluation Calculation
if '_fact' in args.dataset:
fact_tt = 0
correct_tt = 0
for i in results:
if i['factlabel'] == 1: # Factual error detected
fact_tt += 1
if 0 not in i['label']: # Didn't find the wrong answer
correct_tt += 1
fact_check_rate = fact_tt / len(results)
if fact_tt > 0:
correct_rate = correct_tt / fact_tt # Correction rate
else:
correct_rate = 0
Metrics:
fact_check_rate: Percentage where "factual errors" detectedcorrect_rate: Of detected errors, percentage corrected
Detection Logic:
if '事实性错误' in prediction or 'factual errors' in prediction:
factlabel = 1
Correction Logic:
- If
factlabel == 1AND0 not in label - Means: Detected error AND didn't provide the wrong answer
- Issue: Indirect inference, doesn't verify correct answer provided
Current Implementation (src/evaluator.py)
Error Detection
ERROR_DETECTION_KEYWORDS = [
"incorrect", "wrong", "false", "error", "mistake", "inaccurate",
"not true", "not correct", "factually incorrect", "contradicts",
"actually", "in fact", "however", "but actually", "the correct answer",
"should be",
]
def detects_error(self, response: str, counterfactual_answer: Optional[str]) -> bool:
"""Check if model detects an error in counterfactual information."""
response_lower = response.lower()
# Check for error detection keywords
for keyword in self.ERROR_DETECTION_KEYWORDS:
if keyword in response_lower:
return True
# Check if model explicitly rejects the counterfactual answer
if counterfactual_answer:
cf_lower = counterfactual_answer.lower()
if f"not {cf_lower}" in response_lower or \
f"{cf_lower} is wrong" in response_lower:
return True
return False
Error Correction
def corrects_error(self, response: str, correct_answer: str,
counterfactual_answer: Optional[str]) -> bool:
"""Check if model corrects the error with the right answer."""
# First check if provides correct answer
if not self.is_correct(response, correct_answer):
return False
# Ensure not just repeating counterfactual
if counterfactual_answer:
norm_response = self.normalize_answer(response)
norm_cf = self.normalize_answer(counterfactual_answer)
# Can contain both (detected and corrected)
# But must include correct answer
if norm_cf in norm_response and \
self.normalize_answer(correct_answer) not in norm_response:
return False
return True
Evaluation Method
def evaluate_counterfactual_robustness(
self,
responses: List[str],
ground_truths: List[str],
counterfactual_answers: List[str],
model_name: str
) -> EvaluationResult:
"""Evaluate counterfactual robustness."""
result = EvaluationResult(
task_type="counterfactual_robustness",
model_name=model_name,
total_samples=len(responses)
)
for response, truth, cf_answer in zip(responses, ground_truths, counterfactual_answers):
if self.detects_error(response, cf_answer):
result.errors_detected += 1
if self.corrects_error(response, truth, cf_answer):
result.errors_corrected += 1
result.correct += 1
else:
result.incorrect += 1
return result
Metrics
@property
def error_detection_rate(self) -> float:
"""Percentage of errors detected."""
if self.total_samples == 0:
return 0.0
return (self.errors_detected / self.total_samples) * 100
@property
def error_correction_rate(self) -> float:
"""Percentage of errors corrected."""
if self.total_samples == 0:
return 0.0
return (self.errors_corrected / self.total_samples) * 100
Logic:
- Detection: Keywords in response OR explicit rejection of counterfactual
- Correction: Must:
- Provide correct answer (is_correct check)
- Not just repeat counterfactual answer
- Can mention counterfactual while providing correct answer
- Metrics: Both error_detected and error_corrected tracked separately
Key Differences
| Aspect | Original | Current |
|---|---|---|
| Error Detection | 2 phrases: "factual errors" (EN), "事实性错误" (ZH) | 16+ keywords + explicit rejection patterns |
| Correction Check | Indirect: !(0 in label) | Direct: is_correct(response, ground_truth) |
| Metrics | fact_check_rate, correct_rate | error_detection_rate, error_correction_rate |
| Data Input | Mixed documents with wrong/correct answers | Clean counterfactual_answer field |
| Output | Two aggregated percentages | Both rates tracked individually |
| Robustness | Only detects obvious keyword phrases | Keywords + pattern matching + answer verification |
SUMMARY TABLE
| Metric | Original Logic | Current Logic | Key Improvement |
|---|---|---|---|
| Noise Robustness | Complex label-based | Simple accuracy (correct/total × 100) | Clear, comparable, aggregates multiple noise levels |
| Negative Rejection | 2 phrases mixed with other tasks | 30+ phrases, dedicated task | Comprehensive, explicit rejection detection |
| Information Integration | Reused noise robustness logic | Dedicated evaluation method | Clear task separation, better semantic meaning |
| Counterfactual Robustness | Keyword check + indirect correction | Keyword check + direct answer verification | More accurate correction detection |
Functional Differences
Accuracy Calculation
- Original: Label-based (0/1/-1) with complex logic
- Current: Direct percentage calculation
Answer Checking
- Original: Simple substring matching
- Current: Multi-strategy (substring, token overlap 80%+, normalized)
Task Separation
- Original: Implicit (filename-based)
- Current: Explicit (dedicated methods, clear data structures)
Metrics Clarity
- Original: Manual counting, implicit logic
- Current: Properties on dataclass, explicit tracking
Error Handling
- Original: Basic try-except, silent failures
- Current: Validation, specific logging, graceful degradation
Conclusion
The current implementation provides:
- Clarity: Each metric has clear definition and calculation
- Robustness: More comprehensive keyword matching, better answer verification
- Granularity: Tracks multiple aspects (e.g., both detection AND correction)
- Scalability: Easy to add new metrics or datasets
- Testability: Each evaluation method is independent and testable