Spaces:
Sleeping
Sleeping
RGB Evaluation
feat: Show all 9 LLM models in app dropdown, add comprehensive code review and metric analysis documentation
b1ccc5d A newer version of the Streamlit SDK is available: 1.56.0
evalue_original.py - Calculation Methods Explained
This document explains how each evaluation is calculated in the original evalue_original.py file, showing the actual code flow and logic.
Overview: Label System
The original implementation uses a label system to categorize answers:
def predict(query, ground_truth, docs, model, system, instruction, temperature, dataset):
'''
label: 0 for positive, # Correct answer found
1 for negative, # Incorrect answer found
-1 for not enough information # Rejection (insufficient info)
'''
# ... model generates prediction ...
if 'ไฟกๆฏไธ่ถณ' in prediction or 'insufficient information' in prediction:
labels = [-1] # Model rejected (insufficient info)
else:
labels = checkanswer(prediction, ground_truth) # Check if correct
factlabel = 0 # For counterfactual robustness
if 'ไบๅฎๆง้่ฏฏ' in prediction or 'factual errors' in prediction:
factlabel = 1 # Factual error detected
return labels, prediction, factlabel
Labels:
0= Positive (correct answer)1= Negative (incorrect answer)-1= Insufficient Information (rejection)factlabel= 0 or 1 (for counterfactual)
1. NOISE ROBUSTNESS & NEGATIVE REJECTION
Data Preparation
def processdata(instance, noise_rate, passage_num, filename, correct_rate=0):
# For default datasets (en, zh - NOT _int or _fact):
neg_num = math.ceil(passage_num * noise_rate) # Calculate negative docs
pos_num = passage_num - neg_num # Calculate positive docs
# Example: passage_num=5, noise_rate=0.4
# neg_num = ceil(5 ร 0.4) = 2
# pos_num = 5 - 2 = 3
# Select documents
positive = instance['positive'][:pos_num] # Take first pos_num from positive
negative = instance['negative'][:neg_num] # Take first neg_num from negative
docs = positive + negative # Combine
random.shuffle(docs) # Shuffle to hide which are positive/negative
return query, ans, docs
Example:
Dataset: en_refine.json
positive: ["Doc A about answer", "Doc B about answer", "Doc C about answer", ...]
negative: ["Noise doc 1", "Noise doc 2", ...]
With noise_rate=0.4, passage_num=5:
pos_num = 3 (correct docs)
neg_num = 2 (noise docs)
Selected docs: ["Doc A", "Doc B", "Doc C", "Noise 1", "Noise 2"]
After shuffle: ["Noise 1", "Doc A", "Noise 2", "Doc B", "Doc C"]
Answer Checking
def checkanswer(prediction, ground_truth):
prediction = prediction.lower() # Convert to lowercase
if type(ground_truth) is not list:
ground_truth = [ground_truth] # Make it a list
labels = []
for instance in ground_truth:
flag = True
if type(instance) == list:
# If ground truth has multiple variants
flag = False # Start as False, set True if any match
instance = [i.lower() for i in instance]
for i in instance:
if i in prediction: # Simple substring check
flag = True
break
else:
# Single ground truth
instance = instance.lower()
if instance not in prediction: # Simple substring check
flag = False
labels.append(int(flag)) # Convert True/False to 1/0
return labels # Returns list like [0] or [1]
How it works:
- Simple substring matching: Just checks if ground truth is in response
- Lowercase comparison: Makes matching case-insensitive
- Returns list: Can have multiple labels (0 or 1 each)
Example:
Ground Truth: "Paris"
Response: "The capital is Paris"
Step 1: Lowercase both
- truth: "paris"
- response: "the capital is paris"
Step 2: Check substring
- Is "paris" in "the capital is paris"? YES
Result: labels = [1] (meaning correct)
---
Ground Truth: "Rome"
Response: "The capital is Paris"
Step 2: Check substring
- Is "rome" in "the capital is paris"? NO
Result: labels = [0] (meaning incorrect)
Metric Calculation
# Main evaluation loop (lines 269-297)
tt = 0 # Counter for "good" results
for i in results:
label = i['label'] # Get label from result
if noise_rate == 1 and label[0] == -1:
# Special case: At 100% noise, rejection (-1) is correct
tt += 1
elif 0 not in label and 1 in label:
# General case: If label has 1 (correct) and no 0 (incorrect)
tt += 1
# Final calculation
all_rate = tt / len(results) # Percentage of "good" results
Logic Explained:
Condition 1: if noise_rate == 1 and label[0] == -1
- At 100% noise, there IS NO correct answer
- Model SHOULD reject (return -1)
- If label == -1, model did right thing โ
Example:
noise_rate = 1.0 (100% noise)
label = [-1] (model rejected)
Result: tt += 1 โ CORRECT BEHAVIOR
Condition 2: elif 0 not in label and 1 in label
- At other noise levels, there ARE correct answers
- Model SHOULD provide answer
- If label contains 1 (correct) and no 0 (incorrect), it's good
Example:
noise_rate = 0.4 (40% noise)
label = [1] (model provided correct answer)
Result: tt += 1 โ CORRECT ANSWER
Example:
noise_rate = 0.4 (40% noise)
label = [0] (model didn't find answer)
Result: tt += 0 โ FAILED TO ANSWER
Output:
{
"all_rate": 0.82, // 82% success rate
"noise_rate": 0.4,
"tt": 82, // 82 correct out of 100
"nums": 100 // Total samples
}
2. INFORMATION INTEGRATION
Data Preparation
def processdata(instance, noise_rate, passage_num, filename, correct_rate=0):
if '_int' in filename: # Information integration dataset
for i in instance['positive']:
random.shuffle(i) # Shuffle within each group
# Get one doc from each positive group (multi-document synthesis)
docs = [i[0] for i in instance['positive']]
# If not enough docs, get more from each group
if len(docs) < pos_num:
maxnum = max([len(i) for i in instance['positive']])
for i in range(1, maxnum):
for j in instance['positive']:
if len(j) > i:
docs.append(j[i])
if len(docs) == pos_num:
break
if len(docs) == pos_num:
break
# Add negative documents if needed
neg_num = passage_num - len(docs)
if neg_num > 0:
negative = instance['negative'][:neg_num]
docs += negative
Data Structure (en_int.json):
{
"positive": [
["Doc about first fact", "Alt doc about first fact", ...],
["Doc about second fact", "Alt doc about second fact", ...],
["Doc about third fact", "Alt doc about third fact", ...]
],
"negative": ["Noise doc 1", "Noise doc 2", ...],
"answer": "Requires combining facts from all three groups"
}
How Selection Works:
Step 1: Extract first doc from each group
docs = [
"Doc about first fact", # from positive[0][0]
"Doc about second fact", # from positive[1][0]
"Doc about third fact" # from positive[2][0]
]
Step 2: If need more docs, get additional from groups
docs = [
"Doc about first fact",
"Doc about second fact",
"Doc about third fact",
"Alt doc about first fact", # from positive[0][1]
"Noise doc 1" # from negative[0]
]
Result: Model has partial info from multiple sources
Must synthesize to answer correctly
Answer Checking
Uses same checkanswer() function as noise robustness
Metric Calculation
Uses same metric calculation as noise robustness:
tt = 0
for i in results:
label = i['label']
if 0 not in label and 1 in label: # Correct answer found
tt += 1
all_rate = tt / len(results)
Output:
{
"all_rate": 0.78, // 78% success rate
"nums": 100
}
3. COUNTERFACTUAL ROBUSTNESS
Data Preparation
elif '_fact' in filename: # Counterfactual dataset
correct_num = math.ceil(passage_num * correct_rate)
pos_num = passage_num - neg_num - correct_num
# Example: passage_num=5, noise_rate=0.2, correct_rate=0.5
# neg_num = ceil(5 ร 0.2) = 1 (noise docs)
# correct_num = ceil(5 ร 0.5) = 3 (correct answer docs)
# pos_num = 5 - 1 - 3 = 1 (wrong answer docs)
# Get wrong/counterfactual docs
indexs = list(range(len(instance['positive'])))
selected = random.sample(indexs, min(len(indexs), pos_num))
docs = [instance['positive_wrong'][i] for i in selected] # Documents with fake answer
# Add correct docs
remain = [i for i in indexs if i not in selected]
if correct_num > 0 and len(remain) > 0:
docs += [instance['positive'][i] for i in random.sample(remain, min(len(remain), correct_num))]
# Add noise docs
if neg_num > 0:
docs += instance['negative'][:neg_num]
Data Structure (en_fact.json):
{
"query": "What is the capital of France?",
"answer": "Paris",
"fakeanswer": "London",
"positive": ["Doc about Paris", ...],
"positive_wrong": ["Doc saying London is capital", ...],
"negative": ["Noise doc", ...]
}
Example Mix:
Documents given to model:
- "The capital of France is London." (positive_wrong - FAKE)
- "Paris is known for the Eiffel Tower." (positive - CORRECT)
- "France is in Europe." (negative - NOISE)
Model must:
1. Detect that "London" is wrong
2. Provide correct answer "Paris"
Answer Checking
Same checkanswer() function is used
Metric Calculation
if '_fact' in args.dataset:
fact_tt = 0 # Count of times factual error detected
correct_tt = 0 # Count of times error was corrected
for i in results:
if i['factlabel'] == 1: # Factual error detected?
fact_tt += 1
if 0 not in i['label']: # Did NOT provide wrong answer?
correct_tt += 1
# Calculate metrics
fact_check_rate = fact_tt / len(results) # Detection rate
if fact_tt > 0:
correct_rate = correct_tt / fact_tt # Correction rate
else:
correct_rate = 0
Logic Explained:
For each result:
factlabel == 1?
This checks: Does response contain "factual errors" keyword?
If YES โ fact_tt += 1 (model detected the error)
0 not in label?
This checks: Does response NOT contain the wrong answer?
If YES โ correct_tt += 1 (model corrected the error)
Example Scenarios:
Scenario 1: FULL SUCCESS
Response: "The documents claim London, but that's factually incorrect.
The correct answer is Paris."
factlabel == 1? YES (contains "factually incorrect") โ
fact_tt += 1
0 not in label? YES (no wrong answer) โ
correct_tt += 1
Result: Both counters incremented
Scenario 2: DETECTION ONLY
Response: "The documents contain factual errors about the capital."
factlabel == 1? YES (contains "factual errors") โ
fact_tt += 1
0 not in label? NO (still contains wrong answer)
correct_tt += 0
Result: Only detection counted
Scenario 3: NO DETECTION
Response: "According to the documents, the capital is London."
factlabel == 1? NO (no error keywords)
fact_tt += 0
Result: Neither counter incremented
Output:
{
"fact_check_rate": 0.89, // 89% of errors detected
"correct_rate": 0.85, // Of detected errors, 85% corrected
"fact_tt": 89, // Errors detected
"correct_tt": 85, // Errors corrected
"nums": 100
}
Label Interpretation Summary
| Label | Meaning | When Used |
|---|---|---|
| 0 | Incorrect answer found | When ground truth NOT in response |
| 1 | Correct answer found | When ground truth IS in response |
| -1 | Insufficient information (rejection) | When rejection keywords detected |
| factlabel=1 | Factual error detected | When "factual errors" keyword found |
Calculation Flow Comparison
Noise Robustness (noise_rate โ 1)
For each sample:
1. Mix pos_num correct docs + neg_num noise docs
2. Get model response
3. Check if contains correct answer โ label = 0 or 1
4. If label has 1 and no 0 โ count as success
Final: success_count / total_samples = accuracy
Negative Rejection (noise_rate = 1)
For each sample:
1. Use ONLY negative docs (no correct answer)
2. Get model response
3. Check if contains rejection keyword โ label = -1
4. If label == -1 โ count as success
Final: rejection_count / total_samples = rejection_rate
Information Integration (_int dataset)
For each sample:
1. Mix docs from multiple groups (requires synthesis)
2. Get model response
3. Check if contains correct answer โ label = 0 or 1
4. If label has 1 and no 0 โ count as success
Final: success_count / total_samples = accuracy
Counterfactual Robustness (_fact dataset)
For each sample:
1. Mix wrong docs + correct docs + noise docs
2. Get model response
3. Check for "factual errors" keyword โ factlabel = 0 or 1
4. Check if contains correct answer โ label = 0 or 1
5. If factlabel==1 โ detection_count += 1
6. If factlabel==1 AND 0 not in label โ correction_count += 1
Final:
detection_rate = detection_count / total
correction_rate = correction_count / detection_count
Key Differences from Current Implementation
| Aspect | Original | Current |
|---|---|---|
| Answer Checking | Simple substring match only | Multi-strategy (substring, token overlap, normalized) |
| Rejection Detection | Only 2 keywords ("factual errors", "insufficient information") | 30+ keywords with tiered matching |
| Label System | Uses 0/1/-1 labels | Uses explicit counter fields |
| Metric Calculation | Complex conditional logic | Simple property calculations |
| Multiple Noise Levels | One at a time | All together (0%, 20%, 40%, 60%, 80%) |
| Output Structure | One JSON file per run | Structured EvaluationResult objects |