Spaces:

gopikrishnait
/

RGBMetrics

Sleeping

App Files Files Community

RGBMetrics / EVALUE_ORIGINAL_CALCULATION.md

RGB Evaluation

feat: Show all 9 LLM models in app dropdown, add comprehensive code review and metric analysis documentation

b1ccc5d 3 months ago

preview code

raw

history blame contribute delete

14.5 kB

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade

evalue_original.py - Calculation Methods Explained

This document explains how each evaluation is calculated in the original evalue_original.py file, showing the actual code flow and logic.

Overview: Label System

The original implementation uses a label system to categorize answers:

def predict(query, ground_truth, docs, model, system, instruction, temperature, dataset):
    '''
    label: 0 for positive,      # Correct answer found
           1 for negative,      # Incorrect answer found  
           -1 for not enough information  # Rejection (insufficient info)
    '''
    
    # ... model generates prediction ...
    
    if '信息不足' in prediction or 'insufficient information' in prediction:
        labels = [-1]  # Model rejected (insufficient info)
    else:
        labels = checkanswer(prediction, ground_truth)  # Check if correct
    
    factlabel = 0  # For counterfactual robustness
    if '事实性错误' in prediction or 'factual errors' in prediction:
        factlabel = 1  # Factual error detected
    
    return labels, prediction, factlabel

Labels:

0 = Positive (correct answer)
1 = Negative (incorrect answer)
-1 = Insufficient Information (rejection)
factlabel = 0 or 1 (for counterfactual)

1. NOISE ROBUSTNESS & NEGATIVE REJECTION

Data Preparation

def processdata(instance, noise_rate, passage_num, filename, correct_rate=0):
    # For default datasets (en, zh - NOT _int or _fact):
    
    neg_num = math.ceil(passage_num * noise_rate)  # Calculate negative docs
    pos_num = passage_num - neg_num                # Calculate positive docs
    
    # Example: passage_num=5, noise_rate=0.4
    #   neg_num = ceil(5 × 0.4) = 2
    #   pos_num = 5 - 2 = 3
    
    # Select documents
    positive = instance['positive'][:pos_num]    # Take first pos_num from positive
    negative = instance['negative'][:neg_num]    # Take first neg_num from negative
    
    docs = positive + negative  # Combine
    random.shuffle(docs)        # Shuffle to hide which are positive/negative
    
    return query, ans, docs

Example:

Dataset: en_refine.json
  positive: ["Doc A about answer", "Doc B about answer", "Doc C about answer", ...]
  negative: ["Noise doc 1", "Noise doc 2", ...]

With noise_rate=0.4, passage_num=5:
  pos_num = 3 (correct docs)
  neg_num = 2 (noise docs)
  
  Selected docs: ["Doc A", "Doc B", "Doc C", "Noise 1", "Noise 2"]
  After shuffle: ["Noise 1", "Doc A", "Noise 2", "Doc B", "Doc C"]

Answer Checking

def checkanswer(prediction, ground_truth):
    prediction = prediction.lower()  # Convert to lowercase
    
    if type(ground_truth) is not list:
        ground_truth = [ground_truth]  # Make it a list
    
    labels = []
    for instance in ground_truth:
        flag = True
        
        if type(instance) == list:
            # If ground truth has multiple variants
            flag = False  # Start as False, set True if any match
            instance = [i.lower() for i in instance]
            for i in instance:
                if i in prediction:  # Simple substring check
                    flag = True
                    break
        else:
            # Single ground truth
            instance = instance.lower()
            if instance not in prediction:  # Simple substring check
                flag = False
        
        labels.append(int(flag))  # Convert True/False to 1/0
    
    return labels  # Returns list like [0] or [1]

How it works:

Simple substring matching: Just checks if ground truth is in response
Lowercase comparison: Makes matching case-insensitive
Returns list: Can have multiple labels (0 or 1 each)

Example:

Ground Truth: "Paris"
Response:     "The capital is Paris"

Step 1: Lowercase both
  - truth:    "paris"
  - response: "the capital is paris"

Step 2: Check substring
  - Is "paris" in "the capital is paris"? YES
  
Result: labels = [1] (meaning correct)

---

Ground Truth: "Rome"
Response:     "The capital is Paris"

Step 2: Check substring
  - Is "rome" in "the capital is paris"? NO
  
Result: labels = [0] (meaning incorrect)

Metric Calculation

# Main evaluation loop (lines 269-297)
tt = 0  # Counter for "good" results

for i in results:
    label = i['label']  # Get label from result
    
    if noise_rate == 1 and label[0] == -1:
        # Special case: At 100% noise, rejection (-1) is correct
        tt += 1
    elif 0 not in label and 1 in label:
        # General case: If label has 1 (correct) and no 0 (incorrect)
        tt += 1

# Final calculation
all_rate = tt / len(results)  # Percentage of "good" results

Logic Explained:

Condition 1: if noise_rate == 1 and label[0] == -1
  - At 100% noise, there IS NO correct answer
  - Model SHOULD reject (return -1)
  - If label == -1, model did right thing ✓
  
  Example:
    noise_rate = 1.0 (100% noise)
    label = [-1] (model rejected)
    Result: tt += 1 ✓ CORRECT BEHAVIOR

Condition 2: elif 0 not in label and 1 in label
  - At other noise levels, there ARE correct answers
  - Model SHOULD provide answer
  - If label contains 1 (correct) and no 0 (incorrect), it's good
  
  Example:
    noise_rate = 0.4 (40% noise)
    label = [1] (model provided correct answer)
    Result: tt += 1 ✓ CORRECT ANSWER
    
  Example:
    noise_rate = 0.4 (40% noise)
    label = [0] (model didn't find answer)
    Result: tt += 0 ✗ FAILED TO ANSWER

Output:

{
  "all_rate": 0.82,  // 82% success rate
  "noise_rate": 0.4,
  "tt": 82,         // 82 correct out of 100
  "nums": 100       // Total samples
}

2. INFORMATION INTEGRATION

Data Preparation

def processdata(instance, noise_rate, passage_num, filename, correct_rate=0):
    if '_int' in filename:  # Information integration dataset
        for i in instance['positive']:
            random.shuffle(i)  # Shuffle within each group
        
        # Get one doc from each positive group (multi-document synthesis)
        docs = [i[0] for i in instance['positive']]
        
        # If not enough docs, get more from each group
        if len(docs) < pos_num:
            maxnum = max([len(i) for i in instance['positive']])
            for i in range(1, maxnum):
                for j in instance['positive']:
                    if len(j) > i:
                        docs.append(j[i])
                        if len(docs) == pos_num:
                            break
                if len(docs) == pos_num:
                    break
        
        # Add negative documents if needed
        neg_num = passage_num - len(docs)
        if neg_num > 0:
            negative = instance['negative'][:neg_num]
            docs += negative

Data Structure (en_int.json):

{
  "positive": [
    ["Doc about first fact", "Alt doc about first fact", ...],
    ["Doc about second fact", "Alt doc about second fact", ...],
    ["Doc about third fact", "Alt doc about third fact", ...]
  ],
  "negative": ["Noise doc 1", "Noise doc 2", ...],
  "answer": "Requires combining facts from all three groups"
}

How Selection Works:

Step 1: Extract first doc from each group
  docs = [
    "Doc about first fact",      # from positive[0][0]
    "Doc about second fact",     # from positive[1][0]
    "Doc about third fact"       # from positive[2][0]
  ]

Step 2: If need more docs, get additional from groups
  docs = [
    "Doc about first fact",
    "Doc about second fact",
    "Doc about third fact",
    "Alt doc about first fact",  # from positive[0][1]
    "Noise doc 1"                # from negative[0]
  ]
  
Result: Model has partial info from multiple sources
        Must synthesize to answer correctly

Answer Checking

Uses same checkanswer() function as noise robustness

Metric Calculation

Uses same metric calculation as noise robustness:

tt = 0
for i in results:
    label = i['label']
    if 0 not in label and 1 in label:  # Correct answer found
        tt += 1

all_rate = tt / len(results)

Output:

{
  "all_rate": 0.78,  // 78% success rate
  "nums": 100
}

3. COUNTERFACTUAL ROBUSTNESS

Data Preparation

elif '_fact' in filename:  # Counterfactual dataset
    correct_num = math.ceil(passage_num * correct_rate)
    pos_num = passage_num - neg_num - correct_num
    
    # Example: passage_num=5, noise_rate=0.2, correct_rate=0.5
    #   neg_num = ceil(5 × 0.2) = 1      (noise docs)
    #   correct_num = ceil(5 × 0.5) = 3  (correct answer docs)
    #   pos_num = 5 - 1 - 3 = 1          (wrong answer docs)
    
    # Get wrong/counterfactual docs
    indexs = list(range(len(instance['positive'])))
    selected = random.sample(indexs, min(len(indexs), pos_num))
    docs = [instance['positive_wrong'][i] for i in selected]  # Documents with fake answer
    
    # Add correct docs
    remain = [i for i in indexs if i not in selected]
    if correct_num > 0 and len(remain) > 0:
        docs += [instance['positive'][i] for i in random.sample(remain, min(len(remain), correct_num))]
    
    # Add noise docs
    if neg_num > 0:
        docs += instance['negative'][:neg_num]

Data Structure (en_fact.json):

{
  "query": "What is the capital of France?",
  "answer": "Paris",
  "fakeanswer": "London",
  "positive": ["Doc about Paris", ...],
  "positive_wrong": ["Doc saying London is capital", ...],
  "negative": ["Noise doc", ...]
}

Example Mix:

Documents given to model:
  - "The capital of France is London." (positive_wrong - FAKE)
  - "Paris is known for the Eiffel Tower." (positive - CORRECT)
  - "France is in Europe." (negative - NOISE)

Model must:
  1. Detect that "London" is wrong
  2. Provide correct answer "Paris"

Answer Checking

Same checkanswer() function is used

Metric Calculation

if '_fact' in args.dataset:
    fact_tt = 0      # Count of times factual error detected
    correct_tt = 0   # Count of times error was corrected
    
    for i in results:
        if i['factlabel'] == 1:  # Factual error detected?
            fact_tt += 1
            if 0 not in i['label']:  # Did NOT provide wrong answer?
                correct_tt += 1
    
    # Calculate metrics
    fact_check_rate = fact_tt / len(results)  # Detection rate
    
    if fact_tt > 0:
        correct_rate = correct_tt / fact_tt   # Correction rate
    else:
        correct_rate = 0

Logic Explained:

For each result:
  
  factlabel == 1?
    This checks: Does response contain "factual errors" keyword?
    If YES → fact_tt += 1  (model detected the error)
    
  0 not in label?
    This checks: Does response NOT contain the wrong answer?
    If YES → correct_tt += 1  (model corrected the error)

Example Scenarios:

Scenario 1: FULL SUCCESS
  Response: "The documents claim London, but that's factually incorrect. 
            The correct answer is Paris."
  
  factlabel == 1? YES (contains "factually incorrect") ✓
  fact_tt += 1
  
  0 not in label? YES (no wrong answer) ✓
  correct_tt += 1
  
  Result: Both counters incremented

Scenario 2: DETECTION ONLY
  Response: "The documents contain factual errors about the capital."
  
  factlabel == 1? YES (contains "factual errors") ✓
  fact_tt += 1
  
  0 not in label? NO (still contains wrong answer)
  correct_tt += 0
  
  Result: Only detection counted

Scenario 3: NO DETECTION
  Response: "According to the documents, the capital is London."
  
  factlabel == 1? NO (no error keywords)
  fact_tt += 0
  
  Result: Neither counter incremented

Output:

{
  "fact_check_rate": 0.89,  // 89% of errors detected
  "correct_rate": 0.85,     // Of detected errors, 85% corrected
  "fact_tt": 89,            // Errors detected
  "correct_tt": 85,         // Errors corrected
  "nums": 100
}

Label Interpretation Summary

Label	Meaning	When Used
0	Incorrect answer found	When ground truth NOT in response
1	Correct answer found	When ground truth IS in response
-1	Insufficient information (rejection)	When rejection keywords detected
factlabel=1	Factual error detected	When "factual errors" keyword found

Calculation Flow Comparison

Noise Robustness (noise_rate ≠ 1)

For each sample:
  1. Mix pos_num correct docs + neg_num noise docs
  2. Get model response
  3. Check if contains correct answer → label = 0 or 1
  4. If label has 1 and no 0 → count as success
  
Final: success_count / total_samples = accuracy

Negative Rejection (noise_rate = 1)

For each sample:
  1. Use ONLY negative docs (no correct answer)
  2. Get model response
  3. Check if contains rejection keyword → label = -1
  4. If label == -1 → count as success
  
Final: rejection_count / total_samples = rejection_rate

Information Integration (_int dataset)

For each sample:
  1. Mix docs from multiple groups (requires synthesis)
  2. Get model response
  3. Check if contains correct answer → label = 0 or 1
  4. If label has 1 and no 0 → count as success
  
Final: success_count / total_samples = accuracy

Counterfactual Robustness (_fact dataset)

For each sample:
  1. Mix wrong docs + correct docs + noise docs
  2. Get model response
  3. Check for "factual errors" keyword → factlabel = 0 or 1
  4. Check if contains correct answer → label = 0 or 1
  5. If factlabel==1 → detection_count += 1
  6. If factlabel==1 AND 0 not in label → correction_count += 1
  
Final: 
  detection_rate = detection_count / total
  correction_rate = correction_count / detection_count

Key Differences from Current Implementation

Aspect	Original	Current
Answer Checking	Simple substring match only	Multi-strategy (substring, token overlap, normalized)
Rejection Detection	Only 2 keywords ("factual errors", "insufficient information")	30+ keywords with tiered matching
Label System	Uses 0/1/-1 labels	Uses explicit counter fields
Metric Calculation	Complex conditional logic	Simple property calculations
Multiple Noise Levels	One at a time	All together (0%, 20%, 40%, 60%, 80%)
Output Structure	One JSON file per run	Structured EvaluationResult objects