RGBMetrics / EVALUATION_COMPARISON.md
RGB Evaluation
feat: Show all 9 LLM models in app dropdown, add comprehensive code review and metric analysis documentation
b1ccc5d

A newer version of the Streamlit SDK is available: 1.55.0

Upgrade

Evaluation Comparison: evalue_original.py vs Application Implementation

Overview

The original evaluation code (evalue_original.py) was a standalone script for evaluating the RGB benchmark, while the application implementation has been refactored into a modular architecture with src/evaluator.py and integrated into the Streamlit UI (app.py).


Key Differences

1. Architecture & Organization

Original (evalue_original.py)

  • Monolithic design: All evaluation logic in a single script
  • Procedural approach: Functions for data processing, answer checking, and evaluation
  • CLI-based: Command-line arguments for configuration
  • Direct file I/O: Reads/writes directly to JSON files in result directories

Current Application

  • Modular design: Separated into:
    • src/evaluator.py: Core evaluation logic and metrics
    • src/pipeline.py: Orchestration and batch processing
    • src/config.py: Configuration management
    • app.py: Streamlit UI for interactive use
  • Object-oriented approach: RGBEvaluator and EvaluationResult classes
  • Web-based UI: Interactive Streamlit interface with visualizations
  • Flexible I/O: Supports multiple data sources and output formats

2. Answer Checking Logic

Original (checkanswer() function)

def checkanswer(prediction, ground_truth):
    prediction = prediction.lower()
    if type(ground_truth) is not list:
        ground_truth = [ground_truth]
    labels = []
    for instance in ground_truth:
        flag = True
        if type(instance) == list:
            flag = False 
            instance = [i.lower() for i in instance]
            for i in instance:
                if i in prediction:
                    flag = True
                    break
        else:
            instance = instance.lower()
            if instance not in prediction:
                flag = False
        labels.append(int(flag))
    return labels
  • Simple substring matching (case-insensitive)
  • Handles both single answers and lists of answers
  • Returns boolean labels (0 or 1)

Current (is_correct() method)

def is_correct(self, response: str, ground_truth: str, strict: bool = False) -> bool
  • Normalized comparison: Removes punctuation, extra whitespace
  • Multiple matching strategies:
    • Strict mode: Exact match
    • Flexible mode: Substring match
    • Token overlap: 80% token similarity
  • Better error handling: Handles None/empty values
  • Type safety: Uses proper type hints and dataclass structures

3. Rejection Detection

Original

  • Simple keyword check: Only checks for Chinese phrase "信息不足" or English "insufficient information"
  • Limited scope: Only 2 rejection phrases

Current

PRIMARY_REJECTION_PHRASES = [
    "i can not answer the question because of the insufficient information in documents",
    "insufficient information in documents",
    "can not answer",
    "cannot answer",
]

REJECTION_KEYWORDS = [
    "i don't know", "i cannot", "i can't", "unable to", "not able to",
    "insufficient information", "no information", "cannot determine", ...
]
  • Comprehensive rejection detection: 30+ rejection phrases/keywords
  • Tiered approach: Primary phrases (exact match from paper) + secondary keywords (flexible matching)
  • Better alignment with research: Figure 3 of the paper specifies exact rejection phrase
  • Multi-language support: Can be extended for other languages

4. Metrics & Evaluation Results

Original

  • Manual aggregation: Counts tallied manually in main script
  • Limited metrics:
    • Overall accuracy
    • Noise rate
    • Fact-checking rate (for counterfactual dataset)
  • Basic output: JSON file with counts and percentages

Current (EvaluationResult dataclass)

@dataclass
class EvaluationResult:
    task_type: str
    model_name: str
    total_samples: int = 0
    correct: int = 0
    incorrect: int = 0
    rejected: int = 0
    errors_detected: int = 0
    errors_corrected: int = 0
    accuracy_by_noise: Dict[int, float] = field(default_factory=dict)
    
    @property
    def accuracy(self) -> float: ...
    @property
    def rejection_rate(self) -> float: ...
    @property
    def error_detection_rate(self) -> float: ...
    @property
    def error_correction_rate(self) -> float: ...
  • Structured results: Dataclass with computed properties
  • Comprehensive metrics:
    • Accuracy by noise level
    • Rejection rate (negative rejection task)
    • Error detection & correction rates (counterfactual task)
  • Serialization: Easy to convert to dict/JSON with to_dict() method
  • Scalability: Can track multiple metrics simultaneously

5. Data Processing

Original (processdata() function)

  • Complex data handling: Different logic for different dataset types (_int, _fact, default)
  • Shuffle and selection: Random sampling of positive/negative documents
  • Noise injection: Dynamic calculation of positive/negative document ratios
  • Config via function parameters: Noise rate, passage number, correct rate

Current (Pipeline approach)

  • Data loading decoupled: src/data_loader.py handles file I/O
  • Dataset-specific processors: Separate methods for each task type
  • Cleaner configuration: Centralized in src/config.py
  • Better error handling: Validation and type checking
  • Flexible document selection: Can be adjusted without modifying core logic

6. Model Integration

Original

  • Multiple model classes: Direct imports from models.models
  • String-based routing: Long if-elif chain to instantiate models
  • Manual model setup: Requires knowing which model class for each model type
  • Limited extensibility: Adding new models requires code changes

Current (llm_client.py)

  • Abstraction layer: LLMClient interface
  • Configuration-driven: Model selection via config
  • Provider-based: Support for OpenAI, HuggingFace, custom endpoints
  • Error handling: Retry logic, timeout handling
  • Extensible: Easy to add new providers

7. Error Handling & Robustness

Original

  • Basic try-except: Catches all exceptions with generic error message
  • No validation: Assumes valid data and responses
  • Fails silently: Continues despite errors

Current

  • Comprehensive validation: Checks data types, ranges, formats
  • Specific error messages: Detailed logging for debugging
  • Graceful degradation: Continues processing with partial results
  • Logging infrastructure: Track errors and warnings throughout pipeline

8. User Interface & Visualization

Original

  • CLI only: Command-line arguments and console output
  • File-based results: JSON files in result directories
  • No visualization: User must parse JSON manually

Current

  • Interactive Streamlit UI:
    • Real-time evaluation status
    • Interactive charts and visualizations
    • Metric comparisons across models and tasks
    • Results export functionality
  • Visual metrics:
    • Accuracy vs noise level line charts
    • Rejection rate comparisons
    • Error detection/correction metrics
    • Task-specific breakdowns

9. Specific Metric Differences

Metric Original Current Status
Accuracy Sum of correct answers correct/total × 100 Enhanced with per-noise tracking
Noise Robustness All-or-nothing (0 in labels?) Accuracy calculated per noise level Improved granularity
Rejection Rate Manual fact-checking Dedicated rejection detection More comprehensive
Error Detection Fact label check Keyword-based detection More robust
Error Correction Manual label validation Verify both error detection + correct answer Stricter validation

10. Code Quality Improvements

Aspect Original Current
Type hints None Full type hints with mypy compatibility
Documentation Minimal comments Comprehensive docstrings
Error messages Generic Specific and actionable
Testing None visible Testable class methods
Code reuse Duplicated logic DRY principle followed
Configuration Hard-coded/CLI args Centralized config
Logging Print statements Structured logging

Functional Equivalence

Despite architectural differences, the core evaluation logic is functionally equivalent:

  1. Data loading: Original JSON parsing → Current data_loader.py
  2. Answer checking: Original substring check → Current multi-strategy is_correct()
  3. Rejection detection: Original keyword check → Current comprehensive is_rejection()
  4. Metrics calculation: Original manual tallying → Current class properties
  5. Results storage: Original JSON files → Current structured dataclass + JSON export

Migration Path

If moving from original to current implementation:

  1. CLI → UI: Replace command-line args with Streamlit dropdowns/inputs
  2. File I/O: Use current data_loader instead of direct file access
  3. Model setup: Use llm_client instead of model class selection
  4. Evaluation: Call RGBEvaluator methods instead of predict/processdata functions
  5. Results: Use EvaluationResult.to_dict() for JSON export

Conclusion

The refactored implementation maintains all core evaluation functionality while providing:

  • Better code organization and maintainability
  • Enhanced user experience with interactive UI
  • More robust answer checking and rejection detection
  • Comprehensive metric tracking
  • Extensible architecture for future improvements