Spaces:

gopikrishnait
/

RGBMetrics

Sleeping

App Files Files Community

RGBMetrics / EVALUATION_COMPARISON.md

RGB Evaluation

feat: Show all 9 LLM models in app dropdown, add comprehensive code review and metric analysis documentation

b1ccc5d about 1 month ago

preview code

raw

history blame contribute delete

9.79 kB

A newer version of the Streamlit SDK is available: 1.55.0

Upgrade

Evaluation Comparison: evalue_original.py vs Application Implementation

Overview

The original evaluation code (evalue_original.py) was a standalone script for evaluating the RGB benchmark, while the application implementation has been refactored into a modular architecture with src/evaluator.py and integrated into the Streamlit UI (app.py).

Key Differences

1. Architecture & Organization

Original (`evalue_original.py`)

Monolithic design: All evaluation logic in a single script
Procedural approach: Functions for data processing, answer checking, and evaluation
CLI-based: Command-line arguments for configuration
Direct file I/O: Reads/writes directly to JSON files in result directories

Current Application

Modular design: Separated into:
- src/evaluator.py: Core evaluation logic and metrics
- src/pipeline.py: Orchestration and batch processing
- src/config.py: Configuration management
- app.py: Streamlit UI for interactive use
Object-oriented approach: RGBEvaluator and EvaluationResult classes
Web-based UI: Interactive Streamlit interface with visualizations
Flexible I/O: Supports multiple data sources and output formats

2. Answer Checking Logic

Original (`checkanswer()` function)

def checkanswer(prediction, ground_truth):
    prediction = prediction.lower()
    if type(ground_truth) is not list:
        ground_truth = [ground_truth]
    labels = []
    for instance in ground_truth:
        flag = True
        if type(instance) == list:
            flag = False 
            instance = [i.lower() for i in instance]
            for i in instance:
                if i in prediction:
                    flag = True
                    break
        else:
            instance = instance.lower()
            if instance not in prediction:
                flag = False
        labels.append(int(flag))
    return labels

Simple substring matching (case-insensitive)
Handles both single answers and lists of answers
Returns boolean labels (0 or 1)

Current (`is_correct()` method)

def is_correct(self, response: str, ground_truth: str, strict: bool = False) -> bool

Normalized comparison: Removes punctuation, extra whitespace
Multiple matching strategies:
- Strict mode: Exact match
- Flexible mode: Substring match
- Token overlap: 80% token similarity
Better error handling: Handles None/empty values
Type safety: Uses proper type hints and dataclass structures

3. Rejection Detection

Original

Simple keyword check: Only checks for Chinese phrase "信息不足" or English "insufficient information"
Limited scope: Only 2 rejection phrases

Current

PRIMARY_REJECTION_PHRASES = [
    "i can not answer the question because of the insufficient information in documents",
    "insufficient information in documents",
    "can not answer",
    "cannot answer",
]

REJECTION_KEYWORDS = [
    "i don't know", "i cannot", "i can't", "unable to", "not able to",
    "insufficient information", "no information", "cannot determine", ...
]

Comprehensive rejection detection: 30+ rejection phrases/keywords
Tiered approach: Primary phrases (exact match from paper) + secondary keywords (flexible matching)
Better alignment with research: Figure 3 of the paper specifies exact rejection phrase
Multi-language support: Can be extended for other languages

4. Metrics & Evaluation Results

Original

Manual aggregation: Counts tallied manually in main script
Limited metrics:
- Overall accuracy
- Noise rate
- Fact-checking rate (for counterfactual dataset)
Basic output: JSON file with counts and percentages

Current (`EvaluationResult` dataclass)

@dataclass
class EvaluationResult:
    task_type: str
    model_name: str
    total_samples: int = 0
    correct: int = 0
    incorrect: int = 0
    rejected: int = 0
    errors_detected: int = 0
    errors_corrected: int = 0
    accuracy_by_noise: Dict[int, float] = field(default_factory=dict)
    
    @property
    def accuracy(self) -> float: ...
    @property
    def rejection_rate(self) -> float: ...
    @property
    def error_detection_rate(self) -> float: ...
    @property
    def error_correction_rate(self) -> float: ...

Structured results: Dataclass with computed properties
Comprehensive metrics:
- Accuracy by noise level
- Rejection rate (negative rejection task)
- Error detection & correction rates (counterfactual task)
Serialization: Easy to convert to dict/JSON with to_dict() method
Scalability: Can track multiple metrics simultaneously

5. Data Processing

Original (`processdata()` function)

Complex data handling: Different logic for different dataset types (_int, _fact, default)
Shuffle and selection: Random sampling of positive/negative documents
Noise injection: Dynamic calculation of positive/negative document ratios
Config via function parameters: Noise rate, passage number, correct rate

Current (Pipeline approach)

Data loading decoupled: src/data_loader.py handles file I/O
Dataset-specific processors: Separate methods for each task type
Cleaner configuration: Centralized in src/config.py
Better error handling: Validation and type checking
Flexible document selection: Can be adjusted without modifying core logic

6. Model Integration

Original

Multiple model classes: Direct imports from models.models
String-based routing: Long if-elif chain to instantiate models
Manual model setup: Requires knowing which model class for each model type
Limited extensibility: Adding new models requires code changes

Current (`llm_client.py`)

Abstraction layer: LLMClient interface
Configuration-driven: Model selection via config
Provider-based: Support for OpenAI, HuggingFace, custom endpoints
Error handling: Retry logic, timeout handling
Extensible: Easy to add new providers

7. Error Handling & Robustness

Original

Basic try-except: Catches all exceptions with generic error message
No validation: Assumes valid data and responses
Fails silently: Continues despite errors

Current

Comprehensive validation: Checks data types, ranges, formats
Specific error messages: Detailed logging for debugging
Graceful degradation: Continues processing with partial results
Logging infrastructure: Track errors and warnings throughout pipeline

8. User Interface & Visualization

Original

CLI only: Command-line arguments and console output
File-based results: JSON files in result directories
No visualization: User must parse JSON manually

Current

Interactive Streamlit UI:
- Real-time evaluation status
- Interactive charts and visualizations
- Metric comparisons across models and tasks
- Results export functionality
Visual metrics:
- Accuracy vs noise level line charts
- Rejection rate comparisons
- Error detection/correction metrics
- Task-specific breakdowns

9. Specific Metric Differences

Metric	Original	Current	Status
Accuracy	Sum of correct answers	correct/total × 100	Enhanced with per-noise tracking
Noise Robustness	All-or-nothing (0 in labels?)	Accuracy calculated per noise level	Improved granularity
Rejection Rate	Manual fact-checking	Dedicated rejection detection	More comprehensive
Error Detection	Fact label check	Keyword-based detection	More robust
Error Correction	Manual label validation	Verify both error detection + correct answer	Stricter validation

10. Code Quality Improvements

Aspect	Original	Current
Type hints	None	Full type hints with mypy compatibility
Documentation	Minimal comments	Comprehensive docstrings
Error messages	Generic	Specific and actionable
Testing	None visible	Testable class methods
Code reuse	Duplicated logic	DRY principle followed
Configuration	Hard-coded/CLI args	Centralized config
Logging	Print statements	Structured logging

Functional Equivalence

Despite architectural differences, the core evaluation logic is functionally equivalent:

Data loading: Original JSON parsing → Current data_loader.py
Answer checking: Original substring check → Current multi-strategy is_correct()
Rejection detection: Original keyword check → Current comprehensive is_rejection()
Metrics calculation: Original manual tallying → Current class properties
Results storage: Original JSON files → Current structured dataclass + JSON export

Migration Path

If moving from original to current implementation:

CLI → UI: Replace command-line args with Streamlit dropdowns/inputs
File I/O: Use current data_loader instead of direct file access
Model setup: Use llm_client instead of model class selection
Evaluation: Call RGBEvaluator methods instead of predict/processdata functions
Results: Use EvaluationResult.to_dict() for JSON export

Conclusion

The refactored implementation maintains all core evaluation functionality while providing:

Better code organization and maintainability
Enhanced user experience with interactive UI
More robust answer checking and rejection detection
Comprehensive metric tracking
Extensible architecture for future improvements

Evaluation Comparison: evalue_original.py vs Application Implementation

Overview

Key Differences

1. Architecture & Organization

Original (evalue_original.py)

Current Application

2. Answer Checking Logic

Original (checkanswer() function)

Current (is_correct() method)

3. Rejection Detection

Original

Current

4. Metrics & Evaluation Results

Original

Current (EvaluationResult dataclass)

5. Data Processing

Original (processdata() function)

Current (Pipeline approach)

6. Model Integration

Original

Current (llm_client.py)

7. Error Handling & Robustness

Original

Current

8. User Interface & Visualization

Original

Current

9. Specific Metric Differences

10. Code Quality Improvements

Functional Equivalence

Migration Path

Conclusion

Original (`evalue_original.py`)

Original (`checkanswer()` function)

Current (`is_correct()` method)

Current (`EvaluationResult` dataclass)

Original (`processdata()` function)

Current (`llm_client.py`)