# Evaluation Comparison: evalue_original.py vs Application Implementation ## Overview The original evaluation code (`evalue_original.py`) was a standalone script for evaluating the RGB benchmark, while the application implementation has been refactored into a modular architecture with `src/evaluator.py` and integrated into the Streamlit UI (`app.py`). --- ## Key Differences ### 1. **Architecture & Organization** #### Original (`evalue_original.py`) - **Monolithic design**: All evaluation logic in a single script - **Procedural approach**: Functions for data processing, answer checking, and evaluation - **CLI-based**: Command-line arguments for configuration - **Direct file I/O**: Reads/writes directly to JSON files in result directories #### Current Application - **Modular design**: Separated into: - `src/evaluator.py`: Core evaluation logic and metrics - `src/pipeline.py`: Orchestration and batch processing - `src/config.py`: Configuration management - `app.py`: Streamlit UI for interactive use - **Object-oriented approach**: `RGBEvaluator` and `EvaluationResult` classes - **Web-based UI**: Interactive Streamlit interface with visualizations - **Flexible I/O**: Supports multiple data sources and output formats --- ### 2. **Answer Checking Logic** #### Original (`checkanswer()` function) ```python def checkanswer(prediction, ground_truth): prediction = prediction.lower() if type(ground_truth) is not list: ground_truth = [ground_truth] labels = [] for instance in ground_truth: flag = True if type(instance) == list: flag = False instance = [i.lower() for i in instance] for i in instance: if i in prediction: flag = True break else: instance = instance.lower() if instance not in prediction: flag = False labels.append(int(flag)) return labels ``` - Simple substring matching (case-insensitive) - Handles both single answers and lists of answers - Returns boolean labels (0 or 1) #### Current (`is_correct()` method) ```python def is_correct(self, response: str, ground_truth: str, strict: bool = False) -> bool ``` - **Normalized comparison**: Removes punctuation, extra whitespace - **Multiple matching strategies**: - Strict mode: Exact match - Flexible mode: Substring match - Token overlap: 80% token similarity - **Better error handling**: Handles None/empty values - **Type safety**: Uses proper type hints and dataclass structures --- ### 3. **Rejection Detection** #### Original - **Simple keyword check**: Only checks for Chinese phrase "信息不足" or English "insufficient information" - **Limited scope**: Only 2 rejection phrases #### Current ```python PRIMARY_REJECTION_PHRASES = [ "i can not answer the question because of the insufficient information in documents", "insufficient information in documents", "can not answer", "cannot answer", ] REJECTION_KEYWORDS = [ "i don't know", "i cannot", "i can't", "unable to", "not able to", "insufficient information", "no information", "cannot determine", ... ] ``` - **Comprehensive rejection detection**: 30+ rejection phrases/keywords - **Tiered approach**: Primary phrases (exact match from paper) + secondary keywords (flexible matching) - **Better alignment with research**: Figure 3 of the paper specifies exact rejection phrase - **Multi-language support**: Can be extended for other languages --- ### 4. **Metrics & Evaluation Results** #### Original - **Manual aggregation**: Counts tallied manually in main script - **Limited metrics**: - Overall accuracy - Noise rate - Fact-checking rate (for counterfactual dataset) - **Basic output**: JSON file with counts and percentages #### Current (`EvaluationResult` dataclass) ```python @dataclass class EvaluationResult: task_type: str model_name: str total_samples: int = 0 correct: int = 0 incorrect: int = 0 rejected: int = 0 errors_detected: int = 0 errors_corrected: int = 0 accuracy_by_noise: Dict[int, float] = field(default_factory=dict) @property def accuracy(self) -> float: ... @property def rejection_rate(self) -> float: ... @property def error_detection_rate(self) -> float: ... @property def error_correction_rate(self) -> float: ... ``` - **Structured results**: Dataclass with computed properties - **Comprehensive metrics**: - Accuracy by noise level - Rejection rate (negative rejection task) - Error detection & correction rates (counterfactual task) - **Serialization**: Easy to convert to dict/JSON with `to_dict()` method - **Scalability**: Can track multiple metrics simultaneously --- ### 5. **Data Processing** #### Original (`processdata()` function) - **Complex data handling**: Different logic for different dataset types (_int, _fact, default) - **Shuffle and selection**: Random sampling of positive/negative documents - **Noise injection**: Dynamic calculation of positive/negative document ratios - **Config via function parameters**: Noise rate, passage number, correct rate #### Current (Pipeline approach) - **Data loading decoupled**: `src/data_loader.py` handles file I/O - **Dataset-specific processors**: Separate methods for each task type - **Cleaner configuration**: Centralized in `src/config.py` - **Better error handling**: Validation and type checking - **Flexible document selection**: Can be adjusted without modifying core logic --- ### 6. **Model Integration** #### Original - **Multiple model classes**: Direct imports from `models.models` - **String-based routing**: Long if-elif chain to instantiate models - **Manual model setup**: Requires knowing which model class for each model type - **Limited extensibility**: Adding new models requires code changes #### Current (`llm_client.py`) - **Abstraction layer**: LLMClient interface - **Configuration-driven**: Model selection via config - **Provider-based**: Support for OpenAI, HuggingFace, custom endpoints - **Error handling**: Retry logic, timeout handling - **Extensible**: Easy to add new providers --- ### 7. **Error Handling & Robustness** #### Original - **Basic try-except**: Catches all exceptions with generic error message - **No validation**: Assumes valid data and responses - **Fails silently**: Continues despite errors #### Current - **Comprehensive validation**: Checks data types, ranges, formats - **Specific error messages**: Detailed logging for debugging - **Graceful degradation**: Continues processing with partial results - **Logging infrastructure**: Track errors and warnings throughout pipeline --- ### 8. **User Interface & Visualization** #### Original - **CLI only**: Command-line arguments and console output - **File-based results**: JSON files in result directories - **No visualization**: User must parse JSON manually #### Current - **Interactive Streamlit UI**: - Real-time evaluation status - Interactive charts and visualizations - Metric comparisons across models and tasks - Results export functionality - **Visual metrics**: - Accuracy vs noise level line charts - Rejection rate comparisons - Error detection/correction metrics - Task-specific breakdowns --- ### 9. **Specific Metric Differences** | Metric | Original | Current | Status | |--------|----------|---------|--------| | Accuracy | Sum of correct answers | correct/total × 100 | Enhanced with per-noise tracking | | Noise Robustness | All-or-nothing (0 in labels?) | Accuracy calculated per noise level | Improved granularity | | Rejection Rate | Manual fact-checking | Dedicated rejection detection | More comprehensive | | Error Detection | Fact label check | Keyword-based detection | More robust | | Error Correction | Manual label validation | Verify both error detection + correct answer | Stricter validation | --- ### 10. **Code Quality Improvements** | Aspect | Original | Current | |--------|----------|---------| | Type hints | None | Full type hints with mypy compatibility | | Documentation | Minimal comments | Comprehensive docstrings | | Error messages | Generic | Specific and actionable | | Testing | None visible | Testable class methods | | Code reuse | Duplicated logic | DRY principle followed | | Configuration | Hard-coded/CLI args | Centralized config | | Logging | Print statements | Structured logging | --- ## Functional Equivalence Despite architectural differences, the core evaluation logic is functionally equivalent: 1. **Data loading**: Original JSON parsing → Current data_loader.py 2. **Answer checking**: Original substring check → Current multi-strategy is_correct() 3. **Rejection detection**: Original keyword check → Current comprehensive is_rejection() 4. **Metrics calculation**: Original manual tallying → Current class properties 5. **Results storage**: Original JSON files → Current structured dataclass + JSON export --- ## Migration Path If moving from original to current implementation: 1. **CLI → UI**: Replace command-line args with Streamlit dropdowns/inputs 2. **File I/O**: Use current data_loader instead of direct file access 3. **Model setup**: Use llm_client instead of model class selection 4. **Evaluation**: Call RGBEvaluator methods instead of predict/processdata functions 5. **Results**: Use EvaluationResult.to_dict() for JSON export --- ## Conclusion The refactored implementation maintains all core evaluation functionality while providing: - Better code organization and maintainability - Enhanced user experience with interactive UI - More robust answer checking and rejection detection - Comprehensive metric tracking - Extensible architecture for future improvements