Spaces:
Sleeping
Sleeping
RGB Evaluation
feat: Show all 9 LLM models in app dropdown, add comprehensive code review and metric analysis documentation
b1ccc5d | # Evaluation Comparison: evalue_original.py vs Application Implementation | |
| ## Overview | |
| The original evaluation code (`evalue_original.py`) was a standalone script for evaluating the RGB benchmark, while the application implementation has been refactored into a modular architecture with `src/evaluator.py` and integrated into the Streamlit UI (`app.py`). | |
| --- | |
| ## Key Differences | |
| ### 1. **Architecture & Organization** | |
| #### Original (`evalue_original.py`) | |
| - **Monolithic design**: All evaluation logic in a single script | |
| - **Procedural approach**: Functions for data processing, answer checking, and evaluation | |
| - **CLI-based**: Command-line arguments for configuration | |
| - **Direct file I/O**: Reads/writes directly to JSON files in result directories | |
| #### Current Application | |
| - **Modular design**: Separated into: | |
| - `src/evaluator.py`: Core evaluation logic and metrics | |
| - `src/pipeline.py`: Orchestration and batch processing | |
| - `src/config.py`: Configuration management | |
| - `app.py`: Streamlit UI for interactive use | |
| - **Object-oriented approach**: `RGBEvaluator` and `EvaluationResult` classes | |
| - **Web-based UI**: Interactive Streamlit interface with visualizations | |
| - **Flexible I/O**: Supports multiple data sources and output formats | |
| --- | |
| ### 2. **Answer Checking Logic** | |
| #### Original (`checkanswer()` function) | |
| ```python | |
| def checkanswer(prediction, ground_truth): | |
| prediction = prediction.lower() | |
| if type(ground_truth) is not list: | |
| ground_truth = [ground_truth] | |
| labels = [] | |
| for instance in ground_truth: | |
| flag = True | |
| if type(instance) == list: | |
| flag = False | |
| instance = [i.lower() for i in instance] | |
| for i in instance: | |
| if i in prediction: | |
| flag = True | |
| break | |
| else: | |
| instance = instance.lower() | |
| if instance not in prediction: | |
| flag = False | |
| labels.append(int(flag)) | |
| return labels | |
| ``` | |
| - Simple substring matching (case-insensitive) | |
| - Handles both single answers and lists of answers | |
| - Returns boolean labels (0 or 1) | |
| #### Current (`is_correct()` method) | |
| ```python | |
| def is_correct(self, response: str, ground_truth: str, strict: bool = False) -> bool | |
| ``` | |
| - **Normalized comparison**: Removes punctuation, extra whitespace | |
| - **Multiple matching strategies**: | |
| - Strict mode: Exact match | |
| - Flexible mode: Substring match | |
| - Token overlap: 80% token similarity | |
| - **Better error handling**: Handles None/empty values | |
| - **Type safety**: Uses proper type hints and dataclass structures | |
| --- | |
| ### 3. **Rejection Detection** | |
| #### Original | |
| - **Simple keyword check**: Only checks for Chinese phrase "δΏ‘ζ―δΈθΆ³" or English "insufficient information" | |
| - **Limited scope**: Only 2 rejection phrases | |
| #### Current | |
| ```python | |
| PRIMARY_REJECTION_PHRASES = [ | |
| "i can not answer the question because of the insufficient information in documents", | |
| "insufficient information in documents", | |
| "can not answer", | |
| "cannot answer", | |
| ] | |
| REJECTION_KEYWORDS = [ | |
| "i don't know", "i cannot", "i can't", "unable to", "not able to", | |
| "insufficient information", "no information", "cannot determine", ... | |
| ] | |
| ``` | |
| - **Comprehensive rejection detection**: 30+ rejection phrases/keywords | |
| - **Tiered approach**: Primary phrases (exact match from paper) + secondary keywords (flexible matching) | |
| - **Better alignment with research**: Figure 3 of the paper specifies exact rejection phrase | |
| - **Multi-language support**: Can be extended for other languages | |
| --- | |
| ### 4. **Metrics & Evaluation Results** | |
| #### Original | |
| - **Manual aggregation**: Counts tallied manually in main script | |
| - **Limited metrics**: | |
| - Overall accuracy | |
| - Noise rate | |
| - Fact-checking rate (for counterfactual dataset) | |
| - **Basic output**: JSON file with counts and percentages | |
| #### Current (`EvaluationResult` dataclass) | |
| ```python | |
| @dataclass | |
| class EvaluationResult: | |
| task_type: str | |
| model_name: str | |
| total_samples: int = 0 | |
| correct: int = 0 | |
| incorrect: int = 0 | |
| rejected: int = 0 | |
| errors_detected: int = 0 | |
| errors_corrected: int = 0 | |
| accuracy_by_noise: Dict[int, float] = field(default_factory=dict) | |
| @property | |
| def accuracy(self) -> float: ... | |
| @property | |
| def rejection_rate(self) -> float: ... | |
| @property | |
| def error_detection_rate(self) -> float: ... | |
| @property | |
| def error_correction_rate(self) -> float: ... | |
| ``` | |
| - **Structured results**: Dataclass with computed properties | |
| - **Comprehensive metrics**: | |
| - Accuracy by noise level | |
| - Rejection rate (negative rejection task) | |
| - Error detection & correction rates (counterfactual task) | |
| - **Serialization**: Easy to convert to dict/JSON with `to_dict()` method | |
| - **Scalability**: Can track multiple metrics simultaneously | |
| --- | |
| ### 5. **Data Processing** | |
| #### Original (`processdata()` function) | |
| - **Complex data handling**: Different logic for different dataset types (_int, _fact, default) | |
| - **Shuffle and selection**: Random sampling of positive/negative documents | |
| - **Noise injection**: Dynamic calculation of positive/negative document ratios | |
| - **Config via function parameters**: Noise rate, passage number, correct rate | |
| #### Current (Pipeline approach) | |
| - **Data loading decoupled**: `src/data_loader.py` handles file I/O | |
| - **Dataset-specific processors**: Separate methods for each task type | |
| - **Cleaner configuration**: Centralized in `src/config.py` | |
| - **Better error handling**: Validation and type checking | |
| - **Flexible document selection**: Can be adjusted without modifying core logic | |
| --- | |
| ### 6. **Model Integration** | |
| #### Original | |
| - **Multiple model classes**: Direct imports from `models.models` | |
| - **String-based routing**: Long if-elif chain to instantiate models | |
| - **Manual model setup**: Requires knowing which model class for each model type | |
| - **Limited extensibility**: Adding new models requires code changes | |
| #### Current (`llm_client.py`) | |
| - **Abstraction layer**: LLMClient interface | |
| - **Configuration-driven**: Model selection via config | |
| - **Provider-based**: Support for OpenAI, HuggingFace, custom endpoints | |
| - **Error handling**: Retry logic, timeout handling | |
| - **Extensible**: Easy to add new providers | |
| --- | |
| ### 7. **Error Handling & Robustness** | |
| #### Original | |
| - **Basic try-except**: Catches all exceptions with generic error message | |
| - **No validation**: Assumes valid data and responses | |
| - **Fails silently**: Continues despite errors | |
| #### Current | |
| - **Comprehensive validation**: Checks data types, ranges, formats | |
| - **Specific error messages**: Detailed logging for debugging | |
| - **Graceful degradation**: Continues processing with partial results | |
| - **Logging infrastructure**: Track errors and warnings throughout pipeline | |
| --- | |
| ### 8. **User Interface & Visualization** | |
| #### Original | |
| - **CLI only**: Command-line arguments and console output | |
| - **File-based results**: JSON files in result directories | |
| - **No visualization**: User must parse JSON manually | |
| #### Current | |
| - **Interactive Streamlit UI**: | |
| - Real-time evaluation status | |
| - Interactive charts and visualizations | |
| - Metric comparisons across models and tasks | |
| - Results export functionality | |
| - **Visual metrics**: | |
| - Accuracy vs noise level line charts | |
| - Rejection rate comparisons | |
| - Error detection/correction metrics | |
| - Task-specific breakdowns | |
| --- | |
| ### 9. **Specific Metric Differences** | |
| | Metric | Original | Current | Status | | |
| |--------|----------|---------|--------| | |
| | Accuracy | Sum of correct answers | correct/total Γ 100 | Enhanced with per-noise tracking | | |
| | Noise Robustness | All-or-nothing (0 in labels?) | Accuracy calculated per noise level | Improved granularity | | |
| | Rejection Rate | Manual fact-checking | Dedicated rejection detection | More comprehensive | | |
| | Error Detection | Fact label check | Keyword-based detection | More robust | | |
| | Error Correction | Manual label validation | Verify both error detection + correct answer | Stricter validation | | |
| --- | |
| ### 10. **Code Quality Improvements** | |
| | Aspect | Original | Current | | |
| |--------|----------|---------| | |
| | Type hints | None | Full type hints with mypy compatibility | | |
| | Documentation | Minimal comments | Comprehensive docstrings | | |
| | Error messages | Generic | Specific and actionable | | |
| | Testing | None visible | Testable class methods | | |
| | Code reuse | Duplicated logic | DRY principle followed | | |
| | Configuration | Hard-coded/CLI args | Centralized config | | |
| | Logging | Print statements | Structured logging | | |
| --- | |
| ## Functional Equivalence | |
| Despite architectural differences, the core evaluation logic is functionally equivalent: | |
| 1. **Data loading**: Original JSON parsing β Current data_loader.py | |
| 2. **Answer checking**: Original substring check β Current multi-strategy is_correct() | |
| 3. **Rejection detection**: Original keyword check β Current comprehensive is_rejection() | |
| 4. **Metrics calculation**: Original manual tallying β Current class properties | |
| 5. **Results storage**: Original JSON files β Current structured dataclass + JSON export | |
| --- | |
| ## Migration Path | |
| If moving from original to current implementation: | |
| 1. **CLI β UI**: Replace command-line args with Streamlit dropdowns/inputs | |
| 2. **File I/O**: Use current data_loader instead of direct file access | |
| 3. **Model setup**: Use llm_client instead of model class selection | |
| 4. **Evaluation**: Call RGBEvaluator methods instead of predict/processdata functions | |
| 5. **Results**: Use EvaluationResult.to_dict() for JSON export | |
| --- | |
| ## Conclusion | |
| The refactored implementation maintains all core evaluation functionality while providing: | |
| - Better code organization and maintainability | |
| - Enhanced user experience with interactive UI | |
| - More robust answer checking and rejection detection | |
| - Comprehensive metric tracking | |
| - Extensible architecture for future improvements | |