RGBMetrics / EVALUATION_COMPARISON.md
RGB Evaluation
feat: Show all 9 LLM models in app dropdown, add comprehensive code review and metric analysis documentation
b1ccc5d
# Evaluation Comparison: evalue_original.py vs Application Implementation
## Overview
The original evaluation code (`evalue_original.py`) was a standalone script for evaluating the RGB benchmark, while the application implementation has been refactored into a modular architecture with `src/evaluator.py` and integrated into the Streamlit UI (`app.py`).
---
## Key Differences
### 1. **Architecture & Organization**
#### Original (`evalue_original.py`)
- **Monolithic design**: All evaluation logic in a single script
- **Procedural approach**: Functions for data processing, answer checking, and evaluation
- **CLI-based**: Command-line arguments for configuration
- **Direct file I/O**: Reads/writes directly to JSON files in result directories
#### Current Application
- **Modular design**: Separated into:
- `src/evaluator.py`: Core evaluation logic and metrics
- `src/pipeline.py`: Orchestration and batch processing
- `src/config.py`: Configuration management
- `app.py`: Streamlit UI for interactive use
- **Object-oriented approach**: `RGBEvaluator` and `EvaluationResult` classes
- **Web-based UI**: Interactive Streamlit interface with visualizations
- **Flexible I/O**: Supports multiple data sources and output formats
---
### 2. **Answer Checking Logic**
#### Original (`checkanswer()` function)
```python
def checkanswer(prediction, ground_truth):
prediction = prediction.lower()
if type(ground_truth) is not list:
ground_truth = [ground_truth]
labels = []
for instance in ground_truth:
flag = True
if type(instance) == list:
flag = False
instance = [i.lower() for i in instance]
for i in instance:
if i in prediction:
flag = True
break
else:
instance = instance.lower()
if instance not in prediction:
flag = False
labels.append(int(flag))
return labels
```
- Simple substring matching (case-insensitive)
- Handles both single answers and lists of answers
- Returns boolean labels (0 or 1)
#### Current (`is_correct()` method)
```python
def is_correct(self, response: str, ground_truth: str, strict: bool = False) -> bool
```
- **Normalized comparison**: Removes punctuation, extra whitespace
- **Multiple matching strategies**:
- Strict mode: Exact match
- Flexible mode: Substring match
- Token overlap: 80% token similarity
- **Better error handling**: Handles None/empty values
- **Type safety**: Uses proper type hints and dataclass structures
---
### 3. **Rejection Detection**
#### Original
- **Simple keyword check**: Only checks for Chinese phrase "俑息不袳" or English "insufficient information"
- **Limited scope**: Only 2 rejection phrases
#### Current
```python
PRIMARY_REJECTION_PHRASES = [
"i can not answer the question because of the insufficient information in documents",
"insufficient information in documents",
"can not answer",
"cannot answer",
]
REJECTION_KEYWORDS = [
"i don't know", "i cannot", "i can't", "unable to", "not able to",
"insufficient information", "no information", "cannot determine", ...
]
```
- **Comprehensive rejection detection**: 30+ rejection phrases/keywords
- **Tiered approach**: Primary phrases (exact match from paper) + secondary keywords (flexible matching)
- **Better alignment with research**: Figure 3 of the paper specifies exact rejection phrase
- **Multi-language support**: Can be extended for other languages
---
### 4. **Metrics & Evaluation Results**
#### Original
- **Manual aggregation**: Counts tallied manually in main script
- **Limited metrics**:
- Overall accuracy
- Noise rate
- Fact-checking rate (for counterfactual dataset)
- **Basic output**: JSON file with counts and percentages
#### Current (`EvaluationResult` dataclass)
```python
@dataclass
class EvaluationResult:
task_type: str
model_name: str
total_samples: int = 0
correct: int = 0
incorrect: int = 0
rejected: int = 0
errors_detected: int = 0
errors_corrected: int = 0
accuracy_by_noise: Dict[int, float] = field(default_factory=dict)
@property
def accuracy(self) -> float: ...
@property
def rejection_rate(self) -> float: ...
@property
def error_detection_rate(self) -> float: ...
@property
def error_correction_rate(self) -> float: ...
```
- **Structured results**: Dataclass with computed properties
- **Comprehensive metrics**:
- Accuracy by noise level
- Rejection rate (negative rejection task)
- Error detection & correction rates (counterfactual task)
- **Serialization**: Easy to convert to dict/JSON with `to_dict()` method
- **Scalability**: Can track multiple metrics simultaneously
---
### 5. **Data Processing**
#### Original (`processdata()` function)
- **Complex data handling**: Different logic for different dataset types (_int, _fact, default)
- **Shuffle and selection**: Random sampling of positive/negative documents
- **Noise injection**: Dynamic calculation of positive/negative document ratios
- **Config via function parameters**: Noise rate, passage number, correct rate
#### Current (Pipeline approach)
- **Data loading decoupled**: `src/data_loader.py` handles file I/O
- **Dataset-specific processors**: Separate methods for each task type
- **Cleaner configuration**: Centralized in `src/config.py`
- **Better error handling**: Validation and type checking
- **Flexible document selection**: Can be adjusted without modifying core logic
---
### 6. **Model Integration**
#### Original
- **Multiple model classes**: Direct imports from `models.models`
- **String-based routing**: Long if-elif chain to instantiate models
- **Manual model setup**: Requires knowing which model class for each model type
- **Limited extensibility**: Adding new models requires code changes
#### Current (`llm_client.py`)
- **Abstraction layer**: LLMClient interface
- **Configuration-driven**: Model selection via config
- **Provider-based**: Support for OpenAI, HuggingFace, custom endpoints
- **Error handling**: Retry logic, timeout handling
- **Extensible**: Easy to add new providers
---
### 7. **Error Handling & Robustness**
#### Original
- **Basic try-except**: Catches all exceptions with generic error message
- **No validation**: Assumes valid data and responses
- **Fails silently**: Continues despite errors
#### Current
- **Comprehensive validation**: Checks data types, ranges, formats
- **Specific error messages**: Detailed logging for debugging
- **Graceful degradation**: Continues processing with partial results
- **Logging infrastructure**: Track errors and warnings throughout pipeline
---
### 8. **User Interface & Visualization**
#### Original
- **CLI only**: Command-line arguments and console output
- **File-based results**: JSON files in result directories
- **No visualization**: User must parse JSON manually
#### Current
- **Interactive Streamlit UI**:
- Real-time evaluation status
- Interactive charts and visualizations
- Metric comparisons across models and tasks
- Results export functionality
- **Visual metrics**:
- Accuracy vs noise level line charts
- Rejection rate comparisons
- Error detection/correction metrics
- Task-specific breakdowns
---
### 9. **Specific Metric Differences**
| Metric | Original | Current | Status |
|--------|----------|---------|--------|
| Accuracy | Sum of correct answers | correct/total Γ— 100 | Enhanced with per-noise tracking |
| Noise Robustness | All-or-nothing (0 in labels?) | Accuracy calculated per noise level | Improved granularity |
| Rejection Rate | Manual fact-checking | Dedicated rejection detection | More comprehensive |
| Error Detection | Fact label check | Keyword-based detection | More robust |
| Error Correction | Manual label validation | Verify both error detection + correct answer | Stricter validation |
---
### 10. **Code Quality Improvements**
| Aspect | Original | Current |
|--------|----------|---------|
| Type hints | None | Full type hints with mypy compatibility |
| Documentation | Minimal comments | Comprehensive docstrings |
| Error messages | Generic | Specific and actionable |
| Testing | None visible | Testable class methods |
| Code reuse | Duplicated logic | DRY principle followed |
| Configuration | Hard-coded/CLI args | Centralized config |
| Logging | Print statements | Structured logging |
---
## Functional Equivalence
Despite architectural differences, the core evaluation logic is functionally equivalent:
1. **Data loading**: Original JSON parsing β†’ Current data_loader.py
2. **Answer checking**: Original substring check β†’ Current multi-strategy is_correct()
3. **Rejection detection**: Original keyword check β†’ Current comprehensive is_rejection()
4. **Metrics calculation**: Original manual tallying β†’ Current class properties
5. **Results storage**: Original JSON files β†’ Current structured dataclass + JSON export
---
## Migration Path
If moving from original to current implementation:
1. **CLI β†’ UI**: Replace command-line args with Streamlit dropdowns/inputs
2. **File I/O**: Use current data_loader instead of direct file access
3. **Model setup**: Use llm_client instead of model class selection
4. **Evaluation**: Call RGBEvaluator methods instead of predict/processdata functions
5. **Results**: Use EvaluationResult.to_dict() for JSON export
---
## Conclusion
The refactored implementation maintains all core evaluation functionality while providing:
- Better code organization and maintainability
- Enhanced user experience with interactive UI
- More robust answer checking and rejection detection
- Comprehensive metric tracking
- Extensible architecture for future improvements