Spaces:

gopikrishnait
/

RGBMetrics

Sleeping

App Files Files Community

RGBMetrics / EVALUATION_COMPARISON.md

RGB Evaluation

feat: Show all 9 LLM models in app dropdown, add comprehensive code review and metric analysis documentation

b1ccc5d about 2 months ago

preview code

raw

history blame contribute delete

9.79 kB

	# Evaluation Comparison: evalue_original.py vs Application Implementation

	## Overview
	The original evaluation code (`evalue_original.py`) was a standalone script for evaluating the RGB benchmark, while the application implementation has been refactored into a modular architecture with `src/evaluator.py` and integrated into the Streamlit UI (`app.py`).

	---

	## Key Differences

	### 1. Architecture & Organization

	#### Original (`evalue_original.py`)
	- Monolithic design: All evaluation logic in a single script
	- Procedural approach: Functions for data processing, answer checking, and evaluation
	- CLI-based: Command-line arguments for configuration
	- Direct file I/O: Reads/writes directly to JSON files in result directories

	#### Current Application
	- Modular design: Separated into:
	- `src/evaluator.py`: Core evaluation logic and metrics
	- `src/pipeline.py`: Orchestration and batch processing
	- `src/config.py`: Configuration management
	- `app.py`: Streamlit UI for interactive use
	- Object-oriented approach: `RGBEvaluator` and `EvaluationResult` classes
	- Web-based UI: Interactive Streamlit interface with visualizations
	- Flexible I/O: Supports multiple data sources and output formats

	---

	### 2. Answer Checking Logic

	#### Original (`checkanswer()` function)
	```python
	def checkanswer(prediction, ground_truth):
	prediction = prediction.lower()
	if type(ground_truth) is not list:
	ground_truth = [ground_truth]
	labels = []
	for instance in ground_truth:
	flag = True
	if type(instance) == list:
	flag = False
	instance = [i.lower() for i in instance]
	for i in instance:
	if i in prediction:
	flag = True
	break
	else:
	instance = instance.lower()
	if instance not in prediction:
	flag = False
	labels.append(int(flag))
	return labels
	```
	- Simple substring matching (case-insensitive)
	- Handles both single answers and lists of answers
	- Returns boolean labels (0 or 1)

	#### Current (`is_correct()` method)
	```python
	def is_correct(self, response: str, ground_truth: str, strict: bool = False) -> bool
	```
	- Normalized comparison: Removes punctuation, extra whitespace
	- Multiple matching strategies:
	- Strict mode: Exact match
	- Flexible mode: Substring match
	- Token overlap: 80% token similarity
	- Better error handling: Handles None/empty values
	- Type safety: Uses proper type hints and dataclass structures

	---

	### 3. Rejection Detection

	#### Original
	- Simple keyword check: Only checks for Chinese phrase "信息不足" or English "insufficient information"
	- Limited scope: Only 2 rejection phrases

	#### Current
	```python
	PRIMARY_REJECTION_PHRASES = [
	"i can not answer the question because of the insufficient information in documents",
	"insufficient information in documents",
	"can not answer",
	"cannot answer",
	]

	REJECTION_KEYWORDS = [
	"i don't know", "i cannot", "i can't", "unable to", "not able to",
	"insufficient information", "no information", "cannot determine", ...
	]
	```
	- Comprehensive rejection detection: 30+ rejection phrases/keywords
	- Tiered approach: Primary phrases (exact match from paper) + secondary keywords (flexible matching)
	- Better alignment with research: Figure 3 of the paper specifies exact rejection phrase
	- Multi-language support: Can be extended for other languages

	---

	### 4. Metrics & Evaluation Results

	#### Original
	- Manual aggregation: Counts tallied manually in main script
	- Limited metrics:
	- Overall accuracy
	- Noise rate
	- Fact-checking rate (for counterfactual dataset)
	- Basic output: JSON file with counts and percentages

	#### Current (`EvaluationResult` dataclass)
	```python
	@dataclass
	class EvaluationResult:
	task_type: str
	model_name: str
	total_samples: int = 0
	correct: int = 0
	incorrect: int = 0
	rejected: int = 0
	errors_detected: int = 0
	errors_corrected: int = 0
	accuracy_by_noise: Dict[int, float] = field(default_factory=dict)

	@property
	def accuracy(self) -> float: ...
	@property
	def rejection_rate(self) -> float: ...
	@property
	def error_detection_rate(self) -> float: ...
	@property
	def error_correction_rate(self) -> float: ...
	```
	- Structured results: Dataclass with computed properties
	- Comprehensive metrics:
	- Accuracy by noise level
	- Rejection rate (negative rejection task)
	- Error detection & correction rates (counterfactual task)
	- Serialization: Easy to convert to dict/JSON with `to_dict()` method
	- Scalability: Can track multiple metrics simultaneously

	---

	### 5. Data Processing

	#### Original (`processdata()` function)
	- Complex data handling: Different logic for different dataset types (_int, _fact, default)
	- Shuffle and selection: Random sampling of positive/negative documents
	- Noise injection: Dynamic calculation of positive/negative document ratios
	- Config via function parameters: Noise rate, passage number, correct rate

	#### Current (Pipeline approach)
	- Data loading decoupled: `src/data_loader.py` handles file I/O
	- Dataset-specific processors: Separate methods for each task type
	- Cleaner configuration: Centralized in `src/config.py`
	- Better error handling: Validation and type checking
	- Flexible document selection: Can be adjusted without modifying core logic

	---

	### 6. Model Integration

	#### Original
	- Multiple model classes: Direct imports from `models.models`
	- String-based routing: Long if-elif chain to instantiate models
	- Manual model setup: Requires knowing which model class for each model type
	- Limited extensibility: Adding new models requires code changes

	#### Current (`llm_client.py`)
	- Abstraction layer: LLMClient interface
	- Configuration-driven: Model selection via config
	- Provider-based: Support for OpenAI, HuggingFace, custom endpoints
	- Error handling: Retry logic, timeout handling
	- Extensible: Easy to add new providers

	---

	### 7. Error Handling & Robustness

	#### Original
	- Basic try-except: Catches all exceptions with generic error message
	- No validation: Assumes valid data and responses
	- Fails silently: Continues despite errors

	#### Current
	- Comprehensive validation: Checks data types, ranges, formats
	- Specific error messages: Detailed logging for debugging
	- Graceful degradation: Continues processing with partial results
	- Logging infrastructure: Track errors and warnings throughout pipeline

	---

	### 8. User Interface & Visualization

	#### Original
	- CLI only: Command-line arguments and console output
	- File-based results: JSON files in result directories
	- No visualization: User must parse JSON manually

	#### Current
	- Interactive Streamlit UI:
	- Real-time evaluation status
	- Interactive charts and visualizations
	- Metric comparisons across models and tasks
	- Results export functionality
	- Visual metrics:
	- Accuracy vs noise level line charts
	- Rejection rate comparisons
	- Error detection/correction metrics
	- Task-specific breakdowns

	---

	### 9. Specific Metric Differences

	\| Metric \| Original \| Current \| Status \|
	\|--------\|----------\|---------\|--------\|
	\| Accuracy \| Sum of correct answers \| correct/total × 100 \| Enhanced with per-noise tracking \|
	\| Noise Robustness \| All-or-nothing (0 in labels?) \| Accuracy calculated per noise level \| Improved granularity \|
	\| Rejection Rate \| Manual fact-checking \| Dedicated rejection detection \| More comprehensive \|
	\| Error Detection \| Fact label check \| Keyword-based detection \| More robust \|
	\| Error Correction \| Manual label validation \| Verify both error detection + correct answer \| Stricter validation \|

	---

	### 10. Code Quality Improvements

	\| Aspect \| Original \| Current \|
	\|--------\|----------\|---------\|
	\| Type hints \| None \| Full type hints with mypy compatibility \|
	\| Documentation \| Minimal comments \| Comprehensive docstrings \|
	\| Error messages \| Generic \| Specific and actionable \|
	\| Testing \| None visible \| Testable class methods \|
	\| Code reuse \| Duplicated logic \| DRY principle followed \|
	\| Configuration \| Hard-coded/CLI args \| Centralized config \|
	\| Logging \| Print statements \| Structured logging \|

	---

	## Functional Equivalence

	Despite architectural differences, the core evaluation logic is functionally equivalent:

	1. Data loading: Original JSON parsing → Current data_loader.py
	2. Answer checking: Original substring check → Current multi-strategy is_correct()
	3. Rejection detection: Original keyword check → Current comprehensive is_rejection()
	4. Metrics calculation: Original manual tallying → Current class properties
	5. Results storage: Original JSON files → Current structured dataclass + JSON export

	---

	## Migration Path

	If moving from original to current implementation:

	1. CLI → UI: Replace command-line args with Streamlit dropdowns/inputs
	2. File I/O: Use current data_loader instead of direct file access
	3. Model setup: Use llm_client instead of model class selection
	4. Evaluation: Call RGBEvaluator methods instead of predict/processdata functions
	5. Results: Use EvaluationResult.to_dict() for JSON export

	---

	## Conclusion

	The refactored implementation maintains all core evaluation functionality while providing:
	- Better code organization and maintainability
	- Enhanced user experience with interactive UI
	- More robust answer checking and rejection detection
	- Comprehensive metric tracking
	- Extensible architecture for future improvements