Spaces:
Sleeping
Sleeping
RGB Evaluation
feat: Show all 9 LLM models in app dropdown, add comprehensive code review and metric analysis documentation
b1ccc5d A newer version of the Streamlit SDK is available:
1.55.0
Evaluation Comparison: evalue_original.py vs Application Implementation
Overview
The original evaluation code (evalue_original.py) was a standalone script for evaluating the RGB benchmark, while the application implementation has been refactored into a modular architecture with src/evaluator.py and integrated into the Streamlit UI (app.py).
Key Differences
1. Architecture & Organization
Original (evalue_original.py)
- Monolithic design: All evaluation logic in a single script
- Procedural approach: Functions for data processing, answer checking, and evaluation
- CLI-based: Command-line arguments for configuration
- Direct file I/O: Reads/writes directly to JSON files in result directories
Current Application
- Modular design: Separated into:
src/evaluator.py: Core evaluation logic and metricssrc/pipeline.py: Orchestration and batch processingsrc/config.py: Configuration managementapp.py: Streamlit UI for interactive use
- Object-oriented approach:
RGBEvaluatorandEvaluationResultclasses - Web-based UI: Interactive Streamlit interface with visualizations
- Flexible I/O: Supports multiple data sources and output formats
2. Answer Checking Logic
Original (checkanswer() function)
def checkanswer(prediction, ground_truth):
prediction = prediction.lower()
if type(ground_truth) is not list:
ground_truth = [ground_truth]
labels = []
for instance in ground_truth:
flag = True
if type(instance) == list:
flag = False
instance = [i.lower() for i in instance]
for i in instance:
if i in prediction:
flag = True
break
else:
instance = instance.lower()
if instance not in prediction:
flag = False
labels.append(int(flag))
return labels
- Simple substring matching (case-insensitive)
- Handles both single answers and lists of answers
- Returns boolean labels (0 or 1)
Current (is_correct() method)
def is_correct(self, response: str, ground_truth: str, strict: bool = False) -> bool
- Normalized comparison: Removes punctuation, extra whitespace
- Multiple matching strategies:
- Strict mode: Exact match
- Flexible mode: Substring match
- Token overlap: 80% token similarity
- Better error handling: Handles None/empty values
- Type safety: Uses proper type hints and dataclass structures
3. Rejection Detection
Original
- Simple keyword check: Only checks for Chinese phrase "信息不足" or English "insufficient information"
- Limited scope: Only 2 rejection phrases
Current
PRIMARY_REJECTION_PHRASES = [
"i can not answer the question because of the insufficient information in documents",
"insufficient information in documents",
"can not answer",
"cannot answer",
]
REJECTION_KEYWORDS = [
"i don't know", "i cannot", "i can't", "unable to", "not able to",
"insufficient information", "no information", "cannot determine", ...
]
- Comprehensive rejection detection: 30+ rejection phrases/keywords
- Tiered approach: Primary phrases (exact match from paper) + secondary keywords (flexible matching)
- Better alignment with research: Figure 3 of the paper specifies exact rejection phrase
- Multi-language support: Can be extended for other languages
4. Metrics & Evaluation Results
Original
- Manual aggregation: Counts tallied manually in main script
- Limited metrics:
- Overall accuracy
- Noise rate
- Fact-checking rate (for counterfactual dataset)
- Basic output: JSON file with counts and percentages
Current (EvaluationResult dataclass)
@dataclass
class EvaluationResult:
task_type: str
model_name: str
total_samples: int = 0
correct: int = 0
incorrect: int = 0
rejected: int = 0
errors_detected: int = 0
errors_corrected: int = 0
accuracy_by_noise: Dict[int, float] = field(default_factory=dict)
@property
def accuracy(self) -> float: ...
@property
def rejection_rate(self) -> float: ...
@property
def error_detection_rate(self) -> float: ...
@property
def error_correction_rate(self) -> float: ...
- Structured results: Dataclass with computed properties
- Comprehensive metrics:
- Accuracy by noise level
- Rejection rate (negative rejection task)
- Error detection & correction rates (counterfactual task)
- Serialization: Easy to convert to dict/JSON with
to_dict()method - Scalability: Can track multiple metrics simultaneously
5. Data Processing
Original (processdata() function)
- Complex data handling: Different logic for different dataset types (_int, _fact, default)
- Shuffle and selection: Random sampling of positive/negative documents
- Noise injection: Dynamic calculation of positive/negative document ratios
- Config via function parameters: Noise rate, passage number, correct rate
Current (Pipeline approach)
- Data loading decoupled:
src/data_loader.pyhandles file I/O - Dataset-specific processors: Separate methods for each task type
- Cleaner configuration: Centralized in
src/config.py - Better error handling: Validation and type checking
- Flexible document selection: Can be adjusted without modifying core logic
6. Model Integration
Original
- Multiple model classes: Direct imports from
models.models - String-based routing: Long if-elif chain to instantiate models
- Manual model setup: Requires knowing which model class for each model type
- Limited extensibility: Adding new models requires code changes
Current (llm_client.py)
- Abstraction layer: LLMClient interface
- Configuration-driven: Model selection via config
- Provider-based: Support for OpenAI, HuggingFace, custom endpoints
- Error handling: Retry logic, timeout handling
- Extensible: Easy to add new providers
7. Error Handling & Robustness
Original
- Basic try-except: Catches all exceptions with generic error message
- No validation: Assumes valid data and responses
- Fails silently: Continues despite errors
Current
- Comprehensive validation: Checks data types, ranges, formats
- Specific error messages: Detailed logging for debugging
- Graceful degradation: Continues processing with partial results
- Logging infrastructure: Track errors and warnings throughout pipeline
8. User Interface & Visualization
Original
- CLI only: Command-line arguments and console output
- File-based results: JSON files in result directories
- No visualization: User must parse JSON manually
Current
- Interactive Streamlit UI:
- Real-time evaluation status
- Interactive charts and visualizations
- Metric comparisons across models and tasks
- Results export functionality
- Visual metrics:
- Accuracy vs noise level line charts
- Rejection rate comparisons
- Error detection/correction metrics
- Task-specific breakdowns
9. Specific Metric Differences
| Metric | Original | Current | Status |
|---|---|---|---|
| Accuracy | Sum of correct answers | correct/total × 100 | Enhanced with per-noise tracking |
| Noise Robustness | All-or-nothing (0 in labels?) | Accuracy calculated per noise level | Improved granularity |
| Rejection Rate | Manual fact-checking | Dedicated rejection detection | More comprehensive |
| Error Detection | Fact label check | Keyword-based detection | More robust |
| Error Correction | Manual label validation | Verify both error detection + correct answer | Stricter validation |
10. Code Quality Improvements
| Aspect | Original | Current |
|---|---|---|
| Type hints | None | Full type hints with mypy compatibility |
| Documentation | Minimal comments | Comprehensive docstrings |
| Error messages | Generic | Specific and actionable |
| Testing | None visible | Testable class methods |
| Code reuse | Duplicated logic | DRY principle followed |
| Configuration | Hard-coded/CLI args | Centralized config |
| Logging | Print statements | Structured logging |
Functional Equivalence
Despite architectural differences, the core evaluation logic is functionally equivalent:
- Data loading: Original JSON parsing → Current data_loader.py
- Answer checking: Original substring check → Current multi-strategy is_correct()
- Rejection detection: Original keyword check → Current comprehensive is_rejection()
- Metrics calculation: Original manual tallying → Current class properties
- Results storage: Original JSON files → Current structured dataclass + JSON export
Migration Path
If moving from original to current implementation:
- CLI → UI: Replace command-line args with Streamlit dropdowns/inputs
- File I/O: Use current data_loader instead of direct file access
- Model setup: Use llm_client instead of model class selection
- Evaluation: Call RGBEvaluator methods instead of predict/processdata functions
- Results: Use EvaluationResult.to_dict() for JSON export
Conclusion
The refactored implementation maintains all core evaluation functionality while providing:
- Better code organization and maintainability
- Enhanced user experience with interactive UI
- More robust answer checking and rejection detection
- Comprehensive metric tracking
- Extensible architecture for future improvements