RGBMetrics / DETAILED_CHANGES.md
RGB Evaluation
feat: Add separate grid layout for 4 RAG abilities in Streamlit UI
af25c62
# Code Compliance Verification - Detailed Changes
## 1. src/prompts.py - System Instruction & Template
### Added: SYSTEM_INSTRUCTION Constant
```python
SYSTEM_INSTRUCTION = """You are an accurate and reliable AI assistant that can answer questions with the help of external documents. Please note that external documents may contain noisy or factually incorrect information. If the information in the document contains the correct answer, you will give an accurate answer. If the information in the document does not contain the answer, you will generate 'I can not answer the question because of the insufficient information in documents.' If there are inconsistencies with the facts in some of the documents, please generate the response 'There are factual errors in the provided documents.' and provide the correct answer."""
```
**Source:** Figure 3 of RGB benchmark paper (2309.01431v2.pdf)
### Modified: RAG_PROMPT_TEMPLATE
**Before:**
```python
RAG_PROMPT_TEMPLATE = """Answer the following question based on the given documents.
If there is no relevant information in the documents, say you cannot answer.
Documents:
{documents}
Question: {question}
Answer:"""
```
**After:**
```python
RAG_PROMPT_TEMPLATE = """Document:
{documents}
Question: {question}"""
```
**Rationale:** Exact format from Figure 3. Task-specific instructions moved to system prompt.
---
## 2. src/evaluator.py - Rejection Phrase Matching
### Added: PRIMARY_REJECTION_PHRASES
```python
PRIMARY_REJECTION_PHRASES = [
"i can not answer the question because of the insufficient information in documents",
"insufficient information in documents",
"can not answer",
"cannot answer"
]
```
### Modified: is_rejection() Method
**Key Change:** Check exact phrase first, then keywords
```python
def is_rejection(self, response: str) -> bool:
response_lower = response.lower().strip()
# Check for exact primary phrases first (as per Figure 3)
for phrase in self.PRIMARY_REJECTION_PHRASES:
if phrase in response_lower:
return True
# Fall back to more flexible keyword matching
for keyword in self.REJECTION_KEYWORDS:
if keyword in response_lower:
return True
return False
```
### Modified: Type Annotations
**Changed:**
- `detects_error(response: str, counterfactual_answer: str)`
- `corrects_error(response: str, correct_answer: str, counterfactual_answer: str)`
**To:**
- `detects_error(response: str, counterfactual_answer: Optional[str])`
- `corrects_error(response: str, correct_answer: str, counterfactual_answer: Optional[str])`
### Modified: evaluate_noise_robustness() Signature
**Changed from:**
```python
def evaluate_noise_robustness(
self,
responses: List[str],
ground_truths: List[str],
noise_levels: List[int],
model_name: str
) -> EvaluationResult:
```
**Changed to:**
```python
def evaluate_noise_robustness(
self,
responses: List[str],
ground_truths: List[str],
model_name: str,
noise_ratio: float
) -> EvaluationResult:
```
**Rationale:** Each evaluation is now for a specific noise ratio, not aggregated.
---
## 3. src/pipeline.py - Multiple Noise Ratio Testing
### Modified: evaluate_noise_robustness() Method
**New Implementation:**
```python
def evaluate_noise_robustness(
self,
model: str,
max_samples: Optional[int] = None,
noise_ratios: Optional[List[float]] = None
) -> List[EvaluationResult]:
"""
Evaluate noise robustness for a model.
Tests multiple noise ratios as per the RGB paper (0%, 20%, 40%, 60%, 80%).
Args:
model: The model name to evaluate.
max_samples: Maximum samples to evaluate per noise ratio.
noise_ratios: List of noise ratios to test. Defaults to paper's ratios.
Returns:
List of EvaluationResults for different noise ratios.
"""
if noise_ratios is None:
noise_ratios = [0.0, 0.2, 0.4, 0.6, 0.8] # Paper's exact ratios
print(f"\n[Noise Robustness] Evaluating {model}...")
print(f" Testing noise ratios: {noise_ratios}")
client = self._create_client(model)
results = []
for noise_ratio in noise_ratios:
samples = self.data_loader.load_noise_robustness(max_samples, noise_rate=noise_ratio)
if not samples:
print(f" Warning: No noise robustness samples found for noise_rate={noise_ratio}")
continue
prompt_template = get_prompt_template("default")
responses = self._generate_responses(
client, samples, prompt_template,
desc=f" {model} - Noise {int(noise_ratio*100)}%"
)
ground_truths = [s.answer for s in samples]
result = self.evaluator.evaluate_noise_robustness(
responses, ground_truths, model, noise_ratio
)
results.append(result)
print(f" Noise {int(noise_ratio*100)}%: Accuracy = {result.accuracy:.2f}%")
return results
```
### Modified: run_full_evaluation() Method
**Changed from:**
```python
if "noise_robustness" in all_tasks:
result = self.evaluate_noise_robustness(model, max_samples_per_task)
self.results.append(result)
```
**Changed to:**
```python
if "noise_robustness" in all_tasks:
# Noise robustness returns a list of results (one per noise ratio)
noise_results = self.evaluate_noise_robustness(model, max_samples_per_task)
self.results.extend(noise_results)
```
**Rationale:** Handle list return type from noise robustness testing.
---
## Compliance Mapping
### Noise Robustness (RGB Table 1)
- βœ… Tests 5 noise ratios: 0%, 20%, 40%, 60%, 80%
- βœ… Separate evaluation per ratio
- βœ… Calculates accuracy for each noise level
- βœ… Returns List[EvaluationResult] for comparison
### Negative Rejection (RGB Table 2)
- βœ… Checks for exact rejection phrase from Figure 3
- βœ… Falls back to keyword matching for robustness
- βœ… Calculates rejection_rate metric
- βœ… System instruction guides LLM to reject appropriately
### Information Integration (RGB Table 3)
- βœ… Evaluates multi-document synthesis
- βœ… Calculates accuracy metric
- βœ… System instruction guides information combination
### Counterfactual Robustness (RGB Table 4)
- βœ… Detects error detection keywords
- βœ… Verifies error correction with correct answer
- βœ… Calculates error_detection_rate and error_correction_rate
- βœ… System instruction guides factual error detection
---
## Testing Validation
### Test Script: test_refactored_pipeline.py
**Test 1: Data Loader** βœ…
- Loads all 4 task types successfully
- Noise robustness supports all 5 ratios
- Proper sample formatting
**Test 2: Evaluator** βœ…
- Noise robustness: Returns correct task type `noise_robustness_20%`
- Negative rejection: Detects exact phrases correctly
- Information integration: Calculates accuracy properly
- Counterfactual robustness: Detects errors and corrections
**Test 3: Prompts** βœ…
- System instruction: 649 characters, contains required phrases
- Template format: Matches Figure 3 exactly
- Formatting: Properly interpolates documents and question
**Test 4: Pipeline** βœ…
- All required methods present and callable
- Pipeline instantiation successful
- Structure ready for Groq API calls
---
## Production Readiness
**All changes verified against:**
1. RGB benchmark paper (2309.01431v2.pdf)
2. Figure 3 prompt template specification
3. Table 1-7 evaluation methodology
4. Paper's exact wording for rejection and error detection
**Type Safety:** βœ… No type errors
**Integration:** βœ… All components properly connected
**Documentation:** βœ… Code comments and docstrings updated
**Testing:** βœ… Comprehensive test suite passes
---
## Summary
The RGB evaluation pipeline has been successfully refactored to achieve **100% compliance** with the capstone requirements:
- βœ… Exact Figure 3 prompt format with system instruction
- βœ… All 4 RAG abilities fully implemented
- βœ… 5 noise ratios tested separately (matching Table 1)
- βœ… Exact rejection phrase detection
- βœ… System instruction properly integrated
- βœ… Type-safe code with no errors
- βœ… Comprehensive test coverage
**Status: READY FOR GROQ API EVALUATION**