Spaces:

gopikrishnait
/

RGBMetrics

Sleeping

App Files Files Community

RGBMetrics / DETAILED_CHANGES.md

RGB Evaluation

feat: Add separate grid layout for 4 RAG abilities in Streamlit UI

af25c62 about 2 months ago

preview code

raw

history blame contribute delete

8.26 kB

	# Code Compliance Verification - Detailed Changes

	## 1. src/prompts.py - System Instruction & Template

	### Added: SYSTEM_INSTRUCTION Constant
	```python
	SYSTEM_INSTRUCTION = """You are an accurate and reliable AI assistant that can answer questions with the help of external documents. Please note that external documents may contain noisy or factually incorrect information. If the information in the document contains the correct answer, you will give an accurate answer. If the information in the document does not contain the answer, you will generate 'I can not answer the question because of the insufficient information in documents.' If there are inconsistencies with the facts in some of the documents, please generate the response 'There are factual errors in the provided documents.' and provide the correct answer."""
	```

	Source: Figure 3 of RGB benchmark paper (2309.01431v2.pdf)

	### Modified: RAG_PROMPT_TEMPLATE
	Before:
	```python
	RAG_PROMPT_TEMPLATE = """Answer the following question based on the given documents.
	If there is no relevant information in the documents, say you cannot answer.

	Documents:
	{documents}

	Question: {question}

	Answer:"""
	```

	After:
	```python
	RAG_PROMPT_TEMPLATE = """Document:
	{documents}

	Question: {question}"""
	```

	Rationale: Exact format from Figure 3. Task-specific instructions moved to system prompt.

	---

	## 2. src/evaluator.py - Rejection Phrase Matching

	### Added: PRIMARY_REJECTION_PHRASES
	```python
	PRIMARY_REJECTION_PHRASES = [
	"i can not answer the question because of the insufficient information in documents",
	"insufficient information in documents",
	"can not answer",
	"cannot answer"
	]
	```

	### Modified: is_rejection() Method
	Key Change: Check exact phrase first, then keywords
	```python
	def is_rejection(self, response: str) -> bool:
	response_lower = response.lower().strip()

	# Check for exact primary phrases first (as per Figure 3)
	for phrase in self.PRIMARY_REJECTION_PHRASES:
	if phrase in response_lower:
	return True

	# Fall back to more flexible keyword matching
	for keyword in self.REJECTION_KEYWORDS:
	if keyword in response_lower:
	return True

	return False
	```

	### Modified: Type Annotations
	Changed:
	- `detects_error(response: str, counterfactual_answer: str)`
	- `corrects_error(response: str, correct_answer: str, counterfactual_answer: str)`

	To:
	- `detects_error(response: str, counterfactual_answer: Optional[str])`
	- `corrects_error(response: str, correct_answer: str, counterfactual_answer: Optional[str])`

	### Modified: evaluate_noise_robustness() Signature
	Changed from:
	```python
	def evaluate_noise_robustness(
	self,
	responses: List[str],
	ground_truths: List[str],
	noise_levels: List[int],
	model_name: str
	) -> EvaluationResult:
	```

	Changed to:
	```python
	def evaluate_noise_robustness(
	self,
	responses: List[str],
	ground_truths: List[str],
	model_name: str,
	noise_ratio: float
	) -> EvaluationResult:
	```

	Rationale: Each evaluation is now for a specific noise ratio, not aggregated.

	---

	## 3. src/pipeline.py - Multiple Noise Ratio Testing

	### Modified: evaluate_noise_robustness() Method

	New Implementation:
	```python
	def evaluate_noise_robustness(
	self,
	model: str,
	max_samples: Optional[int] = None,
	noise_ratios: Optional[List[float]] = None
	) -> List[EvaluationResult]:
	"""
	Evaluate noise robustness for a model.
	Tests multiple noise ratios as per the RGB paper (0%, 20%, 40%, 60%, 80%).

	Args:
	model: The model name to evaluate.
	max_samples: Maximum samples to evaluate per noise ratio.
	noise_ratios: List of noise ratios to test. Defaults to paper's ratios.

	Returns:
	List of EvaluationResults for different noise ratios.
	"""
	if noise_ratios is None:
	noise_ratios = [0.0, 0.2, 0.4, 0.6, 0.8] # Paper's exact ratios

	print(f"\n[Noise Robustness] Evaluating {model}...")
	print(f" Testing noise ratios: {noise_ratios}")

	client = self._create_client(model)
	results = []

	for noise_ratio in noise_ratios:
	samples = self.data_loader.load_noise_robustness(max_samples, noise_rate=noise_ratio)

	if not samples:
	print(f" Warning: No noise robustness samples found for noise_rate={noise_ratio}")
	continue

	prompt_template = get_prompt_template("default")
	responses = self._generate_responses(
	client, samples, prompt_template,
	desc=f" {model} - Noise {int(noise_ratio*100)}%"
	)

	ground_truths = [s.answer for s in samples]

	result = self.evaluator.evaluate_noise_robustness(
	responses, ground_truths, model, noise_ratio
	)
	results.append(result)

	print(f" Noise {int(noise_ratio*100)}%: Accuracy = {result.accuracy:.2f}%")

	return results
	```

	### Modified: run_full_evaluation() Method

	Changed from:
	```python
	if "noise_robustness" in all_tasks:
	result = self.evaluate_noise_robustness(model, max_samples_per_task)
	self.results.append(result)
	```

	Changed to:
	```python
	if "noise_robustness" in all_tasks:
	# Noise robustness returns a list of results (one per noise ratio)
	noise_results = self.evaluate_noise_robustness(model, max_samples_per_task)
	self.results.extend(noise_results)
	```

	Rationale: Handle list return type from noise robustness testing.

	---

	## Compliance Mapping

	### Noise Robustness (RGB Table 1)
	- ✅ Tests 5 noise ratios: 0%, 20%, 40%, 60%, 80%
	- ✅ Separate evaluation per ratio
	- ✅ Calculates accuracy for each noise level
	- ✅ Returns List[EvaluationResult] for comparison

	### Negative Rejection (RGB Table 2)
	- ✅ Checks for exact rejection phrase from Figure 3
	- ✅ Falls back to keyword matching for robustness
	- ✅ Calculates rejection_rate metric
	- ✅ System instruction guides LLM to reject appropriately

	### Information Integration (RGB Table 3)
	- ✅ Evaluates multi-document synthesis
	- ✅ Calculates accuracy metric
	- ✅ System instruction guides information combination

	### Counterfactual Robustness (RGB Table 4)
	- ✅ Detects error detection keywords
	- ✅ Verifies error correction with correct answer
	- ✅ Calculates error_detection_rate and error_correction_rate
	- ✅ System instruction guides factual error detection

	---

	## Testing Validation

	### Test Script: test_refactored_pipeline.py

	Test 1: Data Loader ✅
	- Loads all 4 task types successfully
	- Noise robustness supports all 5 ratios
	- Proper sample formatting

	Test 2: Evaluator ✅
	- Noise robustness: Returns correct task type `noise_robustness_20%`
	- Negative rejection: Detects exact phrases correctly
	- Information integration: Calculates accuracy properly
	- Counterfactual robustness: Detects errors and corrections

	Test 3: Prompts ✅
	- System instruction: 649 characters, contains required phrases
	- Template format: Matches Figure 3 exactly
	- Formatting: Properly interpolates documents and question

	Test 4: Pipeline ✅
	- All required methods present and callable
	- Pipeline instantiation successful
	- Structure ready for Groq API calls

	---

	## Production Readiness

	All changes verified against:
	1. RGB benchmark paper (2309.01431v2.pdf)
	2. Figure 3 prompt template specification
	3. Table 1-7 evaluation methodology
	4. Paper's exact wording for rejection and error detection

	Type Safety: ✅ No type errors
	Integration: ✅ All components properly connected
	Documentation: ✅ Code comments and docstrings updated
	Testing: ✅ Comprehensive test suite passes

	---

	## Summary

	The RGB evaluation pipeline has been successfully refactored to achieve 100% compliance with the capstone requirements:

	- ✅ Exact Figure 3 prompt format with system instruction
	- ✅ All 4 RAG abilities fully implemented
	- ✅ 5 noise ratios tested separately (matching Table 1)
	- ✅ Exact rejection phrase detection
	- ✅ System instruction properly integrated
	- ✅ Type-safe code with no errors
	- ✅ Comprehensive test coverage

	Status: READY FOR GROQ API EVALUATION