Spaces:

gopikrishnait
/

RGBMetrics

Sleeping

def load_noise_robustness(
    self, 
    max_samples: Optional[int] = None,
    noise_rate: float = 0.4
) -> List[RGBSample]:
    """
    Load data for Noise Robustness evaluation.
    Uses en_refine.json - tests LLM's ability to handle noisy documents.
    """
    filepath = self._get_file_path("en_refine.json")
    if not os.path.exists(filepath):
        raise FileNotFoundError(
            f"Dataset file not found: {filepath}\n"
            "Please run: python download_datasets.py"
        )
    # ... loads and processes en_refine.json

Data Source: en_refine.json Dataset Structure:

{
  "id": int,
  "query": str,
  "answer": str,
  "positive": [str],     // Relevant documents containing answer
  "negative": [str]      // Irrelevant documents (noise)
}

Processing:

Calculates neg_num = passage_num × noise_rate
Selects positive documents and negative documents proportionally
Shuffles them together
Tests model's ability to extract answer from noisy passages

Noise Levels Tested: 0%, 20%, 40%, 60%, 80% (configured in pipeline)

2. Negative Rejection → `en_refine.json`

Location: src/data_loader.py

def load_negative_rejection(
    self, 
    max_samples: Optional[int] = None
) -> List[RGBSample]:
    """
    Load data for Negative Rejection evaluation.
    Uses en_refine.json with noise_rate=1.0 (all negative documents).
    Tests LLM's ability to reject when documents don't contain the answer.
    """
    filepath = self._get_file_path("en_refine.json")
    if not os.path.exists(filepath):
        raise FileNotFoundError(...)
    # ... loads and processes en_refine.json

Data Source: en_refine.json (same file as noise robustness) Processing:

Uses ONLY negative documents (noise_rate = 1.0)
Sets has_answer=False to indicate documents don't contain answer
Tests if model can appropriately reject when insufficient information

Key Difference from Noise Robustness:

Noise Robustness: Mix of positive + negative documents
Negative Rejection: ONLY negative documents (100% noise)

3. Information Integration → `en_int.json`

Location: src/data_loader.py

def load_information_integration(
    self, 
    max_samples: Optional[int] = None
) -> List[RGBSample]:
    """
    Load data for Information Integration evaluation.
    Uses en_int.json - tests LLM's ability to integrate info from multiple docs.
    """
    filepath = self._get_file_path("en_int.json")
    if not os.path.exists(filepath):
        raise FileNotFoundError(...)
    # ... loads and processes en_int.json

Data Source: en_int.json Dataset Structure:

{
  "id": int,
  "query": str,
  "answer": str,
  "answer1": str,           // First component of answer
  "answer2": str,           // Second component of answer
  "positive": [[str], ...], // Multiple groups of related documents
  "negative": [str]         // Irrelevant documents
}

Processing:

Extracts one document from each group in the positive field
Adds negative documents if needed to reach passage_num (5)
Sets num_docs_needed to track how many document groups are needed
Tests model's ability to synthesize answer from multiple sources

Test Scenario: Model must combine information from multiple documents to answer

4. Counterfactual Robustness → `en_fact.json`

Location: src/data_loader.py

def load_counterfactual_robustness(
    self, 
    max_samples: Optional[int] = None
) -> List[RGBSample]:
    """
    Load data for Counterfactual Robustness evaluation.
    Uses en_fact.json - tests LLM's ability to detect/correct factual errors.
    """
    filepath = self._get_file_path("en_fact.json")
    if not os.path.exists(filepath):
        raise FileNotFoundError(...)
    # ... loads and processes en_fact.json

Data Source: en_fact.json Dataset Structure:

{
  "id": int,
  "query": str,
  "answer": str,           // Correct answer
  "fakeanswer": str,       // Incorrect/counterfactual answer
  "positive": [str],       // Documents with correct answer
  "positive_wrong": [str], // Documents with fake/incorrect answer
  "negative": [str]        // Irrelevant documents
}

Processing:

Uses mainly positive_wrong documents (3) + negative (2)
Sets has_counterfactual=True
Stores counterfactual_answer from "fakeanswer" field
Tests model's ability to detect factual inconsistencies and correct them

Test Scenario:

Documents contain incorrect information
Model should detect error and provide correct answer
Evaluates both error detection and error correction

Summary Table

Task	Dataset	File	Data Fields Used	Test Scenario
Noise Robustness	`en_refine.json`	`d:/CapStoneProject/RGB/data/en_refine.json`	query, answer, positive, negative	Mix correct + noise documents
Negative Rejection	`en_refine.json`	`d:/CapStoneProject/RGB/data/en_refine.json`	query, answer, negative	Only noise documents (no answer)
Information Integration	`en_int.json`	`d:/CapStoneProject/RGB/data/en_int.json`	query, answer, answer1, answer2, positive (grouped), negative	Multiple documents needed for synthesis
Counterfactual Robustness	`en_fact.json`	`d:/CapStoneProject/RGB/data/en_fact.json`	query, answer, fakeanswer, positive, positive_wrong, negative	Incorrect information in documents

File Locations Verified

✅ Primary Dataset Directory: d:\CapStoneProject\RGB\data\

en_refine.json ✓ Present
en_int.json ✓ Present
en_fact.json ✓ Present

✅ Mirror Dataset Directory: d:\CapStoneProject\RGB\RGBMetrics\data\

en_refine.json ✓ Present
en_int.json ✓ Present
en_fact.json ✓ Present

Data Loader Integration

Pipeline Integration

Location: src/pipeline.py

def evaluate_noise_robustness(self, model: str, ...):
    samples = self.data_loader.load_noise_robustness(max_samples, noise_rate=noise_ratio)
    # ... evaluate

def evaluate_negative_rejection(self, model: str, ...):
    samples = self.data_loader.load_negative_rejection(max_samples)
    # ... evaluate

def evaluate_information_integration(self, model: str, ...):
    samples = self.data_loader.load_information_integration(max_samples)
    # ... evaluate

def evaluate_counterfactual_robustness(self, model: str, ...):
    samples = self.data_loader.load_counterfactual_robustness(max_samples)
    # ... evaluate

Streamlit App Integration

Location: app.py

The Streamlit UI also uses the same data loader through the pipeline:

Shows available datasets
Allows task selection
Loads samples based on selection
Evaluates using appropriate dataset

Verification Result

✅ CONFIRMED

All requirements are met:

✅ Noise Robustness uses en_refine.json
✅ Negative Rejection uses en_refine.json
✅ Information Integration uses en_int.json
✅ Counterfactual Robustness uses en_fact.json

Implementation is correct and complete:

All three datasets are present in the data directory
Data loader has dedicated methods for each task
Pipeline properly orchestrates data loading and evaluation
Datasets are properly structured with required fields
Data processing respects the structure and purpose of each dataset

Related Files

Data Loader: src/data_loader.py
Evaluation Pipeline: src/pipeline.py
Download Script: download_datasets.py
Streamlit App: app.py
Evaluator: src/evaluator.py

Dataset Usage Verification Report

Requirement vs Implementation Verification

Requirement

✅ VERIFIED IMPLEMENTATION

1. Noise Robustness → en_refine.json

2. Negative Rejection → en_refine.json

3. Information Integration → en_int.json

4. Counterfactual Robustness → en_fact.json

Summary Table

File Locations Verified

Data Loader Integration

Pipeline Integration

Streamlit App Integration

Verification Result

✅ CONFIRMED

Related Files

1. Noise Robustness → `en_refine.json`

2. Negative Rejection → `en_refine.json`

3. Information Integration → `en_int.json`

4. Counterfactual Robustness → `en_fact.json`