RGBMetrics / DATASET_USAGE_VERIFICATION.md
RGB Evaluation
feat: Show all 9 LLM models in app dropdown, add comprehensive code review and metric analysis documentation
b1ccc5d

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade

Dataset Usage Verification Report

Requirement vs Implementation Verification

Requirement

Use the following datasets for experimentation:

  • Noise Robustness and Negative Rejection: en_refine.json dataset
  • Information Integration: en_int.json dataset
  • Counterfactual Robustness: en_fact.json dataset

βœ… VERIFIED IMPLEMENTATION

1. Noise Robustness β†’ en_refine.json

Location: src/data_loader.py

def load_noise_robustness(
    self, 
    max_samples: Optional[int] = None,
    noise_rate: float = 0.4
) -> List[RGBSample]:
    """
    Load data for Noise Robustness evaluation.
    Uses en_refine.json - tests LLM's ability to handle noisy documents.
    """
    filepath = self._get_file_path("en_refine.json")
    if not os.path.exists(filepath):
        raise FileNotFoundError(
            f"Dataset file not found: {filepath}\n"
            "Please run: python download_datasets.py"
        )
    # ... loads and processes en_refine.json

Data Source: en_refine.json Dataset Structure:

{
  "id": int,
  "query": str,
  "answer": str,
  "positive": [str],     // Relevant documents containing answer
  "negative": [str]      // Irrelevant documents (noise)
}

Processing:

  • Calculates neg_num = passage_num Γ— noise_rate
  • Selects positive documents and negative documents proportionally
  • Shuffles them together
  • Tests model's ability to extract answer from noisy passages

Noise Levels Tested: 0%, 20%, 40%, 60%, 80% (configured in pipeline)


2. Negative Rejection β†’ en_refine.json

Location: src/data_loader.py

def load_negative_rejection(
    self, 
    max_samples: Optional[int] = None
) -> List[RGBSample]:
    """
    Load data for Negative Rejection evaluation.
    Uses en_refine.json with noise_rate=1.0 (all negative documents).
    Tests LLM's ability to reject when documents don't contain the answer.
    """
    filepath = self._get_file_path("en_refine.json")
    if not os.path.exists(filepath):
        raise FileNotFoundError(...)
    # ... loads and processes en_refine.json

Data Source: en_refine.json (same file as noise robustness) Processing:

  • Uses ONLY negative documents (noise_rate = 1.0)
  • Sets has_answer=False to indicate documents don't contain answer
  • Tests if model can appropriately reject when insufficient information

Key Difference from Noise Robustness:

  • Noise Robustness: Mix of positive + negative documents
  • Negative Rejection: ONLY negative documents (100% noise)

3. Information Integration β†’ en_int.json

Location: src/data_loader.py

def load_information_integration(
    self, 
    max_samples: Optional[int] = None
) -> List[RGBSample]:
    """
    Load data for Information Integration evaluation.
    Uses en_int.json - tests LLM's ability to integrate info from multiple docs.
    """
    filepath = self._get_file_path("en_int.json")
    if not os.path.exists(filepath):
        raise FileNotFoundError(...)
    # ... loads and processes en_int.json

Data Source: en_int.json Dataset Structure:

{
  "id": int,
  "query": str,
  "answer": str,
  "answer1": str,           // First component of answer
  "answer2": str,           // Second component of answer
  "positive": [[str], ...], // Multiple groups of related documents
  "negative": [str]         // Irrelevant documents
}

Processing:

  • Extracts one document from each group in the positive field
  • Adds negative documents if needed to reach passage_num (5)
  • Sets num_docs_needed to track how many document groups are needed
  • Tests model's ability to synthesize answer from multiple sources

Test Scenario: Model must combine information from multiple documents to answer


4. Counterfactual Robustness β†’ en_fact.json

Location: src/data_loader.py

def load_counterfactual_robustness(
    self, 
    max_samples: Optional[int] = None
) -> List[RGBSample]:
    """
    Load data for Counterfactual Robustness evaluation.
    Uses en_fact.json - tests LLM's ability to detect/correct factual errors.
    """
    filepath = self._get_file_path("en_fact.json")
    if not os.path.exists(filepath):
        raise FileNotFoundError(...)
    # ... loads and processes en_fact.json

Data Source: en_fact.json Dataset Structure:

{
  "id": int,
  "query": str,
  "answer": str,           // Correct answer
  "fakeanswer": str,       // Incorrect/counterfactual answer
  "positive": [str],       // Documents with correct answer
  "positive_wrong": [str], // Documents with fake/incorrect answer
  "negative": [str]        // Irrelevant documents
}

Processing:

  • Uses mainly positive_wrong documents (3) + negative (2)
  • Sets has_counterfactual=True
  • Stores counterfactual_answer from "fakeanswer" field
  • Tests model's ability to detect factual inconsistencies and correct them

Test Scenario:

  • Documents contain incorrect information
  • Model should detect error and provide correct answer
  • Evaluates both error detection and error correction

Summary Table

Task Dataset File Data Fields Used Test Scenario
Noise Robustness en_refine.json d:/CapStoneProject/RGB/data/en_refine.json query, answer, positive, negative Mix correct + noise documents
Negative Rejection en_refine.json d:/CapStoneProject/RGB/data/en_refine.json query, answer, negative Only noise documents (no answer)
Information Integration en_int.json d:/CapStoneProject/RGB/data/en_int.json query, answer, answer1, answer2, positive (grouped), negative Multiple documents needed for synthesis
Counterfactual Robustness en_fact.json d:/CapStoneProject/RGB/data/en_fact.json query, answer, fakeanswer, positive, positive_wrong, negative Incorrect information in documents

File Locations Verified

βœ… Primary Dataset Directory: d:\CapStoneProject\RGB\data\

  • en_refine.json βœ“ Present
  • en_int.json βœ“ Present
  • en_fact.json βœ“ Present

βœ… Mirror Dataset Directory: d:\CapStoneProject\RGB\RGBMetrics\data\

  • en_refine.json βœ“ Present
  • en_int.json βœ“ Present
  • en_fact.json βœ“ Present

Data Loader Integration

Pipeline Integration

Location: src/pipeline.py

def evaluate_noise_robustness(self, model: str, ...):
    samples = self.data_loader.load_noise_robustness(max_samples, noise_rate=noise_ratio)
    # ... evaluate

def evaluate_negative_rejection(self, model: str, ...):
    samples = self.data_loader.load_negative_rejection(max_samples)
    # ... evaluate

def evaluate_information_integration(self, model: str, ...):
    samples = self.data_loader.load_information_integration(max_samples)
    # ... evaluate

def evaluate_counterfactual_robustness(self, model: str, ...):
    samples = self.data_loader.load_counterfactual_robustness(max_samples)
    # ... evaluate

Streamlit App Integration

Location: app.py

The Streamlit UI also uses the same data loader through the pipeline:

  • Shows available datasets
  • Allows task selection
  • Loads samples based on selection
  • Evaluates using appropriate dataset

Verification Result

βœ… CONFIRMED

All requirements are met:

  1. βœ… Noise Robustness uses en_refine.json
  2. βœ… Negative Rejection uses en_refine.json
  3. βœ… Information Integration uses en_int.json
  4. βœ… Counterfactual Robustness uses en_fact.json

Implementation is correct and complete:

  • All three datasets are present in the data directory
  • Data loader has dedicated methods for each task
  • Pipeline properly orchestrates data loading and evaluation
  • Datasets are properly structured with required fields
  • Data processing respects the structure and purpose of each dataset

Related Files