Spaces:
Sleeping
A newer version of the Streamlit SDK is available: 1.56.0
Dataset Usage Verification Report
Requirement vs Implementation Verification
Requirement
Use the following datasets for experimentation:
- Noise Robustness and Negative Rejection:
en_refine.jsondataset - Information Integration:
en_int.jsondataset - Counterfactual Robustness:
en_fact.jsondataset
β VERIFIED IMPLEMENTATION
1. Noise Robustness β en_refine.json
Location: src/data_loader.py
def load_noise_robustness(
self,
max_samples: Optional[int] = None,
noise_rate: float = 0.4
) -> List[RGBSample]:
"""
Load data for Noise Robustness evaluation.
Uses en_refine.json - tests LLM's ability to handle noisy documents.
"""
filepath = self._get_file_path("en_refine.json")
if not os.path.exists(filepath):
raise FileNotFoundError(
f"Dataset file not found: {filepath}\n"
"Please run: python download_datasets.py"
)
# ... loads and processes en_refine.json
Data Source: en_refine.json
Dataset Structure:
{
"id": int,
"query": str,
"answer": str,
"positive": [str], // Relevant documents containing answer
"negative": [str] // Irrelevant documents (noise)
}
Processing:
- Calculates
neg_num = passage_num Γ noise_rate - Selects positive documents and negative documents proportionally
- Shuffles them together
- Tests model's ability to extract answer from noisy passages
Noise Levels Tested: 0%, 20%, 40%, 60%, 80% (configured in pipeline)
2. Negative Rejection β en_refine.json
Location: src/data_loader.py
def load_negative_rejection(
self,
max_samples: Optional[int] = None
) -> List[RGBSample]:
"""
Load data for Negative Rejection evaluation.
Uses en_refine.json with noise_rate=1.0 (all negative documents).
Tests LLM's ability to reject when documents don't contain the answer.
"""
filepath = self._get_file_path("en_refine.json")
if not os.path.exists(filepath):
raise FileNotFoundError(...)
# ... loads and processes en_refine.json
Data Source: en_refine.json (same file as noise robustness)
Processing:
- Uses ONLY negative documents (noise_rate = 1.0)
- Sets
has_answer=Falseto indicate documents don't contain answer - Tests if model can appropriately reject when insufficient information
Key Difference from Noise Robustness:
- Noise Robustness: Mix of positive + negative documents
- Negative Rejection: ONLY negative documents (100% noise)
3. Information Integration β en_int.json
Location: src/data_loader.py
def load_information_integration(
self,
max_samples: Optional[int] = None
) -> List[RGBSample]:
"""
Load data for Information Integration evaluation.
Uses en_int.json - tests LLM's ability to integrate info from multiple docs.
"""
filepath = self._get_file_path("en_int.json")
if not os.path.exists(filepath):
raise FileNotFoundError(...)
# ... loads and processes en_int.json
Data Source: en_int.json
Dataset Structure:
{
"id": int,
"query": str,
"answer": str,
"answer1": str, // First component of answer
"answer2": str, // Second component of answer
"positive": [[str], ...], // Multiple groups of related documents
"negative": [str] // Irrelevant documents
}
Processing:
- Extracts one document from each group in the positive field
- Adds negative documents if needed to reach passage_num (5)
- Sets
num_docs_neededto track how many document groups are needed - Tests model's ability to synthesize answer from multiple sources
Test Scenario: Model must combine information from multiple documents to answer
4. Counterfactual Robustness β en_fact.json
Location: src/data_loader.py
def load_counterfactual_robustness(
self,
max_samples: Optional[int] = None
) -> List[RGBSample]:
"""
Load data for Counterfactual Robustness evaluation.
Uses en_fact.json - tests LLM's ability to detect/correct factual errors.
"""
filepath = self._get_file_path("en_fact.json")
if not os.path.exists(filepath):
raise FileNotFoundError(...)
# ... loads and processes en_fact.json
Data Source: en_fact.json
Dataset Structure:
{
"id": int,
"query": str,
"answer": str, // Correct answer
"fakeanswer": str, // Incorrect/counterfactual answer
"positive": [str], // Documents with correct answer
"positive_wrong": [str], // Documents with fake/incorrect answer
"negative": [str] // Irrelevant documents
}
Processing:
- Uses mainly
positive_wrongdocuments (3) + negative (2) - Sets
has_counterfactual=True - Stores
counterfactual_answerfrom "fakeanswer" field - Tests model's ability to detect factual inconsistencies and correct them
Test Scenario:
- Documents contain incorrect information
- Model should detect error and provide correct answer
- Evaluates both error detection and error correction
Summary Table
| Task | Dataset | File | Data Fields Used | Test Scenario |
|---|---|---|---|---|
| Noise Robustness | en_refine.json |
d:/CapStoneProject/RGB/data/en_refine.json |
query, answer, positive, negative | Mix correct + noise documents |
| Negative Rejection | en_refine.json |
d:/CapStoneProject/RGB/data/en_refine.json |
query, answer, negative | Only noise documents (no answer) |
| Information Integration | en_int.json |
d:/CapStoneProject/RGB/data/en_int.json |
query, answer, answer1, answer2, positive (grouped), negative | Multiple documents needed for synthesis |
| Counterfactual Robustness | en_fact.json |
d:/CapStoneProject/RGB/data/en_fact.json |
query, answer, fakeanswer, positive, positive_wrong, negative | Incorrect information in documents |
File Locations Verified
β
Primary Dataset Directory: d:\CapStoneProject\RGB\data\
en_refine.jsonβ Presenten_int.jsonβ Presenten_fact.jsonβ Present
β
Mirror Dataset Directory: d:\CapStoneProject\RGB\RGBMetrics\data\
en_refine.jsonβ Presenten_int.jsonβ Presenten_fact.jsonβ Present
Data Loader Integration
Pipeline Integration
Location: src/pipeline.py
def evaluate_noise_robustness(self, model: str, ...):
samples = self.data_loader.load_noise_robustness(max_samples, noise_rate=noise_ratio)
# ... evaluate
def evaluate_negative_rejection(self, model: str, ...):
samples = self.data_loader.load_negative_rejection(max_samples)
# ... evaluate
def evaluate_information_integration(self, model: str, ...):
samples = self.data_loader.load_information_integration(max_samples)
# ... evaluate
def evaluate_counterfactual_robustness(self, model: str, ...):
samples = self.data_loader.load_counterfactual_robustness(max_samples)
# ... evaluate
Streamlit App Integration
Location: app.py
The Streamlit UI also uses the same data loader through the pipeline:
- Shows available datasets
- Allows task selection
- Loads samples based on selection
- Evaluates using appropriate dataset
Verification Result
β CONFIRMED
All requirements are met:
- β
Noise Robustness uses
en_refine.json - β
Negative Rejection uses
en_refine.json - β
Information Integration uses
en_int.json - β
Counterfactual Robustness uses
en_fact.json
Implementation is correct and complete:
- All three datasets are present in the data directory
- Data loader has dedicated methods for each task
- Pipeline properly orchestrates data loading and evaluation
- Datasets are properly structured with required fields
- Data processing respects the structure and purpose of each dataset
Related Files
- Data Loader: src/data_loader.py
- Evaluation Pipeline: src/pipeline.py
- Download Script: download_datasets.py
- Streamlit App: app.py
- Evaluator: src/evaluator.py