Spaces:

msse-team-3
/

ai-engineering-project

Sleeping

App Files Files Community

ai-engineering-project / docs /GROUNDEDNESS_EVALUATION_IMPROVEMENTS.md

GitHub Action

Clean deployment without binary files

f884e6e 6 months ago

preview code

raw

history blame contribute delete

9.95 kB

Groundedness and Evaluation Improvements Summary

Overview

This document summarizes the comprehensive improvements made to the RAG system's groundedness evaluation and overall evaluation framework. These improvements focus on deterministic, reproducible, and more accurate assessment of generated responses.

Key Improvements Implemented

1. Deterministic Evaluation Framework

New Components:

src/evaluation/deterministic.py - Core deterministic evaluation utilities
src/evaluation/enhanced_runner.py - Enhanced evaluation runner with deterministic controls
test_deterministic_evaluation.py - Comprehensive test suite

Features:

Fixed Random Seeds: Configurable evaluation seed (default: 42) for reproducible results
Consistent Ordering: Deterministic processing order for queries, sources, and results
Normalized Precision: Fixed floating-point precision (6 decimal places) for consistent metrics
Environment Controls: Sets PYTHONHASHSEED=0 and other reproducibility environment variables

2. Enhanced Groundedness Evaluation

Improvements over Previous System:

Multi-Source Analysis: Evaluates groundedness at both passage-level and aggregate level
Token Overlap Scoring: Calculates precise token overlap between generated text and source passages
Exact Phrase Matching: Detects 2-7 word exact phrase matches for factual consistency
Passage Coverage: Measures how well the response covers information from all source passages
Deterministic Processing: Sources are processed in consistent order for reproducible results

Metrics Provided:

{
  "groundedness_score": 0.8542,     // Overall groundedness (0-1)
  "passage_coverage": 0.7834,       // Coverage across all passages (0-1)
  "token_overlap": 0.6745,          // Token overlap with sources (0-1)
  "exact_matches": 0.4500           // Rate of exact phrase matches (0-1)
}

3. Enhanced Citation Accuracy Validation

Deterministic Citation Matching:

Filename Normalization: Consistent handling of different file path formats
Extension Handling: Removes common extensions (.md, .txt, .pdf, etc.) for matching
Fuzzy Matching: Supports substring and similarity-based matching with configurable thresholds
Multi-Source Format Support: Handles various source metadata formats

Comprehensive Metrics:

{
  "citation_accuracy": 0.9167,      // F1-like overall accuracy (0-1)
  "source_precision": 0.8571,       // Precision of returned sources (0-1)
  "source_recall": 1.0000,          // Recall of expected sources (0-1)
  "exact_filename_matches": 1.0000   // Rate of exact filename matches (0-1)
}

4. Fallback Mechanisms

API Failure Handling:

Graceful Degradation: Falls back to token overlap when ML libraries unavailable
Error Recovery: Continues evaluation even with individual query failures
Timeout Handling: Configurable timeouts with proper error reporting

Missing Dependencies:

Optional Dependencies: Works without NumPy, PyTorch, or advanced NLP libraries
Token-Based Fallbacks: Uses string processing when advanced metrics unavailable
Consistent Interface: Same API regardless of available dependencies

5. Evaluation Runner Enhancements

Enhanced Evaluation Runner Features:

Progress Tracking: Visual progress bars using tqdm
Comprehensive Reporting: Detailed summary with latency percentiles
Configurable Targets: Support for different API endpoints
Batch Processing: Efficient processing of question sets
Result Persistence: Saves detailed results with metadata

Command Line Interface:

python -m src.evaluation.enhanced_runner \
  --questions evaluation/questions.json \
  --gold evaluation/gold_answers.json \
  --output enhanced_results.json \
  --target https://api.example.com \
  --seed 42

Testing and Validation

Comprehensive Test Suite

Test Coverage:

✅ Reproducibility: Same seed produces identical results
✅ Groundedness Scoring: Validates scoring algorithms
✅ Citation Accuracy: Tests filename normalization and matching
✅ Edge Cases: Handles empty inputs, special characters, Unicode
✅ Float Precision: Ensures consistent floating-point handling
✅ Ordering Consistency: Same results regardless of input order

Test Results:

Ran 10 tests in 1.442s - All tests passed ✅

Integration Testing

Real-World Validation:

Tested with existing evaluation files (questions.json, gold_answers.json)
Verified deterministic behavior across multiple runs
Confirmed fallback mechanisms work correctly
Validated API integration and error handling

Performance Improvements

Evaluation Speed

Efficient Processing: Optimized token overlap calculations
Batch Operations: Process multiple queries efficiently
Smart Caching: Avoid redundant calculations
Progress Feedback: Real-time progress indication

Memory Usage

Streaming Processing: Handle large evaluation sets without memory issues
Cleanup: Proper resource management and garbage collection
Optimal Data Structures: Use appropriate data structures for performance

Backward Compatibility

Preserved Functionality

Original API: Existing evaluation scripts continue to work
Same Metrics: Traditional overlap scores still available for comparison
File Formats: Compatible with existing question and gold answer formats
Configuration: Environment variables and command-line options preserved

Migration Path

Gradual Adoption: Can be used alongside existing evaluation system
Drop-in Replacement: Enhanced runner can replace original runner
Configuration Migration: Easy migration of existing configurations

Configuration Options

Environment Variables

# Evaluation configuration
export EVALUATION_SEED=42
export EVAL_TARGET_URL=https://api.example.com
export EVAL_TIMEOUT=30

# Deterministic behavior
export PYTHONHASHSEED=0
export CUBLAS_WORKSPACE_CONFIG=":4096:8"

# Citation matching
export EVAL_CITATION_FUZZY_THRESHOLD=0.72

Programmatic Configuration

from src.evaluation.deterministic import DeterministicConfig, DeterministicEvaluator

config = DeterministicConfig(
    random_seed=42,
    sort_results=True,
    float_precision=6,
    consistent_order=True,
    deterministic_mode=True
)

evaluator = DeterministicEvaluator(config)

Impact on Evaluation Quality

Reproducibility

Consistent Results: Same evaluation produces identical results across runs
Fixed Seeds: Deterministic random number generation
Environment Control: Controlled evaluation environment

Accuracy

Multi-Dimensional Scoring: More comprehensive groundedness assessment
Passage-Level Analysis: Better understanding of source utilization
Enhanced Citation Validation: More accurate citation accuracy measurement

Reliability

Fallback Mechanisms: Continues working even with missing dependencies
Error Handling: Graceful handling of API failures and edge cases
Validation: Comprehensive testing ensures reliability

Future Enhancements

Potential Improvements

LLM-Based Groundedness: Integration with existing OpenRouter LLM evaluation
Semantic Similarity: Use of sentence embeddings for semantic groundedness
Custom Metrics: Support for domain-specific evaluation metrics
Real-Time Monitoring: Live evaluation monitoring and alerting
A/B Testing: Support for comparative evaluation of different models

Extension Points

Metric Plugins: Pluggable architecture for custom metrics
Source Types: Support for different source document types
Evaluation Protocols: Different evaluation strategies for different use cases

Summary

The groundedness and evaluation improvements provide a robust, deterministic, and comprehensive evaluation framework for the RAG system. Key achievements include:

✅ Deterministic Behavior: Fixed seeds and consistent ordering ensure reproducible results
✅ Enhanced Groundedness: Multi-dimensional scoring with passage-level analysis
✅ Improved Citations: Comprehensive citation accuracy validation with fuzzy matching
✅ Fallback Mechanisms: Graceful degradation when dependencies are unavailable
✅ Comprehensive Testing: Full test suite validates all functionality
✅ Backward Compatibility: Works alongside existing evaluation system

These improvements significantly enhance the quality and reliability of RAG system evaluation, providing more accurate and consistent assessment of generated responses while maintaining compatibility with existing workflows.

Usage Examples

Basic Usage

from src.evaluation.enhanced_runner import run_enhanced_evaluation

results = run_enhanced_evaluation(
    questions_file="evaluation/questions.json",
    gold_file="evaluation/gold_answers.json",
    evaluation_seed=42
)

Advanced Configuration

from src.evaluation.enhanced_runner import EnhancedEvaluationRunner

runner = EnhancedEvaluationRunner(
    target_url="https://api.example.com",
    evaluation_seed=42,
    timeout=30
)

results = runner.run_evaluation(
    "questions.json",
    "gold_answers.json",
    "results.json"
)

runner.print_summary()

Direct Groundedness Evaluation

from src.evaluation.deterministic import evaluate_groundedness_deterministic

score = evaluate_groundedness_deterministic(
    generated_text="Response text here",
    source_passages=["Source 1", "Source 2"],
    evaluator=None  # Uses default configuration
)

This completes the groundedness and evaluation improvements, providing a solid foundation for reliable and reproducible RAG system evaluation.