smart-summarizer / EVALUATION_GUIDE.md
Rajak13's picture
Add comprehensive CNN/DailyMail evaluation system - dataset loading, model evaluation, topic analysis, and comparison
cf5d247 verified
|
raw
history blame
4.59 kB

Model Evaluation Guide

This guide explains how to run comprehensive evaluation of the summarization models using the CNN/DailyMail dataset.

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Run Evaluation

python run_evaluation.py

This will:

  • Download CNN/DailyMail dataset
  • Evaluate all three models (TextRank, BART, PEGASUS)
  • Generate comparison reports and visualizations
  • Save results to evaluation_results/ directory

What Gets Evaluated

Models

  • TextRank: Extractive summarization using graph-based ranking
  • BART: Abstractive summarization using transformer encoder-decoder
  • PEGASUS: Abstractive summarization specialized for summarization tasks

Metrics

  • ROUGE-1: Overlap of unigrams between generated and reference summaries
  • ROUGE-2: Overlap of bigrams between generated and reference summaries
  • ROUGE-L: Longest common subsequence between generated and reference summaries
  • Processing Time: Average time to generate each summary

Topic Categories

Articles are automatically categorized into:

  • Politics
  • Business
  • Technology
  • Sports
  • Health
  • Entertainment
  • Other

Advanced Usage

Custom Evaluation

# Evaluate specific number of samples
python evaluation/run_evaluation.py --samples 200

# Evaluate by topic categories
python evaluation/run_evaluation.py --by-topic --samples 100

# Evaluate specific models only
python evaluation/run_evaluation.py --models textrank bart --samples 50

Using Individual Components

Load Dataset

from evaluation.dataset_loader import CNNDailyMailLoader

loader = CNNDailyMailLoader()
dataset = loader.load_dataset()
eval_data = loader.create_evaluation_subset(size=100)

Evaluate Single Model

from evaluation.model_evaluator import ModelEvaluator

evaluator = ModelEvaluator()
evaluator.initialize_models()
results = evaluator.evaluate_single_model('bart', eval_data, max_samples=50)

Analyze Results

from evaluation.results_analyzer import ResultsAnalyzer

analyzer = ResultsAnalyzer()
analyzer.create_performance_charts(results, 'output_dir')
analyzer.create_detailed_report(results, 'output_dir')

Output Files

After running evaluation, you'll find these files in evaluation_results/:

Data Files

  • eval_data.json - Evaluation dataset
  • data_[topic].json - Topic-specific datasets

Results Files

  • results_overall.json - Detailed evaluation results
  • comparison_overall.csv - Summary comparison table
  • results_[topic].json - Topic-specific results

Visualizations

  • performance_comparison.png - Model performance charts
  • topic_performance_heatmap.png - Topic analysis heatmap

Reports

  • evaluation_report.md - Detailed evaluation report
  • topic_summary.csv - Topic performance breakdown

Understanding Results

ROUGE Scores

  • Higher is better (range: 0.0 to 1.0)
  • ROUGE-1: Measures content overlap
  • ROUGE-2: Measures fluency and coherence
  • ROUGE-L: Measures structural similarity

Processing Time

  • Lower is better
  • Measured in seconds per summary
  • Important for real-time applications

Model Characteristics

  • TextRank: Fast, extractive, good for quick summaries
  • BART: Balanced performance, good fluency
  • PEGASUS: Best quality, slower processing

Troubleshooting

Memory Issues

If you encounter memory issues:

# Reduce sample size
python run_evaluation.py --samples 20

# Evaluate models individually
python evaluation/run_evaluation.py --models textrank --samples 50

Dataset Download Issues

The CNN/DailyMail dataset is large (~1.3GB). Ensure you have:

  • Stable internet connection
  • Sufficient disk space
  • Proper HuggingFace datasets cache directory

Model Loading Issues

If models fail to load:

  • Check PyTorch installation
  • Verify transformers library version
  • Ensure sufficient RAM (8GB+ recommended)

Configuration

Sample Sizes

  • Development: 20-50 samples
  • Testing: 100-200 samples
  • Full evaluation: 500+ samples

Topic Evaluation

Minimum 5 articles per topic for meaningful results. Topics with fewer articles are skipped.

Performance Expectations

Processing Times (CPU)

  • TextRank: ~0.1 seconds per summary
  • BART: ~10-15 seconds per summary
  • PEGASUS: ~8-12 seconds per summary

Typical ROUGE Scores

  • TextRank: ROUGE-1 ~0.35, ROUGE-2 ~0.15
  • BART: ROUGE-1 ~0.42, ROUGE-2 ~0.20
  • PEGASUS: ROUGE-1 ~0.44, ROUGE-2 ~0.21

Results may vary based on dataset and configuration.