Spaces:
Running
Running
Model Evaluation Guide
This guide explains how to run comprehensive evaluation of the summarization models using the CNN/DailyMail dataset.
Quick Start
1. Install Dependencies
pip install -r requirements.txt
2. Run Evaluation
python run_evaluation.py
This will:
- Download CNN/DailyMail dataset
- Evaluate all three models (TextRank, BART, PEGASUS)
- Generate comparison reports and visualizations
- Save results to
evaluation_results/directory
What Gets Evaluated
Models
- TextRank: Extractive summarization using graph-based ranking
- BART: Abstractive summarization using transformer encoder-decoder
- PEGASUS: Abstractive summarization specialized for summarization tasks
Metrics
- ROUGE-1: Overlap of unigrams between generated and reference summaries
- ROUGE-2: Overlap of bigrams between generated and reference summaries
- ROUGE-L: Longest common subsequence between generated and reference summaries
- Processing Time: Average time to generate each summary
Topic Categories
Articles are automatically categorized into:
- Politics
- Business
- Technology
- Sports
- Health
- Entertainment
- Other
Advanced Usage
Custom Evaluation
# Evaluate specific number of samples
python evaluation/run_evaluation.py --samples 200
# Evaluate by topic categories
python evaluation/run_evaluation.py --by-topic --samples 100
# Evaluate specific models only
python evaluation/run_evaluation.py --models textrank bart --samples 50
Using Individual Components
Load Dataset
from evaluation.dataset_loader import CNNDailyMailLoader
loader = CNNDailyMailLoader()
dataset = loader.load_dataset()
eval_data = loader.create_evaluation_subset(size=100)
Evaluate Single Model
from evaluation.model_evaluator import ModelEvaluator
evaluator = ModelEvaluator()
evaluator.initialize_models()
results = evaluator.evaluate_single_model('bart', eval_data, max_samples=50)
Analyze Results
from evaluation.results_analyzer import ResultsAnalyzer
analyzer = ResultsAnalyzer()
analyzer.create_performance_charts(results, 'output_dir')
analyzer.create_detailed_report(results, 'output_dir')
Output Files
After running evaluation, you'll find these files in evaluation_results/:
Data Files
eval_data.json- Evaluation datasetdata_[topic].json- Topic-specific datasets
Results Files
results_overall.json- Detailed evaluation resultscomparison_overall.csv- Summary comparison tableresults_[topic].json- Topic-specific results
Visualizations
performance_comparison.png- Model performance chartstopic_performance_heatmap.png- Topic analysis heatmap
Reports
evaluation_report.md- Detailed evaluation reporttopic_summary.csv- Topic performance breakdown
Understanding Results
ROUGE Scores
- Higher is better (range: 0.0 to 1.0)
- ROUGE-1: Measures content overlap
- ROUGE-2: Measures fluency and coherence
- ROUGE-L: Measures structural similarity
Processing Time
- Lower is better
- Measured in seconds per summary
- Important for real-time applications
Model Characteristics
- TextRank: Fast, extractive, good for quick summaries
- BART: Balanced performance, good fluency
- PEGASUS: Best quality, slower processing
Troubleshooting
Memory Issues
If you encounter memory issues:
# Reduce sample size
python run_evaluation.py --samples 20
# Evaluate models individually
python evaluation/run_evaluation.py --models textrank --samples 50
Dataset Download Issues
The CNN/DailyMail dataset is large (~1.3GB). Ensure you have:
- Stable internet connection
- Sufficient disk space
- Proper HuggingFace datasets cache directory
Model Loading Issues
If models fail to load:
- Check PyTorch installation
- Verify transformers library version
- Ensure sufficient RAM (8GB+ recommended)
Configuration
Sample Sizes
- Development: 20-50 samples
- Testing: 100-200 samples
- Full evaluation: 500+ samples
Topic Evaluation
Minimum 5 articles per topic for meaningful results. Topics with fewer articles are skipped.
Performance Expectations
Processing Times (CPU)
- TextRank: ~0.1 seconds per summary
- BART: ~10-15 seconds per summary
- PEGASUS: ~8-12 seconds per summary
Typical ROUGE Scores
- TextRank: ROUGE-1 ~0.35, ROUGE-2 ~0.15
- BART: ROUGE-1 ~0.42, ROUGE-2 ~0.20
- PEGASUS: ROUGE-1 ~0.44, ROUGE-2 ~0.21
Results may vary based on dataset and configuration.