# Model Evaluation Guide This guide explains how to run comprehensive evaluation of the summarization models using the CNN/DailyMail dataset. ## Quick Start ### 1. Install Dependencies ```bash pip install -r requirements.txt ``` ### 2. Run Evaluation ```bash python run_evaluation.py ``` This will: - Download CNN/DailyMail dataset - Evaluate all three models (TextRank, BART, PEGASUS) - Generate comparison reports and visualizations - Save results to `evaluation_results/` directory ## What Gets Evaluated ### Models - **TextRank**: Extractive summarization using graph-based ranking - **BART**: Abstractive summarization using transformer encoder-decoder - **PEGASUS**: Abstractive summarization specialized for summarization tasks ### Metrics - **ROUGE-1**: Overlap of unigrams between generated and reference summaries - **ROUGE-2**: Overlap of bigrams between generated and reference summaries - **ROUGE-L**: Longest common subsequence between generated and reference summaries - **Processing Time**: Average time to generate each summary ### Topic Categories Articles are automatically categorized into: - Politics - Business - Technology - Sports - Health - Entertainment - Other ## Advanced Usage ### Custom Evaluation ```bash # Evaluate specific number of samples python evaluation/run_evaluation.py --samples 200 # Evaluate by topic categories python evaluation/run_evaluation.py --by-topic --samples 100 # Evaluate specific models only python evaluation/run_evaluation.py --models textrank bart --samples 50 ``` ### Using Individual Components #### Load Dataset ```python from evaluation.dataset_loader import CNNDailyMailLoader loader = CNNDailyMailLoader() dataset = loader.load_dataset() eval_data = loader.create_evaluation_subset(size=100) ``` #### Evaluate Single Model ```python from evaluation.model_evaluator import ModelEvaluator evaluator = ModelEvaluator() evaluator.initialize_models() results = evaluator.evaluate_single_model('bart', eval_data, max_samples=50) ``` #### Analyze Results ```python from evaluation.results_analyzer import ResultsAnalyzer analyzer = ResultsAnalyzer() analyzer.create_performance_charts(results, 'output_dir') analyzer.create_detailed_report(results, 'output_dir') ``` ## Output Files After running evaluation, you'll find these files in `evaluation_results/`: ### Data Files - `eval_data.json` - Evaluation dataset - `data_[topic].json` - Topic-specific datasets ### Results Files - `results_overall.json` - Detailed evaluation results - `comparison_overall.csv` - Summary comparison table - `results_[topic].json` - Topic-specific results ### Visualizations - `performance_comparison.png` - Model performance charts - `topic_performance_heatmap.png` - Topic analysis heatmap ### Reports - `evaluation_report.md` - Detailed evaluation report - `topic_summary.csv` - Topic performance breakdown ## Understanding Results ### ROUGE Scores - **Higher is better** (range: 0.0 to 1.0) - ROUGE-1: Measures content overlap - ROUGE-2: Measures fluency and coherence - ROUGE-L: Measures structural similarity ### Processing Time - **Lower is better** - Measured in seconds per summary - Important for real-time applications ### Model Characteristics - **TextRank**: Fast, extractive, good for quick summaries - **BART**: Balanced performance, good fluency - **PEGASUS**: Best quality, slower processing ## Troubleshooting ### Memory Issues If you encounter memory issues: ```bash # Reduce sample size python run_evaluation.py --samples 20 # Evaluate models individually python evaluation/run_evaluation.py --models textrank --samples 50 ``` ### Dataset Download Issues The CNN/DailyMail dataset is large (~1.3GB). Ensure you have: - Stable internet connection - Sufficient disk space - Proper HuggingFace datasets cache directory ### Model Loading Issues If models fail to load: - Check PyTorch installation - Verify transformers library version - Ensure sufficient RAM (8GB+ recommended) ## Configuration ### Sample Sizes - **Development**: 20-50 samples - **Testing**: 100-200 samples - **Full evaluation**: 500+ samples ### Topic Evaluation Minimum 5 articles per topic for meaningful results. Topics with fewer articles are skipped. ## Performance Expectations ### Processing Times (CPU) - TextRank: ~0.1 seconds per summary - BART: ~10-15 seconds per summary - PEGASUS: ~8-12 seconds per summary ### Typical ROUGE Scores - TextRank: ROUGE-1 ~0.35, ROUGE-2 ~0.15 - BART: ROUGE-1 ~0.42, ROUGE-2 ~0.20 - PEGASUS: ROUGE-1 ~0.44, ROUGE-2 ~0.21 Results may vary based on dataset and configuration.