Spaces:
Sleeping
Sleeping
| # Model Evaluation Guide | |
| This guide explains how to run comprehensive evaluation of the summarization models using the CNN/DailyMail dataset. | |
| ## Quick Start | |
| ### 1. Install Dependencies | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### 2. Run Evaluation | |
| ```bash | |
| python run_evaluation.py | |
| ``` | |
| This will: | |
| - Download CNN/DailyMail dataset | |
| - Evaluate all three models (TextRank, BART, PEGASUS) | |
| - Generate comparison reports and visualizations | |
| - Save results to `evaluation_results/` directory | |
| ## What Gets Evaluated | |
| ### Models | |
| - **TextRank**: Extractive summarization using graph-based ranking | |
| - **BART**: Abstractive summarization using transformer encoder-decoder | |
| - **PEGASUS**: Abstractive summarization specialized for summarization tasks | |
| ### Metrics | |
| - **ROUGE-1**: Overlap of unigrams between generated and reference summaries | |
| - **ROUGE-2**: Overlap of bigrams between generated and reference summaries | |
| - **ROUGE-L**: Longest common subsequence between generated and reference summaries | |
| - **Processing Time**: Average time to generate each summary | |
| ### Topic Categories | |
| Articles are automatically categorized into: | |
| - Politics | |
| - Business | |
| - Technology | |
| - Sports | |
| - Health | |
| - Entertainment | |
| - Other | |
| ## Advanced Usage | |
| ### Custom Evaluation | |
| ```bash | |
| # Evaluate specific number of samples | |
| python evaluation/run_evaluation.py --samples 200 | |
| # Evaluate by topic categories | |
| python evaluation/run_evaluation.py --by-topic --samples 100 | |
| # Evaluate specific models only | |
| python evaluation/run_evaluation.py --models textrank bart --samples 50 | |
| ``` | |
| ### Using Individual Components | |
| #### Load Dataset | |
| ```python | |
| from evaluation.dataset_loader import CNNDailyMailLoader | |
| loader = CNNDailyMailLoader() | |
| dataset = loader.load_dataset() | |
| eval_data = loader.create_evaluation_subset(size=100) | |
| ``` | |
| #### Evaluate Single Model | |
| ```python | |
| from evaluation.model_evaluator import ModelEvaluator | |
| evaluator = ModelEvaluator() | |
| evaluator.initialize_models() | |
| results = evaluator.evaluate_single_model('bart', eval_data, max_samples=50) | |
| ``` | |
| #### Analyze Results | |
| ```python | |
| from evaluation.results_analyzer import ResultsAnalyzer | |
| analyzer = ResultsAnalyzer() | |
| analyzer.create_performance_charts(results, 'output_dir') | |
| analyzer.create_detailed_report(results, 'output_dir') | |
| ``` | |
| ## Output Files | |
| After running evaluation, you'll find these files in `evaluation_results/`: | |
| ### Data Files | |
| - `eval_data.json` - Evaluation dataset | |
| - `data_[topic].json` - Topic-specific datasets | |
| ### Results Files | |
| - `results_overall.json` - Detailed evaluation results | |
| - `comparison_overall.csv` - Summary comparison table | |
| - `results_[topic].json` - Topic-specific results | |
| ### Visualizations | |
| - `performance_comparison.png` - Model performance charts | |
| - `topic_performance_heatmap.png` - Topic analysis heatmap | |
| ### Reports | |
| - `evaluation_report.md` - Detailed evaluation report | |
| - `topic_summary.csv` - Topic performance breakdown | |
| ## Understanding Results | |
| ### ROUGE Scores | |
| - **Higher is better** (range: 0.0 to 1.0) | |
| - ROUGE-1: Measures content overlap | |
| - ROUGE-2: Measures fluency and coherence | |
| - ROUGE-L: Measures structural similarity | |
| ### Processing Time | |
| - **Lower is better** | |
| - Measured in seconds per summary | |
| - Important for real-time applications | |
| ### Model Characteristics | |
| - **TextRank**: Fast, extractive, good for quick summaries | |
| - **BART**: Balanced performance, good fluency | |
| - **PEGASUS**: Best quality, slower processing | |
| ## Troubleshooting | |
| ### Memory Issues | |
| If you encounter memory issues: | |
| ```bash | |
| # Reduce sample size | |
| python run_evaluation.py --samples 20 | |
| # Evaluate models individually | |
| python evaluation/run_evaluation.py --models textrank --samples 50 | |
| ``` | |
| ### Dataset Download Issues | |
| The CNN/DailyMail dataset is large (~1.3GB). Ensure you have: | |
| - Stable internet connection | |
| - Sufficient disk space | |
| - Proper HuggingFace datasets cache directory | |
| ### Model Loading Issues | |
| If models fail to load: | |
| - Check PyTorch installation | |
| - Verify transformers library version | |
| - Ensure sufficient RAM (8GB+ recommended) | |
| ## Configuration | |
| ### Sample Sizes | |
| - **Development**: 20-50 samples | |
| - **Testing**: 100-200 samples | |
| - **Full evaluation**: 500+ samples | |
| ### Topic Evaluation | |
| Minimum 5 articles per topic for meaningful results. Topics with fewer articles are skipped. | |
| ## Performance Expectations | |
| ### Processing Times (CPU) | |
| - TextRank: ~0.1 seconds per summary | |
| - BART: ~10-15 seconds per summary | |
| - PEGASUS: ~8-12 seconds per summary | |
| ### Typical ROUGE Scores | |
| - TextRank: ROUGE-1 ~0.35, ROUGE-2 ~0.15 | |
| - BART: ROUGE-1 ~0.42, ROUGE-2 ~0.20 | |
| - PEGASUS: ROUGE-1 ~0.44, ROUGE-2 ~0.21 | |
| Results may vary based on dataset and configuration. |