smart-summarizer / EVALUATION_GUIDE.md
Rajak13's picture
Add comprehensive CNN/DailyMail evaluation system - dataset loading, model evaluation, topic analysis, and comparison
cf5d247 verified
|
raw
history blame
4.59 kB
# Model Evaluation Guide
This guide explains how to run comprehensive evaluation of the summarization models using the CNN/DailyMail dataset.
## Quick Start
### 1. Install Dependencies
```bash
pip install -r requirements.txt
```
### 2. Run Evaluation
```bash
python run_evaluation.py
```
This will:
- Download CNN/DailyMail dataset
- Evaluate all three models (TextRank, BART, PEGASUS)
- Generate comparison reports and visualizations
- Save results to `evaluation_results/` directory
## What Gets Evaluated
### Models
- **TextRank**: Extractive summarization using graph-based ranking
- **BART**: Abstractive summarization using transformer encoder-decoder
- **PEGASUS**: Abstractive summarization specialized for summarization tasks
### Metrics
- **ROUGE-1**: Overlap of unigrams between generated and reference summaries
- **ROUGE-2**: Overlap of bigrams between generated and reference summaries
- **ROUGE-L**: Longest common subsequence between generated and reference summaries
- **Processing Time**: Average time to generate each summary
### Topic Categories
Articles are automatically categorized into:
- Politics
- Business
- Technology
- Sports
- Health
- Entertainment
- Other
## Advanced Usage
### Custom Evaluation
```bash
# Evaluate specific number of samples
python evaluation/run_evaluation.py --samples 200
# Evaluate by topic categories
python evaluation/run_evaluation.py --by-topic --samples 100
# Evaluate specific models only
python evaluation/run_evaluation.py --models textrank bart --samples 50
```
### Using Individual Components
#### Load Dataset
```python
from evaluation.dataset_loader import CNNDailyMailLoader
loader = CNNDailyMailLoader()
dataset = loader.load_dataset()
eval_data = loader.create_evaluation_subset(size=100)
```
#### Evaluate Single Model
```python
from evaluation.model_evaluator import ModelEvaluator
evaluator = ModelEvaluator()
evaluator.initialize_models()
results = evaluator.evaluate_single_model('bart', eval_data, max_samples=50)
```
#### Analyze Results
```python
from evaluation.results_analyzer import ResultsAnalyzer
analyzer = ResultsAnalyzer()
analyzer.create_performance_charts(results, 'output_dir')
analyzer.create_detailed_report(results, 'output_dir')
```
## Output Files
After running evaluation, you'll find these files in `evaluation_results/`:
### Data Files
- `eval_data.json` - Evaluation dataset
- `data_[topic].json` - Topic-specific datasets
### Results Files
- `results_overall.json` - Detailed evaluation results
- `comparison_overall.csv` - Summary comparison table
- `results_[topic].json` - Topic-specific results
### Visualizations
- `performance_comparison.png` - Model performance charts
- `topic_performance_heatmap.png` - Topic analysis heatmap
### Reports
- `evaluation_report.md` - Detailed evaluation report
- `topic_summary.csv` - Topic performance breakdown
## Understanding Results
### ROUGE Scores
- **Higher is better** (range: 0.0 to 1.0)
- ROUGE-1: Measures content overlap
- ROUGE-2: Measures fluency and coherence
- ROUGE-L: Measures structural similarity
### Processing Time
- **Lower is better**
- Measured in seconds per summary
- Important for real-time applications
### Model Characteristics
- **TextRank**: Fast, extractive, good for quick summaries
- **BART**: Balanced performance, good fluency
- **PEGASUS**: Best quality, slower processing
## Troubleshooting
### Memory Issues
If you encounter memory issues:
```bash
# Reduce sample size
python run_evaluation.py --samples 20
# Evaluate models individually
python evaluation/run_evaluation.py --models textrank --samples 50
```
### Dataset Download Issues
The CNN/DailyMail dataset is large (~1.3GB). Ensure you have:
- Stable internet connection
- Sufficient disk space
- Proper HuggingFace datasets cache directory
### Model Loading Issues
If models fail to load:
- Check PyTorch installation
- Verify transformers library version
- Ensure sufficient RAM (8GB+ recommended)
## Configuration
### Sample Sizes
- **Development**: 20-50 samples
- **Testing**: 100-200 samples
- **Full evaluation**: 500+ samples
### Topic Evaluation
Minimum 5 articles per topic for meaningful results. Topics with fewer articles are skipped.
## Performance Expectations
### Processing Times (CPU)
- TextRank: ~0.1 seconds per summary
- BART: ~10-15 seconds per summary
- PEGASUS: ~8-12 seconds per summary
### Typical ROUGE Scores
- TextRank: ROUGE-1 ~0.35, ROUGE-2 ~0.15
- BART: ROUGE-1 ~0.42, ROUGE-2 ~0.20
- PEGASUS: ROUGE-1 ~0.44, ROUGE-2 ~0.21
Results may vary based on dataset and configuration.