Spaces:
Sleeping
Sleeping
File size: 4,589 Bytes
cf5d247 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | # Model Evaluation Guide
This guide explains how to run comprehensive evaluation of the summarization models using the CNN/DailyMail dataset.
## Quick Start
### 1. Install Dependencies
```bash
pip install -r requirements.txt
```
### 2. Run Evaluation
```bash
python run_evaluation.py
```
This will:
- Download CNN/DailyMail dataset
- Evaluate all three models (TextRank, BART, PEGASUS)
- Generate comparison reports and visualizations
- Save results to `evaluation_results/` directory
## What Gets Evaluated
### Models
- **TextRank**: Extractive summarization using graph-based ranking
- **BART**: Abstractive summarization using transformer encoder-decoder
- **PEGASUS**: Abstractive summarization specialized for summarization tasks
### Metrics
- **ROUGE-1**: Overlap of unigrams between generated and reference summaries
- **ROUGE-2**: Overlap of bigrams between generated and reference summaries
- **ROUGE-L**: Longest common subsequence between generated and reference summaries
- **Processing Time**: Average time to generate each summary
### Topic Categories
Articles are automatically categorized into:
- Politics
- Business
- Technology
- Sports
- Health
- Entertainment
- Other
## Advanced Usage
### Custom Evaluation
```bash
# Evaluate specific number of samples
python evaluation/run_evaluation.py --samples 200
# Evaluate by topic categories
python evaluation/run_evaluation.py --by-topic --samples 100
# Evaluate specific models only
python evaluation/run_evaluation.py --models textrank bart --samples 50
```
### Using Individual Components
#### Load Dataset
```python
from evaluation.dataset_loader import CNNDailyMailLoader
loader = CNNDailyMailLoader()
dataset = loader.load_dataset()
eval_data = loader.create_evaluation_subset(size=100)
```
#### Evaluate Single Model
```python
from evaluation.model_evaluator import ModelEvaluator
evaluator = ModelEvaluator()
evaluator.initialize_models()
results = evaluator.evaluate_single_model('bart', eval_data, max_samples=50)
```
#### Analyze Results
```python
from evaluation.results_analyzer import ResultsAnalyzer
analyzer = ResultsAnalyzer()
analyzer.create_performance_charts(results, 'output_dir')
analyzer.create_detailed_report(results, 'output_dir')
```
## Output Files
After running evaluation, you'll find these files in `evaluation_results/`:
### Data Files
- `eval_data.json` - Evaluation dataset
- `data_[topic].json` - Topic-specific datasets
### Results Files
- `results_overall.json` - Detailed evaluation results
- `comparison_overall.csv` - Summary comparison table
- `results_[topic].json` - Topic-specific results
### Visualizations
- `performance_comparison.png` - Model performance charts
- `topic_performance_heatmap.png` - Topic analysis heatmap
### Reports
- `evaluation_report.md` - Detailed evaluation report
- `topic_summary.csv` - Topic performance breakdown
## Understanding Results
### ROUGE Scores
- **Higher is better** (range: 0.0 to 1.0)
- ROUGE-1: Measures content overlap
- ROUGE-2: Measures fluency and coherence
- ROUGE-L: Measures structural similarity
### Processing Time
- **Lower is better**
- Measured in seconds per summary
- Important for real-time applications
### Model Characteristics
- **TextRank**: Fast, extractive, good for quick summaries
- **BART**: Balanced performance, good fluency
- **PEGASUS**: Best quality, slower processing
## Troubleshooting
### Memory Issues
If you encounter memory issues:
```bash
# Reduce sample size
python run_evaluation.py --samples 20
# Evaluate models individually
python evaluation/run_evaluation.py --models textrank --samples 50
```
### Dataset Download Issues
The CNN/DailyMail dataset is large (~1.3GB). Ensure you have:
- Stable internet connection
- Sufficient disk space
- Proper HuggingFace datasets cache directory
### Model Loading Issues
If models fail to load:
- Check PyTorch installation
- Verify transformers library version
- Ensure sufficient RAM (8GB+ recommended)
## Configuration
### Sample Sizes
- **Development**: 20-50 samples
- **Testing**: 100-200 samples
- **Full evaluation**: 500+ samples
### Topic Evaluation
Minimum 5 articles per topic for meaningful results. Topics with fewer articles are skipped.
## Performance Expectations
### Processing Times (CPU)
- TextRank: ~0.1 seconds per summary
- BART: ~10-15 seconds per summary
- PEGASUS: ~8-12 seconds per summary
### Typical ROUGE Scores
- TextRank: ROUGE-1 ~0.35, ROUGE-2 ~0.15
- BART: ROUGE-1 ~0.42, ROUGE-2 ~0.20
- PEGASUS: ROUGE-1 ~0.44, ROUGE-2 ~0.21
Results may vary based on dataset and configuration. |