File size: 4,589 Bytes
cf5d247
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# Model Evaluation Guide

This guide explains how to run comprehensive evaluation of the summarization models using the CNN/DailyMail dataset.

## Quick Start

### 1. Install Dependencies
```bash
pip install -r requirements.txt
```

### 2. Run Evaluation
```bash
python run_evaluation.py
```

This will:
- Download CNN/DailyMail dataset
- Evaluate all three models (TextRank, BART, PEGASUS)
- Generate comparison reports and visualizations
- Save results to `evaluation_results/` directory

## What Gets Evaluated

### Models
- **TextRank**: Extractive summarization using graph-based ranking
- **BART**: Abstractive summarization using transformer encoder-decoder
- **PEGASUS**: Abstractive summarization specialized for summarization tasks

### Metrics
- **ROUGE-1**: Overlap of unigrams between generated and reference summaries
- **ROUGE-2**: Overlap of bigrams between generated and reference summaries  
- **ROUGE-L**: Longest common subsequence between generated and reference summaries
- **Processing Time**: Average time to generate each summary

### Topic Categories
Articles are automatically categorized into:
- Politics
- Business
- Technology
- Sports
- Health
- Entertainment
- Other

## Advanced Usage

### Custom Evaluation
```bash
# Evaluate specific number of samples
python evaluation/run_evaluation.py --samples 200

# Evaluate by topic categories
python evaluation/run_evaluation.py --by-topic --samples 100

# Evaluate specific models only
python evaluation/run_evaluation.py --models textrank bart --samples 50
```

### Using Individual Components

#### Load Dataset
```python
from evaluation.dataset_loader import CNNDailyMailLoader

loader = CNNDailyMailLoader()
dataset = loader.load_dataset()
eval_data = loader.create_evaluation_subset(size=100)
```

#### Evaluate Single Model
```python
from evaluation.model_evaluator import ModelEvaluator

evaluator = ModelEvaluator()
evaluator.initialize_models()
results = evaluator.evaluate_single_model('bart', eval_data, max_samples=50)
```

#### Analyze Results
```python
from evaluation.results_analyzer import ResultsAnalyzer

analyzer = ResultsAnalyzer()
analyzer.create_performance_charts(results, 'output_dir')
analyzer.create_detailed_report(results, 'output_dir')
```

## Output Files

After running evaluation, you'll find these files in `evaluation_results/`:

### Data Files
- `eval_data.json` - Evaluation dataset
- `data_[topic].json` - Topic-specific datasets

### Results Files
- `results_overall.json` - Detailed evaluation results
- `comparison_overall.csv` - Summary comparison table
- `results_[topic].json` - Topic-specific results

### Visualizations
- `performance_comparison.png` - Model performance charts
- `topic_performance_heatmap.png` - Topic analysis heatmap

### Reports
- `evaluation_report.md` - Detailed evaluation report
- `topic_summary.csv` - Topic performance breakdown

## Understanding Results

### ROUGE Scores
- **Higher is better** (range: 0.0 to 1.0)
- ROUGE-1: Measures content overlap
- ROUGE-2: Measures fluency and coherence
- ROUGE-L: Measures structural similarity

### Processing Time
- **Lower is better**
- Measured in seconds per summary
- Important for real-time applications

### Model Characteristics
- **TextRank**: Fast, extractive, good for quick summaries
- **BART**: Balanced performance, good fluency
- **PEGASUS**: Best quality, slower processing

## Troubleshooting

### Memory Issues
If you encounter memory issues:
```bash
# Reduce sample size
python run_evaluation.py --samples 20

# Evaluate models individually
python evaluation/run_evaluation.py --models textrank --samples 50
```

### Dataset Download Issues
The CNN/DailyMail dataset is large (~1.3GB). Ensure you have:
- Stable internet connection
- Sufficient disk space
- Proper HuggingFace datasets cache directory

### Model Loading Issues
If models fail to load:
- Check PyTorch installation
- Verify transformers library version
- Ensure sufficient RAM (8GB+ recommended)

## Configuration

### Sample Sizes
- **Development**: 20-50 samples
- **Testing**: 100-200 samples  
- **Full evaluation**: 500+ samples

### Topic Evaluation
Minimum 5 articles per topic for meaningful results. Topics with fewer articles are skipped.

## Performance Expectations

### Processing Times (CPU)
- TextRank: ~0.1 seconds per summary
- BART: ~10-15 seconds per summary
- PEGASUS: ~8-12 seconds per summary

### Typical ROUGE Scores
- TextRank: ROUGE-1 ~0.35, ROUGE-2 ~0.15
- BART: ROUGE-1 ~0.42, ROUGE-2 ~0.20
- PEGASUS: ROUGE-1 ~0.44, ROUGE-2 ~0.21

Results may vary based on dataset and configuration.