# Model Evaluation Guide

This guide explains how to run comprehensive evaluation of the summarization models using the CNN/DailyMail dataset.

## Quick Start

### 1. Install Dependencies
```bash
pip install -r requirements.txt
```

### 2. Run Evaluation
```bash
python run_evaluation.py
```

This will:
- Download CNN/DailyMail dataset
- Evaluate all three models (TextRank, BART, PEGASUS)
- Generate comparison reports and visualizations
- Save results to `evaluation_results/` directory

## What Gets Evaluated

### Models
- **TextRank**: Extractive summarization using graph-based ranking
- **BART**: Abstractive summarization using transformer encoder-decoder
- **PEGASUS**: Abstractive summarization specialized for summarization tasks

### Metrics
- **ROUGE-1**: Overlap of unigrams between generated and reference summaries
- **ROUGE-2**: Overlap of bigrams between generated and reference summaries  
- **ROUGE-L**: Longest common subsequence between generated and reference summaries
- **Processing Time**: Average time to generate each summary

### Topic Categories
Articles are automatically categorized into:
- Politics
- Business
- Technology
- Sports
- Health
- Entertainment
- Other

## Advanced Usage

### Custom Evaluation
```bash
# Evaluate specific number of samples
python evaluation/run_evaluation.py --samples 200

# Evaluate by topic categories
python evaluation/run_evaluation.py --by-topic --samples 100

# Evaluate specific models only
python evaluation/run_evaluation.py --models textrank bart --samples 50
```

### Using Individual Components

#### Load Dataset
```python
from evaluation.dataset_loader import CNNDailyMailLoader

loader = CNNDailyMailLoader()
dataset = loader.load_dataset()
eval_data = loader.create_evaluation_subset(size=100)
```

#### Evaluate Single Model
```python
from evaluation.model_evaluator import ModelEvaluator

evaluator = ModelEvaluator()
evaluator.initialize_models()
results = evaluator.evaluate_single_model('bart', eval_data, max_samples=50)
```

#### Analyze Results
```python
from evaluation.results_analyzer import ResultsAnalyzer

analyzer = ResultsAnalyzer()
analyzer.create_performance_charts(results, 'output_dir')
analyzer.create_detailed_report(results, 'output_dir')
```

## Output Files

After running evaluation, you'll find these files in `evaluation_results/`:

### Data Files
- `eval_data.json` - Evaluation dataset
- `data_[topic].json` - Topic-specific datasets

### Results Files
- `results_overall.json` - Detailed evaluation results
- `comparison_overall.csv` - Summary comparison table
- `results_[topic].json` - Topic-specific results

### Visualizations
- `performance_comparison.png` - Model performance charts
- `topic_performance_heatmap.png` - Topic analysis heatmap

### Reports
- `evaluation_report.md` - Detailed evaluation report
- `topic_summary.csv` - Topic performance breakdown

## Understanding Results

### ROUGE Scores
- **Higher is better** (range: 0.0 to 1.0)
- ROUGE-1: Measures content overlap
- ROUGE-2: Measures fluency and coherence
- ROUGE-L: Measures structural similarity

### Processing Time
- **Lower is better**
- Measured in seconds per summary
- Important for real-time applications

### Model Characteristics
- **TextRank**: Fast, extractive, good for quick summaries
- **BART**: Balanced performance, good fluency
- **PEGASUS**: Best quality, slower processing

## Troubleshooting

### Memory Issues
If you encounter memory issues:
```bash
# Reduce sample size
python run_evaluation.py --samples 20

# Evaluate models individually
python evaluation/run_evaluation.py --models textrank --samples 50
```

### Dataset Download Issues
The CNN/DailyMail dataset is large (~1.3GB). Ensure you have:
- Stable internet connection
- Sufficient disk space
- Proper HuggingFace datasets cache directory

### Model Loading Issues
If models fail to load:
- Check PyTorch installation
- Verify transformers library version
- Ensure sufficient RAM (8GB+ recommended)

## Configuration

### Sample Sizes
- **Development**: 20-50 samples
- **Testing**: 100-200 samples  
- **Full evaluation**: 500+ samples

### Topic Evaluation
Minimum 5 articles per topic for meaningful results. Topics with fewer articles are skipped.

## Performance Expectations

### Processing Times (CPU)
- TextRank: ~0.1 seconds per summary
- BART: ~10-15 seconds per summary
- PEGASUS: ~8-12 seconds per summary

### Typical ROUGE Scores
- TextRank: ROUGE-1 ~0.35, ROUGE-2 ~0.15
- BART: ROUGE-1 ~0.42, ROUGE-2 ~0.20
- PEGASUS: ROUGE-1 ~0.44, ROUGE-2 ~0.21

Results may vary based on dataset and configuration.