Spaces:

Rajak13
/

smart-summarizer

Sleeping

App Files Files Community

smart-summarizer / EVALUATION_GUIDE.md

Rajak13

Add comprehensive CNN/DailyMail evaluation system - dataset loading, model evaluation, topic analysis, and comparison

cf5d247 verified 2 months ago

preview code

raw

history blame

4.59 kB

	# Model Evaluation Guide

	This guide explains how to run comprehensive evaluation of the summarization models using the CNN/DailyMail dataset.

	## Quick Start

	### 1. Install Dependencies
	```bash
	pip install -r requirements.txt
	```

	### 2. Run Evaluation
	```bash
	python run_evaluation.py
	```

	This will:
	- Download CNN/DailyMail dataset
	- Evaluate all three models (TextRank, BART, PEGASUS)
	- Generate comparison reports and visualizations
	- Save results to `evaluation_results/` directory

	## What Gets Evaluated

	### Models
	- TextRank: Extractive summarization using graph-based ranking
	- BART: Abstractive summarization using transformer encoder-decoder
	- PEGASUS: Abstractive summarization specialized for summarization tasks

	### Metrics
	- ROUGE-1: Overlap of unigrams between generated and reference summaries
	- ROUGE-2: Overlap of bigrams between generated and reference summaries
	- ROUGE-L: Longest common subsequence between generated and reference summaries
	- Processing Time: Average time to generate each summary

	### Topic Categories
	Articles are automatically categorized into:
	- Politics
	- Business
	- Technology
	- Sports
	- Health
	- Entertainment
	- Other

	## Advanced Usage

	### Custom Evaluation
	```bash
	# Evaluate specific number of samples
	python evaluation/run_evaluation.py --samples 200

	# Evaluate by topic categories
	python evaluation/run_evaluation.py --by-topic --samples 100

	# Evaluate specific models only
	python evaluation/run_evaluation.py --models textrank bart --samples 50
	```

	### Using Individual Components

	#### Load Dataset
	```python
	from evaluation.dataset_loader import CNNDailyMailLoader

	loader = CNNDailyMailLoader()
	dataset = loader.load_dataset()
	eval_data = loader.create_evaluation_subset(size=100)
	```

	#### Evaluate Single Model
	```python
	from evaluation.model_evaluator import ModelEvaluator

	evaluator = ModelEvaluator()
	evaluator.initialize_models()
	results = evaluator.evaluate_single_model('bart', eval_data, max_samples=50)
	```

	#### Analyze Results
	```python
	from evaluation.results_analyzer import ResultsAnalyzer

	analyzer = ResultsAnalyzer()
	analyzer.create_performance_charts(results, 'output_dir')
	analyzer.create_detailed_report(results, 'output_dir')
	```

	## Output Files

	After running evaluation, you'll find these files in `evaluation_results/`:

	### Data Files
	- `eval_data.json` - Evaluation dataset
	- `data_[topic].json` - Topic-specific datasets

	### Results Files
	- `results_overall.json` - Detailed evaluation results
	- `comparison_overall.csv` - Summary comparison table
	- `results_[topic].json` - Topic-specific results

	### Visualizations
	- `performance_comparison.png` - Model performance charts
	- `topic_performance_heatmap.png` - Topic analysis heatmap

	### Reports
	- `evaluation_report.md` - Detailed evaluation report
	- `topic_summary.csv` - Topic performance breakdown

	## Understanding Results

	### ROUGE Scores
	- Higher is better (range: 0.0 to 1.0)
	- ROUGE-1: Measures content overlap
	- ROUGE-2: Measures fluency and coherence
	- ROUGE-L: Measures structural similarity

	### Processing Time
	- Lower is better
	- Measured in seconds per summary
	- Important for real-time applications

	### Model Characteristics
	- TextRank: Fast, extractive, good for quick summaries
	- BART: Balanced performance, good fluency
	- PEGASUS: Best quality, slower processing

	## Troubleshooting

	### Memory Issues
	If you encounter memory issues:
	```bash
	# Reduce sample size
	python run_evaluation.py --samples 20

	# Evaluate models individually
	python evaluation/run_evaluation.py --models textrank --samples 50
	```

	### Dataset Download Issues
	The CNN/DailyMail dataset is large (~1.3GB). Ensure you have:
	- Stable internet connection
	- Sufficient disk space
	- Proper HuggingFace datasets cache directory

	### Model Loading Issues
	If models fail to load:
	- Check PyTorch installation
	- Verify transformers library version
	- Ensure sufficient RAM (8GB+ recommended)

	## Configuration

	### Sample Sizes
	- Development: 20-50 samples
	- Testing: 100-200 samples
	- Full evaluation: 500+ samples

	### Topic Evaluation
	Minimum 5 articles per topic for meaningful results. Topics with fewer articles are skipped.

	## Performance Expectations

	### Processing Times (CPU)
	- TextRank: ~0.1 seconds per summary
	- BART: ~10-15 seconds per summary
	- PEGASUS: ~8-12 seconds per summary

	### Typical ROUGE Scores
	- TextRank: ROUGE-1 ~0.35, ROUGE-2 ~0.15
	- BART: ROUGE-1 ~0.42, ROUGE-2 ~0.20
	- PEGASUS: ROUGE-1 ~0.44, ROUGE-2 ~0.21

	Results may vary based on dataset and configuration.