Spaces:

Rajak13
/

smart-summarizer

Running

App Files Files Community

Rajak13 commited on 20 days ago

Commit

cf5d247

verified ·

1 Parent(s): e36176a

Add comprehensive CNN/DailyMail evaluation system - dataset loading, model evaluation, topic analysis, and comparison

Browse files

Files changed (21) hide show

.ipynb_checkpoints/railway-checkpoint.json +10 -0
EVALUATION_GUIDE.md +174 -0
evaluation/__init__.py +1 -0
evaluation/dataset_loader.py +140 -0
evaluation/model_evaluator.py +226 -0
evaluation/results_analyzer.py +202 -0
evaluation/run_evaluation.py +123 -0
notebooks/.ipynb_checkpoints/01_data_exploration-checkpoint.ipynb +0 -0
notebooks/.ipynb_checkpoints/02_model_testing-checkpoint.ipynb +0 -0
notebooks/.ipynb_checkpoints/03_evaluation_analysis-checkpoint.ipynb +478 -0
notebooks/.ipynb_checkpoints/03_evaluation_analysis_cnn_dailymail-checkpoint.ipynb +0 -0
notebooks/.ipynb_checkpoints/Smart-Summarizer-checkpoint.ipynb +6 -0
notebooks/01_data_exploration.ipynb +1 -1
notebooks/02_model_testing.ipynb +0 -0
notebooks/03_evaluation_analysis.ipynb +1 -1
notebooks/03_evaluation_analysis_cnn_dailymail.ipynb +0 -0
notebooks/Smart-Summarizer.ipynb +0 -0
results/cnn_dailymail_evaluation_export.json +109 -0
results/cnn_dailymail_evaluation_results.csv +4 -0
results/cnn_dailymail_report_summary.md +29 -0
run_evaluation.py +130 -0

.ipynb_checkpoints/railway-checkpoint.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "build": {
+    "builder": "NIXPACKS"
+  },
+  "deploy": {
+    "startCommand": "cd webapp && gunicorn app:app --bind 0.0.0.0:$PORT --timeout 120 --workers 2",
+    "restartPolicyType": "ON_FAILURE",
+    "restartPolicyMaxRetries": 10
+  }
+}

EVALUATION_GUIDE.md ADDED Viewed

	@@ -0,0 +1,174 @@

+# Model Evaluation Guide
+This guide explains how to run comprehensive evaluation of the summarization models using the CNN/DailyMail dataset.
+## Quick Start
+### 1. Install Dependencies
+```bash
+pip install -r requirements.txt
+```
+### 2. Run Evaluation
+```bash
+python run_evaluation.py
+```
+This will:
+- Download CNN/DailyMail dataset
+- Evaluate all three models (TextRank, BART, PEGASUS)
+- Generate comparison reports and visualizations
+- Save results to `evaluation_results/` directory
+## What Gets Evaluated
+### Models
+- **TextRank**: Extractive summarization using graph-based ranking
+- **BART**: Abstractive summarization using transformer encoder-decoder
+- **PEGASUS**: Abstractive summarization specialized for summarization tasks
+### Metrics
+- **ROUGE-1**: Overlap of unigrams between generated and reference summaries
+- **ROUGE-2**: Overlap of bigrams between generated and reference summaries
+- **ROUGE-L**: Longest common subsequence between generated and reference summaries
+- **Processing Time**: Average time to generate each summary
+### Topic Categories
+Articles are automatically categorized into:
+- Politics
+- Business
+- Technology
+- Sports
+- Health
+- Entertainment
+- Other
+## Advanced Usage
+### Custom Evaluation
+```bash
+# Evaluate specific number of samples
+python evaluation/run_evaluation.py --samples 200
+# Evaluate by topic categories
+python evaluation/run_evaluation.py --by-topic --samples 100
+# Evaluate specific models only
+python evaluation/run_evaluation.py --models textrank bart --samples 50
+```
+### Using Individual Components
+#### Load Dataset
+```python
+from evaluation.dataset_loader import CNNDailyMailLoader
+loader = CNNDailyMailLoader()
+dataset = loader.load_dataset()
+eval_data = loader.create_evaluation_subset(size=100)
+```
+#### Evaluate Single Model
+```python
+from evaluation.model_evaluator import ModelEvaluator
+evaluator = ModelEvaluator()
+evaluator.initialize_models()
+results = evaluator.evaluate_single_model('bart', eval_data, max_samples=50)
+```
+#### Analyze Results
+```python
+from evaluation.results_analyzer import ResultsAnalyzer
+analyzer = ResultsAnalyzer()
+analyzer.create_performance_charts(results, 'output_dir')
+analyzer.create_detailed_report(results, 'output_dir')
+```
+## Output Files
+After running evaluation, you'll find these files in `evaluation_results/`:
+### Data Files
+- `eval_data.json` - Evaluation dataset
+- `data_[topic].json` - Topic-specific datasets
+### Results Files
+- `results_overall.json` - Detailed evaluation results
+- `comparison_overall.csv` - Summary comparison table
+- `results_[topic].json` - Topic-specific results
+### Visualizations
+- `performance_comparison.png` - Model performance charts
+- `topic_performance_heatmap.png` - Topic analysis heatmap
+### Reports
+- `evaluation_report.md` - Detailed evaluation report
+- `topic_summary.csv` - Topic performance breakdown
+## Understanding Results
+### ROUGE Scores
+- **Higher is better** (range: 0.0 to 1.0)
+- ROUGE-1: Measures content overlap
+- ROUGE-2: Measures fluency and coherence
+- ROUGE-L: Measures structural similarity
+### Processing Time
+- **Lower is better**
+- Measured in seconds per summary
+- Important for real-time applications
+### Model Characteristics
+- **TextRank**: Fast, extractive, good for quick summaries
+- **BART**: Balanced performance, good fluency
+- **PEGASUS**: Best quality, slower processing
+## Troubleshooting
+### Memory Issues
+If you encounter memory issues:
+```bash
+# Reduce sample size
+python run_evaluation.py --samples 20
+# Evaluate models individually
+python evaluation/run_evaluation.py --models textrank --samples 50
+```
+### Dataset Download Issues
+The CNN/DailyMail dataset is large (~1.3GB). Ensure you have:
+- Stable internet connection
+- Sufficient disk space
+- Proper HuggingFace datasets cache directory
+### Model Loading Issues
+If models fail to load:
+- Check PyTorch installation
+- Verify transformers library version
+- Ensure sufficient RAM (8GB+ recommended)
+## Configuration
+### Sample Sizes
+- **Development**: 20-50 samples
+- **Testing**: 100-200 samples
+- **Full evaluation**: 500+ samples
+### Topic Evaluation
+Minimum 5 articles per topic for meaningful results. Topics with fewer articles are skipped.
+## Performance Expectations
+### Processing Times (CPU)
+- TextRank: ~0.1 seconds per summary
+- BART: ~10-15 seconds per summary
+- PEGASUS: ~8-12 seconds per summary
+### Typical ROUGE Scores
+- TextRank: ROUGE-1 ~0.35, ROUGE-2 ~0.15
+- BART: ROUGE-1 ~0.42, ROUGE-2 ~0.20
+- PEGASUS: ROUGE-1 ~0.44, ROUGE-2 ~0.21
+Results may vary based on dataset and configuration.

evaluation/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Evaluation package for Smart Summarizer

evaluation/dataset_loader.py ADDED Viewed

	@@ -0,0 +1,140 @@

+"""
+Dataset Loader for CNN/DailyMail Dataset
+Handles loading, splitting, and preprocessing of evaluation data
+"""
+from datasets import load_dataset
+import pandas as pd
+import json
+import os
+from typing import Dict, List, Tuple
+import logging
+logger = logging.getLogger(__name__)
+class CNNDailyMailLoader:
+    """Load and manage CNN/DailyMail dataset for summarization evaluation"""
+    def __init__(self, cache_dir: str = "data/cache"):
+        self.cache_dir = cache_dir
+        self.dataset = None
+        os.makedirs(cache_dir, exist_ok=True)
+    def load_dataset(self, version: str = "3.0.0") -> Dict:
+        """Load CNN/DailyMail dataset from HuggingFace"""
+        logger.info(f"Loading CNN/DailyMail dataset version {version}")
+        try:
+            self.dataset = load_dataset("abisee/cnn_dailymail", version)
+            logger.info("Dataset loaded successfully")
+            return self.dataset
+        except Exception as e:
+            logger.error(f"Failed to load dataset: {e}")
+            raise
+    def get_splits(self) -> Tuple[List[Dict], List[Dict], List[Dict]]:
+        """Get train, validation, and test splits"""
+        if not self.dataset:
+            self.load_dataset()
+        train_data = list(self.dataset['train'])
+        val_data = list(self.dataset['validation'])
+        test_data = list(self.dataset['test'])
+        logger.info(f"Train: {len(train_data)}, Val: {len(val_data)}, Test: {len(test_data)}")
+        return train_data, val_data, test_data
+    def create_evaluation_subset(self, split: str = "test", size: int = 100) -> List[Dict]:
+        """Create a smaller subset for evaluation"""
+        if not self.dataset:
+            self.load_dataset()
+        data = list(self.dataset[split])
+        subset = data[:size]
+        # Clean and format data
+        evaluation_data = []
+        for item in subset:
+            evaluation_data.append({
+                'id': item.get('id', ''),
+                'article': item['article'],
+                'highlights': item['highlights'],
+                'url': item.get('url', '')
+            })
+        return evaluation_data
+    def save_evaluation_data(self, data: List[Dict], filename: str):
+        """Save evaluation data to JSON file"""
+        filepath = os.path.join(self.cache_dir, filename)
+        with open(filepath, 'w', encoding='utf-8') as f:
+            json.dump(data, f, indent=2, ensure_ascii=False)
+        logger.info(f"Saved {len(data)} items to {filepath}")
+    def load_evaluation_data(self, filename: str) -> List[Dict]:
+        """Load evaluation data from JSON file"""
+        filepath = os.path.join(self.cache_dir, filename)
+        if os.path.exists(filepath):
+            with open(filepath, 'r', encoding='utf-8') as f:
+                data = json.load(f)
+            logger.info(f"Loaded {len(data)} items from {filepath}")
+            return data
+        else:
+            logger.warning(f"File not found: {filepath}")
+            return []
+    def get_topic_categories(self) -> Dict[str, List[str]]:
+        """Define topic categories for evaluation"""
+        return {
+            'politics': ['election', 'government', 'president', 'congress', 'senate', 'political'],
+            'business': ['company', 'market', 'stock', 'economy', 'financial', 'business'],
+            'technology': ['tech', 'computer', 'software', 'internet', 'digital', 'AI'],
+            'sports': ['game', 'team', 'player', 'sport', 'match', 'championship'],
+            'health': ['medical', 'health', 'doctor', 'hospital', 'disease', 'treatment'],
+            'entertainment': ['movie', 'actor', 'celebrity', 'film', 'music', 'entertainment']
+        }
+    def categorize_by_topic(self, data: List[Dict]) -> Dict[str, List[Dict]]:
+        """Categorize articles by topic"""
+        categories = self.get_topic_categories()
+        categorized = {topic: [] for topic in categories.keys()}
+        categorized['other'] = []
+        for item in data:
+            article_text = item['article'].lower()
+            assigned = False
+            for topic, keywords in categories.items():
+                if any(keyword in article_text for keyword in keywords):
+                    categorized[topic].append(item)
+                    assigned = True
+                    break
+            if not assigned:
+                categorized['other'].append(item)
+        # Log distribution
+        for topic, items in categorized.items():
+            logger.info(f"{topic}: {len(items)} articles")
+        return categorized
+if __name__ == "__main__":
+    # Example usage
+    loader = CNNDailyMailLoader()
+    # Load dataset
+    dataset = loader.load_dataset()
+    # Create evaluation subset
+    eval_data = loader.create_evaluation_subset(size=200)
+    # Categorize by topic
+    categorized = loader.categorize_by_topic(eval_data)
+    # Save data
+    loader.save_evaluation_data(eval_data, "cnn_dailymail_eval_200.json")
+    for topic, items in categorized.items():
+        if items:
+            loader.save_evaluation_data(items, f"cnn_dailymail_{topic}.json")

evaluation/model_evaluator.py ADDED Viewed

	@@ -0,0 +1,226 @@

+"""
+Model Evaluator for Summarization Models
+Evaluates individual models and compares their performance
+"""
+import sys
+from pathlib import Path
+project_root = Path(__file__).parent.parent
+if str(project_root) not in sys.path:
+    sys.path.insert(0, str(project_root))
+import time
+import json
+import pandas as pd
+from typing import Dict, List, Any
+import logging
+from rouge_score import rouge_scorer
+from models.textrank import TextRankSummarizer
+from models.bart import BARTSummarizer
+from models.pegasus import PEGASUSSummarizer
+logger = logging.getLogger(__name__)
+class ModelEvaluator:
+    """Evaluate summarization models on CNN/DailyMail dataset"""
+    def __init__(self):
+        self.models = {}
+        self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
+        self.results = {}
+    def initialize_models(self):
+        """Initialize all summarization models"""
+        logger.info("Initializing models...")
+        try:
+            self.models['textrank'] = TextRankSummarizer()
+            logger.info("TextRank model initialized")
+        except Exception as e:
+            logger.error(f"Failed to initialize TextRank: {e}")
+        try:
+            self.models['bart'] = BARTSummarizer(device='cpu')
+            logger.info("BART model initialized")
+        except Exception as e:
+            logger.error(f"Failed to initialize BART: {e}")
+        try:
+            self.models['pegasus'] = PEGASUSSummarizer(device='cpu')
+            logger.info("PEGASUS model initialized")
+        except Exception as e:
+            logger.error(f"Failed to initialize PEGASUS: {e}")
+    def evaluate_single_model(self, model_name: str, data: List[Dict], max_samples: int = None) -> Dict:
+        """Evaluate a single model on the dataset"""
+        if model_name not in self.models:
+            raise ValueError(f"Model {model_name} not initialized")
+        model = self.models[model_name]
+        results = {
+            'model': model_name,
+            'total_samples': len(data),
+            'processed_samples': 0,
+            'rouge_scores': {'rouge1': [], 'rouge2': [], 'rougeL': []},
+            'processing_times': [],
+            'summaries': [],
+            'errors': 0
+        }
+        if max_samples:
+            data = data[:max_samples]
+        logger.info(f"Evaluating {model_name} on {len(data)} samples")
+        for i, item in enumerate(data):
+            try:
+                start_time = time.time()
+                # Generate summary
+                if model_name == 'textrank':
+                    # Calculate appropriate number of sentences
+                    sentences = item['article'].count('.') + item['article'].count('!') + item['article'].count('?')
+                    num_sentences = max(2, int(sentences * 0.3))
+                    summary = model.summarize(item['article'], num_sentences=num_sentences)
+                else:
+                    # For BART and PEGASUS
+                    input_words = len(item['article'].split())
+                    target_length = max(30, min(150, int(input_words * 0.22)))
+                    summary = model.summarize(
+                        item['article'],
+                        max_length=target_length,
+                        min_length=max(20, int(target_length * 0.5))
+                    )
+                processing_time = time.time() - start_time
+                # Calculate ROUGE scores
+                rouge_scores = self.rouge_scorer.score(item['highlights'], summary)
+                # Store results
+                results['rouge_scores']['rouge1'].append(rouge_scores['rouge1'].fmeasure)
+                results['rouge_scores']['rouge2'].append(rouge_scores['rouge2'].fmeasure)
+                results['rouge_scores']['rougeL'].append(rouge_scores['rougeL'].fmeasure)
+                results['processing_times'].append(processing_time)
+                results['summaries'].append({
+                    'id': item.get('id', i),
+                    'original': item['article'][:200] + '...',
+                    'reference': item['highlights'],
+                    'generated': summary,
+                    'rouge1': rouge_scores['rouge1'].fmeasure,
+                    'rouge2': rouge_scores['rouge2'].fmeasure,
+                    'rougeL': rouge_scores['rougeL'].fmeasure,
+                    'processing_time': processing_time
+                })
+                results['processed_samples'] += 1
+                if (i + 1) % 10 == 0:
+                    logger.info(f"{model_name}: Processed {i + 1}/{len(data)} samples")
+            except Exception as e:
+                logger.error(f"Error processing sample {i} with {model_name}: {e}")
+                results['errors'] += 1
+        # Calculate average scores
+        results['avg_rouge1'] = sum(results['rouge_scores']['rouge1']) / len(results['rouge_scores']['rouge1']) if results['rouge_scores']['rouge1'] else 0
+        results['avg_rouge2'] = sum(results['rouge_scores']['rouge2']) / len(results['rouge_scores']['rouge2']) if results['rouge_scores']['rouge2'] else 0
+        results['avg_rougeL'] = sum(results['rouge_scores']['rougeL']) / len(results['rouge_scores']['rougeL']) if results['rouge_scores']['rougeL'] else 0
+        results['avg_processing_time'] = sum(results['processing_times']) / len(results['processing_times']) if results['processing_times'] else 0
+        logger.info(f"{model_name} evaluation complete:")
+        logger.info(f"  ROUGE-1: {results['avg_rouge1']:.4f}")
+        logger.info(f"  ROUGE-2: {results['avg_rouge2']:.4f}")
+        logger.info(f"  ROUGE-L: {results['avg_rougeL']:.4f}")
+        logger.info(f"  Avg Time: {results['avg_processing_time']:.4f}s")
+        return results
+    def evaluate_all_models(self, data: List[Dict], max_samples: int = None) -> Dict:
+        """Evaluate all models on the same dataset"""
+        if not self.models:
+            self.initialize_models()
+        all_results = {}
+        for model_name in self.models.keys():
+            logger.info(f"Starting evaluation for {model_name}")
+            all_results[model_name] = self.evaluate_single_model(model_name, data, max_samples)
+        return all_results
+    def compare_models(self, results: Dict) -> pd.DataFrame:
+        """Create comparison table of model performance"""
+        comparison_data = []
+        for model_name, result in results.items():
+            comparison_data.append({
+                'Model': model_name.upper(),
+                'ROUGE-1': f"{result['avg_rouge1']:.4f}",
+                'ROUGE-2': f"{result['avg_rouge2']:.4f}",
+                'ROUGE-L': f"{result['avg_rougeL']:.4f}",
+                'Avg Time (s)': f"{result['avg_processing_time']:.4f}",
+                'Samples': result['processed_samples'],
+                'Errors': result['errors']
+            })
+        df = pd.DataFrame(comparison_data)
+        return df
+    def save_results(self, results: Dict, filename: str):
+        """Save evaluation results to JSON file"""
+        # Convert numpy types to native Python types for JSON serialization
+        serializable_results = {}
+        for model_name, result in results.items():
+            serializable_results[model_name] = {
+                'model': result['model'],
+                'total_samples': result['total_samples'],
+                'processed_samples': result['processed_samples'],
+                'errors': result['errors'],
+                'avg_rouge1': float(result['avg_rouge1']),
+                'avg_rouge2': float(result['avg_rouge2']),
+                'avg_rougeL': float(result['avg_rougeL']),
+                'avg_processing_time': float(result['avg_processing_time']),
+                'summaries': result['summaries'][:10]  # Save only first 10 for space
+            }
+        with open(filename, 'w', encoding='utf-8') as f:
+            json.dump(serializable_results, f, indent=2, ensure_ascii=False)
+        logger.info(f"Results saved to {filename}")
+    def evaluate_by_topic(self, categorized_data: Dict[str, List[Dict]], max_samples_per_topic: int = 20) -> Dict:
+        """Evaluate models on different topic categories"""
+        topic_results = {}
+        for topic, data in categorized_data.items():
+            if not data:
+                continue
+            logger.info(f"Evaluating topic: {topic} ({len(data)} samples)")
+            topic_results[topic] = self.evaluate_all_models(data, max_samples_per_topic)
+        return topic_results
+if __name__ == "__main__":
+    # Example usage
+    from evaluation.dataset_loader import CNNDailyMailLoader
+    # Load data
+    loader = CNNDailyMailLoader()
+    eval_data = loader.create_evaluation_subset(size=50)
+    # Initialize evaluator
+    evaluator = ModelEvaluator()
+    evaluator.initialize_models()
+    # Run evaluation
+    results = evaluator.evaluate_all_models(eval_data, max_samples=20)
+    # Create comparison
+    comparison_df = evaluator.compare_models(results)
+    print("\nModel Comparison:")
+    print(comparison_df.to_string(index=False))
+    # Save results
+    evaluator.save_results(results, "evaluation_results.json")

evaluation/results_analyzer.py ADDED Viewed

	@@ -0,0 +1,202 @@

+"""
+Results Analyzer
+Analyzes and visualizes evaluation results
+"""
+import json
+import pandas as pd
+import matplotlib.pyplot as plt
+import seaborn as sns
+from typing import Dict, List
+import os
+import logging
+logger = logging.getLogger(__name__)
+class ResultsAnalyzer:
+    """Analyze and visualize evaluation results"""
+    def __init__(self):
+        plt.style.use('default')
+        sns.set_palette("husl")
+    def load_results(self, filepath: str) -> Dict:
+        """Load results from JSON file"""
+        with open(filepath, 'r', encoding='utf-8') as f:
+            return json.load(f)
+    def create_performance_charts(self, results: Dict, output_dir: str):
+        """Create performance comparison charts"""
+        # Prepare data for plotting
+        models = list(results.keys())
+        rouge1_scores = [results[model]['avg_rouge1'] for model in models]
+        rouge2_scores = [results[model]['avg_rouge2'] for model in models]
+        rougeL_scores = [results[model]['avg_rougeL'] for model in models]
+        processing_times = [results[model]['avg_processing_time'] for model in models]
+        # Create subplots
+        fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
+        # ROUGE scores comparison
+        x_pos = range(len(models))
+        width = 0.25
+        ax1.bar([x - width for x in x_pos], rouge1_scores, width, label='ROUGE-1', alpha=0.8)
+        ax1.bar(x_pos, rouge2_scores, width, label='ROUGE-2', alpha=0.8)
+        ax1.bar([x + width for x in x_pos], rougeL_scores, width, label='ROUGE-L', alpha=0.8)
+        ax1.set_xlabel('Models')
+        ax1.set_ylabel('ROUGE Score')
+        ax1.set_title('ROUGE Scores Comparison')
+        ax1.set_xticks(x_pos)
+        ax1.set_xticklabels([m.upper() for m in models])
+        ax1.legend()
+        ax1.grid(True, alpha=0.3)
+        # Processing time comparison
+        ax2.bar(models, processing_times, alpha=0.8, color='orange')
+        ax2.set_xlabel('Models')
+        ax2.set_ylabel('Processing Time (seconds)')
+        ax2.set_title('Average Processing Time')
+        ax2.set_xticklabels([m.upper() for m in models])
+        ax2.grid(True, alpha=0.3)
+        # ROUGE-1 vs ROUGE-2 scatter
+        ax3.scatter(rouge1_scores, rouge2_scores, s=100, alpha=0.7)
+        for i, model in enumerate(models):
+            ax3.annotate(model.upper(), (rouge1_scores[i], rouge2_scores[i]),
+                        xytext=(5, 5), textcoords='offset points')
+        ax3.set_xlabel('ROUGE-1')
+        ax3.set_ylabel('ROUGE-2')
+        ax3.set_title('ROUGE-1 vs ROUGE-2')
+        ax3.grid(True, alpha=0.3)
+        # Performance vs Speed
+        ax4.scatter(processing_times, rouge1_scores, s=100, alpha=0.7, color='green')
+        for i, model in enumerate(models):
+            ax4.annotate(model.upper(), (processing_times[i], rouge1_scores[i]),
+                        xytext=(5, 5), textcoords='offset points')
+        ax4.set_xlabel('Processing Time (seconds)')
+        ax4.set_ylabel('ROUGE-1 Score')
+        ax4.set_title('Performance vs Speed Trade-off')
+        ax4.grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(f"{output_dir}/performance_comparison.png", dpi=300, bbox_inches='tight')
+        plt.close()
+        logger.info(f"Performance charts saved to {output_dir}/performance_comparison.png")
+    def analyze_topic_performance(self, topic_results: Dict, output_dir: str):
+        """Analyze performance across different topics"""
+        # Prepare data
+        topics = list(topic_results.keys())
+        models = list(topic_results[topics[0]].keys()) if topics else []
+        # Create topic performance matrix
+        rouge1_matrix = []
+        rouge2_matrix = []
+        rougeL_matrix = []
+        for topic in topics:
+            rouge1_row = [topic_results[topic][model]['avg_rouge1'] for model in models]
+            rouge2_row = [topic_results[topic][model]['avg_rouge2'] for model in models]
+            rougeL_row = [topic_results[topic][model]['avg_rougeL'] for model in models]
+            rouge1_matrix.append(rouge1_row)
+            rouge2_matrix.append(rouge2_row)
+            rougeL_matrix.append(rougeL_row)
+        # Create heatmaps
+        fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 6))
+        # ROUGE-1 heatmap
+        sns.heatmap(rouge1_matrix, annot=True, fmt='.3f',
+                   xticklabels=[m.upper() for m in models],
+                   yticklabels=[t.upper() for t in topics],
+                   ax=ax1, cmap='YlOrRd')
+        ax1.set_title('ROUGE-1 Scores by Topic')
+        # ROUGE-2 heatmap
+        sns.heatmap(rouge2_matrix, annot=True, fmt='.3f',
+                   xticklabels=[m.upper() for m in models],
+                   yticklabels=[t.upper() for t in topics],
+                   ax=ax2, cmap='YlOrRd')
+        ax2.set_title('ROUGE-2 Scores by Topic')
+        # ROUGE-L heatmap
+        sns.heatmap(rougeL_matrix, annot=True, fmt='.3f',
+                   xticklabels=[m.upper() for m in models],
+                   yticklabels=[t.upper() for t in topics],
+                   ax=ax3, cmap='YlOrRd')
+        ax3.set_title('ROUGE-L Scores by Topic')
+        plt.tight_layout()
+        plt.savefig(f"{output_dir}/topic_performance_heatmap.png", dpi=300, bbox_inches='tight')
+        plt.close()
+        # Create topic summary table
+        topic_summary = []
+        for topic in topics:
+            for model in models:
+                topic_summary.append({
+                    'Topic': topic.upper(),
+                    'Model': model.upper(),
+                    'ROUGE-1': f"{topic_results[topic][model]['avg_rouge1']:.4f}",
+                    'ROUGE-2': f"{topic_results[topic][model]['avg_rouge2']:.4f}",
+                    'ROUGE-L': f"{topic_results[topic][model]['avg_rougeL']:.4f}",
+                    'Samples': topic_results[topic][model]['processed_samples']
+                })
+        df = pd.DataFrame(topic_summary)
+        df.to_csv(f"{output_dir}/topic_summary.csv", index=False)
+        logger.info(f"Topic analysis saved to {output_dir}")
+        logger.info("\nTopic Performance Summary:")
+        logger.info(df.to_string(index=False))
+    def create_detailed_report(self, results: Dict, output_dir: str):
+        """Create detailed evaluation report"""
+        report_lines = []
+        report_lines.append("# Summarization Model Evaluation Report")
+        report_lines.append("")
+        # Overall statistics
+        report_lines.append("## Overall Performance")
+        report_lines.append("")
+        for model_name, result in results.items():
+            report_lines.append(f"### {model_name.upper()}")
+            report_lines.append(f"- Samples Processed: {result['processed_samples']}")
+            report_lines.append(f"- ROUGE-1: {result['avg_rouge1']:.4f}")
+            report_lines.append(f"- ROUGE-2: {result['avg_rouge2']:.4f}")
+            report_lines.append(f"- ROUGE-L: {result['avg_rougeL']:.4f}")
+            report_lines.append(f"- Average Processing Time: {result['avg_processing_time']:.4f}s")
+            report_lines.append(f"- Errors: {result['errors']}")
+            report_lines.append("")
+        # Best performing model
+        best_rouge1 = max(results.items(), key=lambda x: x[1]['avg_rouge1'])
+        best_rouge2 = max(results.items(), key=lambda x: x[1]['avg_rouge2'])
+        fastest = min(results.items(), key=lambda x: x[1]['avg_processing_time'])
+        report_lines.append("## Summary")
+        report_lines.append(f"- Best ROUGE-1: {best_rouge1[0].upper()} ({best_rouge1[1]['avg_rouge1']:.4f})")
+        report_lines.append(f"- Best ROUGE-2: {best_rouge2[0].upper()} ({best_rouge2[1]['avg_rouge2']:.4f})")
+        report_lines.append(f"- Fastest: {fastest[0].upper()} ({fastest[1]['avg_processing_time']:.4f}s)")
+        report_lines.append("")
+        # Save report
+        with open(f"{output_dir}/evaluation_report.md", 'w', encoding='utf-8') as f:
+            f.write('\n'.join(report_lines))
+        logger.info(f"Detailed report saved to {output_dir}/evaluation_report.md")
+if __name__ == "__main__":
+    # Example usage
+    analyzer = ResultsAnalyzer()
+    # Load and analyze results
+    if os.path.exists("evaluation_results.json"):
+        results = analyzer.load_results("evaluation_results.json")
+        analyzer.create_performance_charts(results, ".")
+        analyzer.create_detailed_report(results, ".")

evaluation/run_evaluation.py ADDED Viewed

	@@ -0,0 +1,123 @@

+"""
+Main Evaluation Script
+Runs comprehensive evaluation of all models on CNN/DailyMail dataset
+"""
+import os
+import sys
+import logging
+import argparse
+from pathlib import Path
+# Add project root to path
+project_root = Path(__file__).parent.parent
+sys.path.insert(0, str(project_root))
+from evaluation.dataset_loader import CNNDailyMailLoader
+from evaluation.model_evaluator import ModelEvaluator
+from evaluation.results_analyzer import ResultsAnalyzer
+# Setup logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+    handlers=[
+        logging.FileHandler('evaluation.log'),
+        logging.StreamHandler()
+    ]
+)
+logger = logging.getLogger(__name__)
+def main():
+    parser = argparse.ArgumentParser(description='Evaluate summarization models')
+    parser.add_argument('--samples', type=int, default=100, help='Number of samples to evaluate')
+    parser.add_argument('--by-topic', action='store_true', help='Evaluate by topic categories')
+    parser.add_argument('--output-dir', type=str, default='evaluation/results', help='Output directory')
+    parser.add_argument('--models', nargs='+', default=['textrank', 'bart', 'pegasus'],
+                       help='Models to evaluate')
+    args = parser.parse_args()
+    # Create output directory
+    os.makedirs(args.output_dir, exist_ok=True)
+    logger.info("Starting comprehensive model evaluation")
+    logger.info(f"Samples: {args.samples}")
+    logger.info(f"Models: {args.models}")
+    logger.info(f"By topic: {args.by_topic}")
+    # Initialize components
+    loader = CNNDailyMailLoader()
+    evaluator = ModelEvaluator()
+    analyzer = ResultsAnalyzer()
+    try:
+        # Load dataset
+        logger.info("Loading CNN/DailyMail dataset...")
+        dataset = loader.load_dataset()
+        # Create evaluation subset
+        logger.info(f"Creating evaluation subset of {args.samples} samples...")
+        eval_data = loader.create_evaluation_subset(size=args.samples)
+        # Save evaluation data
+        loader.save_evaluation_data(eval_data, f"eval_data_{args.samples}.json")
+        # Initialize models
+        logger.info("Initializing models...")
+        evaluator.initialize_models()
+        if args.by_topic:
+            # Evaluate by topic
+            logger.info("Categorizing data by topics...")
+            categorized_data = loader.categorize_by_topic(eval_data)
+            # Save categorized data
+            for topic, data in categorized_data.items():
+                if data:
+                    loader.save_evaluation_data(data, f"eval_data_{topic}.json")
+            # Run topic-based evaluation
+            logger.info("Running topic-based evaluation...")
+            topic_results = evaluator.evaluate_by_topic(categorized_data, max_samples_per_topic=20)
+            # Save topic results
+            for topic, results in topic_results.items():
+                evaluator.save_results(results, f"{args.output_dir}/results_{topic}.json")
+                # Create topic comparison
+                comparison_df = evaluator.compare_models(results)
+                comparison_df.to_csv(f"{args.output_dir}/comparison_{topic}.csv", index=False)
+                logger.info(f"\n{topic.upper()} Topic Results:")
+                logger.info(comparison_df.to_string(index=False))
+            # Analyze topic results
+            analyzer.analyze_topic_performance(topic_results, args.output_dir)
+        else:
+            # Standard evaluation
+            logger.info("Running standard evaluation...")
+            results = evaluator.evaluate_all_models(eval_data, max_samples=args.samples)
+            # Save results
+            evaluator.save_results(results, f"{args.output_dir}/results_overall.json")
+            # Create comparison
+            comparison_df = evaluator.compare_models(results)
+            comparison_df.to_csv(f"{args.output_dir}/comparison_overall.csv", index=False)
+            logger.info("\nOverall Results:")
+            logger.info(comparison_df.to_string(index=False))
+            # Analyze results
+            analyzer.create_performance_charts(results, args.output_dir)
+        logger.info(f"Evaluation complete. Results saved to {args.output_dir}")
+    except Exception as e:
+        logger.error(f"Evaluation failed: {e}")
+        raise
+if __name__ == "__main__":
+    main()

notebooks/.ipynb_checkpoints/01_data_exploration-checkpoint.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

notebooks/.ipynb_checkpoints/02_model_testing-checkpoint.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

notebooks/.ipynb_checkpoints/03_evaluation_analysis-checkpoint.ipynb ADDED Viewed

	@@ -0,0 +1,478 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "0c688166",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "⚠ rouge library not found. Installing rouge-score...\n",
+      "✓ Successfully installed rouge-score\n",
+      "✗ Installation succeeded but import still fails.\n",
+      "  Please restart the kernel and run this cell again.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# FIX: Install and verify rouge-score package\n",
+    "# Run this cell FIRST if you get \"ModuleNotFoundError: No module named 'rouge'\"\n",
+    "\n",
+    "import sys\n",
+    "import subprocess\n",
+    "\n",
+    "def install_package(package_name):\n",
+    "    \"\"\"Install package using pip\"\"\"\n",
+    "    try:\n",
+    "        subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", package_name, \"--quiet\"])\n",
+    "        return True\n",
+    "    except subprocess.CalledProcessError:\n",
+    "        return False\n",
+    "\n",
+    "# Check if rouge is available\n",
+    "try:\n",
+    "    from rouge import Rouge\n",
+    "    print(\"✓ rouge library is already installed\")\n",
+    "except ImportError:\n",
+    "    print(\"⚠ rouge library not found. Installing rouge-score...\")\n",
+    "    if install_package(\"rouge-score\"):\n",
+    "        print(\"✓ Successfully installed rouge-score\")\n",
+    "        # Try importing again\n",
+    "        try:\n",
+    "            from rouge import Rouge\n",
+    "            print(\"✓ rouge library now available\")\n",
+    "        except ImportError:\n",
+    "            print(\"✗ Installation succeeded but import still fails.\")\n",
+    "            print(\"  Please restart the kernel and run this cell again.\")\n",
+    "    else:\n",
+    "        print(\"✗ Failed to install rouge-score\")\n",
+    "        print(\"  Please run manually: pip install rouge-score\")\n",
+    "        print(\"  Then restart the kernel.\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "1aa43993",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✗ Import error: No module named 'rouge'\n",
+      "  Make sure you've run the previous cell to install dependencies\n"
+     ]
+    },
+    {
+     "ename": "ModuleNotFoundError",
+     "evalue": "No module named 'rouge'",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
+      "\u001b[31mModuleNotFoundError\u001b[39m                       Traceback (most recent call last)",
+      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[1]\u001b[39m\u001b[32m, line 10\u001b[39m\n\u001b[32m      8\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mmodels\u001b[39;00m\u001b[34;01m.\u001b[39;00m\u001b[34;01mbart\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m BARTSummarizer\n\u001b[32m      9\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mmodels\u001b[39;00m\u001b[34;01m.\u001b[39;00m\u001b[34;01mpegasus\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m PEGASUSSummarizer\n\u001b[32m---> \u001b[39m\u001b[32m10\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mutils\u001b[39;00m\u001b[34;01m.\u001b[39;00m\u001b[34;01mevaluator\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m SummarizerEvaluator\n\u001b[32m     11\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mutils\u001b[39;00m\u001b[34;01m.\u001b[39;00m\u001b[34;01mdata_loader\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m DataLoader\n\u001b[32m     12\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[33m✓ All imports successful\u001b[39m\u001b[33m\"\u001b[39m)\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m~/Downloads/smart-summarizer/notebooks/../utils/evaluator.py:6\u001b[39m\n\u001b[32m      1\u001b[39m \u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m      2\u001b[39m \u001b[33;03mComprehensive Evaluation System for Summarization Models\u001b[39;00m\n\u001b[32m      3\u001b[39m \u001b[33;03mImplements ROUGE metrics, comparison analysis, and statistical testing\u001b[39;00m\n\u001b[32m      4\u001b[39m \u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m6\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mrouge\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m Rouge\n\u001b[32m      7\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mnumpy\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mnp\u001b[39;00m\n\u001b[32m      8\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mtyping\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m Dict, List, Tuple, Optional\n",
+      "\u001b[31mModuleNotFoundError\u001b[39m: No module named 'rouge'"
+     ]
+    }
+   ],
+   "source": [
+    "# Add project root to path\n",
+    "import sys\n",
+    "sys.path.append('..')\n",
+    "\n",
+    "# Import models and utilities\n",
+    "try:\n",
+    "    from models.textrank import TextRankSummarizer\n",
+    "    from models.bart import BARTSummarizer\n",
+    "    from models.pegasus import PEGASUSSummarizer\n",
+    "    from utils.evaluator import SummarizerEvaluator\n",
+    "    from utils.data_loader import DataLoader\n",
+    "    print(\"✓ All imports successful\")\n",
+    "except ImportError as e:\n",
+    "    print(f\"✗ Import error: {e}\")\n",
+    "    print(\"  Make sure you've run the previous cell to install dependencies\")\n",
+    "    raise\n",
+    "\n",
+    "# Import standard libraries\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns\n",
+    "from scipy import stats\n",
+    "import json\n",
+    "\n",
+    "plt.style.use('seaborn-v0_8')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e28695c0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Loading test dataset...\")\n",
+    "loader = DataLoader()\n",
+    "\n",
+    "# Load your saved samples (or load fresh)\n",
+    "try:\n",
+    "    test_data = loader.load_samples('../data/samples/test_50.json')\n",
+    "    print(f\"✓ Loaded {len(test_data)} test samples\")\n",
+    "except:\n",
+    "    print(\"Downloading test data...\")\n",
+    "    test_data = loader.load_cnn_dailymail(split='test', num_samples=50)\n",
+    "    loader.save_samples(test_data, '../data/samples/test_50.json')\n",
+    "    print(f\"✓ Downloaded and saved {len(test_data)} samples\")\n",
+    "\n",
+    "# Extract texts and references\n",
+    "texts = [item['article'] for item in test_data]\n",
+    "references = [item['reference_summary'] for item in test_data]\n",
+    "\n",
+    "print(f\"\\nDataset Statistics:\")\n",
+    "print(f\"  - Number of samples: {len(texts)}\")\n",
+    "print(f\"  - Avg article length: {np.mean([len(t.split()) for t in texts]):.0f} words\")\n",
+    "print(f\"  - Avg reference length: {np.mean([len(r.split()) for r in references]):.0f}words\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3b7dc004",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"\\nInitializing models...\")\n",
+    "\n",
+    "models = {\n",
+    "    'TextRank': TextRankSummarizer(),\n",
+    "    'BART': BARTSummarizer(device='cpu'),\n",
+    "    'PEGASUS': PEGASUSSummarizer(device='cpu')\n",
+    "}\n",
+    "\n",
+    "print(\"✓ All models ready\")\n",
+    "\n",
+    "# Cell 4: Generate Summaries (Takes ~10-20 minutes for 50 samples)\n",
+    "print(\"\\nGenerating summaries for all models...\")\n",
+    "print(\"This will take 10-20 minutes. Grab a coffee! ☕\")\n",
+    "\n",
+    "all_summaries = {}\n",
+    "all_times = {}\n",
+    "\n",
+    "for model_name, model in models.items():\n",
+    "    print(f\"\\n{model_name}:\")\n",
+    "    summaries = []\n",
+    "    times = []\n",
+    "    \n",
+    "    for i, text in enumerate(texts[:10], 1):  # Start with 10 samples\n",
+    "        print(f\"  Processing {i}/10...\", end='\\r')\n",
+    "        \n",
+    "        if model_name == 'TextRank':\n",
+    "            result = model.summarize_with_metrics(text)\n",
+    "        else:\n",
+    "            result = model.summarize_with_metrics(text, max_length=100, min_length=30)\n",
+    "        \n",
+    "        summaries.append(result['summary'])\n",
+    "        times.append(result['metadata']['processing_time'])\n",
+    "    \n",
+    "    all_summaries[model_name] = summaries\n",
+    "    all_times[model_name] = times\n",
+    "    print(f\"  ✓ Completed {model_name}                    \")\n",
+    "\n",
+    "print(\"\\n✓ All summaries generated!\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bf78630d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"\\nEvaluating models...\")\n",
+    "\n",
+    "evaluator = SummarizerEvaluator()\n",
+    "evaluation_results = {}\n",
+    "\n",
+    "for model_name in models.keys():\n",
+    "    print(f\"\\nEvaluating {model_name}...\")\n",
+    "    results = evaluator.evaluate_batch(\n",
+    "        all_summaries[model_name],\n",
+    "        references[:len(all_summaries[model_name])],\n",
+    "        model_name\n",
+    "    )\n",
+    "    results['avg_time'] = np.mean(all_times[model_name])\n",
+    "    results['std_time'] = np.std(all_times[model_name])\n",
+    "    evaluation_results[model_name] = results\n",
+    "\n",
+    "print(\"✓ Evaluation complete\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c7ebcf59",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"\\n\" + \"=\"*70)\n",
+    "print(\"EVALUATION RESULTS\")\n",
+    "print(\"=\"*70)\n",
+    "\n",
+    "results_table = []\n",
+    "\n",
+    "for model_name, results in evaluation_results.items():\n",
+    "    results_table.append({\n",
+    "        'Model': model_name,\n",
+    "        'Type': 'Extractive' if model_name == 'TextRank' else 'Abstractive',\n",
+    "        'ROUGE-1': f\"{results['rouge_1_f1_mean']:.4f} ± {results['rouge_1_f1_std']:.4f}\",\n",
+    "        'ROUGE-2': f\"{results['rouge_2_f1_mean']:.4f} ± {results['rouge_2_f1_std']:.4f}\",\n",
+    "        'ROUGE-L': f\"{results['rouge_l_f1_mean']:.4f} ± {results['rouge_l_f1_std']:.4f}\",\n",
+    "        'Avg Time (s)': f\"{results['avg_time']:.3f} ± {results['std_time']:.3f}\",\n",
+    "        'Samples': results['num_samples']\n",
+    "    })\n",
+    "\n",
+    "results_df = pd.DataFrame(results_table)\n",
+    "print(results_df.to_string(index=False))\n",
+    "\n",
+    "# Save to CSV for report\n",
+    "results_df.to_csv('../results/evaluation_results.csv', index=False)\n",
+    "print(\"\\n✓ Results saved to results/evaluation_results.csv\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a65fac0c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"\\n\" + \"=\"*70)\n",
+    "print(\"STATISTICAL SIGNIFICANCE TESTS\")\n",
+    "print(\"=\"*70)\n",
+    "\n",
+    "# Compare BART vs PEGASUS (both abstractive)\n",
+    "bart_rouge1 = [s['rouge_1_f1'] for s in evaluation_results['BART']['individual_scores']]\n",
+    "peg_rouge1 = [s['rouge_1_f1'] for s in evaluation_results['PEGASUS']['individual_scores']]\n",
+    "\n",
+    "sig_test = evaluator.statistical_significance_test(\n",
+    "    bart_rouge1,\n",
+    "    peg_rouge1,\n",
+    "    test_name='paired t-test'\n",
+    ")\n",
+    "\n",
+    "print(f\"\\nBART vs PEGASUS (ROUGE-1):\")\n",
+    "print(f\"  Test: {sig_test['test_name']}\")\n",
+    "print(f\"  p-value: {sig_test['p_value']:.6f}\")\n",
+    "print(f\"  {sig_test['interpretation']}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ae272f7a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig = plt.figure(figsize=(16, 12))\n",
+    "\n",
+    "# Create grid\n",
+    "gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)\n",
+    "\n",
+    "# 1. ROUGE Scores Comparison\n",
+    "ax1 = fig.add_subplot(gs[0, :2])\n",
+    "rouge_data = pd.DataFrame({\n",
+    "    'Model': list(evaluation_results.keys()) * 3,\n",
+    "    'Metric': ['ROUGE-1']*3 + ['ROUGE-2']*3 + ['ROUGE-L']*3,\n",
+    "    'Score': [\n",
+    "        evaluation_results['TextRank']['rouge_1_f1_mean'],\n",
+    "        evaluation_results['BART']['rouge_1_f1_mean'],\n",
+    "        evaluation_results['PEGASUS']['rouge_1_f1_mean'],\n",
+    "        evaluation_results['TextRank']['rouge_2_f1_mean'],\n",
+    "        evaluation_results['BART']['rouge_2_f1_mean'],\n",
+    "        evaluation_results['PEGASUS']['rouge_2_f1_mean'],\n",
+    "        evaluation_results['TextRank']['rouge_l_f1_mean'],\n",
+    "        evaluation_results['BART']['rouge_l_f1_mean'],\n",
+    "        evaluation_results['PEGASUS']['rouge_l_f1_mean']\n",
+    "    ]\n",
+    "})\n",
+    "\n",
+    "sns.barplot(data=rouge_data, x='Metric', y='Score', hue='Model', ax=ax1)\n",
+    "ax1.set_title('ROUGE Score Comparison', fontsize=14, fontweight='bold')\n",
+    "ax1.set_ylabel('F1 Score')\n",
+    "ax1.set_ylim([0, 0.5])\n",
+    "ax1.legend(title='Model')\n",
+    "ax1.grid(axis='y', alpha=0.3)\n",
+    "\n",
+    "# 2. Processing Time\n",
+    "ax2 = fig.add_subplot(gs[0, 2])\n",
+    "times = [evaluation_results[m]['avg_time'] for m in models.keys()]\n",
+    "colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']\n",
+    "ax2.bar(models.keys(), times, color=colors)\n",
+    "ax2.set_title('Processing Time', fontsize=12, fontweight='bold')\n",
+    "ax2.set_ylabel('Time (seconds)')\n",
+    "ax2.grid(axis='y', alpha=0.3)\n",
+    "\n",
+    "# 3. ROUGE-1 Distribution\n",
+    "ax3 = fig.add_subplot(gs[1, 0])\n",
+    "for model_name, color in zip(models.keys(), colors):\n",
+    "    rouge1_scores = [s['rouge_1_f1'] for s in evaluation_results[model_name]['individual_scores']]\n",
+    "    ax3.hist(rouge1_scores, alpha=0.6, label=model_name, bins=10, color=color)\n",
+    "ax3.set_title('ROUGE-1 Score Distribution', fontsize=12, fontweight='bold')\n",
+    "ax3.set_xlabel('ROUGE-1 F1 Score')\n",
+    "ax3.set_ylabel('Frequency')\n",
+    "ax3.legend()\n",
+    "ax3.grid(axis='y', alpha=0.3)\n",
+    "\n",
+    "# 4. ROUGE-2 Distribution\n",
+    "ax4 = fig.add_subplot(gs[1, 1])\n",
+    "for model_name, color in zip(models.keys(), colors):\n",
+    "    rouge2_scores = [s['rouge_2_f1'] for s in evaluation_results[model_name]['individual_scores']]\n",
+    "    ax4.hist(rouge2_scores, alpha=0.6, label=model_name, bins=10, color=color)\n",
+    "ax4.set_title('ROUGE-2 Score Distribution', fontsize=12, fontweight='bold')\n",
+    "ax4.set_xlabel('ROUGE-2 F1 Score')\n",
+    "ax4.set_ylabel('Frequency')\n",
+    "ax4.legend()\n",
+    "ax4.grid(axis='y', alpha=0.3)\n",
+    "\n",
+    "# 5. ROUGE-L Distribution\n",
+    "ax5 = fig.add_subplot(gs[1, 2])\n",
+    "for model_name, color in zip(models.keys(), colors):\n",
+    "    rougel_scores = [s['rouge_l_f1'] for s in evaluation_results[model_name]['individual_scores']]\n",
+    "    ax5.hist(rougel_scores, alpha=0.6, label=model_name, bins=10, color=color)\n",
+    "ax5.set_title('ROUGE-L Score Distribution', fontsize=12, fontweight='bold')\n",
+    "ax5.set_xlabel('ROUGE-L F1 Score')\n",
+    "ax5.set_ylabel('Frequency')\n",
+    "ax5.legend()\n",
+    "ax5.grid(axis='y', alpha=0.3)\n",
+    "\n",
+    "# 6. Box Plot Comparison\n",
+    "ax6 = fig.add_subplot(gs[2, :])\n",
+    "box_data = []\n",
+    "for model_name in models.keys():\n",
+    "    rouge1_scores = [s['rouge_1_f1'] for s in evaluation_results[model_name]['individual_scores']]\n",
+    "    for score in rouge1_scores:\n",
+    "        box_data.append({'Model': model_name, 'ROUGE-1': score})\n",
+    "\n",
+    "box_df = pd.DataFrame(box_data)\n",
+    "sns.boxplot(data=box_df, x='Model', y='ROUGE-1', ax=ax6, palette=colors)\n",
+    "ax6.set_title('ROUGE-1 Score Distribution (Box Plot)', fontsize=14, fontweight='bold')\n",
+    "ax6.grid(axis='y', alpha=0.3)\n",
+    "\n",
+    "plt.savefig('../results/comprehensive_evaluation.png', dpi=300, bbox_inches='tight')\n",
+    "print(\"\\n✓ Comprehensive visualization saved!\")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3e24f94c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"\\n\" + \"=\"*70)\n",
+    "print(\"EXPORTING RESULTS FOR REPORT\")\n",
+    "print(\"=\"*70)\n",
+    "\n",
+    "# Create comprehensive export\n",
+    "export_data = {\n",
+    "    'evaluation_date': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S'),\n",
+    "    'dataset': {\n",
+    "        'name': 'CNN/DailyMail',\n",
+    "        'samples_evaluated': len(all_summaries['TextRank']),\n",
+    "        'split': 'test'\n",
+    "    },\n",
+    "    'models': {\n",
+    "        model_name: {\n",
+    "            'type': results_table[i]['Type'],\n",
+    "            'rouge_1': {\n",
+    "                'mean': evaluation_results[model_name]['rouge_1_f1_mean'],\n",
+    "                'std': evaluation_results[model_name]['rouge_1_f1_std']\n",
+    "            },\n",
+    "            'rouge_2': {\n",
+    "                'mean': evaluation_results[model_name]['rouge_2_f1_mean'],\n",
+    "                'std': evaluation_results[model_name]['rouge_2_f1_std']\n",
+    "            },\n",
+    "            'rouge_l': {\n",
+    "                'mean': evaluation_results[model_name]['rouge_l_f1_mean'],\n",
+    "                'std': evaluation_results[model_name]['rouge_l_f1_std']\n",
+    "            },\n",
+    "            'processing_time': {\n",
+    "                'mean': evaluation_results[model_name]['avg_time'],\n",
+    "                'std': evaluation_results[model_name]['std_time']\n",
+    "            }\n",
+    "        }\n",
+    "        for i, model_name in enumerate(models.keys())\n",
+    "    },\n",
+    "    'statistical_tests': {\n",
+    "        'bart_vs_pegasus': sig_test\n",
+    "    }\n",
+    "}\n",
+    "\n",
+    "with open('../results/final_evaluation.json', 'w') as f:\n",
+    "    json.dump(export_data, f, indent=2)\n",
+    "\n",
+    "print(\"✓ Exported to results/final_evaluation.json\")\n",
+    "print(\"\\nFiles created for your report:\")\n",
+    "print(\"  1. results/evaluation_results.csv - Table for report\")\n",
+    "print(\"  2. results/comprehensive_evaluation.png - Main figure\")\n",
+    "print(\"  3. results/final_evaluation.json - All data\")\n",
+    "\n",
+    "# Cell 10: Summary for Report\n",
+    "print(\"\\n\" + \"=\"*70)\n",
+    "print(\"KEY FINDINGS FOR YOUR REPORT\")\n",
+    "print(\"=\"*70)\n",
+    "\n",
+    "best_model = max(evaluation_results.keys(), \n",
+    "                 key=lambda x: evaluation_results[x]['rouge_1_f1_mean'])\n",
+    "fastest_model = min(evaluation_results.keys(),\n",
+    "                    key=lambda x: evaluation_results[x]['avg_time'])\n",
+    "\n",
+    "print(f\"\\n1. Best Overall Performance: {best_model}\")\n",
+    "print(f\"   - ROUGE-1: {evaluation_results[best_model]['rouge_1_f1_mean']:.4f}\")\n",
+    "print(f\"   - ROUGE-2: {evaluation_results[best_model]['rouge_2_f1_mean']:.4f}\")\n",
+    "print(f\"   - ROUGE-L: {evaluation_results[best_model]['rouge_l_f1_mean']:.4f}\")\n",
+    "\n",
+    "print(f\"\\n2. Fastest Processing: {fastest_model}\")\n",
+    "print(f\"   - Avg time: {evaluation_results[fastest_model]['avg_time']:.3f}s\")\n",
+    "print(f\"   - {evaluation_results[max(evaluation_results.keys(), key=lambda x: evaluation_results[x]['avg_time'])]['avg_time'] / evaluation_results[fastest_model]['avg_time']:.1f}x faster than slowest\")\n",
+    "\n",
+    "print(f\"\\n3. Extractive vs Abstractive:\")\n",
+    "print(f\"   - TextRank (Extractive): ROUGE-1 = {evaluation_results['TextRank']['rouge_1_f1_mean']:.4f}\")\n",
+    "print(f\"   - BART (Abstractive): ROUGE-1 = {evaluation_results['BART']['rouge_1_f1_mean']:.4f}\")\n",
+    "print(f\"   - PEGASUS (Abstractive): ROUGE-1 = {evaluation_results['PEGASUS']['rouge_1_f1_mean']:.4f}\")\n",
+    "print(f\"   - Abstractive models outperform extractive by {(evaluation_results[best_model]['rouge_1_f1_mean'] / evaluation_results['TextRank']['rouge_1_f1_mean'] - 1) * 100:.1f}%\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\"*70)\n",
+    "print(\"✓ Evaluation complete! Use these results in your report.\")\n",
+    "print(\"=\"*70)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Workshop2",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

notebooks/.ipynb_checkpoints/03_evaluation_analysis_cnn_dailymail-checkpoint.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

notebooks/.ipynb_checkpoints/Smart-Summarizer-checkpoint.ipynb ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+ "cells": [],
+ "metadata": {},
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

notebooks/01_data_exploration.ipynb CHANGED Viewed

@@ -240,7 +240,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Workshop2",
    "language": "python",
    "name": "python3"
   },

  ],
  "metadata": {
   "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },

notebooks/02_model_testing.ipynb CHANGED Viewed

The diff for this file is too large to render. See raw diff

notebooks/03_evaluation_analysis.ipynb CHANGED Viewed

@@ -456,7 +456,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Workshop2",
    "language": "python",
    "name": "python3"
   },

  ],
  "metadata": {
   "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },

notebooks/03_evaluation_analysis_cnn_dailymail.ipynb CHANGED Viewed

The diff for this file is too large to render. See raw diff

notebooks/Smart-Summarizer.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

results/cnn_dailymail_evaluation_export.json ADDED Viewed

	@@ -0,0 +1,109 @@

+{
+  "evaluation_metadata": {
+    "date": "2026-01-05 11:07:16",
+    "dataset": "CNN/DailyMail",
+    "dataset_version": "3.0.0",
+    "dataset_source": "abisee/cnn_dailymail",
+    "split": "test",
+    "samples_evaluated": 20,
+    "student_id": "23049149",
+    "module": "CU6051NI Artificial Intelligence"
+  },
+  "models_evaluated": {
+    "TextRank": {
+      "model_type": "Extractive",
+      "samples_processed": 20,
+      "rouge_scores": {
+        "rouge_1": {
+          "mean": 0.2506712046180422,
+          "std": 0.12486457066759726,
+          "interpretation": "Unigram overlap with reference"
+        },
+        "rouge_2": {
+          "mean": 0.10349963104031257,
+          "std": 0.0726814983362185,
+          "interpretation": "Bigram overlap with reference"
+        },
+        "rouge_l": {
+          "mean": 0.16371538220717308,
+          "std": 0.07898183592353582,
+          "interpretation": "Longest common subsequence"
+        }
+      },
+      "performance_metrics": {
+        "avg_processing_time": 0.005225980281829834,
+        "std_processing_time": 0.013506362618507201,
+        "total_processing_time": 0.10451960563659668,
+        "compression_ratio_mean": 4.669225541097807,
+        "compression_ratio_std": 2.4839200547893845
+      }
+    },
+    "BART": {
+      "model_type": "Abstractive",
+      "samples_processed": 20,
+      "rouge_scores": {
+        "rouge_1": {
+          "mean": 0.35022025793945055,
+          "std": 0.09190543055324636,
+          "interpretation": "Unigram overlap with reference"
+        },
+        "rouge_2": {
+          "mean": 0.1478972899837078,
+          "std": 0.08392194073728265,
+          "interpretation": "Bigram overlap with reference"
+        },
+        "rouge_l": {
+          "mean": 0.2604310393319945,
+          "std": 0.10189025331501939,
+          "interpretation": "Longest common subsequence"
+        }
+      },
+      "performance_metrics": {
+        "avg_processing_time": 6.735281562805175,
+        "std_processing_time": 1.252485361304747,
+        "total_processing_time": 134.70563125610352,
+        "compression_ratio_mean": 1.4679364788911673,
+        "compression_ratio_std": 0.3564447507091954
+      }
+    },
+    "PEGASUS": {
+      "model_type": "Abstractive",
+      "samples_processed": 20,
+      "rouge_scores": {
+        "rouge_1": {
+          "mean": 0.3530379619461269,
+          "std": 0.10720945707466437,
+          "interpretation": "Unigram overlap with reference"
+        },
+        "rouge_2": {
+          "mean": 0.1531830157168635,
+          "std": 0.08764155739126663,
+          "interpretation": "Bigram overlap with reference"
+        },
+        "rouge_l": {
+          "mean": 0.25491739595110097,
+          "std": 0.09604101774475897,
+          "interpretation": "Longest common subsequence"
+        }
+      },
+      "performance_metrics": {
+        "avg_processing_time": 8.351530861854553,
+        "std_processing_time": 0.8606459954310681,
+        "total_processing_time": 167.03061723709106,
+        "compression_ratio_mean": 1.268746653225481,
+        "compression_ratio_std": 0.34500943090569686
+      }
+    }
+  },
+  "summary_statistics": {
+    "total_models": 3,
+    "successful_evaluations": 3,
+    "best_rouge1_model": "PEGASUS",
+    "fastest_model": "TextRank"
+  },
+  "dataset_characteristics": {
+    "avg_article_length": 530.35,
+    "avg_reference_length": 36.7,
+    "avg_compression_ratio": 0.09504919499509967
+  }
+}

results/cnn_dailymail_evaluation_results.csv ADDED Viewed

	@@ -0,0 +1,4 @@

+Model,Type,ROUGE-1,ROUGE-2,ROUGE-L,Avg Time (s),Total Time (s),Samples
+TextRank,Extractive,0.2507 ± 0.1249,0.1035 ± 0.0727,0.1637 ± 0.0790,0.005 ± 0.014,0.1,20
+BART,Abstractive,0.3502 ± 0.0919,0.1479 ± 0.0839,0.2604 ± 0.1019,6.735 ± 1.252,134.7,20
+PEGASUS,Abstractive,0.3530 ± 0.1072,0.1532 ± 0.0876,0.2549 ± 0.0960,8.352 ± 0.861,167.0,20

results/cnn_dailymail_report_summary.md ADDED Viewed

	@@ -0,0 +1,29 @@

+# CNN/DailyMail Evaluation Report Summary
+## Dataset Information
+- **Dataset**: CNN/DailyMail v3.0.0 (abisee/cnn_dailymail)
+- **Split**: Test set
+- **Samples**: 20 articles evaluated
+- **Average Article Length**: 530 words
+- **Average Reference Length**: 37 words
+## Model Performance (ROUGE-1 F1 Scores)
+### PEGASUS (Abstractive)
+- **ROUGE-1**: 0.3530 ± 0.1072
+- **ROUGE-2**: 0.1532 ± 0.0876
+- **ROUGE-L**: 0.2549 ± 0.0960
+- **Avg Processing Time**: 8.352s per sample
+### BART (Abstractive)
+- **ROUGE-1**: 0.3502 ± 0.0919
+- **ROUGE-2**: 0.1479 ± 0.0839
+- **ROUGE-L**: 0.2604 ± 0.1019
+- **Avg Processing Time**: 6.735s per sample
+### TextRank (Extractive)
+- **ROUGE-1**: 0.2507 ± 0.1249
+- **ROUGE-2**: 0.1035 ± 0.0727
+- **ROUGE-L**: 0.1637 ± 0.0790
+- **Avg Processing Time**: 0.005s per sample

run_evaluation.py ADDED Viewed

	@@ -0,0 +1,130 @@

+#!/usr/bin/env python3
+"""
+Simple script to run model evaluation on CNN/DailyMail dataset
+"""
+import os
+import sys
+import logging
+from pathlib import Path
+# Add project root to path
+project_root = Path(__file__).parent
+sys.path.insert(0, str(project_root))
+from evaluation.dataset_loader import CNNDailyMailLoader
+from evaluation.model_evaluator import ModelEvaluator
+from evaluation.results_analyzer import ResultsAnalyzer
+# Setup logging
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+def main():
+    """Run comprehensive evaluation"""
+    # Configuration
+    SAMPLE_SIZE = 50  # Number of samples to evaluate
+    OUTPUT_DIR = "evaluation_results"
+    # Create output directory
+    os.makedirs(OUTPUT_DIR, exist_ok=True)
+    logger.info("Starting Smart Summarizer Evaluation")
+    logger.info(f"Sample size: {SAMPLE_SIZE}")
+    try:
+        # Step 1: Load dataset
+        logger.info("Step 1: Loading CNN/DailyMail dataset...")
+        loader = CNNDailyMailLoader()
+        dataset = loader.load_dataset()
+        # Step 2: Create evaluation subset
+        logger.info("Step 2: Creating evaluation subset...")
+        eval_data = loader.create_evaluation_subset(size=SAMPLE_SIZE)
+        loader.save_evaluation_data(eval_data, f"{OUTPUT_DIR}/eval_data.json")
+        # Step 3: Categorize by topics
+        logger.info("Step 3: Categorizing by topics...")
+        categorized_data = loader.categorize_by_topic(eval_data)
+        # Save categorized data
+        for topic, data in categorized_data.items():
+            if data:
+                loader.save_evaluation_data(data, f"{OUTPUT_DIR}/data_{topic}.json")
+                logger.info(f"  {topic}: {len(data)} articles")
+        # Step 4: Initialize models
+        logger.info("Step 4: Initializing models...")
+        evaluator = ModelEvaluator()
+        evaluator.initialize_models()
+        # Step 5: Run overall evaluation
+        logger.info("Step 5: Running overall evaluation...")
+        overall_results = evaluator.evaluate_all_models(eval_data, max_samples=SAMPLE_SIZE)
+        # Save overall results
+        evaluator.save_results(overall_results, f"{OUTPUT_DIR}/results_overall.json")
+        # Create overall comparison
+        comparison_df = evaluator.compare_models(overall_results)
+        comparison_df.to_csv(f"{OUTPUT_DIR}/comparison_overall.csv", index=False)
+        print("\n" + "="*60)
+        print("OVERALL EVALUATION RESULTS")
+        print("="*60)
+        print(comparison_df.to_string(index=False))
+        # Step 6: Run topic-based evaluation
+        logger.info("Step 6: Running topic-based evaluation...")
+        topic_results = {}
+        for topic, data in categorized_data.items():
+            if len(data) >= 5:  # Only evaluate topics with sufficient data
+                logger.info(f"  Evaluating topic: {topic}")
+                topic_results[topic] = evaluator.evaluate_all_models(data, max_samples=20)
+                # Save topic results
+                evaluator.save_results(topic_results[topic], f"{OUTPUT_DIR}/results_{topic}.json")
+                # Create topic comparison
+                topic_comparison = evaluator.compare_models(topic_results[topic])
+                topic_comparison.to_csv(f"{OUTPUT_DIR}/comparison_{topic}.csv", index=False)
+                print(f"\n{topic.upper()} TOPIC RESULTS:")
+                print("-" * 40)
+                print(topic_comparison.to_string(index=False))
+        # Step 7: Create visualizations and analysis
+        logger.info("Step 7: Creating analysis and visualizations...")
+        analyzer = ResultsAnalyzer()
+        # Overall performance charts
+        analyzer.create_performance_charts(overall_results, OUTPUT_DIR)
+        # Topic analysis if we have topic results
+        if topic_results:
+            analyzer.analyze_topic_performance(topic_results, OUTPUT_DIR)
+        # Detailed report
+        analyzer.create_detailed_report(overall_results, OUTPUT_DIR)
+        print(f"\n" + "="*60)
+        print("EVALUATION COMPLETE")
+        print("="*60)
+        print(f"Results saved to: {OUTPUT_DIR}/")
+        print("Files created:")
+        print(f"  - results_overall.json (detailed results)")
+        print(f"  - comparison_overall.csv (summary table)")
+        print(f"  - performance_comparison.png (charts)")
+        print(f"  - evaluation_report.md (detailed report)")
+        if topic_results:
+            print(f"  - topic_performance_heatmap.png (topic analysis)")
+            print(f"  - topic_summary.csv (topic breakdown)")
+    except Exception as e:
+        logger.error(f"Evaluation failed: {e}")
+        raise
+if __name__ == "__main__":
+    main()