Rajak13 commited on
Commit
69df729
·
verified ·
1 Parent(s): 9c0ee8e

Upload folder using huggingface_hub

Browse files
Dockerfile ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.9-slim
2
+
3
+ # Set working directory
4
+ WORKDIR /app
5
+
6
+ # Install system dependencies
7
+ RUN apt-get update && apt-get install -y \
8
+ build-essential \
9
+ && rm -rf /var/lib/apt/lists/*
10
+
11
+ # Copy requirements first for better caching
12
+ COPY requirements.txt .
13
+
14
+ # Install Python dependencies
15
+ RUN pip install --no-cache-dir -r requirements.txt && \
16
+ pip install --no-cache-dir gunicorn==21.2.0
17
+
18
+ # Copy application code
19
+ COPY . .
20
+
21
+ # Create necessary directories
22
+ RUN mkdir -p uploads logs
23
+
24
+ # Download NLTK data
25
+ RUN python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"
26
+
27
+ # Expose port for Hugging Face Spaces
28
+ EXPOSE 7860
29
+
30
+ # Set environment variables for Hugging Face Spaces
31
+ ENV FLASK_ENV=production
32
+ ENV PYTHONUNBUFFERED=1
33
+ ENV PORT=7860
34
+
35
+ # Run the application on port 7860 for Hugging Face Spaces
36
+ CMD ["gunicorn", "--chdir", "webapp", "app:app", "--bind", "0.0.0.0:7860", "--timeout", "120", "--workers", "2"]
README.md CHANGED
@@ -1,13 +1,65 @@
1
  ---
2
  title: Smart Summarizer
3
- emoji: 🏆
4
- colorFrom: yellow
5
- colorTo: gray
6
  sdk: docker
7
- sdk_version: 6.2.0
8
- app_file: app.py
9
  pinned: false
10
  license: mit
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Smart Summarizer
3
+ emoji: 🤖
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: docker
 
 
7
  pinned: false
8
  license: mit
9
  ---
10
 
11
+ # Smart Summarizer
12
+
13
+ Professional text summarization using three state-of-the-art models:
14
+
15
+ - **TextRank**: Fast extractive summarization (graph-based)
16
+ - **BART**: High-quality abstractive summarization
17
+ - **PEGASUS**: Specialized abstractive model for summarization
18
+
19
+ ## Features
20
+
21
+ - 📄 **Single Summary**: Generate summaries with individual models
22
+ - ⚖️ **Comparison**: Compare all three models side-by-side
23
+ - 📚 **Batch Processing**: Process multiple documents simultaneously
24
+ - 📊 **Evaluation**: ROUGE metrics and performance insights
25
+ - 📁 **File Support**: Upload .txt, .md, .pdf, .docx files
26
+
27
+ ## Models
28
+
29
+ ### TextRank (Extractive)
30
+ - **Speed**: Very fast (~0.03s)
31
+ - **Type**: Graph-based PageRank algorithm
32
+ - **Best for**: Quick summaries, keyword extraction
33
+
34
+ ### BART (Abstractive)
35
+ - **Speed**: Moderate (~9s on CPU)
36
+ - **Type**: Transformer encoder-decoder
37
+ - **Best for**: Fluent, human-like summaries
38
+
39
+ ### PEGASUS (Abstractive)
40
+ - **Speed**: Moderate (~6s on CPU)
41
+ - **Type**: Gap Sentence Generation pre-training
42
+ - **Best for**: High-quality abstractive summaries
43
+
44
+ ## Usage
45
+
46
+ 1. Navigate to the web interface
47
+ 2. Choose between single summary or model comparison
48
+ 3. Input text directly or upload a supported file
49
+ 4. Select your preferred model(s)
50
+ 5. Generate and compare summaries
51
+
52
+ ## Supported File Types
53
+
54
+ - Plain text (`.txt`, `.md`)
55
+ - PDF documents (`.pdf`)
56
+ - Word documents (`.docx`, `.doc`)
57
+
58
+ ## Author
59
+
60
+ **Abdul Razzaq Ansari**
61
+
62
+ ## Links
63
+
64
+ - [GitHub Repository](https://github.com/Rajak13/Smart-Summarizer)
65
+ - [Documentation](https://github.com/Rajak13/Smart-Summarizer/blob/main/QUICK_START.md)
models/__init__.py ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Models package for text summarization
3
+ Contains implementations of various summarization algorithms
4
+ """
5
+
6
+ # Optional imports - import only what you need to avoid loading heavy dependencies
7
+ __all__ = [
8
+ 'BaseSummarizer',
9
+ 'TextRankSummarizer',
10
+ 'BARTSummarizer',
11
+ 'PEGASUSSummarizer'
12
+ ]
13
+
14
+ # Lazy imports - import classes when accessed via package
15
+ def __getattr__(name):
16
+ if name == 'BaseSummarizer':
17
+ from .base_summarizer import BaseSummarizer
18
+ return BaseSummarizer
19
+ elif name == 'TextRankSummarizer':
20
+ from .textrank import TextRankSummarizer
21
+ return TextRankSummarizer
22
+ elif name == 'BARTSummarizer':
23
+ from .bart import BARTSummarizer
24
+ return BARTSummarizer
25
+ elif name == 'PEGASUSSummarizer':
26
+ from .pegasus import PEGASUSSummarizer
27
+ return PEGASUSSummarizer
28
+ raise AttributeError(f"module '{__name__}' has no attribute '{name}'")
29
+
models/bart.py ADDED
@@ -0,0 +1,348 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ BART (Bidirectional and Auto-Regressive Transformers) Abstractive Summarization
3
+ State-of-the-art sequence-to-sequence model for text generation
4
+ Professional implementation with comprehensive features
5
+ """
6
+
7
+ # Handle imports when running directly (python models/bart.py)
8
+ # For proper package usage, run as: python -m models.bart
9
+ import sys
10
+ from pathlib import Path
11
+ project_root = Path(__file__).parent.parent
12
+ if str(project_root) not in sys.path:
13
+ sys.path.insert(0, str(project_root))
14
+
15
+ from transformers import BartForConditionalGeneration, BartTokenizer
16
+ import torch
17
+ import logging
18
+ from typing import Dict, List, Optional, Union
19
+ from models.base_summarizer import BaseSummarizer
20
+
21
+ logger = logging.getLogger(__name__)
22
+
23
+
24
+ class BARTSummarizer(BaseSummarizer):
25
+ """
26
+ BART implementation for abstractive text summarization.
27
+
28
+ Model Architecture:
29
+ - Encoder: Bidirectional transformer (like BERT)
30
+ - Decoder: Auto-regressive transformer (like GPT)
31
+ - Pre-trained on denoising tasks
32
+
33
+ Key Features:
34
+ - Generates human-like, fluent summaries
35
+ - Can paraphrase and compress information
36
+ - Handles long documents effectively
37
+ - State-of-the-art performance on CNN/DailyMail
38
+
39
+ Training Objective:
40
+ Trained to reconstruct original text from corrupted versions:
41
+ - Token masking
42
+ - Token deletion
43
+ - Sentence permutation
44
+ - Document rotation
45
+
46
+ Mathematical Foundation:
47
+ Self-Attention: Attention(Q,K,V) = softmax(QK^T/√d_k)V
48
+ Where Q=Query, K=Key, V=Value, d_k=dimension of keys
49
+ """
50
+
51
+ def __init__(self,
52
+ model_name: str = "facebook/bart-large-cnn",
53
+ device: Optional[str] = None,
54
+ use_fp16: bool = False):
55
+ """
56
+ Initialize BART Summarizer
57
+
58
+ Args:
59
+ model_name: HuggingFace model identifier
60
+ device: Computing device ('cuda', 'cpu', or None for auto-detect)
61
+ use_fp16: Use 16-bit floating point for faster inference (requires GPU)
62
+ """
63
+ super().__init__(model_name="BART", model_type="Abstractive")
64
+
65
+ logger.info(f"Loading BART model: {model_name}")
66
+ logger.info("Initial model loading may take 2-3 minutes...")
67
+
68
+ # Determine device
69
+ if device is None:
70
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
71
+ else:
72
+ self.device = device
73
+
74
+ logger.info(f"Using device: {self.device}")
75
+ if self.device == "cuda":
76
+ logger.info(f"GPU: {torch.cuda.get_device_name(0)}")
77
+ logger.info(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
78
+
79
+ # Load tokenizer and model
80
+ try:
81
+ self.tokenizer = BartTokenizer.from_pretrained(model_name)
82
+ self.model = BartForConditionalGeneration.from_pretrained(model_name)
83
+
84
+ # Move model to device
85
+ self.model.to(self.device)
86
+
87
+ # Enable FP16 if requested and GPU available
88
+ if use_fp16 and self.device == "cuda":
89
+ self.model.half()
90
+ logger.info("Using FP16 precision for faster inference")
91
+
92
+ # Set to evaluation mode
93
+ self.model.eval()
94
+
95
+ self.model_name_full = model_name
96
+ self.is_initialized = True
97
+
98
+ logger.info("BART model loaded successfully!")
99
+
100
+ except Exception as e:
101
+ logger.error(f"Failed to load BART model: {e}")
102
+ raise
103
+
104
+ def summarize(self,
105
+ text: str,
106
+ max_length: int = 150,
107
+ min_length: int = 50,
108
+ num_beams: int = 4,
109
+ length_penalty: float = 2.0,
110
+ no_repeat_ngram_size: int = 3,
111
+ early_stopping: bool = True,
112
+ do_sample: bool = False,
113
+ temperature: float = 1.0,
114
+ top_k: int = 50,
115
+ top_p: float = 0.95) -> str:
116
+ """
117
+ Generate abstractive summary using BART
118
+
119
+ Beam Search: Maintains top-k hypotheses at each step
120
+ Length Penalty: Exponential penalty applied to sequence length
121
+
122
+ Args:
123
+ text: Input text to summarize
124
+ max_length: Maximum summary length in tokens
125
+ min_length: Minimum summary length in tokens
126
+ num_beams: Number of beams for beam search (higher = better quality, slower)
127
+ length_penalty: >1.0 favors longer sequences, <1.0 favors shorter
128
+ no_repeat_ngram_size: Prevent repetition of n-grams
129
+ early_stopping: Stop when num_beams hypotheses are complete
130
+ do_sample: Use sampling instead of greedy decoding
131
+ temperature: Sampling temperature (higher = more random)
132
+ top_k: Keep only top k tokens for sampling
133
+ top_p: Nucleus sampling threshold
134
+
135
+ Returns:
136
+ Generated summary string
137
+ """
138
+ # Validate input
139
+ self.validate_input(text)
140
+
141
+ # Tokenize input
142
+ inputs = self.tokenizer(
143
+ text,
144
+ max_length=1024, # BART max input length
145
+ truncation=True,
146
+ padding="max_length",
147
+ return_tensors="pt"
148
+ )
149
+
150
+ # Move to device
151
+ input_ids = inputs["input_ids"].to(self.device)
152
+ attention_mask = inputs["attention_mask"].to(self.device)
153
+
154
+ # Generate summary
155
+ with torch.no_grad():
156
+ if do_sample:
157
+ # Sampling-based generation (more diverse)
158
+ summary_ids = self.model.generate(
159
+ input_ids,
160
+ attention_mask=attention_mask,
161
+ max_length=max_length,
162
+ min_length=min_length,
163
+ do_sample=True,
164
+ temperature=temperature,
165
+ top_k=top_k,
166
+ top_p=top_p,
167
+ no_repeat_ngram_size=no_repeat_ngram_size,
168
+ early_stopping=early_stopping
169
+ )
170
+ else:
171
+ # Beam search generation (more deterministic, higher quality)
172
+ summary_ids = self.model.generate(
173
+ input_ids,
174
+ attention_mask=attention_mask,
175
+ max_length=max_length,
176
+ min_length=min_length,
177
+ num_beams=num_beams,
178
+ length_penalty=length_penalty,
179
+ no_repeat_ngram_size=no_repeat_ngram_size,
180
+ early_stopping=early_stopping
181
+ )
182
+
183
+ # Decode summary
184
+ summary = self.tokenizer.decode(
185
+ summary_ids[0],
186
+ skip_special_tokens=True,
187
+ clean_up_tokenization_spaces=True
188
+ )
189
+
190
+ return summary
191
+
192
+ def batch_summarize(self,
193
+ texts: List[str],
194
+ batch_size: int = 4,
195
+ max_length: int = 150,
196
+ min_length: int = 50,
197
+ **kwargs) -> List[str]:
198
+ """
199
+ Efficiently summarize multiple texts in batches
200
+
201
+ Args:
202
+ texts: List of texts to summarize
203
+ batch_size: Number of texts to process simultaneously
204
+ max_length: Maximum summary length
205
+ min_length: Minimum summary length
206
+ **kwargs: Additional generation parameters
207
+
208
+ Returns:
209
+ List of generated summaries
210
+ """
211
+ logger.info(f"Batch summarizing {len(texts)} texts (batch_size={batch_size})")
212
+
213
+ summaries = []
214
+
215
+ # Process in batches
216
+ for i in range(0, len(texts), batch_size):
217
+ batch = texts[i:i + batch_size]
218
+
219
+ # Tokenize batch
220
+ inputs = self.tokenizer(
221
+ batch,
222
+ max_length=1024,
223
+ truncation=True,
224
+ padding=True,
225
+ return_tensors="pt"
226
+ )
227
+
228
+ input_ids = inputs["input_ids"].to(self.device)
229
+ attention_mask = inputs["attention_mask"].to(self.device)
230
+
231
+ # Generate summaries for batch
232
+ with torch.no_grad():
233
+ summary_ids = self.model.generate(
234
+ input_ids,
235
+ attention_mask=attention_mask,
236
+ max_length=max_length,
237
+ min_length=min_length,
238
+ num_beams=kwargs.get('num_beams', 4),
239
+ early_stopping=True
240
+ )
241
+
242
+ # Decode summaries
243
+ batch_summaries = [
244
+ self.tokenizer.decode(ids, skip_special_tokens=True)
245
+ for ids in summary_ids
246
+ ]
247
+
248
+ summaries.extend(batch_summaries)
249
+
250
+ logger.info(f"Processed batch {i//batch_size + 1}/{(len(texts)-1)//batch_size + 1}")
251
+
252
+ return summaries
253
+
254
+ def get_model_info(self) -> Dict:
255
+ """Return comprehensive model information"""
256
+ info = super().get_model_info()
257
+ info.update({
258
+ 'algorithm': 'Transformer Encoder-Decoder',
259
+ 'architecture': {
260
+ 'encoder': 'Bidirectional (BERT-like)',
261
+ 'decoder': 'Auto-regressive (GPT-like)',
262
+ 'layers': '12 encoder + 12 decoder',
263
+ 'attention_heads': 16,
264
+ 'hidden_size': 1024,
265
+ 'parameters': '406M'
266
+ },
267
+ 'training': {
268
+ 'objective': 'Denoising autoencoder',
269
+ 'noise_functions': [
270
+ 'Token masking',
271
+ 'Token deletion',
272
+ 'Sentence permutation',
273
+ 'Document rotation'
274
+ ],
275
+ 'dataset': 'Large-scale web text + CNN/DailyMail fine-tuning'
276
+ },
277
+ 'performance': {
278
+ 'rouge_1': '44.16',
279
+ 'rouge_2': '21.28',
280
+ 'rouge_l': '40.90',
281
+ 'benchmark': 'CNN/DailyMail test set'
282
+ },
283
+ 'advantages': [
284
+ 'Generates fluent, human-like summaries',
285
+ 'Can paraphrase and compress effectively',
286
+ 'Handles long documents well',
287
+ 'State-of-the-art performance'
288
+ ],
289
+ 'limitations': [
290
+ 'May introduce factual errors',
291
+ 'Computationally intensive',
292
+ 'Requires GPU for fast inference',
293
+ 'Black-box nature (less interpretable)'
294
+ ]
295
+ })
296
+ return info
297
+
298
+ def __del__(self):
299
+ """Cleanup GPU memory when object is destroyed"""
300
+ if hasattr(self, 'device') and self.device == 'cuda':
301
+ torch.cuda.empty_cache()
302
+
303
+
304
+ # Test the implementation
305
+ if __name__ == "__main__":
306
+ sample_text = """
307
+ Machine learning has revolutionized artificial intelligence in recent years.
308
+ Deep learning neural networks can now perform tasks that were impossible just
309
+ a decade ago. Computer vision systems can recognize objects in images with
310
+ superhuman accuracy. Natural language processing models can generate human-like
311
+ text and translate between languages. Reinforcement learning has enabled AI
312
+ to master complex games like Go and StarCraft. These advances have been driven
313
+ by increases in computing power, availability of large datasets, and algorithmic
314
+ innovations. However, challenges remain in areas like explainability, fairness,
315
+ and robustness. The field continues to evolve rapidly with new breakthroughs
316
+ occurring regularly.
317
+ """
318
+
319
+ print("=" * 70)
320
+ print("BART SUMMARIZER - PROFESSIONAL TEST")
321
+ print("=" * 70)
322
+
323
+ # Initialize summarizer
324
+ summarizer = BARTSummarizer()
325
+
326
+ # Generate summary with metrics
327
+ result = summarizer.summarize_with_metrics(
328
+ sample_text,
329
+ max_length=100,
330
+ min_length=30,
331
+ num_beams=4
332
+ )
333
+
334
+ print(f"\nModel: {result['metadata']['model_name']}")
335
+ print(f"Type: {result['metadata']['model_type']}")
336
+ print(f"Device: {summarizer.device}")
337
+ print(f"Input Length: {result['metadata']['input_length']} words")
338
+ print(f"Summary Length: {result['metadata']['summary_length']} words")
339
+ print(f"Compression Ratio: {result['metadata']['compression_ratio']:.2%}")
340
+ print(f"Processing Time: {result['metadata']['processing_time']:.4f} seconds")
341
+
342
+ print(f"\n{'Generated Summary:':-^70}")
343
+ print(result['summary'])
344
+
345
+ print("\n" + "=" * 70)
346
+ model_info = summarizer.get_model_info()
347
+ print(f"Architecture: {model_info['architecture']}")
348
+ print(f"Performance: {model_info['performance']}")
models/base_summarizer.py ADDED
@@ -0,0 +1,221 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Base Summarizer Class
3
+ Defines the interface for all summarization models
4
+ Implements Strategy Design Pattern for interchangeable algorithms
5
+ """
6
+
7
+ from abc import ABC, abstractmethod
8
+ from typing import Dict, Any, Optional, List
9
+ import time
10
+ import logging
11
+
12
+ # Setup logging
13
+ logging.basicConfig(level=logging.INFO)
14
+ logger = logging.getLogger(__name__)
15
+
16
+
17
+ class BaseSummarizer(ABC):
18
+ """
19
+ Abstract base class for all summarization models.
20
+ Implements common functionality and defines interface.
21
+
22
+ Design Pattern: Strategy Pattern
23
+ - Allows switching between different summarization algorithms
24
+ - Ensures consistent interface across models
25
+ """
26
+
27
+ def __init__(self, model_name: str, model_type: str):
28
+ """
29
+ Initialize base summarizer
30
+
31
+ Args:
32
+ model_name: Name of the model (e.g., "TextRank", "BART")
33
+ model_type: Type of summarization ("Extractive" or "Abstractive")
34
+ """
35
+ self.model_name = model_name
36
+ self.model_type = model_type
37
+ self.is_initialized = False
38
+ self.stats = {
39
+ 'total_summarizations': 0,
40
+ 'total_processing_time': 0.0,
41
+ 'average_processing_time': 0.0
42
+ }
43
+ logger.info(f"Initializing {model_name} ({model_type}) summarizer")
44
+
45
+ @abstractmethod
46
+ def summarize(self, text: str, **kwargs) -> str:
47
+ """
48
+ Generate summary from input text.
49
+ Must be implemented by all subclasses.
50
+
51
+ Args:
52
+ text: Input text to summarize
53
+ **kwargs: Additional parameters specific to each model
54
+
55
+ Returns:
56
+ Generated summary string
57
+ """
58
+ pass
59
+
60
+ def summarize_with_metrics(self, text: str, **kwargs) -> Dict[str, Any]:
61
+ """
62
+ Summarize text and return detailed metrics
63
+
64
+ Args:
65
+ text: Input text to summarize
66
+ **kwargs: Model-specific parameters
67
+
68
+ Returns:
69
+ Dictionary containing summary and metadata
70
+ """
71
+ start_time = time.time()
72
+
73
+ # Generate summary
74
+ summary = self.summarize(text, **kwargs)
75
+
76
+ # Calculate metrics
77
+ processing_time = time.time() - start_time
78
+ self._update_stats(processing_time)
79
+
80
+ return {
81
+ 'summary': summary,
82
+ 'metadata': {
83
+ 'model_name': self.model_name,
84
+ 'model_type': self.model_type,
85
+ 'processing_time': processing_time,
86
+ 'input_length': len(text.split()),
87
+ 'summary_length': len(summary.split()),
88
+ 'compression_ratio': len(summary.split()) / len(text.split()) if len(text.split()) > 0 else 0,
89
+ 'timestamp': time.strftime('%Y-%m-%d %H:%M:%S')
90
+ }
91
+ }
92
+
93
+ def batch_summarize(self, texts: List[str], **kwargs) -> List[Dict[str, Any]]:
94
+ """
95
+ Summarize multiple texts
96
+
97
+ Args:
98
+ texts: List of texts to summarize
99
+ **kwargs: Model-specific parameters
100
+
101
+ Returns:
102
+ List of dictionaries with summaries and metadata
103
+ """
104
+ logger.info(f"Batch summarizing {len(texts)} texts with {self.model_name}")
105
+ results = []
106
+
107
+ for idx, text in enumerate(texts):
108
+ logger.info(f"Processing text {idx + 1}/{len(texts)}")
109
+ result = self.summarize_with_metrics(text, **kwargs)
110
+ result['metadata']['batch_index'] = idx
111
+ results.append(result)
112
+
113
+ return results
114
+
115
+ def _update_stats(self, processing_time: float):
116
+ """Update internal statistics"""
117
+ self.stats['total_summarizations'] += 1
118
+ self.stats['total_processing_time'] += processing_time
119
+ self.stats['average_processing_time'] = (
120
+ self.stats['total_processing_time'] / self.stats['total_summarizations']
121
+ )
122
+
123
+ def get_model_info(self) -> Dict[str, Any]:
124
+ """
125
+ Get detailed model information
126
+
127
+ Returns:
128
+ Dictionary with model specifications
129
+ """
130
+ return {
131
+ 'name': self.model_name,
132
+ 'type': self.model_type,
133
+ 'statistics': self.stats.copy(),
134
+ 'is_initialized': self.is_initialized
135
+ }
136
+
137
+ def reset_stats(self):
138
+ """Reset usage statistics"""
139
+ self.stats = {
140
+ 'total_summarizations': 0,
141
+ 'total_processing_time': 0.0,
142
+ 'average_processing_time': 0.0
143
+ }
144
+ logger.info(f"Statistics reset for {self.model_name}")
145
+
146
+ def validate_input(self, text: str, min_length: int = 10) -> bool:
147
+ """
148
+ Validate input text
149
+
150
+ Args:
151
+ text: Input text
152
+ min_length: Minimum number of words required
153
+
154
+ Returns:
155
+ Boolean indicating if input is valid
156
+
157
+ Raises:
158
+ ValueError: If input is invalid
159
+ """
160
+ if not text or not isinstance(text, str):
161
+ raise ValueError("Input text must be a non-empty string")
162
+
163
+ word_count = len(text.split())
164
+ if word_count < min_length:
165
+ raise ValueError(
166
+ f"Input text too short. Minimum {min_length} words required, got {word_count}"
167
+ )
168
+
169
+ return True
170
+
171
+ def __repr__(self) -> str:
172
+ """String representation of the summarizer"""
173
+ return (f"{self.__class__.__name__}(model_name='{self.model_name}', "
174
+ f"model_type='{self.model_type}', "
175
+ f"total_summarizations={self.stats['total_summarizations']})")
176
+
177
+
178
+ class SummarizerFactory:
179
+ """
180
+ Factory Pattern for creating summarizer instances
181
+ Centralizes model instantiation logic
182
+ """
183
+
184
+ _models = {}
185
+
186
+ @classmethod
187
+ def register_model(cls, model_class, name: str):
188
+ """Register a new summarizer model"""
189
+ cls._models[name.lower()] = model_class
190
+ logger.info(f"Registered model: {name}")
191
+
192
+ @classmethod
193
+ def create_summarizer(cls, model_name: str, **kwargs):
194
+ """
195
+ Create a summarizer instance
196
+
197
+ Args:
198
+ model_name: Name of the model to create
199
+ **kwargs: Model-specific initialization parameters
200
+
201
+ Returns:
202
+ Instance of requested summarizer
203
+
204
+ Raises:
205
+ ValueError: If model not found
206
+ """
207
+ model_name_lower = model_name.lower()
208
+
209
+ if model_name_lower not in cls._models:
210
+ available = ', '.join(cls._models.keys())
211
+ raise ValueError(
212
+ f"Model '{model_name}' not found. Available models: {available}"
213
+ )
214
+
215
+ model_class = cls._models[model_name_lower]
216
+ return model_class(**kwargs)
217
+
218
+ @classmethod
219
+ def list_available_models(cls) -> List[str]:
220
+ """Get list of available models"""
221
+ return list(cls._models.keys())
models/pegasus.py ADDED
@@ -0,0 +1,384 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive SUmmarization)
3
+ State-of-the-art model specifically designed for summarization tasks
4
+ Professional implementation with Gap Sentence Generation pre-training
5
+ """
6
+
7
+ # Handle imports when running directly (python models/pegasus.py)
8
+ # For proper package usage, run as: python -m models.pegasus
9
+ import sys
10
+ from pathlib import Path
11
+ project_root = Path(__file__).parent.parent
12
+ if str(project_root) not in sys.path:
13
+ sys.path.insert(0, str(project_root))
14
+
15
+ from transformers import PegasusForConditionalGeneration, PegasusTokenizer
16
+ import torch
17
+ import logging
18
+ from typing import Dict, List, Optional
19
+ from models.base_summarizer import BaseSummarizer
20
+
21
+ logger = logging.getLogger(__name__)
22
+
23
+
24
+ class PEGASUSSummarizer(BaseSummarizer):
25
+ """
26
+ PEGASUS implementation for abstractive text summarization.
27
+
28
+ Innovation: Gap Sentence Generation (GSG)
29
+ - Pre-training task: Predict important missing sentences
30
+ - Directly aligned with summarization objective
31
+ - Superior transfer learning for summarization
32
+
33
+ Model Architecture:
34
+ - Transformer encoder-decoder (16 layers each)
35
+ - Pre-trained on C4 and HugeNews datasets
36
+ - Fine-tuned on domain-specific summarization data
37
+
38
+ Key Advantages:
39
+ - Highest ROUGE scores on multiple benchmarks
40
+ - Excellent zero-shot and few-shot capabilities
41
+ - Generates highly coherent summaries
42
+ - Handles long documents effectively
43
+
44
+ Performance Highlights (CNN/DailyMail):
45
+ - ROUGE-1: 44.17
46
+ - ROUGE-2: 21.47
47
+ - ROUGE-L: 41.11
48
+
49
+ Mathematical Foundation:
50
+ Sentence Importance: ROUGE-F1(Si, D\Si)
51
+ Where Si = sentence i, D\Si = document without sentence i
52
+ """
53
+
54
+ def __init__(self,
55
+ model_name: str = "google/pegasus-cnn_dailymail",
56
+ device: Optional[str] = None,
57
+ use_fp16: bool = False):
58
+ """
59
+ Initialize PEGASUS Summarizer
60
+
61
+ Args:
62
+ model_name: HuggingFace model identifier
63
+ Options: 'google/pegasus-cnn_dailymail' (recommended)
64
+ 'google/pegasus-xsum' (for extreme summarization)
65
+ 'google/pegasus-large' (base model)
66
+ device: Computing device ('cuda', 'cpu', or None for auto-detect)
67
+ use_fp16: Use 16-bit floating point for faster inference
68
+ """
69
+ super().__init__(model_name="PEGASUS", model_type="Abstractive")
70
+
71
+ logger.info(f"Loading PEGASUS model: {model_name}")
72
+ logger.info("PEGASUS is a large model. Initial loading may take 3-5 minutes...")
73
+
74
+ # Determine device
75
+ if device is None:
76
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
77
+ else:
78
+ self.device = device
79
+
80
+ logger.info(f"Using device: {self.device}")
81
+
82
+ # Load tokenizer and model
83
+ try:
84
+ logger.info("Loading tokenizer...")
85
+ self.tokenizer = PegasusTokenizer.from_pretrained(model_name)
86
+
87
+ logger.info("Loading model weights...")
88
+ self.model = PegasusForConditionalGeneration.from_pretrained(model_name)
89
+
90
+ # Move to device
91
+ self.model.to(self.device)
92
+
93
+ # Enable FP16 if requested
94
+ if use_fp16 and self.device == "cuda":
95
+ self.model.half()
96
+ logger.info("Using FP16 precision")
97
+
98
+ # Set to evaluation mode
99
+ self.model.eval()
100
+
101
+ self.model_name_full = model_name
102
+ self.is_initialized = True
103
+
104
+ # Get model configuration
105
+ self.config = self.model.config
106
+
107
+ logger.info("PEGASUS model loaded successfully!")
108
+ logger.info(f"Model size: {self._count_parameters() / 1e6:.1f}M parameters")
109
+
110
+ except Exception as e:
111
+ logger.error(f"Failed to load PEGASUS model: {e}")
112
+ raise
113
+
114
+ def _count_parameters(self) -> int:
115
+ """Count total number of trainable parameters"""
116
+ return sum(p.numel() for p in self.model.parameters() if p.requires_grad)
117
+
118
+ def summarize(self,
119
+ text: str,
120
+ max_length: int = 128,
121
+ min_length: int = 32,
122
+ num_beams: int = 4,
123
+ length_penalty: float = 2.0,
124
+ no_repeat_ngram_size: int = 3,
125
+ early_stopping: bool = True,
126
+ do_sample: bool = False,
127
+ temperature: float = 1.0) -> str:
128
+ """
129
+ Generate abstractive summary using PEGASUS
130
+
131
+ PEGASUS uses special tokens:
132
+ - <pad>: Padding token (also used as decoder start token)
133
+ - </s>: End of sequence token
134
+ - <unk>: Unknown token
135
+ - <mask_1>, <mask_2>: Gap sentence masks
136
+
137
+ Args:
138
+ text: Input text to summarize
139
+ max_length: Maximum summary length in tokens (PEGASUS optimal: 128)
140
+ min_length: Minimum summary length in tokens
141
+ num_beams: Beam search width (4-8 recommended)
142
+ length_penalty: Controls summary length (>1.0 = longer)
143
+ no_repeat_ngram_size: Prevent n-gram repetition
144
+ early_stopping: Stop when beams complete
145
+ do_sample: Use sampling instead of beam search
146
+ temperature: Sampling randomness (lower = more deterministic)
147
+
148
+ Returns:
149
+ Generated summary string
150
+ """
151
+ # Validate input
152
+ self.validate_input(text)
153
+
154
+ # Tokenize input
155
+ inputs = self.tokenizer(
156
+ text,
157
+ max_length=1024, # PEGASUS max input
158
+ truncation=True,
159
+ padding="max_length",
160
+ return_tensors="pt"
161
+ )
162
+
163
+ # Move to device
164
+ input_ids = inputs["input_ids"].to(self.device)
165
+ attention_mask = inputs["attention_mask"].to(self.device)
166
+
167
+ # Generate summary
168
+ with torch.no_grad():
169
+ if do_sample:
170
+ # Sampling-based generation
171
+ summary_ids = self.model.generate(
172
+ input_ids,
173
+ attention_mask=attention_mask,
174
+ max_length=max_length,
175
+ min_length=min_length,
176
+ do_sample=True,
177
+ temperature=temperature,
178
+ top_k=50,
179
+ top_p=0.95,
180
+ no_repeat_ngram_size=no_repeat_ngram_size
181
+ )
182
+ else:
183
+ # Beam search generation (recommended for PEGASUS)
184
+ summary_ids = self.model.generate(
185
+ input_ids,
186
+ attention_mask=attention_mask,
187
+ max_length=max_length,
188
+ min_length=min_length,
189
+ num_beams=num_beams,
190
+ length_penalty=length_penalty,
191
+ no_repeat_ngram_size=no_repeat_ngram_size,
192
+ early_stopping=early_stopping
193
+ )
194
+
195
+ # Decode summary
196
+ summary = self.tokenizer.decode(
197
+ summary_ids[0],
198
+ skip_special_tokens=True,
199
+ clean_up_tokenization_spaces=True
200
+ )
201
+
202
+ return summary
203
+
204
+ def batch_summarize(self,
205
+ texts: List[str],
206
+ batch_size: int = 2,
207
+ max_length: int = 128,
208
+ **kwargs) -> List[str]:
209
+ """
210
+ Batch summarization (PEGASUS is large, use smaller batches)
211
+
212
+ Args:
213
+ texts: List of texts to summarize
214
+ batch_size: Texts per batch (2-4 recommended for PEGASUS)
215
+ max_length: Maximum summary length
216
+ **kwargs: Additional generation parameters
217
+
218
+ Returns:
219
+ List of generated summaries
220
+ """
221
+ logger.info(f"Batch summarizing {len(texts)} texts (batch_size={batch_size})")
222
+
223
+ summaries = []
224
+
225
+ for i in range(0, len(texts), batch_size):
226
+ batch = texts[i:i + batch_size]
227
+
228
+ # Tokenize
229
+ inputs = self.tokenizer(
230
+ batch,
231
+ max_length=1024,
232
+ truncation=True,
233
+ padding=True,
234
+ return_tensors="pt"
235
+ )
236
+
237
+ input_ids = inputs["input_ids"].to(self.device)
238
+ attention_mask = inputs["attention_mask"].to(self.device)
239
+
240
+ # Generate
241
+ with torch.no_grad():
242
+ summary_ids = self.model.generate(
243
+ input_ids,
244
+ attention_mask=attention_mask,
245
+ max_length=max_length,
246
+ num_beams=kwargs.get('num_beams', 4),
247
+ length_penalty=kwargs.get('length_penalty', 2.0),
248
+ early_stopping=True
249
+ )
250
+
251
+ # Decode
252
+ batch_summaries = [
253
+ self.tokenizer.decode(ids, skip_special_tokens=True)
254
+ for ids in summary_ids
255
+ ]
256
+
257
+ summaries.extend(batch_summaries)
258
+
259
+ logger.info(f"Completed batch {i//batch_size + 1}/{(len(texts)-1)//batch_size + 1}")
260
+
261
+ return summaries
262
+
263
+ def get_model_info(self) -> Dict:
264
+ """Return comprehensive model information"""
265
+ info = super().get_model_info()
266
+ info.update({
267
+ 'algorithm': 'Gap Sentence Generation (GSG) + Transformer',
268
+ 'innovation': 'Pre-training specifically designed for summarization',
269
+ 'architecture': {
270
+ 'encoder_layers': 16,
271
+ 'decoder_layers': 16,
272
+ 'attention_heads': 16,
273
+ 'hidden_size': 1024,
274
+ 'parameters': f'{self._count_parameters() / 1e6:.1f}M',
275
+ 'vocabulary_size': self.tokenizer.vocab_size
276
+ },
277
+ 'pre_training': {
278
+ 'objective': 'Gap Sentence Generation (GSG)',
279
+ 'method': 'Mask and predict important sentences',
280
+ 'datasets': ['C4 corpus', 'HugeNews dataset'],
281
+ 'sentence_selection': 'ROUGE-based importance scoring'
282
+ },
283
+ 'fine_tuning': {
284
+ 'dataset': 'CNN/DailyMail',
285
+ 'task': 'Abstractive summarization'
286
+ },
287
+ 'performance': {
288
+ 'rouge_1': '44.17',
289
+ 'rouge_2': '21.47',
290
+ 'rouge_l': '41.11',
291
+ 'benchmark': 'CNN/DailyMail test set',
292
+ 'ranking': 'State-of-the-art (as of 2020)'
293
+ },
294
+ 'advantages': [
295
+ 'Highest ROUGE scores on benchmarks',
296
+ 'Excellent zero-shot performance',
297
+ 'Generates highly coherent summaries',
298
+ 'Pre-training aligned with summarization',
299
+ 'Strong transfer learning capabilities'
300
+ ],
301
+ 'limitations': [
302
+ 'Very large model (high memory requirements)',
303
+ 'Slower inference than smaller models',
304
+ 'May hallucinate facts',
305
+ 'Less interpretable (black-box)',
306
+ 'Requires powerful GPU for real-time use'
307
+ ],
308
+ 'optimal_use_cases': [
309
+ 'High-quality abstractive summaries needed',
310
+ 'News article summarization',
311
+ 'Long document summarization',
312
+ 'Multi-document summarization',
313
+ 'Research paper abstracts'
314
+ ]
315
+ })
316
+ return info
317
+
318
+ def get_special_tokens(self) -> Dict:
319
+ """Get information about special tokens"""
320
+ return {
321
+ 'pad_token': self.tokenizer.pad_token,
322
+ 'eos_token': self.tokenizer.eos_token,
323
+ 'unk_token': self.tokenizer.unk_token,
324
+ 'mask_token': self.tokenizer.mask_token,
325
+ 'vocab_size': self.tokenizer.vocab_size
326
+ }
327
+
328
+ def __del__(self):
329
+ """Cleanup GPU memory"""
330
+ if hasattr(self, 'device') and self.device == 'cuda':
331
+ torch.cuda.empty_cache()
332
+ logger.info("Cleared GPU cache")
333
+
334
+
335
+ # Test the implementation
336
+ if __name__ == "__main__":
337
+ sample_text = """
338
+ Climate change poses one of the greatest challenges to humanity in the 21st century.
339
+ Rising global temperatures are causing ice caps to melt and sea levels to rise.
340
+ Extreme weather events like hurricanes, droughts, and floods are becoming more frequent.
341
+ Scientists warn that without immediate action, the consequences could be catastrophic.
342
+ Renewable energy sources like solar and wind power offer sustainable alternatives to
343
+ fossil fuels. Many countries have committed to reducing carbon emissions through the
344
+ Paris Agreement. However, implementing these changes requires unprecedented international
345
+ cooperation and technological innovation. The transition to a green economy will create
346
+ new jobs while protecting the environment for future generations.
347
+ """
348
+
349
+ print("=" * 70)
350
+ print("PEGASUS SUMMARIZER - PROFESSIONAL TEST")
351
+ print("=" * 70)
352
+
353
+ # Initialize summarizer
354
+ summarizer = PEGASUSSummarizer()
355
+
356
+ # Generate summary with metrics
357
+ result = summarizer.summarize_with_metrics(
358
+ sample_text,
359
+ max_length=100,
360
+ min_length=30,
361
+ num_beams=4,
362
+ length_penalty=2.0
363
+ )
364
+
365
+ print(f"\nModel: {result['metadata']['model_name']}")
366
+ print(f"Type: {result['metadata']['model_type']}")
367
+ print(f"Device: {summarizer.device}")
368
+ print(f"Input Length: {result['metadata']['input_length']} words")
369
+ print(f"Summary Length: {result['metadata']['summary_length']} words")
370
+ print(f"Compression Ratio: {result['metadata']['compression_ratio']:.2%}")
371
+ print(f"Processing Time: {result['metadata']['processing_time']:.4f} seconds")
372
+
373
+ print(f"\n{'Generated Summary:':-^70}")
374
+ print(result['summary'])
375
+
376
+ print(f"\n{'Model Architecture:':-^70}")
377
+ model_info = summarizer.get_model_info()
378
+ print(f"Parameters: {model_info['architecture']['parameters']}")
379
+ print(f"Pre-training: {model_info['pre_training']['objective']}")
380
+ print(f"Performance (CNN/DM): ROUGE-1={model_info['performance']['rouge_1']}, "
381
+ f"ROUGE-2={model_info['performance']['rouge_2']}, "
382
+ f"ROUGE-L={model_info['performance']['rouge_l']}")
383
+
384
+ print("\n" + "=" * 70)
models/textrank.py ADDED
@@ -0,0 +1,366 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ TextRank Extractive Summarization
3
+ Graph-based ranking algorithm inspired by PageRank
4
+ Professional implementation with extensive documentation
5
+ """
6
+
7
+ # Handle imports when running directly (python models/textrank.py)
8
+ # For proper package usage, run as: python -m models.textrank
9
+ import sys
10
+ from pathlib import Path
11
+ project_root = Path(__file__).parent.parent
12
+ if str(project_root) not in sys.path:
13
+ sys.path.insert(0, str(project_root))
14
+
15
+ import numpy as np
16
+ import networkx as nx
17
+ from nltk.tokenize import sent_tokenize, word_tokenize
18
+ from nltk.corpus import stopwords
19
+ from sklearn.feature_extraction.text import TfidfVectorizer
20
+ from sklearn.metrics.pairwise import cosine_similarity
21
+ import logging
22
+ from typing import Dict, List, Tuple, Optional
23
+ from models.base_summarizer import BaseSummarizer
24
+
25
+ # Setup logging
26
+ logger = logging.getLogger(__name__)
27
+
28
+
29
+ class TextRankSummarizer(BaseSummarizer):
30
+ """
31
+ TextRank implementation for extractive text summarization.
32
+
33
+ Algorithm Overview:
34
+ 1. Split text into sentences
35
+ 2. Create TF-IDF vectors for each sentence
36
+ 3. Calculate cosine similarity between all sentence pairs
37
+ 4. Build weighted graph (sentences as nodes, similarities as edges)
38
+ 5. Apply PageRank algorithm to rank sentences
39
+ 6. Select top-ranked sentences for summary
40
+
41
+ Advantages:
42
+ - Fast and efficient (no neural networks)
43
+ - Language-agnostic (works on any language)
44
+ - Interpretable results
45
+ - No training required
46
+
47
+ Limitations:
48
+ - Cannot generate new sentences
49
+ - May select redundant information
50
+ - Limited semantic understanding
51
+ """
52
+
53
+ def __init__(self,
54
+ damping: float = 0.85,
55
+ max_iter: int = 100,
56
+ tol: float = 1e-4,
57
+ summary_ratio: float = 0.3,
58
+ min_sentence_length: int = 5):
59
+ """
60
+ Initialize TextRank Summarizer
61
+
62
+ Args:
63
+ damping: PageRank damping factor (0-1). Higher = more weight to neighbors
64
+ max_iter: Maximum iterations for PageRank convergence
65
+ tol: Convergence tolerance for PageRank
66
+ summary_ratio: Proportion of sentences to include (0-1)
67
+ min_sentence_length: Minimum words per sentence to consider
68
+ """
69
+ super().__init__(model_name="TextRank", model_type="Extractive")
70
+
71
+ self.damping = damping
72
+ self.max_iter = max_iter
73
+ self.tol = tol
74
+ self.summary_ratio = summary_ratio
75
+ self.min_sentence_length = min_sentence_length
76
+
77
+ # Initialize stopwords
78
+ try:
79
+ self.stop_words = set(stopwords.words('english'))
80
+ except LookupError:
81
+ logger.warning("NLTK stopwords not found. Downloading...")
82
+ import nltk
83
+ nltk.download('stopwords')
84
+ self.stop_words = set(stopwords.words('english'))
85
+
86
+ self.is_initialized = True
87
+ logger.info("TextRank summarizer initialized successfully")
88
+
89
+ def preprocess(self, text: str) -> Tuple[List[str], List[str]]:
90
+ """
91
+ Preprocess text into sentences
92
+
93
+ Args:
94
+ text: Input text string
95
+
96
+ Returns:
97
+ Tuple of (original_sentences, cleaned_sentences)
98
+ """
99
+ # Split into sentences
100
+ sentences = sent_tokenize(text)
101
+
102
+ # Filter out very short sentences
103
+ filtered_sentences = [
104
+ s for s in sentences
105
+ if len(s.split()) >= self.min_sentence_length
106
+ ]
107
+
108
+ if not filtered_sentences:
109
+ filtered_sentences = sentences # Keep all if filtering removes everything
110
+
111
+ # Clean sentences for similarity calculation
112
+ cleaned_sentences = []
113
+ for sent in filtered_sentences:
114
+ # Tokenize and lowercase
115
+ words = word_tokenize(sent.lower())
116
+ # Remove stopwords and non-alphanumeric tokens
117
+ words = [w for w in words if w.isalnum() and w not in self.stop_words]
118
+ cleaned_sentences.append(' '.join(words))
119
+
120
+ return filtered_sentences, cleaned_sentences
121
+
122
+ def build_similarity_matrix(self, sentences: List[str]) -> np.ndarray:
123
+ """
124
+ Build sentence similarity matrix using TF-IDF and cosine similarity
125
+
126
+ Mathematical Foundation:
127
+ - TF-IDF: Term Frequency-Inverse Document Frequency
128
+ - Cosine Similarity: cos(θ) = (A·B) / (||A|| × ||B||)
129
+
130
+ Args:
131
+ sentences: List of cleaned sentences
132
+
133
+ Returns:
134
+ Similarity matrix (numpy array) of shape [n_sentences, n_sentences]
135
+ """
136
+ # Edge case handling
137
+ n_sentences = len(sentences)
138
+ if n_sentences < 2:
139
+ return np.zeros((n_sentences, n_sentences))
140
+
141
+ # Remove empty sentences
142
+ valid_sentences = [s for s in sentences if s.strip()]
143
+ if not valid_sentences:
144
+ return np.zeros((n_sentences, n_sentences))
145
+
146
+ try:
147
+ # Create TF-IDF vectors
148
+ vectorizer = TfidfVectorizer(
149
+ max_features=1000, # Limit features for efficiency
150
+ ngram_range=(1, 2) # Use unigrams and bigrams
151
+ )
152
+ tfidf_matrix = vectorizer.fit_transform(valid_sentences)
153
+
154
+ # Calculate cosine similarity
155
+ similarity_matrix = cosine_similarity(tfidf_matrix)
156
+
157
+ # Set diagonal to 0 (sentence shouldn't be similar to itself)
158
+ np.fill_diagonal(similarity_matrix, 0)
159
+
160
+ return similarity_matrix
161
+
162
+ except ValueError as e:
163
+ logger.error(f"Error building similarity matrix: {e}")
164
+ return np.zeros((n_sentences, n_sentences))
165
+
166
+ def calculate_pagerank(self, similarity_matrix: np.ndarray) -> Dict[int, float]:
167
+ """
168
+ Apply PageRank algorithm to rank sentences
169
+
170
+ PageRank Formula:
171
+ WS(Vi) = (1-d) + d × Σ(wji / Σwjk) × WS(Vj)
172
+
173
+ Where:
174
+ - WS(Vi) = Score of sentence i
175
+ - d = damping factor
176
+ - wji = weight of edge from sentence j to i
177
+
178
+ Args:
179
+ similarity_matrix: Sentence similarity matrix
180
+
181
+ Returns:
182
+ Dictionary mapping sentence index to score
183
+ """
184
+ # Create graph from similarity matrix
185
+ nx_graph = nx.from_numpy_array(similarity_matrix)
186
+
187
+ try:
188
+ # Calculate PageRank scores
189
+ scores = nx.pagerank(
190
+ nx_graph,
191
+ alpha=self.damping, # damping factor
192
+ max_iter=self.max_iter,
193
+ tol=self.tol
194
+ )
195
+ return scores
196
+
197
+ except Exception as e:
198
+ logger.error(f"PageRank calculation failed: {e}")
199
+ # Return uniform scores as fallback
200
+ n_nodes = similarity_matrix.shape[0]
201
+ return {i: 1.0/n_nodes for i in range(n_nodes)}
202
+
203
+ def summarize(self,
204
+ text: str,
205
+ num_sentences: Optional[int] = None,
206
+ return_scores: bool = False) -> str:
207
+ """
208
+ Generate extractive summary using TextRank
209
+
210
+ Args:
211
+ text: Input text to summarize
212
+ num_sentences: Number of sentences in summary (overrides ratio)
213
+ return_scores: If True, return tuple of (summary, scores)
214
+
215
+ Returns:
216
+ Summary string, or tuple of (summary, scores) if return_scores=True
217
+ """
218
+ # Validate input
219
+ self.validate_input(text)
220
+
221
+ # Preprocess
222
+ original_sentences, cleaned_sentences = self.preprocess(text)
223
+
224
+ # Edge cases
225
+ if len(original_sentences) == 0:
226
+ return "" if not return_scores else ("", {})
227
+ if len(original_sentences) == 1:
228
+ summary = original_sentences[0]
229
+ return summary if not return_scores else (summary, {0: 1.0})
230
+
231
+ # Build similarity matrix
232
+ similarity_matrix = self.build_similarity_matrix(cleaned_sentences)
233
+
234
+ # Calculate sentence scores using PageRank
235
+ scores = self.calculate_pagerank(similarity_matrix)
236
+
237
+ # Determine number of sentences for summary
238
+ if num_sentences is None:
239
+ num_sentences = max(1, int(len(original_sentences) * self.summary_ratio))
240
+ num_sentences = min(num_sentences, len(original_sentences))
241
+
242
+ # Rank sentences by score
243
+ ranked_sentences = sorted(
244
+ ((scores[i], i, s) for i, s in enumerate(original_sentences)),
245
+ reverse=True
246
+ )
247
+
248
+ # Select top sentences and maintain original order
249
+ top_sentences = sorted(
250
+ ranked_sentences[:num_sentences],
251
+ key=lambda x: x[1] # Sort by original position
252
+ )
253
+
254
+ # Build summary
255
+ summary = ' '.join([sent for _, _, sent in top_sentences])
256
+
257
+ if return_scores:
258
+ return summary, {
259
+ 'sentence_scores': scores,
260
+ 'selected_indices': [idx for _, idx, _ in top_sentences],
261
+ 'num_sentences_original': len(original_sentences),
262
+ 'num_sentences_summary': num_sentences
263
+ }
264
+
265
+ return summary
266
+
267
+ def get_sentence_importance(self, text: str) -> List[Tuple[str, float]]:
268
+ """
269
+ Get all sentences with their importance scores
270
+
271
+ Args:
272
+ text: Input text
273
+
274
+ Returns:
275
+ List of (sentence, score) tuples sorted by importance
276
+ """
277
+ original_sentences, cleaned_sentences = self.preprocess(text)
278
+
279
+ if len(original_sentences) < 2:
280
+ return [(s, 1.0) for s in original_sentences]
281
+
282
+ similarity_matrix = self.build_similarity_matrix(cleaned_sentences)
283
+ scores = self.calculate_pagerank(similarity_matrix)
284
+
285
+ # Combine sentences with scores
286
+ sentence_importance = [
287
+ (original_sentences[i], scores[i])
288
+ for i in range(len(original_sentences))
289
+ ]
290
+
291
+ # Sort by importance
292
+ sentence_importance.sort(key=lambda x: x[1], reverse=True)
293
+
294
+ return sentence_importance
295
+
296
+ def get_model_info(self) -> Dict:
297
+ """Return detailed model information"""
298
+ info = super().get_model_info()
299
+ info.update({
300
+ 'algorithm': 'Graph-based PageRank',
301
+ 'parameters': {
302
+ 'damping_factor': self.damping,
303
+ 'max_iterations': self.max_iter,
304
+ 'tolerance': self.tol,
305
+ 'summary_ratio': self.summary_ratio,
306
+ 'min_sentence_length': self.min_sentence_length
307
+ },
308
+ 'complexity': 'O(V²) where V = number of sentences',
309
+ 'advantages': [
310
+ 'Fast and efficient',
311
+ 'No training required',
312
+ 'Language-agnostic',
313
+ 'Interpretable results'
314
+ ],
315
+ 'limitations': [
316
+ 'Cannot generate new sentences',
317
+ 'Limited semantic understanding',
318
+ 'May miss context'
319
+ ]
320
+ })
321
+ return info
322
+
323
+
324
+ # Test the implementation
325
+ if __name__ == "__main__":
326
+ # Sample academic text
327
+ sample_text = """
328
+ Artificial intelligence has become one of the most transformative technologies
329
+ of the 21st century. Machine learning, a subset of AI, enables computers to
330
+ learn from data without explicit programming. Deep learning uses neural networks
331
+ with multiple layers to process complex patterns. Natural language processing
332
+ allows machines to understand and generate human language. Computer vision enables
333
+ machines to interpret visual information from the world. AI applications span
334
+ healthcare, finance, education, transportation, and entertainment. Ethical
335
+ considerations around AI include privacy, bias, and job displacement. The future
336
+ of AI promises both unprecedented opportunities and significant challenges that
337
+ society must navigate carefully.
338
+ """
339
+
340
+ # Initialize summarizer
341
+ summarizer = TextRankSummarizer(summary_ratio=0.3)
342
+
343
+ print("=" * 70)
344
+ print("TEXTRANK SUMMARIZER - PROFESSIONAL TEST")
345
+ print("=" * 70)
346
+
347
+ # Generate summary with metrics
348
+ result = summarizer.summarize_with_metrics(sample_text)
349
+
350
+ print(f"\nModel: {result['metadata']['model_name']}")
351
+ print(f"Type: {result['metadata']['model_type']}")
352
+ print(f"Input Length: {result['metadata']['input_length']} words")
353
+ print(f"Summary Length: {result['metadata']['summary_length']} words")
354
+ print(f"Compression Ratio: {result['metadata']['compression_ratio']:.2%}")
355
+ print(f"Processing Time: {result['metadata']['processing_time']:.4f} seconds")
356
+
357
+ print(f"\n{'Summary:':-^70}")
358
+ print(result['summary'])
359
+
360
+ print(f"\n{'Sentence Importance Ranking:':-^70}")
361
+ importance = summarizer.get_sentence_importance(sample_text)
362
+ for i, (sent, score) in enumerate(importance[:5], 1):
363
+ print(f"{i}. [Score: {score:.4f}] {sent[:80]}...")
364
+
365
+ print("\n" + "=" * 70)
366
+ print(summarizer.get_model_info())
requirements.txt ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core Dependencies
2
+ torch>=2.0.1
3
+ transformers>=4.30.2
4
+ datasets>=2.14.0
5
+
6
+ # NLP Libraries
7
+ nltk>=3.8.1
8
+ rouge-score>=0.1.2
9
+ sentencepiece>=0.1.99
10
+
11
+ # Scientific Computing
12
+ numpy>=1.24.3
13
+ pandas>=2.0.3
14
+ scipy>=1.11.1
15
+ scikit-learn>=1.3.0
16
+
17
+ # Web Framework
18
+ flask>=2.3.0
19
+ gunicorn>=21.2.0
20
+
21
+ # File Processing
22
+ PyPDF2>=3.0.0
23
+ python-docx>=0.8.11
24
+
25
+ # Visualization
26
+ networkx>=3.1
27
+ matplotlib>=3.7.2
28
+ seaborn>=0.12.2
29
+ plotly>=5.15.0
30
+
31
+ # Utilities
32
+ tqdm>=4.65.0
33
+ python-dotenv>=1.0.0
34
+
35
+ # Development & Testing
36
+ pytest>=7.4.0
37
+ pytest-cov>=4.1.0
38
+ sphinx>=7.0.1
utils/__init__.py ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Utils package for text summarization utilities
3
+ Contains data loading, evaluation, preprocessing, and visualization tools
4
+ """
5
+
6
+ # Package-level imports can be added here if needed
7
+ __all__ = []
8
+
utils/data_loader.py ADDED
@@ -0,0 +1,384 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Data Loading and Management System
3
+ Handles CNN/DailyMail dataset loading, preprocessing, and sample management
4
+ """
5
+
6
+ import json
7
+ import os
8
+ from typing import Dict, List, Optional, Union
9
+ import logging
10
+ from pathlib import Path
11
+ import pandas as pd
12
+
13
+ try:
14
+ from datasets import load_dataset
15
+ DATASETS_AVAILABLE = True
16
+ except ImportError:
17
+ DATASETS_AVAILABLE = False
18
+ print("Warning: datasets library not available. Install with: pip install datasets")
19
+
20
+ logger = logging.getLogger(__name__)
21
+
22
+
23
+ class DataLoader:
24
+ """
25
+ Professional data loading system for summarization datasets.
26
+
27
+ Features:
28
+ - CNN/DailyMail dataset loading
29
+ - Sample management and caching
30
+ - Data preprocessing and validation
31
+ - Export/import functionality
32
+ """
33
+
34
+ def __init__(self, cache_dir: Optional[str] = None):
35
+ """
36
+ Initialize DataLoader
37
+
38
+ Args:
39
+ cache_dir: Directory for caching datasets
40
+ """
41
+ self.cache_dir = cache_dir or "./data/cache"
42
+ os.makedirs(self.cache_dir, exist_ok=True)
43
+ logger.info(f"DataLoader initialized with cache dir: {self.cache_dir}")
44
+
45
+ def load_cnn_dailymail(self,
46
+ split: str = "test",
47
+ num_samples: Optional[int] = None,
48
+ version: str = "3.0.0") -> List[Dict]:
49
+ """
50
+ Load CNN/DailyMail dataset
51
+
52
+ Args:
53
+ split: Dataset split ('train', 'validation', 'test')
54
+ num_samples: Number of samples to load (None for all)
55
+ version: Dataset version
56
+
57
+ Returns:
58
+ List of dictionaries with 'article' and 'reference_summary' keys
59
+ """
60
+ if not DATASETS_AVAILABLE:
61
+ logger.error("datasets library not available")
62
+ return self._load_sample_data()
63
+
64
+ logger.info(f"Loading CNN/DailyMail {split} split (version {version})")
65
+
66
+ try:
67
+ # Load dataset
68
+ dataset = load_dataset('abisee/cnn_dailymail', version, split=split)
69
+
70
+ # Limit samples if requested
71
+ if num_samples:
72
+ dataset = dataset.select(range(min(num_samples, len(dataset))))
73
+
74
+ # Convert to our format
75
+ data = []
76
+ for item in dataset:
77
+ data.append({
78
+ 'article': item['article'],
79
+ 'reference_summary': item['highlights'],
80
+ 'id': item.get('id', len(data))
81
+ })
82
+
83
+ logger.info(f"Loaded {len(data)} samples from CNN/DailyMail")
84
+ return data
85
+
86
+ except Exception as e:
87
+ logger.error(f"Failed to load CNN/DailyMail: {e}")
88
+ return self._load_sample_data()
89
+
90
+ def _load_sample_data(self) -> List[Dict]:
91
+ """Load sample data when dataset library is not available"""
92
+ logger.info("Loading built-in sample data")
93
+
94
+ return [
95
+ {
96
+ 'article': """
97
+ Artificial intelligence has revolutionized modern technology in unprecedented ways.
98
+ Machine learning algorithms enable computers to learn from vast amounts of data without
99
+ explicit programming. Deep learning neural networks, inspired by the human brain, can
100
+ now recognize patterns in images, understand natural language, and even generate creative
101
+ content. Natural language processing has advanced to the point where AI systems can
102
+ engage in human-like conversations, translate between languages in real-time, and
103
+ summarize lengthy documents automatically. Computer vision technology allows machines
104
+ to interpret and understand visual information from the world, powering applications
105
+ from autonomous vehicles to medical diagnosis systems. The integration of AI across
106
+ industries has improved efficiency, accuracy, and decision-making capabilities.
107
+ Healthcare providers use AI to detect diseases earlier and recommend personalized
108
+ treatments. Financial institutions employ machine learning for fraud detection and
109
+ algorithmic trading. Manufacturing companies utilize AI-powered robots for precision
110
+ tasks and quality control. Despite these advances, challenges remain in areas such as
111
+ algorithmic bias, data privacy, interpretability of AI decisions, and the ethical
112
+ implications of autonomous systems.
113
+ """,
114
+ 'reference_summary': "AI has transformed technology through machine learning, deep learning, and NLP. Applications span healthcare, finance, and manufacturing, though challenges like bias and privacy remain.",
115
+ 'id': 1
116
+ },
117
+ {
118
+ 'article': """
119
+ Climate change represents one of the most pressing challenges facing humanity in the
120
+ 21st century. Global temperatures have risen significantly over the past century,
121
+ primarily due to increased greenhouse gas emissions from human activities. The burning
122
+ of fossil fuels for energy, deforestation, and industrial processes have released
123
+ enormous amounts of carbon dioxide and methane into the atmosphere. These greenhouse
124
+ gases trap heat, leading to a warming effect known as the greenhouse effect. The
125
+ consequences of climate change are already visible worldwide. Polar ice caps and
126
+ glaciers are melting at alarming rates, contributing to rising sea levels that threaten
127
+ coastal communities. Extreme weather events, including hurricanes, droughts, floods,
128
+ and heat waves, have become more frequent and intense. Changes in precipitation patterns
129
+ affect agriculture and water supplies, potentially leading to food insecurity. Ocean
130
+ acidification, caused by increased absorption of carbon dioxide, threatens marine
131
+ ecosystems and the communities that depend on them. Many species face extinction as
132
+ their habitats change faster than they can adapt.
133
+ """,
134
+ 'reference_summary': "Climate change, driven by greenhouse gas emissions, causes rising temperatures, melting ice caps, extreme weather, and threatens ecosystems and human communities worldwide.",
135
+ 'id': 2
136
+ },
137
+ {
138
+ 'article': """
139
+ Space exploration has captured human imagination for decades and continues to push the
140
+ boundaries of what's possible. Since the first satellite launch in 1957 and the moon
141
+ landing in 1969, humanity has made remarkable progress in understanding our universe.
142
+ Modern space agencies like NASA, ESA, and private companies like SpaceX have developed
143
+ advanced technologies for space travel. The International Space Station serves as a
144
+ permanent laboratory orbiting Earth, enabling research in microgravity conditions.
145
+ Robotic missions have explored nearly every planet in our solar system, sending back
146
+ invaluable data about planetary geology, atmospheres, and potential for life. Mars has
147
+ been particularly exciting, with rovers like Curiosity and Perseverance analyzing soil
148
+ samples and searching for signs of ancient microbial life. Space telescopes such as
149
+ Hubble and James Webb have revolutionized astronomy, capturing images of distant
150
+ galaxies and helping scientists understand the universe's origins. Commercial space
151
+ flight is becoming reality, with companies developing reusable rockets and planning
152
+ tourist trips to orbit.
153
+ """,
154
+ 'reference_summary': "Space exploration has advanced from early satellites to modern missions exploring planets, operating space stations, and developing commercial spaceflight capabilities.",
155
+ 'id': 3
156
+ }
157
+ ]
158
+
159
+ def save_samples(self, data: List[Dict], filename: str) -> bool:
160
+ """
161
+ Save samples to JSON file
162
+
163
+ Args:
164
+ data: List of sample dictionaries
165
+ filename: Output filename
166
+
167
+ Returns:
168
+ Success status
169
+ """
170
+ try:
171
+ # Ensure directory exists
172
+ filepath = Path(filename)
173
+ filepath.parent.mkdir(parents=True, exist_ok=True)
174
+
175
+ with open(filename, 'w', encoding='utf-8') as f:
176
+ json.dump(data, f, indent=2, ensure_ascii=False)
177
+
178
+ logger.info(f"Saved {len(data)} samples to {filename}")
179
+ return True
180
+
181
+ except Exception as e:
182
+ logger.error(f"Failed to save samples: {e}")
183
+ return False
184
+
185
+ def load_samples(self, filename: str) -> List[Dict]:
186
+ """
187
+ Load samples from JSON file
188
+
189
+ Args:
190
+ filename: Input filename
191
+
192
+ Returns:
193
+ List of sample dictionaries
194
+ """
195
+ try:
196
+ with open(filename, 'r', encoding='utf-8') as f:
197
+ data = json.load(f)
198
+
199
+ logger.info(f"Loaded {len(data)} samples from {filename}")
200
+ return data
201
+
202
+ except FileNotFoundError:
203
+ logger.warning(f"File not found: {filename}")
204
+ return []
205
+ except Exception as e:
206
+ logger.error(f"Failed to load samples: {e}")
207
+ return []
208
+
209
+ def validate_data(self, data: List[Dict]) -> Dict:
210
+ """
211
+ Validate dataset structure and content
212
+
213
+ Args:
214
+ data: List of sample dictionaries
215
+
216
+ Returns:
217
+ Validation report
218
+ """
219
+ report = {
220
+ 'total_samples': len(data),
221
+ 'valid_samples': 0,
222
+ 'issues': []
223
+ }
224
+
225
+ required_keys = ['article', 'reference_summary']
226
+
227
+ for i, sample in enumerate(data):
228
+ # Check required keys
229
+ missing_keys = [key for key in required_keys if key not in sample]
230
+ if missing_keys:
231
+ report['issues'].append(f"Sample {i}: Missing keys {missing_keys}")
232
+ continue
233
+
234
+ # Check content
235
+ if not sample['article'] or not sample['reference_summary']:
236
+ report['issues'].append(f"Sample {i}: Empty content")
237
+ continue
238
+
239
+ # Check lengths
240
+ article_words = len(sample['article'].split())
241
+ summary_words = len(sample['reference_summary'].split())
242
+
243
+ if article_words < 10:
244
+ report['issues'].append(f"Sample {i}: Article too short ({article_words} words)")
245
+ continue
246
+
247
+ if summary_words < 3:
248
+ report['issues'].append(f"Sample {i}: Summary too short ({summary_words} words)")
249
+ continue
250
+
251
+ report['valid_samples'] += 1
252
+
253
+ report['validity_rate'] = report['valid_samples'] / report['total_samples'] if report['total_samples'] > 0 else 0
254
+
255
+ logger.info(f"Validation: {report['valid_samples']}/{report['total_samples']} valid samples")
256
+ return report
257
+
258
+ def get_statistics(self, data: List[Dict]) -> Dict:
259
+ """
260
+ Get dataset statistics
261
+
262
+ Args:
263
+ data: List of sample dictionaries
264
+
265
+ Returns:
266
+ Statistics dictionary
267
+ """
268
+ if not data:
269
+ return {}
270
+
271
+ article_lengths = [len(sample['article'].split()) for sample in data]
272
+ summary_lengths = [len(sample['reference_summary'].split()) for sample in data]
273
+ compression_ratios = [s/a for a, s in zip(article_lengths, summary_lengths) if a > 0]
274
+
275
+ stats = {
276
+ 'total_samples': len(data),
277
+ 'article_stats': {
278
+ 'mean_length': sum(article_lengths) / len(article_lengths),
279
+ 'min_length': min(article_lengths),
280
+ 'max_length': max(article_lengths),
281
+ 'median_length': sorted(article_lengths)[len(article_lengths)//2]
282
+ },
283
+ 'summary_stats': {
284
+ 'mean_length': sum(summary_lengths) / len(summary_lengths),
285
+ 'min_length': min(summary_lengths),
286
+ 'max_length': max(summary_lengths),
287
+ 'median_length': sorted(summary_lengths)[len(summary_lengths)//2]
288
+ },
289
+ 'compression_stats': {
290
+ 'mean_ratio': sum(compression_ratios) / len(compression_ratios),
291
+ 'min_ratio': min(compression_ratios),
292
+ 'max_ratio': max(compression_ratios)
293
+ }
294
+ }
295
+
296
+ return stats
297
+
298
+ def export_to_csv(self, data: List[Dict], filename: str) -> bool:
299
+ """
300
+ Export data to CSV format
301
+
302
+ Args:
303
+ data: List of sample dictionaries
304
+ filename: Output CSV filename
305
+
306
+ Returns:
307
+ Success status
308
+ """
309
+ try:
310
+ df = pd.DataFrame(data)
311
+ df.to_csv(filename, index=False, encoding='utf-8')
312
+ logger.info(f"Exported {len(data)} samples to {filename}")
313
+ return True
314
+ except Exception as e:
315
+ logger.error(f"Failed to export CSV: {e}")
316
+ return False
317
+
318
+ def create_sample_dataset(self,
319
+ full_data: List[Dict],
320
+ sample_size: int,
321
+ strategy: str = "random") -> List[Dict]:
322
+ """
323
+ Create a sample dataset from full data
324
+
325
+ Args:
326
+ full_data: Complete dataset
327
+ sample_size: Number of samples to select
328
+ strategy: Sampling strategy ('random', 'first', 'balanced')
329
+
330
+ Returns:
331
+ Sampled dataset
332
+ """
333
+ if sample_size >= len(full_data):
334
+ return full_data
335
+
336
+ if strategy == "random":
337
+ import random
338
+ return random.sample(full_data, sample_size)
339
+ elif strategy == "first":
340
+ return full_data[:sample_size]
341
+ elif strategy == "balanced":
342
+ # Try to balance by length
343
+ sorted_data = sorted(full_data, key=lambda x: len(x['article'].split()))
344
+ step = len(sorted_data) // sample_size
345
+ return [sorted_data[i * step] for i in range(sample_size)]
346
+ else:
347
+ return full_data[:sample_size]
348
+
349
+
350
+ # Test the DataLoader
351
+ if __name__ == "__main__":
352
+ print("=" * 60)
353
+ print("DATA LOADER - PROFESSIONAL TEST")
354
+ print("=" * 60)
355
+
356
+ # Initialize loader
357
+ loader = DataLoader()
358
+
359
+ # Load sample data
360
+ data = loader.load_cnn_dailymail(split='test', num_samples=5)
361
+
362
+ print(f"\nLoaded {len(data)} samples")
363
+
364
+ # Validate data
365
+ validation = loader.validate_data(data)
366
+ print(f"Validation: {validation['valid_samples']}/{validation['total_samples']} valid")
367
+
368
+ # Get statistics
369
+ stats = loader.get_statistics(data)
370
+ print(f"\nStatistics:")
371
+ print(f" Article length: {stats['article_stats']['mean_length']:.1f} words (avg)")
372
+ print(f" Summary length: {stats['summary_stats']['mean_length']:.1f} words (avg)")
373
+ print(f" Compression ratio: {stats['compression_stats']['mean_ratio']:.2%}")
374
+
375
+ # Test save/load
376
+ test_file = "test_samples.json"
377
+ if loader.save_samples(data, test_file):
378
+ loaded_data = loader.load_samples(test_file)
379
+ print(f"\nSave/Load test: {len(loaded_data)} samples loaded")
380
+
381
+ # Cleanup
382
+ os.remove(test_file)
383
+
384
+ print("\n" + "=" * 60)
utils/evaluator.py ADDED
@@ -0,0 +1,394 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Comprehensive Evaluation System for Summarization Models
3
+ Implements ROUGE metrics, comparison analysis, and statistical testing
4
+ """
5
+
6
+ # Handle different rouge library installations
7
+ try:
8
+ from rouge import Rouge
9
+ ROUGE_AVAILABLE = True
10
+ ROUGE_TYPE = "rouge"
11
+ except ImportError:
12
+ try:
13
+ from rouge_score import rouge_scorer
14
+ ROUGE_AVAILABLE = True
15
+ ROUGE_TYPE = "rouge_score"
16
+ except ImportError:
17
+ ROUGE_AVAILABLE = False
18
+ ROUGE_TYPE = None
19
+ print("Warning: No ROUGE library found. Install with: pip install rouge-score")
20
+
21
+ import numpy as np
22
+ from typing import Dict, List, Tuple, Optional
23
+ import pandas as pd
24
+ import logging
25
+ from scipy import stats
26
+ import time
27
+
28
+ logger = logging.getLogger(__name__)
29
+
30
+
31
+ class SummarizerEvaluator:
32
+ """
33
+ Professional evaluation system for summarization models.
34
+
35
+ Metrics Implemented:
36
+ - ROUGE-1: Unigram overlap
37
+ - ROUGE-2: Bigram overlap
38
+ - ROUGE-L: Longest common subsequence
39
+ - ROUGE-W: Weighted longest common subsequence
40
+
41
+ Additional Analysis:
42
+ - Compression ratio
43
+ - Processing time
44
+ - Statistical significance testing
45
+ - Model comparison
46
+ """
47
+
48
+ def __init__(self):
49
+ """Initialize evaluator with ROUGE scorer"""
50
+ if ROUGE_AVAILABLE:
51
+ if ROUGE_TYPE == "rouge":
52
+ self.rouge = Rouge()
53
+ self.rouge_scorer = None
54
+ else: # rouge_score
55
+ self.rouge = None
56
+ self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
57
+ logger.info(f"Evaluator initialized with {ROUGE_TYPE} library")
58
+ else:
59
+ self.rouge = None
60
+ self.rouge_scorer = None
61
+ logger.warning("ROUGE library not available - only basic metrics will be computed")
62
+
63
+ self.evaluation_history = []
64
+
65
+ def _calculate_rouge_scores(self, generated: str, reference: str) -> Dict:
66
+ """Calculate ROUGE scores using available library"""
67
+ if not ROUGE_AVAILABLE:
68
+ return {
69
+ 'rouge-1': {'f': 0.0, 'p': 0.0, 'r': 0.0},
70
+ 'rouge-2': {'f': 0.0, 'p': 0.0, 'r': 0.0},
71
+ 'rouge-l': {'f': 0.0, 'p': 0.0, 'r': 0.0}
72
+ }
73
+
74
+ if ROUGE_TYPE == "rouge":
75
+ # Original rouge library
76
+ scores = self.rouge.get_scores(generated, reference)[0]
77
+ return scores
78
+ else:
79
+ # rouge_score library
80
+ scores = self.rouge_scorer.score(reference, generated)
81
+ return {
82
+ 'rouge-1': {
83
+ 'f': scores['rouge1'].fmeasure,
84
+ 'p': scores['rouge1'].precision,
85
+ 'r': scores['rouge1'].recall
86
+ },
87
+ 'rouge-2': {
88
+ 'f': scores['rouge2'].fmeasure,
89
+ 'p': scores['rouge2'].precision,
90
+ 'r': scores['rouge2'].recall
91
+ },
92
+ 'rouge-l': {
93
+ 'f': scores['rougeL'].fmeasure,
94
+ 'p': scores['rougeL'].precision,
95
+ 'r': scores['rougeL'].recall
96
+ }
97
+ }
98
+
99
+ def evaluate_single(self,
100
+ generated: str,
101
+ reference: str,
102
+ model_name: str = "Unknown") -> Dict:
103
+ """
104
+ Evaluate a single summary against reference
105
+
106
+ ROUGE Metrics Explained:
107
+ - Precision: What % of generated words are in reference
108
+ - Recall: What % of reference words are in generated
109
+ - F1-Score: Harmonic mean of precision and recall
110
+
111
+ Args:
112
+ generated: Generated summary
113
+ reference: Human reference summary
114
+ model_name: Name of the model
115
+
116
+ Returns:
117
+ Dictionary containing all metrics
118
+ """
119
+ if not generated or not reference:
120
+ logger.warning("Empty summary or reference provided")
121
+ return self._empty_scores()
122
+
123
+ try:
124
+ # Calculate ROUGE scores
125
+ scores = self._calculate_rouge_scores(generated, reference)
126
+
127
+ # Calculate additional metrics
128
+ compression_ratio = len(generated.split()) / len(reference.split()) if len(reference.split()) > 0 else 0
129
+
130
+ result = {
131
+ 'model_name': model_name,
132
+ 'rouge_1_f1': scores['rouge-1']['f'],
133
+ 'rouge_1_precision': scores['rouge-1']['p'],
134
+ 'rouge_1_recall': scores['rouge-1']['r'],
135
+ 'rouge_2_f1': scores['rouge-2']['f'],
136
+ 'rouge_2_precision': scores['rouge-2']['p'],
137
+ 'rouge_2_recall': scores['rouge-2']['r'],
138
+ 'rouge_l_f1': scores['rouge-l']['f'],
139
+ 'rouge_l_precision': scores['rouge-l']['p'],
140
+ 'rouge_l_recall': scores['rouge-l']['r'],
141
+ 'compression_ratio': compression_ratio,
142
+ 'generated_length': len(generated.split()),
143
+ 'reference_length': len(reference.split())
144
+ }
145
+
146
+ return result
147
+
148
+ except Exception as e:
149
+ logger.error(f"Error evaluating summary: {e}")
150
+ return self._empty_scores()
151
+
152
+ def _empty_scores(self) -> Dict:
153
+ """Return empty scores for error cases"""
154
+ return {
155
+ 'rouge_1_f1': 0.0,
156
+ 'rouge_1_precision': 0.0,
157
+ 'rouge_1_recall': 0.0,
158
+ 'rouge_2_f1': 0.0,
159
+ 'rouge_2_precision': 0.0,
160
+ 'rouge_2_recall': 0.0,
161
+ 'rouge_l_f1': 0.0,
162
+ 'rouge_l_precision': 0.0,
163
+ 'rouge_l_recall': 0.0,
164
+ 'compression_ratio': 0.0,
165
+ 'generated_length': 0,
166
+ 'reference_length': 0
167
+ }
168
+
169
+ def evaluate_batch(self,
170
+ generated_summaries: List[str],
171
+ reference_summaries: List[str],
172
+ model_name: str = "Unknown") -> Dict:
173
+ """
174
+ Evaluate multiple summaries and aggregate results
175
+
176
+ Args:
177
+ generated_summaries: List of generated summaries
178
+ reference_summaries: List of reference summaries
179
+ model_name: Name of the model
180
+
181
+ Returns:
182
+ Dictionary with aggregated statistics
183
+ """
184
+ assert len(generated_summaries) == len(reference_summaries), \
185
+ "Generated and reference lists must have same length"
186
+
187
+ logger.info(f"Evaluating {len(generated_summaries)} summaries for {model_name}")
188
+
189
+ results = []
190
+ for gen, ref in zip(generated_summaries, reference_summaries):
191
+ scores = self.evaluate_single(gen, ref, model_name)
192
+ results.append(scores)
193
+
194
+ # Aggregate statistics
195
+ df = pd.DataFrame(results)
196
+
197
+ aggregated = {
198
+ 'model_name': model_name,
199
+ 'num_samples': len(results),
200
+ 'rouge_1_f1_mean': df['rouge_1_f1'].mean(),
201
+ 'rouge_1_f1_std': df['rouge_1_f1'].std(),
202
+ 'rouge_2_f1_mean': df['rouge_2_f1'].mean(),
203
+ 'rouge_2_f1_std': df['rouge_2_f1'].std(),
204
+ 'rouge_l_f1_mean': df['rouge_l_f1'].mean(),
205
+ 'rouge_l_f1_std': df['rouge_l_f1'].std(),
206
+ 'compression_ratio_mean': df['compression_ratio'].mean(),
207
+ 'compression_ratio_std': df['compression_ratio'].std(),
208
+ 'individual_scores': results
209
+ }
210
+
211
+ # Store in history
212
+ self.evaluation_history.append(aggregated)
213
+
214
+ return aggregated
215
+
216
+ def compare_models(self,
217
+ models_dict: Dict,
218
+ test_texts: List[str],
219
+ reference_summaries: List[str],
220
+ **summarize_kwargs) -> pd.DataFrame:
221
+ """
222
+ Compare multiple models on the same dataset
223
+
224
+ Args:
225
+ models_dict: Dictionary {model_name: model_instance}
226
+ test_texts: List of texts to summarize
227
+ reference_summaries: List of reference summaries
228
+ **summarize_kwargs: Additional parameters for summarization
229
+
230
+ Returns:
231
+ DataFrame with comparison results
232
+ """
233
+ logger.info(f"Comparing {len(models_dict)} models on {len(test_texts)} texts")
234
+
235
+ comparison_results = []
236
+
237
+ for model_name, model in models_dict.items():
238
+ logger.info(f"Evaluating {model_name}...")
239
+
240
+ start_time = time.time()
241
+
242
+ # Generate summaries
243
+ generated_summaries = []
244
+ for text in test_texts:
245
+ try:
246
+ summary = model.summarize(text, **summarize_kwargs)
247
+ generated_summaries.append(summary)
248
+ except Exception as e:
249
+ logger.error(f"Error with {model_name}: {e}")
250
+ generated_summaries.append("")
251
+
252
+ total_time = time.time() - start_time
253
+
254
+ # Evaluate
255
+ eval_results = self.evaluate_batch(
256
+ generated_summaries,
257
+ reference_summaries,
258
+ model_name
259
+ )
260
+
261
+ # Add timing information
262
+ eval_results['total_time'] = total_time
263
+ eval_results['avg_time_per_summary'] = total_time / len(test_texts)
264
+
265
+ comparison_results.append(eval_results)
266
+
267
+ # Create comparison DataFrame
268
+ df = pd.DataFrame([
269
+ {
270
+ 'Model': r['model_name'],
271
+ 'ROUGE-1': f"{r['rouge_1_f1_mean']:.4f} ± {r['rouge_1_f1_std']:.4f}",
272
+ 'ROUGE-2': f"{r['rouge_2_f1_mean']:.4f} ± {r['rouge_2_f1_std']:.4f}",
273
+ 'ROUGE-L': f"{r['rouge_l_f1_mean']:.4f} ± {r['rouge_l_f1_std']:.4f}",
274
+ 'Compression': f"{r['compression_ratio_mean']:.2f}x",
275
+ 'Avg Time (s)': f"{r['avg_time_per_summary']:.3f}"
276
+ }
277
+ for r in comparison_results
278
+ ])
279
+
280
+ logger.info("Model comparison completed")
281
+ return df
282
+
283
+ def statistical_significance_test(self,
284
+ model1_scores: List[float],
285
+ model2_scores: List[float],
286
+ test_name: str = "paired t-test") -> Dict:
287
+ """
288
+ Test if difference between models is statistically significant
289
+
290
+ Args:
291
+ model1_scores: Scores from first model
292
+ model2_scores: Scores from second model
293
+ test_name: Type of statistical test
294
+
295
+ Returns:
296
+ Dictionary with test results
297
+ """
298
+ if test_name == "paired t-test":
299
+ statistic, p_value = stats.ttest_rel(model1_scores, model2_scores)
300
+ elif test_name == "wilcoxon":
301
+ statistic, p_value = stats.wilcoxon(model1_scores, model2_scores)
302
+ else:
303
+ raise ValueError(f"Unknown test: {test_name}")
304
+
305
+ is_significant = p_value < 0.05
306
+
307
+ return {
308
+ 'test_name': test_name,
309
+ 'statistic': statistic,
310
+ 'p_value': p_value,
311
+ 'is_significant': is_significant,
312
+ 'significance_level': 0.05,
313
+ 'interpretation': (
314
+ f"The difference is {'statistically significant' if is_significant else 'not statistically significant'} "
315
+ f"(p={p_value:.4f})"
316
+ )
317
+ }
318
+
319
+ def get_detailed_report(self,
320
+ evaluation_result: Dict) -> str:
321
+ """
322
+ Generate a detailed text report
323
+
324
+ Args:
325
+ evaluation_result: Results from evaluate_batch
326
+
327
+ Returns:
328
+ Formatted report string
329
+ """
330
+ report = []
331
+ report.append("=" * 70)
332
+ report.append(f"EVALUATION REPORT: {evaluation_result['model_name']}")
333
+ report.append("=" * 70)
334
+ report.append(f"\nDataset Size: {evaluation_result['num_samples']} samples\n")
335
+
336
+ report.append("ROUGE Scores (F1):")
337
+ report.append(f" ROUGE-1: {evaluation_result['rouge_1_f1_mean']:.4f} (±{evaluation_result['rouge_1_f1_std']:.4f})")
338
+ report.append(f" ROUGE-2: {evaluation_result['rouge_2_f1_mean']:.4f} (±{evaluation_result['rouge_2_f1_std']:.4f})")
339
+ report.append(f" ROUGE-L: {evaluation_result['rouge_l_f1_mean']:.4f} (±{evaluation_result['rouge_l_f1_std']:.4f})")
340
+
341
+ report.append(f"\nCompression Ratio: {evaluation_result['compression_ratio_mean']:.2f}x")
342
+ report.append(f" (Standard Deviation: {evaluation_result['compression_ratio_std']:.2f})")
343
+
344
+ report.append("\n" + "=" * 70)
345
+
346
+ return "\n".join(report)
347
+
348
+ def export_results(self,
349
+ evaluation_result: Dict,
350
+ filename: str = "evaluation_results.json"):
351
+ """
352
+ Export evaluation results to file
353
+
354
+ Args:
355
+ evaluation_result: Results to export
356
+ filename: Output filename
357
+ """
358
+ import json
359
+
360
+ with open(filename, 'w') as f:
361
+ json.dump(evaluation_result, f, indent=2)
362
+
363
+ logger.info(f"Results exported to {filename}")
364
+
365
+
366
+ # Test the evaluator
367
+ if __name__ == "__main__":
368
+ print("=" * 70)
369
+ print("EVALUATOR SYSTEM TEST")
370
+ print("=" * 70)
371
+
372
+ # Sample data
373
+ generated = "Machine learning revolutionizes AI. Neural networks perform complex tasks."
374
+ reference = "Machine learning has transformed artificial intelligence. Deep neural networks can now handle complicated tasks with high accuracy."
375
+
376
+ # Initialize evaluator
377
+ evaluator = SummarizerEvaluator()
378
+
379
+ # Evaluate single summary
380
+ scores = evaluator.evaluate_single(generated, reference, "TestModel")
381
+
382
+ print("\nSingle Summary Evaluation:")
383
+ print(f"ROUGE-1 F1: {scores['rouge_1_f1']:.4f}")
384
+ print(f"ROUGE-2 F1: {scores['rouge_2_f1']:.4f}")
385
+ print(f"ROUGE-L F1: {scores['rouge_l_f1']:.4f}")
386
+ print(f"Compression Ratio: {scores['compression_ratio']:.2f}x")
387
+
388
+ # Test batch evaluation
389
+ generated_list = [generated] * 5
390
+ reference_list = [reference] * 5
391
+
392
+ batch_scores = evaluator.evaluate_batch(generated_list, reference_list, "TestModel")
393
+
394
+ print("\n" + evaluator.get_detailed_report(batch_scores))
utils/preprocessor.py ADDED
File without changes
utils/visualizer.py ADDED
File without changes
webapp/README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Smart Summarizer Web Application
2
+
3
+ Professional web interface for comparing TextRank, BART, and PEGASUS summarization models.
4
+
5
+ ## Features
6
+
7
+ - **Home**: Overview of the three summarization models
8
+ - **Single Summary**: Generate summaries with individual models
9
+ - **Comparison**: Compare all three models side-by-side
10
+ - **Batch Processing**: Process multiple documents simultaneously
11
+ - **Evaluation**: View ROUGE metric benchmarks and model performance
12
+
13
+ ## Design
14
+
15
+ The UI follows the "Ink Wash" color palette:
16
+ - Charcoal (#4A4A4A)
17
+ - Cool Gray (#CBCBCB)
18
+ - Soft Ivory (#FFFFE3)
19
+ - Slate Blue (#6D8196)
20
+
21
+ ## Running the Application
22
+
23
+ ### 1. Install Dependencies
24
+
25
+ ```bash
26
+ pip install -r requirements.txt
27
+ ```
28
+
29
+ ### 2. Start the Server
30
+
31
+ ```bash
32
+ cd webapp
33
+ python app.py
34
+ ```
35
+
36
+ The application will be available at: `http://localhost:5001`
37
+
38
+ ### 3. Test the Routes
39
+
40
+ ```bash
41
+ python test_webapp.py
42
+ ```
43
+
44
+ ## File Structure
45
+
46
+ ```
47
+ webapp/
48
+ ├── app.py # Flask application
49
+ ├── templates/
50
+ │ ├── home.html # Home page
51
+ │ ├── single_summary.html # Single summary page
52
+ │ ├── comparison.html # Model comparison page
53
+ │ ├── batch.html # Batch processing page
54
+ │ └── evaluation.html # Evaluation metrics page
55
+ ├── static/
56
+ │ ├── css/
57
+ │ │ └── style.css # Main stylesheet
58
+ │ └── js/
59
+ │ ├── evaluation.js # Evaluation page logic
60
+ │ └── batch.js # Batch processing logic
61
+ └── uploads/ # Temporary file uploads
62
+ ```
63
+
64
+ ## API Endpoints
65
+
66
+ ### POST /api/summarize
67
+ Generate a summary with a single model.
68
+
69
+ **Request:**
70
+ ```json
71
+ {
72
+ "text": "Your text here...",
73
+ "model": "bart" // or "textrank", "pegasus"
74
+ }
75
+ ```
76
+
77
+ **Response:**
78
+ ```json
79
+ {
80
+ "success": true,
81
+ "summary": "Generated summary...",
82
+ "metadata": {
83
+ "model_name": "BART",
84
+ "processing_time": 2.34,
85
+ "compression_ratio": 0.22
86
+ }
87
+ }
88
+ ```
89
+
90
+ ### POST /api/compare
91
+ Compare all three models on the same text.
92
+
93
+ **Request:**
94
+ ```json
95
+ {
96
+ "text": "Your text here..."
97
+ }
98
+ ```
99
+
100
+ **Response:**
101
+ ```json
102
+ {
103
+ "success": true,
104
+ "results": {
105
+ "textrank": { "summary": "...", "metadata": {...} },
106
+ "bart": { "summary": "...", "metadata": {...} },
107
+ "pegasus": { "summary": "...", "metadata": {...} }
108
+ }
109
+ }
110
+ ```
111
+
112
+ ### POST /api/upload
113
+ Upload a file (.txt, .md, .pdf, .docx) and extract text.
114
+
115
+ **Request:** multipart/form-data with file
116
+
117
+ **Response:**
118
+ ```json
119
+ {
120
+ "success": true,
121
+ "text": "Extracted text...",
122
+ "filename": "document.pdf",
123
+ "word_count": 1234
124
+ }
125
+ ```
126
+
127
+ ## Supported File Types
128
+
129
+ - Plain text (.txt, .md)
130
+ - PDF documents (.pdf)
131
+ - Word documents (.docx, .doc)
132
+
133
+ ## Model Information
134
+
135
+ ### TextRank
136
+ - Type: Extractive
137
+ - Algorithm: Graph-based PageRank
138
+ - Speed: Very fast (~0.03s)
139
+ - Best for: Quick summaries, keyword extraction
140
+
141
+ ### BART
142
+ - Type: Abstractive
143
+ - Algorithm: Transformer encoder-decoder
144
+ - Speed: Moderate (~9s on CPU)
145
+ - Best for: Fluent, human-like summaries
146
+
147
+ ### PEGASUS
148
+ - Type: Abstractive
149
+ - Algorithm: Gap Sentence Generation
150
+ - Speed: Moderate (~6s on CPU)
151
+ - Best for: High-quality abstractive summaries
152
+
153
+ ## Notes
154
+
155
+ - Models are loaded lazily (on first use) to reduce startup time
156
+ - GPU acceleration is supported if CUDA is available
157
+ - All models generate similar compression ratios (~22%) for fair comparison
158
+ - File uploads are limited to 16MB
webapp/app.py ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Smart Summarizer - Flask Web Application
3
+ Professional UI matching Figma design
4
+ """
5
+
6
+ from flask import Flask, render_template, request, jsonify
7
+ import sys
8
+ from pathlib import Path
9
+ import os
10
+ from werkzeug.utils import secure_filename
11
+ import PyPDF2
12
+ from docx import Document as DocxDocument
13
+
14
+ # Add project root to path
15
+ sys.path.append(str(Path(__file__).parent.parent))
16
+
17
+ from models.textrank import TextRankSummarizer
18
+ from models.bart import BARTSummarizer
19
+ from models.pegasus import PEGASUSSummarizer
20
+
21
+ app = Flask(__name__)
22
+ app.config['SECRET_KEY'] = 'your-secret-key-here'
23
+ app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024 # 16MB max file size
24
+ app.config['UPLOAD_FOLDER'] = 'uploads'
25
+
26
+ # Create uploads folder if it doesn't exist
27
+ os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)
28
+
29
+ # Allowed file extensions
30
+ ALLOWED_EXTENSIONS = {'txt', 'md', 'text', 'pdf', 'docx', 'doc'}
31
+
32
+ def allowed_file(filename):
33
+ return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
34
+
35
+ # Initialize models (lazy loading)
36
+ models = {}
37
+
38
+ def get_model(model_name):
39
+ """Load and cache models"""
40
+ if model_name not in models:
41
+ if model_name == "textrank":
42
+ models[model_name] = TextRankSummarizer()
43
+ elif model_name == "bart":
44
+ models[model_name] = BARTSummarizer(device='cpu')
45
+ elif model_name == "pegasus":
46
+ models[model_name] = PEGASUSSummarizer(device='cpu')
47
+ return models[model_name]
48
+
49
+ @app.route('/')
50
+ def home():
51
+ """Home page"""
52
+ return render_template('home.html')
53
+
54
+ @app.route('/single-summary')
55
+ def single_summary():
56
+ """Single summary page"""
57
+ return render_template('single_summary.html')
58
+
59
+ @app.route('/comparison')
60
+ def comparison():
61
+ """Model comparison page"""
62
+ return render_template('comparison.html')
63
+
64
+ @app.route('/batch')
65
+ def batch():
66
+ """Batch processing page"""
67
+ return render_template('batch.html')
68
+
69
+ @app.route('/evaluation')
70
+ def evaluation():
71
+ """Evaluation page"""
72
+ return render_template('evaluation.html')
73
+
74
+ @app.route('/api/summarize', methods=['POST'])
75
+ def summarize():
76
+ """API endpoint for summarization"""
77
+ try:
78
+ data = request.json
79
+ text = data.get('text', '')
80
+ model_name = data.get('model', 'bart').lower()
81
+
82
+ if not text or len(text.split()) < 10:
83
+ return jsonify({
84
+ 'success': False,
85
+ 'error': 'Please provide at least 10 words of text'
86
+ }), 400
87
+
88
+ # Get model
89
+ model = get_model(model_name)
90
+
91
+ # Calculate target summary length (approximately 20-25% of original)
92
+ input_words = len(text.split())
93
+ target_length = max(30, min(150, int(input_words * 0.22))) # 22% compression
94
+
95
+ # Generate summary based on model type
96
+ if model_name == 'textrank':
97
+ # For TextRank, calculate number of sentences to achieve similar compression
98
+ sentences = text.count('.') + text.count('!') + text.count('?')
99
+ num_sentences = max(2, int(sentences * 0.3)) # ~30% of sentences
100
+ result = model.summarize_with_metrics(text, num_sentences=num_sentences)
101
+ else:
102
+ # For BART and PEGASUS, use word-based limits
103
+ result = model.summarize_with_metrics(
104
+ text,
105
+ max_length=target_length,
106
+ min_length=max(20, int(target_length * 0.5))
107
+ )
108
+
109
+ return jsonify({
110
+ 'success': True,
111
+ 'summary': result['summary'],
112
+ 'metadata': result['metadata']
113
+ })
114
+
115
+ except Exception as e:
116
+ return jsonify({
117
+ 'success': False,
118
+ 'error': str(e)
119
+ }), 500
120
+
121
+ @app.route('/api/compare', methods=['POST'])
122
+ def compare():
123
+ """API endpoint for comparing all three models"""
124
+ try:
125
+ data = request.json
126
+ text = data.get('text', '')
127
+
128
+ if not text or len(text.split()) < 10:
129
+ return jsonify({
130
+ 'success': False,
131
+ 'error': 'Please provide at least 10 words of text'
132
+ }), 400
133
+
134
+ results = {}
135
+
136
+ # Calculate consistent target length for all models
137
+ input_words = len(text.split())
138
+ target_length = max(30, min(150, int(input_words * 0.22)))
139
+ sentences = text.count('.') + text.count('!') + text.count('?')
140
+ num_sentences = max(2, int(sentences * 0.3))
141
+
142
+ # Run all three models
143
+ for model_name in ['textrank', 'bart', 'pegasus']:
144
+ try:
145
+ model = get_model(model_name)
146
+
147
+ if model_name == 'textrank':
148
+ result = model.summarize_with_metrics(text, num_sentences=num_sentences)
149
+ else:
150
+ result = model.summarize_with_metrics(
151
+ text,
152
+ max_length=target_length,
153
+ min_length=max(20, int(target_length * 0.5))
154
+ )
155
+
156
+ results[model_name] = {
157
+ 'summary': result['summary'],
158
+ 'metadata': result['metadata']
159
+ }
160
+ except Exception as e:
161
+ results[model_name] = {
162
+ 'error': str(e)
163
+ }
164
+
165
+ return jsonify({
166
+ 'success': True,
167
+ 'results': results
168
+ })
169
+
170
+ except Exception as e:
171
+ return jsonify({
172
+ 'success': False,
173
+ 'error': str(e)
174
+ }), 500
175
+
176
+ @app.route('/api/upload', methods=['POST'])
177
+ def upload_file():
178
+ """API endpoint for file upload"""
179
+ try:
180
+ if 'file' not in request.files:
181
+ return jsonify({
182
+ 'success': False,
183
+ 'error': 'No file provided'
184
+ }), 400
185
+
186
+ file = request.files['file']
187
+
188
+ if file.filename == '':
189
+ return jsonify({
190
+ 'success': False,
191
+ 'error': 'No file selected'
192
+ }), 400
193
+
194
+ if not allowed_file(file.filename):
195
+ return jsonify({
196
+ 'success': False,
197
+ 'error': 'Invalid file type. Please upload .txt, .md, .pdf, .docx, or .doc files'
198
+ }), 400
199
+
200
+ # Extract text based on file type
201
+ filename = secure_filename(file.filename)
202
+ file_ext = filename.rsplit('.', 1)[1].lower()
203
+
204
+ try:
205
+ if file_ext in ['txt', 'md', 'text']:
206
+ # Plain text files
207
+ text = file.read().decode('utf-8')
208
+
209
+ elif file_ext == 'pdf':
210
+ # PDF files
211
+ pdf_reader = PyPDF2.PdfReader(file)
212
+ text = ''
213
+ for page in pdf_reader.pages:
214
+ text += page.extract_text() + '\n'
215
+
216
+ elif file_ext in ['docx', 'doc']:
217
+ # Word documents
218
+ doc = DocxDocument(file)
219
+ text = '\n'.join([paragraph.text for paragraph in doc.paragraphs])
220
+
221
+ else:
222
+ return jsonify({
223
+ 'success': False,
224
+ 'error': 'Unsupported file format'
225
+ }), 400
226
+
227
+ except UnicodeDecodeError:
228
+ return jsonify({
229
+ 'success': False,
230
+ 'error': 'File encoding not supported. Please use UTF-8 encoded files'
231
+ }), 400
232
+ except Exception as e:
233
+ return jsonify({
234
+ 'success': False,
235
+ 'error': f'Error reading file: {str(e)}'
236
+ }), 400
237
+
238
+ if not text or len(text.split()) < 10:
239
+ return jsonify({
240
+ 'success': False,
241
+ 'error': 'File content is too short. Please provide at least 10 words'
242
+ }), 400
243
+
244
+ return jsonify({
245
+ 'success': True,
246
+ 'text': text,
247
+ 'filename': filename,
248
+ 'word_count': len(text.split())
249
+ })
250
+
251
+ except Exception as e:
252
+ return jsonify({
253
+ 'success': False,
254
+ 'error': str(e)
255
+ }), 500
256
+
257
+ if __name__ == '__main__':
258
+ import os
259
+
260
+ # Get port from environment variable (Hugging Face Spaces uses 7860)
261
+ port = int(os.environ.get('PORT', 7860))
262
+
263
+ # Check if running in production
264
+ debug = os.environ.get('FLASK_ENV') != 'production'
265
+
266
+ # Bind to 0.0.0.0 for cloud deployment
267
+ app.run(host='0.0.0.0', port=port, debug=debug)
webapp/requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ Flask==3.0.0
2
+ PyPDF2==3.0.1
3
+ python-docx==1.1.0
webapp/static/css/style.css ADDED
@@ -0,0 +1,880 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /* Smart Summarizer - Main Stylesheet */
2
+ @import url('https://fonts.googleapis.com/css2?family=Playfair+Display:wght@400;600;700&display=swap');
3
+
4
+ /* Ink Wash Color Palette */
5
+ :root {
6
+ --charcoal: #4A4A4A;
7
+ --cool-gray: #CBCBCB;
8
+ --soft-ivory: #FFFFE3;
9
+ --slate-blue: #6D8196;
10
+ --card-bg: #F5F0F6;
11
+ --white: #ffffff;
12
+ }
13
+
14
+ * {
15
+ margin: 0;
16
+ padding: 0;
17
+ box-sizing: border-box;
18
+ }
19
+
20
+ body {
21
+ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
22
+ background: var(--soft-ivory);
23
+ color: var(--charcoal);
24
+ line-height: 1.6;
25
+ }
26
+
27
+ /* Top Navigation Bar */
28
+ .top-navbar {
29
+ background: var(--slate-blue);
30
+ padding: 1rem 3rem;
31
+ display: flex;
32
+ justify-content: space-between;
33
+ align-items: center;
34
+ position: sticky;
35
+ top: 0;
36
+ z-index: 1000;
37
+ box-shadow: 0 2px 10px rgba(74, 74, 74, 0.1);
38
+ }
39
+
40
+ .navbar-logo {
41
+ display: flex;
42
+ align-items: center;
43
+ gap: 0.75rem;
44
+ color: white;
45
+ font-size: 1.1rem;
46
+ font-weight: 600;
47
+ text-decoration: none;
48
+ transition: opacity 0.3s ease;
49
+ }
50
+
51
+ .navbar-logo:hover {
52
+ opacity: 0.9;
53
+ }
54
+
55
+ .logo-circle {
56
+ width: 36px;
57
+ height: 36px;
58
+ background: white;
59
+ color: var(--slate-blue);
60
+ border-radius: 50%;
61
+ display: flex;
62
+ align-items: center;
63
+ justify-content: center;
64
+ font-weight: bold;
65
+ font-size: 1.2rem;
66
+ }
67
+
68
+ .navbar-links {
69
+ display: flex;
70
+ gap: 2.5rem;
71
+ align-items: center;
72
+ }
73
+
74
+ .nav-item {
75
+ color: rgba(255, 255, 255, 0.8);
76
+ font-size: 0.95rem;
77
+ font-weight: 500;
78
+ text-decoration: none;
79
+ transition: color 0.3s ease;
80
+ cursor: pointer;
81
+ display: flex;
82
+ align-items: center;
83
+ gap: 0.5rem;
84
+ }
85
+
86
+ .nav-item i {
87
+ font-size: 0.9rem;
88
+ }
89
+
90
+ .nav-item:hover {
91
+ color: white;
92
+ }
93
+
94
+ .nav-item.active {
95
+ color: white;
96
+ }
97
+
98
+ /* Hero Section */
99
+ .hero-container {
100
+ text-align: center;
101
+ padding: 5rem 2rem 3rem 2rem;
102
+ max-width: 900px;
103
+ margin: 0 auto;
104
+ }
105
+
106
+ .hero-title {
107
+ font-family: 'Playfair Display', serif;
108
+ font-size: 4.5rem;
109
+ font-weight: 400;
110
+ color: var(--charcoal);
111
+ line-height: 1.1;
112
+ margin-bottom: 0.5rem;
113
+ letter-spacing: -0.02em;
114
+ }
115
+
116
+ .hero-subtitle {
117
+ font-family: 'Playfair Display', serif;
118
+ font-size: 4.5rem;
119
+ font-weight: 400;
120
+ color: var(--slate-blue);
121
+ line-height: 1.1;
122
+ margin-bottom: 2rem;
123
+ letter-spacing: -0.02em;
124
+ }
125
+
126
+ .hero-description {
127
+ font-size: 1.1rem;
128
+ color: var(--slate-blue);
129
+ line-height: 1.6;
130
+ margin-bottom: 0.5rem;
131
+ }
132
+
133
+ /* CTA Buttons */
134
+ .cta-container {
135
+ display: flex;
136
+ gap: 1rem;
137
+ justify-content: center;
138
+ margin: 3rem 0 4rem 0;
139
+ }
140
+
141
+ .btn-primary {
142
+ background: var(--charcoal);
143
+ color: white;
144
+ padding: 1rem 2.5rem;
145
+ border: none;
146
+ border-radius: 8px;
147
+ font-size: 1rem;
148
+ font-weight: 500;
149
+ cursor: pointer;
150
+ transition: all 0.3s ease;
151
+ text-decoration: none;
152
+ display: inline-block;
153
+ }
154
+
155
+ .btn-primary:hover {
156
+ background: #3a3a3a;
157
+ transform: translateY(-2px);
158
+ box-shadow: 0 4px 12px rgba(74, 74, 74, 0.3);
159
+ }
160
+
161
+ .btn-secondary {
162
+ background: transparent;
163
+ color: var(--charcoal);
164
+ padding: 1rem 2.5rem;
165
+ border: 1px solid var(--cool-gray);
166
+ border-radius: 8px;
167
+ font-size: 1rem;
168
+ font-weight: 500;
169
+ cursor: pointer;
170
+ transition: all 0.3s ease;
171
+ text-decoration: none;
172
+ display: inline-block;
173
+ }
174
+
175
+ .btn-secondary:hover {
176
+ border-color: var(--slate-blue);
177
+ color: var(--slate-blue);
178
+ }
179
+
180
+ /* Model Cards */
181
+ .models-container {
182
+ max-width: 1100px;
183
+ margin: 0 auto;
184
+ padding: 0 2rem 4rem 2rem;
185
+ }
186
+
187
+ .cards-grid {
188
+ display: grid;
189
+ grid-template-columns: repeat(3, 1fr);
190
+ gap: 2rem;
191
+ }
192
+
193
+ .model-card {
194
+ background: var(--card-bg);
195
+ border-radius: 16px;
196
+ padding: 2.5rem 2rem;
197
+ text-align: left;
198
+ transition: all 0.3s ease;
199
+ border: 1px solid rgba(203, 203, 203, 0.3);
200
+ }
201
+
202
+ .model-card:hover {
203
+ transform: translateY(-4px);
204
+ box-shadow: 0 8px 24px rgba(74, 74, 74, 0.12);
205
+ }
206
+
207
+ .model-emoji {
208
+ font-size: 2.5rem;
209
+ margin-bottom: 1.5rem;
210
+ display: block;
211
+ }
212
+
213
+ .model-name {
214
+ font-size: 1.6rem;
215
+ font-weight: 600;
216
+ color: var(--charcoal);
217
+ margin-bottom: 1rem;
218
+ }
219
+
220
+ .model-desc {
221
+ font-size: 0.95rem;
222
+ color: var(--slate-blue);
223
+ line-height: 1.6;
224
+ }
225
+
226
+ /* Page Container */
227
+ .page-container {
228
+ max-width: 1200px;
229
+ margin: 0 auto;
230
+ padding: 3rem 2rem;
231
+ }
232
+
233
+ .page-title {
234
+ font-family: 'Playfair Display', serif;
235
+ font-size: 2.5rem;
236
+ font-weight: 600;
237
+ color: var(--charcoal);
238
+ margin-bottom: 0.5rem;
239
+ }
240
+
241
+ .page-subtitle {
242
+ font-size: 1.1rem;
243
+ color: var(--slate-blue);
244
+ margin-bottom: 3rem;
245
+ }
246
+
247
+ /* Content Grid */
248
+ .content-grid {
249
+ display: grid;
250
+ grid-template-columns: 1fr 1fr;
251
+ gap: 2rem;
252
+ margin-bottom: 2rem;
253
+ }
254
+
255
+ .input-section, .output-section {
256
+ background: white;
257
+ border-radius: 12px;
258
+ padding: 2rem;
259
+ border: 1px solid rgba(203, 203, 203, 0.3);
260
+ }
261
+
262
+ .section-label {
263
+ font-size: 0.85rem;
264
+ font-weight: 600;
265
+ color: var(--slate-blue);
266
+ text-transform: uppercase;
267
+ letter-spacing: 1px;
268
+ margin-bottom: 1rem;
269
+ }
270
+
271
+ .text-input {
272
+ width: 100%;
273
+ min-height: 300px;
274
+ padding: 1rem;
275
+ border: 1px solid var(--cool-gray);
276
+ border-radius: 8px;
277
+ font-size: 0.95rem;
278
+ font-family: inherit;
279
+ resize: vertical;
280
+ background: #FAFAFA;
281
+ }
282
+
283
+ .text-input:focus {
284
+ outline: none;
285
+ border-color: var(--slate-blue);
286
+ }
287
+
288
+ .char-count {
289
+ display: flex;
290
+ justify-content: space-between;
291
+ margin-top: 0.5rem;
292
+ font-size: 0.85rem;
293
+ color: var(--slate-blue);
294
+ }
295
+
296
+ .output-preview {
297
+ min-height: 300px;
298
+ padding: 2rem;
299
+ border: 2px dashed var(--cool-gray);
300
+ border-radius: 8px;
301
+ display: flex;
302
+ flex-direction: column;
303
+ align-items: center;
304
+ justify-content: center;
305
+ text-align: center;
306
+ color: var(--slate-blue);
307
+ background: var(--soft-ivory);
308
+ }
309
+
310
+ .output-preview .icon {
311
+ font-size: 3rem;
312
+ margin-bottom: 1rem;
313
+ }
314
+
315
+ .output-text {
316
+ width: 100%;
317
+ min-height: 300px;
318
+ padding: 1rem;
319
+ border: 1px solid var(--cool-gray);
320
+ border-radius: 8px;
321
+ font-size: 0.95rem;
322
+ line-height: 1.8;
323
+ background: white;
324
+ }
325
+
326
+ /* Controls */
327
+ .controls-section {
328
+ display: flex;
329
+ gap: 1rem;
330
+ align-items: center;
331
+ margin-top: 2rem;
332
+ }
333
+
334
+ .model-select {
335
+ padding: 0.75rem 1.5rem;
336
+ border: 1px solid var(--cool-gray);
337
+ border-radius: 8px;
338
+ font-size: 0.95rem;
339
+ background: white;
340
+ color: var(--charcoal);
341
+ cursor: pointer;
342
+ }
343
+
344
+ .btn-generate {
345
+ background: var(--charcoal);
346
+ color: white;
347
+ padding: 0.75rem 2rem;
348
+ border: none;
349
+ border-radius: 8px;
350
+ font-size: 0.95rem;
351
+ font-weight: 500;
352
+ cursor: pointer;
353
+ transition: all 0.3s ease;
354
+ }
355
+
356
+ .btn-generate:hover {
357
+ background: #3a3a3a;
358
+ transform: translateY(-2px);
359
+ box-shadow: 0 4px 12px rgba(74, 74, 74, 0.3);
360
+ }
361
+
362
+ .btn-generate:disabled {
363
+ background: var(--cool-gray);
364
+ cursor: not-allowed;
365
+ transform: none;
366
+ }
367
+
368
+ /* Footer */
369
+ .footer {
370
+ background: var(--charcoal);
371
+ color: var(--cool-gray);
372
+ padding: 2rem 3rem;
373
+ margin-top: 4rem;
374
+ display: flex;
375
+ justify-content: space-between;
376
+ align-items: center;
377
+ }
378
+
379
+ .footer-left {
380
+ display: flex;
381
+ align-items: center;
382
+ gap: 0.5rem;
383
+ }
384
+
385
+ .footer-right {
386
+ display: flex;
387
+ gap: 2rem;
388
+ align-items: center;
389
+ }
390
+
391
+ .footer-link {
392
+ color: var(--cool-gray);
393
+ text-decoration: none;
394
+ font-size: 0.9rem;
395
+ transition: color 0.3s ease;
396
+ display: flex;
397
+ align-items: center;
398
+ gap: 0.5rem;
399
+ }
400
+
401
+ .footer-link i {
402
+ font-size: 1rem;
403
+ }
404
+
405
+ .footer-link:hover {
406
+ color: white;
407
+ }
408
+
409
+ /* Loading Spinner */
410
+ .spinner {
411
+ border: 3px solid rgba(109, 129, 150, 0.3);
412
+ border-top: 3px solid var(--slate-blue);
413
+ border-radius: 50%;
414
+ width: 40px;
415
+ height: 40px;
416
+ animation: spin 1s linear infinite;
417
+ margin: 2rem auto;
418
+ }
419
+
420
+ @keyframes spin {
421
+ 0% { transform: rotate(0deg); }
422
+ 100% { transform: rotate(360deg); }
423
+ }
424
+
425
+ /* Responsive */
426
+ @media (max-width: 1024px) {
427
+ .cards-grid {
428
+ grid-template-columns: 1fr;
429
+ }
430
+
431
+ .content-grid {
432
+ grid-template-columns: 1fr;
433
+ }
434
+
435
+ .hero-title, .hero-subtitle {
436
+ font-size: 3rem;
437
+ }
438
+
439
+ .navbar-links {
440
+ gap: 1rem;
441
+ }
442
+ }
443
+
444
+ /* Comparison Page Styles */
445
+ .comparison-input-section {
446
+ background: white;
447
+ border-radius: 12px;
448
+ padding: 2rem;
449
+ border: 1px solid rgba(203, 203, 203, 0.3);
450
+ margin-bottom: 2rem;
451
+ }
452
+
453
+ .comparison-grid {
454
+ display: grid;
455
+ grid-template-columns: repeat(3, 1fr);
456
+ gap: 2rem;
457
+ margin-top: 2rem;
458
+ }
459
+
460
+ .comparison-card {
461
+ background: white;
462
+ border-radius: 12px;
463
+ border: 1px solid rgba(203, 203, 203, 0.3);
464
+ overflow: hidden;
465
+ }
466
+
467
+ .comparison-header {
468
+ display: flex;
469
+ align-items: center;
470
+ gap: 0.75rem;
471
+ padding: 1.5rem;
472
+ background: var(--card-bg);
473
+ border-bottom: 1px solid rgba(203, 203, 203, 0.3);
474
+ }
475
+
476
+ .model-indicator {
477
+ width: 12px;
478
+ height: 12px;
479
+ border-radius: 50%;
480
+ display: inline-block;
481
+ }
482
+
483
+ .comparison-header h3 {
484
+ margin: 0;
485
+ font-size: 1.3rem;
486
+ font-weight: 600;
487
+ color: var(--charcoal);
488
+ }
489
+
490
+ .comparison-result {
491
+ padding: 2rem;
492
+ min-height: 250px;
493
+ display: flex;
494
+ flex-direction: column;
495
+ align-items: center;
496
+ justify-content: center;
497
+ }
498
+
499
+ .awaiting-text {
500
+ color: var(--cool-gray);
501
+ font-size: 0.95rem;
502
+ text-align: center;
503
+ }
504
+
505
+ .summary-content {
506
+ line-height: 1.8;
507
+ color: var(--charcoal);
508
+ margin-bottom: 1.5rem;
509
+ text-align: left;
510
+ width: 100%;
511
+ }
512
+
513
+ .summary-metrics {
514
+ display: flex;
515
+ gap: 1.5rem;
516
+ padding-top: 1rem;
517
+ border-top: 1px solid rgba(203, 203, 203, 0.3);
518
+ width: 100%;
519
+ }
520
+
521
+ .metric-item {
522
+ display: flex;
523
+ flex-direction: column;
524
+ gap: 0.25rem;
525
+ }
526
+
527
+ .metric-label {
528
+ font-size: 0.75rem;
529
+ color: var(--slate-blue);
530
+ text-transform: uppercase;
531
+ letter-spacing: 0.5px;
532
+ font-weight: 600;
533
+ }
534
+
535
+ .metric-value {
536
+ font-size: 1.1rem;
537
+ color: var(--charcoal);
538
+ font-weight: 600;
539
+ }
540
+
541
+ @media (max-width: 1024px) {
542
+ .comparison-grid {
543
+ grid-template-columns: 1fr;
544
+ }
545
+ }
546
+
547
+ /* Input Tabs */
548
+ .input-tabs {
549
+ display: flex;
550
+ gap: 0.5rem;
551
+ margin-bottom: 1rem;
552
+ }
553
+
554
+ .tab-btn {
555
+ padding: 0.75rem 1.5rem;
556
+ border: 1px solid var(--cool-gray);
557
+ background: white;
558
+ color: var(--charcoal);
559
+ border-radius: 8px 8px 0 0;
560
+ cursor: pointer;
561
+ font-size: 0.9rem;
562
+ font-weight: 500;
563
+ transition: all 0.3s ease;
564
+ }
565
+
566
+ .tab-btn:hover {
567
+ background: var(--card-bg);
568
+ }
569
+
570
+ .tab-btn.active {
571
+ background: var(--slate-blue);
572
+ color: white;
573
+ border-color: var(--slate-blue);
574
+ }
575
+
576
+ .tab-content {
577
+ display: none;
578
+ }
579
+
580
+ .tab-content.active {
581
+ display: block;
582
+ }
583
+
584
+ /* File Upload Area */
585
+ .upload-area {
586
+ border: 2px dashed var(--cool-gray);
587
+ border-radius: 8px;
588
+ padding: 3rem 2rem;
589
+ text-align: center;
590
+ cursor: pointer;
591
+ transition: all 0.3s ease;
592
+ background: transparent;
593
+ }
594
+
595
+ .upload-area:hover {
596
+ border-color: var(--slate-blue);
597
+ background: rgba(109, 129, 150, 0.05);
598
+ }
599
+
600
+ .upload-icon {
601
+ font-size: 3rem;
602
+ margin-bottom: 1rem;
603
+ }
604
+
605
+ .upload-hint {
606
+ font-size: 0.85rem;
607
+ color: var(--slate-blue);
608
+ margin-top: 0.5rem;
609
+ }
610
+
611
+ .file-info {
612
+ display: flex;
613
+ justify-content: space-between;
614
+ align-items: center;
615
+ padding: 1rem;
616
+ background: var(--card-bg);
617
+ border-radius: 8px;
618
+ margin-top: 1rem;
619
+ }
620
+
621
+ .btn-remove {
622
+ background: #ef4444;
623
+ color: white;
624
+ border: none;
625
+ padding: 0.5rem 1rem;
626
+ border-radius: 6px;
627
+ cursor: pointer;
628
+ font-size: 0.85rem;
629
+ transition: all 0.3s ease;
630
+ }
631
+
632
+ .btn-remove:hover {
633
+ background: #dc2626;
634
+ }
635
+
636
+ /* Evaluation Page Styles */
637
+ .evaluation-grid {
638
+ display: grid;
639
+ grid-template-columns: 1fr 1fr;
640
+ gap: 2rem;
641
+ margin-top: 2rem;
642
+ }
643
+
644
+ .chart-section {
645
+ background: white;
646
+ border-radius: 12px;
647
+ padding: 2rem;
648
+ border: 1px solid rgba(203, 203, 203, 0.3);
649
+ }
650
+
651
+ .chart-container {
652
+ width: 100%;
653
+ }
654
+
655
+ .chart-title {
656
+ font-size: 1.3rem;
657
+ font-weight: 600;
658
+ color: var(--charcoal);
659
+ margin-bottom: 2rem;
660
+ text-align: center;
661
+ }
662
+
663
+ .metrics-explanation {
664
+ display: flex;
665
+ flex-direction: column;
666
+ gap: 1.5rem;
667
+ }
668
+
669
+ .section-title {
670
+ font-size: 1.3rem;
671
+ font-weight: 600;
672
+ color: var(--charcoal);
673
+ margin-bottom: 0.5rem;
674
+ }
675
+
676
+ .metric-card {
677
+ background: white;
678
+ border-radius: 12px;
679
+ padding: 1.5rem;
680
+ border: 1px solid rgba(203, 203, 203, 0.3);
681
+ }
682
+
683
+ .metric-header {
684
+ display: flex;
685
+ align-items: center;
686
+ gap: 0.75rem;
687
+ margin-bottom: 0.75rem;
688
+ }
689
+
690
+ .metric-indicator {
691
+ width: 12px;
692
+ height: 12px;
693
+ border-radius: 50%;
694
+ }
695
+
696
+ .metric-card h4 {
697
+ font-size: 1.1rem;
698
+ font-weight: 600;
699
+ color: var(--charcoal);
700
+ margin: 0;
701
+ }
702
+
703
+ .metric-card p {
704
+ font-size: 0.9rem;
705
+ color: var(--slate-blue);
706
+ line-height: 1.6;
707
+ margin: 0;
708
+ }
709
+
710
+ .insight-box {
711
+ background: var(--charcoal);
712
+ color: white;
713
+ border-radius: 12px;
714
+ padding: 1.5rem;
715
+ margin-top: 1rem;
716
+ }
717
+
718
+ .insight-box h4 {
719
+ font-size: 0.85rem;
720
+ font-weight: 600;
721
+ letter-spacing: 1px;
722
+ margin-bottom: 0.75rem;
723
+ color: var(--cool-gray);
724
+ }
725
+
726
+ .insight-box p {
727
+ font-size: 0.95rem;
728
+ line-height: 1.6;
729
+ margin: 0;
730
+ font-style: italic;
731
+ }
732
+
733
+ /* Batch Page Styles */
734
+ .batch-controls {
735
+ display: flex;
736
+ gap: 1rem;
737
+ justify-content: flex-end;
738
+ margin-bottom: 2rem;
739
+ }
740
+
741
+ .batch-table-container {
742
+ background: white;
743
+ border-radius: 12px;
744
+ border: 1px solid rgba(203, 203, 203, 0.3);
745
+ overflow: hidden;
746
+ margin-bottom: 2rem;
747
+ }
748
+
749
+ .batch-table {
750
+ width: 100%;
751
+ border-collapse: collapse;
752
+ }
753
+
754
+ .batch-table thead {
755
+ background: var(--card-bg);
756
+ }
757
+
758
+ .batch-table th {
759
+ padding: 1rem 1.5rem;
760
+ text-align: left;
761
+ font-size: 0.75rem;
762
+ font-weight: 600;
763
+ color: var(--slate-blue);
764
+ text-transform: uppercase;
765
+ letter-spacing: 1px;
766
+ border-bottom: 1px solid rgba(203, 203, 203, 0.3);
767
+ }
768
+
769
+ .batch-table td {
770
+ padding: 1.5rem;
771
+ border-bottom: 1px solid rgba(203, 203, 203, 0.1);
772
+ color: var(--charcoal);
773
+ }
774
+
775
+ .batch-table tbody tr:hover {
776
+ background: rgba(109, 129, 150, 0.03);
777
+ }
778
+
779
+ .empty-state {
780
+ text-align: center;
781
+ }
782
+
783
+ .empty-message {
784
+ padding: 4rem 2rem;
785
+ color: var(--cool-gray);
786
+ font-size: 1rem;
787
+ }
788
+
789
+ .source-preview {
790
+ max-width: 400px;
791
+ overflow: hidden;
792
+ text-overflow: ellipsis;
793
+ white-space: nowrap;
794
+ color: var(--charcoal);
795
+ }
796
+
797
+ .model-badges {
798
+ display: flex;
799
+ gap: 0.5rem;
800
+ flex-wrap: wrap;
801
+ }
802
+
803
+ .model-badge {
804
+ padding: 0.25rem 0.75rem;
805
+ border-radius: 6px;
806
+ font-size: 0.8rem;
807
+ font-weight: 500;
808
+ background: var(--card-bg);
809
+ color: var(--charcoal);
810
+ border: 1px solid rgba(203, 203, 203, 0.3);
811
+ }
812
+
813
+ .status-badge {
814
+ padding: 0.5rem 1rem;
815
+ border-radius: 6px;
816
+ font-size: 0.85rem;
817
+ font-weight: 500;
818
+ display: inline-block;
819
+ }
820
+
821
+ .status-pending {
822
+ background: rgba(203, 203, 203, 0.2);
823
+ color: var(--slate-blue);
824
+ }
825
+
826
+ .status-processing {
827
+ background: rgba(109, 129, 150, 0.2);
828
+ color: var(--slate-blue);
829
+ }
830
+
831
+ .status-complete {
832
+ background: rgba(34, 197, 94, 0.2);
833
+ color: #16a34a;
834
+ }
835
+
836
+ .status-error {
837
+ background: rgba(239, 68, 68, 0.2);
838
+ color: #dc2626;
839
+ }
840
+
841
+ .action-buttons {
842
+ display: flex;
843
+ gap: 0.5rem;
844
+ }
845
+
846
+ .btn-icon {
847
+ background: transparent;
848
+ border: 1px solid var(--cool-gray);
849
+ color: var(--charcoal);
850
+ padding: 0.5rem 0.75rem;
851
+ border-radius: 6px;
852
+ cursor: pointer;
853
+ font-size: 0.85rem;
854
+ transition: all 0.3s ease;
855
+ }
856
+
857
+ .btn-icon:hover {
858
+ background: var(--card-bg);
859
+ border-color: var(--slate-blue);
860
+ }
861
+
862
+ .export-section {
863
+ display: flex;
864
+ justify-content: flex-end;
865
+ }
866
+
867
+ @media (max-width: 1024px) {
868
+ .evaluation-grid {
869
+ grid-template-columns: 1fr;
870
+ }
871
+
872
+ .batch-table {
873
+ font-size: 0.85rem;
874
+ }
875
+
876
+ .batch-table th,
877
+ .batch-table td {
878
+ padding: 0.75rem;
879
+ }
880
+ }
webapp/static/js/batch.js ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // Batch Processing Page
2
+
3
+ let batchQueue = [];
4
+ let batchResults = [];
5
+
6
+ // DOM Elements
7
+ const loadSamplesBtn = document.getElementById('loadSamplesBtn');
8
+ const runBatchBtn = document.getElementById('runBatchBtn');
9
+ const exportBtn = document.getElementById('exportBtn');
10
+ const tableBody = document.getElementById('batchTableBody');
11
+
12
+ // Sample texts for demo
13
+ const sampleTexts = [
14
+ {
15
+ text: "Artificial intelligence has revolutionized the way we interact with technology. Machine learning algorithms can now process vast amounts of data and identify patterns that humans might miss. Deep learning neural networks have enabled breakthroughs in computer vision, natural language processing, and speech recognition. These advances are transforming industries from healthcare to finance.",
16
+ models: ['textrank', 'bart', 'pegasus']
17
+ },
18
+ {
19
+ text: "Climate change poses one of the greatest challenges to humanity. Rising global temperatures are causing ice caps to melt and sea levels to rise. Extreme weather events are becoming more frequent and severe. Scientists warn that without immediate action, the consequences could be catastrophic for future generations.",
20
+ models: ['textrank', 'bart']
21
+ },
22
+ {
23
+ text: "The human brain is the most complex organ in the body, containing approximately 86 billion neurons. These neurons communicate through electrical and chemical signals, forming intricate networks that enable thought, memory, and consciousness. Neuroscientists continue to uncover the mysteries of how the brain processes information and generates our subjective experiences.",
24
+ models: ['bart', 'pegasus']
25
+ }
26
+ ];
27
+
28
+ // Load sample documents
29
+ loadSamplesBtn.addEventListener('click', function() {
30
+ batchQueue = [...sampleTexts];
31
+ renderTable();
32
+ });
33
+
34
+ // Run batch processing
35
+ runBatchBtn.addEventListener('click', async function() {
36
+ if (batchQueue.length === 0) {
37
+ alert('No items in queue. Please load samples first.');
38
+ return;
39
+ }
40
+
41
+ runBatchBtn.disabled = true;
42
+ runBatchBtn.textContent = 'Processing...';
43
+
44
+ for (let i = 0; i < batchQueue.length; i++) {
45
+ const item = batchQueue[i];
46
+ item.status = 'processing';
47
+ renderTable();
48
+
49
+ try {
50
+ const results = {};
51
+
52
+ for (const model of item.models) {
53
+ const response = await fetch('/api/summarize', {
54
+ method: 'POST',
55
+ headers: {
56
+ 'Content-Type': 'application/json'
57
+ },
58
+ body: JSON.stringify({
59
+ text: item.text,
60
+ model: model
61
+ })
62
+ });
63
+
64
+ const data = await response.json();
65
+
66
+ if (data.success) {
67
+ results[model] = {
68
+ summary: data.summary,
69
+ metadata: data.metadata
70
+ };
71
+ }
72
+ }
73
+
74
+ item.results = results;
75
+ item.status = 'complete';
76
+ batchResults.push(item);
77
+
78
+ } catch (error) {
79
+ item.status = 'error';
80
+ item.error = error.message;
81
+ }
82
+
83
+ renderTable();
84
+ }
85
+
86
+ runBatchBtn.disabled = false;
87
+ runBatchBtn.textContent = 'Run Batch';
88
+ });
89
+
90
+ // Export results to CSV
91
+ exportBtn.addEventListener('click', function() {
92
+ if (batchResults.length === 0) {
93
+ alert('No results to export. Please run batch processing first.');
94
+ return;
95
+ }
96
+
97
+ let csv = 'Source Text,Model,Summary,Processing Time (s),Compression Ratio\n';
98
+
99
+ batchResults.forEach(item => {
100
+ if (item.results) {
101
+ Object.keys(item.results).forEach(model => {
102
+ const result = item.results[model];
103
+ const sourceText = item.text.replace(/"/g, '""').substring(0, 100) + '...';
104
+ const summary = result.summary.replace(/"/g, '""');
105
+ const time = result.metadata.processing_time.toFixed(2);
106
+ const compression = (result.metadata.compression_ratio * 100).toFixed(1) + '%';
107
+
108
+ csv += `"${sourceText}","${model}","${summary}",${time},${compression}\n`;
109
+ });
110
+ }
111
+ });
112
+
113
+ // Download CSV
114
+ const blob = new Blob([csv], { type: 'text/csv' });
115
+ const url = window.URL.createObjectURL(blob);
116
+ const a = document.createElement('a');
117
+ a.href = url;
118
+ a.download = 'batch_results_' + new Date().toISOString().split('T')[0] + '.csv';
119
+ document.body.appendChild(a);
120
+ a.click();
121
+ document.body.removeChild(a);
122
+ window.URL.revokeObjectURL(url);
123
+ });
124
+
125
+ // Render table
126
+ function renderTable() {
127
+ if (batchQueue.length === 0) {
128
+ tableBody.innerHTML = `
129
+ <tr class="empty-state">
130
+ <td colspan="4">
131
+ <div class="empty-message">
132
+ No items in the queue. Load samples or upload a CSV to begin.
133
+ </div>
134
+ </td>
135
+ </tr>
136
+ `;
137
+ return;
138
+ }
139
+
140
+ tableBody.innerHTML = batchQueue.map((item, index) => {
141
+ const preview = item.text.substring(0, 80) + '...';
142
+ const modelBadges = item.models.map(m =>
143
+ `<span class="model-badge">${m.toUpperCase()}</span>`
144
+ ).join('');
145
+
146
+ let statusBadge = '';
147
+ if (!item.status || item.status === 'pending') {
148
+ statusBadge = '<span class="status-badge status-pending">Pending</span>';
149
+ } else if (item.status === 'processing') {
150
+ statusBadge = '<span class="status-badge status-processing">Processing...</span>';
151
+ } else if (item.status === 'complete') {
152
+ statusBadge = '<span class="status-badge status-complete">Complete</span>';
153
+ } else if (item.status === 'error') {
154
+ statusBadge = '<span class="status-badge status-error">Error</span>';
155
+ }
156
+
157
+ return `
158
+ <tr>
159
+ <td><div class="source-preview">${preview}</div></td>
160
+ <td><div class="model-badges">${modelBadges}</div></td>
161
+ <td>${statusBadge}</td>
162
+ <td>
163
+ <div class="action-buttons">
164
+ <button class="btn-icon" onclick="viewItem(${index})" ${item.status !== 'complete' ? 'disabled' : ''}>View</button>
165
+ <button class="btn-icon" onclick="removeItem(${index})">Remove</button>
166
+ </div>
167
+ </td>
168
+ </tr>
169
+ `;
170
+ }).join('');
171
+ }
172
+
173
+ // View item results
174
+ function viewItem(index) {
175
+ const item = batchQueue[index];
176
+ if (!item.results) return;
177
+
178
+ let resultsHtml = '<div style="max-width: 800px; margin: 0 auto;">';
179
+ resultsHtml += '<h3 style="margin-bottom: 1rem;">Batch Results</h3>';
180
+ resultsHtml += `<p style="color: #6D8196; margin-bottom: 2rem;"><strong>Source:</strong> ${item.text.substring(0, 200)}...</p>`;
181
+
182
+ Object.keys(item.results).forEach(model => {
183
+ const result = item.results[model];
184
+ resultsHtml += `
185
+ <div style="margin-bottom: 2rem; padding: 1.5rem; background: #F5F0F6; border-radius: 8px;">
186
+ <h4 style="margin-bottom: 0.5rem; color: #4A4A4A;">${model.toUpperCase()}</h4>
187
+ <p style="line-height: 1.8; margin-bottom: 1rem;">${result.summary}</p>
188
+ <div style="display: flex; gap: 2rem; font-size: 0.9rem; color: #6D8196;">
189
+ <span><strong>Time:</strong> ${result.metadata.processing_time.toFixed(2)}s</span>
190
+ <span><strong>Compression:</strong> ${(result.metadata.compression_ratio * 100).toFixed(1)}%</span>
191
+ </div>
192
+ </div>
193
+ `;
194
+ });
195
+
196
+ resultsHtml += '</div>';
197
+
198
+ // Create modal
199
+ const modal = document.createElement('div');
200
+ modal.style.cssText = 'position: fixed; top: 0; left: 0; right: 0; bottom: 0; background: rgba(0,0,0,0.5); display: flex; align-items: center; justify-content: center; z-index: 9999; padding: 2rem;';
201
+ modal.innerHTML = `
202
+ <div style="background: white; border-radius: 12px; padding: 2rem; max-height: 90vh; overflow-y: auto; position: relative;">
203
+ <button onclick="this.parentElement.parentElement.remove()" style="position: absolute; top: 1rem; right: 1rem; background: none; border: none; font-size: 1.5rem; cursor: pointer; color: #4A4A4A;">×</button>
204
+ ${resultsHtml}
205
+ </div>
206
+ `;
207
+ document.body.appendChild(modal);
208
+ }
209
+
210
+ // Remove item from queue
211
+ function removeItem(index) {
212
+ batchQueue.splice(index, 1);
213
+ renderTable();
214
+ }
215
+
216
+ // Initial render
217
+ renderTable();
webapp/static/js/evaluation.js ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // Evaluation Page - ROUGE Metrics Chart
2
+
3
+ // Sample benchmark data (from CNN/DailyMail evaluation)
4
+ const benchmarkData = {
5
+ textrank: {
6
+ rouge1: 0.43,
7
+ rouge2: 0.18,
8
+ rougeL: 0.35
9
+ },
10
+ bart: {
11
+ rouge1: 0.51,
12
+ rouge2: 0.34,
13
+ rougeL: 0.48
14
+ },
15
+ pegasus: {
16
+ rouge1: 0.55,
17
+ rouge2: 0.30,
18
+ rougeL: 0.52
19
+ }
20
+ };
21
+
22
+ // Initialize chart
23
+ document.addEventListener('DOMContentLoaded', function() {
24
+ const ctx = document.getElementById('rougeChart').getContext('2d');
25
+
26
+ const chart = new Chart(ctx, {
27
+ type: 'bar',
28
+ data: {
29
+ labels: ['TextRank', 'BART', 'PEGASUS'],
30
+ datasets: [
31
+ {
32
+ label: 'ROUGE-1',
33
+ data: [
34
+ benchmarkData.textrank.rouge1,
35
+ benchmarkData.bart.rouge1,
36
+ benchmarkData.pegasus.rouge1
37
+ ],
38
+ backgroundColor: '#6D8196',
39
+ borderRadius: 6
40
+ },
41
+ {
42
+ label: 'ROUGE-2',
43
+ data: [
44
+ benchmarkData.textrank.rouge2,
45
+ benchmarkData.bart.rouge2,
46
+ benchmarkData.pegasus.rouge2
47
+ ],
48
+ backgroundColor: '#CBCBCB',
49
+ borderRadius: 6
50
+ },
51
+ {
52
+ label: 'ROUGE-L',
53
+ data: [
54
+ benchmarkData.textrank.rougeL,
55
+ benchmarkData.bart.rougeL,
56
+ benchmarkData.pegasus.rougeL
57
+ ],
58
+ backgroundColor: '#4A4A4A',
59
+ borderRadius: 6
60
+ }
61
+ ]
62
+ },
63
+ options: {
64
+ responsive: true,
65
+ maintainAspectRatio: true,
66
+ plugins: {
67
+ legend: {
68
+ display: true,
69
+ position: 'bottom',
70
+ labels: {
71
+ padding: 20,
72
+ font: {
73
+ size: 12,
74
+ family: '-apple-system, BlinkMacSystemFont, "Segoe UI", Roboto'
75
+ },
76
+ usePointStyle: true,
77
+ pointStyle: 'circle'
78
+ }
79
+ },
80
+ tooltip: {
81
+ backgroundColor: '#4A4A4A',
82
+ padding: 12,
83
+ titleFont: {
84
+ size: 13
85
+ },
86
+ bodyFont: {
87
+ size: 12
88
+ },
89
+ callbacks: {
90
+ label: function(context) {
91
+ return context.dataset.label + ': ' + context.parsed.y.toFixed(2);
92
+ }
93
+ }
94
+ }
95
+ },
96
+ scales: {
97
+ y: {
98
+ beginAtZero: true,
99
+ max: 0.6,
100
+ ticks: {
101
+ font: {
102
+ size: 11
103
+ },
104
+ color: '#6D8196'
105
+ },
106
+ grid: {
107
+ color: 'rgba(203, 203, 203, 0.3)',
108
+ drawBorder: false
109
+ }
110
+ },
111
+ x: {
112
+ ticks: {
113
+ font: {
114
+ size: 12,
115
+ weight: '500'
116
+ },
117
+ color: '#4A4A4A'
118
+ },
119
+ grid: {
120
+ display: false
121
+ }
122
+ }
123
+ }
124
+ }
125
+ });
126
+ });
webapp/templates/batch.html ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>Batch Processing - Smart Summarizer</title>
7
+ <link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}">
8
+ <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
9
+ </head>
10
+ <body>
11
+ <!-- Top Navigation Bar -->
12
+ <nav class="top-navbar">
13
+ <a href="/" class="navbar-logo">
14
+ <div class="logo-circle">S</div>
15
+ <span>Smart Summarizer</span>
16
+ </a>
17
+ <div class="navbar-links">
18
+ <a href="/" class="nav-item">
19
+ <i class="fas fa-home"></i> Home
20
+ </a>
21
+ <a href="/single-summary" class="nav-item">
22
+ <i class="fas fa-file-alt"></i> Single Summary
23
+ </a>
24
+ <a href="/comparison" class="nav-item">
25
+ <i class="fas fa-balance-scale"></i> Comparison
26
+ </a>
27
+ <a href="/batch" class="nav-item active">
28
+ <i class="fas fa-layer-group"></i> Batch
29
+ </a>
30
+ <a href="/evaluation" class="nav-item">
31
+ <i class="fas fa-chart-bar"></i> Evaluation
32
+ </a>
33
+ </div>
34
+ </nav>
35
+
36
+ <!-- Page Content -->
37
+ <div class="page-container">
38
+ <h1 class="page-title">Batch Processing</h1>
39
+ <p class="page-subtitle">Process multiple documents simultaneously for high-throughput summarization.</p>
40
+
41
+ <!-- Controls -->
42
+ <div class="batch-controls">
43
+ <button class="btn-secondary" id="loadSamplesBtn">Load Samples</button>
44
+ <button class="btn-primary" id="runBatchBtn">Run Batch</button>
45
+ </div>
46
+
47
+ <!-- Batch Table -->
48
+ <div class="batch-table-container">
49
+ <table class="batch-table">
50
+ <thead>
51
+ <tr>
52
+ <th>SOURCE PREVIEW</th>
53
+ <th>MODELS</th>
54
+ <th>STATUS</th>
55
+ <th>ACTIONS</th>
56
+ </tr>
57
+ </thead>
58
+ <tbody id="batchTableBody">
59
+ <tr class="empty-state">
60
+ <td colspan="4">
61
+ <div class="empty-message">
62
+ No items in the queue. Load samples or upload a CSV to begin.
63
+ </div>
64
+ </td>
65
+ </tr>
66
+ </tbody>
67
+ </table>
68
+ </div>
69
+
70
+ <!-- Export Button -->
71
+ <div class="export-section">
72
+ <button class="btn-secondary" id="exportBtn">
73
+ <span>📥</span> Export All Results (CSV)
74
+ </button>
75
+ </div>
76
+ </div>
77
+
78
+ <!-- Footer -->
79
+ <footer class="footer">
80
+ <div class="footer-left">
81
+ <div class="logo-circle" style="width: 24px; height: 24px; font-size: 0.9rem;">S</div>
82
+ <span>Smart Summarizer</span>
83
+ </div>
84
+ <div class="footer-right">
85
+ <span>© 2025 Smart Summarizer. Abdul Razzaq Ansari</span>
86
+ <a href="https://github.com/Rajak13/Smart-Summarizer" target="_blank" class="footer-link">
87
+ <i class="fab fa-github"></i> GitHub Repository
88
+ </a>
89
+ </div>
90
+ </footer>
91
+
92
+ <script src="{{ url_for('static', filename='js/batch.js') }}"></script>
93
+ </body>
94
+ </html>
webapp/templates/comparison.html ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>Model Comparison - Smart Summarizer</title>
7
+ <link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}">
8
+ <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
9
+ </head>
10
+ <body>
11
+ <!-- Top Navigation Bar -->
12
+ <nav class="top-navbar">
13
+ <a href="/" class="navbar-logo">
14
+ <div class="logo-circle">S</div>
15
+ <span>Smart Summarizer</span>
16
+ </a>
17
+ <div class="navbar-links">
18
+ <a href="/" class="nav-item">
19
+ <i class="fas fa-home"></i> Home
20
+ </a>
21
+ <a href="/single-summary" class="nav-item">
22
+ <i class="fas fa-file-alt"></i> Single Summary
23
+ </a>
24
+ <a href="/comparison" class="nav-item active">
25
+ <i class="fas fa-balance-scale"></i> Comparison
26
+ </a>
27
+ <a href="/batch" class="nav-item">
28
+ <i class="fas fa-layer-group"></i> Batch
29
+ </a>
30
+ <a href="/evaluation" class="nav-item">
31
+ <i class="fas fa-chart-bar"></i> Evaluation
32
+ </a>
33
+ </div>
34
+ </nav>
35
+
36
+ <!-- Page Content -->
37
+ <div class="page-container">
38
+ <h1 class="page-title">Model Comparison Matrix</h1>
39
+ <p class="page-subtitle">Compare extractive and abstractive strategies in real-time. Witness how graph-based ranking compares to transformer-based generation.</p>
40
+
41
+ <!-- Input Section -->
42
+ <div class="comparison-input-section">
43
+ <textarea
44
+ class="text-input"
45
+ id="inputText"
46
+ placeholder="Input source text for cross-model analysis..."
47
+ style="min-height: 200px;"
48
+ ></textarea>
49
+ </div>
50
+
51
+ <!-- Run Analysis Button -->
52
+ <div style="text-align: center; margin: 2rem 0;">
53
+ <button class="btn-generate" id="runAnalysisBtn" style="padding: 1rem 3rem;">
54
+ Run Analysis
55
+ </button>
56
+ </div>
57
+
58
+ <!-- Results Grid -->
59
+ <div class="comparison-grid" id="resultsGrid">
60
+ <!-- TextRank Card -->
61
+ <div class="comparison-card">
62
+ <div class="comparison-header">
63
+ <span class="model-indicator" style="background: #FFA500;"></span>
64
+ <h3>TextRank</h3>
65
+ </div>
66
+ <div class="comparison-result" id="textrank-result">
67
+ <div class="awaiting-text">Awaiting Analysis</div>
68
+ </div>
69
+ </div>
70
+
71
+ <!-- BART Card -->
72
+ <div class="comparison-card">
73
+ <div class="comparison-header">
74
+ <span class="model-indicator" style="background: #4A90E2;"></span>
75
+ <h3>BART</h3>
76
+ </div>
77
+ <div class="comparison-result" id="bart-result">
78
+ <div class="awaiting-text">Awaiting Analysis</div>
79
+ </div>
80
+ </div>
81
+
82
+ <!-- PEGASUS Card -->
83
+ <div class="comparison-card">
84
+ <div class="comparison-header">
85
+ <span class="model-indicator" style="background: #50C878;"></span>
86
+ <h3>PEGASUS</h3>
87
+ </div>
88
+ <div class="comparison-result" id="pegasus-result">
89
+ <div class="awaiting-text">Awaiting Analysis</div>
90
+ </div>
91
+ </div>
92
+ </div>
93
+ </div>
94
+
95
+ <!-- Footer -->
96
+ <footer class="footer">
97
+ <div class="footer-left">
98
+ <div class="logo-circle" style="width: 24px; height: 24px; font-size: 0.9rem;">S</div>
99
+ <span>Smart Summarizer</span>
100
+ </div>
101
+ <div class="footer-right">
102
+ <span>© 2025 Smart Summarizer. Abdul Razzaq Ansari</span>
103
+ <a href="https://github.com/Rajak13/Smart-Summarizer" target="_blank" class="footer-link">
104
+ <i class="fab fa-github"></i> GitHub Repository
105
+ </a>
106
+ </div>
107
+ </footer>
108
+
109
+ <script>
110
+ const inputText = document.getElementById('inputText');
111
+ const runAnalysisBtn = document.getElementById('runAnalysisBtn');
112
+
113
+ runAnalysisBtn.addEventListener('click', async () => {
114
+ const text = inputText.value.trim();
115
+
116
+ if (!text || text.split(/\s+/).length < 10) {
117
+ alert('Please enter at least 10 words of text');
118
+ return;
119
+ }
120
+
121
+ // Show loading state
122
+ runAnalysisBtn.disabled = true;
123
+ runAnalysisBtn.textContent = 'Analyzing...';
124
+
125
+ // Show loading in all cards
126
+ ['textrank', 'bart', 'pegasus'].forEach(model => {
127
+ document.getElementById(`${model}-result`).innerHTML = `
128
+ <div class="spinner"></div>
129
+ <div style="margin-top: 1rem; color: var(--slate-blue);">Processing...</div>
130
+ `;
131
+ });
132
+
133
+ try {
134
+ const response = await fetch('/api/compare', {
135
+ method: 'POST',
136
+ headers: {
137
+ 'Content-Type': 'application/json',
138
+ },
139
+ body: JSON.stringify({ text: text })
140
+ });
141
+
142
+ const data = await response.json();
143
+
144
+ if (data.success) {
145
+ // Display results for each model
146
+ Object.keys(data.results).forEach(model => {
147
+ const result = data.results[model];
148
+ const resultDiv = document.getElementById(`${model}-result`);
149
+
150
+ if (result.error) {
151
+ resultDiv.innerHTML = `
152
+ <div style="color: #ef4444; padding: 1rem;">
153
+ <strong>Error:</strong> ${result.error}
154
+ </div>
155
+ `;
156
+ } else {
157
+ resultDiv.innerHTML = `
158
+ <div class="summary-content">
159
+ ${result.summary}
160
+ </div>
161
+ <div class="summary-metrics">
162
+ <div class="metric-item">
163
+ <span class="metric-label">Time:</span>
164
+ <span class="metric-value">${result.metadata.processing_time.toFixed(2)}s</span>
165
+ </div>
166
+ <div class="metric-item">
167
+ <span class="metric-label">Compression:</span>
168
+ <span class="metric-value">${(result.metadata.compression_ratio * 100).toFixed(1)}%</span>
169
+ </div>
170
+ <div class="metric-item">
171
+ <span class="metric-label">Words:</span>
172
+ <span class="metric-value">${result.metadata.summary_length}</span>
173
+ </div>
174
+ </div>
175
+ `;
176
+ }
177
+ });
178
+ } else {
179
+ alert('Error: ' + data.error);
180
+ }
181
+ } catch (error) {
182
+ alert('Failed to run analysis. Please try again.');
183
+ console.error(error);
184
+ } finally {
185
+ runAnalysisBtn.disabled = false;
186
+ runAnalysisBtn.textContent = 'Run Analysis';
187
+ }
188
+ });
189
+ </script>
190
+ </body>
191
+ </html>
webapp/templates/evaluation.html ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>Evaluation - Smart Summarizer</title>
7
+ <link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}">
8
+ <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
9
+ <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
10
+ </head>
11
+ <body>
12
+ <!-- Top Navigation Bar -->
13
+ <nav class="top-navbar">
14
+ <a href="/" class="navbar-logo">
15
+ <div class="logo-circle">S</div>
16
+ <span>Smart Summarizer</span>
17
+ </a>
18
+ <div class="navbar-links">
19
+ <a href="/" class="nav-item">
20
+ <i class="fas fa-home"></i> Home
21
+ </a>
22
+ <a href="/single-summary" class="nav-item">
23
+ <i class="fas fa-file-alt"></i> Single Summary
24
+ </a>
25
+ <a href="/comparison" class="nav-item">
26
+ <i class="fas fa-balance-scale"></i> Comparison
27
+ </a>
28
+ <a href="/batch" class="nav-item">
29
+ <i class="fas fa-layer-group"></i> Batch
30
+ </a>
31
+ <a href="/evaluation" class="nav-item active">
32
+ <i class="fas fa-chart-bar"></i> Evaluation
33
+ </a>
34
+ </div>
35
+ </nav>
36
+
37
+ <!-- Page Content -->
38
+ <div class="page-container">
39
+ <h1 class="page-title">Metric Benchmarks</h1>
40
+ <p class="page-subtitle">Aggregate performance data based on the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scoring system.</p>
41
+
42
+ <!-- Content Grid -->
43
+ <div class="evaluation-grid">
44
+ <!-- Chart Section -->
45
+ <div class="chart-section">
46
+ <div class="chart-container">
47
+ <h3 class="chart-title">ROUGE Metric Comparison</h3>
48
+ <canvas id="rougeChart"></canvas>
49
+ </div>
50
+ </div>
51
+
52
+ <!-- Metrics Explanation -->
53
+ <div class="metrics-explanation">
54
+ <h3 class="section-title">Understanding the Metrics</h3>
55
+
56
+ <div class="metric-card">
57
+ <div class="metric-header">
58
+ <div class="metric-indicator" style="background: #6D8196;"></div>
59
+ <h4>ROUGE-1</h4>
60
+ </div>
61
+ <p>Measures the overlap of unigrams (single words) between the generated summary and the reference text. High scores indicate good content coverage.</p>
62
+ </div>
63
+
64
+ <div class="metric-card">
65
+ <div class="metric-header">
66
+ <div class="metric-indicator" style="background: #CBCBCB;"></div>
67
+ <h4>ROUGE-2</h4>
68
+ </div>
69
+ <p>Measures the overlap of bigrams (pairs of consecutive words). This is a strong indicator of fluency and phrasing quality.</p>
70
+ </div>
71
+
72
+ <div class="metric-card">
73
+ <div class="metric-header">
74
+ <div class="metric-indicator" style="background: #4A4A4A;"></div>
75
+ <h4>ROUGE-L</h4>
76
+ </div>
77
+ <p>Based on the Longest Common Subsequence. It captures sentence structure and sequential flow more effectively than simple n-gram overlap.</p>
78
+ </div>
79
+
80
+ <div class="insight-box">
81
+ <h4>MODEL INSIGHT</h4>
82
+ <p>"BART and PEGASUS typically outperform TextRank in ROUGE-2 and ROUGE-L as they generate fluent, abstractive prose rather than just extracting source fragments."</p>
83
+ </div>
84
+ </div>
85
+ </div>
86
+ </div>
87
+
88
+ <!-- Footer -->
89
+ <footer class="footer">
90
+ <div class="footer-left">
91
+ <div class="logo-circle" style="width: 24px; height: 24px; font-size: 0.9rem;">S</div>
92
+ <span>Smart Summarizer</span>
93
+ </div>
94
+ <div class="footer-right">
95
+ <span>© 2025 Smart Summarizer. Abdul Razzaq Ansari</span>
96
+ <a href="https://github.com/Rajak13/Smart-Summarizer" target="_blank" class="footer-link">
97
+ <i class="fab fa-github"></i> GitHub Repository
98
+ </a>
99
+ </div>
100
+ </footer>
101
+
102
+ <script src="{{ url_for('static', filename='js/evaluation.js') }}"></script>
103
+ </body>
104
+ </html>
webapp/templates/home.html ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>Smart Summarizer - Home</title>
7
+ <link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}">
8
+ <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
9
+ </head>
10
+ <body>
11
+ <!-- Top Navigation Bar -->
12
+ <nav class="top-navbar">
13
+ <a href="/" class="navbar-logo">
14
+ <div class="logo-circle">S</div>
15
+ <span>Smart Summarizer</span>
16
+ </a>
17
+ <div class="navbar-links">
18
+ <a href="/" class="nav-item active">
19
+ <i class="fas fa-home"></i> Home
20
+ </a>
21
+ <a href="/single-summary" class="nav-item">
22
+ <i class="fas fa-file-alt"></i> Single Summary
23
+ </a>
24
+ <a href="/comparison" class="nav-item">
25
+ <i class="fas fa-balance-scale"></i> Comparison
26
+ </a>
27
+ <a href="/batch" class="nav-item">
28
+ <i class="fas fa-layer-group"></i> Batch
29
+ </a>
30
+ <a href="/evaluation" class="nav-item">
31
+ <i class="fas fa-chart-bar"></i> Evaluation
32
+ </a>
33
+ </div>
34
+ </nav>
35
+
36
+ <!-- Hero Section -->
37
+ <div class="hero-container">
38
+ <h1 class="hero-title">Refined Intelligence.</h1>
39
+ <h1 class="hero-subtitle">Elegant Summaries.</h1>
40
+
41
+ <p class="hero-description">A high-fidelity comparison platform for state-of-the-art NLP models.</p>
42
+ <p class="hero-description">Compare TextRank, BART, and PEGASUS with precision metrics.</p>
43
+
44
+ <!-- CTA Buttons -->
45
+ <div class="cta-container">
46
+ <a href="/single-summary" class="btn-primary">Start Summarizing</a>
47
+ <a href="/evaluation" class="btn-secondary">View Evaluation</a>
48
+ </div>
49
+ </div>
50
+
51
+ <!-- Model Cards Section -->
52
+ <div class="models-container">
53
+ <div class="cards-grid">
54
+ <div class="model-card">
55
+ <span class="model-emoji">🎯</span>
56
+ <h3 class="model-name">TextRank</h3>
57
+ <p class="model-desc">
58
+ Extractive graph-based model that identifies the most
59
+ salient sentences directly from the source.
60
+ </p>
61
+ </div>
62
+
63
+ <div class="model-card">
64
+ <span class="model-emoji">💝</span>
65
+ <h3 class="model-name">BART</h3>
66
+ <p class="model-desc">
67
+ Abstractive transformer-based model optimized for standard,
68
+ fluent summaries of varying length.
69
+ </p>
70
+ </div>
71
+
72
+ <div class="model-card">
73
+ <span class="model-emoji">🚀</span>
74
+ <h3 class="model-name">PEGASUS</h3>
75
+ <p class="model-desc">
76
+ Advanced abstractive model pre-trained specifically for
77
+ summarization tasks and gap prediction.
78
+ </p>
79
+ </div>
80
+ </div>
81
+ </div>
82
+
83
+ <!-- Footer -->
84
+ <footer class="footer">
85
+ <div class="footer-left">
86
+ <div class="logo-circle" style="width: 24px; height: 24px; font-size: 0.9rem;">S</div>
87
+ <span>Smart Summarizer</span>
88
+ </div>
89
+ <div class="footer-right">
90
+ <span>© 2025 Smart Summarizer. Abdul Razzaq Ansari</span>
91
+ <a href="https://github.com/Rajak13/Smart-Summarizer" target="_blank" class="footer-link">
92
+ <i class="fab fa-github"></i> GitHub Repository
93
+ </a>
94
+ </div>
95
+ </footer>
96
+ </body>
97
+ </html>
webapp/templates/single_summary.html ADDED
@@ -0,0 +1,287 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>Single Summary - Smart Summarizer</title>
7
+ <link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}">
8
+ <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
9
+ </head>
10
+ <body>
11
+ <!-- Top Navigation Bar -->
12
+ <nav class="top-navbar">
13
+ <a href="/" class="navbar-logo">
14
+ <div class="logo-circle">S</div>
15
+ <span>Smart Summarizer</span>
16
+ </a>
17
+ <div class="navbar-links">
18
+ <a href="/" class="nav-item">
19
+ <i class="fas fa-home"></i> Home
20
+ </a>
21
+ <a href="/single-summary" class="nav-item active">
22
+ <i class="fas fa-file-alt"></i> Single Summary
23
+ </a>
24
+ <a href="/comparison" class="nav-item">
25
+ <i class="fas fa-balance-scale"></i> Comparison
26
+ </a>
27
+ <a href="/batch" class="nav-item">
28
+ <i class="fas fa-layer-group"></i> Batch
29
+ </a>
30
+ <a href="/evaluation" class="nav-item">
31
+ <i class="fas fa-chart-bar"></i> Evaluation
32
+ </a>
33
+ </div>
34
+ </nav>
35
+
36
+ <!-- Page Content -->
37
+ <div class="page-container">
38
+ <h1 class="page-title">Single Model Summary</h1>
39
+ <p class="page-subtitle">Input your text and select a specialized model to begin.</p>
40
+
41
+ <div class="content-grid">
42
+ <!-- Input Section -->
43
+ <div class="input-section">
44
+ <div class="section-label">Input Text</div>
45
+
46
+ <!-- Input Method Tabs -->
47
+ <div class="input-tabs">
48
+ <button class="tab-btn active" onclick="switchTab('paste')">Paste Text</button>
49
+ <button class="tab-btn" onclick="switchTab('upload')">Upload File</button>
50
+ </div>
51
+
52
+ <!-- Paste Text Tab -->
53
+ <div id="paste-tab" class="tab-content active">
54
+ <textarea
55
+ class="text-input"
56
+ id="inputText"
57
+ placeholder="Paste your source text here..."
58
+ ></textarea>
59
+ <div class="char-count">
60
+ <span id="charCount">0 characters</span>
61
+ <span id="wordCount">0 words</span>
62
+ </div>
63
+ </div>
64
+
65
+ <!-- Upload File Tab -->
66
+ <div id="upload-tab" class="tab-content">
67
+ <div class="upload-area" id="uploadArea">
68
+ <div class="upload-icon">📄</div>
69
+ <p>Drag and drop a file here or click to browse</p>
70
+ <p class="upload-hint">Supported formats: .txt, .md, .pdf, .docx, .doc (Max 16MB)</p>
71
+ <input type="file" id="fileInput" accept=".txt,.md,.pdf,.docx,.doc" style="display: none;">
72
+ </div>
73
+ <div id="fileInfo" class="file-info" style="display: none;">
74
+ <span id="fileName"></span>
75
+ <button class="btn-remove" onclick="removeFile()">Remove</button>
76
+ </div>
77
+ </div>
78
+ </div>
79
+
80
+ <!-- Output Section -->
81
+ <div class="output-section">
82
+ <div class="section-label">Output Preview</div>
83
+ <div class="output-preview" id="outputPreview">
84
+ <div class="icon">✨</div>
85
+ <div>Summary will appear here</div>
86
+ </div>
87
+ </div>
88
+ </div>
89
+
90
+ <!-- Controls -->
91
+ <div class="controls-section">
92
+ <select class="model-select" id="modelSelect">
93
+ <option value="bart">BART</option>
94
+ <option value="textrank">TextRank</option>
95
+ <option value="pegasus">PEGASUS</option>
96
+ </select>
97
+
98
+ <button class="btn-generate" id="generateBtn">
99
+ Generate Summary
100
+ </button>
101
+ </div>
102
+ </div>
103
+
104
+ <!-- Footer -->
105
+ <footer class="footer">
106
+ <div class="footer-left">
107
+ <div class="logo-circle" style="width: 24px; height: 24px; font-size: 0.9rem;">S</div>
108
+ <span>Smart Summarizer</span>
109
+ </div>
110
+ <div class="footer-right">
111
+ <span>© 2025 Smart Summarizer. Abdul Razzaq Ansari</span>
112
+ <a href="https://github.com/Rajak13/Smart-Summarizer" target="_blank" class="footer-link">
113
+ <i class="fab fa-github"></i> GitHub Repository
114
+ </a>
115
+ </div>
116
+ </footer>
117
+
118
+ <script>
119
+ const inputText = document.getElementById('inputText');
120
+ const charCount = document.getElementById('charCount');
121
+ const wordCount = document.getElementById('wordCount');
122
+ const generateBtn = document.getElementById('generateBtn');
123
+ const modelSelect = document.getElementById('modelSelect');
124
+ const outputPreview = document.getElementById('outputPreview');
125
+ const fileInput = document.getElementById('fileInput');
126
+ const uploadArea = document.getElementById('uploadArea');
127
+ const fileInfo = document.getElementById('fileInfo');
128
+ const fileName = document.getElementById('fileName');
129
+
130
+ // Tab switching
131
+ function switchTab(tab) {
132
+ document.querySelectorAll('.tab-btn').forEach(btn => btn.classList.remove('active'));
133
+ document.querySelectorAll('.tab-content').forEach(content => content.classList.remove('active'));
134
+
135
+ if (tab === 'paste') {
136
+ document.querySelector('.tab-btn:first-child').classList.add('active');
137
+ document.getElementById('paste-tab').classList.add('active');
138
+ } else {
139
+ document.querySelector('.tab-btn:last-child').classList.add('active');
140
+ document.getElementById('upload-tab').classList.add('active');
141
+ }
142
+ }
143
+
144
+ // Update character and word count
145
+ inputText.addEventListener('input', () => {
146
+ const text = inputText.value;
147
+ const chars = text.length;
148
+ const words = text.trim().split(/\s+/).filter(word => word.length > 0).length;
149
+
150
+ charCount.textContent = `${chars} characters`;
151
+ wordCount.textContent = `${words} words`;
152
+ });
153
+
154
+ // File upload handling
155
+ uploadArea.addEventListener('click', () => fileInput.click());
156
+
157
+ uploadArea.addEventListener('dragover', (e) => {
158
+ e.preventDefault();
159
+ uploadArea.style.borderColor = 'var(--slate-blue)';
160
+ uploadArea.style.background = 'rgba(109, 129, 150, 0.05)';
161
+ });
162
+
163
+ uploadArea.addEventListener('dragleave', () => {
164
+ uploadArea.style.borderColor = 'var(--cool-gray)';
165
+ uploadArea.style.background = 'transparent';
166
+ });
167
+
168
+ uploadArea.addEventListener('drop', async (e) => {
169
+ e.preventDefault();
170
+ uploadArea.style.borderColor = 'var(--cool-gray)';
171
+ uploadArea.style.background = 'transparent';
172
+
173
+ const file = e.dataTransfer.files[0];
174
+ if (file) {
175
+ await handleFileUpload(file);
176
+ }
177
+ });
178
+
179
+ fileInput.addEventListener('change', async (e) => {
180
+ const file = e.target.files[0];
181
+ if (file) {
182
+ await handleFileUpload(file);
183
+ }
184
+ });
185
+
186
+ async function handleFileUpload(file) {
187
+ const formData = new FormData();
188
+ formData.append('file', file);
189
+
190
+ try {
191
+ const response = await fetch('/api/upload', {
192
+ method: 'POST',
193
+ body: formData
194
+ });
195
+
196
+ const data = await response.json();
197
+
198
+ if (data.success) {
199
+ inputText.value = data.text;
200
+ inputText.dispatchEvent(new Event('input'));
201
+
202
+ fileName.textContent = `${data.filename} (${data.word_count} words)`;
203
+ fileInfo.style.display = 'flex';
204
+ uploadArea.style.display = 'none';
205
+
206
+ // Switch to paste tab to show the text
207
+ switchTab('paste');
208
+ } else {
209
+ alert('Error: ' + data.error);
210
+ }
211
+ } catch (error) {
212
+ alert('Failed to upload file. Please try again.');
213
+ console.error(error);
214
+ }
215
+ }
216
+
217
+ function removeFile() {
218
+ fileInput.value = '';
219
+ fileInfo.style.display = 'none';
220
+ uploadArea.style.display = 'flex';
221
+ inputText.value = '';
222
+ inputText.dispatchEvent(new Event('input'));
223
+ }
224
+
225
+ // Generate summary
226
+ generateBtn.addEventListener('click', async () => {
227
+ const text = inputText.value.trim();
228
+ const model = modelSelect.value;
229
+
230
+ if (!text || text.split(/\s+/).length < 10) {
231
+ alert('Please enter at least 10 words of text');
232
+ return;
233
+ }
234
+
235
+ // Show loading state
236
+ generateBtn.disabled = true;
237
+ generateBtn.textContent = 'Generating...';
238
+ outputPreview.innerHTML = '<div class="spinner"></div><div>Processing your text...</div>';
239
+
240
+ try {
241
+ const response = await fetch('/api/summarize', {
242
+ method: 'POST',
243
+ headers: {
244
+ 'Content-Type': 'application/json',
245
+ },
246
+ body: JSON.stringify({
247
+ text: text,
248
+ model: model
249
+ })
250
+ });
251
+
252
+ const data = await response.json();
253
+
254
+ if (data.success) {
255
+ // Display summary
256
+ outputPreview.innerHTML = `
257
+ <div class="output-text">
258
+ <strong>Summary (${model.toUpperCase()}):</strong><br><br>
259
+ ${data.summary}
260
+ <br><br>
261
+ <small style="color: var(--slate-blue);">
262
+ Processing time: ${data.metadata.processing_time.toFixed(2)}s |
263
+ Compression: ${(data.metadata.compression_ratio * 100).toFixed(1)}%
264
+ </small>
265
+ </div>
266
+ `;
267
+ } else {
268
+ outputPreview.innerHTML = `
269
+ <div style="color: #ef4444;">
270
+ <strong>Error:</strong> ${data.error}
271
+ </div>
272
+ `;
273
+ }
274
+ } catch (error) {
275
+ outputPreview.innerHTML = `
276
+ <div style="color: #ef4444;">
277
+ <strong>Error:</strong> Failed to generate summary. Please try again.
278
+ </div>
279
+ `;
280
+ } finally {
281
+ generateBtn.disabled = false;
282
+ generateBtn.textContent = 'Generate Summary';
283
+ }
284
+ });
285
+ </script>
286
+ </body>
287
+ </html>