xTHExBEASTx commited on
Commit
4815095
Β·
verified Β·
1 Parent(s): 7698190

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +178 -11
  2. app.py +380 -0
  3. requirements.txt +8 -0
README.md CHANGED
@@ -1,13 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: Pdf Summarizer
3
- emoji: πŸƒ
4
- colorFrom: gray
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 6.2.0
8
- app_file: app.py
9
- pinned: false
10
- short_description: pdf-summarizer
11
- ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
1
+ # πŸ“š AI-Powered PDF Summarizer
2
+
3
+ An intelligent PDF summarization tool powered by state-of-the-art Hugging Face transformer models. Upload any PDF document and get a comprehensive, well-structured summary perfect for studying, research, or quick document review.
4
+
5
+ ## 🌟 Features
6
+
7
+ ### πŸ€– Multiple AI Models
8
+ - **BART (facebook/bart-large-cnn)**: Fast, high-quality summarization for general documents
9
+ - **Long-T5 (google/long-t5-tglobal-base)**: Optimized for very long documents and academic papers
10
+
11
+ ### ⚑ Smart Processing
12
+ - Intelligent text chunking with overlap for context preservation
13
+ - Progress tracking during summarization
14
+ - Handles documents of any length
15
+ - GPU acceleration support (when available)
16
+
17
+ ### πŸ“ Flexible Output
18
+ - Choose between bullet points or paragraph format
19
+ - Downloadable markdown files
20
+ - Statistics about your document
21
+ - Clean, readable formatting
22
+
23
+ ### 🎨 User-Friendly Interface
24
+ - Simple drag-and-drop file upload
25
+ - Real-time progress updates
26
+ - Advanced settings for fine-tuned control
27
+ - Beautiful, responsive design
28
+
29
+ ## πŸš€ Quick Start
30
+
31
+ ### Local Installation
32
+
33
+ 1. Clone or download this repository
34
+
35
+ 2. Install dependencies:
36
+ ```bash
37
+ pip install -r requirements.txt
38
+ ```
39
+
40
+ 3. Run the application:
41
+ ```bash
42
+ python app.py
43
+ ```
44
+
45
+ 4. Open your browser to `http://localhost:7860`
46
+
47
+ ### Hugging Face Spaces Deployment
48
+
49
+ See the detailed deployment guide below for step-by-step instructions.
50
+
51
+ ## πŸ“– How to Use
52
+
53
+ 1. **Upload PDF**: Click or drag your PDF file to the upload area
54
+ 2. **Select Model**: Choose between BART (faster) or Long-T5 (better for long docs)
55
+ 3. **Choose Style**: Pick bullet points or paragraph format
56
+ 4. **Adjust Settings** (optional): Fine-tune chunk size and summary length
57
+ 5. **Generate**: Click the "Generate Summary" button
58
+ 6. **Download**: Get your summary as a markdown file
59
+
60
+ ## βš™οΈ Advanced Settings
61
+
62
+ ### Chunk Size (1000-8000 words)
63
+ - **Default**: 3000 words
64
+ - **Smaller chunks**: Faster processing, may lose some context
65
+ - **Larger chunks**: Better context, slower processing
66
+
67
+ ### Chunk Overlap (0-1000 words)
68
+ - **Default**: 200 words
69
+ - **Purpose**: Maintains context between chunks
70
+ - **Higher overlap**: Better continuity, slightly slower
71
+
72
+ ### Summary Length
73
+ - **Max Length**: 50-500 words per section (default: 150)
74
+ - **Min Length**: 10-100 words per section (default: 30)
75
+ - Adjust based on how detailed you want the summary
76
+
77
+ ## 🎯 Best Practices
78
+
79
+ ### For Best Results:
80
+ - Use clear, text-based PDFs (not scanned images)
81
+ - For technical documents: Use Long-T5 model
82
+ - For general documents: BART works great
83
+ - Large files (100+ pages): Increase chunk size to 4000-5000
84
+
85
+ ### Processing Times:
86
+ - Short documents (1-10 pages): 10-30 seconds
87
+ - Medium documents (10-50 pages): 30-120 seconds
88
+ - Large documents (50+ pages): 2-5 minutes
89
+
90
+ ## πŸ› οΈ Technical Details
91
+
92
+ ### Models Used
93
+
94
+ **BART (facebook/bart-large-cnn)**
95
+ - 406M parameters
96
+ - Trained on CNN/DailyMail dataset
97
+ - Excellent for news, articles, general documents
98
+ - Fast inference time
99
+
100
+ **Long-T5 (google/long-t5-tglobal-base)**
101
+ - 250M parameters
102
+ - Handles inputs up to 16,384 tokens
103
+ - Better for academic papers and long-form content
104
+ - Slightly slower but more comprehensive
105
+
106
+ ### Technologies
107
+ - **Gradio**: Web interface
108
+ - **Transformers**: Hugging Face models
109
+ - **PyMuPDF (fitz)**: PDF text extraction
110
+ - **LangChain**: Text splitting and chunking
111
+ - **PyTorch**: Deep learning backend
112
+
113
+ ## πŸ“Š Example Use Cases
114
+
115
+ - **Students**: Summarize textbooks and research papers
116
+ - **Researchers**: Quick overview of academic literature
117
+ - **Professionals**: Digest reports and documentation
118
+ - **Anyone**: Understand long documents quickly
119
+
120
+ ## πŸ”’ Privacy & Security
121
+
122
+ - Documents are processed in real-time
123
+ - No permanent storage of uploaded files
124
+ - Processing happens on your selected infrastructure
125
+ - Temporary files are automatically cleaned up
126
+
127
+ ## πŸ› Troubleshooting
128
+
129
+ ### PDF Upload Failed
130
+ - Ensure PDF is not password-protected
131
+ - Check file is not corrupted
132
+ - Try re-saving the PDF
133
+
134
+ ### Summary Quality Issues
135
+ - Try the Long-T5 model for better quality
136
+ - Adjust chunk size based on document type
137
+ - Increase max summary length for more detail
138
+
139
+ ### Out of Memory Errors
140
+ - Reduce chunk size
141
+ - Use CPU instead of GPU (slower but stable)
142
+ - Process smaller sections at a time
143
+
144
+ ## πŸ“ Requirements
145
+
146
+ - Python 3.8 or higher
147
+ - 4GB+ RAM (8GB+ recommended)
148
+ - GPU optional (speeds up processing significantly)
149
+
150
+ ## 🀝 Contributing
151
+
152
+ Contributions are welcome! Feel free to:
153
+ - Report bugs
154
+ - Suggest new features
155
+ - Improve documentation
156
+ - Submit pull requests
157
+
158
+ ## πŸ“„ License
159
+
160
+ This project is open source and available under the MIT License.
161
+
162
+ ## πŸ™ Acknowledgments
163
+
164
+ - Hugging Face for the amazing transformer models
165
+ - Facebook AI for BART
166
+ - Google Research for Long-T5
167
+ - Gradio team for the excellent UI framework
168
+
169
+ ## πŸ“§ Support
170
+
171
+ For issues or questions:
172
+ - Open an issue on GitHub
173
+ - Check existing documentation
174
+ - Review the troubleshooting section
175
+
176
  ---
 
 
 
 
 
 
 
 
 
 
177
 
178
+ **Made with ❀️ for efficient document summarization**
179
+
180
+ Happy summarizing! πŸ“šβœ¨
app.py ADDED
@@ -0,0 +1,380 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import gradio as gr
3
+ import fitz # PyMuPDF
4
+ from transformers import pipeline
5
+ from langchain_text_splitters import RecursiveCharacterTextSplitter
6
+ import torch
7
+
8
+ # Check if CUDA is available
9
+ device = 0 if torch.cuda.is_available() else -1
10
+
11
+ # Initialize summarization pipelines at startup
12
+ print("Loading AI models... This may take a few minutes on first run.")
13
+
14
+ try:
15
+ bart_summarizer = pipeline(
16
+ "summarization",
17
+ model="facebook/bart-large-cnn",
18
+ device=device
19
+ )
20
+ print("βœ“ BART model loaded successfully")
21
+ except Exception as e:
22
+ print(f"βœ— Error loading BART model: {e}")
23
+ bart_summarizer = None
24
+
25
+ try:
26
+ longt5_summarizer = pipeline(
27
+ "summarization",
28
+ model="google/long-t5-tglobal-base",
29
+ device=device
30
+ )
31
+ print("βœ“ Long-T5 model loaded successfully")
32
+ except Exception as e:
33
+ print(f"βœ— Error loading Long-T5 model: {e}")
34
+ longt5_summarizer = None
35
+
36
+ print("Models ready!")
37
+
38
+ def extract_text_from_pdf(pdf_file) -> tuple[str, str]:
39
+ """
40
+ Extracts text from the uploaded PDF file.
41
+
42
+ Args:
43
+ pdf_file: Gradio file object
44
+
45
+ Returns:
46
+ tuple: (extracted_text, error_message)
47
+ """
48
+ text = ""
49
+ try:
50
+ with fitz.open(pdf_file.name) as doc:
51
+ total_pages = len(doc)
52
+ for page_num, page in enumerate(doc, 1):
53
+ text += page.get_text()
54
+
55
+ if not text.strip():
56
+ return "", "PDF appears to be empty or contains only images."
57
+
58
+ return text, None
59
+ except Exception as e:
60
+ return "", f"Error reading PDF: {str(e)}"
61
+
62
+ def chunk_text(text: str, chunk_size: int, chunk_overlap: int) -> list[str]:
63
+ """
64
+ Split text into manageable chunks.
65
+
66
+ Args:
67
+ text: The text to split
68
+ chunk_size: Maximum size of each chunk
69
+ chunk_overlap: Overlap between chunks
70
+
71
+ Returns:
72
+ list: List of text chunks
73
+ """
74
+ text_splitter = RecursiveCharacterTextSplitter(
75
+ chunk_size=chunk_size,
76
+ chunk_overlap=chunk_overlap,
77
+ length_function=len,
78
+ separators=["\n\n", "\n", " ", ""]
79
+ )
80
+ return text_splitter.split_text(text)
81
+
82
+ def summarize_chunk(chunk: str, model_name: str, max_length: int, min_length: int) -> str:
83
+ """
84
+ Summarize a single chunk of text.
85
+
86
+ Args:
87
+ chunk: Text to summarize
88
+ model_name: Model to use ('BART' or 'Long-T5')
89
+ max_length: Maximum summary length
90
+ min_length: Minimum summary length
91
+
92
+ Returns:
93
+ str: Summarized text
94
+ """
95
+ try:
96
+ summarizer = bart_summarizer if model_name == "BART (Fast, High Quality)" else longt5_summarizer
97
+
98
+ if summarizer is None:
99
+ return "Error: Model not loaded properly."
100
+
101
+ # Adjust lengths based on chunk size
102
+ actual_max = min(max_length, len(chunk.split()) // 2)
103
+ actual_min = min(min_length, actual_max - 10)
104
+
105
+ result = summarizer(
106
+ chunk,
107
+ max_length=actual_max,
108
+ min_length=actual_min,
109
+ do_sample=False,
110
+ truncation=True
111
+ )
112
+
113
+ return result[0]['summary_text']
114
+ except Exception as e:
115
+ return f"Error summarizing chunk: {str(e)}"
116
+
117
+ def process_pdf(pdf_file, model_name, chunk_size, chunk_overlap, max_length, min_length, summary_style):
118
+ """
119
+ Main processing function: Extract β†’ Chunk β†’ Summarize β†’ Synthesize.
120
+
121
+ Args:
122
+ pdf_file: Uploaded PDF file
123
+ model_name: Selected model
124
+ chunk_size: Size of text chunks
125
+ chunk_overlap: Overlap between chunks
126
+ max_length: Maximum summary length
127
+ min_length: Minimum summary length
128
+ summary_style: Style of summary (Bullet Points or Paragraph)
129
+
130
+ Yields:
131
+ tuple: (status_message, output_file_path)
132
+ """
133
+ if pdf_file is None:
134
+ yield "⚠️ Please upload a PDF file first.", None
135
+ return
136
+
137
+ # Extract text from PDF
138
+ yield "πŸ“„ Reading PDF and extracting text...", None
139
+ full_text, error = extract_text_from_pdf(pdf_file)
140
+
141
+ if error:
142
+ yield f"❌ {error}", None
143
+ return
144
+
145
+ # Get basic stats
146
+ word_count = len(full_text.split())
147
+ char_count = len(full_text)
148
+
149
+ yield f"βœ… Extracted {word_count:,} words ({char_count:,} characters)\n\nπŸ“Š Splitting text into sections...", None
150
+
151
+ # Split into chunks
152
+ chunks = chunk_text(full_text, int(chunk_size), int(chunk_overlap))
153
+ total_chunks = len(chunks)
154
+
155
+ if total_chunks == 0:
156
+ yield "❌ No text could be extracted from the PDF.", None
157
+ return
158
+
159
+ yield f"βœ… Created {total_chunks} sections\n\nπŸ€– Starting summarization...", None
160
+
161
+ # Summarize each chunk
162
+ intermediate_summaries = []
163
+ for i, chunk in enumerate(chunks, 1):
164
+ yield f"πŸ”„ Processing section {i}/{total_chunks}...", None
165
+
166
+ summary = summarize_chunk(chunk, model_name, max_length, min_length)
167
+ intermediate_summaries.append(summary)
168
+
169
+ yield f"βœ… Completed all sections\n\n🎯 Creating final structured summary...", None
170
+
171
+ # Create final summary
172
+ if len(intermediate_summaries) > 1:
173
+ combined = "\n\n".join(intermediate_summaries)
174
+
175
+ # Create a synthesis prompt based on style
176
+ if summary_style == "Bullet Points":
177
+ style_instruction = "Create a well-organized summary with clear bullet points and headings."
178
+ else:
179
+ style_instruction = "Create a comprehensive, flowing paragraph summary."
180
+
181
+ final_summary = summarize_chunk(
182
+ combined,
183
+ model_name,
184
+ max_length * 2, # Allow longer final summary
185
+ min_length
186
+ )
187
+ else:
188
+ final_summary = intermediate_summaries[0]
189
+
190
+ # Format the output based on style
191
+ if summary_style == "Bullet Points":
192
+ formatted_summary = f"""# πŸ“š PDF Summary
193
+
194
+ **Original Document:** {os.path.basename(pdf_file.name)}
195
+ **Word Count:** {word_count:,}
196
+ **Sections Processed:** {total_chunks}
197
+ **Model Used:** {model_name}
198
+
199
+ ---
200
+
201
+ ## Summary
202
+
203
+ {final_summary}
204
+
205
+ ---
206
+
207
+ *Generated with Hugging Face Transformers*
208
+ """
209
+ else:
210
+ formatted_summary = f"""# πŸ“š PDF Summary
211
+
212
+ **Original Document:** {os.path.basename(pdf_file.name)}
213
+ **Word Count:** {word_count:,}
214
+ **Sections Processed:** {total_chunks}
215
+ **Model Used:** {model_name}
216
+
217
+ ---
218
+
219
+ {final_summary}
220
+
221
+ ---
222
+
223
+ *Generated with Hugging Face Transformers*
224
+ """
225
+
226
+ # Save to file
227
+ base_name = os.path.splitext(os.path.basename(pdf_file.name))[0]
228
+ output_path = f"{base_name}_Summary.md"
229
+
230
+ try:
231
+ with open(output_path, "w", encoding="utf-8") as f:
232
+ f.write(formatted_summary)
233
+ except Exception as e:
234
+ yield f"❌ Error saving file: {str(e)}\n\n{formatted_summary}", None
235
+ return
236
+
237
+ yield formatted_summary, output_path
238
+
239
+ # --- GRADIO UI DESIGN ---
240
+ with gr.Blocks(theme=gr.themes.Soft(), title="PDF Summarizer") as demo:
241
+ gr.Markdown("""
242
+ # πŸ“š AI-Powered PDF Summarizer
243
+
244
+ Upload any PDF document and get an intelligent, comprehensive summary using state-of-the-art AI models.
245
+ Perfect for research papers, textbooks, reports, and study materials!
246
+ """)
247
+
248
+ with gr.Row():
249
+ with gr.Column(scale=1):
250
+ gr.Markdown("### πŸ“€ Upload & Configure")
251
+
252
+ file_input = gr.File(
253
+ label="Upload PDF Document",
254
+ file_types=[".pdf"],
255
+ type="filepath"
256
+ )
257
+
258
+ model_dropdown = gr.Dropdown(
259
+ choices=[
260
+ "BART (Fast, High Quality)",
261
+ "Long-T5 (Better for Very Long Documents)"
262
+ ],
263
+ value="BART (Fast, High Quality)",
264
+ label="πŸ€– Select AI Model",
265
+ info="BART is faster and works great for most documents"
266
+ )
267
+
268
+ summary_style = gr.Radio(
269
+ choices=["Bullet Points", "Paragraph"],
270
+ value="Bullet Points",
271
+ label="πŸ“ Summary Style",
272
+ info="Choose how you want the summary formatted"
273
+ )
274
+
275
+ with gr.Accordion("βš™οΈ Advanced Settings", open=False):
276
+ gr.Markdown("*Adjust these settings for fine-tuned control*")
277
+
278
+ chunk_size = gr.Slider(
279
+ minimum=1000,
280
+ maximum=8000,
281
+ value=3000,
282
+ step=500,
283
+ label="Chunk Size",
284
+ info="Larger chunks = more context but slower processing"
285
+ )
286
+
287
+ chunk_overlap = gr.Slider(
288
+ minimum=0,
289
+ maximum=1000,
290
+ value=200,
291
+ step=50,
292
+ label="Chunk Overlap",
293
+ info="Overlap helps maintain context between chunks"
294
+ )
295
+
296
+ max_length = gr.Slider(
297
+ minimum=50,
298
+ maximum=500,
299
+ value=150,
300
+ step=10,
301
+ label="Max Summary Length (words)",
302
+ info="Maximum length for each section summary"
303
+ )
304
+
305
+ min_length = gr.Slider(
306
+ minimum=10,
307
+ maximum=100,
308
+ value=30,
309
+ step=5,
310
+ label="Min Summary Length (words)",
311
+ info="Minimum length for each section summary"
312
+ )
313
+
314
+ run_btn = gr.Button("πŸš€ Generate Summary", variant="primary", size="lg")
315
+
316
+ gr.Markdown("""
317
+ ---
318
+ ### πŸ’‘ Tips:
319
+ - **Best results**: Use clear, text-based PDFs
320
+ - **Large files**: May take a few minutes to process
321
+ - **Very long docs**: Try Long-T5 model for better results
322
+ """)
323
+
324
+ with gr.Column(scale=2):
325
+ gr.Markdown("### πŸ“Š Results")
326
+
327
+ output_text = gr.Markdown(
328
+ label="Generated Summary",
329
+ value="*Your summary will appear here...*"
330
+ )
331
+
332
+ file_output = gr.File(
333
+ label="πŸ“₯ Download Summary (.md)",
334
+ interactive=False
335
+ )
336
+
337
+ gr.Markdown("""
338
+ ---
339
+ ### ℹ️ About the Models:
340
+
341
+ **BART (facebook/bart-large-cnn)**
342
+ - Fast and efficient
343
+ - Excellent for general documents
344
+ - Great summary quality
345
+
346
+ **Long-T5 (google/long-t5-tglobal-base)**
347
+ - Handles very long documents
348
+ - Better for academic papers
349
+ - Slightly slower but more comprehensive
350
+ """)
351
+
352
+ # Connect the button to the processing function
353
+ run_btn.click(
354
+ fn=process_pdf,
355
+ inputs=[
356
+ file_input,
357
+ model_dropdown,
358
+ chunk_size,
359
+ chunk_overlap,
360
+ max_length,
361
+ min_length,
362
+ summary_style
363
+ ],
364
+ outputs=[output_text, file_output]
365
+ )
366
+
367
+ gr.Markdown("""
368
+ ---
369
+ ### πŸ”’ Privacy Notice
370
+ Your documents are processed securely and are not stored permanently.
371
+
372
+ Made with ❀️ using Hugging Face Transformers
373
+ """)
374
+
375
+ if __name__ == "__main__":
376
+ demo.queue(max_size=10).launch(
377
+ server_name="0.0.0.0",
378
+ server_port=7860,
379
+ share=False
380
+ )
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ gradio==4.44.0
2
+ transformers==4.36.2
3
+ torch==2.1.2
4
+ PyMuPDF==1.23.8
5
+ langchain-text-splitters==0.0.1
6
+ sentencepiece==0.1.99
7
+ protobuf==4.25.1
8
+ accelerate==0.25.0