LovnishVerma commited on
Commit
0950f3f
ยท
verified ยท
1 Parent(s): c30eefc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +181 -137
README.md CHANGED
@@ -4,7 +4,7 @@ emoji: ๐Ÿ“„
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: gradio
7
- sdk_version: 5.31.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
@@ -25,192 +25,236 @@ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-
25
 
26
  An intelligent PDF document summarizer powered by state-of-the-art transformer models. Upload any PDF and get comprehensive, accurate summaries in seconds with advanced text processing.
27
 
28
- ## ๐ŸŒŸ Key Features
29
 
30
- ### ๐Ÿš€ **Fast Processing**
31
- - **Fast Model**: DistilBART for quick summaries (โšก ~5-10 seconds)
32
 
 
33
 
34
- ### ๐Ÿ“Š **Intelligent Text Analysis**
35
- - **Smart Chunking**: Semantic boundary detection for better context preservation
36
- - **Hierarchical Summarization**: Multi-stage processing for long documents
37
- - **Quality Metrics**: Automatic readability and coverage assessment
38
- - **Extractive Fallback**: Backup summarization when abstractive fails
 
 
 
39
 
40
- ### ๐ŸŽฏ **Flexible Summary Options**
41
- - **Brief (Quick)**: Concise overviews (60-80 words per section)
42
- - **Detailed**: Balanced summaries (100-130 words per section)
43
- - **Comprehensive**: In-depth analysis (150-200 words per section)
44
 
45
- ### ๐Ÿ’ก **Advanced Processing**
46
- - **Enhanced PDF Parsing**: Handles complex layouts and formatting
47
- - **Text Cleaning**: Removes artifacts and normalizes content
48
- - **Error Recovery**: Robust fallback systems for problematic documents
49
- - **Real-time Progress**: Live processing status and metrics
50
 
51
- ## ๐ŸŽฎ Try It Now
 
52
 
53
- **[๐Ÿš€ Launch the App](https://huggingface.co/spaces/your-username/pdf-summarizer)**
54
 
55
- Simply upload a PDF and watch the AI generate intelligent summaries instantly!
 
 
 
 
56
 
57
- ## ๐Ÿ“– How to Use
 
 
 
58
 
59
- 1. **Upload PDF**: Click "Upload PDF Document" and select your file
60
- 2. **Choose Settings**:
61
- - Select summary detail level (Brief/Detailed/Comprehensive)
62
- 3. **Generate**: Click "Generate Smart Summary" or wait for auto-processing
63
- 4. **Review**: Get your summary with detailed statistics and metrics
64
 
65
- ## ๐Ÿ› ๏ธ Technical Details
66
 
67
- ### **Models**
68
- - **DistilBART** (`sshleifer/distilbart-cnn-12-6`): Fast, lightweight summarization
69
 
70
- ### **Processing Pipeline**
71
- 1. **PDF Text Extraction**: PyPDF2 with error handling
72
- 2. **Text Preprocessing**: Cleaning, normalization, artifact removal
73
- 3. **Intelligent Chunking**: Sentence-aware segmentation with overlap prevention
74
- 4. **Multi-stage Summarization**: Hierarchical processing for optimal results
75
- 5. **Quality Assessment**: Automatic metrics and readability analysis
76
 
77
- ### **Performance Optimization**
78
- - **GPU Acceleration**: CUDA support when available
79
- - **Memory Management**: Efficient processing for large documents
80
- - **Batch Processing**: Optimized chunk handling
81
- - **Early Stopping**: Smart termination for faster results
82
 
83
- ## ๐Ÿ“‹ Requirements
84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  ```
86
- gradio>=4.0.0
87
- transformers>=4.30.0
88
- torch>=2.0.0
89
- PyPDF2>=3.0.0
90
- accelerate>=0.20.0
91
- sentencepiece>=0.1.99
92
- protobuf>=3.20.0
93
- tokenizers>=0.13.0
94
- ```
95
 
96
- ## ๐ŸŽฏ Best Results Tips
 
 
97
 
98
- ### **Document Quality**
99
- - โœ… **Text-based PDFs**: Selectable text (not scanned images)
100
- - โœ… **Optimal Length**: 500-50,000 words
101
- - โœ… **Language**: English content (optimized)
102
- - โœ… **Format**: Well-structured documents
103
 
104
- ### **Summary Type Guide**
105
- - **Brief**: Perfect for quick scanning and overview
106
- - **Detailed**: Ideal for most use cases, good balance
107
- - **Comprehensive**: Best for thorough analysis and research
 
108
 
109
- ## ๐Ÿ“Š Example Results
 
 
 
110
 
111
- ### **Input**: 50-page research paper (12,000 words)
112
- - **Processing Time**: 45 seconds
113
- - **Output**: 800-word comprehensive summary
114
- - **Compression**: 15:1 ratio
115
- - **Coverage**: 95% of key topics
116
 
117
- ### **Input**: 10-page report (3,000 words)
118
- - **Processing Time**: 8 seconds
119
- - **Output**: 200-word detailed summary
120
- - **Compression**: 15:1 ratio
121
- - **Coverage**: 90% of main points
122
 
123
- ## ๐Ÿ”ง Advanced Features
 
 
 
 
124
 
125
- ### **Quality Metrics**
126
- - **Readability Score**: Based on summary complexity
127
- - **Coverage Analysis**: Percentage of document topics covered
128
- - **Compression Ratio**: Original:Summary word ratio
129
- - **Processing Efficiency**: Time and resource usage stats
130
 
131
- ### **Error Handling**
132
- - **Graceful Degradation**: Falls back to simpler methods if needed
133
- - **Content Validation**: Checks for sufficient extractable text
134
- - **Format Support**: Handles various PDF structures and layouts
135
- - **Recovery Systems**: Multiple fallback summarization strategies
136
 
137
- ## ๐ŸŽจ Interface Features
 
 
138
 
139
- - **Modern UI**: Clean, intuitive design with responsive layout
140
- - **Real-time Feedback**: Live processing status and progress
141
- - **Detailed Statistics**: Comprehensive document analysis
142
- - **Copy-friendly Output**: Easy text selection and copying
143
- - **Mobile Responsive**: Works on all device sizes
144
 
145
- ## ๐Ÿš€ Performance Benchmarks
 
 
146
 
147
- | Document Size | Fast Mode | Balanced Mode | Quality Mode |
148
- |---------------|-----------|---------------|--------------|
149
- | 1-5 pages | 3-8s | 8-15s | 15-30s |
150
- | 6-20 pages | 8-20s | 20-45s | 45-90s |
151
- | 21-50 pages | 20-60s | 60-120s | 120-240s |
152
 
153
- *Benchmarks on CPU. GPU acceleration provides 2-4x speedup.*
 
 
154
 
155
- ## ๐Ÿ”ฌ Use Cases
 
156
 
157
- ### **Academic & Research**
158
- - Research paper analysis
159
- - Literature review summaries
160
- - Thesis chapter overviews
161
- - Conference paper digests
162
 
163
- ### **Business & Professional**
164
- - Report summarization
165
- - Legal document analysis
166
- - Technical documentation
167
- - Meeting minutes processing
168
 
169
- ### **Personal & Educational**
170
- - Book chapter summaries
171
- - Article condensation
172
- - Study material preparation
173
- - Content curation
174
 
175
- ## ๐Ÿ›ก๏ธ Privacy & Security
 
 
 
176
 
177
- - **No Data Storage**: Files processed in memory only
178
- - **Secure Processing**: No permanent file retention
179
- - **Privacy First**: Documents not logged or cached
180
- - **Local Processing**: All computation on Hugging Face infrastructure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
181
 
182
  ## ๐Ÿค Contributing
183
 
184
- This is an open-source project! Contributions welcome:
185
 
186
- - **Bug Reports**: Issues and edge cases
187
- - **Feature Requests**: New capabilities and improvements
188
- - **Model Integration**: Additional transformer models
189
- - **UI Enhancements**: Better user experience
 
190
 
191
  ## ๐Ÿ“„ License
192
 
193
- This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
194
 
195
- ## ๐Ÿ™ Acknowledgments
196
 
197
- - **Hugging Face**: For the amazing Transformers library and hosting
198
- - **Facebook AI**: For the BART model architecture
199
- - **Gradio Team**: For the excellent web interface framework
200
- - **PyPDF2**: For reliable PDF text extraction
201
 
202
- ## ๐Ÿ“ž Support
 
 
 
203
 
204
- - **Issues**: [GitHub Issues](https://github.com/your-username/pdf-summarizer/issues)
205
- - **Discussions**: [Hugging Face Community](https://huggingface.co/spaces/your-username/pdf-summarizer/discussions)
206
- - **Documentation**: [Wiki](https://github.com/your-username/pdf-summarizer/wiki)
207
 
208
- ---
209
-
210
- <div align="center">
 
 
 
211
 
212
- **Made with โค๏ธ using ๐Ÿค— Transformers**
 
 
 
 
213
 
214
- [Try the App](https://huggingface.co/spaces/your-username/pdf-summarizer) โ€ข [GitHub Repo](https://github.com/your-username/pdf-summarizer) โ€ข [Report Bug](https://github.com/your-username/pdf-summarizer/issues)
215
 
 
216
  </div>
 
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 5.32.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
 
25
 
26
  An intelligent PDF document summarizer powered by state-of-the-art transformer models. Upload any PDF and get comprehensive, accurate summaries in seconds with advanced text processing.
27
 
28
+ # โšก Ultra-Fast AI PDF Summarizer
29
 
30
+ A lightning-fast PDF summarization tool powered by AI that can process documents and generate intelligent summaries in seconds. Built with Gradio for an intuitive web interface and optimized for maximum speed without sacrificing quality.
 
31
 
32
+ ## ๐Ÿš€ Features
33
 
34
+ - **โšก Ultra-Fast Processing**: Optimized for speed with lazy loading and smart chunking
35
+ - **๐Ÿค– AI-Powered**: Uses state-of-the-art BART models for intelligent summarization
36
+ - **๐Ÿ“„ PDF Support**: Extracts and processes text from PDF documents automatically
37
+ - **๐ŸŽฏ Multiple Summary Types**: Brief, Detailed, and Comprehensive options
38
+ - **๐Ÿ”„ Smart Fallbacks**: Automatically switches to extractive summarization for large documents
39
+ - **๐Ÿ“Š Document Statistics**: Provides detailed analytics about your documents
40
+ - **๐Ÿ–ฅ๏ธ Web Interface**: Easy-to-use Gradio interface accessible via browser
41
+ - **โš™๏ธ GPU Acceleration**: Automatic GPU detection and utilization when available
42
 
43
+ ## ๐Ÿ› ๏ธ Installation
 
 
 
44
 
45
+ ### Prerequisites
 
 
 
 
46
 
47
+ - Python 3.8 or higher
48
+ - pip package manager
49
 
50
+ ### Quick Setup
51
 
52
+ 1. **Clone or download the repository**
53
+ ```bash
54
+ git clone <repository-url>
55
+ cd ultra-fast-pdf-summarizer
56
+ ```
57
 
58
+ 2. **Install dependencies**
59
+ ```bash
60
+ pip install -r requirements.txt
61
+ ```
62
 
63
+ 3. **Run the application**
64
+ ```bash
65
+ python app.py
66
+ ```
 
67
 
68
+ 4. **Open your browser** and navigate to the URL shown in the terminal (usually `http://127.0.0.1:7860`)
69
 
70
+ ## ๐Ÿ“‹ Requirements
 
71
 
72
+ See `requirements.txt` for the complete list of dependencies. Key packages include:
 
 
 
 
 
73
 
74
+ - **gradio**: Web interface framework
75
+ - **transformers**: Hugging Face transformers for AI models
76
+ - **torch**: PyTorch for deep learning
77
+ - **PyPDF2**: PDF text extraction
78
+ - **nltk**: Natural language processing toolkit
79
 
80
+ ## ๐Ÿš€ Usage
81
 
82
+ ### Basic Usage
83
+
84
+ 1. **Upload a PDF**: Click "Upload PDF" and select your document
85
+ 2. **Choose Summary Type**:
86
+ - **Brief (Quick)**: Fast, concise summary
87
+ - **Detailed**: Balanced detail and speed
88
+ - **Comprehensive**: Most detailed summary
89
+ 3. **Generate**: Click "โšก Generate Summary" or upload will auto-process
90
+ 4. **View Results**: See your summary and document statistics
91
+
92
+ ### Command Line Usage
93
+
94
+ ```python
95
+ from your_app import FastPDFSummarizer
96
+
97
+ # Initialize summarizer
98
+ summarizer = FastPDFSummarizer()
99
+
100
+ # Process a PDF file
101
+ summary, stats, status = summarizer.process_pdf_fast("document.pdf", "Brief (Quick)")
102
+ print(summary)
103
  ```
 
 
 
 
 
 
 
 
 
104
 
105
+ ## โšก Speed Optimizations
106
+
107
+ This tool is specifically optimized for speed:
108
 
109
+ ### Model Optimizations
110
+ - **Lazy Loading**: Models load only when needed
111
+ - **Lightweight Model**: Uses `distilbart-cnn-6-6` for optimal speed/quality balance
112
+ - **Single Beam Search**: Fastest generation settings
113
+ - **GPU Acceleration**: Automatic CUDA utilization
114
 
115
+ ### Processing Optimizations
116
+ - **Page Limiting**: Processes maximum 20 pages for speed
117
+ - **Smart Chunking**: Maximum 3 chunks to reduce processing time
118
+ - **Extractive Fallback**: Ultra-fast summarization for large documents
119
+ - **Efficient Text Cleaning**: Optimized regex operations
120
 
121
+ ### Memory Optimizations
122
+ - **Low Memory Usage**: Configured for minimal RAM consumption
123
+ - **Cache Optimization**: Efficient model caching
124
+ - **16-bit Precision**: Uses float16 on GPU for speed
125
 
126
+ ## ๐Ÿ“Š Performance
 
 
 
 
127
 
128
+ ### Typical Processing Times
129
+ - **Small PDFs** (1-5 pages): 2-5 seconds
130
+ - **Medium PDFs** (5-15 pages): 5-15 seconds
131
+ - **Large PDFs** (15-20 pages): 10-30 seconds
 
132
 
133
+ ### Hardware Recommendations
134
+ - **CPU**: Modern multi-core processor
135
+ - **RAM**: 4GB minimum, 8GB+ recommended
136
+ - **GPU**: NVIDIA GPU with CUDA support (optional, for acceleration)
137
+ - **Storage**: 2GB free space for models
138
 
139
+ ## ๐Ÿ”ง Configuration
 
 
 
 
140
 
141
+ ### Model Selection
142
+ You can change the model in the code for different speed/quality trade-offs:
 
 
 
143
 
144
+ ```python
145
+ # Ultra-fast (lower quality)
146
+ self.model_name = "sshleifer/distilbart-cnn-6-6"
147
 
148
+ # Balanced (default)
149
+ self.model_name = "sshleifer/distilbart-cnn-12-6"
 
 
 
150
 
151
+ # High quality (slower)
152
+ self.model_name = "facebook/bart-large-cnn"
153
+ ```
154
 
155
+ ### Processing Limits
156
+ Adjust these parameters in the code:
 
 
 
157
 
158
+ ```python
159
+ # Maximum pages to process
160
+ max_pages = min(20, len(pdf_reader.pages))
161
 
162
+ # Maximum chunks for processing
163
+ return chunks[:3]
164
 
165
+ # Maximum words per chunk
166
+ max_length: int = 1000
167
+ ```
 
 
168
 
169
+ ## ๐Ÿ› Troubleshooting
 
 
 
 
170
 
171
+ ### Common Issues
 
 
 
 
172
 
173
+ **1. "No module named 'transformers'"**
174
+ ```bash
175
+ pip install transformers torch
176
+ ```
177
 
178
+ **2. NLTK data not found**
179
+ The app automatically downloads required NLTK data, but if issues persist:
180
+ ```python
181
+ import nltk
182
+ nltk.download('punkt')
183
+ ```
184
+
185
+ **3. CUDA out of memory**
186
+ - Reduce batch size or disable GPU:
187
+ ```python
188
+ device = "cpu" # Force CPU usage
189
+ ```
190
+
191
+ **4. PDF text extraction fails**
192
+ - Ensure PDF has extractable text (not just images)
193
+ - Try OCR preprocessing for scanned PDFs
194
+
195
+ ### Performance Issues
196
+
197
+ **Slow processing:**
198
+ - Check if GPU is being utilized
199
+ - Reduce page limit or chunk size
200
+ - Use "Brief (Quick)" mode for fastest results
201
+
202
+ **Memory errors:**
203
+ - Close other applications
204
+ - Use CPU mode instead of GPU
205
+ - Process smaller documents
206
+
207
+ ## ๐Ÿ“ File Format Support
208
+
209
+ ### Supported Formats
210
+ - **PDF**: Primary format with full text extraction
211
+ - **Text Content**: Must be selectable/extractable text
212
+
213
+ ### Limitations
214
+ - **Scanned PDFs**: Requires OCR preprocessing
215
+ - **Image-only PDFs**: No text extraction possible
216
+ - **Password-protected PDFs**: Not supported
217
+ - **Very large files**: >100MB may cause memory issues
218
 
219
  ## ๐Ÿค Contributing
220
 
221
+ We welcome contributions! Areas for improvement:
222
 
223
+ - **OCR Integration**: Support for scanned PDFs
224
+ - **Additional Formats**: Word documents, web pages, etc.
225
+ - **Model Options**: More model choices in the interface
226
+ - **Language Support**: Multi-language summarization
227
+ - **Export Options**: PDF, Word, markdown export
228
 
229
  ## ๐Ÿ“„ License
230
 
231
+ This project is open source. Please check the license file for details.
232
 
233
+ ## ๐Ÿ†˜ Support
234
 
235
+ If you encounter issues:
 
 
 
236
 
237
+ 1. **Check the troubleshooting section** above
238
+ 2. **Verify requirements** are properly installed
239
+ 3. **Check system resources** (RAM, storage)
240
+ 4. **Try with different PDF files** to isolate issues
241
 
242
+ ## ๐Ÿ”ฎ Future Enhancements
 
 
243
 
244
+ ### Planned Features
245
+ - **Batch Processing**: Multiple PDFs at once
246
+ - **Custom Models**: Upload your own trained models
247
+ - **API Endpoint**: REST API for integration
248
+ - **Cloud Deployment**: One-click cloud deployment
249
+ - **Mobile App**: Dedicated mobile application
250
 
251
+ ### Performance Improvements
252
+ - **Model Quantization**: Even faster inference
253
+ - **Streaming Processing**: Real-time summarization
254
+ - **Distributed Processing**: Multi-GPU support
255
+ - **Edge Optimization**: Optimized for edge devices
256
 
257
+ ---
258
 
259
+ **Built with โค๏ธ for fast, intelligent document processing**
260
  </div>