Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -4,7 +4,7 @@ emoji: ๐
|
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: purple
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version: 5.
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
license: mit
|
|
@@ -25,192 +25,236 @@ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-
|
|
| 25 |
|
| 26 |
An intelligent PDF document summarizer powered by state-of-the-art transformer models. Upload any PDF and get comprehensive, accurate summaries in seconds with advanced text processing.
|
| 27 |
|
| 28 |
-
#
|
| 29 |
|
| 30 |
-
|
| 31 |
-
- **Fast Model**: DistilBART for quick summaries (โก ~5-10 seconds)
|
| 32 |
|
|
|
|
| 33 |
|
| 34 |
-
|
| 35 |
-
- **
|
| 36 |
-
- **
|
| 37 |
-
- **
|
| 38 |
-
- **
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
-
##
|
| 41 |
-
- **Brief (Quick)**: Concise overviews (60-80 words per section)
|
| 42 |
-
- **Detailed**: Balanced summaries (100-130 words per section)
|
| 43 |
-
- **Comprehensive**: In-depth analysis (150-200 words per section)
|
| 44 |
|
| 45 |
-
###
|
| 46 |
-
- **Enhanced PDF Parsing**: Handles complex layouts and formatting
|
| 47 |
-
- **Text Cleaning**: Removes artifacts and normalizes content
|
| 48 |
-
- **Error Recovery**: Robust fallback systems for problematic documents
|
| 49 |
-
- **Real-time Progress**: Live processing status and metrics
|
| 50 |
|
| 51 |
-
|
|
|
|
| 52 |
|
| 53 |
-
|
| 54 |
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
4. **Review**: Get your summary with detailed statistics and metrics
|
| 64 |
|
| 65 |
-
|
| 66 |
|
| 67 |
-
##
|
| 68 |
-
- **DistilBART** (`sshleifer/distilbart-cnn-12-6`): Fast, lightweight summarization
|
| 69 |
|
| 70 |
-
|
| 71 |
-
1. **PDF Text Extraction**: PyPDF2 with error handling
|
| 72 |
-
2. **Text Preprocessing**: Cleaning, normalization, artifact removal
|
| 73 |
-
3. **Intelligent Chunking**: Sentence-aware segmentation with overlap prevention
|
| 74 |
-
4. **Multi-stage Summarization**: Hierarchical processing for optimal results
|
| 75 |
-
5. **Quality Assessment**: Automatic metrics and readability analysis
|
| 76 |
|
| 77 |
-
|
| 78 |
-
- **
|
| 79 |
-
- **
|
| 80 |
-
- **
|
| 81 |
-
- **
|
| 82 |
|
| 83 |
-
##
|
| 84 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
```
|
| 86 |
-
gradio>=4.0.0
|
| 87 |
-
transformers>=4.30.0
|
| 88 |
-
torch>=2.0.0
|
| 89 |
-
PyPDF2>=3.0.0
|
| 90 |
-
accelerate>=0.20.0
|
| 91 |
-
sentencepiece>=0.1.99
|
| 92 |
-
protobuf>=3.20.0
|
| 93 |
-
tokenizers>=0.13.0
|
| 94 |
-
```
|
| 95 |
|
| 96 |
-
##
|
|
|
|
|
|
|
| 97 |
|
| 98 |
-
###
|
| 99 |
-
-
|
| 100 |
-
-
|
| 101 |
-
-
|
| 102 |
-
-
|
| 103 |
|
| 104 |
-
###
|
| 105 |
-
- **
|
| 106 |
-
- **
|
| 107 |
-
- **
|
|
|
|
| 108 |
|
| 109 |
-
##
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
-
##
|
| 112 |
-
- **Processing Time**: 45 seconds
|
| 113 |
-
- **Output**: 800-word comprehensive summary
|
| 114 |
-
- **Compression**: 15:1 ratio
|
| 115 |
-
- **Coverage**: 95% of key topics
|
| 116 |
|
| 117 |
-
###
|
| 118 |
-
- **
|
| 119 |
-
- **
|
| 120 |
-
- **
|
| 121 |
-
- **Coverage**: 90% of main points
|
| 122 |
|
| 123 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
|
| 125 |
-
##
|
| 126 |
-
- **Readability Score**: Based on summary complexity
|
| 127 |
-
- **Coverage Analysis**: Percentage of document topics covered
|
| 128 |
-
- **Compression Ratio**: Original:Summary word ratio
|
| 129 |
-
- **Processing Efficiency**: Time and resource usage stats
|
| 130 |
|
| 131 |
-
###
|
| 132 |
-
|
| 133 |
-
- **Content Validation**: Checks for sufficient extractable text
|
| 134 |
-
- **Format Support**: Handles various PDF structures and layouts
|
| 135 |
-
- **Recovery Systems**: Multiple fallback summarization strategies
|
| 136 |
|
| 137 |
-
|
|
|
|
|
|
|
| 138 |
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
- **Detailed Statistics**: Comprehensive document analysis
|
| 142 |
-
- **Copy-friendly Output**: Easy text selection and copying
|
| 143 |
-
- **Mobile Responsive**: Works on all device sizes
|
| 144 |
|
| 145 |
-
#
|
|
|
|
|
|
|
| 146 |
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
| 1-5 pages | 3-8s | 8-15s | 15-30s |
|
| 150 |
-
| 6-20 pages | 8-20s | 20-45s | 45-90s |
|
| 151 |
-
| 21-50 pages | 20-60s | 60-120s | 120-240s |
|
| 152 |
|
| 153 |
-
|
|
|
|
|
|
|
| 154 |
|
| 155 |
-
#
|
|
|
|
| 156 |
|
| 157 |
-
#
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
- Thesis chapter overviews
|
| 161 |
-
- Conference paper digests
|
| 162 |
|
| 163 |
-
##
|
| 164 |
-
- Report summarization
|
| 165 |
-
- Legal document analysis
|
| 166 |
-
- Technical documentation
|
| 167 |
-
- Meeting minutes processing
|
| 168 |
|
| 169 |
-
###
|
| 170 |
-
- Book chapter summaries
|
| 171 |
-
- Article condensation
|
| 172 |
-
- Study material preparation
|
| 173 |
-
- Content curation
|
| 174 |
|
| 175 |
-
|
|
|
|
|
|
|
|
|
|
| 176 |
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
|
| 182 |
## ๐ค Contributing
|
| 183 |
|
| 184 |
-
|
| 185 |
|
| 186 |
-
- **
|
| 187 |
-
- **
|
| 188 |
-
- **Model
|
| 189 |
-
- **
|
|
|
|
| 190 |
|
| 191 |
## ๐ License
|
| 192 |
|
| 193 |
-
This project is
|
| 194 |
|
| 195 |
-
##
|
| 196 |
|
| 197 |
-
|
| 198 |
-
- **Facebook AI**: For the BART model architecture
|
| 199 |
-
- **Gradio Team**: For the excellent web interface framework
|
| 200 |
-
- **PyPDF2**: For reliable PDF text extraction
|
| 201 |
|
| 202 |
-
|
|
|
|
|
|
|
|
|
|
| 203 |
|
| 204 |
-
|
| 205 |
-
- **Discussions**: [Hugging Face Community](https://huggingface.co/spaces/your-username/pdf-summarizer/discussions)
|
| 206 |
-
- **Documentation**: [Wiki](https://github.com/your-username/pdf-summarizer/wiki)
|
| 207 |
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
|
|
|
|
|
|
|
|
|
| 211 |
|
| 212 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 213 |
|
| 214 |
-
|
| 215 |
|
|
|
|
| 216 |
</div>
|
|
|
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: purple
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: 5.32.0
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
license: mit
|
|
|
|
| 25 |
|
| 26 |
An intelligent PDF document summarizer powered by state-of-the-art transformer models. Upload any PDF and get comprehensive, accurate summaries in seconds with advanced text processing.
|
| 27 |
|
| 28 |
+
# โก Ultra-Fast AI PDF Summarizer
|
| 29 |
|
| 30 |
+
A lightning-fast PDF summarization tool powered by AI that can process documents and generate intelligent summaries in seconds. Built with Gradio for an intuitive web interface and optimized for maximum speed without sacrificing quality.
|
|
|
|
| 31 |
|
| 32 |
+
## ๐ Features
|
| 33 |
|
| 34 |
+
- **โก Ultra-Fast Processing**: Optimized for speed with lazy loading and smart chunking
|
| 35 |
+
- **๐ค AI-Powered**: Uses state-of-the-art BART models for intelligent summarization
|
| 36 |
+
- **๐ PDF Support**: Extracts and processes text from PDF documents automatically
|
| 37 |
+
- **๐ฏ Multiple Summary Types**: Brief, Detailed, and Comprehensive options
|
| 38 |
+
- **๐ Smart Fallbacks**: Automatically switches to extractive summarization for large documents
|
| 39 |
+
- **๐ Document Statistics**: Provides detailed analytics about your documents
|
| 40 |
+
- **๐ฅ๏ธ Web Interface**: Easy-to-use Gradio interface accessible via browser
|
| 41 |
+
- **โ๏ธ GPU Acceleration**: Automatic GPU detection and utilization when available
|
| 42 |
|
| 43 |
+
## ๐ ๏ธ Installation
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
+
### Prerequisites
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
+
- Python 3.8 or higher
|
| 48 |
+
- pip package manager
|
| 49 |
|
| 50 |
+
### Quick Setup
|
| 51 |
|
| 52 |
+
1. **Clone or download the repository**
|
| 53 |
+
```bash
|
| 54 |
+
git clone <repository-url>
|
| 55 |
+
cd ultra-fast-pdf-summarizer
|
| 56 |
+
```
|
| 57 |
|
| 58 |
+
2. **Install dependencies**
|
| 59 |
+
```bash
|
| 60 |
+
pip install -r requirements.txt
|
| 61 |
+
```
|
| 62 |
|
| 63 |
+
3. **Run the application**
|
| 64 |
+
```bash
|
| 65 |
+
python app.py
|
| 66 |
+
```
|
|
|
|
| 67 |
|
| 68 |
+
4. **Open your browser** and navigate to the URL shown in the terminal (usually `http://127.0.0.1:7860`)
|
| 69 |
|
| 70 |
+
## ๐ Requirements
|
|
|
|
| 71 |
|
| 72 |
+
See `requirements.txt` for the complete list of dependencies. Key packages include:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
+
- **gradio**: Web interface framework
|
| 75 |
+
- **transformers**: Hugging Face transformers for AI models
|
| 76 |
+
- **torch**: PyTorch for deep learning
|
| 77 |
+
- **PyPDF2**: PDF text extraction
|
| 78 |
+
- **nltk**: Natural language processing toolkit
|
| 79 |
|
| 80 |
+
## ๐ Usage
|
| 81 |
|
| 82 |
+
### Basic Usage
|
| 83 |
+
|
| 84 |
+
1. **Upload a PDF**: Click "Upload PDF" and select your document
|
| 85 |
+
2. **Choose Summary Type**:
|
| 86 |
+
- **Brief (Quick)**: Fast, concise summary
|
| 87 |
+
- **Detailed**: Balanced detail and speed
|
| 88 |
+
- **Comprehensive**: Most detailed summary
|
| 89 |
+
3. **Generate**: Click "โก Generate Summary" or upload will auto-process
|
| 90 |
+
4. **View Results**: See your summary and document statistics
|
| 91 |
+
|
| 92 |
+
### Command Line Usage
|
| 93 |
+
|
| 94 |
+
```python
|
| 95 |
+
from your_app import FastPDFSummarizer
|
| 96 |
+
|
| 97 |
+
# Initialize summarizer
|
| 98 |
+
summarizer = FastPDFSummarizer()
|
| 99 |
+
|
| 100 |
+
# Process a PDF file
|
| 101 |
+
summary, stats, status = summarizer.process_pdf_fast("document.pdf", "Brief (Quick)")
|
| 102 |
+
print(summary)
|
| 103 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
+
## โก Speed Optimizations
|
| 106 |
+
|
| 107 |
+
This tool is specifically optimized for speed:
|
| 108 |
|
| 109 |
+
### Model Optimizations
|
| 110 |
+
- **Lazy Loading**: Models load only when needed
|
| 111 |
+
- **Lightweight Model**: Uses `distilbart-cnn-6-6` for optimal speed/quality balance
|
| 112 |
+
- **Single Beam Search**: Fastest generation settings
|
| 113 |
+
- **GPU Acceleration**: Automatic CUDA utilization
|
| 114 |
|
| 115 |
+
### Processing Optimizations
|
| 116 |
+
- **Page Limiting**: Processes maximum 20 pages for speed
|
| 117 |
+
- **Smart Chunking**: Maximum 3 chunks to reduce processing time
|
| 118 |
+
- **Extractive Fallback**: Ultra-fast summarization for large documents
|
| 119 |
+
- **Efficient Text Cleaning**: Optimized regex operations
|
| 120 |
|
| 121 |
+
### Memory Optimizations
|
| 122 |
+
- **Low Memory Usage**: Configured for minimal RAM consumption
|
| 123 |
+
- **Cache Optimization**: Efficient model caching
|
| 124 |
+
- **16-bit Precision**: Uses float16 on GPU for speed
|
| 125 |
|
| 126 |
+
## ๐ Performance
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
|
| 128 |
+
### Typical Processing Times
|
| 129 |
+
- **Small PDFs** (1-5 pages): 2-5 seconds
|
| 130 |
+
- **Medium PDFs** (5-15 pages): 5-15 seconds
|
| 131 |
+
- **Large PDFs** (15-20 pages): 10-30 seconds
|
|
|
|
| 132 |
|
| 133 |
+
### Hardware Recommendations
|
| 134 |
+
- **CPU**: Modern multi-core processor
|
| 135 |
+
- **RAM**: 4GB minimum, 8GB+ recommended
|
| 136 |
+
- **GPU**: NVIDIA GPU with CUDA support (optional, for acceleration)
|
| 137 |
+
- **Storage**: 2GB free space for models
|
| 138 |
|
| 139 |
+
## ๐ง Configuration
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
+
### Model Selection
|
| 142 |
+
You can change the model in the code for different speed/quality trade-offs:
|
|
|
|
|
|
|
|
|
|
| 143 |
|
| 144 |
+
```python
|
| 145 |
+
# Ultra-fast (lower quality)
|
| 146 |
+
self.model_name = "sshleifer/distilbart-cnn-6-6"
|
| 147 |
|
| 148 |
+
# Balanced (default)
|
| 149 |
+
self.model_name = "sshleifer/distilbart-cnn-12-6"
|
|
|
|
|
|
|
|
|
|
| 150 |
|
| 151 |
+
# High quality (slower)
|
| 152 |
+
self.model_name = "facebook/bart-large-cnn"
|
| 153 |
+
```
|
| 154 |
|
| 155 |
+
### Processing Limits
|
| 156 |
+
Adjust these parameters in the code:
|
|
|
|
|
|
|
|
|
|
| 157 |
|
| 158 |
+
```python
|
| 159 |
+
# Maximum pages to process
|
| 160 |
+
max_pages = min(20, len(pdf_reader.pages))
|
| 161 |
|
| 162 |
+
# Maximum chunks for processing
|
| 163 |
+
return chunks[:3]
|
| 164 |
|
| 165 |
+
# Maximum words per chunk
|
| 166 |
+
max_length: int = 1000
|
| 167 |
+
```
|
|
|
|
|
|
|
| 168 |
|
| 169 |
+
## ๐ Troubleshooting
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
|
| 171 |
+
### Common Issues
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
|
| 173 |
+
**1. "No module named 'transformers'"**
|
| 174 |
+
```bash
|
| 175 |
+
pip install transformers torch
|
| 176 |
+
```
|
| 177 |
|
| 178 |
+
**2. NLTK data not found**
|
| 179 |
+
The app automatically downloads required NLTK data, but if issues persist:
|
| 180 |
+
```python
|
| 181 |
+
import nltk
|
| 182 |
+
nltk.download('punkt')
|
| 183 |
+
```
|
| 184 |
+
|
| 185 |
+
**3. CUDA out of memory**
|
| 186 |
+
- Reduce batch size or disable GPU:
|
| 187 |
+
```python
|
| 188 |
+
device = "cpu" # Force CPU usage
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
**4. PDF text extraction fails**
|
| 192 |
+
- Ensure PDF has extractable text (not just images)
|
| 193 |
+
- Try OCR preprocessing for scanned PDFs
|
| 194 |
+
|
| 195 |
+
### Performance Issues
|
| 196 |
+
|
| 197 |
+
**Slow processing:**
|
| 198 |
+
- Check if GPU is being utilized
|
| 199 |
+
- Reduce page limit or chunk size
|
| 200 |
+
- Use "Brief (Quick)" mode for fastest results
|
| 201 |
+
|
| 202 |
+
**Memory errors:**
|
| 203 |
+
- Close other applications
|
| 204 |
+
- Use CPU mode instead of GPU
|
| 205 |
+
- Process smaller documents
|
| 206 |
+
|
| 207 |
+
## ๐ File Format Support
|
| 208 |
+
|
| 209 |
+
### Supported Formats
|
| 210 |
+
- **PDF**: Primary format with full text extraction
|
| 211 |
+
- **Text Content**: Must be selectable/extractable text
|
| 212 |
+
|
| 213 |
+
### Limitations
|
| 214 |
+
- **Scanned PDFs**: Requires OCR preprocessing
|
| 215 |
+
- **Image-only PDFs**: No text extraction possible
|
| 216 |
+
- **Password-protected PDFs**: Not supported
|
| 217 |
+
- **Very large files**: >100MB may cause memory issues
|
| 218 |
|
| 219 |
## ๐ค Contributing
|
| 220 |
|
| 221 |
+
We welcome contributions! Areas for improvement:
|
| 222 |
|
| 223 |
+
- **OCR Integration**: Support for scanned PDFs
|
| 224 |
+
- **Additional Formats**: Word documents, web pages, etc.
|
| 225 |
+
- **Model Options**: More model choices in the interface
|
| 226 |
+
- **Language Support**: Multi-language summarization
|
| 227 |
+
- **Export Options**: PDF, Word, markdown export
|
| 228 |
|
| 229 |
## ๐ License
|
| 230 |
|
| 231 |
+
This project is open source. Please check the license file for details.
|
| 232 |
|
| 233 |
+
## ๐ Support
|
| 234 |
|
| 235 |
+
If you encounter issues:
|
|
|
|
|
|
|
|
|
|
| 236 |
|
| 237 |
+
1. **Check the troubleshooting section** above
|
| 238 |
+
2. **Verify requirements** are properly installed
|
| 239 |
+
3. **Check system resources** (RAM, storage)
|
| 240 |
+
4. **Try with different PDF files** to isolate issues
|
| 241 |
|
| 242 |
+
## ๐ฎ Future Enhancements
|
|
|
|
|
|
|
| 243 |
|
| 244 |
+
### Planned Features
|
| 245 |
+
- **Batch Processing**: Multiple PDFs at once
|
| 246 |
+
- **Custom Models**: Upload your own trained models
|
| 247 |
+
- **API Endpoint**: REST API for integration
|
| 248 |
+
- **Cloud Deployment**: One-click cloud deployment
|
| 249 |
+
- **Mobile App**: Dedicated mobile application
|
| 250 |
|
| 251 |
+
### Performance Improvements
|
| 252 |
+
- **Model Quantization**: Even faster inference
|
| 253 |
+
- **Streaming Processing**: Real-time summarization
|
| 254 |
+
- **Distributed Processing**: Multi-GPU support
|
| 255 |
+
- **Edge Optimization**: Optimized for edge devices
|
| 256 |
|
| 257 |
+
---
|
| 258 |
|
| 259 |
+
**Built with โค๏ธ for fast, intelligent document processing**
|
| 260 |
</div>
|