Spaces:
Sleeping
Sleeping
| title: AI-Powered PDF Summarizer | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| # π AI-Powered PDF Summarizer | |
| An intelligent PDF summarization tool powered by state-of-the-art Hugging Face transformer models. Upload any PDF document and get a comprehensive, well-structured summary perfect for studying, research, or quick document review. | |
| ## π Features | |
| ### π€ Multiple AI Models | |
| - **BART (facebook/bart-large-cnn)**: Fast, high-quality summarization for general documents | |
| - **Long-T5 (google/long-t5-tglobal-base)**: Optimized for very long documents and academic papers | |
| ### β‘ Smart Processing | |
| - Intelligent text chunking with overlap for context preservation | |
| - Progress tracking during summarization | |
| - Handles documents of any length | |
| - GPU acceleration support (when available) | |
| ### π Flexible Output | |
| - Choose between bullet points or paragraph format | |
| - Downloadable markdown files | |
| - Statistics about your document | |
| - Clean, readable formatting | |
| ### π¨ User-Friendly Interface | |
| - Simple drag-and-drop file upload | |
| - Real-time progress updates | |
| - Advanced settings for fine-tuned control | |
| - Beautiful, responsive design | |
| ## π Quick Start | |
| ### Local Installation | |
| 1. Clone or download this repository | |
| 2. Install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 3. Run the application: | |
| ```bash | |
| python app.py | |
| ``` | |
| 4. Open your browser to `http://localhost:7860` | |
| ### Hugging Face Spaces Deployment | |
| See the detailed deployment guide below for step-by-step instructions. | |
| ## π How to Use | |
| 1. **Upload PDF**: Click or drag your PDF file to the upload area | |
| 2. **Select Model**: Choose between BART (faster) or Long-T5 (better for long docs) | |
| 3. **Choose Style**: Pick bullet points or paragraph format | |
| 4. **Adjust Settings** (optional): Fine-tune chunk size and summary length | |
| 5. **Generate**: Click the "Generate Summary" button | |
| 6. **Download**: Get your summary as a markdown file | |
| ## βοΈ Advanced Settings | |
| ### Chunk Size (1000-8000 words) | |
| - **Default**: 3000 words | |
| - **Smaller chunks**: Faster processing, may lose some context | |
| - **Larger chunks**: Better context, slower processing | |
| ### Chunk Overlap (0-1000 words) | |
| - **Default**: 200 words | |
| - **Purpose**: Maintains context between chunks | |
| - **Higher overlap**: Better continuity, slightly slower | |
| ### Summary Length | |
| - **Max Length**: 50-500 words per section (default: 150) | |
| - **Min Length**: 10-100 words per section (default: 30) | |
| - Adjust based on how detailed you want the summary | |
| ## π― Best Practices | |
| ### For Best Results: | |
| - Use clear, text-based PDFs (not scanned images) | |
| - For technical documents: Use Long-T5 model | |
| - For general documents: BART works great | |
| - Large files (100+ pages): Increase chunk size to 4000-5000 | |
| ### Processing Times: | |
| - Short documents (1-10 pages): 10-30 seconds | |
| - Medium documents (10-50 pages): 30-120 seconds | |
| - Large documents (50+ pages): 2-5 minutes | |
| ## π οΈ Technical Details | |
| ### Models Used | |
| **BART (facebook/bart-large-cnn)** | |
| - 406M parameters | |
| - Trained on CNN/DailyMail dataset | |
| - Excellent for news, articles, general documents | |
| - Fast inference time | |
| **Long-T5 (google/long-t5-tglobal-base)** | |
| - 250M parameters | |
| - Handles inputs up to 16,384 tokens | |
| - Better for academic papers and long-form content | |
| - Slightly slower but more comprehensive | |
| ### Technologies | |
| - **Gradio**: Web interface | |
| - **Transformers**: Hugging Face models | |
| - **PyMuPDF (fitz)**: PDF text extraction | |
| - **LangChain**: Text splitting and chunking | |
| - **PyTorch**: Deep learning backend | |
| ## π Example Use Cases | |
| - **Students**: Summarize textbooks and research papers | |
| - **Researchers**: Quick overview of academic literature | |
| - **Professionals**: Digest reports and documentation | |
| - **Anyone**: Understand long documents quickly | |
| ## π Privacy & Security | |
| - Documents are processed in real-time | |
| - No permanent storage of uploaded files | |
| - Processing happens on your selected infrastructure | |
| - Temporary files are automatically cleaned up | |
| ## π Troubleshooting | |
| ### PDF Upload Failed | |
| - Ensure PDF is not password-protected | |
| - Check file is not corrupted | |
| - Try re-saving the PDF | |
| ### Summary Quality Issues | |
| - Try the Long-T5 model for better quality | |
| - Adjust chunk size based on document type | |
| - Increase max summary length for more detail | |
| ### Out of Memory Errors | |
| - Reduce chunk size | |
| - Use CPU instead of GPU (slower but stable) | |
| - Process smaller sections at a time | |
| ## π Requirements | |
| - Python 3.8 or higher | |
| - 4GB+ RAM (8GB+ recommended) | |
| - GPU optional (speeds up processing significantly) | |
| ## π€ Contributing | |
| Contributions are welcome! Feel free to: | |
| - Report bugs | |
| - Suggest new features | |
| - Improve documentation | |
| - Submit pull requests | |
| ## π License | |
| This project is open source and available under the MIT License. | |
| ## π Acknowledgments | |
| - Hugging Face for the amazing transformer models | |
| - Facebook AI for BART | |
| - Google Research for Long-T5 | |
| - Gradio team for the excellent UI framework | |
| ## π§ Support | |
| For issues or questions: | |
| - Open an issue on GitHub | |
| - Check existing documentation | |
| - Review the troubleshooting section | |
| --- | |
| **Made with β€οΈ for efficient document summarization** | |
| Happy summarizing! πβ¨ | |