--- title: AI-Powered PDF Summarizer emoji: 📚 colorFrom: blue colorTo: purple sdk: docker app_port: 7860 pinned: false --- # 📚 AI-Powered PDF Summarizer An intelligent PDF summarization tool powered by state-of-the-art Hugging Face transformer models. Upload any PDF document and get a comprehensive, well-structured summary perfect for studying, research, or quick document review. ## 🌟 Features ### 🤖 Multiple AI Models - **BART (facebook/bart-large-cnn)**: Fast, high-quality summarization for general documents - **Long-T5 (google/long-t5-tglobal-base)**: Optimized for very long documents and academic papers ### ⚡ Smart Processing - Intelligent text chunking with overlap for context preservation - Progress tracking during summarization - Handles documents of any length - GPU acceleration support (when available) ### 📝 Flexible Output - Choose between bullet points or paragraph format - Downloadable markdown files - Statistics about your document - Clean, readable formatting ### 🎨 User-Friendly Interface - Simple drag-and-drop file upload - Real-time progress updates - Advanced settings for fine-tuned control - Beautiful, responsive design ## 🚀 Quick Start ### Local Installation 1. Clone or download this repository 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Run the application: ```bash python app.py ``` 4. Open your browser to `http://localhost:7860` ### Hugging Face Spaces Deployment See the detailed deployment guide below for step-by-step instructions. ## 📖 How to Use 1. **Upload PDF**: Click or drag your PDF file to the upload area 2. **Select Model**: Choose between BART (faster) or Long-T5 (better for long docs) 3. **Choose Style**: Pick bullet points or paragraph format 4. **Adjust Settings** (optional): Fine-tune chunk size and summary length 5. **Generate**: Click the "Generate Summary" button 6. **Download**: Get your summary as a markdown file ## ⚙️ Advanced Settings ### Chunk Size (1000-8000 words) - **Default**: 3000 words - **Smaller chunks**: Faster processing, may lose some context - **Larger chunks**: Better context, slower processing ### Chunk Overlap (0-1000 words) - **Default**: 200 words - **Purpose**: Maintains context between chunks - **Higher overlap**: Better continuity, slightly slower ### Summary Length - **Max Length**: 50-500 words per section (default: 150) - **Min Length**: 10-100 words per section (default: 30) - Adjust based on how detailed you want the summary ## 🎯 Best Practices ### For Best Results: - Use clear, text-based PDFs (not scanned images) - For technical documents: Use Long-T5 model - For general documents: BART works great - Large files (100+ pages): Increase chunk size to 4000-5000 ### Processing Times: - Short documents (1-10 pages): 10-30 seconds - Medium documents (10-50 pages): 30-120 seconds - Large documents (50+ pages): 2-5 minutes ## 🛠️ Technical Details ### Models Used **BART (facebook/bart-large-cnn)** - 406M parameters - Trained on CNN/DailyMail dataset - Excellent for news, articles, general documents - Fast inference time **Long-T5 (google/long-t5-tglobal-base)** - 250M parameters - Handles inputs up to 16,384 tokens - Better for academic papers and long-form content - Slightly slower but more comprehensive ### Technologies - **Gradio**: Web interface - **Transformers**: Hugging Face models - **PyMuPDF (fitz)**: PDF text extraction - **LangChain**: Text splitting and chunking - **PyTorch**: Deep learning backend ## 📊 Example Use Cases - **Students**: Summarize textbooks and research papers - **Researchers**: Quick overview of academic literature - **Professionals**: Digest reports and documentation - **Anyone**: Understand long documents quickly ## 🔒 Privacy & Security - Documents are processed in real-time - No permanent storage of uploaded files - Processing happens on your selected infrastructure - Temporary files are automatically cleaned up ## 🐛 Troubleshooting ### PDF Upload Failed - Ensure PDF is not password-protected - Check file is not corrupted - Try re-saving the PDF ### Summary Quality Issues - Try the Long-T5 model for better quality - Adjust chunk size based on document type - Increase max summary length for more detail ### Out of Memory Errors - Reduce chunk size - Use CPU instead of GPU (slower but stable) - Process smaller sections at a time ## 📝 Requirements - Python 3.8 or higher - 4GB+ RAM (8GB+ recommended) - GPU optional (speeds up processing significantly) ## 🤝 Contributing Contributions are welcome! Feel free to: - Report bugs - Suggest new features - Improve documentation - Submit pull requests ## 📄 License This project is open source and available under the MIT License. ## 🙏 Acknowledgments - Hugging Face for the amazing transformer models - Facebook AI for BART - Google Research for Long-T5 - Gradio team for the excellent UI framework ## 📧 Support For issues or questions: - Open an issue on GitHub - Check existing documentation - Review the troubleshooting section --- **Made with ❤️ for efficient document summarization** Happy summarizing! 📚✨