Spaces:
Sleeping
Sleeping
metadata
title: AI-Powered PDF Summarizer
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
π AI-Powered PDF Summarizer
An intelligent PDF summarization tool powered by state-of-the-art Hugging Face transformer models. Upload any PDF document and get a comprehensive, well-structured summary perfect for studying, research, or quick document review.
π Features
π€ Multiple AI Models
- BART (facebook/bart-large-cnn): Fast, high-quality summarization for general documents
- Long-T5 (google/long-t5-tglobal-base): Optimized for very long documents and academic papers
β‘ Smart Processing
- Intelligent text chunking with overlap for context preservation
- Progress tracking during summarization
- Handles documents of any length
- GPU acceleration support (when available)
π Flexible Output
- Choose between bullet points or paragraph format
- Downloadable markdown files
- Statistics about your document
- Clean, readable formatting
π¨ User-Friendly Interface
- Simple drag-and-drop file upload
- Real-time progress updates
- Advanced settings for fine-tuned control
- Beautiful, responsive design
π Quick Start
Local Installation
Clone or download this repository
Install dependencies:
pip install -r requirements.txt
- Run the application:
python app.py
- Open your browser to
http://localhost:7860
Hugging Face Spaces Deployment
See the detailed deployment guide below for step-by-step instructions.
π How to Use
- Upload PDF: Click or drag your PDF file to the upload area
- Select Model: Choose between BART (faster) or Long-T5 (better for long docs)
- Choose Style: Pick bullet points or paragraph format
- Adjust Settings (optional): Fine-tune chunk size and summary length
- Generate: Click the "Generate Summary" button
- Download: Get your summary as a markdown file
βοΈ Advanced Settings
Chunk Size (1000-8000 words)
- Default: 3000 words
- Smaller chunks: Faster processing, may lose some context
- Larger chunks: Better context, slower processing
Chunk Overlap (0-1000 words)
- Default: 200 words
- Purpose: Maintains context between chunks
- Higher overlap: Better continuity, slightly slower
Summary Length
- Max Length: 50-500 words per section (default: 150)
- Min Length: 10-100 words per section (default: 30)
- Adjust based on how detailed you want the summary
π― Best Practices
For Best Results:
- Use clear, text-based PDFs (not scanned images)
- For technical documents: Use Long-T5 model
- For general documents: BART works great
- Large files (100+ pages): Increase chunk size to 4000-5000
Processing Times:
- Short documents (1-10 pages): 10-30 seconds
- Medium documents (10-50 pages): 30-120 seconds
- Large documents (50+ pages): 2-5 minutes
π οΈ Technical Details
Models Used
BART (facebook/bart-large-cnn)
- 406M parameters
- Trained on CNN/DailyMail dataset
- Excellent for news, articles, general documents
- Fast inference time
Long-T5 (google/long-t5-tglobal-base)
- 250M parameters
- Handles inputs up to 16,384 tokens
- Better for academic papers and long-form content
- Slightly slower but more comprehensive
Technologies
- Gradio: Web interface
- Transformers: Hugging Face models
- PyMuPDF (fitz): PDF text extraction
- LangChain: Text splitting and chunking
- PyTorch: Deep learning backend
π Example Use Cases
- Students: Summarize textbooks and research papers
- Researchers: Quick overview of academic literature
- Professionals: Digest reports and documentation
- Anyone: Understand long documents quickly
π Privacy & Security
- Documents are processed in real-time
- No permanent storage of uploaded files
- Processing happens on your selected infrastructure
- Temporary files are automatically cleaned up
π Troubleshooting
PDF Upload Failed
- Ensure PDF is not password-protected
- Check file is not corrupted
- Try re-saving the PDF
Summary Quality Issues
- Try the Long-T5 model for better quality
- Adjust chunk size based on document type
- Increase max summary length for more detail
Out of Memory Errors
- Reduce chunk size
- Use CPU instead of GPU (slower but stable)
- Process smaller sections at a time
π Requirements
- Python 3.8 or higher
- 4GB+ RAM (8GB+ recommended)
- GPU optional (speeds up processing significantly)
π€ Contributing
Contributions are welcome! Feel free to:
- Report bugs
- Suggest new features
- Improve documentation
- Submit pull requests
π License
This project is open source and available under the MIT License.
π Acknowledgments
- Hugging Face for the amazing transformer models
- Facebook AI for BART
- Google Research for Long-T5
- Gradio team for the excellent UI framework
π§ Support
For issues or questions:
- Open an issue on GitHub
- Check existing documentation
- Review the troubleshooting section
Made with β€οΈ for efficient document summarization
Happy summarizing! πβ¨