pdf-summarizer / README.md
aladhefafalquran
Convert to Docker SDK for HF Spaces
df78b63
metadata
title: AI-Powered PDF Summarizer
emoji: πŸ“š
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false

πŸ“š AI-Powered PDF Summarizer

An intelligent PDF summarization tool powered by state-of-the-art Hugging Face transformer models. Upload any PDF document and get a comprehensive, well-structured summary perfect for studying, research, or quick document review.

🌟 Features

πŸ€– Multiple AI Models

  • BART (facebook/bart-large-cnn): Fast, high-quality summarization for general documents
  • Long-T5 (google/long-t5-tglobal-base): Optimized for very long documents and academic papers

⚑ Smart Processing

  • Intelligent text chunking with overlap for context preservation
  • Progress tracking during summarization
  • Handles documents of any length
  • GPU acceleration support (when available)

πŸ“ Flexible Output

  • Choose between bullet points or paragraph format
  • Downloadable markdown files
  • Statistics about your document
  • Clean, readable formatting

🎨 User-Friendly Interface

  • Simple drag-and-drop file upload
  • Real-time progress updates
  • Advanced settings for fine-tuned control
  • Beautiful, responsive design

πŸš€ Quick Start

Local Installation

  1. Clone or download this repository

  2. Install dependencies:

pip install -r requirements.txt
  1. Run the application:
python app.py
  1. Open your browser to http://localhost:7860

Hugging Face Spaces Deployment

See the detailed deployment guide below for step-by-step instructions.

πŸ“– How to Use

  1. Upload PDF: Click or drag your PDF file to the upload area
  2. Select Model: Choose between BART (faster) or Long-T5 (better for long docs)
  3. Choose Style: Pick bullet points or paragraph format
  4. Adjust Settings (optional): Fine-tune chunk size and summary length
  5. Generate: Click the "Generate Summary" button
  6. Download: Get your summary as a markdown file

βš™οΈ Advanced Settings

Chunk Size (1000-8000 words)

  • Default: 3000 words
  • Smaller chunks: Faster processing, may lose some context
  • Larger chunks: Better context, slower processing

Chunk Overlap (0-1000 words)

  • Default: 200 words
  • Purpose: Maintains context between chunks
  • Higher overlap: Better continuity, slightly slower

Summary Length

  • Max Length: 50-500 words per section (default: 150)
  • Min Length: 10-100 words per section (default: 30)
  • Adjust based on how detailed you want the summary

🎯 Best Practices

For Best Results:

  • Use clear, text-based PDFs (not scanned images)
  • For technical documents: Use Long-T5 model
  • For general documents: BART works great
  • Large files (100+ pages): Increase chunk size to 4000-5000

Processing Times:

  • Short documents (1-10 pages): 10-30 seconds
  • Medium documents (10-50 pages): 30-120 seconds
  • Large documents (50+ pages): 2-5 minutes

πŸ› οΈ Technical Details

Models Used

BART (facebook/bart-large-cnn)

  • 406M parameters
  • Trained on CNN/DailyMail dataset
  • Excellent for news, articles, general documents
  • Fast inference time

Long-T5 (google/long-t5-tglobal-base)

  • 250M parameters
  • Handles inputs up to 16,384 tokens
  • Better for academic papers and long-form content
  • Slightly slower but more comprehensive

Technologies

  • Gradio: Web interface
  • Transformers: Hugging Face models
  • PyMuPDF (fitz): PDF text extraction
  • LangChain: Text splitting and chunking
  • PyTorch: Deep learning backend

πŸ“Š Example Use Cases

  • Students: Summarize textbooks and research papers
  • Researchers: Quick overview of academic literature
  • Professionals: Digest reports and documentation
  • Anyone: Understand long documents quickly

πŸ”’ Privacy & Security

  • Documents are processed in real-time
  • No permanent storage of uploaded files
  • Processing happens on your selected infrastructure
  • Temporary files are automatically cleaned up

πŸ› Troubleshooting

PDF Upload Failed

  • Ensure PDF is not password-protected
  • Check file is not corrupted
  • Try re-saving the PDF

Summary Quality Issues

  • Try the Long-T5 model for better quality
  • Adjust chunk size based on document type
  • Increase max summary length for more detail

Out of Memory Errors

  • Reduce chunk size
  • Use CPU instead of GPU (slower but stable)
  • Process smaller sections at a time

πŸ“ Requirements

  • Python 3.8 or higher
  • 4GB+ RAM (8GB+ recommended)
  • GPU optional (speeds up processing significantly)

🀝 Contributing

Contributions are welcome! Feel free to:

  • Report bugs
  • Suggest new features
  • Improve documentation
  • Submit pull requests

πŸ“„ License

This project is open source and available under the MIT License.

πŸ™ Acknowledgments

  • Hugging Face for the amazing transformer models
  • Facebook AI for BART
  • Google Research for Long-T5
  • Gradio team for the excellent UI framework

πŸ“§ Support

For issues or questions:

  • Open an issue on GitHub
  • Check existing documentation
  • Review the troubleshooting section

Made with ❀️ for efficient document summarization

Happy summarizing! πŸ“šβœ¨