Spaces:

xTHExBEASTx
/

pdf-summarizer

Sleeping

App Files Files Community

pdf-summarizer / README.md

aladhefafalquran

Convert to Docker SDK for HF Spaces

df78b63 about 2 months ago

preview code

raw

history blame contribute delete

5.17 kB

metadata

title: AI-Powered PDF Summarizer
emoji: 📚
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false

📚 AI-Powered PDF Summarizer

An intelligent PDF summarization tool powered by state-of-the-art Hugging Face transformer models. Upload any PDF document and get a comprehensive, well-structured summary perfect for studying, research, or quick document review.

🌟 Features

🤖 Multiple AI Models

BART (facebook/bart-large-cnn): Fast, high-quality summarization for general documents
Long-T5 (google/long-t5-tglobal-base): Optimized for very long documents and academic papers

⚡ Smart Processing

Intelligent text chunking with overlap for context preservation
Progress tracking during summarization
Handles documents of any length
GPU acceleration support (when available)

📝 Flexible Output

Choose between bullet points or paragraph format
Downloadable markdown files
Statistics about your document
Clean, readable formatting

🎨 User-Friendly Interface

Simple drag-and-drop file upload
Real-time progress updates
Advanced settings for fine-tuned control
Beautiful, responsive design

🚀 Quick Start

Local Installation

Clone or download this repository
Install dependencies:

pip install -r requirements.txt

Run the application:

python app.py

Open your browser to http://localhost:7860

Hugging Face Spaces Deployment

See the detailed deployment guide below for step-by-step instructions.

📖 How to Use

Upload PDF: Click or drag your PDF file to the upload area
Select Model: Choose between BART (faster) or Long-T5 (better for long docs)
Choose Style: Pick bullet points or paragraph format
Adjust Settings (optional): Fine-tune chunk size and summary length
Generate: Click the "Generate Summary" button
Download: Get your summary as a markdown file

⚙️ Advanced Settings

Chunk Size (1000-8000 words)

Default: 3000 words
Smaller chunks: Faster processing, may lose some context
Larger chunks: Better context, slower processing

Chunk Overlap (0-1000 words)

Default: 200 words
Purpose: Maintains context between chunks
Higher overlap: Better continuity, slightly slower

Summary Length

Max Length: 50-500 words per section (default: 150)
Min Length: 10-100 words per section (default: 30)
Adjust based on how detailed you want the summary

🎯 Best Practices

For Best Results:

Use clear, text-based PDFs (not scanned images)
For technical documents: Use Long-T5 model
For general documents: BART works great
Large files (100+ pages): Increase chunk size to 4000-5000

Processing Times:

Short documents (1-10 pages): 10-30 seconds
Medium documents (10-50 pages): 30-120 seconds
Large documents (50+ pages): 2-5 minutes

🛠️ Technical Details

Models Used

BART (facebook/bart-large-cnn)

406M parameters
Trained on CNN/DailyMail dataset
Excellent for news, articles, general documents
Fast inference time

Long-T5 (google/long-t5-tglobal-base)

250M parameters
Handles inputs up to 16,384 tokens
Better for academic papers and long-form content
Slightly slower but more comprehensive

Technologies

Gradio: Web interface
Transformers: Hugging Face models
PyMuPDF (fitz): PDF text extraction
LangChain: Text splitting and chunking
PyTorch: Deep learning backend

📊 Example Use Cases

Students: Summarize textbooks and research papers
Researchers: Quick overview of academic literature
Professionals: Digest reports and documentation
Anyone: Understand long documents quickly

🔒 Privacy & Security

Documents are processed in real-time
No permanent storage of uploaded files
Processing happens on your selected infrastructure
Temporary files are automatically cleaned up

🐛 Troubleshooting

PDF Upload Failed

Ensure PDF is not password-protected
Check file is not corrupted
Try re-saving the PDF

Summary Quality Issues

Try the Long-T5 model for better quality
Adjust chunk size based on document type
Increase max summary length for more detail

Out of Memory Errors

Reduce chunk size
Use CPU instead of GPU (slower but stable)
Process smaller sections at a time

📝 Requirements

Python 3.8 or higher
4GB+ RAM (8GB+ recommended)
GPU optional (speeds up processing significantly)

🤝 Contributing

Contributions are welcome! Feel free to:

Report bugs
Suggest new features
Improve documentation
Submit pull requests

📄 License

This project is open source and available under the MIT License.

🙏 Acknowledgments

Hugging Face for the amazing transformer models
Facebook AI for BART
Google Research for Long-T5
Gradio team for the excellent UI framework

📧 Support

For issues or questions:

Open an issue on GitHub
Check existing documentation
Review the troubleshooting section

Made with ❤️ for efficient document summarization

Happy summarizing! 📚✨