Spaces:

xTHExBEASTx
/

pdf-summarizer

Sleeping

File size: 5,170 Bytes

---
title: AI-Powered PDF Summarizer
emoji: 📚
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---

# 📚 AI-Powered PDF Summarizer

An intelligent PDF summarization tool powered by state-of-the-art Hugging Face transformer models. Upload any PDF document and get a comprehensive, well-structured summary perfect for studying, research, or quick document review.

## 🌟 Features

### 🤖 Multiple AI Models
- **BART (facebook/bart-large-cnn)**: Fast, high-quality summarization for general documents
- **Long-T5 (google/long-t5-tglobal-base)**: Optimized for very long documents and academic papers

### ⚡ Smart Processing
- Intelligent text chunking with overlap for context preservation
- Progress tracking during summarization
- Handles documents of any length
- GPU acceleration support (when available)

### 📝 Flexible Output
- Choose between bullet points or paragraph format
- Downloadable markdown files
- Statistics about your document
- Clean, readable formatting

### 🎨 User-Friendly Interface
- Simple drag-and-drop file upload
- Real-time progress updates
- Advanced settings for fine-tuned control
- Beautiful, responsive design

## 🚀 Quick Start

### Local Installation

1. Clone or download this repository

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Run the application:
```bash
python app.py
```

4. Open your browser to `http://localhost:7860`

### Hugging Face Spaces Deployment

See the detailed deployment guide below for step-by-step instructions.

## 📖 How to Use

1. **Upload PDF**: Click or drag your PDF file to the upload area
2. **Select Model**: Choose between BART (faster) or Long-T5 (better for long docs)
3. **Choose Style**: Pick bullet points or paragraph format
4. **Adjust Settings** (optional): Fine-tune chunk size and summary length
5. **Generate**: Click the "Generate Summary" button
6. **Download**: Get your summary as a markdown file

## ⚙️ Advanced Settings

### Chunk Size (1000-8000 words)
- **Default**: 3000 words
- **Smaller chunks**: Faster processing, may lose some context
- **Larger chunks**: Better context, slower processing

### Chunk Overlap (0-1000 words)
- **Default**: 200 words
- **Purpose**: Maintains context between chunks
- **Higher overlap**: Better continuity, slightly slower

### Summary Length
- **Max Length**: 50-500 words per section (default: 150)
- **Min Length**: 10-100 words per section (default: 30)
- Adjust based on how detailed you want the summary

## 🎯 Best Practices

### For Best Results:
- Use clear, text-based PDFs (not scanned images)
- For technical documents: Use Long-T5 model
- For general documents: BART works great
- Large files (100+ pages): Increase chunk size to 4000-5000

### Processing Times:
- Short documents (1-10 pages): 10-30 seconds
- Medium documents (10-50 pages): 30-120 seconds
- Large documents (50+ pages): 2-5 minutes

## 🛠️ Technical Details

### Models Used

**BART (facebook/bart-large-cnn)**
- 406M parameters
- Trained on CNN/DailyMail dataset
- Excellent for news, articles, general documents
- Fast inference time

**Long-T5 (google/long-t5-tglobal-base)**
- 250M parameters
- Handles inputs up to 16,384 tokens
- Better for academic papers and long-form content
- Slightly slower but more comprehensive

### Technologies
- **Gradio**: Web interface
- **Transformers**: Hugging Face models
- **PyMuPDF (fitz)**: PDF text extraction
- **LangChain**: Text splitting and chunking
- **PyTorch**: Deep learning backend

## 📊 Example Use Cases

- **Students**: Summarize textbooks and research papers
- **Researchers**: Quick overview of academic literature
- **Professionals**: Digest reports and documentation
- **Anyone**: Understand long documents quickly

## 🔒 Privacy & Security

- Documents are processed in real-time
- No permanent storage of uploaded files
- Processing happens on your selected infrastructure
- Temporary files are automatically cleaned up

## 🐛 Troubleshooting

### PDF Upload Failed
- Ensure PDF is not password-protected
- Check file is not corrupted
- Try re-saving the PDF

### Summary Quality Issues
- Try the Long-T5 model for better quality
- Adjust chunk size based on document type
- Increase max summary length for more detail

### Out of Memory Errors
- Reduce chunk size
- Use CPU instead of GPU (slower but stable)
- Process smaller sections at a time

## 📝 Requirements

- Python 3.8 or higher
- 4GB+ RAM (8GB+ recommended)
- GPU optional (speeds up processing significantly)

## 🤝 Contributing

Contributions are welcome! Feel free to:
- Report bugs
- Suggest new features
- Improve documentation
- Submit pull requests

## 📄 License

This project is open source and available under the MIT License.

## 🙏 Acknowledgments

- Hugging Face for the amazing transformer models
- Facebook AI for BART
- Google Research for Long-T5
- Gradio team for the excellent UI framework

## 📧 Support

For issues or questions:
- Open an issue on GitHub
- Check existing documentation
- Review the troubleshooting section

---

**Made with ❤️ for efficient document summarization**

Happy summarizing! 📚✨