Spaces:
Sleeping
Sleeping
File size: 5,170 Bytes
398c281 df78b63 398c281 4815095 7698190 4815095 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | ---
title: AI-Powered PDF Summarizer
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# π AI-Powered PDF Summarizer
An intelligent PDF summarization tool powered by state-of-the-art Hugging Face transformer models. Upload any PDF document and get a comprehensive, well-structured summary perfect for studying, research, or quick document review.
## π Features
### π€ Multiple AI Models
- **BART (facebook/bart-large-cnn)**: Fast, high-quality summarization for general documents
- **Long-T5 (google/long-t5-tglobal-base)**: Optimized for very long documents and academic papers
### β‘ Smart Processing
- Intelligent text chunking with overlap for context preservation
- Progress tracking during summarization
- Handles documents of any length
- GPU acceleration support (when available)
### π Flexible Output
- Choose between bullet points or paragraph format
- Downloadable markdown files
- Statistics about your document
- Clean, readable formatting
### π¨ User-Friendly Interface
- Simple drag-and-drop file upload
- Real-time progress updates
- Advanced settings for fine-tuned control
- Beautiful, responsive design
## π Quick Start
### Local Installation
1. Clone or download this repository
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Run the application:
```bash
python app.py
```
4. Open your browser to `http://localhost:7860`
### Hugging Face Spaces Deployment
See the detailed deployment guide below for step-by-step instructions.
## π How to Use
1. **Upload PDF**: Click or drag your PDF file to the upload area
2. **Select Model**: Choose between BART (faster) or Long-T5 (better for long docs)
3. **Choose Style**: Pick bullet points or paragraph format
4. **Adjust Settings** (optional): Fine-tune chunk size and summary length
5. **Generate**: Click the "Generate Summary" button
6. **Download**: Get your summary as a markdown file
## βοΈ Advanced Settings
### Chunk Size (1000-8000 words)
- **Default**: 3000 words
- **Smaller chunks**: Faster processing, may lose some context
- **Larger chunks**: Better context, slower processing
### Chunk Overlap (0-1000 words)
- **Default**: 200 words
- **Purpose**: Maintains context between chunks
- **Higher overlap**: Better continuity, slightly slower
### Summary Length
- **Max Length**: 50-500 words per section (default: 150)
- **Min Length**: 10-100 words per section (default: 30)
- Adjust based on how detailed you want the summary
## π― Best Practices
### For Best Results:
- Use clear, text-based PDFs (not scanned images)
- For technical documents: Use Long-T5 model
- For general documents: BART works great
- Large files (100+ pages): Increase chunk size to 4000-5000
### Processing Times:
- Short documents (1-10 pages): 10-30 seconds
- Medium documents (10-50 pages): 30-120 seconds
- Large documents (50+ pages): 2-5 minutes
## π οΈ Technical Details
### Models Used
**BART (facebook/bart-large-cnn)**
- 406M parameters
- Trained on CNN/DailyMail dataset
- Excellent for news, articles, general documents
- Fast inference time
**Long-T5 (google/long-t5-tglobal-base)**
- 250M parameters
- Handles inputs up to 16,384 tokens
- Better for academic papers and long-form content
- Slightly slower but more comprehensive
### Technologies
- **Gradio**: Web interface
- **Transformers**: Hugging Face models
- **PyMuPDF (fitz)**: PDF text extraction
- **LangChain**: Text splitting and chunking
- **PyTorch**: Deep learning backend
## π Example Use Cases
- **Students**: Summarize textbooks and research papers
- **Researchers**: Quick overview of academic literature
- **Professionals**: Digest reports and documentation
- **Anyone**: Understand long documents quickly
## π Privacy & Security
- Documents are processed in real-time
- No permanent storage of uploaded files
- Processing happens on your selected infrastructure
- Temporary files are automatically cleaned up
## π Troubleshooting
### PDF Upload Failed
- Ensure PDF is not password-protected
- Check file is not corrupted
- Try re-saving the PDF
### Summary Quality Issues
- Try the Long-T5 model for better quality
- Adjust chunk size based on document type
- Increase max summary length for more detail
### Out of Memory Errors
- Reduce chunk size
- Use CPU instead of GPU (slower but stable)
- Process smaller sections at a time
## π Requirements
- Python 3.8 or higher
- 4GB+ RAM (8GB+ recommended)
- GPU optional (speeds up processing significantly)
## π€ Contributing
Contributions are welcome! Feel free to:
- Report bugs
- Suggest new features
- Improve documentation
- Submit pull requests
## π License
This project is open source and available under the MIT License.
## π Acknowledgments
- Hugging Face for the amazing transformer models
- Facebook AI for BART
- Google Research for Long-T5
- Gradio team for the excellent UI framework
## π§ Support
For issues or questions:
- Open an issue on GitHub
- Check existing documentation
- Review the troubleshooting section
---
**Made with β€οΈ for efficient document summarization**
Happy summarizing! πβ¨
|