File size: 5,170 Bytes
398c281
 
 
 
 
df78b63
 
398c281
 
 
4815095
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7698190
 
4815095
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
title: AI-Powered PDF Summarizer
emoji: πŸ“š
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---

# πŸ“š AI-Powered PDF Summarizer

An intelligent PDF summarization tool powered by state-of-the-art Hugging Face transformer models. Upload any PDF document and get a comprehensive, well-structured summary perfect for studying, research, or quick document review.

## 🌟 Features

### πŸ€– Multiple AI Models
- **BART (facebook/bart-large-cnn)**: Fast, high-quality summarization for general documents
- **Long-T5 (google/long-t5-tglobal-base)**: Optimized for very long documents and academic papers

### ⚑ Smart Processing
- Intelligent text chunking with overlap for context preservation
- Progress tracking during summarization
- Handles documents of any length
- GPU acceleration support (when available)

### πŸ“ Flexible Output
- Choose between bullet points or paragraph format
- Downloadable markdown files
- Statistics about your document
- Clean, readable formatting

### 🎨 User-Friendly Interface
- Simple drag-and-drop file upload
- Real-time progress updates
- Advanced settings for fine-tuned control
- Beautiful, responsive design

## πŸš€ Quick Start

### Local Installation

1. Clone or download this repository

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Run the application:
```bash
python app.py
```

4. Open your browser to `http://localhost:7860`

### Hugging Face Spaces Deployment

See the detailed deployment guide below for step-by-step instructions.

## πŸ“– How to Use

1. **Upload PDF**: Click or drag your PDF file to the upload area
2. **Select Model**: Choose between BART (faster) or Long-T5 (better for long docs)
3. **Choose Style**: Pick bullet points or paragraph format
4. **Adjust Settings** (optional): Fine-tune chunk size and summary length
5. **Generate**: Click the "Generate Summary" button
6. **Download**: Get your summary as a markdown file

## βš™οΈ Advanced Settings

### Chunk Size (1000-8000 words)
- **Default**: 3000 words
- **Smaller chunks**: Faster processing, may lose some context
- **Larger chunks**: Better context, slower processing

### Chunk Overlap (0-1000 words)
- **Default**: 200 words
- **Purpose**: Maintains context between chunks
- **Higher overlap**: Better continuity, slightly slower

### Summary Length
- **Max Length**: 50-500 words per section (default: 150)
- **Min Length**: 10-100 words per section (default: 30)
- Adjust based on how detailed you want the summary

## 🎯 Best Practices

### For Best Results:
- Use clear, text-based PDFs (not scanned images)
- For technical documents: Use Long-T5 model
- For general documents: BART works great
- Large files (100+ pages): Increase chunk size to 4000-5000

### Processing Times:
- Short documents (1-10 pages): 10-30 seconds
- Medium documents (10-50 pages): 30-120 seconds
- Large documents (50+ pages): 2-5 minutes

## πŸ› οΈ Technical Details

### Models Used

**BART (facebook/bart-large-cnn)**
- 406M parameters
- Trained on CNN/DailyMail dataset
- Excellent for news, articles, general documents
- Fast inference time

**Long-T5 (google/long-t5-tglobal-base)**
- 250M parameters
- Handles inputs up to 16,384 tokens
- Better for academic papers and long-form content
- Slightly slower but more comprehensive

### Technologies
- **Gradio**: Web interface
- **Transformers**: Hugging Face models
- **PyMuPDF (fitz)**: PDF text extraction
- **LangChain**: Text splitting and chunking
- **PyTorch**: Deep learning backend

## πŸ“Š Example Use Cases

- **Students**: Summarize textbooks and research papers
- **Researchers**: Quick overview of academic literature
- **Professionals**: Digest reports and documentation
- **Anyone**: Understand long documents quickly

## πŸ”’ Privacy & Security

- Documents are processed in real-time
- No permanent storage of uploaded files
- Processing happens on your selected infrastructure
- Temporary files are automatically cleaned up

## πŸ› Troubleshooting

### PDF Upload Failed
- Ensure PDF is not password-protected
- Check file is not corrupted
- Try re-saving the PDF

### Summary Quality Issues
- Try the Long-T5 model for better quality
- Adjust chunk size based on document type
- Increase max summary length for more detail

### Out of Memory Errors
- Reduce chunk size
- Use CPU instead of GPU (slower but stable)
- Process smaller sections at a time

## πŸ“ Requirements

- Python 3.8 or higher
- 4GB+ RAM (8GB+ recommended)
- GPU optional (speeds up processing significantly)

## 🀝 Contributing

Contributions are welcome! Feel free to:
- Report bugs
- Suggest new features
- Improve documentation
- Submit pull requests

## πŸ“„ License

This project is open source and available under the MIT License.

## πŸ™ Acknowledgments

- Hugging Face for the amazing transformer models
- Facebook AI for BART
- Google Research for Long-T5
- Gradio team for the excellent UI framework

## πŸ“§ Support

For issues or questions:
- Open an issue on GitHub
- Check existing documentation
- Review the troubleshooting section

---

**Made with ❀️ for efficient document summarization**

Happy summarizing! πŸ“šβœ¨