Spaces:

xTHExBEASTx
/

pdf-summarizer

Sleeping

App Files Files Community

xTHExBEASTx commited on Dec 26, 2025

Commit

4815095

verified ·

1 Parent(s): 7698190

Upload 3 files

Browse files

Files changed (3) hide show

README.md +178 -11
app.py +380 -0
requirements.txt +8 -0

README.md CHANGED Viewed

@@ -1,13 +1,180 @@
 ---
-title: Pdf Summarizer
-emoji: 🏃
-colorFrom: gray
-colorTo: purple
-sdk: gradio
-sdk_version: 6.2.0
-app_file: app.py
-pinned: false
-short_description: pdf-summarizer
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# 📚 AI-Powered PDF Summarizer
+An intelligent PDF summarization tool powered by state-of-the-art Hugging Face transformer models. Upload any PDF document and get a comprehensive, well-structured summary perfect for studying, research, or quick document review.
+## 🌟 Features
+### 🤖 Multiple AI Models
+- **BART (facebook/bart-large-cnn)**: Fast, high-quality summarization for general documents
+- **Long-T5 (google/long-t5-tglobal-base)**: Optimized for very long documents and academic papers
+### ⚡ Smart Processing
+- Intelligent text chunking with overlap for context preservation
+- Progress tracking during summarization
+- Handles documents of any length
+- GPU acceleration support (when available)
+### 📝 Flexible Output
+- Choose between bullet points or paragraph format
+- Downloadable markdown files
+- Statistics about your document
+- Clean, readable formatting
+### 🎨 User-Friendly Interface
+- Simple drag-and-drop file upload
+- Real-time progress updates
+- Advanced settings for fine-tuned control
+- Beautiful, responsive design
+## 🚀 Quick Start
+### Local Installation
+1. Clone or download this repository
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+3. Run the application:
+```bash
+python app.py
+```
+4. Open your browser to `http://localhost:7860`
+### Hugging Face Spaces Deployment
+See the detailed deployment guide below for step-by-step instructions.
+## 📖 How to Use
+1. **Upload PDF**: Click or drag your PDF file to the upload area
+2. **Select Model**: Choose between BART (faster) or Long-T5 (better for long docs)
+3. **Choose Style**: Pick bullet points or paragraph format
+4. **Adjust Settings** (optional): Fine-tune chunk size and summary length
+5. **Generate**: Click the "Generate Summary" button
+6. **Download**: Get your summary as a markdown file
+## ⚙️ Advanced Settings
+### Chunk Size (1000-8000 words)
+- **Default**: 3000 words
+- **Smaller chunks**: Faster processing, may lose some context
+- **Larger chunks**: Better context, slower processing
+### Chunk Overlap (0-1000 words)
+- **Default**: 200 words
+- **Purpose**: Maintains context between chunks
+- **Higher overlap**: Better continuity, slightly slower
+### Summary Length
+- **Max Length**: 50-500 words per section (default: 150)
+- **Min Length**: 10-100 words per section (default: 30)
+- Adjust based on how detailed you want the summary
+## 🎯 Best Practices
+### For Best Results:
+- Use clear, text-based PDFs (not scanned images)
+- For technical documents: Use Long-T5 model
+- For general documents: BART works great
+- Large files (100+ pages): Increase chunk size to 4000-5000
+### Processing Times:
+- Short documents (1-10 pages): 10-30 seconds
+- Medium documents (10-50 pages): 30-120 seconds
+- Large documents (50+ pages): 2-5 minutes
+## 🛠️ Technical Details
+### Models Used
+**BART (facebook/bart-large-cnn)**
+- 406M parameters
+- Trained on CNN/DailyMail dataset
+- Excellent for news, articles, general documents
+- Fast inference time
+**Long-T5 (google/long-t5-tglobal-base)**
+- 250M parameters
+- Handles inputs up to 16,384 tokens
+- Better for academic papers and long-form content
+- Slightly slower but more comprehensive
+### Technologies
+- **Gradio**: Web interface
+- **Transformers**: Hugging Face models
+- **PyMuPDF (fitz)**: PDF text extraction
+- **LangChain**: Text splitting and chunking
+- **PyTorch**: Deep learning backend
+## 📊 Example Use Cases
+- **Students**: Summarize textbooks and research papers
+- **Researchers**: Quick overview of academic literature
+- **Professionals**: Digest reports and documentation
+- **Anyone**: Understand long documents quickly
+## 🔒 Privacy & Security
+- Documents are processed in real-time
+- No permanent storage of uploaded files
+- Processing happens on your selected infrastructure
+- Temporary files are automatically cleaned up
+## 🐛 Troubleshooting
+### PDF Upload Failed
+- Ensure PDF is not password-protected
+- Check file is not corrupted
+- Try re-saving the PDF
+### Summary Quality Issues
+- Try the Long-T5 model for better quality
+- Adjust chunk size based on document type
+- Increase max summary length for more detail
+### Out of Memory Errors
+- Reduce chunk size
+- Use CPU instead of GPU (slower but stable)
+- Process smaller sections at a time
+## 📝 Requirements
+- Python 3.8 or higher
+- 4GB+ RAM (8GB+ recommended)
+- GPU optional (speeds up processing significantly)
+## 🤝 Contributing
+Contributions are welcome! Feel free to:
+- Report bugs
+- Suggest new features
+- Improve documentation
+- Submit pull requests
+## 📄 License
+This project is open source and available under the MIT License.
+## 🙏 Acknowledgments
+- Hugging Face for the amazing transformer models
+- Facebook AI for BART
+- Google Research for Long-T5
+- Gradio team for the excellent UI framework
+## 📧 Support
+For issues or questions:
+- Open an issue on GitHub
+- Check existing documentation
+- Review the troubleshooting section
 ---
+**Made with ❤️ for efficient document summarization**
+Happy summarizing! 📚✨

app.py ADDED Viewed

	@@ -0,0 +1,380 @@

+import os
+import gradio as gr
+import fitz  # PyMuPDF
+from transformers import pipeline
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+import torch
+# Check if CUDA is available
+device = 0 if torch.cuda.is_available() else -1
+# Initialize summarization pipelines at startup
+print("Loading AI models... This may take a few minutes on first run.")
+try:
+    bart_summarizer = pipeline(
+        "summarization",
+        model="facebook/bart-large-cnn",
+        device=device
+    )
+    print("✓ BART model loaded successfully")
+except Exception as e:
+    print(f"✗ Error loading BART model: {e}")
+    bart_summarizer = None
+try:
+    longt5_summarizer = pipeline(
+        "summarization",
+        model="google/long-t5-tglobal-base",
+        device=device
+    )
+    print("✓ Long-T5 model loaded successfully")
+except Exception as e:
+    print(f"✗ Error loading Long-T5 model: {e}")
+    longt5_summarizer = None
+print("Models ready!")
+def extract_text_from_pdf(pdf_file) -> tuple[str, str]:
+    """
+    Extracts text from the uploaded PDF file.
+    Args:
+        pdf_file: Gradio file object
+    Returns:
+        tuple: (extracted_text, error_message)
+    """
+    text = ""
+    try:
+        with fitz.open(pdf_file.name) as doc:
+            total_pages = len(doc)
+            for page_num, page in enumerate(doc, 1):
+                text += page.get_text()
+        if not text.strip():
+            return "", "PDF appears to be empty or contains only images."
+        return text, None
+    except Exception as e:
+        return "", f"Error reading PDF: {str(e)}"
+def chunk_text(text: str, chunk_size: int, chunk_overlap: int) -> list[str]:
+    """
+    Split text into manageable chunks.
+    Args:
+        text: The text to split
+        chunk_size: Maximum size of each chunk
+        chunk_overlap: Overlap between chunks
+    Returns:
+        list: List of text chunks
+    """
+    text_splitter = RecursiveCharacterTextSplitter(
+        chunk_size=chunk_size,
+        chunk_overlap=chunk_overlap,
+        length_function=len,
+        separators=["\n\n", "\n", " ", ""]
+    )
+    return text_splitter.split_text(text)
+def summarize_chunk(chunk: str, model_name: str, max_length: int, min_length: int) -> str:
+    """
+    Summarize a single chunk of text.
+    Args:
+        chunk: Text to summarize
+        model_name: Model to use ('BART' or 'Long-T5')
+        max_length: Maximum summary length
+        min_length: Minimum summary length
+    Returns:
+        str: Summarized text
+    """
+    try:
+        summarizer = bart_summarizer if model_name == "BART (Fast, High Quality)" else longt5_summarizer
+        if summarizer is None:
+            return "Error: Model not loaded properly."
+        # Adjust lengths based on chunk size
+        actual_max = min(max_length, len(chunk.split()) // 2)
+        actual_min = min(min_length, actual_max - 10)
+        result = summarizer(
+            chunk,
+            max_length=actual_max,
+            min_length=actual_min,
+            do_sample=False,
+            truncation=True
+        )
+        return result[0]['summary_text']
+    except Exception as e:
+        return f"Error summarizing chunk: {str(e)}"
+def process_pdf(pdf_file, model_name, chunk_size, chunk_overlap, max_length, min_length, summary_style):
+    """
+    Main processing function: Extract → Chunk → Summarize → Synthesize.
+    Args:
+        pdf_file: Uploaded PDF file
+        model_name: Selected model
+        chunk_size: Size of text chunks
+        chunk_overlap: Overlap between chunks
+        max_length: Maximum summary length
+        min_length: Minimum summary length
+        summary_style: Style of summary (Bullet Points or Paragraph)
+    Yields:
+        tuple: (status_message, output_file_path)
+    """
+    if pdf_file is None:
+        yield "⚠️ Please upload a PDF file first.", None
+        return
+    # Extract text from PDF
+    yield "📄 Reading PDF and extracting text...", None
+    full_text, error = extract_text_from_pdf(pdf_file)
+    if error:
+        yield f"❌ {error}", None
+        return
+    # Get basic stats
+    word_count = len(full_text.split())
+    char_count = len(full_text)
+    yield f"✅ Extracted {word_count:,} words ({char_count:,} characters)\n\n📊 Splitting text into sections...", None
+    # Split into chunks
+    chunks = chunk_text(full_text, int(chunk_size), int(chunk_overlap))
+    total_chunks = len(chunks)
+    if total_chunks == 0:
+        yield "❌ No text could be extracted from the PDF.", None
+        return
+    yield f"✅ Created {total_chunks} sections\n\n🤖 Starting summarization...", None
+    # Summarize each chunk
+    intermediate_summaries = []
+    for i, chunk in enumerate(chunks, 1):
+        yield f"🔄 Processing section {i}/{total_chunks}...", None
+        summary = summarize_chunk(chunk, model_name, max_length, min_length)
+        intermediate_summaries.append(summary)
+    yield f"✅ Completed all sections\n\n🎯 Creating final structured summary...", None
+    # Create final summary
+    if len(intermediate_summaries) > 1:
+        combined = "\n\n".join(intermediate_summaries)
+        # Create a synthesis prompt based on style
+        if summary_style == "Bullet Points":
+            style_instruction = "Create a well-organized summary with clear bullet points and headings."
+        else:
+            style_instruction = "Create a comprehensive, flowing paragraph summary."
+        final_summary = summarize_chunk(
+            combined,
+            model_name,
+            max_length * 2,  # Allow longer final summary
+            min_length
+        )
+    else:
+        final_summary = intermediate_summaries[0]
+    # Format the output based on style
+    if summary_style == "Bullet Points":
+        formatted_summary = f"""# 📚 PDF Summary
+**Original Document:** {os.path.basename(pdf_file.name)}
+**Word Count:** {word_count:,}
+**Sections Processed:** {total_chunks}
+**Model Used:** {model_name}
+---
+## Summary
+{final_summary}
+---
+*Generated with Hugging Face Transformers*
+"""
+    else:
+        formatted_summary = f"""# 📚 PDF Summary
+**Original Document:** {os.path.basename(pdf_file.name)}
+**Word Count:** {word_count:,}
+**Sections Processed:** {total_chunks}
+**Model Used:** {model_name}
+---
+{final_summary}
+---
+*Generated with Hugging Face Transformers*
+"""
+    # Save to file
+    base_name = os.path.splitext(os.path.basename(pdf_file.name))[0]
+    output_path = f"{base_name}_Summary.md"
+    try:
+        with open(output_path, "w", encoding="utf-8") as f:
+            f.write(formatted_summary)
+    except Exception as e:
+        yield f"❌ Error saving file: {str(e)}\n\n{formatted_summary}", None
+        return
+    yield formatted_summary, output_path
+# --- GRADIO UI DESIGN ---
+with gr.Blocks(theme=gr.themes.Soft(), title="PDF Summarizer") as demo:
+    gr.Markdown("""
+    # 📚 AI-Powered PDF Summarizer
+    Upload any PDF document and get an intelligent, comprehensive summary using state-of-the-art AI models.
+    Perfect for research papers, textbooks, reports, and study materials!
+    """)
+    with gr.Row():
+        with gr.Column(scale=1):
+            gr.Markdown("### 📤 Upload & Configure")
+            file_input = gr.File(
+                label="Upload PDF Document",
+                file_types=[".pdf"],
+                type="filepath"
+            )
+            model_dropdown = gr.Dropdown(
+                choices=[
+                    "BART (Fast, High Quality)",
+                    "Long-T5 (Better for Very Long Documents)"
+                ],
+                value="BART (Fast, High Quality)",
+                label="🤖 Select AI Model",
+                info="BART is faster and works great for most documents"
+            )
+            summary_style = gr.Radio(
+                choices=["Bullet Points", "Paragraph"],
+                value="Bullet Points",
+                label="📝 Summary Style",
+                info="Choose how you want the summary formatted"
+            )
+            with gr.Accordion("⚙️ Advanced Settings", open=False):
+                gr.Markdown("*Adjust these settings for fine-tuned control*")
+                chunk_size = gr.Slider(
+                    minimum=1000,
+                    maximum=8000,
+                    value=3000,
+                    step=500,
+                    label="Chunk Size",
+                    info="Larger chunks = more context but slower processing"
+                )
+                chunk_overlap = gr.Slider(
+                    minimum=0,
+                    maximum=1000,
+                    value=200,
+                    step=50,
+                    label="Chunk Overlap",
+                    info="Overlap helps maintain context between chunks"
+                )
+                max_length = gr.Slider(
+                    minimum=50,
+                    maximum=500,
+                    value=150,
+                    step=10,
+                    label="Max Summary Length (words)",
+                    info="Maximum length for each section summary"
+                )
+                min_length = gr.Slider(
+                    minimum=10,
+                    maximum=100,
+                    value=30,
+                    step=5,
+                    label="Min Summary Length (words)",
+                    info="Minimum length for each section summary"
+                )
+            run_btn = gr.Button("🚀 Generate Summary", variant="primary", size="lg")
+            gr.Markdown("""
+            ---
+            ### 💡 Tips:
+            - **Best results**: Use clear, text-based PDFs
+            - **Large files**: May take a few minutes to process
+            - **Very long docs**: Try Long-T5 model for better results
+            """)
+        with gr.Column(scale=2):
+            gr.Markdown("### 📊 Results")
+            output_text = gr.Markdown(
+                label="Generated Summary",
+                value="*Your summary will appear here...*"
+            )
+            file_output = gr.File(
+                label="📥 Download Summary (.md)",
+                interactive=False
+            )
+            gr.Markdown("""
+            ---
+            ### ℹ️ About the Models:
+            **BART (facebook/bart-large-cnn)**
+            - Fast and efficient
+            - Excellent for general documents
+            - Great summary quality
+            **Long-T5 (google/long-t5-tglobal-base)**
+            - Handles very long documents
+            - Better for academic papers
+            - Slightly slower but more comprehensive
+            """)
+    # Connect the button to the processing function
+    run_btn.click(
+        fn=process_pdf,
+        inputs=[
+            file_input,
+            model_dropdown,
+            chunk_size,
+            chunk_overlap,
+            max_length,
+            min_length,
+            summary_style
+        ],
+        outputs=[output_text, file_output]
+    )
+    gr.Markdown("""
+    ---
+    ### 🔒 Privacy Notice
+    Your documents are processed securely and are not stored permanently.
+    Made with ❤️ using Hugging Face Transformers
+    """)
+if __name__ == "__main__":
+    demo.queue(max_size=10).launch(
+        server_name="0.0.0.0",
+        server_port=7860,
+        share=False
+    )

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+gradio==4.44.0
+transformers==4.36.2
+torch==2.1.2
+PyMuPDF==1.23.8
+langchain-text-splitters==0.0.1
+sentencepiece==0.1.99
+protobuf==4.25.1
+accelerate==0.25.0