Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available: 6.13.0
metadata
title: Medical Document Summarizer
emoji: 🔥
colorFrom: yellow
colorTo: green
sdk: gradio
sdk_version: 5.23.2
app_file: app.py
pinned: false
license: mit
short_description: Upload your files and get a brief summary!
Medical Document Summarizer
This project is designed to automatically extract and summarize key information from clinical trial documents (e.g., PDF files of research articles) using state-of-the-art NLP models. The pipeline leverages the BigBird-Pegasus model for long-form summarization and includes content filtering, text cleaning, and post-processing to produce concise bullet-point and paragraph summaries.
Features
Note: User has to upload medical document into the file directory before running the model.
- PDF Extraction: Reads and filters PDF files to capture only pages with core content (e.g., Abstract, Methods, Results, Conclusions).
- Text Cleaning: Removes noisy metadata, citations, and excess whitespace.
- Core Section Extraction: Attempts to identify and extract important sections using regex; falls back to header removal when sections are not detected.
- Chunking & Summarization: Splits the text into manageable chunks and uses the BigBird-Pegasus summarization model for each chunk.
- Post-Processing: Formats the final summary into bullet points and neatly wraps it into a paragraph.
- Modular and Extensible: Each step is modular, making it easy to adjust, extend, or integrate with other systems.
Requirements
- Python 3.7+
- spaCy with the
en_core_web_smmodel - NLTK (with the
punkttokenizer) - Transformers
- PyMuPDF
- BeautifulSoup4
Installation
Clone the repository:
git clone https://github.com/yourusername/Medical_Doc_Summarization.git cd Medical_Doc_Summarization