Spaces:

Kaiyeee
/

Medical_Document_Summarizer

Sleeping

App Files Files Community

Medical_Document_Summarizer / README.md

Kaiyeee

Update README.md

a628757 verified about 1 year ago

preview code

raw

history blame contribute delete

2 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

metadata

title: Medical Document Summarizer
emoji: 🔥
colorFrom: yellow
colorTo: green
sdk: gradio
sdk_version: 5.23.2
app_file: app.py
pinned: false
license: mit
short_description: Upload your files and get a brief summary!

Medical Document Summarizer

This project is designed to automatically extract and summarize key information from clinical trial documents (e.g., PDF files of research articles) using state-of-the-art NLP models. The pipeline leverages the BigBird-Pegasus model for long-form summarization and includes content filtering, text cleaning, and post-processing to produce concise bullet-point and paragraph summaries.

Features

Note: User has to upload medical document into the file directory before running the model.

PDF Extraction: Reads and filters PDF files to capture only pages with core content (e.g., Abstract, Methods, Results, Conclusions).
Text Cleaning: Removes noisy metadata, citations, and excess whitespace.
Core Section Extraction: Attempts to identify and extract important sections using regex; falls back to header removal when sections are not detected.
Chunking & Summarization: Splits the text into manageable chunks and uses the BigBird-Pegasus summarization model for each chunk.
Post-Processing: Formats the final summary into bullet points and neatly wraps it into a paragraph.
Modular and Extensible: Each step is modular, making it easy to adjust, extend, or integrate with other systems.

Requirements

Python 3.7+
spaCy with the en_core_web_sm model
NLTK (with the punkt tokenizer)
Transformers
PyMuPDF
BeautifulSoup4

Installation

Clone the repository:

git clone https://github.com/yourusername/Medical_Doc_Summarization.git
cd Medical_Doc_Summarization