Kaiyeee commited on
Commit
a628757
·
verified ·
1 Parent(s): 31ac0f5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -4
README.md CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
  title: Medical Document Summarizer
3
- emoji: 🦀
4
- colorFrom: indigo
5
- colorTo: red
6
  sdk: gradio
7
  sdk_version: 5.23.2
8
  app_file: app.py
@@ -11,4 +11,35 @@ license: mit
11
  short_description: Upload your files and get a brief summary!
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Medical Document Summarizer
3
+ emoji: 🔥
4
+ colorFrom: yellow
5
+ colorTo: green
6
  sdk: gradio
7
  sdk_version: 5.23.2
8
  app_file: app.py
 
11
  short_description: Upload your files and get a brief summary!
12
  ---
13
 
14
+ # Medical Document Summarizer
15
+
16
+ This project is designed to automatically extract and summarize key information from clinical trial documents (e.g., PDF files of research articles) using state-of-the-art NLP models. The pipeline leverages the BigBird-Pegasus model for long-form summarization and includes content filtering, text cleaning, and post-processing to produce concise bullet-point and paragraph summaries.
17
+
18
+ ## Features
19
+
20
+ *Note*: User has to upload medical document into the file directory before running the model.
21
+
22
+ - **PDF Extraction:** Reads and filters PDF files to capture only pages with core content (e.g., Abstract, Methods, Results, Conclusions).
23
+ - **Text Cleaning:** Removes noisy metadata, citations, and excess whitespace.
24
+ - **Core Section Extraction:** Attempts to identify and extract important sections using regex; falls back to header removal when sections are not detected.
25
+ - **Chunking & Summarization:** Splits the text into manageable chunks and uses the BigBird-Pegasus summarization model for each chunk.
26
+ - **Post-Processing:** Formats the final summary into bullet points and neatly wraps it into a paragraph.
27
+ - **Modular and Extensible:** Each step is modular, making it easy to adjust, extend, or integrate with other systems.
28
+
29
+ ## Requirements
30
+
31
+ - Python 3.7+
32
+ - [spaCy](https://spacy.io/) with the `en_core_web_sm` model
33
+ - [NLTK](https://www.nltk.org/) (with the `punkt` tokenizer)
34
+ - [Transformers](https://huggingface.co/transformers/)
35
+ - [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/)
36
+ - [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
37
+
38
+ ## Installation
39
+
40
+ 1. **Clone the repository:**
41
+
42
+ ```bash
43
+ git clone https://github.com/yourusername/Medical_Doc_Summarization.git
44
+ cd Medical_Doc_Summarization
45
+ ```