Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,84 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
+
|
| 5 |
+
# SkimLit: NLP Model for Medical Abstracts
|
| 6 |
+
SkimLit is a natural language processing (NLP) project aimed at making the reading of medical abstracts more accessible. This project replicates the methodology outlined in the paper "PubMed 200K RCT: a Dataset for Sequenctial Sentence Classification in Medical Abstracts," using TensorFlow and various deep learning techniques.
|
| 7 |
+
|
| 8 |
+
# Project Overview
|
| 9 |
+
|
| 10 |
+
# **`Section 1`**
|
| 11 |
+
|
| 12 |
+
## Data Collection
|
| 13 |
+
- The PubMed 200K RCT dataset is obtained from the author's GitHub repository using the following commands:
|
| 14 |
+
```
|
| 15 |
+
git clone https://github.com/Franck-Dernoncourt/pubmed-rct
|
| 16 |
+
cd pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign
|
| 17 |
+
```
|
| 18 |
+
|
| 19 |
+
## Data Prepocessing
|
| 20 |
+
- Sentences are extracted from the dataset, and numeric labels are assigned for machine learning models.
|
| 21 |
+
- Three baseline models are established to set the foundation for more complex models.
|
| 22 |
+
|
| 23 |
+
## Baseline Model (Model 0)
|
| 24 |
+
- TF-IDF Multinomial Naive Bayes Classifier is implemented.
|
| 25 |
+
- Classification evaluation metrics such as accuracy, precision, recall, and F1-score are employed.
|
| 26 |
+
|
| 27 |
+
## Deep Sequence Models
|
| 28 |
+
### Model 1: Conv1D with Token Embeddings
|
| 29 |
+
- Custom TextVectorizer and text embedding layers are created.
|
| 30 |
+
- Data is optimized for efficiency using TensorFlow tf.data API.
|
| 31 |
+
|
| 32 |
+
### Model 2: Pretrained Token Embeddings
|
| 33 |
+
- Universal Sentence Encoder (USE) from TensorFlow Hub is used for feature extraction.
|
| 34 |
+
|
| 35 |
+
### Model 3: Conv1D with Character Embeddings
|
| 36 |
+
- Character-level tokenizer and embedding are implemented.
|
| 37 |
+
- Conv1D model is constructed using character embeddings.
|
| 38 |
+
|
| 39 |
+
### Model 4: Hybrid Embedding Layer
|
| 40 |
+
- Token and character-level embeddings are combined using layers.Concatenate.
|
| 41 |
+
- A model is developed to process both types of embeddings and output label probabilities.
|
| 42 |
+
|
| 43 |
+
### Model 5: Transfer Learning with Positional Embeddings
|
| 44 |
+
- Positional embeddings are introduced to enhance the model's understanding of the sequence.
|
| 45 |
+
- A tribrid embedding model is created, combining token, character, line_number, and total_lines features.
|
| 46 |
+
|
| 47 |
+
## Model Evaluation and Comparison
|
| 48 |
+
- Models are evaluated on various datasets to compare their performance.
|
| 49 |
+
|
| 50 |
+
## Save and Load Models
|
| 51 |
+
- Models are saved and loaded for future use.
|
| 52 |
+
|
| 53 |
+
## Model Loading and Evaluation
|
| 54 |
+
- Pre-trained models are loaded and evaluated on validation datasets.
|
| 55 |
+
|
| 56 |
+
## Test Dataset Processing and Prediction
|
| 57 |
+
- A test dataset is created, preprocessed, and used for making predictions with the loaded model.
|
| 58 |
+
|
| 59 |
+
## Enriching Test Dataframe with Predictions
|
| 60 |
+
- Predictions and additional columns are added to the test dataframe for analysis.
|
| 61 |
+
|
| 62 |
+
## Finding Top Wrong Predictions
|
| 63 |
+
- The top 100 most inaccurately predicted samples are identified.
|
| 64 |
+
|
| 65 |
+
## Investigating Top Wrong Predictions
|
| 66 |
+
- Detailed information on the top 10 wrong predictions is displayed.
|
| 67 |
+
|
| 68 |
+
# **`Section 2`**
|
| 69 |
+
## Example Abstracts
|
| 70 |
+
- Example abstracts are downloaded from a GitHub repository.
|
| 71 |
+
|
| 72 |
+
## Processing Example Abstracts with spaCy
|
| 73 |
+
- spaCy is used to parse sentences from example abstracts.
|
| 74 |
+
|
| 75 |
+
## One-Hot Encoding and Prediction on Example Abstracts
|
| 76 |
+
- Line numbers and total lines are one-hot encoded, and predictions are made using the loaded model.
|
| 77 |
+
|
| 78 |
+
## Visualizing Predictions on Example Abstracts
|
| 79 |
+
- Predicted sequence labels for each line in the abstract are displayed.
|
| 80 |
+
|
| 81 |
+
# Conclusion
|
| 82 |
+
- SkimLit provides a comprehensive exploration of NLP techniques for medical abstracts, from baseline models to sophisticated deep learning architectures. The models are evaluated, compared, and applied to real-world examples, offering insights into their strengths and limitations.
|
| 83 |
+
|
| 84 |
+
- Feel free to explore the code, experiment with different models, and contribute to the advancement of Skimlit NLP.
|