Dhrumit1314
/

SkimLit_NLP

TF-Keras

English

Model card Files Files and versions

xet

Community

Dhrumit1314 commited on Mar 31, 2024

Commit

aba8919

verified ·

1 Parent(s): ed00d9b

Update README.md

Browse files

Files changed (1) hide show

README.md +81 -0

README.md CHANGED Viewed

@@ -1,3 +1,84 @@
 ---
 license: mit
 ---

 ---
 license: mit
 ---
+ # SkimLit: NLP Model for Medical Abstracts
+ SkimLit is a natural language processing (NLP) project aimed at making the reading of medical abstracts more accessible. This project replicates the methodology outlined in the paper "PubMed 200K RCT: a Dataset for Sequenctial Sentence Classification in Medical Abstracts," using TensorFlow and various deep learning techniques.
+# Project Overview
+# **`Section 1`**
+## Data Collection
+- The PubMed 200K RCT dataset is obtained from the author's GitHub repository using the following commands:
+```
+git clone https://github.com/Franck-Dernoncourt/pubmed-rct
+cd pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign
+```
+## Data Prepocessing
+- Sentences are extracted from the dataset, and numeric labels are assigned for machine learning models.
+- Three baseline models are established to set the foundation for more complex models.
+## Baseline Model (Model 0)
+- TF-IDF Multinomial Naive Bayes Classifier is implemented.
+- Classification evaluation metrics such as accuracy, precision, recall, and F1-score are employed.
+## Deep Sequence Models
+### Model 1: Conv1D with Token Embeddings
+- Custom TextVectorizer and text embedding layers are created.
+- Data is optimized for efficiency using TensorFlow tf.data API.
+### Model 2: Pretrained Token Embeddings
+- Universal Sentence Encoder (USE) from TensorFlow Hub is used for feature extraction.
+### Model 3: Conv1D with Character Embeddings
+- Character-level tokenizer and embedding are implemented.
+- Conv1D model is constructed using character embeddings.
+### Model 4: Hybrid Embedding Layer
+- Token and character-level embeddings are combined using layers.Concatenate.
+- A model is developed to process both types of embeddings and output label probabilities.
+### Model 5: Transfer Learning with Positional Embeddings
+- Positional embeddings are introduced to enhance the model's understanding of the sequence.
+- A tribrid embedding model is created, combining token, character, line_number, and total_lines features.
+## Model Evaluation and Comparison
+- Models are evaluated on various datasets to compare their performance.
+## Save and Load Models
+- Models are saved and loaded for future use.
+## Model Loading and Evaluation
+- Pre-trained models are loaded and evaluated on validation datasets.
+## Test Dataset Processing and Prediction
+- A test dataset is created, preprocessed, and used for making predictions with the loaded model.
+## Enriching Test Dataframe with Predictions
+- Predictions and additional columns are added to the test dataframe for analysis.
+## Finding Top Wrong Predictions
+- The top 100 most inaccurately predicted samples are identified.
+## Investigating Top Wrong Predictions
+- Detailed information on the top 10 wrong predictions is displayed.
+# **`Section 2`**
+## Example Abstracts
+- Example abstracts are downloaded from a GitHub repository.
+## Processing Example Abstracts with spaCy
+- spaCy is used to parse sentences from example abstracts.
+## One-Hot Encoding and Prediction on Example Abstracts
+- Line numbers and total lines are one-hot encoded, and predictions are made using the loaded model.
+## Visualizing Predictions on Example Abstracts
+- Predicted sequence labels for each line in the abstract are displayed.
+# Conclusion
+- SkimLit provides a comprehensive exploration of NLP techniques for medical abstracts, from baseline models to sophisticated deep learning architectures. The models are evaluated, compared, and applied to real-world examples, offering insights into their strengths and limitations.
+- Feel free to explore the code, experiment with different models, and contribute to the advancement of Skimlit NLP.