Instructions to use Dhrumit1314/SkimLit_NLP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- TF-Keras
How to use Dhrumit1314/SkimLit_NLP with TF-Keras:
# Note: 'keras<3.x' or 'tf_keras' must be installed (legacy) # See https://github.com/keras-team/tf-keras for more details. from huggingface_hub import from_pretrained_keras model = from_pretrained_keras("Dhrumit1314/SkimLit_NLP") - Notebooks
- Google Colab
- Kaggle
- SkimLit: NLP Model for Medical Abstracts
- Project Overview
Section 1- Data Collection
- Data Prepocessing
- Baseline Model (Model 0)
- Deep Sequence Models
- Model Evaluation and Comparison
- Save and Load Models
- Model Loading and Evaluation
- Test Dataset Processing and Prediction
- Enriching Test Dataframe with Predictions
- Finding Top Wrong Predictions
- Investigating Top Wrong Predictions
- Data Collection
Section 2- Conclusion
SkimLit: NLP Model for Medical Abstracts
SkimLit is a natural language processing (NLP) project aimed at making the reading of medical abstracts more accessible. This project replicates the methodology outlined in the paper "PubMed 200K RCT: a Dataset for Sequenctial Sentence Classification in Medical Abstracts," using TensorFlow and various deep learning techniques.
Project Overview
Section 1
Data Collection
- The PubMed 200K RCT dataset is obtained from the author's GitHub repository using the following commands:
git clone https://github.com/Franck-Dernoncourt/pubmed-rct
cd pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign
Data Prepocessing
- Sentences are extracted from the dataset, and numeric labels are assigned for machine learning models.
- Three baseline models are established to set the foundation for more complex models.
Baseline Model (Model 0)
- TF-IDF Multinomial Naive Bayes Classifier is implemented.
- Classification evaluation metrics such as accuracy, precision, recall, and F1-score are employed.
Deep Sequence Models
Model 1: Conv1D with Token Embeddings
- Custom TextVectorizer and text embedding layers are created.
- Data is optimized for efficiency using TensorFlow tf.data API.
Model 2: Pretrained Token Embeddings
- Universal Sentence Encoder (USE) from TensorFlow Hub is used for feature extraction.
Model 3: Conv1D with Character Embeddings
- Character-level tokenizer and embedding are implemented.
- Conv1D model is constructed using character embeddings.
Model 4: Hybrid Embedding Layer
- Token and character-level embeddings are combined using layers.Concatenate.
- A model is developed to process both types of embeddings and output label probabilities.
Model 5: Transfer Learning with Positional Embeddings
- Positional embeddings are introduced to enhance the model's understanding of the sequence.
- A tribrid embedding model is created, combining token, character, line_number, and total_lines features.
Model Evaluation and Comparison
- Models are evaluated on various datasets to compare their performance.
Save and Load Models
- Models are saved and loaded for future use.
Model Loading and Evaluation
- Pre-trained models are loaded and evaluated on validation datasets.
Test Dataset Processing and Prediction
- A test dataset is created, preprocessed, and used for making predictions with the loaded model.
Enriching Test Dataframe with Predictions
- Predictions and additional columns are added to the test dataframe for analysis.
Finding Top Wrong Predictions
- The top 100 most inaccurately predicted samples are identified.
Investigating Top Wrong Predictions
- Detailed information on the top 10 wrong predictions is displayed.
Section 2
Example Abstracts
- Example abstracts are downloaded from a GitHub repository.
Processing Example Abstracts with spaCy
- spaCy is used to parse sentences from example abstracts.
One-Hot Encoding and Prediction on Example Abstracts
- Line numbers and total lines are one-hot encoded, and predictions are made using the loaded model.
Visualizing Predictions on Example Abstracts
- Predicted sequence labels for each line in the abstract are displayed.
Conclusion
SkimLit provides a comprehensive exploration of NLP techniques for medical abstracts, from baseline models to sophisticated deep learning architectures. The models are evaluated, compared, and applied to real-world examples, offering insights into their strengths and limitations.
Feel free to explore the code, experiment with different models, and contribute to the advancement of Skimlit NLP.
- Downloads last month
- -