Added details to README
Browse files
README.md
CHANGED
|
@@ -1,2 +1,61 @@
|
|
| 1 |
# CoronaCentral BERT Model for Topic / Article Type Classification
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# CoronaCentral BERT Model for Topic / Article Type Classification
|
| 2 |
|
| 3 |
+
This is the topic / article type classification for the [CoronaCentral website](https://coronacentral.ai). This forms part of the pipeline for downloading and processing coronavirus literature described in the [corona-ml repo](https://github.com/jakelever/corona-ml) with available [step-by-step descriptions](https://github.com/jakelever/corona-ml/blob/master/stepByStep.md). The method is described in the [preprint](https://doi.org/10.1101/2020.12.21.423860) and detailed performance results can be found in the [machine learning details](https://github.com/jakelever/corona-ml/blob/master/machineLearningDetails.md) document.
|
| 4 |
+
|
| 5 |
+
This is derived from the [microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) model and fine-tuned for the sequence classification task.
|
| 6 |
+
|
| 7 |
+
## Usage
|
| 8 |
+
|
| 9 |
+
Below are two Google Colab notebooks with example usage of this sequence classification model using HuggingFace transformers and KTrain.
|
| 10 |
+
|
| 11 |
+
- [HuggingFace example on Google Colab](https://colab.research.google.com/drive/1cBNgKd4o6FNWwjKXXQQsC_SaX1kOXDa4?usp=sharing)
|
| 12 |
+
- [KTrain example on Google Colab](https://colab.research.google.com/drive/1h7oJa2NDjnBEoox0D5vwXrxiCHj3B1kU?usp=sharing)
|
| 13 |
+
|
| 14 |
+
## Training Data
|
| 15 |
+
|
| 16 |
+
The model is trained on ~3200 manually-curated articles sampled at various stages during the coronavirus pandemic. The code for training is available in the [category\_prediction](https://github.com/jakelever/corona-ml/tree/master/category_prediction) directory of the main Github Repo. The data is available in the [annotated_documents.json.gz](https://github.com/jakelever/corona-ml/blob/master/category_prediction/annotated_documents.json.gz) file.
|
| 17 |
+
|
| 18 |
+
## Inputs and Outputs
|
| 19 |
+
|
| 20 |
+
The model takes in a tokenized title and abstract (combined into a single string and separated by a new line). The outputs are topics and article types, broadly called categories in the pipeline code. The types are listed below. Some others are managed by hand-coded rules described in the [step-by-step descriptions](https://github.com/jakelever/corona-ml/blob/master/stepByStep.md).
|
| 21 |
+
|
| 22 |
+
### List of Article Types
|
| 23 |
+
|
| 24 |
+
- Comment/Editorial
|
| 25 |
+
- Meta-analysis
|
| 26 |
+
- News
|
| 27 |
+
- Review
|
| 28 |
+
|
| 29 |
+
### List of Topics
|
| 30 |
+
|
| 31 |
+
- Clinical Reports
|
| 32 |
+
- Communication
|
| 33 |
+
- Contact Tracing
|
| 34 |
+
- Diagnostics
|
| 35 |
+
- Drug Targets
|
| 36 |
+
- Education
|
| 37 |
+
- Effect on Medical Specialties
|
| 38 |
+
- Forecasting & Modelling
|
| 39 |
+
- Health Policy
|
| 40 |
+
- Healthcare Workers
|
| 41 |
+
- Imaging
|
| 42 |
+
- Immunology
|
| 43 |
+
- Inequality
|
| 44 |
+
- Infection Reports
|
| 45 |
+
- Long Haul
|
| 46 |
+
- Medical Devices
|
| 47 |
+
- Misinformation
|
| 48 |
+
- Model Systems & Tools
|
| 49 |
+
- Molecular Biology
|
| 50 |
+
- Non-human
|
| 51 |
+
- Non-medical
|
| 52 |
+
- Pediatrics
|
| 53 |
+
- Prevalence
|
| 54 |
+
- Prevention
|
| 55 |
+
- Psychology
|
| 56 |
+
- Recommendations
|
| 57 |
+
- Risk Factors
|
| 58 |
+
- Surveillance
|
| 59 |
+
- Therapeutics
|
| 60 |
+
- Transmission
|
| 61 |
+
- Vaccines
|