Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,157 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning
|
| 2 |
+
|
| 3 |
+
<details>
|
| 4 |
+
<summary>
|
| 5 |
+
<b>Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning</b>, Nat. Biomed. Eng (2022).
|
| 6 |
+
<a href="https://www.nature.com/articles/s41551-022-00936-9" target="blank">[Paper]</a>
|
| 7 |
+
<br><em><a href="https://www.linkedin.com/in/ekin-tiu-0aa467200/">Ekin Tiu</a>, <a href="https://www.linkedin.com/in/ellie-talius/">Ellie Talius</a>, <a href="https://www.linkedin.com/in/pujanpatel24/">Pujan Patel</a>, <a href="https://med.stanford.edu/profiles/curtis-langlotz">Curtis P. Langlotz</a>, <a href="https://www.andrewng.org/">Andrew Y. Ng</a>, <a href="https://pranavrajpurkar.squarespace.com/">Pranav Rajpurkar</a></em></br>
|
| 8 |
+
</summary>
|
| 9 |
+
|
| 10 |
+
```bash
|
| 11 |
+
Tiu, E., Talius, E., Patel, P. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng (2022). https://doi.org/10.1038/s41551-022-00936-9
|
| 12 |
+
```
|
| 13 |
+
</details>
|
| 14 |
+
|
| 15 |
+
<img width="848" alt="Screen Shot 2022-09-15 at 10 57 16 AM" src="https://user-images.githubusercontent.com/12751529/190451160-a919b363-6005-4cd4-9633-b194392bd728.png">
|
| 16 |
+
|
| 17 |
+
This repository contains code to train a self-supervised learning model on chest X-ray images that lack explicit annotations and evalute this model's performance on pathology-classification tasks.
|
| 18 |
+
|
| 19 |
+
<details>
|
| 20 |
+
<summary>
|
| 21 |
+
<b>Main Findings</b>
|
| 22 |
+
</summary>
|
| 23 |
+
|
| 24 |
+
1. **Automatically detecting pathologies in chest x-rays without explicit annotations:** Our method learns directly from the combination of images and unstructured radiology reports, thereby avoiding time-consuming labeling efforts. Our deep learning method is capable of predicting multiple pathologies and differential diagnoses that it had not explicitly seen during training.
|
| 25 |
+
2. **Matching radiologist performance on different tasks on an external test set:** Our method performed on par with human performance when evaluated on an external validation set (CheXpert) of chest x-ray images labeled for the presence of 14 different conditions by multiple radiologists.
|
| 26 |
+
3. **Outperforming approaches that train on explicitly labeled data on an external test set:** Using no labels, we outperformed a fully supervised approach (100% of labels) on 3 out of the 8 selected pathologies on a dataset (PadChest) collected in a different country. We further demonstrated high performance (AUC > 0.9) on 14 findings and at least 0.700 on 53 findings out of 107 radiographic findings that the method had not seen during training.
|
| 27 |
+
</details>
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
## Dependencies
|
| 31 |
+
To clone all files:
|
| 32 |
+
|
| 33 |
+
```git clone https://github.com/rajpurkarlab/CheXzero.git```
|
| 34 |
+
|
| 35 |
+
To install Python dependencies:
|
| 36 |
+
|
| 37 |
+
```pip install -r requirements.txt```
|
| 38 |
+
|
| 39 |
+
## Data
|
| 40 |
+
### Training Dataset
|
| 41 |
+
1. Download images come from [MIMIC-CXR JPG] https://physionet.org/content/mimic-cxr-jpg/2.0.0/ and reports from [MIMIC-CXR Database](https://physionet.org/content/mimic-cxr/2.0.0/) Note: in order to gain access to the data, you must be a credentialed user as defined on [PhysioNet](https://physionet.org/settings/credentialing/).
|
| 42 |
+
2. Copy the dataset into the `data/` directory.
|
| 43 |
+
3. Run `python run_preprocess.py`
|
| 44 |
+
4. This should preprocess the chest x-ray images into a Hierarchical Data Format (HDF) format used for training stored at `data/cxr.h5` and extract the impressions section as text from the corresponding chest x-ray radiology report stored at `data/mimic_impressions.csv` .
|
| 45 |
+
|
| 46 |
+
### Evaluation Dataset
|
| 47 |
+
|
| 48 |
+
#### CheXpert Dataset
|
| 49 |
+
The CheXpert dataset consists of chest radiographic examinations from Stanford Hospital, performed between October 2002
|
| 50 |
+
and July 2017 in both inpatient and outpatient centers. Population-level characteristics are unavailable for the CheXpert test
|
| 51 |
+
dataset, as they are used for official evaluation on the CheXpert leaderboard.
|
| 52 |
+
|
| 53 |
+
The main data (CheXpert data) supporting the results of this study are available at https://aimi.stanford.edu/chexpert-chest-x-rays.
|
| 54 |
+
|
| 55 |
+
The CheXpert **test** dataset has recently been made public, and can be found by following the steps in the [cheXpert-test-set-labels](https://github.com/rajpurkarlab/cheXpert-test-set-labels) repository.
|
| 56 |
+
|
| 57 |
+
#### PadChest Dataset
|
| 58 |
+
The PadChest dataset contains chest X-rays that were interpreted by 18 radiologists at the Hospital Universitario de San Juan,
|
| 59 |
+
Alicante, Spain, from January 2009 to December 2017. The dataset contains 109,931 image studies and 168,861 images.
|
| 60 |
+
PadChest also contains 206,222 study reports.
|
| 61 |
+
|
| 62 |
+
The [PadChest](https://arxiv.org/abs/1901.07441) is publicly available at https://bimcv.cipf.es/bimcv-projects/padchest. Those who would like to use PadChest for experimentation should request access to PadChest at the [link](https://bimcv.cipf.es/bimcv-projects/padchest).
|
| 63 |
+
|
| 64 |
+
### Model Checkpoints
|
| 65 |
+
Model checkpoints of CheXzero pre-trained on MIMIC-CXR are publicly available at the following [link](https://drive.google.com/drive/folders/1makFLiEMbSleYltaRxw81aBhEDMpVwno?usp=sharing). Download files and save them in the `./checkpoints/chexzero_weights` directory.
|
| 66 |
+
|
| 67 |
+
## Running Training
|
| 68 |
+
Run the following command to perform CheXzero pretraining.
|
| 69 |
+
```bash
|
| 70 |
+
python run_train.py --cxr_filepath "./data/cxr.h5" --txt_filepath "data/mimic_impressions.csv"
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
### Arguments
|
| 74 |
+
* `--cxr_filepath` Directory to load chest x-ray image data from.
|
| 75 |
+
* `--txt_filepath` Directory to load radiology report impressions text from.
|
| 76 |
+
|
| 77 |
+
Use `-h` flag to see all optional arguments.
|
| 78 |
+
|
| 79 |
+
## Zero-Shot Inference
|
| 80 |
+
See the following [notebook](https://github.com/rajpurkarlab/CheXzero/blob/main/notebooks/zero_shot.ipynb) for an example of how to use CheXzero to perform zero-shot inference on a chest x-ray dataset. The example shows how to output predictions from the model ensemble and evaluate performance of the model if ground truth labels are available.
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
import zero_shot
|
| 84 |
+
|
| 85 |
+
# computes predictions for a set of images stored as a np array of probabilities for each pathology
|
| 86 |
+
predictions, y_pred_avg = zero_shot.ensemble_models(
|
| 87 |
+
model_paths=model_paths,
|
| 88 |
+
cxr_filepath=cxr_filepath,
|
| 89 |
+
cxr_labels=cxr_labels,
|
| 90 |
+
cxr_pair_template=cxr_pair_template,
|
| 91 |
+
cache_dir=cache_dir,
|
| 92 |
+
)
|
| 93 |
+
```
|
| 94 |
+
### Arguments
|
| 95 |
+
* `model_paths: List[str]`: List of paths to all checkpoints to be used in the ensemble. To run on a single model, input a list containing a single path.
|
| 96 |
+
* `cxr_filepath: str`: Path to images `.h5` file
|
| 97 |
+
* `cxr_labels: List[str]`: List of pathologies to query in each image
|
| 98 |
+
* `cxr_pair_templates: Tuple[str, str]`: constrasting templates used to query model (see Figure 1 in article for visual explanation).
|
| 99 |
+
* `cache_dir: str`: Directory to cache predictions of each checkpoint, use to avoid recomputing predictions.
|
| 100 |
+
|
| 101 |
+
In order to use CheXzero for zero-shot inference, ensure the following requirements are met:
|
| 102 |
+
* All input *`images`* must be stored in a single `.h5` (Hierarchical Data Format). See the [`img_to_h5`](https://github.com/rajpurkarlab/CheXzero/blob/main/preprocess_padchest.py#L156) function in [preprocess_padchest.py](https://github.com/rajpurkarlab/internal-chexzero/blob/cleanversion/preprocess_padchest.py) for an example of how to convert a list of paths to `.png` files into a valid `.h5` file.
|
| 103 |
+
* The *ground truth `labels`* must be in a `.csv` dataframe where rows represent each image sample, and each column represents the binary labels for a particular pathology on each sample.
|
| 104 |
+
* Ensure all [model checkpoints](https://drive.google.com/drive/folders/1makFLiEMbSleYltaRxw81aBhEDMpVwno?usp=sharing) are stored in `checkpoints/chexzero_weights/`, or the `model_dir` that is specified in the notebook.
|
| 105 |
+
|
| 106 |
+
## Evaluation
|
| 107 |
+
Given a numpy array of predictions (obtained from zero-shot inference), and a numpy array of ground truth labels, one can evaluate the performance of the model using the following code:
|
| 108 |
+
```python
|
| 109 |
+
import zero_shot
|
| 110 |
+
import eval
|
| 111 |
+
|
| 112 |
+
# loads in ground truth labels into memory
|
| 113 |
+
test_pred = y_pred_avg
|
| 114 |
+
test_true = zero_shot.make_true_labels(cxr_true_labels_path=cxr_true_labels_path, cxr_labels=cxr_labels)
|
| 115 |
+
|
| 116 |
+
# evaluate model, no bootstrap
|
| 117 |
+
cxr_results: pd.DataFrame = eval.evaluate(test_pred, test_true, cxr_labels) # eval on full test datset
|
| 118 |
+
|
| 119 |
+
# boostrap evaluations for 95% confidence intervals
|
| 120 |
+
bootstrap_results: Tuple[pd.DataFrame, pd.DataFrame] = eval.bootstrap(test_pred, test_true, cxr_labels) # (df of results for each bootstrap, df of CI)
|
| 121 |
+
|
| 122 |
+
# print results with confidence intervals
|
| 123 |
+
print(bootstrap_results[1])
|
| 124 |
+
```
|
| 125 |
+
The results are represented as a `pd.DataFrame` which can be saved as a `.csv`.
|
| 126 |
+
|
| 127 |
+
### CheXpert Test Dataset
|
| 128 |
+
In order to replicate the results in the paper, zero-shot inference and evaluation can be performed on the now publicly available CheXpert test dataset.
|
| 129 |
+
1) Download labels at [cheXpert-test-set-labels](https://github.com/rajpurkarlab/cheXpert-test-set-labels/blob/main/groundtruth.csv) and image files from [Stanford AIMI](https://stanfordaimi.azurewebsites.net/datasets/23c56a0d-15de-405b-87c8-99c30138950c) and save in the `./data` directory in `CheXzero/`. The test dataset images should have the following directory structure:
|
| 130 |
+
```
|
| 131 |
+
data/
|
| 132 |
+
ββ CheXpert/
|
| 133 |
+
β ββ test/
|
| 134 |
+
β β ββ patient64741/
|
| 135 |
+
β β β ββ study1/
|
| 136 |
+
β β β β ββ view1_frontal.jpg
|
| 137 |
+
β β ββ .../
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
2) Run `run_preprocess.py` script with the following arguments:
|
| 141 |
+
```bash
|
| 142 |
+
python run_preprocess.py --dataset_type "chexpert-test" --cxr_out_path "./data/chexpert_test.h5" --chest_x_ray_path "./data/CheXpert/test/"
|
| 143 |
+
```
|
| 144 |
+
This should save a `.h5` version of the test dataset images which can be used for evaluation.
|
| 145 |
+
|
| 146 |
+
3) Open sample zero-shot [notebook](https://github.com/rajpurkarlab/CheXzero/blob/main/notebooks/zero_shot.ipynb) and run all cells. If the directory structure is set up correctly, then all cells should run without errors.
|
| 147 |
+
|
| 148 |
+
## Issues
|
| 149 |
+
Please open new issue threads specifying the issue with the codebase or report issues directly to ekintiu@stanford.edu.
|
| 150 |
+
|
| 151 |
+
## Citation
|
| 152 |
+
```bash
|
| 153 |
+
Tiu, E., Talius, E., Patel, P. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng (2022). https://doi.org/10.1038/s41551-022-00936-9
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
## License
|
| 157 |
+
The source code for the site is licensed under the MIT license, which you can find in the `LICENSE` file. Also see `NOTICE.md` for attributions to third-party sources.
|