Alhdrawi commited on
Commit
5a69d0a
Β·
verified Β·
1 Parent(s): ad90b77

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +157 -0
README.md ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning
2
+
3
+ <details>
4
+ <summary>
5
+ <b>Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning</b>, Nat. Biomed. Eng (2022).
6
+ <a href="https://www.nature.com/articles/s41551-022-00936-9" target="blank">[Paper]</a>
7
+ <br><em><a href="https://www.linkedin.com/in/ekin-tiu-0aa467200/">Ekin Tiu</a>, <a href="https://www.linkedin.com/in/ellie-talius/">Ellie Talius</a>, <a href="https://www.linkedin.com/in/pujanpatel24/">Pujan Patel</a>, <a href="https://med.stanford.edu/profiles/curtis-langlotz">Curtis P. Langlotz</a>, <a href="https://www.andrewng.org/">Andrew Y. Ng</a>, <a href="https://pranavrajpurkar.squarespace.com/">Pranav Rajpurkar</a></em></br>
8
+ </summary>
9
+
10
+ ```bash
11
+ Tiu, E., Talius, E., Patel, P. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng (2022). https://doi.org/10.1038/s41551-022-00936-9
12
+ ```
13
+ </details>
14
+
15
+ <img width="848" alt="Screen Shot 2022-09-15 at 10 57 16 AM" src="https://user-images.githubusercontent.com/12751529/190451160-a919b363-6005-4cd4-9633-b194392bd728.png">
16
+
17
+ This repository contains code to train a self-supervised learning model on chest X-ray images that lack explicit annotations and evalute this model's performance on pathology-classification tasks.
18
+
19
+ <details>
20
+ <summary>
21
+ <b>Main Findings</b>
22
+ </summary>
23
+
24
+ 1. **Automatically detecting pathologies in chest x-rays without explicit annotations:** Our method learns directly from the combination of images and unstructured radiology reports, thereby avoiding time-consuming labeling efforts. Our deep learning method is capable of predicting multiple pathologies and differential diagnoses that it had not explicitly seen during training.
25
+ 2. **Matching radiologist performance on different tasks on an external test set:** Our method performed on par with human performance when evaluated on an external validation set (CheXpert) of chest x-ray images labeled for the presence of 14 different conditions by multiple radiologists.
26
+ 3. **Outperforming approaches that train on explicitly labeled data on an external test set:** Using no labels, we outperformed a fully supervised approach (100% of labels) on 3 out of the 8 selected pathologies on a dataset (PadChest) collected in a different country. We further demonstrated high performance (AUC > 0.9) on 14 findings and at least 0.700 on 53 findings out of 107 radiographic findings that the method had not seen during training.
27
+ </details>
28
+
29
+
30
+ ## Dependencies
31
+ To clone all files:
32
+
33
+ ```git clone https://github.com/rajpurkarlab/CheXzero.git```
34
+
35
+ To install Python dependencies:
36
+
37
+ ```pip install -r requirements.txt```
38
+
39
+ ## Data
40
+ ### Training Dataset
41
+ 1. Download images come from [MIMIC-CXR JPG] https://physionet.org/content/mimic-cxr-jpg/2.0.0/ and reports from [MIMIC-CXR Database](https://physionet.org/content/mimic-cxr/2.0.0/) Note: in order to gain access to the data, you must be a credentialed user as defined on [PhysioNet](https://physionet.org/settings/credentialing/).
42
+ 2. Copy the dataset into the `data/` directory.
43
+ 3. Run `python run_preprocess.py`
44
+ 4. This should preprocess the chest x-ray images into a Hierarchical Data Format (HDF) format used for training stored at `data/cxr.h5` and extract the impressions section as text from the corresponding chest x-ray radiology report stored at `data/mimic_impressions.csv` .
45
+
46
+ ### Evaluation Dataset
47
+
48
+ #### CheXpert Dataset
49
+ The CheXpert dataset consists of chest radiographic examinations from Stanford Hospital, performed between October 2002
50
+ and July 2017 in both inpatient and outpatient centers. Population-level characteristics are unavailable for the CheXpert test
51
+ dataset, as they are used for official evaluation on the CheXpert leaderboard.
52
+
53
+ The main data (CheXpert data) supporting the results of this study are available at https://aimi.stanford.edu/chexpert-chest-x-rays.
54
+
55
+ The CheXpert **test** dataset has recently been made public, and can be found by following the steps in the [cheXpert-test-set-labels](https://github.com/rajpurkarlab/cheXpert-test-set-labels) repository.
56
+
57
+ #### PadChest Dataset
58
+ The PadChest dataset contains chest X-rays that were interpreted by 18 radiologists at the Hospital Universitario de San Juan,
59
+ Alicante, Spain, from January 2009 to December 2017. The dataset contains 109,931 image studies and 168,861 images.
60
+ PadChest also contains 206,222 study reports.
61
+
62
+ The [PadChest](https://arxiv.org/abs/1901.07441) is publicly available at https://bimcv.cipf.es/bimcv-projects/padchest. Those who would like to use PadChest for experimentation should request access to PadChest at the [link](https://bimcv.cipf.es/bimcv-projects/padchest).
63
+
64
+ ### Model Checkpoints
65
+ Model checkpoints of CheXzero pre-trained on MIMIC-CXR are publicly available at the following [link](https://drive.google.com/drive/folders/1makFLiEMbSleYltaRxw81aBhEDMpVwno?usp=sharing). Download files and save them in the `./checkpoints/chexzero_weights` directory.
66
+
67
+ ## Running Training
68
+ Run the following command to perform CheXzero pretraining.
69
+ ```bash
70
+ python run_train.py --cxr_filepath "./data/cxr.h5" --txt_filepath "data/mimic_impressions.csv"
71
+ ```
72
+
73
+ ### Arguments
74
+ * `--cxr_filepath` Directory to load chest x-ray image data from.
75
+ * `--txt_filepath` Directory to load radiology report impressions text from.
76
+
77
+ Use `-h` flag to see all optional arguments.
78
+
79
+ ## Zero-Shot Inference
80
+ See the following [notebook](https://github.com/rajpurkarlab/CheXzero/blob/main/notebooks/zero_shot.ipynb) for an example of how to use CheXzero to perform zero-shot inference on a chest x-ray dataset. The example shows how to output predictions from the model ensemble and evaluate performance of the model if ground truth labels are available.
81
+
82
+ ```python
83
+ import zero_shot
84
+
85
+ # computes predictions for a set of images stored as a np array of probabilities for each pathology
86
+ predictions, y_pred_avg = zero_shot.ensemble_models(
87
+ model_paths=model_paths,
88
+ cxr_filepath=cxr_filepath,
89
+ cxr_labels=cxr_labels,
90
+ cxr_pair_template=cxr_pair_template,
91
+ cache_dir=cache_dir,
92
+ )
93
+ ```
94
+ ### Arguments
95
+ * `model_paths: List[str]`: List of paths to all checkpoints to be used in the ensemble. To run on a single model, input a list containing a single path.
96
+ * `cxr_filepath: str`: Path to images `.h5` file
97
+ * `cxr_labels: List[str]`: List of pathologies to query in each image
98
+ * `cxr_pair_templates: Tuple[str, str]`: constrasting templates used to query model (see Figure 1 in article for visual explanation).
99
+ * `cache_dir: str`: Directory to cache predictions of each checkpoint, use to avoid recomputing predictions.
100
+
101
+ In order to use CheXzero for zero-shot inference, ensure the following requirements are met:
102
+ * All input *`images`* must be stored in a single `.h5` (Hierarchical Data Format). See the [`img_to_h5`](https://github.com/rajpurkarlab/CheXzero/blob/main/preprocess_padchest.py#L156) function in [preprocess_padchest.py](https://github.com/rajpurkarlab/internal-chexzero/blob/cleanversion/preprocess_padchest.py) for an example of how to convert a list of paths to `.png` files into a valid `.h5` file.
103
+ * The *ground truth `labels`* must be in a `.csv` dataframe where rows represent each image sample, and each column represents the binary labels for a particular pathology on each sample.
104
+ * Ensure all [model checkpoints](https://drive.google.com/drive/folders/1makFLiEMbSleYltaRxw81aBhEDMpVwno?usp=sharing) are stored in `checkpoints/chexzero_weights/`, or the `model_dir` that is specified in the notebook.
105
+
106
+ ## Evaluation
107
+ Given a numpy array of predictions (obtained from zero-shot inference), and a numpy array of ground truth labels, one can evaluate the performance of the model using the following code:
108
+ ```python
109
+ import zero_shot
110
+ import eval
111
+
112
+ # loads in ground truth labels into memory
113
+ test_pred = y_pred_avg
114
+ test_true = zero_shot.make_true_labels(cxr_true_labels_path=cxr_true_labels_path, cxr_labels=cxr_labels)
115
+
116
+ # evaluate model, no bootstrap
117
+ cxr_results: pd.DataFrame = eval.evaluate(test_pred, test_true, cxr_labels) # eval on full test datset
118
+
119
+ # boostrap evaluations for 95% confidence intervals
120
+ bootstrap_results: Tuple[pd.DataFrame, pd.DataFrame] = eval.bootstrap(test_pred, test_true, cxr_labels) # (df of results for each bootstrap, df of CI)
121
+
122
+ # print results with confidence intervals
123
+ print(bootstrap_results[1])
124
+ ```
125
+ The results are represented as a `pd.DataFrame` which can be saved as a `.csv`.
126
+
127
+ ### CheXpert Test Dataset
128
+ In order to replicate the results in the paper, zero-shot inference and evaluation can be performed on the now publicly available CheXpert test dataset.
129
+ 1) Download labels at [cheXpert-test-set-labels](https://github.com/rajpurkarlab/cheXpert-test-set-labels/blob/main/groundtruth.csv) and image files from [Stanford AIMI](https://stanfordaimi.azurewebsites.net/datasets/23c56a0d-15de-405b-87c8-99c30138950c) and save in the `./data` directory in `CheXzero/`. The test dataset images should have the following directory structure:
130
+ ```
131
+ data/
132
+ β”œβ”€ CheXpert/
133
+ β”‚ β”œβ”€ test/
134
+ β”‚ β”‚ β”œβ”€ patient64741/
135
+ β”‚ β”‚ β”‚ β”œβ”€ study1/
136
+ β”‚ β”‚ β”‚ β”‚ β”œβ”€ view1_frontal.jpg
137
+ β”‚ β”‚ β”œβ”€ .../
138
+ ```
139
+
140
+ 2) Run `run_preprocess.py` script with the following arguments:
141
+ ```bash
142
+ python run_preprocess.py --dataset_type "chexpert-test" --cxr_out_path "./data/chexpert_test.h5" --chest_x_ray_path "./data/CheXpert/test/"
143
+ ```
144
+ This should save a `.h5` version of the test dataset images which can be used for evaluation.
145
+
146
+ 3) Open sample zero-shot [notebook](https://github.com/rajpurkarlab/CheXzero/blob/main/notebooks/zero_shot.ipynb) and run all cells. If the directory structure is set up correctly, then all cells should run without errors.
147
+
148
+ ## Issues
149
+ Please open new issue threads specifying the issue with the codebase or report issues directly to ekintiu@stanford.edu.
150
+
151
+ ## Citation
152
+ ```bash
153
+ Tiu, E., Talius, E., Patel, P. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng (2022). https://doi.org/10.1038/s41551-022-00936-9
154
+ ```
155
+
156
+ ## License
157
+ The source code for the site is licensed under the MIT license, which you can find in the `LICENSE` file. Also see `NOTICE.md` for attributions to third-party sources.