monajm36 commited on
Commit
ed96473
Β·
unverified Β·
1 Parent(s): 644617d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +183 -115
README.md CHANGED
@@ -1,38 +1,63 @@
1
  # ohca-classifier-3.0
2
  BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical text
3
 
4
- NLP OHCA Classifier
5
  A BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical discharge notes using natural language processing.
6
 
7
- Overview
8
  This package provides two main modules:
9
 
10
- Training Pipeline (ohca_training_pipeline.py) - Complete workflow from data annotation to model training
11
- Inference Module (ohca_inference.py) - Apply pre-trained models to new datasets
12
- Features
13
- Training Pipeline
14
- Intelligent Sampling: Two-stage sampling strategy (keyword-enriched + random)
15
- Annotation Interface: Generates Excel files for manual annotation with guidelines
16
- BERT-based Training: Uses PubMedBERT optimized for medical text
17
- Class Balancing: Handles imbalanced datasets with oversampling
18
- Comprehensive Evaluation: Clinical metrics including sensitivity, specificity, PPV, NPV
19
- Inference Module
20
- Pre-trained Model Loading: Easy loading of trained OHCA models
21
- Batch Processing: Efficient inference on large datasets
22
- Clinical Decision Support: Probability thresholds and confidence categories
23
- Quality Analysis: Built-in tools for analyzing prediction patterns
24
- Installation
25
- Prerequisites
26
- Python 3.8+
27
- PyTorch
28
- CUDA (optional, for GPU acceleration)
29
- Install from source
 
 
 
 
 
 
 
 
 
30
  git clone https://github.com/monajm36/nlp-ohca-classifier.git
31
  cd nlp-ohca-classifier
 
 
 
 
 
 
 
 
 
 
32
  pip install -r requirements.txt
33
  pip install -e .
34
- Quick Start
35
- Training a New Model
 
 
 
 
 
 
36
  from src.ohca_training_pipeline import create_training_sample, complete_annotation_and_train
37
  import pandas as pd
38
 
@@ -49,7 +74,10 @@ results = complete_annotation_and_train(
49
  model_save_path="./my_ohca_model",
50
  num_epochs=3
51
  )
52
- Using a Pre-trained Model
 
 
 
53
  from src.ohca_inference import quick_inference
54
  import pandas as pd
55
 
@@ -64,30 +92,39 @@ results = quick_inference(
64
  # View high-confidence predictions
65
  high_confidence = results[results['ohca_probability'] >= 0.8]
66
  print(f"Found {len(high_confidence)} high-confidence OHCA cases")
67
- Data Format
68
- Input Requirements
 
 
 
69
  Your CSV file must contain:
 
 
70
 
71
- hadm_id: Unique identifier for each hospital admission
72
- clean_text: Preprocessed discharge note text
73
- Example:
74
  hadm_id,clean_text
75
  12345,"Chief complaint: Cardiac arrest at home. Patient found down by family..."
76
  12346,"Chief complaint: Chest pain. Patient presents with acute onset chest pain..."
77
- Annotation Labels
78
- 1: OHCA case (cardiac arrest outside hospital)
79
- 0: Non-OHCA case (everything else, including all transfer cases)
80
- Module Documentation
81
- Training Pipeline (ohca_training_pipeline.py)
82
- Main Functions:
83
-
84
- create_training_sample() - Create balanced annotation sample
85
- prepare_training_data() - Process annotations for training
86
- train_ohca_model() - Train BERT-based classifier
87
- evaluate_model() - Comprehensive performance evaluation
88
- complete_training_pipeline() - End-to-end training workflow
89
- Example Usage:
90
 
 
 
 
 
 
 
 
 
 
91
  from src.ohca_training_pipeline import complete_training_pipeline
92
 
93
  # Complete training pipeline
@@ -96,81 +133,106 @@ result = complete_training_pipeline(
96
  annotation_dir="./annotation",
97
  model_save_path="./trained_model"
98
  )
99
- Inference Module (ohca_inference.py)
100
- Main Functions:
 
101
 
102
- load_ohca_model() - Load pre-trained model
103
- run_inference() - Full inference with analysis
104
- quick_inference() - Simple inference function
105
- process_large_dataset() - Handle large datasets in chunks
106
- test_model_on_sample() - Test on specific text samples
107
- Example Usage:
108
 
 
 
109
  from src.ohca_inference import run_inference, load_ohca_model
110
 
111
  # Load model and run inference
112
  model, tokenizer = load_ohca_model("./trained_model")
113
  results = run_inference(model, tokenizer, new_data_df)
114
- Model Architecture
115
- Base Model: PubMedBERT (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract)
116
- Task: Binary classification (OHCA vs Non-OHCA)
117
- Max Sequence Length: 512 tokens
118
- Optimization: AdamW with linear learning rate scheduling
119
- Class Balancing: Weighted loss + minority class oversampling
120
- Performance Metrics
 
 
 
121
  The model reports comprehensive clinical metrics:
122
 
123
- Sensitivity (Recall): Percentage of OHCA cases correctly identified
124
- Specificity: Percentage of non-OHCA cases correctly identified
125
- Precision (PPV): When model predicts OHCA, percentage that are correct
126
- NPV: When model predicts non-OHCA, percentage that are correct
127
- F1-Score: Harmonic mean of precision and recall
128
- AUC-ROC: Area under the receiver operating characteristic curve
129
- Clinical Usage
130
- Probability Thresholds
131
- β‰₯0.9: Very high confidence - Priority manual review
132
- 0.7-0.9: High confidence - Clinical review recommended
133
- 0.3-0.7: Uncertain - Manual review suggested
134
- <0.3: Low probability - Likely non-OHCA
135
- Workflow Integration
136
- Run inference on new discharge notes
137
- Prioritize high-confidence predictions for review
138
- Use medium-confidence cases for quality improvement
139
- Monitor low-confidence cases for false negatives
140
- Repository Structure
 
 
 
 
 
141
  nlp-ohca-classifier/
142
  β”œβ”€β”€ src/
143
  β”‚ β”œβ”€β”€ __init__.py
144
  β”‚ β”œβ”€β”€ ohca_training_pipeline.py # Training workflow
145
- β”‚ └── ohca_inference.py # Inference on new data
146
  β”œβ”€β”€ examples/
147
- β”‚ β”œβ”€β”€ training_example.py # Complete training examples
148
- β”‚ └── inference_example.py # Inference usage examples
149
  β”œβ”€β”€ docs/
150
- β”‚ └── annotation_guidelines.md # Detailed annotation guidelines
151
  β”œβ”€β”€ requirements.txt
152
  β”œβ”€β”€ setup.py
153
  β”œβ”€β”€ README.md
154
  └── LICENSE
155
- Examples
156
- Complete Training Example
 
 
 
 
157
  cd examples
158
  python training_example.py
159
- Inference Examples
160
- cd examples
 
 
 
161
  python inference_example.py
162
- Advanced Usage
163
- Large Dataset Processing
 
 
 
 
164
  from src.ohca_inference import process_large_dataset
165
 
166
  # Process 100K+ records in chunks
167
  process_large_dataset(
168
  model_path="./trained_model",
169
- data_path="large_dataset.csv",
170
  output_path="results.csv",
171
  chunk_size=5000
172
  )
173
- Model Testing
 
 
 
174
  from src.ohca_inference import test_model_on_sample
175
 
176
  # Test on specific cases
@@ -180,41 +242,47 @@ test_cases = {
180
  }
181
 
182
  results = test_model_on_sample("./trained_model", test_cases)
183
- Performance Benchmarks
 
 
184
  Typical performance on validation data:
 
 
 
 
185
 
186
- AUC-ROC: 0.85-0.95
187
- Sensitivity: 85-95%
188
- Specificity: 85-95%
189
- F1-Score: 0.7-0.9
190
- Performance varies based on data quality and annotation consistency
191
 
192
- Citation
193
  If you use this code in your research, please cite:
194
 
 
195
  @software{nlp_ohca_classifier,
196
- title={NLP OHCA Classifier: BERT-based Detection of Out-of-Hospital Cardiac Arrest in Medical Text},
197
- author={Mona Moukaddem},
198
- year={2025},
199
- url={https://github.com/monajm36/nlp-ohca-classifier}
200
  }
201
- License
 
 
202
  This project is licensed under the MIT License - see the LICENSE file for details.
203
 
204
- Contributing
205
- Fork the repository
206
- Create a feature branch (git checkout -b feature/AmazingFeature)
207
- Commit your changes (git commit -m 'Add some AmazingFeature')
208
- Push to the branch (git push origin feature/AmazingFeature)
209
- Open a Pull Request
210
- Support
 
211
  For questions or issues:
 
 
 
212
 
213
- Check the Issues page
214
- Create a new issue if needed
215
- Review examples in the examples/ folder
216
- Acknowledgments
217
- PubMedBERT model from Microsoft Research
218
- MIMIC-III dataset for model development
219
- Transformers library by Hugging Face
220
- PyTorch for deep learning framework
 
1
  # ohca-classifier-3.0
2
  BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical text
3
 
4
+ ## NLP OHCA Classifier
5
  A BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical discharge notes using natural language processing.
6
 
7
+ ## Overview
8
  This package provides two main modules:
9
 
10
+ - **Training Pipeline** (`ohca_training_pipeline.py`) - Complete workflow from data annotation to model training
11
+ - **Inference Module** (`ohca_inference.py`) - Apply pre-trained models to new datasets
12
+
13
+ ## Features
14
+
15
+ ### Training Pipeline
16
+ - **Intelligent Sampling**: Two-stage sampling strategy (keyword-enriched + random)
17
+ - **Annotation Interface**: Generates Excel files for manual annotation with guidelines
18
+ - **BERT-based Training**: Uses PubMedBERT optimized for medical text
19
+ - **Class Balancing**: Handles imbalanced datasets with oversampling
20
+ - **Comprehensive Evaluation**: Clinical metrics including sensitivity, specificity, PPV, NPV
21
+
22
+ ### Inference Module
23
+ - **Pre-trained Model Loading**: Easy loading of trained OHCA models
24
+ - **Batch Processing**: Efficient inference on large datasets
25
+ - **Clinical Decision Support**: Probability thresholds and confidence categories
26
+ - **Quality Analysis**: Built-in tools for analyzing prediction patterns
27
+
28
+ ## Installation
29
+
30
+ ### Prerequisites
31
+ - Python 3.8+
32
+ - PyTorch
33
+ - CUDA (optional, for GPU acceleration)
34
+
35
+ ### Install from source
36
+
37
+ 1. Clone the repository:
38
+ ```bash
39
  git clone https://github.com/monajm36/nlp-ohca-classifier.git
40
  cd nlp-ohca-classifier
41
+ ```
42
+
43
+ 2. Set up virtual environment:
44
+ ```bash
45
+ python3 -m venv .venv/
46
+ source .venv/bin/activate
47
+ ```
48
+
49
+ 3. Install dependencies:
50
+ ```bash
51
  pip install -r requirements.txt
52
  pip install -e .
53
+ ```
54
+
55
+ **Note for Windows users**: Replace `source .venv/bin/activate` with `.venv\Scripts\activate`
56
+
57
+ ## Quick Start
58
+
59
+ ### Training a New Model
60
+ ```python
61
  from src.ohca_training_pipeline import create_training_sample, complete_annotation_and_train
62
  import pandas as pd
63
 
 
74
  model_save_path="./my_ohca_model",
75
  num_epochs=3
76
  )
77
+ ```
78
+
79
+ ### Using a Pre-trained Model
80
+ ```python
81
  from src.ohca_inference import quick_inference
82
  import pandas as pd
83
 
 
92
  # View high-confidence predictions
93
  high_confidence = results[results['ohca_probability'] >= 0.8]
94
  print(f"Found {len(high_confidence)} high-confidence OHCA cases")
95
+ ```
96
+
97
+ ## Data Format
98
+
99
+ ### Input Requirements
100
  Your CSV file must contain:
101
+ - `hadm_id`: Unique identifier for each hospital admission
102
+ - `clean_text`: Preprocessed discharge note text
103
 
104
+ **Example:**
105
+ ```
 
106
  hadm_id,clean_text
107
  12345,"Chief complaint: Cardiac arrest at home. Patient found down by family..."
108
  12346,"Chief complaint: Chest pain. Patient presents with acute onset chest pain..."
109
+ ```
110
+
111
+ ### Annotation Labels
112
+ - `1`: OHCA case (cardiac arrest outside hospital)
113
+ - `0`: Non-OHCA case (everything else, including all transfer cases)
114
+
115
+ ## Module Documentation
116
+
117
+ ### Training Pipeline (`ohca_training_pipeline.py`)
 
 
 
 
118
 
119
+ **Main Functions:**
120
+ - `create_training_sample()` - Create balanced annotation sample
121
+ - `prepare_training_data()` - Process annotations for training
122
+ - `train_ohca_model()` - Train BERT-based classifier
123
+ - `evaluate_model()` - Comprehensive performance evaluation
124
+ - `complete_training_pipeline()` - End-to-end training workflow
125
+
126
+ **Example Usage:**
127
+ ```python
128
  from src.ohca_training_pipeline import complete_training_pipeline
129
 
130
  # Complete training pipeline
 
133
  annotation_dir="./annotation",
134
  model_save_path="./trained_model"
135
  )
136
+ ```
137
+
138
+ ### Inference Module (`ohca_inference.py`)
139
 
140
+ **Main Functions:**
141
+ - `load_ohca_model()` - Load pre-trained model
142
+ - `run_inference()` - Full inference with analysis
143
+ - `quick_inference()` - Simple inference function
144
+ - `process_large_dataset()` - Handle large datasets in chunks
145
+ - `test_model_on_sample()` - Test on specific text samples
146
 
147
+ **Example Usage:**
148
+ ```python
149
  from src.ohca_inference import run_inference, load_ohca_model
150
 
151
  # Load model and run inference
152
  model, tokenizer = load_ohca_model("./trained_model")
153
  results = run_inference(model, tokenizer, new_data_df)
154
+ ```
155
+
156
+ ## Model Architecture
157
+ - **Base Model**: PubMedBERT (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract)
158
+ - **Task**: Binary classification (OHCA vs Non-OHCA)
159
+ - **Max Sequence Length**: 512 tokens
160
+ - **Optimization**: AdamW with linear learning rate scheduling
161
+ - **Class Balancing**: Weighted loss + minority class oversampling
162
+
163
+ ## Performance Metrics
164
  The model reports comprehensive clinical metrics:
165
 
166
+ - **Sensitivity (Recall)**: Percentage of OHCA cases correctly identified
167
+ - **Specificity**: Percentage of non-OHCA cases correctly identified
168
+ - **Precision (PPV)**: When model predicts OHCA, percentage that are correct
169
+ - **NPV**: When model predicts non-OHCA, percentage that are correct
170
+ - **F1-Score**: Harmonic mean of precision and recall
171
+ - **AUC-ROC**: Area under the receiver operating characteristic curve
172
+
173
+ ## Clinical Usage
174
+
175
+ ### Probability Thresholds
176
+ - **β‰₯0.9**: Very high confidence - Priority manual review
177
+ - **0.7-0.9**: High confidence - Clinical review recommended
178
+ - **0.3-0.7**: Uncertain - Manual review suggested
179
+ - **<0.3**: Low probability - Likely non-OHCA
180
+
181
+ ### Workflow Integration
182
+ 1. Run inference on new discharge notes
183
+ 2. Prioritize high-confidence predictions for review
184
+ 3. Use medium-confidence cases for quality improvement
185
+ 4. Monitor low-confidence cases for false negatives
186
+
187
+ ## Repository Structure
188
+ ```
189
  nlp-ohca-classifier/
190
  β”œβ”€β”€ src/
191
  β”‚ β”œβ”€β”€ __init__.py
192
  β”‚ β”œβ”€β”€ ohca_training_pipeline.py # Training workflow
193
+ β”‚ └── ohca_inference.py # Inference on new data
194
  β”œβ”€β”€ examples/
195
+ β”‚ β”œβ”€β”€ training_example.py # Complete training examples
196
+ β”‚ └── inference_example.py # Inference usage examples
197
  β”œβ”€β”€ docs/
198
+ β”‚ └── annotation_guidelines.md # Detailed annotation guidelines
199
  β”œβ”€β”€ requirements.txt
200
  β”œβ”€β”€ setup.py
201
  β”œβ”€β”€ README.md
202
  └── LICENSE
203
+ ```
204
+
205
+ ## Examples
206
+
207
+ ### Complete Training Example
208
+ ```bash
209
  cd examples
210
  python training_example.py
211
+ ```
212
+
213
+ ### Inference Examples
214
+ ```bash
215
+ cd examples
216
  python inference_example.py
217
+ ```
218
+
219
+ ## Advanced Usage
220
+
221
+ ### Large Dataset Processing
222
+ ```python
223
  from src.ohca_inference import process_large_dataset
224
 
225
  # Process 100K+ records in chunks
226
  process_large_dataset(
227
  model_path="./trained_model",
228
+ data_path="large_dataset.csv",
229
  output_path="results.csv",
230
  chunk_size=5000
231
  )
232
+ ```
233
+
234
+ ### Model Testing
235
+ ```python
236
  from src.ohca_inference import test_model_on_sample
237
 
238
  # Test on specific cases
 
242
  }
243
 
244
  results = test_model_on_sample("./trained_model", test_cases)
245
+ ```
246
+
247
+ ## Performance Benchmarks
248
  Typical performance on validation data:
249
+ - **AUC-ROC**: 0.85-0.95
250
+ - **Sensitivity**: 85-95%
251
+ - **Specificity**: 85-95%
252
+ - **F1-Score**: 0.7-0.9
253
 
254
+ *Performance varies based on data quality and annotation consistency*
 
 
 
 
255
 
256
+ ## Citation
257
  If you use this code in your research, please cite:
258
 
259
+ ```bibtex
260
  @software{nlp_ohca_classifier,
261
+ title={NLP OHCA Classifier: BERT-based Detection of Out-of-Hospital Cardiac Arrest in Medical Text},
262
+ author={Mona Moukaddem},
263
+ year={2025},
264
+ url={https://github.com/monajm36/nlp-ohca-classifier}
265
  }
266
+ ```
267
+
268
+ ## License
269
  This project is licensed under the MIT License - see the LICENSE file for details.
270
 
271
+ ## Contributing
272
+ 1. Fork the repository
273
+ 2. Create a feature branch (`git checkout -b feature/AmazingFeature`)
274
+ 3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
275
+ 4. Push to the branch (`git push origin feature/AmazingFeature`)
276
+ 5. Open a Pull Request
277
+
278
+ ## Support
279
  For questions or issues:
280
+ - Check the [Issues](https://github.com/monajm36/nlp-ohca-classifier/issues) page
281
+ - Create a new issue if needed
282
+ - Review examples in the `examples/` folder
283
 
284
+ ## Acknowledgments
285
+ - PubMedBERT model from Microsoft Research
286
+ - MIMIC-III dataset for model development
287
+ - Transformers library by Hugging Face
288
+ - PyTorch for deep learning framework