File size: 15,031 Bytes
e2ef18e
 
e9b57e9
e2ef18e
 
 
 
 
 
 
 
 
 
 
 
 
e9b57e9
ed96473
e2ef18e
e9b57e9
e2ef18e
 
ed96473
 
 
e2ef18e
 
 
 
 
ed96473
e2ef18e
 
ed96473
e2ef18e
 
 
ed96473
e2ef18e
 
ed96473
 
 
 
 
 
 
 
 
 
 
 
cc5acf5
 
ed96473
 
 
 
 
 
 
 
 
 
e9b57e9
 
ed96473
 
 
 
 
 
e2ef18e
 
ed96473
e2ef18e
e9b57e9
 
e2ef18e
 
 
 
 
 
 
e9b57e9
e2ef18e
 
 
e9b57e9
 
e2ef18e
 
 
 
 
 
 
 
e9b57e9
 
e2ef18e
 
 
ed96473
 
e2ef18e
 
ed96473
e2ef18e
e9b57e9
 
e2ef18e
e9b57e9
e2ef18e
 
e9b57e9
 
 
 
e2ef18e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed96473
 
 
 
e2ef18e
e9b57e9
ed96473
e2ef18e
ed96473
e9b57e9
ed96473
e2ef18e
 
 
 
 
ed96473
e2ef18e
 
 
 
ed96473
 
 
e2ef18e
 
ed96473
 
 
e2ef18e
 
 
 
 
 
 
 
e9b57e9
e2ef18e
 
 
ed96473
e2ef18e
ed96473
e2ef18e
e9b57e9
e2ef18e
 
e9b57e9
e2ef18e
 
 
e9b57e9
ed96473
 
e2ef18e
 
 
 
 
 
 
e9b57e9
e2ef18e
 
 
 
e9b57e9
e2ef18e
ed96473
e2ef18e
e9b57e9
e2ef18e
 
 
 
 
ed96473
 
 
 
 
 
 
 
e2ef18e
ed96473
 
e9b57e9
e2ef18e
 
 
 
 
 
 
ed96473
 
 
 
 
 
 
 
 
e2ef18e
ed96473
e2ef18e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed96473
 
 
e2ef18e
e9b57e9
 
e2ef18e
 
e9b57e9
e2ef18e
 
 
e9b57e9
e2ef18e
e9b57e9
 
 
 
ed96473
 
 
 
e2ef18e
ed96473
e9b57e9
 
e2ef18e
ed96473
 
e2ef18e
ed96473
 
e9b57e9
e2ef18e
 
 
 
 
 
 
 
ed96473
 
 
 
e2ef18e
ed96473
e2ef18e
e9b57e9
e2ef18e
 
 
ed96473
e9b57e9
 
 
ed96473
 
e2ef18e
ed96473
e9b57e9
 
e2ef18e
e9b57e9
 
 
 
 
e2ef18e
 
ed96473
 
 
e2ef18e
 
 
 
 
 
 
 
 
 
 
 
 
e9b57e9
ed96473
e9b57e9
e2ef18e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed96473
e9b57e9
 
ed96473
e2ef18e
 
ed96473
 
e2ef18e
 
e9b57e9
ed96473
 
 
e9b57e9
 
ed96473
 
 
 
 
 
 
e2ef18e
 
 
 
 
 
 
 
 
 
 
 
e9b57e9
ed96473
 
 
 
 
e2ef18e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
# OHCA Classifier v3.0 - Improved Methodology
BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical text with enhanced machine learning methodology

## NLP OHCA Classifier v3.0
A BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical discharge notes using improved natural language processing methodology that addresses key methodological concerns in medical AI.

## Key Improvements in v3.0

This version implements significant methodological improvements based on data science best practices:

**Patient-Level Data Splits** - Prevents data leakage by ensuring all notes from the same patient stay in one split  
**Proper Train/Validation/Test** - Uses independent test set for unbiased evaluation  
**Optimal Threshold Finding** - Finds and saves optimal decision threshold during training  
**Larger Training Samples** - 800+ training samples instead of 264  
**Enhanced Clinical Decision Support** - Improved confidence categories and workflow integration  
**Unbiased Evaluation** - Eliminates threshold tuning on test data  

## Overview
This package provides two main modules with v3.0 enhancements:

- **Training Pipeline** (`ohca_training_pipeline.py`) - Complete workflow with improved methodology
- **Inference Module** (`ohca_inference.py`) - Apply models with optimal threshold support

## Features

### Training Pipeline (Enhanced v3.0)
- **Patient-Level Splits**: Prevents data leakage between training and test sets
- **Dual Annotation Strategy**: Separate training and validation annotation files
- **Intelligent Sampling**: Two-stage sampling strategy (keyword-enriched + random)  
- **Larger Sample Sizes**: 800 training + 200 validation samples
- **BERT-based Training**: Uses PubMedBERT optimized for medical text
- **Optimal Threshold Finding**: Automatically finds best decision threshold
- **Unbiased Evaluation**: Independent test set for reliable performance estimates

### Inference Module (Enhanced v3.0)
- **Optimal Threshold Usage**: Automatically uses threshold found during training
- **Enhanced Clinical Priorities**: Improved confidence categories for clinical workflow
- **Batch Processing**: Efficient inference on large datasets
- **Clinical Decision Support**: Evidence-based probability thresholds
- **Backward Compatibility**: Works with both v3.0 and legacy models

## Installation

### Prerequisites
- Python 3.8+
- PyTorch
- CUDA (optional, for GPU acceleration)

### Install from source

1. Clone the repository:
```bash
git clone https://github.com/monajm36/ohca-classifier-3.0.git
cd ohca-classifier-3.0
```

2. Set up virtual environment:
```bash
python3 -m venv .venv/
source .venv/bin/activate
```

3. Install dependencies:
```bash
pip install -r requirements.txt
pip install -e .
```

**Note for Windows users**: Replace `source .venv/bin/activate` with `.venv\Scripts\activate`

## Quick Start

### Training a New Model (v3.0 Methodology - RECOMMENDED)

```python
from src.ohca_training_pipeline import complete_improved_training_pipeline
import pandas as pd

# Step 1: Create patient-level splits and annotation samples
results = complete_improved_training_pipeline(
    data_path="your_discharge_notes.csv",  # Must have: hadm_id, subject_id, clean_text
    annotation_dir="./annotation_v3",
    train_sample_size=800,    # Much larger than legacy
    val_sample_size=200       # Separate validation sample
)

# Step 2: Manually annotate BOTH Excel files:
# - annotation_v3/train_annotation.xlsx (800 cases)
# - annotation_v3/validation_annotation.xlsx (200 cases)
# Label each case: 1=OHCA, 0=Non-OHCA

# Step 3: Complete training (after annotation)
from src.ohca_training_pipeline import complete_annotation_and_train_v3

model_results = complete_annotation_and_train_v3(
    train_annotation_file="./annotation_v3/train_annotation.xlsx",
    val_annotation_file="./annotation_v3/validation_annotation.xlsx",
    test_file="./annotation_v3/test_set_DO_NOT_ANNOTATE.csv",
    model_save_path="./my_ohca_model_v3",
    num_epochs=3
)

print(f"Optimal threshold: {model_results['optimal_threshold']:.3f}")
print(f"Model automatically uses this threshold during inference")
```

### Using a Pre-trained v3.0 Model

```python
from src.ohca_inference import quick_inference_with_optimal_threshold
import pandas as pd

# Apply v3.0 model to new data (uses optimal threshold automatically)
new_data = pd.read_csv("new_discharge_notes.csv")  # Must have: hadm_id, clean_text
results = quick_inference_with_optimal_threshold(
    model_path="./my_ohca_model_v3",  # v3.0 model with metadata
    data_path=new_data,
    output_path="ohca_predictions.csv"
)

# Enhanced v3.0 results with clinical priorities
immediate_review = results[results['clinical_priority'] == 'Immediate Review']
priority_review = results[results['clinical_priority'] == 'Priority Review']

print(f"Immediate review needed: {len(immediate_review)} cases")
print(f"Priority review needed: {len(priority_review)} cases")
print(f"Optimal threshold used: {results['optimal_threshold_used'].iloc[0]:.3f}")
```

### Backward Compatibility (Legacy Models)

```python
from src.ohca_inference import quick_inference

# Works with both v3.0 and legacy models
results = quick_inference(
    model_path="./any_model",  # Auto-detects model version
    data_path="new_data.csv"
)
```

## Data Format

### Input Requirements (Enhanced for v3.0)
Your CSV file must contain:
- `hadm_id`: Unique identifier for each hospital admission
- `subject_id`: Patient identifier (for patient-level splits to prevent data leakage)
- `clean_text`: Preprocessed discharge note text

**Example:**
```csv
hadm_id,subject_id,clean_text
12345,101,"Chief complaint: Cardiac arrest at home. Patient found down by family..."
12346,102,"Chief complaint: Chest pain. Patient presents with acute onset chest pain..."
12347,101,"Follow-up visit. Patient doing well after recent arrest..."
```

**If you don't have patient IDs**: Add this line to your preprocessing:
```python
df['subject_id'] = df['hadm_id']  # Use admission ID as patient ID
```

### Annotation Labels
- `1`: OHCA case (cardiac arrest outside hospital, primary reason for admission)
- `0`: Non-OHCA case (everything else, including transfers and historical arrests)

## Module Documentation

### Training Pipeline (Enhanced v3.0)

**Main v3.0 Functions (RECOMMENDED):**
- `complete_improved_training_pipeline()` - Create patient-level splits and annotation samples
- `complete_annotation_and_train_v3()` - Train with optimal threshold finding
- `create_patient_level_splits()` - Create proper data splits
- `find_optimal_threshold()` - Find optimal decision threshold
- `evaluate_on_test_set()` - Unbiased final evaluation

**Legacy Functions (Backward Compatible):**
- `create_training_sample()` - Legacy single-file annotation
- `complete_annotation_and_train()` - Legacy training workflow

**Example Usage (v3.0):**
```python
from src.ohca_training_pipeline import complete_improved_training_pipeline

# Enhanced training with proper methodology
result = complete_improved_training_pipeline(
    data_path="discharge_notes.csv",
    annotation_dir="./annotation_v3",
    train_sample_size=800,
    val_sample_size=200
)
```

### Inference Module (Enhanced v3.0)

**Main v3.0 Functions (RECOMMENDED):**
- `quick_inference_with_optimal_threshold()` - Uses optimal threshold automatically
- `load_ohca_model_with_metadata()` - Load model with optimal threshold
- `run_inference_with_optimal_threshold()` - Enhanced inference
- `analyze_predictions_enhanced()` - Improved prediction analysis

**Legacy Functions (Backward Compatible):**
- `quick_inference()` - Auto-detects model version
- `load_ohca_model()` - Basic model loading
- `run_inference()` - Basic inference

**Example Usage (v3.0):**
```python
from src.ohca_inference import load_ohca_model_with_metadata, run_inference_with_optimal_threshold

# Load v3.0 model with optimal threshold
model, tokenizer, optimal_threshold, metadata = load_ohca_model_with_metadata("./trained_model")

# Run inference with optimal threshold
results = run_inference_with_optimal_threshold(model, tokenizer, new_data_df, optimal_threshold)
```

## Model Architecture
- **Base Model**: PubMedBERT (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract)
- **Task**: Binary classification (OHCA vs Non-OHCA)
- **Max Sequence Length**: 512 tokens
- **Optimization**: AdamW with linear learning rate scheduling
- **Class Balancing**: Weighted loss + minority class oversampling
- **Threshold Selection**: Optimal threshold found via validation set (v3.0)

## Performance Metrics

### v3.0 Enhanced Evaluation
The model provides unbiased performance estimates using:
- **Independent test set** for final evaluation
- **Optimal threshold** found on validation set only
- **Patient-level splits** preventing data leakage

**Clinical Metrics:**
- **Sensitivity (Recall)**: Percentage of OHCA cases correctly identified
- **Specificity**: Percentage of non-OHCA cases correctly identified
- **Precision (PPV)**: When model predicts OHCA, percentage that are correct
- **NPV**: When model predicts non-OHCA, percentage that are correct
- **F1-Score**: Harmonic mean of precision and recall
- **AUC-ROC**: Area under the receiver operating characteristic curve

## Clinical Usage

### Enhanced v3.0 Clinical Decision Support

**Clinical Priorities (v3.0):**
- **Immediate Review**: Very high probability cases requiring urgent attention
- **Priority Review**: High probability cases for clinical team review
- **Clinical Review**: Medium-high probability cases above optimal threshold
- **Consider Review**: Medium probability cases for potential review
- **Routine Processing**: Low probability cases

**Optimal Threshold Usage:**
- Model automatically uses threshold found during validation
- Consistent decision-making across all datasets
- Better performance than static thresholds

**Workflow Integration:**
1. Run inference on new discharge notes (uses optimal threshold)
2. Prioritize "Immediate Review" cases for urgent manual review
3. Schedule "Priority Review" cases for clinical team evaluation
4. Use "Clinical Review" cases for quality improvement
5. Monitor routine cases for false negatives

## Repository Structure
```
ohca-classifier-3.0/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ ohca_training_pipeline.py    # Enhanced v3.0 training workflow
β”‚   └── ohca_inference.py            # Enhanced v3.0 inference
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ training_example.py          # v3.0 training examples
β”‚   β”œβ”€β”€ inference_example.py         # v3.0 inference examples
β”‚   └── clif_dataset_example.py      # Cross-institutional deployment
β”œβ”€β”€ docs/
β”‚   └── annotation_guidelines.md     # Enhanced annotation guidelines
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ setup.py
β”œβ”€β”€ README.md
└── LICENSE
```

## Examples

### Complete v3.0 Training Example
```bash
cd examples
python training_example.py
# Choose option 1: v3.0 Training with Improved Methodology
```

### Enhanced v3.0 Inference Examples
```bash
cd examples
python inference_example.py
# Choose option 1: v3.0 Inference with Optimal Threshold
```

### Cross-Institutional Deployment
```bash
cd examples
python clif_dataset_example.py
# Apply v3.0 model to external datasets
```

## Advanced Usage

### Large Dataset Processing (v3.0)
```python
from src.ohca_inference import process_large_dataset_with_optimal_threshold

# Process with optimal threshold automatically
process_large_dataset_with_optimal_threshold(
    model_path="./trained_model_v3",
    data_path="large_dataset.csv",
    output_path="results.csv",
    chunk_size=5000
)
```

### Model Testing with v3.0 Features
```python
from src.ohca_inference import test_model_on_sample

# Test with optimal threshold support
test_cases = {
    'case1': "Chief complaint: Cardiac arrest at home...",
    'case2': "Chief complaint: Chest pain, no arrest..."
}

results = test_model_on_sample("./trained_model_v3", test_cases)
# Results include optimal threshold predictions and clinical priorities
```

## Performance Benchmarks

### v3.0 Methodology Performance
Typical performance with improved methodology:
- **AUC-ROC**: 0.85-0.95 (unbiased estimates)
- **Sensitivity**: 85-95% (at optimal threshold)
- **Specificity**: 85-95% (at optimal threshold)
- **F1-Score**: 0.7-0.9 (optimized via validation)

**Key Improvements over Legacy:**
- **Unbiased evaluation** using independent test set
- **Optimal threshold** provides better sensitivity/specificity balance
- **Larger training sets** (800 vs 264) improve generalization
- **Patient-level splits** prevent overoptimistic performance estimates

*Performance varies based on data quality and annotation consistency*

## Migration from Legacy Versions

### Upgrading from Legacy to v3.0

**Benefits of Upgrading:**
- More reliable performance estimates
- Better clinical decision support
- Optimal threshold usage
- Enhanced workflow integration

**Migration Steps:**
1. **Retrain with v3.0 methodology** using `complete_improved_training_pipeline()`
2. **Add patient IDs** to your data (`subject_id` column)
3. **Use v3.0 inference functions** for new predictions
4. **Update workflows** to use clinical priorities

**Backward Compatibility:**
- Legacy models continue to work
- Legacy functions automatically detect model version
- Gradual migration supported

## Citation
If you use this code in your research, please cite:

```bibtex
@software{nlp_ohca_classifier_v3,
    title={NLP OHCA Classifier v3.0: BERT-based Detection of Out-of-Hospital Cardiac Arrest with Enhanced Methodology},
    author={Mona Moukaddem},
    year={2025},
    url={https://github.com/monajm36/ohca-classifier-3.0},
    note={Enhanced methodology addressing data leakage, threshold optimization, and evaluation bias}
}
```

## License
This project is licensed under the MIT License - see the LICENSE file for details.

## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## Support
For questions or issues:
- Check the [Issues](https://github.com/monajm36/ohca-classifier-3.0/issues) page
- Create a new issue if needed
- Review examples in the `examples/` folder

## Methodology References
The v3.0 improvements are based on established machine learning best practices:
- Patient-level data splits prevent data leakage in healthcare AI
- Proper train/validation/test methodology ensures unbiased evaluation
- Optimal threshold finding improves clinical performance
- Larger sample sizes enhance model generalization

## Acknowledgments
- PubMedBERT model from Microsoft Research
- MIMIC-III dataset for model development
- Transformers library by Hugging Face
- PyTorch for deep learning framework
- Data science community for methodological guidance