monajm36 commited on
Commit
e9b57e9
Β·
unverified Β·
1 Parent(s): fbde8a5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +218 -0
README.md CHANGED
@@ -1,2 +1,220 @@
1
  # ohca-classifier-3.0
2
  BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # ohca-classifier-3.0
2
  BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical text
3
+
4
+ NLP OHCA Classifier
5
+ A BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical discharge notes using natural language processing.
6
+
7
+ Overview
8
+ This package provides two main modules:
9
+
10
+ Training Pipeline (ohca_training_pipeline.py) - Complete workflow from data annotation to model training
11
+ Inference Module (ohca_inference.py) - Apply pre-trained models to new datasets
12
+ Features
13
+ Training Pipeline
14
+ Intelligent Sampling: Two-stage sampling strategy (keyword-enriched + random)
15
+ Annotation Interface: Generates Excel files for manual annotation with guidelines
16
+ BERT-based Training: Uses PubMedBERT optimized for medical text
17
+ Class Balancing: Handles imbalanced datasets with oversampling
18
+ Comprehensive Evaluation: Clinical metrics including sensitivity, specificity, PPV, NPV
19
+ Inference Module
20
+ Pre-trained Model Loading: Easy loading of trained OHCA models
21
+ Batch Processing: Efficient inference on large datasets
22
+ Clinical Decision Support: Probability thresholds and confidence categories
23
+ Quality Analysis: Built-in tools for analyzing prediction patterns
24
+ Installation
25
+ Prerequisites
26
+ Python 3.8+
27
+ PyTorch
28
+ CUDA (optional, for GPU acceleration)
29
+ Install from source
30
+ git clone https://github.com/monajm36/nlp-ohca-classifier.git
31
+ cd nlp-ohca-classifier
32
+ pip install -r requirements.txt
33
+ pip install -e .
34
+ Quick Start
35
+ Training a New Model
36
+ from src.ohca_training_pipeline import create_training_sample, complete_annotation_and_train
37
+ import pandas as pd
38
+
39
+ # 1. Create annotation sample
40
+ df = pd.read_csv("your_discharge_notes.csv") # Must have: hadm_id, clean_text
41
+ annotation_df = create_training_sample(df, output_dir="./annotation_interface")
42
+
43
+ # 2. Manually annotate the Excel file (ohca_annotation.xlsx)
44
+ # Label each case: 1=OHCA, 0=Non-OHCA
45
+
46
+ # 3. Train model after annotation
47
+ results = complete_annotation_and_train(
48
+ annotation_file="./annotation_interface/ohca_annotation.xlsx",
49
+ model_save_path="./my_ohca_model",
50
+ num_epochs=3
51
+ )
52
+ Using a Pre-trained Model
53
+ from src.ohca_inference import quick_inference
54
+ import pandas as pd
55
+
56
+ # Apply model to new data
57
+ new_data = pd.read_csv("new_discharge_notes.csv") # Must have: hadm_id, clean_text
58
+ results = quick_inference(
59
+ model_path="./my_ohca_model",
60
+ data_path=new_data,
61
+ output_path="ohca_predictions.csv"
62
+ )
63
+
64
+ # View high-confidence predictions
65
+ high_confidence = results[results['ohca_probability'] >= 0.8]
66
+ print(f"Found {len(high_confidence)} high-confidence OHCA cases")
67
+ Data Format
68
+ Input Requirements
69
+ Your CSV file must contain:
70
+
71
+ hadm_id: Unique identifier for each hospital admission
72
+ clean_text: Preprocessed discharge note text
73
+ Example:
74
+ hadm_id,clean_text
75
+ 12345,"Chief complaint: Cardiac arrest at home. Patient found down by family..."
76
+ 12346,"Chief complaint: Chest pain. Patient presents with acute onset chest pain..."
77
+ Annotation Labels
78
+ 1: OHCA case (cardiac arrest outside hospital)
79
+ 0: Non-OHCA case (everything else, including all transfer cases)
80
+ Module Documentation
81
+ Training Pipeline (ohca_training_pipeline.py)
82
+ Main Functions:
83
+
84
+ create_training_sample() - Create balanced annotation sample
85
+ prepare_training_data() - Process annotations for training
86
+ train_ohca_model() - Train BERT-based classifier
87
+ evaluate_model() - Comprehensive performance evaluation
88
+ complete_training_pipeline() - End-to-end training workflow
89
+ Example Usage:
90
+
91
+ from src.ohca_training_pipeline import complete_training_pipeline
92
+
93
+ # Complete training pipeline
94
+ result = complete_training_pipeline(
95
+ data_path="discharge_notes.csv",
96
+ annotation_dir="./annotation",
97
+ model_save_path="./trained_model"
98
+ )
99
+ Inference Module (ohca_inference.py)
100
+ Main Functions:
101
+
102
+ load_ohca_model() - Load pre-trained model
103
+ run_inference() - Full inference with analysis
104
+ quick_inference() - Simple inference function
105
+ process_large_dataset() - Handle large datasets in chunks
106
+ test_model_on_sample() - Test on specific text samples
107
+ Example Usage:
108
+
109
+ from src.ohca_inference import run_inference, load_ohca_model
110
+
111
+ # Load model and run inference
112
+ model, tokenizer = load_ohca_model("./trained_model")
113
+ results = run_inference(model, tokenizer, new_data_df)
114
+ Model Architecture
115
+ Base Model: PubMedBERT (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract)
116
+ Task: Binary classification (OHCA vs Non-OHCA)
117
+ Max Sequence Length: 512 tokens
118
+ Optimization: AdamW with linear learning rate scheduling
119
+ Class Balancing: Weighted loss + minority class oversampling
120
+ Performance Metrics
121
+ The model reports comprehensive clinical metrics:
122
+
123
+ Sensitivity (Recall): Percentage of OHCA cases correctly identified
124
+ Specificity: Percentage of non-OHCA cases correctly identified
125
+ Precision (PPV): When model predicts OHCA, percentage that are correct
126
+ NPV: When model predicts non-OHCA, percentage that are correct
127
+ F1-Score: Harmonic mean of precision and recall
128
+ AUC-ROC: Area under the receiver operating characteristic curve
129
+ Clinical Usage
130
+ Probability Thresholds
131
+ β‰₯0.9: Very high confidence - Priority manual review
132
+ 0.7-0.9: High confidence - Clinical review recommended
133
+ 0.3-0.7: Uncertain - Manual review suggested
134
+ <0.3: Low probability - Likely non-OHCA
135
+ Workflow Integration
136
+ Run inference on new discharge notes
137
+ Prioritize high-confidence predictions for review
138
+ Use medium-confidence cases for quality improvement
139
+ Monitor low-confidence cases for false negatives
140
+ Repository Structure
141
+ nlp-ohca-classifier/
142
+ β”œβ”€β”€ src/
143
+ β”‚ β”œβ”€β”€ __init__.py
144
+ β”‚ β”œβ”€β”€ ohca_training_pipeline.py # Training workflow
145
+ β”‚ └── ohca_inference.py # Inference on new data
146
+ β”œβ”€β”€ examples/
147
+ β”‚ β”œβ”€β”€ training_example.py # Complete training examples
148
+ β”‚ └── inference_example.py # Inference usage examples
149
+ β”œβ”€β”€ docs/
150
+ β”‚ └── annotation_guidelines.md # Detailed annotation guidelines
151
+ β”œβ”€β”€ requirements.txt
152
+ β”œβ”€β”€ setup.py
153
+ β”œβ”€β”€ README.md
154
+ └── LICENSE
155
+ Examples
156
+ Complete Training Example
157
+ cd examples
158
+ python training_example.py
159
+ Inference Examples
160
+ cd examples
161
+ python inference_example.py
162
+ Advanced Usage
163
+ Large Dataset Processing
164
+ from src.ohca_inference import process_large_dataset
165
+
166
+ # Process 100K+ records in chunks
167
+ process_large_dataset(
168
+ model_path="./trained_model",
169
+ data_path="large_dataset.csv",
170
+ output_path="results.csv",
171
+ chunk_size=5000
172
+ )
173
+ Model Testing
174
+ from src.ohca_inference import test_model_on_sample
175
+
176
+ # Test on specific cases
177
+ test_cases = {
178
+ 'case1': "Chief complaint: Cardiac arrest at home...",
179
+ 'case2': "Chief complaint: Chest pain, no arrest..."
180
+ }
181
+
182
+ results = test_model_on_sample("./trained_model", test_cases)
183
+ Performance Benchmarks
184
+ Typical performance on validation data:
185
+
186
+ AUC-ROC: 0.85-0.95
187
+ Sensitivity: 85-95%
188
+ Specificity: 85-95%
189
+ F1-Score: 0.7-0.9
190
+ Performance varies based on data quality and annotation consistency
191
+
192
+ Citation
193
+ If you use this code in your research, please cite:
194
+
195
+ @software{nlp_ohca_classifier,
196
+ title={NLP OHCA Classifier: BERT-based Detection of Out-of-Hospital Cardiac Arrest in Medical Text},
197
+ author={Mona Moukaddem},
198
+ year={2025},
199
+ url={https://github.com/monajm36/nlp-ohca-classifier}
200
+ }
201
+ License
202
+ This project is licensed under the MIT License - see the LICENSE file for details.
203
+
204
+ Contributing
205
+ Fork the repository
206
+ Create a feature branch (git checkout -b feature/AmazingFeature)
207
+ Commit your changes (git commit -m 'Add some AmazingFeature')
208
+ Push to the branch (git push origin feature/AmazingFeature)
209
+ Open a Pull Request
210
+ Support
211
+ For questions or issues:
212
+
213
+ Check the Issues page
214
+ Create a new issue if needed
215
+ Review examples in the examples/ folder
216
+ Acknowledgments
217
+ PubMedBERT model from Microsoft Research
218
+ MIMIC-III dataset for model development
219
+ Transformers library by Hugging Face
220
+ PyTorch for deep learning framework