File size: 15,031 Bytes

e2ef18e
 
e9b57e9
e2ef18e
 
 
 
 
 
 
 
 
 
 
 
 
e9b57e9
ed96473
e2ef18e
e9b57e9
e2ef18e
 
ed96473
 
 
e2ef18e
 
 
 
 
ed96473
e2ef18e
 
ed96473
e2ef18e
 
 
ed96473
e2ef18e
 
ed96473
 
 
 
 
 
 
 
 
 
 
 
cc5acf5
 
ed96473
 
 
 
 
 
 
 
 
 
e9b57e9
 
ed96473
 
 
 
 
 
e2ef18e
 
ed96473
e2ef18e
e9b57e9
 
e2ef18e
 
 
 
 
 
 
e9b57e9
e2ef18e
 
 
e9b57e9
 
e2ef18e
 
 
 
 
 
 
 
e9b57e9
 
e2ef18e
 
 
ed96473
 
e2ef18e
 
ed96473
e2ef18e
e9b57e9
 
e2ef18e
e9b57e9
e2ef18e
 
e9b57e9
 
 
 
e2ef18e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed96473
 
 
 
e2ef18e
e9b57e9
ed96473
e2ef18e
ed96473
e9b57e9
ed96473
e2ef18e
 
 
 
 
ed96473
e2ef18e
 
 
 
ed96473
 
 
e2ef18e
 
ed96473
 
 
e2ef18e
 
 
 
 
 
 
 
e9b57e9
e2ef18e
 
 
ed96473
e2ef18e
ed96473
e2ef18e
e9b57e9
e2ef18e
 
e9b57e9
e2ef18e
 
 
e9b57e9
ed96473
 
e2ef18e
 
 
 
 
 
 
e9b57e9
e2ef18e
 
 
 
e9b57e9
e2ef18e
ed96473
e2ef18e
e9b57e9
e2ef18e
 
 
 
 
ed96473
 
 
 
 
 
 
 
e2ef18e
ed96473
 
e9b57e9
e2ef18e
 
 
 
 
 
 
ed96473
 
 
 
 
 
 
 
 
e2ef18e
ed96473
e2ef18e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed96473
 
 
e2ef18e
e9b57e9
 
e2ef18e
 
e9b57e9
e2ef18e
 
 
e9b57e9
e2ef18e
e9b57e9
 
 
 
ed96473
 
 
 
e2ef18e
ed96473
e9b57e9
 
e2ef18e
ed96473
 
e2ef18e
ed96473
 
e9b57e9
e2ef18e
 
 
 
 
 
 
 
ed96473
 
 
 
e2ef18e
ed96473
e2ef18e
e9b57e9
e2ef18e
 
 
ed96473
e9b57e9
 
 
ed96473
 
e2ef18e
ed96473
e9b57e9
 
e2ef18e
e9b57e9
 
 
 
 
e2ef18e
 
ed96473
 
 
e2ef18e
 
 
 
 
 
 
 
 
 
 
 
 
e9b57e9
ed96473
e9b57e9
e2ef18e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed96473
e9b57e9
 
ed96473
e2ef18e
 
ed96473
 
e2ef18e
 
e9b57e9
ed96473
 
 
e9b57e9
 
ed96473
 
 
 
 
 
 
e2ef18e
 
 
 
 
 
 
 
 
 
 
 
e9b57e9
ed96473
 
 
 
 
e2ef18e

# OHCA Classifier v3.0 - Improved Methodology
BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical text with enhanced machine learning methodology

## NLP OHCA Classifier v3.0
A BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical discharge notes using improved natural language processing methodology that addresses key methodological concerns in medical AI.

## Key Improvements in v3.0

This version implements significant methodological improvements based on data science best practices:

**Patient-Level Data Splits** - Prevents data leakage by ensuring all notes from the same patient stay in one split  
**Proper Train/Validation/Test** - Uses independent test set for unbiased evaluation  
**Optimal Threshold Finding** - Finds and saves optimal decision threshold during training  
**Larger Training Samples** - 800+ training samples instead of 264  
**Enhanced Clinical Decision Support** - Improved confidence categories and workflow integration  
**Unbiased Evaluation** - Eliminates threshold tuning on test data  

## Overview
This package provides two main modules with v3.0 enhancements:

- **Training Pipeline** (`ohca_training_pipeline.py`) - Complete workflow with improved methodology
- **Inference Module** (`ohca_inference.py`) - Apply models with optimal threshold support

## Features

### Training Pipeline (Enhanced v3.0)
- **Patient-Level Splits**: Prevents data leakage between training and test sets
- **Dual Annotation Strategy**: Separate training and validation annotation files
- **Intelligent Sampling**: Two-stage sampling strategy (keyword-enriched + random)  
- **Larger Sample Sizes**: 800 training + 200 validation samples
- **BERT-based Training**: Uses PubMedBERT optimized for medical text
- **Optimal Threshold Finding**: Automatically finds best decision threshold
- **Unbiased Evaluation**: Independent test set for reliable performance estimates

### Inference Module (Enhanced v3.0)
- **Optimal Threshold Usage**: Automatically uses threshold found during training
- **Enhanced Clinical Priorities**: Improved confidence categories for clinical workflow
- **Batch Processing**: Efficient inference on large datasets
- **Clinical Decision Support**: Evidence-based probability thresholds
- **Backward Compatibility**: Works with both v3.0 and legacy models

## Installation

### Prerequisites
- Python 3.8+
- PyTorch
- CUDA (optional, for GPU acceleration)

### Install from source

1. Clone the repository:
```bash
git clone https://github.com/monajm36/ohca-classifier-3.0.git
cd ohca-classifier-3.0
```

2. Set up virtual environment:
```bash
python3 -m venv .venv/
source .venv/bin/activate
```

3. Install dependencies:
```bash
pip install -r requirements.txt
pip install -e .
```

**Note for Windows users**: Replace `source .venv/bin/activate` with `.venv\Scripts\activate`

## Quick Start

### Training a New Model (v3.0 Methodology - RECOMMENDED)

```python
from src.ohca_training_pipeline import complete_improved_training_pipeline
import pandas as pd

# Step 1: Create patient-level splits and annotation samples
results = complete_improved_training_pipeline(
    data_path="your_discharge_notes.csv",  # Must have: hadm_id, subject_id, clean_text
    annotation_dir="./annotation_v3",
    train_sample_size=800,    # Much larger than legacy
    val_sample_size=200       # Separate validation sample
)

# Step 2: Manually annotate BOTH Excel files:
# - annotation_v3/train_annotation.xlsx (800 cases)
# - annotation_v3/validation_annotation.xlsx (200 cases)
# Label each case: 1=OHCA, 0=Non-OHCA

# Step 3: Complete training (after annotation)
from src.ohca_training_pipeline import complete_annotation_and_train_v3

model_results = complete_annotation_and_train_v3(
    train_annotation_file="./annotation_v3/train_annotation.xlsx",
    val_annotation_file="./annotation_v3/validation_annotation.xlsx",
    test_file="./annotation_v3/test_set_DO_NOT_ANNOTATE.csv",
    model_save_path="./my_ohca_model_v3",
    num_epochs=3
)

print(f"Optimal threshold: {model_results['optimal_threshold']:.3f}")
print(f"Model automatically uses this threshold during inference")
```

### Using a Pre-trained v3.0 Model

```python
from src.ohca_inference import quick_inference_with_optimal_threshold
import pandas as pd

# Apply v3.0 model to new data (uses optimal threshold automatically)
new_data = pd.read_csv("new_discharge_notes.csv")  # Must have: hadm_id, clean_text
results = quick_inference_with_optimal_threshold(
    model_path="./my_ohca_model_v3",  # v3.0 model with metadata
    data_path=new_data,
    output_path="ohca_predictions.csv"
)

# Enhanced v3.0 results with clinical priorities
immediate_review = results[results['clinical_priority'] == 'Immediate Review']
priority_review = results[results['clinical_priority'] == 'Priority Review']

print(f"Immediate review needed: {len(immediate_review)} cases")
print(f"Priority review needed: {len(priority_review)} cases")
print(f"Optimal threshold used: {results['optimal_threshold_used'].iloc[0]:.3f}")
```

### Backward Compatibility (Legacy Models)

```python
from src.ohca_inference import quick_inference

# Works with both v3.0 and legacy models
results = quick_inference(
    model_path="./any_model",  # Auto-detects model version
    data_path="new_data.csv"
)
```

## Data Format

### Input Requirements (Enhanced for v3.0)
Your CSV file must contain:
- `hadm_id`: Unique identifier for each hospital admission
- `subject_id`: Patient identifier (for patient-level splits to prevent data leakage)
- `clean_text`: Preprocessed discharge note text

**Example:**
```csv
hadm_id,subject_id,clean_text
12345,101,"Chief complaint: Cardiac arrest at home. Patient found down by family..."
12346,102,"Chief complaint: Chest pain. Patient presents with acute onset chest pain..."
12347,101,"Follow-up visit. Patient doing well after recent arrest..."
```

**If you don't have patient IDs**: Add this line to your preprocessing:
```python
df['subject_id'] = df['hadm_id']  # Use admission ID as patient ID
```

### Annotation Labels
- `1`: OHCA case (cardiac arrest outside hospital, primary reason for admission)
- `0`: Non-OHCA case (everything else, including transfers and historical arrests)

## Module Documentation

### Training Pipeline (Enhanced v3.0)

**Main v3.0 Functions (RECOMMENDED):**
- `complete_improved_training_pipeline()` - Create patient-level splits and annotation samples
- `complete_annotation_and_train_v3()` - Train with optimal threshold finding
- `create_patient_level_splits()` - Create proper data splits
- `find_optimal_threshold()` - Find optimal decision threshold
- `evaluate_on_test_set()` - Unbiased final evaluation

**Legacy Functions (Backward Compatible):**
- `create_training_sample()` - Legacy single-file annotation
- `complete_annotation_and_train()` - Legacy training workflow

**Example Usage (v3.0):**
```python
from src.ohca_training_pipeline import complete_improved_training_pipeline

# Enhanced training with proper methodology
result = complete_improved_training_pipeline(
    data_path="discharge_notes.csv",
    annotation_dir="./annotation_v3",
    train_sample_size=800,
    val_sample_size=200
)
```

### Inference Module (Enhanced v3.0)

**Main v3.0 Functions (RECOMMENDED):**
- `quick_inference_with_optimal_threshold()` - Uses optimal threshold automatically
- `load_ohca_model_with_metadata()` - Load model with optimal threshold
- `run_inference_with_optimal_threshold()` - Enhanced inference
- `analyze_predictions_enhanced()` - Improved prediction analysis

**Legacy Functions (Backward Compatible):**
- `quick_inference()` - Auto-detects model version
- `load_ohca_model()` - Basic model loading
- `run_inference()` - Basic inference

**Example Usage (v3.0):**
```python
from src.ohca_inference import load_ohca_model_with_metadata, run_inference_with_optimal_threshold

# Load v3.0 model with optimal threshold
model, tokenizer, optimal_threshold, metadata = load_ohca_model_with_metadata("./trained_model")

# Run inference with optimal threshold
results = run_inference_with_optimal_threshold(model, tokenizer, new_data_df, optimal_threshold)
```

## Model Architecture
- **Base Model**: PubMedBERT (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract)
- **Task**: Binary classification (OHCA vs Non-OHCA)
- **Max Sequence Length**: 512 tokens
- **Optimization**: AdamW with linear learning rate scheduling
- **Class Balancing**: Weighted loss + minority class oversampling
- **Threshold Selection**: Optimal threshold found via validation set (v3.0)

## Performance Metrics

### v3.0 Enhanced Evaluation
The model provides unbiased performance estimates using:
- **Independent test set** for final evaluation
- **Optimal threshold** found on validation set only
- **Patient-level splits** preventing data leakage

**Clinical Metrics:**
- **Sensitivity (Recall)**: Percentage of OHCA cases correctly identified
- **Specificity**: Percentage of non-OHCA cases correctly identified
- **Precision (PPV)**: When model predicts OHCA, percentage that are correct
- **NPV**: When model predicts non-OHCA, percentage that are correct
- **F1-Score**: Harmonic mean of precision and recall
- **AUC-ROC**: Area under the receiver operating characteristic curve

## Clinical Usage

### Enhanced v3.0 Clinical Decision Support

**Clinical Priorities (v3.0):**
- **Immediate Review**: Very high probability cases requiring urgent attention
- **Priority Review**: High probability cases for clinical team review
- **Clinical Review**: Medium-high probability cases above optimal threshold
- **Consider Review**: Medium probability cases for potential review
- **Routine Processing**: Low probability cases

**Optimal Threshold Usage:**
- Model automatically uses threshold found during validation
- Consistent decision-making across all datasets
- Better performance than static thresholds

**Workflow Integration:**
1. Run inference on new discharge notes (uses optimal threshold)
2. Prioritize "Immediate Review" cases for urgent manual review
3. Schedule "Priority Review" cases for clinical team evaluation
4. Use "Clinical Review" cases for quality improvement
5. Monitor routine cases for false negatives

## Repository Structure
```
ohca-classifier-3.0/
├── src/
│   ├── __init__.py
│   ├── ohca_training_pipeline.py    # Enhanced v3.0 training workflow
│   └── ohca_inference.py            # Enhanced v3.0 inference
├── examples/
│   ├── training_example.py          # v3.0 training examples
│   ├── inference_example.py         # v3.0 inference examples
│   └── clif_dataset_example.py      # Cross-institutional deployment
├── docs/
│   └── annotation_guidelines.md     # Enhanced annotation guidelines
├── requirements.txt
├── setup.py
├── README.md
└── LICENSE
```

## Examples

### Complete v3.0 Training Example
```bash
cd examples
python training_example.py
# Choose option 1: v3.0 Training with Improved Methodology
```

### Enhanced v3.0 Inference Examples
```bash
cd examples
python inference_example.py
# Choose option 1: v3.0 Inference with Optimal Threshold
```

### Cross-Institutional Deployment
```bash
cd examples
python clif_dataset_example.py
# Apply v3.0 model to external datasets
```

## Advanced Usage

### Large Dataset Processing (v3.0)
```python
from src.ohca_inference import process_large_dataset_with_optimal_threshold

# Process with optimal threshold automatically
process_large_dataset_with_optimal_threshold(
    model_path="./trained_model_v3",
    data_path="large_dataset.csv",
    output_path="results.csv",
    chunk_size=5000
)
```

### Model Testing with v3.0 Features
```python
from src.ohca_inference import test_model_on_sample

# Test with optimal threshold support
test_cases = {
    'case1': "Chief complaint: Cardiac arrest at home...",
    'case2': "Chief complaint: Chest pain, no arrest..."
}

results = test_model_on_sample("./trained_model_v3", test_cases)
# Results include optimal threshold predictions and clinical priorities
```

## Performance Benchmarks

### v3.0 Methodology Performance
Typical performance with improved methodology:
- **AUC-ROC**: 0.85-0.95 (unbiased estimates)
- **Sensitivity**: 85-95% (at optimal threshold)
- **Specificity**: 85-95% (at optimal threshold)
- **F1-Score**: 0.7-0.9 (optimized via validation)

**Key Improvements over Legacy:**
- **Unbiased evaluation** using independent test set
- **Optimal threshold** provides better sensitivity/specificity balance
- **Larger training sets** (800 vs 264) improve generalization
- **Patient-level splits** prevent overoptimistic performance estimates

*Performance varies based on data quality and annotation consistency*

## Migration from Legacy Versions

### Upgrading from Legacy to v3.0

**Benefits of Upgrading:**
- More reliable performance estimates
- Better clinical decision support
- Optimal threshold usage
- Enhanced workflow integration

**Migration Steps:**
1. **Retrain with v3.0 methodology** using `complete_improved_training_pipeline()`
2. **Add patient IDs** to your data (`subject_id` column)
3. **Use v3.0 inference functions** for new predictions
4. **Update workflows** to use clinical priorities

**Backward Compatibility:**
- Legacy models continue to work
- Legacy functions automatically detect model version
- Gradual migration supported

## Citation
If you use this code in your research, please cite:

```bibtex
@software{nlp_ohca_classifier_v3,
    title={NLP OHCA Classifier v3.0: BERT-based Detection of Out-of-Hospital Cardiac Arrest with Enhanced Methodology},
    author={Mona Moukaddem},
    year={2025},
    url={https://github.com/monajm36/ohca-classifier-3.0},
    note={Enhanced methodology addressing data leakage, threshold optimization, and evaluation bias}
}
```

## License
This project is licensed under the MIT License - see the LICENSE file for details.

## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## Support
For questions or issues:
- Check the [Issues](https://github.com/monajm36/ohca-classifier-3.0/issues) page
- Create a new issue if needed
- Review examples in the `examples/` folder

## Methodology References
The v3.0 improvements are based on established machine learning best practices:
- Patient-level data splits prevent data leakage in healthcare AI
- Proper train/validation/test methodology ensures unbiased evaluation
- Optimal threshold finding improves clinical performance
- Larger sample sizes enhance model generalization

## Acknowledgments
- PubMedBERT model from Microsoft Research
- MIMIC-III dataset for model development
- Transformers library by Hugging Face
- PyTorch for deep learning framework
- Data science community for methodological guidance