ohca-classifier-v3 / README.md

monajm36

Update README.md

e2ef18e unverified 4 months ago

15 kB

	# OHCA Classifier v3.0 - Improved Methodology
	BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical text with enhanced machine learning methodology

	## NLP OHCA Classifier v3.0
	A BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical discharge notes using improved natural language processing methodology that addresses key methodological concerns in medical AI.

	## Key Improvements in v3.0

	This version implements significant methodological improvements based on data science best practices:

	Patient-Level Data Splits - Prevents data leakage by ensuring all notes from the same patient stay in one split
	Proper Train/Validation/Test - Uses independent test set for unbiased evaluation
	Optimal Threshold Finding - Finds and saves optimal decision threshold during training
	Larger Training Samples - 800+ training samples instead of 264
	Enhanced Clinical Decision Support - Improved confidence categories and workflow integration
	Unbiased Evaluation - Eliminates threshold tuning on test data

	## Overview
	This package provides two main modules with v3.0 enhancements:

	- Training Pipeline (`ohca_training_pipeline.py`) - Complete workflow with improved methodology
	- Inference Module (`ohca_inference.py`) - Apply models with optimal threshold support

	## Features

	### Training Pipeline (Enhanced v3.0)
	- Patient-Level Splits: Prevents data leakage between training and test sets
	- Dual Annotation Strategy: Separate training and validation annotation files
	- Intelligent Sampling: Two-stage sampling strategy (keyword-enriched + random)
	- Larger Sample Sizes: 800 training + 200 validation samples
	- BERT-based Training: Uses PubMedBERT optimized for medical text
	- Optimal Threshold Finding: Automatically finds best decision threshold
	- Unbiased Evaluation: Independent test set for reliable performance estimates

	### Inference Module (Enhanced v3.0)
	- Optimal Threshold Usage: Automatically uses threshold found during training
	- Enhanced Clinical Priorities: Improved confidence categories for clinical workflow
	- Batch Processing: Efficient inference on large datasets
	- Clinical Decision Support: Evidence-based probability thresholds
	- Backward Compatibility: Works with both v3.0 and legacy models

	## Installation

	### Prerequisites
	- Python 3.8+
	- PyTorch
	- CUDA (optional, for GPU acceleration)

	### Install from source

	1. Clone the repository:
	```bash
	git clone https://github.com/monajm36/ohca-classifier-3.0.git
	cd ohca-classifier-3.0
	```

	2. Set up virtual environment:
	```bash
	python3 -m venv .venv/
	source .venv/bin/activate
	```

	3. Install dependencies:
	```bash
	pip install -r requirements.txt
	pip install -e .
	```

	Note for Windows users: Replace `source .venv/bin/activate` with `.venv\Scripts\activate`

	## Quick Start

	### Training a New Model (v3.0 Methodology - RECOMMENDED)

	```python
	from src.ohca_training_pipeline import complete_improved_training_pipeline
	import pandas as pd

	# Step 1: Create patient-level splits and annotation samples
	results = complete_improved_training_pipeline(
	data_path="your_discharge_notes.csv", # Must have: hadm_id, subject_id, clean_text
	annotation_dir="./annotation_v3",
	train_sample_size=800, # Much larger than legacy
	val_sample_size=200 # Separate validation sample
	)

	# Step 2: Manually annotate BOTH Excel files:
	# - annotation_v3/train_annotation.xlsx (800 cases)
	# - annotation_v3/validation_annotation.xlsx (200 cases)
	# Label each case: 1=OHCA, 0=Non-OHCA

	# Step 3: Complete training (after annotation)
	from src.ohca_training_pipeline import complete_annotation_and_train_v3

	model_results = complete_annotation_and_train_v3(
	train_annotation_file="./annotation_v3/train_annotation.xlsx",
	val_annotation_file="./annotation_v3/validation_annotation.xlsx",
	test_file="./annotation_v3/test_set_DO_NOT_ANNOTATE.csv",
	model_save_path="./my_ohca_model_v3",
	num_epochs=3
	)

	print(f"Optimal threshold: {model_results['optimal_threshold']:.3f}")
	print(f"Model automatically uses this threshold during inference")
	```

	### Using a Pre-trained v3.0 Model

	```python
	from src.ohca_inference import quick_inference_with_optimal_threshold
	import pandas as pd

	# Apply v3.0 model to new data (uses optimal threshold automatically)
	new_data = pd.read_csv("new_discharge_notes.csv") # Must have: hadm_id, clean_text
	results = quick_inference_with_optimal_threshold(
	model_path="./my_ohca_model_v3", # v3.0 model with metadata
	data_path=new_data,
	output_path="ohca_predictions.csv"
	)

	# Enhanced v3.0 results with clinical priorities
	immediate_review = results[results['clinical_priority'] == 'Immediate Review']
	priority_review = results[results['clinical_priority'] == 'Priority Review']

	print(f"Immediate review needed: {len(immediate_review)} cases")
	print(f"Priority review needed: {len(priority_review)} cases")
	print(f"Optimal threshold used: {results['optimal_threshold_used'].iloc[0]:.3f}")
	```

	### Backward Compatibility (Legacy Models)

	```python
	from src.ohca_inference import quick_inference

	# Works with both v3.0 and legacy models
	results = quick_inference(
	model_path="./any_model", # Auto-detects model version
	data_path="new_data.csv"
	)
	```

	## Data Format

	### Input Requirements (Enhanced for v3.0)
	Your CSV file must contain:
	- `hadm_id`: Unique identifier for each hospital admission
	- `subject_id`: Patient identifier (for patient-level splits to prevent data leakage)
	- `clean_text`: Preprocessed discharge note text

	Example:
	```csv
	hadm_id,subject_id,clean_text
	12345,101,"Chief complaint: Cardiac arrest at home. Patient found down by family..."
	12346,102,"Chief complaint: Chest pain. Patient presents with acute onset chest pain..."
	12347,101,"Follow-up visit. Patient doing well after recent arrest..."
	```

	If you don't have patient IDs: Add this line to your preprocessing:
	```python
	df['subject_id'] = df['hadm_id'] # Use admission ID as patient ID
	```

	### Annotation Labels
	- `1`: OHCA case (cardiac arrest outside hospital, primary reason for admission)
	- `0`: Non-OHCA case (everything else, including transfers and historical arrests)

	## Module Documentation

	### Training Pipeline (Enhanced v3.0)

	Main v3.0 Functions (RECOMMENDED):
	- `complete_improved_training_pipeline()` - Create patient-level splits and annotation samples
	- `complete_annotation_and_train_v3()` - Train with optimal threshold finding
	- `create_patient_level_splits()` - Create proper data splits
	- `find_optimal_threshold()` - Find optimal decision threshold
	- `evaluate_on_test_set()` - Unbiased final evaluation

	Legacy Functions (Backward Compatible):
	- `create_training_sample()` - Legacy single-file annotation
	- `complete_annotation_and_train()` - Legacy training workflow

	Example Usage (v3.0):
	```python
	from src.ohca_training_pipeline import complete_improved_training_pipeline

	# Enhanced training with proper methodology
	result = complete_improved_training_pipeline(
	data_path="discharge_notes.csv",
	annotation_dir="./annotation_v3",
	train_sample_size=800,
	val_sample_size=200
	)
	```

	### Inference Module (Enhanced v3.0)

	Main v3.0 Functions (RECOMMENDED):
	- `quick_inference_with_optimal_threshold()` - Uses optimal threshold automatically
	- `load_ohca_model_with_metadata()` - Load model with optimal threshold
	- `run_inference_with_optimal_threshold()` - Enhanced inference
	- `analyze_predictions_enhanced()` - Improved prediction analysis

	Legacy Functions (Backward Compatible):
	- `quick_inference()` - Auto-detects model version
	- `load_ohca_model()` - Basic model loading
	- `run_inference()` - Basic inference

	Example Usage (v3.0):
	```python
	from src.ohca_inference import load_ohca_model_with_metadata, run_inference_with_optimal_threshold

	# Load v3.0 model with optimal threshold
	model, tokenizer, optimal_threshold, metadata = load_ohca_model_with_metadata("./trained_model")

	# Run inference with optimal threshold
	results = run_inference_with_optimal_threshold(model, tokenizer, new_data_df, optimal_threshold)
	```

	## Model Architecture
	- Base Model: PubMedBERT (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract)
	- Task: Binary classification (OHCA vs Non-OHCA)
	- Max Sequence Length: 512 tokens
	- Optimization: AdamW with linear learning rate scheduling
	- Class Balancing: Weighted loss + minority class oversampling
	- Threshold Selection: Optimal threshold found via validation set (v3.0)

	## Performance Metrics

	### v3.0 Enhanced Evaluation
	The model provides unbiased performance estimates using:
	- Independent test set for final evaluation
	- Optimal threshold found on validation set only
	- Patient-level splits preventing data leakage

	Clinical Metrics:
	- Sensitivity (Recall): Percentage of OHCA cases correctly identified
	- Specificity: Percentage of non-OHCA cases correctly identified
	- Precision (PPV): When model predicts OHCA, percentage that are correct
	- NPV: When model predicts non-OHCA, percentage that are correct
	- F1-Score: Harmonic mean of precision and recall
	- AUC-ROC: Area under the receiver operating characteristic curve

	## Clinical Usage

	### Enhanced v3.0 Clinical Decision Support

	Clinical Priorities (v3.0):
	- Immediate Review: Very high probability cases requiring urgent attention
	- Priority Review: High probability cases for clinical team review
	- Clinical Review: Medium-high probability cases above optimal threshold
	- Consider Review: Medium probability cases for potential review
	- Routine Processing: Low probability cases

	Optimal Threshold Usage:
	- Model automatically uses threshold found during validation
	- Consistent decision-making across all datasets
	- Better performance than static thresholds

	Workflow Integration:
	1. Run inference on new discharge notes (uses optimal threshold)
	2. Prioritize "Immediate Review" cases for urgent manual review
	3. Schedule "Priority Review" cases for clinical team evaluation
	4. Use "Clinical Review" cases for quality improvement
	5. Monitor routine cases for false negatives

	## Repository Structure
	```
	ohca-classifier-3.0/
	├── src/
	│ ├── __init__.py
	│ ├── ohca_training_pipeline.py # Enhanced v3.0 training workflow
	│ └── ohca_inference.py # Enhanced v3.0 inference
	├── examples/
	│ ├── training_example.py # v3.0 training examples
	│ ├── inference_example.py # v3.0 inference examples
	│ └── clif_dataset_example.py # Cross-institutional deployment
	├── docs/
	│ └── annotation_guidelines.md # Enhanced annotation guidelines
	├── requirements.txt
	├── setup.py
	├── README.md
	└── LICENSE
	```

	## Examples

	### Complete v3.0 Training Example
	```bash
	cd examples
	python training_example.py
	# Choose option 1: v3.0 Training with Improved Methodology
	```

	### Enhanced v3.0 Inference Examples
	```bash
	cd examples
	python inference_example.py
	# Choose option 1: v3.0 Inference with Optimal Threshold
	```

	### Cross-Institutional Deployment
	```bash
	cd examples
	python clif_dataset_example.py
	# Apply v3.0 model to external datasets
	```

	## Advanced Usage

	### Large Dataset Processing (v3.0)
	```python
	from src.ohca_inference import process_large_dataset_with_optimal_threshold

	# Process with optimal threshold automatically
	process_large_dataset_with_optimal_threshold(
	model_path="./trained_model_v3",
	data_path="large_dataset.csv",
	output_path="results.csv",
	chunk_size=5000
	)
	```

	### Model Testing with v3.0 Features
	```python
	from src.ohca_inference import test_model_on_sample

	# Test with optimal threshold support
	test_cases = {
	'case1': "Chief complaint: Cardiac arrest at home...",
	'case2': "Chief complaint: Chest pain, no arrest..."
	}

	results = test_model_on_sample("./trained_model_v3", test_cases)
	# Results include optimal threshold predictions and clinical priorities
	```

	## Performance Benchmarks

	### v3.0 Methodology Performance
	Typical performance with improved methodology:
	- AUC-ROC: 0.85-0.95 (unbiased estimates)
	- Sensitivity: 85-95% (at optimal threshold)
	- Specificity: 85-95% (at optimal threshold)
	- F1-Score: 0.7-0.9 (optimized via validation)

	Key Improvements over Legacy:
	- Unbiased evaluation using independent test set
	- Optimal threshold provides better sensitivity/specificity balance
	- Larger training sets (800 vs 264) improve generalization
	- Patient-level splits prevent overoptimistic performance estimates

	Performance varies based on data quality and annotation consistency

	## Migration from Legacy Versions

	### Upgrading from Legacy to v3.0

	Benefits of Upgrading:
	- More reliable performance estimates
	- Better clinical decision support
	- Optimal threshold usage
	- Enhanced workflow integration

	Migration Steps:
	1. Retrain with v3.0 methodology using `complete_improved_training_pipeline()`
	2. Add patient IDs to your data (`subject_id` column)
	3. Use v3.0 inference functions for new predictions
	4. Update workflows to use clinical priorities

	Backward Compatibility:
	- Legacy models continue to work
	- Legacy functions automatically detect model version
	- Gradual migration supported

	## Citation
	If you use this code in your research, please cite:

	```bibtex
	@software{nlp_ohca_classifier_v3,
	title={NLP OHCA Classifier v3.0: BERT-based Detection of Out-of-Hospital Cardiac Arrest with Enhanced Methodology},
	author={Mona Moukaddem},
	year={2025},
	url={https://github.com/monajm36/ohca-classifier-3.0},
	note={Enhanced methodology addressing data leakage, threshold optimization, and evaluation bias}
	}
	```

	## License
	This project is licensed under the MIT License - see the LICENSE file for details.

	## Contributing
	1. Fork the repository
	2. Create a feature branch (`git checkout -b feature/AmazingFeature`)
	3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
	4. Push to the branch (`git push origin feature/AmazingFeature`)
	5. Open a Pull Request

	## Support
	For questions or issues:
	- Check the [Issues](https://github.com/monajm36/ohca-classifier-3.0/issues) page
	- Create a new issue if needed
	- Review examples in the `examples/` folder

	## Methodology References
	The v3.0 improvements are based on established machine learning best practices:
	- Patient-level data splits prevent data leakage in healthcare AI
	- Proper train/validation/test methodology ensures unbiased evaluation
	- Optimal threshold finding improves clinical performance
	- Larger sample sizes enhance model generalization

	## Acknowledgments
	- PubMedBERT model from Microsoft Research
	- MIMIC-III dataset for model development
	- Transformers library by Hugging Face
	- PyTorch for deep learning framework
	- Data science community for methodological guidance