Spaces:

taraky
/

Medical_Document_Retrieval

Sleeping

App Files Files Community

Medical_Document_Retrieval / classifier /reason /README.md

taraky

Upload folder using huggingface_hub

b7f3196 verified 5 days ago

preview code

raw

history blame contribute delete

10.3 kB

	# Healthcare Reason Classification System

	This module implements a specialized classifier for healthcare visit reasons using real clinic data to classify patient queries into specific healthcare reason categories.

	## Overview

	The reason classifier addresses the challenge of routing medical healthcare queries to appropriate specialized departments. It classifies medical queries into specific reason categories based on actual healthcare visit data.

	## Architecture

	### Classification Categories

	\| Category \| Description \| Examples \|
	\|----------\|-------------\|----------\|
	\| `ROUTINE_CARE` \| Routine healthcare, maintenance visits, general care \| "I need routine foot care", "Regular nail care appointment" \|
	\| `PAIN_CONDITIONS` \| Various pain-related conditions and discomfort \| "I have heel pain when I walk", "My ankle is sore" \|
	\| `INJURIES` \| Sprains, wounds, trauma-related conditions \| "I sprained my ankle playing sports", "I have a wound that won't heal" \|
	\| `SKIN_CONDITIONS` \| Skin-related issues and conditions \| "My toenail is ingrown and infected", "I have calluses on my feet" \|
	\| `STRUCTURAL_ISSUES` \| Structural problems and related conditions \| "I have flat feet", "I need evaluation for plantar fasciitis" \|
	\| `PROCEDURES` \| Injections, surgical consultations, post-operative care \| "I need a cortisone injection", "Post-surgical follow-up" \|

	### Technical Implementation

	- Base Model: `sentence-transformers/embeddinggemma-300m-medical`
	- Architecture: SetFit with frozen embeddings + trainable classification head
	- Training: Real healthcare data from clinic appointment records
	- Integration: Works as part of the complete healthcare routing system

	## Quick Start

	### 1. Train the Classifier

	```bash
	# Train with real healthcare data
	python classifier/reason/train_reason.py

	# The training script will:
	# - Load real healthcare data from data/reason_for_visit_data.xlsx
	# - Map reasons to categories using keyword matching
	# - Train the classifier with frozen embeddings
	# - Save the trained model to classifier/reason_checkpoints/
	```

	### 2. Use the CLI

	```bash
	# Classify a single reason query
	python cli/reason_classifier_cli_new.py "I have heel pain when I walk"

	# Interactive mode
	python cli/reason_classifier_cli_new.py --interactive

	# Batch processing
	python cli/reason_classifier_cli_new.py --batch queries.txt --output results.json

	# Use complete healthcare routing system
	python cli/healthcare_classifier_cli.py "I need routine foot care"
	```

	### 3. Programmatic Usage

	```python
	from classifier.reason import ReasonClassifier, predict_single_reason

	# Using the main classifier class
	classifier = ReasonClassifier()
	predictions = classifier.predict(["I have heel pain when I walk"])
	print(predictions[0]['category']) # Output: PAIN_CONDITIONS

	# Using convenience function
	result = predict_single_reason("I need routine foot care")
	print(result['category']) # Output: ROUTINE_CARE
	print(result['confidence']) # Confidence score
	print(result['probabilities']) # All category probabilities
	```

	## System Integration

	### Complete Healthcare Routing Workflow

	```
	User Query
	↓
	Medical vs Insurance Classification
	↓
	┌─────────────────┬─────────────────┐
	│ Insurance │ Medical │
	│ Queries │ Queries │
	│ ↓ │ ↓ │
	│ Insurance │ Reason │
	│ Department │ Classification │
	│ │ ↓ │
	│ │ • ROUTINE_CARE │
	│ │ • PAIN_CONDITIONS │
	│ │ • INJURIES │
	│ │ • SKIN_CONDITIONS │
	│ │ • STRUCTURAL_ISSUES │
	│ │ • PROCEDURES │
	└─────────────────┴─────────────────┘
	```

	### Integration with Healthcare System

	The reason classifier integrates as part of the complete healthcare routing system:

	1. Primary Classification: Medical vs Insurance queries
	2. Reason Classification: Medical queries → Specific reason categories
	3. Department Routing: Route to appropriate specialized departments

	## Training Data Strategy

	### Real Healthcare Data

	The system uses actual healthcare clinic data:

	```python
	# Data source: data/reason_for_visit_data.xlsx
	# Contains real patient visit reasons and appointment types
	# Examples from actual data:
	# - "Heel pain"
	# - "Routine foot care"
	# - "Ingrown toenail"
	# - "Ankle sprain"
	# - "Plantar fasciitis"
	```

	### Category Mapping Strategy

	The system uses keyword-based mapping to categorize real healthcare reasons:

	```python
	def map_reason_to_category(reason: str) -> int:
	reason_lower = reason.lower()

	# ROUTINE_CARE (routine care, maintenance visits)
	if any(word in reason_lower for word in ['routine', 'nail care', 'calluses']):
	return 0

	# PAIN_CONDITIONS (various pain-related conditions)
	elif any(word in reason_lower for word in ['pain', 'ache', 'sore']):
	return 1

	# ... other categories
	```

	## Performance Metrics

	### Expected Performance
	- Accuracy: Based on real healthcare data patterns
	- Categories: 6 specialized healthcare reason categories
	- Confidence: Variable based on training data quality

	### Evaluation Framework

	```bash
	# Train and evaluate the model
	python classifier/reason/train_reason.py

	# Test the trained model
	python classifier/reason/infer_reason.py

	# Results include:
	# - Training metrics
	# - Category distribution
	# - Example predictions with confidence scores
	```

	## File Structure

	```
	classifier/reason/
	├── __init__.py # Package initialization and exports
	├── README.md # This documentation
	├── reason_classifier.py # Main ReasonClassifier class
	├── infer_reason.py # Inference functions and utilities
	└── train_reason.py # Training script and functions
	```

	## API Reference

	### ReasonClassifier

	```python
	class ReasonClassifier:
	def __init__(self, data_file: str = "data/reason_for_visit_data.xlsx")
	def predict(self, queries: List[str]) -> List[Dict]
	def train(self, train_data: pd.DataFrame = None, eval_data: Optional[pd.DataFrame] = None)
	def save_model(self, path: str)
	def load_model(self, path: str)
	def create_real_dataset(self) -> pd.DataFrame
	def analyze_real_data(self)
	```

	### Inference Functions

	```python
	def predict_single_reason(query: str) -> dict
	def predict_reason_query(text: list[str], embedding_model, classifier_head) -> dict
	def get_reason_models() -> tuple
	def test_reason_classifier()
	```

	### Training Functions

	```python
	def get_reason_model(num_classes: int)
	def get_reason_dataset() -> pd.DataFrame
	def map_reason_to_category(reason: str) -> int
	def preprocess_reason_data(df: pd.DataFrame) -> pd.DataFrame
	```

	## Data Requirements

	### Healthcare Data Format

	The system expects healthcare data in Excel format with these columns:

	```
	Required columns:
	- "Reason For Visit": The primary reason for the healthcare visit
	- "Appointment Type": Type of appointment (optional, used for context)

	Example data:
	\| Reason For Visit \| Appointment Type \|
	\|------------------\|------------------\|
	\| Heel pain \| Follow-up \|
	\| Routine foot care\| Maintenance \|
	\| Ingrown toenail \| New Patient \|
	```

	## Deployment Considerations

	### Production Readiness

	1. Model Persistence: Trained models saved with timestamps in `classifier/reason_checkpoints/`
	2. Error Handling: Graceful fallbacks for prediction failures
	3. Real Data Integration: Uses actual healthcare clinic data
	4. Device Support: CPU/GPU/MPS compatibility

	### Scalability

	- Batch Processing: Efficient handling of multiple queries
	- Integration: Works with existing healthcare routing system
	- Checkpoints: Automatic model saving with timestamps

	## Future Enhancements

	### Data Improvements

	1. Expanded Dataset: Include more healthcare specialties
	2. Active Learning: Improve model with real-world feedback
	3. Multi-language Support: Support for non-English healthcare queries

	### Advanced Features

	1. Confidence Calibration: Improve confidence score reliability
	2. Hierarchical Classification: Sub-categories within reason types
	3. Context Awareness: Consider patient history and appointment context

	## Troubleshooting

	### Common Issues

	1. Data Loading Errors: Ensure `data/reason_for_visit_data.xlsx` exists
	2. Low Confidence: May indicate need for more training data or model retraining
	3. Import Errors: Ensure all dependencies are installed and paths are correct

	### Debug Mode

	```python
	# Test the classifier with sample queries
	from classifier.reason.infer_reason import test_reason_classifier
	test_reason_classifier()

	# Check model predictions with probabilities
	from classifier.reason import predict_single_reason
	result = predict_single_reason("ambiguous query")
	print(result['probabilities'])
	```

	### Model Training Issues

	```bash
	# Check if healthcare data is available
	ls -la data/reason_for_visit_data.xlsx

	# Verify model training
	python classifier/reason/train_reason.py

	# Test inference after training
	python classifier/reason/infer_reason.py
	```

	## Contributing

	### Adding New Categories

	1. Update `REASON_CATEGORIES` in `reason_classifier.py`, `infer_reason.py`, and `train_reason.py`
	2. Update category mapping logic in `map_reason_to_category()`
	3. Retrain the model with new categories
	4. Update documentation and examples

	### Improving Training Data

	1. Add more real healthcare examples to the dataset
	2. Improve keyword mapping for better categorization
	3. Implement more sophisticated NLP techniques for category assignment

	## License

	This module is part of the health-query-classifier project and follows the same licensing terms.