taraky's picture
Upload folder using huggingface_hub
b7f3196 verified
# Healthcare Reason Classification System
This module implements a specialized classifier for healthcare visit reasons using real clinic data to classify patient queries into specific healthcare reason categories.
## Overview
The reason classifier addresses the challenge of routing medical healthcare queries to appropriate specialized departments. It classifies medical queries into specific reason categories based on actual healthcare visit data.
## Architecture
### Classification Categories
| Category | Description | Examples |
|----------|-------------|----------|
| `ROUTINE_CARE` | Routine healthcare, maintenance visits, general care | "I need routine foot care", "Regular nail care appointment" |
| `PAIN_CONDITIONS` | Various pain-related conditions and discomfort | "I have heel pain when I walk", "My ankle is sore" |
| `INJURIES` | Sprains, wounds, trauma-related conditions | "I sprained my ankle playing sports", "I have a wound that won't heal" |
| `SKIN_CONDITIONS` | Skin-related issues and conditions | "My toenail is ingrown and infected", "I have calluses on my feet" |
| `STRUCTURAL_ISSUES` | Structural problems and related conditions | "I have flat feet", "I need evaluation for plantar fasciitis" |
| `PROCEDURES` | Injections, surgical consultations, post-operative care | "I need a cortisone injection", "Post-surgical follow-up" |
### Technical Implementation
- **Base Model**: `sentence-transformers/embeddinggemma-300m-medical`
- **Architecture**: SetFit with frozen embeddings + trainable classification head
- **Training**: Real healthcare data from clinic appointment records
- **Integration**: Works as part of the complete healthcare routing system
## Quick Start
### 1. Train the Classifier
```bash
# Train with real healthcare data
python classifier/reason/train_reason.py
# The training script will:
# - Load real healthcare data from data/reason_for_visit_data.xlsx
# - Map reasons to categories using keyword matching
# - Train the classifier with frozen embeddings
# - Save the trained model to classifier/reason_checkpoints/
```
### 2. Use the CLI
```bash
# Classify a single reason query
python cli/reason_classifier_cli_new.py "I have heel pain when I walk"
# Interactive mode
python cli/reason_classifier_cli_new.py --interactive
# Batch processing
python cli/reason_classifier_cli_new.py --batch queries.txt --output results.json
# Use complete healthcare routing system
python cli/healthcare_classifier_cli.py "I need routine foot care"
```
### 3. Programmatic Usage
```python
from classifier.reason import ReasonClassifier, predict_single_reason
# Using the main classifier class
classifier = ReasonClassifier()
predictions = classifier.predict(["I have heel pain when I walk"])
print(predictions[0]['category']) # Output: PAIN_CONDITIONS
# Using convenience function
result = predict_single_reason("I need routine foot care")
print(result['category']) # Output: ROUTINE_CARE
print(result['confidence']) # Confidence score
print(result['probabilities']) # All category probabilities
```
## System Integration
### Complete Healthcare Routing Workflow
```
User Query
↓
Medical vs Insurance Classification
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Insurance β”‚ Medical β”‚
β”‚ Queries β”‚ Queries β”‚
β”‚ ↓ β”‚ ↓ β”‚
β”‚ Insurance β”‚ Reason β”‚
β”‚ Department β”‚ Classification β”‚
β”‚ β”‚ ↓ β”‚
β”‚ β”‚ β€’ ROUTINE_CARE β”‚
β”‚ β”‚ β€’ PAIN_CONDITIONS β”‚
β”‚ β”‚ β€’ INJURIES β”‚
β”‚ β”‚ β€’ SKIN_CONDITIONS β”‚
β”‚ β”‚ β€’ STRUCTURAL_ISSUES β”‚
β”‚ β”‚ β€’ PROCEDURES β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Integration with Healthcare System
The reason classifier integrates as part of the complete healthcare routing system:
1. **Primary Classification**: Medical vs Insurance queries
2. **Reason Classification**: Medical queries β†’ Specific reason categories
3. **Department Routing**: Route to appropriate specialized departments
## Training Data Strategy
### Real Healthcare Data
The system uses actual healthcare clinic data:
```python
# Data source: data/reason_for_visit_data.xlsx
# Contains real patient visit reasons and appointment types
# Examples from actual data:
# - "Heel pain"
# - "Routine foot care"
# - "Ingrown toenail"
# - "Ankle sprain"
# - "Plantar fasciitis"
```
### Category Mapping Strategy
The system uses keyword-based mapping to categorize real healthcare reasons:
```python
def map_reason_to_category(reason: str) -> int:
reason_lower = reason.lower()
# ROUTINE_CARE (routine care, maintenance visits)
if any(word in reason_lower for word in ['routine', 'nail care', 'calluses']):
return 0
# PAIN_CONDITIONS (various pain-related conditions)
elif any(word in reason_lower for word in ['pain', 'ache', 'sore']):
return 1
# ... other categories
```
## Performance Metrics
### Expected Performance
- **Accuracy**: Based on real healthcare data patterns
- **Categories**: 6 specialized healthcare reason categories
- **Confidence**: Variable based on training data quality
### Evaluation Framework
```bash
# Train and evaluate the model
python classifier/reason/train_reason.py
# Test the trained model
python classifier/reason/infer_reason.py
# Results include:
# - Training metrics
# - Category distribution
# - Example predictions with confidence scores
```
## File Structure
```
classifier/reason/
β”œβ”€β”€ __init__.py # Package initialization and exports
β”œβ”€β”€ README.md # This documentation
β”œβ”€β”€ reason_classifier.py # Main ReasonClassifier class
β”œβ”€β”€ infer_reason.py # Inference functions and utilities
└── train_reason.py # Training script and functions
```
## API Reference
### ReasonClassifier
```python
class ReasonClassifier:
def __init__(self, data_file: str = "data/reason_for_visit_data.xlsx")
def predict(self, queries: List[str]) -> List[Dict]
def train(self, train_data: pd.DataFrame = None, eval_data: Optional[pd.DataFrame] = None)
def save_model(self, path: str)
def load_model(self, path: str)
def create_real_dataset(self) -> pd.DataFrame
def analyze_real_data(self)
```
### Inference Functions
```python
def predict_single_reason(query: str) -> dict
def predict_reason_query(text: list[str], embedding_model, classifier_head) -> dict
def get_reason_models() -> tuple
def test_reason_classifier()
```
### Training Functions
```python
def get_reason_model(num_classes: int)
def get_reason_dataset() -> pd.DataFrame
def map_reason_to_category(reason: str) -> int
def preprocess_reason_data(df: pd.DataFrame) -> pd.DataFrame
```
## Data Requirements
### Healthcare Data Format
The system expects healthcare data in Excel format with these columns:
```
Required columns:
- "Reason For Visit": The primary reason for the healthcare visit
- "Appointment Type": Type of appointment (optional, used for context)
Example data:
| Reason For Visit | Appointment Type |
|------------------|------------------|
| Heel pain | Follow-up |
| Routine foot care| Maintenance |
| Ingrown toenail | New Patient |
```
## Deployment Considerations
### Production Readiness
1. **Model Persistence**: Trained models saved with timestamps in `classifier/reason_checkpoints/`
2. **Error Handling**: Graceful fallbacks for prediction failures
3. **Real Data Integration**: Uses actual healthcare clinic data
4. **Device Support**: CPU/GPU/MPS compatibility
### Scalability
- **Batch Processing**: Efficient handling of multiple queries
- **Integration**: Works with existing healthcare routing system
- **Checkpoints**: Automatic model saving with timestamps
## Future Enhancements
### Data Improvements
1. **Expanded Dataset**: Include more healthcare specialties
2. **Active Learning**: Improve model with real-world feedback
3. **Multi-language Support**: Support for non-English healthcare queries
### Advanced Features
1. **Confidence Calibration**: Improve confidence score reliability
2. **Hierarchical Classification**: Sub-categories within reason types
3. **Context Awareness**: Consider patient history and appointment context
## Troubleshooting
### Common Issues
1. **Data Loading Errors**: Ensure `data/reason_for_visit_data.xlsx` exists
2. **Low Confidence**: May indicate need for more training data or model retraining
3. **Import Errors**: Ensure all dependencies are installed and paths are correct
### Debug Mode
```python
# Test the classifier with sample queries
from classifier.reason.infer_reason import test_reason_classifier
test_reason_classifier()
# Check model predictions with probabilities
from classifier.reason import predict_single_reason
result = predict_single_reason("ambiguous query")
print(result['probabilities'])
```
### Model Training Issues
```bash
# Check if healthcare data is available
ls -la data/reason_for_visit_data.xlsx
# Verify model training
python classifier/reason/train_reason.py
# Test inference after training
python classifier/reason/infer_reason.py
```
## Contributing
### Adding New Categories
1. Update `REASON_CATEGORIES` in `reason_classifier.py`, `infer_reason.py`, and `train_reason.py`
2. Update category mapping logic in `map_reason_to_category()`
3. Retrain the model with new categories
4. Update documentation and examples
### Improving Training Data
1. Add more real healthcare examples to the dataset
2. Improve keyword mapping for better categorization
3. Implement more sophisticated NLP techniques for category assignment
## License
This module is part of the health-query-classifier project and follows the same licensing terms.