File size: 10,345 Bytes
b7f3196 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 |
# Healthcare Reason Classification System
This module implements a specialized classifier for healthcare visit reasons using real clinic data to classify patient queries into specific healthcare reason categories.
## Overview
The reason classifier addresses the challenge of routing medical healthcare queries to appropriate specialized departments. It classifies medical queries into specific reason categories based on actual healthcare visit data.
## Architecture
### Classification Categories
| Category | Description | Examples |
|----------|-------------|----------|
| `ROUTINE_CARE` | Routine healthcare, maintenance visits, general care | "I need routine foot care", "Regular nail care appointment" |
| `PAIN_CONDITIONS` | Various pain-related conditions and discomfort | "I have heel pain when I walk", "My ankle is sore" |
| `INJURIES` | Sprains, wounds, trauma-related conditions | "I sprained my ankle playing sports", "I have a wound that won't heal" |
| `SKIN_CONDITIONS` | Skin-related issues and conditions | "My toenail is ingrown and infected", "I have calluses on my feet" |
| `STRUCTURAL_ISSUES` | Structural problems and related conditions | "I have flat feet", "I need evaluation for plantar fasciitis" |
| `PROCEDURES` | Injections, surgical consultations, post-operative care | "I need a cortisone injection", "Post-surgical follow-up" |
### Technical Implementation
- **Base Model**: `sentence-transformers/embeddinggemma-300m-medical`
- **Architecture**: SetFit with frozen embeddings + trainable classification head
- **Training**: Real healthcare data from clinic appointment records
- **Integration**: Works as part of the complete healthcare routing system
## Quick Start
### 1. Train the Classifier
```bash
# Train with real healthcare data
python classifier/reason/train_reason.py
# The training script will:
# - Load real healthcare data from data/reason_for_visit_data.xlsx
# - Map reasons to categories using keyword matching
# - Train the classifier with frozen embeddings
# - Save the trained model to classifier/reason_checkpoints/
```
### 2. Use the CLI
```bash
# Classify a single reason query
python cli/reason_classifier_cli_new.py "I have heel pain when I walk"
# Interactive mode
python cli/reason_classifier_cli_new.py --interactive
# Batch processing
python cli/reason_classifier_cli_new.py --batch queries.txt --output results.json
# Use complete healthcare routing system
python cli/healthcare_classifier_cli.py "I need routine foot care"
```
### 3. Programmatic Usage
```python
from classifier.reason import ReasonClassifier, predict_single_reason
# Using the main classifier class
classifier = ReasonClassifier()
predictions = classifier.predict(["I have heel pain when I walk"])
print(predictions[0]['category']) # Output: PAIN_CONDITIONS
# Using convenience function
result = predict_single_reason("I need routine foot care")
print(result['category']) # Output: ROUTINE_CARE
print(result['confidence']) # Confidence score
print(result['probabilities']) # All category probabilities
```
## System Integration
### Complete Healthcare Routing Workflow
```
User Query
β
Medical vs Insurance Classification
β
βββββββββββββββββββ¬ββββββββββββββββββ
β Insurance β Medical β
β Queries β Queries β
β β β β β
β Insurance β Reason β
β Department β Classification β
β β β β
β β β’ ROUTINE_CARE β
β β β’ PAIN_CONDITIONS β
β β β’ INJURIES β
β β β’ SKIN_CONDITIONS β
β β β’ STRUCTURAL_ISSUES β
β β β’ PROCEDURES β
βββββββββββββββββββ΄ββββββββββββββββββ
```
### Integration with Healthcare System
The reason classifier integrates as part of the complete healthcare routing system:
1. **Primary Classification**: Medical vs Insurance queries
2. **Reason Classification**: Medical queries β Specific reason categories
3. **Department Routing**: Route to appropriate specialized departments
## Training Data Strategy
### Real Healthcare Data
The system uses actual healthcare clinic data:
```python
# Data source: data/reason_for_visit_data.xlsx
# Contains real patient visit reasons and appointment types
# Examples from actual data:
# - "Heel pain"
# - "Routine foot care"
# - "Ingrown toenail"
# - "Ankle sprain"
# - "Plantar fasciitis"
```
### Category Mapping Strategy
The system uses keyword-based mapping to categorize real healthcare reasons:
```python
def map_reason_to_category(reason: str) -> int:
reason_lower = reason.lower()
# ROUTINE_CARE (routine care, maintenance visits)
if any(word in reason_lower for word in ['routine', 'nail care', 'calluses']):
return 0
# PAIN_CONDITIONS (various pain-related conditions)
elif any(word in reason_lower for word in ['pain', 'ache', 'sore']):
return 1
# ... other categories
```
## Performance Metrics
### Expected Performance
- **Accuracy**: Based on real healthcare data patterns
- **Categories**: 6 specialized healthcare reason categories
- **Confidence**: Variable based on training data quality
### Evaluation Framework
```bash
# Train and evaluate the model
python classifier/reason/train_reason.py
# Test the trained model
python classifier/reason/infer_reason.py
# Results include:
# - Training metrics
# - Category distribution
# - Example predictions with confidence scores
```
## File Structure
```
classifier/reason/
βββ __init__.py # Package initialization and exports
βββ README.md # This documentation
βββ reason_classifier.py # Main ReasonClassifier class
βββ infer_reason.py # Inference functions and utilities
βββ train_reason.py # Training script and functions
```
## API Reference
### ReasonClassifier
```python
class ReasonClassifier:
def __init__(self, data_file: str = "data/reason_for_visit_data.xlsx")
def predict(self, queries: List[str]) -> List[Dict]
def train(self, train_data: pd.DataFrame = None, eval_data: Optional[pd.DataFrame] = None)
def save_model(self, path: str)
def load_model(self, path: str)
def create_real_dataset(self) -> pd.DataFrame
def analyze_real_data(self)
```
### Inference Functions
```python
def predict_single_reason(query: str) -> dict
def predict_reason_query(text: list[str], embedding_model, classifier_head) -> dict
def get_reason_models() -> tuple
def test_reason_classifier()
```
### Training Functions
```python
def get_reason_model(num_classes: int)
def get_reason_dataset() -> pd.DataFrame
def map_reason_to_category(reason: str) -> int
def preprocess_reason_data(df: pd.DataFrame) -> pd.DataFrame
```
## Data Requirements
### Healthcare Data Format
The system expects healthcare data in Excel format with these columns:
```
Required columns:
- "Reason For Visit": The primary reason for the healthcare visit
- "Appointment Type": Type of appointment (optional, used for context)
Example data:
| Reason For Visit | Appointment Type |
|------------------|------------------|
| Heel pain | Follow-up |
| Routine foot care| Maintenance |
| Ingrown toenail | New Patient |
```
## Deployment Considerations
### Production Readiness
1. **Model Persistence**: Trained models saved with timestamps in `classifier/reason_checkpoints/`
2. **Error Handling**: Graceful fallbacks for prediction failures
3. **Real Data Integration**: Uses actual healthcare clinic data
4. **Device Support**: CPU/GPU/MPS compatibility
### Scalability
- **Batch Processing**: Efficient handling of multiple queries
- **Integration**: Works with existing healthcare routing system
- **Checkpoints**: Automatic model saving with timestamps
## Future Enhancements
### Data Improvements
1. **Expanded Dataset**: Include more healthcare specialties
2. **Active Learning**: Improve model with real-world feedback
3. **Multi-language Support**: Support for non-English healthcare queries
### Advanced Features
1. **Confidence Calibration**: Improve confidence score reliability
2. **Hierarchical Classification**: Sub-categories within reason types
3. **Context Awareness**: Consider patient history and appointment context
## Troubleshooting
### Common Issues
1. **Data Loading Errors**: Ensure `data/reason_for_visit_data.xlsx` exists
2. **Low Confidence**: May indicate need for more training data or model retraining
3. **Import Errors**: Ensure all dependencies are installed and paths are correct
### Debug Mode
```python
# Test the classifier with sample queries
from classifier.reason.infer_reason import test_reason_classifier
test_reason_classifier()
# Check model predictions with probabilities
from classifier.reason import predict_single_reason
result = predict_single_reason("ambiguous query")
print(result['probabilities'])
```
### Model Training Issues
```bash
# Check if healthcare data is available
ls -la data/reason_for_visit_data.xlsx
# Verify model training
python classifier/reason/train_reason.py
# Test inference after training
python classifier/reason/infer_reason.py
```
## Contributing
### Adding New Categories
1. Update `REASON_CATEGORIES` in `reason_classifier.py`, `infer_reason.py`, and `train_reason.py`
2. Update category mapping logic in `map_reason_to_category()`
3. Retrain the model with new categories
4. Update documentation and examples
### Improving Training Data
1. Add more real healthcare examples to the dataset
2. Improve keyword mapping for better categorization
3. Implement more sophisticated NLP techniques for category assignment
## License
This module is part of the health-query-classifier project and follows the same licensing terms. |