# Healthcare Reason Classification System

This module implements a specialized classifier for healthcare visit reasons using real clinic data to classify patient queries into specific healthcare reason categories.

## Overview

The reason classifier addresses the challenge of routing medical healthcare queries to appropriate specialized departments. It classifies medical queries into specific reason categories based on actual healthcare visit data.

## Architecture

### Classification Categories

| Category | Description | Examples |
|----------|-------------|----------|
| `ROUTINE_CARE` | Routine healthcare, maintenance visits, general care | "I need routine foot care", "Regular nail care appointment" |
| `PAIN_CONDITIONS` | Various pain-related conditions and discomfort | "I have heel pain when I walk", "My ankle is sore" |
| `INJURIES` | Sprains, wounds, trauma-related conditions | "I sprained my ankle playing sports", "I have a wound that won't heal" |
| `SKIN_CONDITIONS` | Skin-related issues and conditions | "My toenail is ingrown and infected", "I have calluses on my feet" |
| `STRUCTURAL_ISSUES` | Structural problems and related conditions | "I have flat feet", "I need evaluation for plantar fasciitis" |
| `PROCEDURES` | Injections, surgical consultations, post-operative care | "I need a cortisone injection", "Post-surgical follow-up" |

### Technical Implementation

- **Base Model**: `sentence-transformers/embeddinggemma-300m-medical`
- **Architecture**: SetFit with frozen embeddings + trainable classification head
- **Training**: Real healthcare data from clinic appointment records
- **Integration**: Works as part of the complete healthcare routing system

## Quick Start

### 1. Train the Classifier

```bash
# Train with real healthcare data
python classifier/reason/train_reason.py

# The training script will:
# - Load real healthcare data from data/reason_for_visit_data.xlsx
# - Map reasons to categories using keyword matching
# - Train the classifier with frozen embeddings
# - Save the trained model to classifier/reason_checkpoints/
```

### 2. Use the CLI

```bash
# Classify a single reason query
python cli/reason_classifier_cli_new.py "I have heel pain when I walk"

# Interactive mode
python cli/reason_classifier_cli_new.py --interactive

# Batch processing
python cli/reason_classifier_cli_new.py --batch queries.txt --output results.json

# Use complete healthcare routing system
python cli/healthcare_classifier_cli.py "I need routine foot care"
```

### 3. Programmatic Usage

```python
from classifier.reason import ReasonClassifier, predict_single_reason

# Using the main classifier class
classifier = ReasonClassifier()
predictions = classifier.predict(["I have heel pain when I walk"])
print(predictions[0]['category'])  # Output: PAIN_CONDITIONS

# Using convenience function
result = predict_single_reason("I need routine foot care")
print(result['category'])  # Output: ROUTINE_CARE
print(result['confidence'])  # Confidence score
print(result['probabilities'])  # All category probabilities
```

## System Integration

### Complete Healthcare Routing Workflow

```
User Query
    ↓
Medical vs Insurance Classification
    ↓
┌─────────────────┬─────────────────┐
│   Insurance     │     Medical     │
│   Queries       │     Queries     │
│       ↓         │        ↓        │
│  Insurance      │   Reason        │
│  Department     │ Classification  │
│                 │        ↓        │
│                 │  • ROUTINE_CARE │
│                 │  • PAIN_CONDITIONS │
│                 │  • INJURIES     │
│                 │  • SKIN_CONDITIONS │
│                 │  • STRUCTURAL_ISSUES │
│                 │  • PROCEDURES   │
└─────────────────┴─────────────────┘
```

### Integration with Healthcare System

The reason classifier integrates as part of the complete healthcare routing system:

1. **Primary Classification**: Medical vs Insurance queries
2. **Reason Classification**: Medical queries → Specific reason categories
3. **Department Routing**: Route to appropriate specialized departments

## Training Data Strategy

### Real Healthcare Data

The system uses actual healthcare clinic data:

```python
# Data source: data/reason_for_visit_data.xlsx
# Contains real patient visit reasons and appointment types
# Examples from actual data:
# - "Heel pain"
# - "Routine foot care"
# - "Ingrown toenail"
# - "Ankle sprain"
# - "Plantar fasciitis"
```

### Category Mapping Strategy

The system uses keyword-based mapping to categorize real healthcare reasons:

```python
def map_reason_to_category(reason: str) -> int:
    reason_lower = reason.lower()
    
    # ROUTINE_CARE (routine care, maintenance visits)
    if any(word in reason_lower for word in ['routine', 'nail care', 'calluses']):
        return 0
    
    # PAIN_CONDITIONS (various pain-related conditions)
    elif any(word in reason_lower for word in ['pain', 'ache', 'sore']):
        return 1
    
    # ... other categories
```

## Performance Metrics

### Expected Performance
- **Accuracy**: Based on real healthcare data patterns
- **Categories**: 6 specialized healthcare reason categories
- **Confidence**: Variable based on training data quality

### Evaluation Framework

```bash
# Train and evaluate the model
python classifier/reason/train_reason.py

# Test the trained model
python classifier/reason/infer_reason.py

# Results include:
# - Training metrics
# - Category distribution
# - Example predictions with confidence scores
```

## File Structure

```
classifier/reason/
├── __init__.py              # Package initialization and exports
├── README.md               # This documentation
├── reason_classifier.py    # Main ReasonClassifier class
├── infer_reason.py        # Inference functions and utilities
└── train_reason.py        # Training script and functions
```

## API Reference

### ReasonClassifier

```python
class ReasonClassifier:
    def __init__(self, data_file: str = "data/reason_for_visit_data.xlsx")
    def predict(self, queries: List[str]) -> List[Dict]
    def train(self, train_data: pd.DataFrame = None, eval_data: Optional[pd.DataFrame] = None)
    def save_model(self, path: str)
    def load_model(self, path: str)
    def create_real_dataset(self) -> pd.DataFrame
    def analyze_real_data(self)
```

### Inference Functions

```python
def predict_single_reason(query: str) -> dict
def predict_reason_query(text: list[str], embedding_model, classifier_head) -> dict
def get_reason_models() -> tuple
def test_reason_classifier()
```

### Training Functions

```python
def get_reason_model(num_classes: int)
def get_reason_dataset() -> pd.DataFrame
def map_reason_to_category(reason: str) -> int
def preprocess_reason_data(df: pd.DataFrame) -> pd.DataFrame
```

## Data Requirements

### Healthcare Data Format

The system expects healthcare data in Excel format with these columns:

```
Required columns:
- "Reason For Visit": The primary reason for the healthcare visit
- "Appointment Type": Type of appointment (optional, used for context)

Example data:
| Reason For Visit | Appointment Type |
|------------------|------------------|
| Heel pain        | Follow-up        |
| Routine foot care| Maintenance      |
| Ingrown toenail  | New Patient      |
```

## Deployment Considerations

### Production Readiness

1. **Model Persistence**: Trained models saved with timestamps in `classifier/reason_checkpoints/`
2. **Error Handling**: Graceful fallbacks for prediction failures
3. **Real Data Integration**: Uses actual healthcare clinic data
4. **Device Support**: CPU/GPU/MPS compatibility

### Scalability

- **Batch Processing**: Efficient handling of multiple queries
- **Integration**: Works with existing healthcare routing system
- **Checkpoints**: Automatic model saving with timestamps

## Future Enhancements

### Data Improvements

1. **Expanded Dataset**: Include more healthcare specialties
2. **Active Learning**: Improve model with real-world feedback
3. **Multi-language Support**: Support for non-English healthcare queries

### Advanced Features

1. **Confidence Calibration**: Improve confidence score reliability
2. **Hierarchical Classification**: Sub-categories within reason types
3. **Context Awareness**: Consider patient history and appointment context

## Troubleshooting

### Common Issues

1. **Data Loading Errors**: Ensure `data/reason_for_visit_data.xlsx` exists
2. **Low Confidence**: May indicate need for more training data or model retraining
3. **Import Errors**: Ensure all dependencies are installed and paths are correct

### Debug Mode

```python
# Test the classifier with sample queries
from classifier.reason.infer_reason import test_reason_classifier
test_reason_classifier()

# Check model predictions with probabilities
from classifier.reason import predict_single_reason
result = predict_single_reason("ambiguous query")
print(result['probabilities'])
```

### Model Training Issues

```bash
# Check if healthcare data is available
ls -la data/reason_for_visit_data.xlsx

# Verify model training
python classifier/reason/train_reason.py

# Test inference after training
python classifier/reason/infer_reason.py
```

## Contributing

### Adding New Categories

1. Update `REASON_CATEGORIES` in `reason_classifier.py`, `infer_reason.py`, and `train_reason.py`
2. Update category mapping logic in `map_reason_to_category()`
3. Retrain the model with new categories
4. Update documentation and examples

### Improving Training Data

1. Add more real healthcare examples to the dataset
2. Improve keyword mapping for better categorization
3. Implement more sophisticated NLP techniques for category assignment

## License

This module is part of the health-query-classifier project and follows the same licensing terms.