Spaces:

taraky
/

Medical_Document_Retrieval

Running

File size: 10,345 Bytes

b7f3196

# Healthcare Reason Classification System

This module implements a specialized classifier for healthcare visit reasons using real clinic data to classify patient queries into specific healthcare reason categories.

## Overview

The reason classifier addresses the challenge of routing medical healthcare queries to appropriate specialized departments. It classifies medical queries into specific reason categories based on actual healthcare visit data.

## Architecture

### Classification Categories

| Category | Description | Examples |
|----------|-------------|----------|
| `ROUTINE_CARE` | Routine healthcare, maintenance visits, general care | "I need routine foot care", "Regular nail care appointment" |
| `PAIN_CONDITIONS` | Various pain-related conditions and discomfort | "I have heel pain when I walk", "My ankle is sore" |
| `INJURIES` | Sprains, wounds, trauma-related conditions | "I sprained my ankle playing sports", "I have a wound that won't heal" |
| `SKIN_CONDITIONS` | Skin-related issues and conditions | "My toenail is ingrown and infected", "I have calluses on my feet" |
| `STRUCTURAL_ISSUES` | Structural problems and related conditions | "I have flat feet", "I need evaluation for plantar fasciitis" |
| `PROCEDURES` | Injections, surgical consultations, post-operative care | "I need a cortisone injection", "Post-surgical follow-up" |

### Technical Implementation

- **Base Model**: `sentence-transformers/embeddinggemma-300m-medical`
- **Architecture**: SetFit with frozen embeddings + trainable classification head
- **Training**: Real healthcare data from clinic appointment records
- **Integration**: Works as part of the complete healthcare routing system

## Quick Start

### 1. Train the Classifier

```bash

# Train with real healthcare data

python classifier/reason/train_reason.py



# The training script will:

# - Load real healthcare data from data/reason_for_visit_data.xlsx

# - Map reasons to categories using keyword matching

# - Train the classifier with frozen embeddings

# - Save the trained model to classifier/reason_checkpoints/

```

### 2. Use the CLI

```bash

# Classify a single reason query

python cli/reason_classifier_cli_new.py "I have heel pain when I walk"



# Interactive mode

python cli/reason_classifier_cli_new.py --interactive



# Batch processing

python cli/reason_classifier_cli_new.py --batch queries.txt --output results.json



# Use complete healthcare routing system

python cli/healthcare_classifier_cli.py "I need routine foot care"

```

### 3. Programmatic Usage

```python

from classifier.reason import ReasonClassifier, predict_single_reason



# Using the main classifier class

classifier = ReasonClassifier()

predictions = classifier.predict(["I have heel pain when I walk"])

print(predictions[0]['category'])  # Output: PAIN_CONDITIONS



# Using convenience function

result = predict_single_reason("I need routine foot care")

print(result['category'])  # Output: ROUTINE_CARE

print(result['confidence'])  # Confidence score

print(result['probabilities'])  # All category probabilities

```

## System Integration

### Complete Healthcare Routing Workflow

```

User Query

    ↓

Medical vs Insurance Classification

    ↓

┌─────────────────┬─────────────────┐

│   Insurance     │     Medical     │

│   Queries       │     Queries     │

│       ↓         │        ↓        │

│  Insurance      │   Reason        │

│  Department     │ Classification  │

│                 │        ↓        │

│                 │  • ROUTINE_CARE │

│                 │  • PAIN_CONDITIONS │

│                 │  • INJURIES     │

│                 │  • SKIN_CONDITIONS │

│                 │  • STRUCTURAL_ISSUES │

│                 │  • PROCEDURES   │

└─────────────────┴─────────────────┘

```

### Integration with Healthcare System

The reason classifier integrates as part of the complete healthcare routing system:

1. **Primary Classification**: Medical vs Insurance queries
2. **Reason Classification**: Medical queries → Specific reason categories
3. **Department Routing**: Route to appropriate specialized departments

## Training Data Strategy

### Real Healthcare Data

The system uses actual healthcare clinic data:

```python

# Data source: data/reason_for_visit_data.xlsx

# Contains real patient visit reasons and appointment types

# Examples from actual data:

# - "Heel pain"

# - "Routine foot care"

# - "Ingrown toenail"

# - "Ankle sprain"

# - "Plantar fasciitis"

```

### Category Mapping Strategy

The system uses keyword-based mapping to categorize real healthcare reasons:

```python

def map_reason_to_category(reason: str) -> int:

    reason_lower = reason.lower()

    

    # ROUTINE_CARE (routine care, maintenance visits)

    if any(word in reason_lower for word in ['routine', 'nail care', 'calluses']):

        return 0

    

    # PAIN_CONDITIONS (various pain-related conditions)

    elif any(word in reason_lower for word in ['pain', 'ache', 'sore']):

        return 1

    

    # ... other categories

```

## Performance Metrics

### Expected Performance
- **Accuracy**: Based on real healthcare data patterns
- **Categories**: 6 specialized healthcare reason categories
- **Confidence**: Variable based on training data quality

### Evaluation Framework

```bash

# Train and evaluate the model

python classifier/reason/train_reason.py



# Test the trained model

python classifier/reason/infer_reason.py



# Results include:

# - Training metrics

# - Category distribution

# - Example predictions with confidence scores

```

## File Structure

```

classifier/reason/

├── __init__.py              # Package initialization and exports

├── README.md               # This documentation

├── reason_classifier.py    # Main ReasonClassifier class

├── infer_reason.py        # Inference functions and utilities

└── train_reason.py        # Training script and functions

```

## API Reference

### ReasonClassifier

```python

class ReasonClassifier:

    def __init__(self, data_file: str = "data/reason_for_visit_data.xlsx")

    def predict(self, queries: List[str]) -> List[Dict]

    def train(self, train_data: pd.DataFrame = None, eval_data: Optional[pd.DataFrame] = None)

    def save_model(self, path: str)

    def load_model(self, path: str)

    def create_real_dataset(self) -> pd.DataFrame

    def analyze_real_data(self)

```

### Inference Functions

```python

def predict_single_reason(query: str) -> dict

def predict_reason_query(text: list[str], embedding_model, classifier_head) -> dict

def get_reason_models() -> tuple

def test_reason_classifier()

```

### Training Functions

```python

def get_reason_model(num_classes: int)

def get_reason_dataset() -> pd.DataFrame

def map_reason_to_category(reason: str) -> int

def preprocess_reason_data(df: pd.DataFrame) -> pd.DataFrame

```

## Data Requirements

### Healthcare Data Format

The system expects healthcare data in Excel format with these columns:

```

Required columns:

- "Reason For Visit": The primary reason for the healthcare visit

- "Appointment Type": Type of appointment (optional, used for context)



Example data:

| Reason For Visit | Appointment Type |

|------------------|------------------|

| Heel pain        | Follow-up        |

| Routine foot care| Maintenance      |

| Ingrown toenail  | New Patient      |

```

## Deployment Considerations

### Production Readiness

1. **Model Persistence**: Trained models saved with timestamps in `classifier/reason_checkpoints/`
2. **Error Handling**: Graceful fallbacks for prediction failures
3. **Real Data Integration**: Uses actual healthcare clinic data
4. **Device Support**: CPU/GPU/MPS compatibility

### Scalability

- **Batch Processing**: Efficient handling of multiple queries
- **Integration**: Works with existing healthcare routing system
- **Checkpoints**: Automatic model saving with timestamps

## Future Enhancements

### Data Improvements

1. **Expanded Dataset**: Include more healthcare specialties
2. **Active Learning**: Improve model with real-world feedback
3. **Multi-language Support**: Support for non-English healthcare queries

### Advanced Features

1. **Confidence Calibration**: Improve confidence score reliability
2. **Hierarchical Classification**: Sub-categories within reason types
3. **Context Awareness**: Consider patient history and appointment context

## Troubleshooting

### Common Issues

1. **Data Loading Errors**: Ensure `data/reason_for_visit_data.xlsx` exists
2. **Low Confidence**: May indicate need for more training data or model retraining
3. **Import Errors**: Ensure all dependencies are installed and paths are correct

### Debug Mode

```python

# Test the classifier with sample queries

from classifier.reason.infer_reason import test_reason_classifier

test_reason_classifier()



# Check model predictions with probabilities

from classifier.reason import predict_single_reason

result = predict_single_reason("ambiguous query")

print(result['probabilities'])

```

### Model Training Issues

```bash

# Check if healthcare data is available

ls -la data/reason_for_visit_data.xlsx



# Verify model training

python classifier/reason/train_reason.py



# Test inference after training

python classifier/reason/infer_reason.py

```

## Contributing

### Adding New Categories

1. Update `REASON_CATEGORIES` in `reason_classifier.py`, `infer_reason.py`, and `train_reason.py`
2. Update category mapping logic in `map_reason_to_category()`
3. Retrain the model with new categories
4. Update documentation and examples

### Improving Training Data

1. Add more real healthcare examples to the dataset
2. Improve keyword mapping for better categorization
3. Implement more sophisticated NLP techniques for category assignment

## License

This module is part of the health-query-classifier project and follows the same licensing terms.