| # Healthcare Reason Classification System | |
| This module implements a specialized classifier for healthcare visit reasons using real clinic data to classify patient queries into specific healthcare reason categories. | |
| ## Overview | |
| The reason classifier addresses the challenge of routing medical healthcare queries to appropriate specialized departments. It classifies medical queries into specific reason categories based on actual healthcare visit data. | |
| ## Architecture | |
| ### Classification Categories | |
| | Category | Description | Examples | | |
| |----------|-------------|----------| | |
| | `ROUTINE_CARE` | Routine healthcare, maintenance visits, general care | "I need routine foot care", "Regular nail care appointment" | | |
| | `PAIN_CONDITIONS` | Various pain-related conditions and discomfort | "I have heel pain when I walk", "My ankle is sore" | | |
| | `INJURIES` | Sprains, wounds, trauma-related conditions | "I sprained my ankle playing sports", "I have a wound that won't heal" | | |
| | `SKIN_CONDITIONS` | Skin-related issues and conditions | "My toenail is ingrown and infected", "I have calluses on my feet" | | |
| | `STRUCTURAL_ISSUES` | Structural problems and related conditions | "I have flat feet", "I need evaluation for plantar fasciitis" | | |
| | `PROCEDURES` | Injections, surgical consultations, post-operative care | "I need a cortisone injection", "Post-surgical follow-up" | | |
| ### Technical Implementation | |
| - **Base Model**: `sentence-transformers/embeddinggemma-300m-medical` | |
| - **Architecture**: SetFit with frozen embeddings + trainable classification head | |
| - **Training**: Real healthcare data from clinic appointment records | |
| - **Integration**: Works as part of the complete healthcare routing system | |
| ## Quick Start | |
| ### 1. Train the Classifier | |
| ```bash | |
| # Train with real healthcare data | |
| python classifier/reason/train_reason.py | |
| # The training script will: | |
| # - Load real healthcare data from data/reason_for_visit_data.xlsx | |
| # - Map reasons to categories using keyword matching | |
| # - Train the classifier with frozen embeddings | |
| # - Save the trained model to classifier/reason_checkpoints/ | |
| ``` | |
| ### 2. Use the CLI | |
| ```bash | |
| # Classify a single reason query | |
| python cli/reason_classifier_cli_new.py "I have heel pain when I walk" | |
| # Interactive mode | |
| python cli/reason_classifier_cli_new.py --interactive | |
| # Batch processing | |
| python cli/reason_classifier_cli_new.py --batch queries.txt --output results.json | |
| # Use complete healthcare routing system | |
| python cli/healthcare_classifier_cli.py "I need routine foot care" | |
| ``` | |
| ### 3. Programmatic Usage | |
| ```python | |
| from classifier.reason import ReasonClassifier, predict_single_reason | |
| # Using the main classifier class | |
| classifier = ReasonClassifier() | |
| predictions = classifier.predict(["I have heel pain when I walk"]) | |
| print(predictions[0]['category']) # Output: PAIN_CONDITIONS | |
| # Using convenience function | |
| result = predict_single_reason("I need routine foot care") | |
| print(result['category']) # Output: ROUTINE_CARE | |
| print(result['confidence']) # Confidence score | |
| print(result['probabilities']) # All category probabilities | |
| ``` | |
| ## System Integration | |
| ### Complete Healthcare Routing Workflow | |
| ``` | |
| User Query | |
| β | |
| Medical vs Insurance Classification | |
| β | |
| βββββββββββββββββββ¬ββββββββββββββββββ | |
| β Insurance β Medical β | |
| β Queries β Queries β | |
| β β β β β | |
| β Insurance β Reason β | |
| β Department β Classification β | |
| β β β β | |
| β β β’ ROUTINE_CARE β | |
| β β β’ PAIN_CONDITIONS β | |
| β β β’ INJURIES β | |
| β β β’ SKIN_CONDITIONS β | |
| β β β’ STRUCTURAL_ISSUES β | |
| β β β’ PROCEDURES β | |
| βββββββββββββββββββ΄ββββββββββββββββββ | |
| ``` | |
| ### Integration with Healthcare System | |
| The reason classifier integrates as part of the complete healthcare routing system: | |
| 1. **Primary Classification**: Medical vs Insurance queries | |
| 2. **Reason Classification**: Medical queries β Specific reason categories | |
| 3. **Department Routing**: Route to appropriate specialized departments | |
| ## Training Data Strategy | |
| ### Real Healthcare Data | |
| The system uses actual healthcare clinic data: | |
| ```python | |
| # Data source: data/reason_for_visit_data.xlsx | |
| # Contains real patient visit reasons and appointment types | |
| # Examples from actual data: | |
| # - "Heel pain" | |
| # - "Routine foot care" | |
| # - "Ingrown toenail" | |
| # - "Ankle sprain" | |
| # - "Plantar fasciitis" | |
| ``` | |
| ### Category Mapping Strategy | |
| The system uses keyword-based mapping to categorize real healthcare reasons: | |
| ```python | |
| def map_reason_to_category(reason: str) -> int: | |
| reason_lower = reason.lower() | |
| # ROUTINE_CARE (routine care, maintenance visits) | |
| if any(word in reason_lower for word in ['routine', 'nail care', 'calluses']): | |
| return 0 | |
| # PAIN_CONDITIONS (various pain-related conditions) | |
| elif any(word in reason_lower for word in ['pain', 'ache', 'sore']): | |
| return 1 | |
| # ... other categories | |
| ``` | |
| ## Performance Metrics | |
| ### Expected Performance | |
| - **Accuracy**: Based on real healthcare data patterns | |
| - **Categories**: 6 specialized healthcare reason categories | |
| - **Confidence**: Variable based on training data quality | |
| ### Evaluation Framework | |
| ```bash | |
| # Train and evaluate the model | |
| python classifier/reason/train_reason.py | |
| # Test the trained model | |
| python classifier/reason/infer_reason.py | |
| # Results include: | |
| # - Training metrics | |
| # - Category distribution | |
| # - Example predictions with confidence scores | |
| ``` | |
| ## File Structure | |
| ``` | |
| classifier/reason/ | |
| βββ __init__.py # Package initialization and exports | |
| βββ README.md # This documentation | |
| βββ reason_classifier.py # Main ReasonClassifier class | |
| βββ infer_reason.py # Inference functions and utilities | |
| βββ train_reason.py # Training script and functions | |
| ``` | |
| ## API Reference | |
| ### ReasonClassifier | |
| ```python | |
| class ReasonClassifier: | |
| def __init__(self, data_file: str = "data/reason_for_visit_data.xlsx") | |
| def predict(self, queries: List[str]) -> List[Dict] | |
| def train(self, train_data: pd.DataFrame = None, eval_data: Optional[pd.DataFrame] = None) | |
| def save_model(self, path: str) | |
| def load_model(self, path: str) | |
| def create_real_dataset(self) -> pd.DataFrame | |
| def analyze_real_data(self) | |
| ``` | |
| ### Inference Functions | |
| ```python | |
| def predict_single_reason(query: str) -> dict | |
| def predict_reason_query(text: list[str], embedding_model, classifier_head) -> dict | |
| def get_reason_models() -> tuple | |
| def test_reason_classifier() | |
| ``` | |
| ### Training Functions | |
| ```python | |
| def get_reason_model(num_classes: int) | |
| def get_reason_dataset() -> pd.DataFrame | |
| def map_reason_to_category(reason: str) -> int | |
| def preprocess_reason_data(df: pd.DataFrame) -> pd.DataFrame | |
| ``` | |
| ## Data Requirements | |
| ### Healthcare Data Format | |
| The system expects healthcare data in Excel format with these columns: | |
| ``` | |
| Required columns: | |
| - "Reason For Visit": The primary reason for the healthcare visit | |
| - "Appointment Type": Type of appointment (optional, used for context) | |
| Example data: | |
| | Reason For Visit | Appointment Type | | |
| |------------------|------------------| | |
| | Heel pain | Follow-up | | |
| | Routine foot care| Maintenance | | |
| | Ingrown toenail | New Patient | | |
| ``` | |
| ## Deployment Considerations | |
| ### Production Readiness | |
| 1. **Model Persistence**: Trained models saved with timestamps in `classifier/reason_checkpoints/` | |
| 2. **Error Handling**: Graceful fallbacks for prediction failures | |
| 3. **Real Data Integration**: Uses actual healthcare clinic data | |
| 4. **Device Support**: CPU/GPU/MPS compatibility | |
| ### Scalability | |
| - **Batch Processing**: Efficient handling of multiple queries | |
| - **Integration**: Works with existing healthcare routing system | |
| - **Checkpoints**: Automatic model saving with timestamps | |
| ## Future Enhancements | |
| ### Data Improvements | |
| 1. **Expanded Dataset**: Include more healthcare specialties | |
| 2. **Active Learning**: Improve model with real-world feedback | |
| 3. **Multi-language Support**: Support for non-English healthcare queries | |
| ### Advanced Features | |
| 1. **Confidence Calibration**: Improve confidence score reliability | |
| 2. **Hierarchical Classification**: Sub-categories within reason types | |
| 3. **Context Awareness**: Consider patient history and appointment context | |
| ## Troubleshooting | |
| ### Common Issues | |
| 1. **Data Loading Errors**: Ensure `data/reason_for_visit_data.xlsx` exists | |
| 2. **Low Confidence**: May indicate need for more training data or model retraining | |
| 3. **Import Errors**: Ensure all dependencies are installed and paths are correct | |
| ### Debug Mode | |
| ```python | |
| # Test the classifier with sample queries | |
| from classifier.reason.infer_reason import test_reason_classifier | |
| test_reason_classifier() | |
| # Check model predictions with probabilities | |
| from classifier.reason import predict_single_reason | |
| result = predict_single_reason("ambiguous query") | |
| print(result['probabilities']) | |
| ``` | |
| ### Model Training Issues | |
| ```bash | |
| # Check if healthcare data is available | |
| ls -la data/reason_for_visit_data.xlsx | |
| # Verify model training | |
| python classifier/reason/train_reason.py | |
| # Test inference after training | |
| python classifier/reason/infer_reason.py | |
| ``` | |
| ## Contributing | |
| ### Adding New Categories | |
| 1. Update `REASON_CATEGORIES` in `reason_classifier.py`, `infer_reason.py`, and `train_reason.py` | |
| 2. Update category mapping logic in `map_reason_to_category()` | |
| 3. Retrain the model with new categories | |
| 4. Update documentation and examples | |
| ### Improving Training Data | |
| 1. Add more real healthcare examples to the dataset | |
| 2. Improve keyword mapping for better categorization | |
| 3. Implement more sophisticated NLP techniques for category assignment | |
| ## License | |
| This module is part of the health-query-classifier project and follows the same licensing terms. |