Spaces:

taraky
/

Medical_Document_Retrieval

Running

App Files Files Community

Medical_Document_Retrieval / classifier /reason /README.md

taraky

Upload folder using huggingface_hub

b7f3196 verified 3 days ago

preview code

raw

history blame contribute delete

10.3 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Healthcare Reason Classification System

This module implements a specialized classifier for healthcare visit reasons using real clinic data to classify patient queries into specific healthcare reason categories.

Overview

The reason classifier addresses the challenge of routing medical healthcare queries to appropriate specialized departments. It classifies medical queries into specific reason categories based on actual healthcare visit data.

Architecture

Classification Categories

Category	Description	Examples
`ROUTINE_CARE`	Routine healthcare, maintenance visits, general care	"I need routine foot care", "Regular nail care appointment"
`PAIN_CONDITIONS`	Various pain-related conditions and discomfort	"I have heel pain when I walk", "My ankle is sore"
`INJURIES`	Sprains, wounds, trauma-related conditions	"I sprained my ankle playing sports", "I have a wound that won't heal"
`SKIN_CONDITIONS`	Skin-related issues and conditions	"My toenail is ingrown and infected", "I have calluses on my feet"
`STRUCTURAL_ISSUES`	Structural problems and related conditions	"I have flat feet", "I need evaluation for plantar fasciitis"
`PROCEDURES`	Injections, surgical consultations, post-operative care	"I need a cortisone injection", "Post-surgical follow-up"

Technical Implementation

Base Model: sentence-transformers/embeddinggemma-300m-medical
Architecture: SetFit with frozen embeddings + trainable classification head
Training: Real healthcare data from clinic appointment records
Integration: Works as part of the complete healthcare routing system

Quick Start

1. Train the Classifier

# Train with real healthcare data
python classifier/reason/train_reason.py

# The training script will:
# - Load real healthcare data from data/reason_for_visit_data.xlsx
# - Map reasons to categories using keyword matching
# - Train the classifier with frozen embeddings
# - Save the trained model to classifier/reason_checkpoints/

2. Use the CLI

# Classify a single reason query
python cli/reason_classifier_cli_new.py "I have heel pain when I walk"

# Interactive mode
python cli/reason_classifier_cli_new.py --interactive

# Batch processing
python cli/reason_classifier_cli_new.py --batch queries.txt --output results.json

# Use complete healthcare routing system
python cli/healthcare_classifier_cli.py "I need routine foot care"

3. Programmatic Usage

from classifier.reason import ReasonClassifier, predict_single_reason

# Using the main classifier class
classifier = ReasonClassifier()
predictions = classifier.predict(["I have heel pain when I walk"])
print(predictions[0]['category'])  # Output: PAIN_CONDITIONS

# Using convenience function
result = predict_single_reason("I need routine foot care")
print(result['category'])  # Output: ROUTINE_CARE
print(result['confidence'])  # Confidence score
print(result['probabilities'])  # All category probabilities

System Integration

Complete Healthcare Routing Workflow

User Query
    ↓
Medical vs Insurance Classification
    ↓
┌─────────────────┬─────────────────┐
│   Insurance     │     Medical     │
│   Queries       │     Queries     │
│       ↓         │        ↓        │
│  Insurance      │   Reason        │
│  Department     │ Classification  │
│                 │        ↓        │
│                 │  • ROUTINE_CARE │
│                 │  • PAIN_CONDITIONS │
│                 │  • INJURIES     │
│                 │  • SKIN_CONDITIONS │
│                 │  • STRUCTURAL_ISSUES │
│                 │  • PROCEDURES   │
└─────────────────┴─────────────────┘

Integration with Healthcare System

The reason classifier integrates as part of the complete healthcare routing system:

Primary Classification: Medical vs Insurance queries
Reason Classification: Medical queries → Specific reason categories
Department Routing: Route to appropriate specialized departments

Training Data Strategy

Real Healthcare Data

The system uses actual healthcare clinic data:

# Data source: data/reason_for_visit_data.xlsx
# Contains real patient visit reasons and appointment types
# Examples from actual data:
# - "Heel pain"
# - "Routine foot care"
# - "Ingrown toenail"
# - "Ankle sprain"
# - "Plantar fasciitis"

Category Mapping Strategy

The system uses keyword-based mapping to categorize real healthcare reasons:

def map_reason_to_category(reason: str) -> int:
    reason_lower = reason.lower()
    
    # ROUTINE_CARE (routine care, maintenance visits)
    if any(word in reason_lower for word in ['routine', 'nail care', 'calluses']):
        return 0
    
    # PAIN_CONDITIONS (various pain-related conditions)
    elif any(word in reason_lower for word in ['pain', 'ache', 'sore']):
        return 1
    
    # ... other categories

Performance Metrics

Expected Performance

Accuracy: Based on real healthcare data patterns
Categories: 6 specialized healthcare reason categories
Confidence: Variable based on training data quality

Evaluation Framework

# Train and evaluate the model
python classifier/reason/train_reason.py

# Test the trained model
python classifier/reason/infer_reason.py

# Results include:
# - Training metrics
# - Category distribution
# - Example predictions with confidence scores

File Structure

classifier/reason/
├── __init__.py              # Package initialization and exports
├── README.md               # This documentation
├── reason_classifier.py    # Main ReasonClassifier class
├── infer_reason.py        # Inference functions and utilities
└── train_reason.py        # Training script and functions

API Reference

ReasonClassifier

class ReasonClassifier:
    def __init__(self, data_file: str = "data/reason_for_visit_data.xlsx")
    def predict(self, queries: List[str]) -> List[Dict]
    def train(self, train_data: pd.DataFrame = None, eval_data: Optional[pd.DataFrame] = None)
    def save_model(self, path: str)
    def load_model(self, path: str)
    def create_real_dataset(self) -> pd.DataFrame
    def analyze_real_data(self)

Inference Functions

def predict_single_reason(query: str) -> dict
def predict_reason_query(text: list[str], embedding_model, classifier_head) -> dict
def get_reason_models() -> tuple
def test_reason_classifier()

Training Functions

def get_reason_model(num_classes: int)
def get_reason_dataset() -> pd.DataFrame
def map_reason_to_category(reason: str) -> int
def preprocess_reason_data(df: pd.DataFrame) -> pd.DataFrame

Data Requirements

Healthcare Data Format

The system expects healthcare data in Excel format with these columns:

Required columns:
- "Reason For Visit": The primary reason for the healthcare visit
- "Appointment Type": Type of appointment (optional, used for context)

Example data:
| Reason For Visit | Appointment Type |
|------------------|------------------|
| Heel pain        | Follow-up        |
| Routine foot care| Maintenance      |
| Ingrown toenail  | New Patient      |

Deployment Considerations

Production Readiness

Model Persistence: Trained models saved with timestamps in classifier/reason_checkpoints/
Error Handling: Graceful fallbacks for prediction failures
Real Data Integration: Uses actual healthcare clinic data
Device Support: CPU/GPU/MPS compatibility

Scalability

Batch Processing: Efficient handling of multiple queries
Integration: Works with existing healthcare routing system
Checkpoints: Automatic model saving with timestamps

Future Enhancements

Data Improvements

Expanded Dataset: Include more healthcare specialties
Active Learning: Improve model with real-world feedback
Multi-language Support: Support for non-English healthcare queries

Advanced Features

Confidence Calibration: Improve confidence score reliability
Hierarchical Classification: Sub-categories within reason types
Context Awareness: Consider patient history and appointment context

Troubleshooting

Common Issues

Data Loading Errors: Ensure data/reason_for_visit_data.xlsx exists
Low Confidence: May indicate need for more training data or model retraining
Import Errors: Ensure all dependencies are installed and paths are correct

Debug Mode

# Test the classifier with sample queries
from classifier.reason.infer_reason import test_reason_classifier
test_reason_classifier()

# Check model predictions with probabilities
from classifier.reason import predict_single_reason
result = predict_single_reason("ambiguous query")
print(result['probabilities'])

Model Training Issues

# Check if healthcare data is available
ls -la data/reason_for_visit_data.xlsx

# Verify model training
python classifier/reason/train_reason.py

# Test inference after training
python classifier/reason/infer_reason.py

Contributing

Adding New Categories

Update REASON_CATEGORIES in reason_classifier.py, infer_reason.py, and train_reason.py
Update category mapping logic in map_reason_to_category()
Retrain the model with new categories
Update documentation and examples

Improving Training Data

Add more real healthcare examples to the dataset
Improve keyword mapping for better categorization
Implement more sophisticated NLP techniques for category assignment

License

This module is part of the health-query-classifier project and follows the same licensing terms.