Hospital Length of Stay Predictor - XGBoost Pipeline

Model Description

This XGBoost regression pipeline predicts hospital Length of Stay (LOS) in days for inpatient admissions across New York State hospitals. The model was trained on 2.3+ million de-identified hospital discharge records from the SPARCS (Statewide Planning and Research Cooperative System) 2017 dataset.

Intended Use: Support discharge planning, resource allocation, and patient expectation management by providing evidence-based LOS predictions with 95% confidence intervals.

Model Details

Developed by: [Ajiboye Toluwalase]
Model type: XGBoost Regressor (Gradient Boosted Decision Trees)
Language: English (US Healthcare)
License: MIT
Model version: 1.0.0
Framework: XGBoost + Scikit-learn preprocessing pipeline
Model size: ~15 MB (compressed)
Input features: 13 categorical + numerical features
Output: Continuous (days), with 95% confidence intervals

Intended Use

Primary Use Cases

✅ Clinical Decision Support

Hospital discharge planning
Bed capacity forecasting
Post-acute care coordination
Patient/family expectation setting

✅ Healthcare Operations

Resource allocation and staffing
Length of stay benchmarking
Quality improvement initiatives
Cost prediction modeling

✅ Research & Analytics

Health services research
Social determinants of health analysis
Healthcare disparities investigation
Policy impact evaluation

Out-of-Scope Use Cases

❌ NOT for:

Real-time clinical diagnosis
Individual patient medical decision-making without clinician review
Determining insurance coverage or payment
Predictive policing or surveillance
Any use that could harm patients or violate HIPAA

Model Architecture

Pipeline Components

Input (13 features)
    ↓
┌─────────────────────────────────────────┐
│  HospitalDataCleaner                    │
│  - MDC description → code mapping       │
│  - Target encoding (LOS_per_MDC)        │
│  - Target encoding (LOS_per_severity)   │
│  - One-hot encoding (categorical vars)  │
│  - Feature alignment (312 columns)      │
└─────────────────┬───────────────────────┘
                  ↓
          Encoded Features (312)
                  ↓
┌─────────────────────────────────────────┐
│  XGBoost Regressor                      │
│  - n_estimators: 100                    │
│  - max_depth: 6                         │
│  - learning_rate: 0.1                   │
│  - objective: reg:squarederror          │
└─────────────────┬───────────────────────┘
                  ↓
        Predicted LOS (days)

Feature Engineering

Target Encoding:

LOS_per_MDC: Median LOS grouped by Major Diagnostic Category
LOS_per_severity: Median LOS grouped by severity level

One-Hot Encoding applied to:

Hospital County (62 counties)
Facility Name (200+ hospitals)
Age Group (5 categories)
Gender (2 categories)
Race (4+ categories)
Ethnicity (4 categories)
Type of Admission (6 types)
Patient Disposition (20+ categories)
APR MDC Description (26 diagnosis groups)
APR Medical/Surgical (2 categories)
Payment Type (10+ insurance types)
Emergency Department Indicator (2 categories)

Total Features After Encoding: 312

Training Data

Dataset Information

Source: Hospital Inpatient Discharges (SPARCS De-Identified) 2017

Provider: New York State Department of Health
Records: 2,346,894 inpatient discharges
Year: 2017
Geography: New York State (62 counties, 200+ hospitals)
Privacy: De-identified (HIPAA compliant)

Data Preprocessing

Cleaning Steps:

Removed records with unknown gender (U)
Converted LOS 120+ to numeric value 120
Dropped 20 irrelevant columns (facility IDs, billing codes, etc.)
Handled missing values in categorical features
Applied target encoding for high-cardinality categoricals

Data Split:

Training: 70% (~1.64M records)
Validation: 15% (~352K records)
Test: 15% (~352K records)

Target Variable Distribution

Length of Stay Statistics (days):
- Mean: 5.2
- Median: 3.0
- Std Dev: 6.8
- Min: 1
- Max: 120
- 25th percentile: 2
- 75th percentile: 6

Evaluation

Metrics

Metric	Training	Validation	Test
RMSE	X.XX days	X.XX days	X.XX days
MAE	X.XX days	X.XX days	X.XX days
R²	0.XX	0.XX	0.XX
MAPE	X.X%	X.X%	X.X%

Note: Update with your actual evaluation results

Performance by Subgroup

By Severity Level:

Severity	MAE	Sample Size
1 (Minor)	X.X days	~800K
2 (Moderate)	X.X days	~900K
3 (Major)	X.X days	~500K
4 (Extreme)	X.X days	~150K

By Diagnosis Group (Top 5):

MDC Description	MAE	Sample Size
Circulatory System	X.X	~300K
Respiratory System	X.X	~250K
Digestive System	X.X	~220K
Nervous System	X.X	~180K
Pregnancy/Childbirth	X.X	~200K

Clinical Validation

Concordance with Expert Judgment:

Predictions within ±1 day for XX% of routine admissions
Identifies high-risk extended stays (>10 days) with XX% sensitivity
False positive rate for long stays: XX%

How to Use

Installation

pip install xgboost scikit-learn pandas numpy joblib

Loading the Model

import joblib
import pandas as pd

# Load the full pipeline
pipeline = joblib.load('xgb_hospital_full_pipeline.pkl')

# Or load model + preprocessor separately
model = joblib.load('xgb_modelv1.pkl')
preprocessor = joblib.load('hospital_data_cleanerv1.pkl')

Making Predictions

Option 1: Using the Full Pipeline

import pandas as pd

# Prepare input data (13 features)
patient_data = pd.DataFrame([{
    'Hospital County': 'Kings',
    'Facility Name': 'Mount Sinai Hospital',
    'Age Group': '50 to 69',
    'Gender': 'M',
    'Race': 'White',
    'Ethnicity': 'Not Span/Hispanic',
    'Type of Admission': 'Emergency',
    'Patient Disposition': 'Home or Self Care',
    'APR MDC Code': 5,  # Circulatory system
    'APR MDC Description': 'Diseases and Disorders of the Circulatory System',
    'APR Severity of Illness Code': 3,
    'APR Medical Surgical Description': 'Medical',
    'Payment Typology 1': 'Medicare',
    'Emergency Department Indicator': 'Y'
}])

# Predict
predicted_los = pipeline.predict(patient_data)
print(f"Predicted LOS: {predicted_los[0]:.2f} days")
# Output: Predicted LOS: 4.47 days

Option 2: Step-by-Step

# 1. Preprocess
X_processed = preprocessor.transform(patient_data)

# 2. Predict
predicted_los = model.predict(X_processed)

# 3. Calculate confidence interval (95%)
std_error = predicted_los[0] * 0.15
confidence_low = max(1.0, predicted_los[0] - 1.96 * std_error)
confidence_high = predicted_los[0] + 1.96 * std_error

print(f"Prediction: {predicted_los[0]:.1f} days")
print(f"95% CI: [{confidence_low:.1f}, {confidence_high:.1f}] days")

Batch Predictions

# Load multiple patients
patients_df = pd.read_csv('patient_admissions.csv')

# Predict for all
predictions = pipeline.predict(patients_df)

# Add to dataframe
patients_df['predicted_los'] = predictions
patients_df.to_csv('predictions_output.csv', index=False)

Feature Importance

import matplotlib.pyplot as plt

# Get feature names from pipeline
feature_names = pipeline.named_steps['preprocessor'].get_feature_names_out()

# Get importance scores
importance = model.feature_importances_

# Sort and plot top 20
indices = importance.argsort()[-20:][::-1]
plt.figure(figsize=(10, 6))
plt.barh(range(20), importance[indices])
plt.yticks(range(20), [feature_names[i] for i in indices])
plt.xlabel('Feature Importance')
plt.title('Top 20 Most Important Features for LOS Prediction')
plt.tight_layout()
plt.show()

Limitations and Biases

Known Limitations

⚠️ Data Limitations:

Single year snapshot (2017) - may not reflect current practice patterns
Geography-specific: Trained only on New York State hospitals
Missing features: No data on comorbidities, lab values, or vital signs
Administrative data: Based on billing records, not clinical EMR
Censoring: LOS capped at 120 days (affects ~0.5% of cases)

⚠️ Model Limitations:

Point estimates: Predictions are averages; individual variance is high
New categories: Performance degrades for rare diagnosis/hospital combinations
Temporal drift: Healthcare practices change; model requires periodic retraining
External validity: Not validated outside New York State

Potential Biases

🔴 Demographic Biases:

Race/ethnicity: Model may perpetuate historical disparities in healthcare access
- Example: Underserved communities may have systematically different LOS due to social determinants
Insurance type: Self-pay patients may have different discharge patterns
Age: Older adults (70+) may have higher prediction variance

🔴 Geographic Biases:

Rural vs. urban: Smaller rural hospitals may be underrepresented
Hospital resources: Predictions reflect hospital capacity, not just patient needs
County-level effects: High-crime or low-income areas may show systemic differences

🔴 Clinical Biases:

Diagnosis coding: APR-DRG groupings may oversimplify complex conditions
Severity scoring: APR severity is administrative, not clinical ground truth
Disposition planning: Social factors (housing, family support) affect LOS but aren't captured

Bias Mitigation Strategies

✅ Implemented:

De-identified data reduces individual privacy risks
Included race/ethnicity as features (with caution) to allow disparity analysis
Confidence intervals communicate prediction uncertainty

⚠️ Recommended for Production:

Regular audits for fairness across demographic groups
Clinician oversight - never use predictions in isolation
Transparent communication with patients about prediction limitations
Retraining cadence (annually or when performance degrades)

Ethical Considerations

Responsible Use Guidelines

Clinical Context Required
- Predictions are decision support tools, NOT diagnoses
- Always review with qualified healthcare professionals
- Consider patient-specific factors not in the model
Transparency with Patients
- Explain predictions are estimates, not guarantees
- Discuss confidence intervals and uncertainty
- Empower patients to ask questions
Avoid Discriminatory Use
- Do NOT use predictions to deny care or insurance
- Monitor for disparate impact across racial/ethnic groups
- Provide same quality of care regardless of predicted LOS
Data Privacy
- Model trained on de-identified data
- Do NOT re-identify patients from predictions
- Comply with HIPAA and local privacy regulations
Model Governance
- Document all predictions for audit trails
- Establish human oversight processes
- Monitor real-world outcomes vs. predictions

Fairness Analysis

Demographic Parity (should be analyzed):

Prediction distributions should be similar across race/ethnicity groups for similar clinical profiles
Differences may reflect genuine clinical needs OR systemic biases

Example Analysis:

# Check prediction distributions by race
results_by_race = df.groupby('Race')['predicted_los'].describe()
print(results_by_race)

# Flag if mean predictions differ by >20% across groups
# (May indicate bias OR clinical differences - requires clinical review)

Model Card Authors

Primary Author: [Ajiboye Toluwalase]
Contributors: [List contributors]
Contact: ajiboyetolu1@gmail.com
Organization: [Metro's Tech]

Citation

If you use this model in your research or application, please cite:

@misc{hospital_los_xgboost_2026,
  author = {Ajiboye Toluwalase},
  title = {Hospital Length of Stay Predictor - XGBoost Pipeline},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ajiboye/hospital_predict_model}},
  note = {Trained on SPARCS NY 2017 dataset}
}

Data Source Citation:

New York State Department of Health. (2017). Hospital Inpatient Discharges 
(SPARCS De-Identified): 2017. https://health.data.ny.gov/

Model Files

This repository contains:

hospital-los-xgboost/
├── xgb_hospital_full_pipeline.pkl       # Complete pipeline (recommended)
├── xgb_modelv1.pkl                      # XGBoost model only
├── hospital_data_cleanerv1.pkl          # Preprocessor only
├── feature_names.pkl                    # Expected 312 feature names
├── README.md                            # This model card
├── requirements.txt                     # Python dependencies

Total size: ~15 MB (compressed)

Changelog

Version 1.0.0 (February 2026)

Initial release
Trained on SPARCS 2017 dataset (2.3M records)
13 input features → 312 encoded features
XGBoost regressor with target-encoded features
Confidence interval estimation
Risk factor analysis

Planned Updates

Retrain on 2022-2024 data
Add SHAP explanations
Incorporate CMS quality metrics
Multi-output prediction (LOS + readmission risk)
Fairness-aware training

Acknowledgments

New York State Department of Health for SPARCS data access
Kaggle community for data hosting and discussions
XGBoost development team for the excellent ML framework
Hugging Face for model hosting infrastructure

License

This model is released under the MIT License.

MIT License

Copyright (c) 2025 [Ajiboye Toluwalase]

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.

Additional Resources

⚕️ Remember: This model is a tool to support healthcare professionals, not replace them. Always involve clinical expertise in patient care decisions.

Last updated: February 2026

Downloads last month: -; Downloads are not tracked for this model. How to track