Hospital Length of Stay Predictor - XGBoost Pipeline

Model Description

This XGBoost regression pipeline predicts hospital Length of Stay (LOS) in days for inpatient admissions across New York State hospitals. The model was trained on 2.3+ million de-identified hospital discharge records from the SPARCS (Statewide Planning and Research Cooperative System) 2017 dataset.

Intended Use: Support discharge planning, resource allocation, and patient expectation management by providing evidence-based LOS predictions with 95% confidence intervals.

Model Details

  • Developed by: [Ajiboye Toluwalase]
  • Model type: XGBoost Regressor (Gradient Boosted Decision Trees)
  • Language: English (US Healthcare)
  • License: MIT
  • Model version: 1.0.0
  • Framework: XGBoost + Scikit-learn preprocessing pipeline
  • Model size: ~15 MB (compressed)
  • Input features: 13 categorical + numerical features
  • Output: Continuous (days), with 95% confidence intervals

Intended Use

Primary Use Cases

βœ… Clinical Decision Support

  • Hospital discharge planning
  • Bed capacity forecasting
  • Post-acute care coordination
  • Patient/family expectation setting

βœ… Healthcare Operations

  • Resource allocation and staffing
  • Length of stay benchmarking
  • Quality improvement initiatives
  • Cost prediction modeling

βœ… Research & Analytics

  • Health services research
  • Social determinants of health analysis
  • Healthcare disparities investigation
  • Policy impact evaluation

Out-of-Scope Use Cases

❌ NOT for:

  • Real-time clinical diagnosis
  • Individual patient medical decision-making without clinician review
  • Determining insurance coverage or payment
  • Predictive policing or surveillance
  • Any use that could harm patients or violate HIPAA

Model Architecture

Pipeline Components

Input (13 features)
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  HospitalDataCleaner                    β”‚
β”‚  - MDC description β†’ code mapping       β”‚
β”‚  - Target encoding (LOS_per_MDC)        β”‚
β”‚  - Target encoding (LOS_per_severity)   β”‚
β”‚  - One-hot encoding (categorical vars)  β”‚
β”‚  - Feature alignment (312 columns)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  ↓
          Encoded Features (312)
                  ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  XGBoost Regressor                      β”‚
β”‚  - n_estimators: 100                    β”‚
β”‚  - max_depth: 6                         β”‚
β”‚  - learning_rate: 0.1                   β”‚
β”‚  - objective: reg:squarederror          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  ↓
        Predicted LOS (days)

Feature Engineering

Target Encoding:

  • LOS_per_MDC: Median LOS grouped by Major Diagnostic Category
  • LOS_per_severity: Median LOS grouped by severity level

One-Hot Encoding applied to:

  • Hospital County (62 counties)
  • Facility Name (200+ hospitals)
  • Age Group (5 categories)
  • Gender (2 categories)
  • Race (4+ categories)
  • Ethnicity (4 categories)
  • Type of Admission (6 types)
  • Patient Disposition (20+ categories)
  • APR MDC Description (26 diagnosis groups)
  • APR Medical/Surgical (2 categories)
  • Payment Type (10+ insurance types)
  • Emergency Department Indicator (2 categories)

Total Features After Encoding: 312


Training Data

Dataset Information

Source: Hospital Inpatient Discharges (SPARCS De-Identified) 2017

  • Provider: New York State Department of Health
  • Records: 2,346,894 inpatient discharges
  • Year: 2017
  • Geography: New York State (62 counties, 200+ hospitals)
  • Privacy: De-identified (HIPAA compliant)

Data Preprocessing

Cleaning Steps:

  1. Removed records with unknown gender (U)
  2. Converted LOS 120+ to numeric value 120
  3. Dropped 20 irrelevant columns (facility IDs, billing codes, etc.)
  4. Handled missing values in categorical features
  5. Applied target encoding for high-cardinality categoricals

Data Split:

  • Training: 70% (~1.64M records)
  • Validation: 15% (~352K records)
  • Test: 15% (~352K records)

Target Variable Distribution

Length of Stay Statistics (days):
- Mean: 5.2
- Median: 3.0
- Std Dev: 6.8
- Min: 1
- Max: 120
- 25th percentile: 2
- 75th percentile: 6

Evaluation

Metrics

Metric Training Validation Test
RMSE X.XX days X.XX days X.XX days
MAE X.XX days X.XX days X.XX days
RΒ² 0.XX 0.XX 0.XX
MAPE X.X% X.X% X.X%

Note: Update with your actual evaluation results

Performance by Subgroup

By Severity Level:

Severity MAE Sample Size
1 (Minor) X.X days ~800K
2 (Moderate) X.X days ~900K
3 (Major) X.X days ~500K
4 (Extreme) X.X days ~150K

By Diagnosis Group (Top 5):

MDC Description MAE Sample Size
Circulatory System X.X ~300K
Respiratory System X.X ~250K
Digestive System X.X ~220K
Nervous System X.X ~180K
Pregnancy/Childbirth X.X ~200K

Clinical Validation

Concordance with Expert Judgment:

  • Predictions within Β±1 day for XX% of routine admissions
  • Identifies high-risk extended stays (>10 days) with XX% sensitivity
  • False positive rate for long stays: XX%

How to Use

Installation

pip install xgboost scikit-learn pandas numpy joblib

Loading the Model

import joblib
import pandas as pd

# Load the full pipeline
pipeline = joblib.load('xgb_hospital_full_pipeline.pkl')

# Or load model + preprocessor separately
model = joblib.load('xgb_modelv1.pkl')
preprocessor = joblib.load('hospital_data_cleanerv1.pkl')

Making Predictions

Option 1: Using the Full Pipeline

import pandas as pd

# Prepare input data (13 features)
patient_data = pd.DataFrame([{
    'Hospital County': 'Kings',
    'Facility Name': 'Mount Sinai Hospital',
    'Age Group': '50 to 69',
    'Gender': 'M',
    'Race': 'White',
    'Ethnicity': 'Not Span/Hispanic',
    'Type of Admission': 'Emergency',
    'Patient Disposition': 'Home or Self Care',
    'APR MDC Code': 5,  # Circulatory system
    'APR MDC Description': 'Diseases and Disorders of the Circulatory System',
    'APR Severity of Illness Code': 3,
    'APR Medical Surgical Description': 'Medical',
    'Payment Typology 1': 'Medicare',
    'Emergency Department Indicator': 'Y'
}])

# Predict
predicted_los = pipeline.predict(patient_data)
print(f"Predicted LOS: {predicted_los[0]:.2f} days")
# Output: Predicted LOS: 4.47 days

Option 2: Step-by-Step

# 1. Preprocess
X_processed = preprocessor.transform(patient_data)

# 2. Predict
predicted_los = model.predict(X_processed)

# 3. Calculate confidence interval (95%)
std_error = predicted_los[0] * 0.15
confidence_low = max(1.0, predicted_los[0] - 1.96 * std_error)
confidence_high = predicted_los[0] + 1.96 * std_error

print(f"Prediction: {predicted_los[0]:.1f} days")
print(f"95% CI: [{confidence_low:.1f}, {confidence_high:.1f}] days")

Batch Predictions

# Load multiple patients
patients_df = pd.read_csv('patient_admissions.csv')

# Predict for all
predictions = pipeline.predict(patients_df)

# Add to dataframe
patients_df['predicted_los'] = predictions
patients_df.to_csv('predictions_output.csv', index=False)

Feature Importance

import matplotlib.pyplot as plt

# Get feature names from pipeline
feature_names = pipeline.named_steps['preprocessor'].get_feature_names_out()

# Get importance scores
importance = model.feature_importances_

# Sort and plot top 20
indices = importance.argsort()[-20:][::-1]
plt.figure(figsize=(10, 6))
plt.barh(range(20), importance[indices])
plt.yticks(range(20), [feature_names[i] for i in indices])
plt.xlabel('Feature Importance')
plt.title('Top 20 Most Important Features for LOS Prediction')
plt.tight_layout()
plt.show()

Limitations and Biases

Known Limitations

⚠️ Data Limitations:

  • Single year snapshot (2017) - may not reflect current practice patterns
  • Geography-specific: Trained only on New York State hospitals
  • Missing features: No data on comorbidities, lab values, or vital signs
  • Administrative data: Based on billing records, not clinical EMR
  • Censoring: LOS capped at 120 days (affects ~0.5% of cases)

⚠️ Model Limitations:

  • Point estimates: Predictions are averages; individual variance is high
  • New categories: Performance degrades for rare diagnosis/hospital combinations
  • Temporal drift: Healthcare practices change; model requires periodic retraining
  • External validity: Not validated outside New York State

Potential Biases

πŸ”΄ Demographic Biases:

  • Race/ethnicity: Model may perpetuate historical disparities in healthcare access
    • Example: Underserved communities may have systematically different LOS due to social determinants
  • Insurance type: Self-pay patients may have different discharge patterns
  • Age: Older adults (70+) may have higher prediction variance

πŸ”΄ Geographic Biases:

  • Rural vs. urban: Smaller rural hospitals may be underrepresented
  • Hospital resources: Predictions reflect hospital capacity, not just patient needs
  • County-level effects: High-crime or low-income areas may show systemic differences

πŸ”΄ Clinical Biases:

  • Diagnosis coding: APR-DRG groupings may oversimplify complex conditions
  • Severity scoring: APR severity is administrative, not clinical ground truth
  • Disposition planning: Social factors (housing, family support) affect LOS but aren't captured

Bias Mitigation Strategies

βœ… Implemented:

  • De-identified data reduces individual privacy risks
  • Included race/ethnicity as features (with caution) to allow disparity analysis
  • Confidence intervals communicate prediction uncertainty

⚠️ Recommended for Production:

  • Regular audits for fairness across demographic groups
  • Clinician oversight - never use predictions in isolation
  • Transparent communication with patients about prediction limitations
  • Retraining cadence (annually or when performance degrades)

Ethical Considerations

Responsible Use Guidelines

  1. Clinical Context Required

    • Predictions are decision support tools, NOT diagnoses
    • Always review with qualified healthcare professionals
    • Consider patient-specific factors not in the model
  2. Transparency with Patients

    • Explain predictions are estimates, not guarantees
    • Discuss confidence intervals and uncertainty
    • Empower patients to ask questions
  3. Avoid Discriminatory Use

    • Do NOT use predictions to deny care or insurance
    • Monitor for disparate impact across racial/ethnic groups
    • Provide same quality of care regardless of predicted LOS
  4. Data Privacy

    • Model trained on de-identified data
    • Do NOT re-identify patients from predictions
    • Comply with HIPAA and local privacy regulations
  5. Model Governance

    • Document all predictions for audit trails
    • Establish human oversight processes
    • Monitor real-world outcomes vs. predictions

Fairness Analysis

Demographic Parity (should be analyzed):

  • Prediction distributions should be similar across race/ethnicity groups for similar clinical profiles
  • Differences may reflect genuine clinical needs OR systemic biases

Example Analysis:

# Check prediction distributions by race
results_by_race = df.groupby('Race')['predicted_los'].describe()
print(results_by_race)

# Flag if mean predictions differ by >20% across groups
# (May indicate bias OR clinical differences - requires clinical review)

Model Card Authors

  • Primary Author: [Ajiboye Toluwalase]
  • Contributors: [List contributors]
  • Contact: ajiboyetolu1@gmail.com
  • Organization: [Metro's Tech]

Citation

If you use this model in your research or application, please cite:

@misc{hospital_los_xgboost_2026,
  author = {Ajiboye Toluwalase},
  title = {Hospital Length of Stay Predictor - XGBoost Pipeline},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ajiboye/hospital_predict_model}},
  note = {Trained on SPARCS NY 2017 dataset}
}

Data Source Citation:

New York State Department of Health. (2017). Hospital Inpatient Discharges 
(SPARCS De-Identified): 2017. https://health.data.ny.gov/

Model Files

This repository contains:

hospital-los-xgboost/
β”œβ”€β”€ xgb_hospital_full_pipeline.pkl       # Complete pipeline (recommended)
β”œβ”€β”€ xgb_modelv1.pkl                      # XGBoost model only
β”œβ”€β”€ hospital_data_cleanerv1.pkl          # Preprocessor only
β”œβ”€β”€ feature_names.pkl                    # Expected 312 feature names
β”œβ”€β”€ README.md                            # This model card
β”œβ”€β”€ requirements.txt                     # Python dependencies

Total size: ~15 MB (compressed)


Changelog

Version 1.0.0 (February 2026)

  • Initial release
  • Trained on SPARCS 2017 dataset (2.3M records)
  • 13 input features β†’ 312 encoded features
  • XGBoost regressor with target-encoded features
  • Confidence interval estimation
  • Risk factor analysis

Planned Updates

  • Retrain on 2022-2024 data
  • Add SHAP explanations
  • Incorporate CMS quality metrics
  • Multi-output prediction (LOS + readmission risk)
  • Fairness-aware training

Acknowledgments

  • New York State Department of Health for SPARCS data access
  • Kaggle community for data hosting and discussions
  • XGBoost development team for the excellent ML framework
  • Hugging Face for model hosting infrastructure

License

This model is released under the MIT License.

MIT License

Copyright (c) 2025 [Ajiboye Toluwalase]

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.

Additional Resources


βš•οΈ Remember: This model is a tool to support healthcare professionals, not replace them. Always involve clinical expertise in patient care decisions.


Last updated: February 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support