| ---
|
| license: mit
|
| tags:
|
| - cbc-reference-model
|
| - mlops-100-day
|
| - healthcare
|
| - tabular-classification
|
| ---
|
|
|
| # CBC Reference Model: Healthcare 30-Day Readmission
|
|
|
| > Pre-trained reference model for the **CBC [MLOps 100-Day Track](https://github.com/careerbytecode/cbc-learning-hub/tree/main/100-days/mlops)**. MLOps students pull this on Day 1 as their uniform starting artifact and learn to operate it (containerize, deploy, monitor, retrain). It is the published twin of ML Development Capstone 1 — a student who completed ML Dev can substitute their own model.
|
|
|
| ## Model details
|
| - **Type:** sigmoid-calibrated RandomForest in a 3-branch sklearn `Pipeline` (median+scale numerics; constant-impute + OneHot low-cardinality categoricals; constant-impute + `TargetEncoder` for high-cardinality ICD-9 diagnoses), selected over LogisticRegression by stratified-CV PR-AUC and calibrated on a disjoint validation split.
|
| - **Framework:** scikit-learn 1.8.0 · **Serialization:** joblib · **Seed:** 42
|
| - **Returns a calibrated probability.** The operating threshold is a SERVING-time decision (the MLOps track teaches this); it is NOT baked into the model.
|
|
|
| ## Intended use
|
| Decision support to flag diabetic inpatients at elevated 30-day readmission risk. **Teaching/reference artifact — NOT for real clinical decisions.**
|
|
|
| ## Training data
|
| UCI Diabetes 130-US Hospitals 1999-2008 (id 296), CC BY 4.0. Binary target: readmitted within 30 days (prevalence 0.1116). Fully de-identified.
|
|
|
| ## Metrics (sealed test split, n=20354, evaluated once)
|
| | Metric | Value | Note |
|
| |---|---|---|
|
| | ROC-AUC | 0.6562 | no-skill 0.5 |
|
| | PR-AUC | 0.2098 | no-skill = prevalence 0.1116 (~1.9x floor) |
|
| | log loss | 0.3339 | |
|
| | Brier | 0.0954 | |
|
|
|
| 30-day readmission is a weak-signal problem; ROC-AUC in the mid-0.60s is a real but modest lift, not a high-accuracy classifier.
|
|
|
| ## Operating points (choose the threshold from cost — do NOT use 0.5)
|
| | Threshold | Flagged | Recall | Precision |
|
| |---|---|---|---|
|
| | 0.50 | 92 | 0.022 | 0.533 |
|
| | 0.15 | 2935 | 0.289 | 0.224 |
|
| | 0.1116 (prevalence) | 6505 | 0.511 | 0.178 |
|
|
|
| At 0.5 the model flags almost nobody (probabilities top out near the prevalence); `metrics.json` carries `recommended_threshold = 0.1116`.
|
|
|
| ## How to load and predict
|
| ```python
|
| import joblib, json, pandas as pd
|
| from huggingface_hub import hf_hub_download
|
|
|
| model = joblib.load(hf_hub_download("careerbytecode/mlops-ref-healthcare-readmission", "model/pipeline.joblib"))
|
| sample = json.load(open(hf_hub_download("careerbytecode/mlops-ref-healthcare-readmission", "sample_input.json")))
|
| proba = float(model.predict_proba(pd.DataFrame([sample]))[0, 1])
|
| print(proba) # calibrated readmission probability
|
| ```
|
|
|
| Input schema: 19 columns — numeric ['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_diagnoses', 'number_inpatient', 'number_outpatient', 'number_emergency']; categorical (strings) ['race', 'gender', 'age', 'A1Cresult', 'max_glu_serum', 'insulin', 'change', 'diabetesMed', 'diag_1', 'diag_2', 'diag_3'].
|
|
|
| ## Limitations
|
| - 1999-2008 US data; transfer to other populations/eras is unvalidated.
|
| - No fairness audit here (covered in ML Dev Phase 4 Day 88); protected attributes are present.
|
| - Reference/teaching artifact only.
|
|
|
| ---
|
| © 2015-2026 CareerByteCode. All rights reserved. | CC BY-NC-SA 4.0 (docs), MIT (code) | Authored by Raghavendra R, Platform Owner CareerByteCode, Solution Architect
|
| |