Upload README.md

75e6b3b verified 4 days ago

3.51 kB

license: mit
tags:
  - cbc-reference-model
  - mlops-100-day
  - healthcare
  - tabular-classification

CBC Reference Model: Healthcare 30-Day Readmission

Pre-trained reference model for the CBC MLOps 100-Day Track. MLOps students pull this on Day 1 as their uniform starting artifact and learn to operate it (containerize, deploy, monitor, retrain). It is the published twin of ML Development Capstone 1 — a student who completed ML Dev can substitute their own model.

Model details

Type: sigmoid-calibrated RandomForest in a 3-branch sklearn Pipeline (median+scale numerics; constant-impute + OneHot low-cardinality categoricals; constant-impute + TargetEncoder for high-cardinality ICD-9 diagnoses), selected over LogisticRegression by stratified-CV PR-AUC and calibrated on a disjoint validation split.
Framework: scikit-learn 1.8.0 · Serialization: joblib · Seed: 42
Returns a calibrated probability. The operating threshold is a SERVING-time decision (the MLOps track teaches this); it is NOT baked into the model.

Intended use

Decision support to flag diabetic inpatients at elevated 30-day readmission risk. Teaching/reference artifact — NOT for real clinical decisions.

Training data

UCI Diabetes 130-US Hospitals 1999-2008 (id 296), CC BY 4.0. Binary target: readmitted within 30 days (prevalence 0.1116). Fully de-identified.

Metrics (sealed test split, n=20354, evaluated once)

Metric	Value	Note
ROC-AUC	0.6562	no-skill 0.5
PR-AUC	0.2098	no-skill = prevalence 0.1116 (~1.9x floor)
log loss	0.3339
Brier	0.0954

30-day readmission is a weak-signal problem; ROC-AUC in the mid-0.60s is a real but modest lift, not a high-accuracy classifier.

Operating points (choose the threshold from cost — do NOT use 0.5)

Threshold	Flagged	Recall	Precision
0.50	92	0.022	0.533
0.15	2935	0.289	0.224
0.1116 (prevalence)	6505	0.511	0.178

At 0.5 the model flags almost nobody (probabilities top out near the prevalence); metrics.json carries recommended_threshold = 0.1116.

How to load and predict

import joblib, json, pandas as pd
from huggingface_hub import hf_hub_download

model = joblib.load(hf_hub_download("careerbytecode/mlops-ref-healthcare-readmission", "model/pipeline.joblib"))
sample = json.load(open(hf_hub_download("careerbytecode/mlops-ref-healthcare-readmission", "sample_input.json")))
proba = float(model.predict_proba(pd.DataFrame([sample]))[0, 1])
print(proba)  # calibrated readmission probability

Input schema: 19 columns — numeric ['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_diagnoses', 'number_inpatient', 'number_outpatient', 'number_emergency']; categorical (strings) ['race', 'gender', 'age', 'A1Cresult', 'max_glu_serum', 'insulin', 'change', 'diabetesMed', 'diag_1', 'diag_2', 'diag_3'].

Limitations

1999-2008 US data; transfer to other populations/eras is unvalidated.
No fairness audit here (covered in ML Dev Phase 4 Day 88); protected attributes are present.
Reference/teaching artifact only.