--- license: mit tags: - cbc-reference-model - mlops-100-day - healthcare - tabular-classification --- # CBC Reference Model: Healthcare 30-Day Readmission > Pre-trained reference model for the **CBC [MLOps 100-Day Track](https://github.com/careerbytecode/cbc-learning-hub/tree/main/100-days/mlops)**. MLOps students pull this on Day 1 as their uniform starting artifact and learn to operate it (containerize, deploy, monitor, retrain). It is the published twin of ML Development Capstone 1 — a student who completed ML Dev can substitute their own model. ## Model details - **Type:** sigmoid-calibrated RandomForest in a 3-branch sklearn `Pipeline` (median+scale numerics; constant-impute + OneHot low-cardinality categoricals; constant-impute + `TargetEncoder` for high-cardinality ICD-9 diagnoses), selected over LogisticRegression by stratified-CV PR-AUC and calibrated on a disjoint validation split. - **Framework:** scikit-learn 1.8.0 · **Serialization:** joblib · **Seed:** 42 - **Returns a calibrated probability.** The operating threshold is a SERVING-time decision (the MLOps track teaches this); it is NOT baked into the model. ## Intended use Decision support to flag diabetic inpatients at elevated 30-day readmission risk. **Teaching/reference artifact — NOT for real clinical decisions.** ## Training data UCI Diabetes 130-US Hospitals 1999-2008 (id 296), CC BY 4.0. Binary target: readmitted within 30 days (prevalence 0.1116). Fully de-identified. ## Metrics (sealed test split, n=20354, evaluated once) | Metric | Value | Note | |---|---|---| | ROC-AUC | 0.6562 | no-skill 0.5 | | PR-AUC | 0.2098 | no-skill = prevalence 0.1116 (~1.9x floor) | | log loss | 0.3339 | | | Brier | 0.0954 | | 30-day readmission is a weak-signal problem; ROC-AUC in the mid-0.60s is a real but modest lift, not a high-accuracy classifier. ## Operating points (choose the threshold from cost — do NOT use 0.5) | Threshold | Flagged | Recall | Precision | |---|---|---|---| | 0.50 | 92 | 0.022 | 0.533 | | 0.15 | 2935 | 0.289 | 0.224 | | 0.1116 (prevalence) | 6505 | 0.511 | 0.178 | At 0.5 the model flags almost nobody (probabilities top out near the prevalence); `metrics.json` carries `recommended_threshold = 0.1116`. ## How to load and predict ```python import joblib, json, pandas as pd from huggingface_hub import hf_hub_download model = joblib.load(hf_hub_download("careerbytecode/mlops-ref-healthcare-readmission", "model/pipeline.joblib")) sample = json.load(open(hf_hub_download("careerbytecode/mlops-ref-healthcare-readmission", "sample_input.json"))) proba = float(model.predict_proba(pd.DataFrame([sample]))[0, 1]) print(proba) # calibrated readmission probability ``` Input schema: 19 columns — numeric ['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_diagnoses', 'number_inpatient', 'number_outpatient', 'number_emergency']; categorical (strings) ['race', 'gender', 'age', 'A1Cresult', 'max_glu_serum', 'insulin', 'change', 'diabetesMed', 'diag_1', 'diag_2', 'diag_3']. ## Limitations - 1999-2008 US data; transfer to other populations/eras is unvalidated. - No fairness audit here (covered in ML Dev Phase 4 Day 88); protected attributes are present. - Reference/teaching artifact only. --- © 2015-2026 CareerByteCode. All rights reserved. | CC BY-NC-SA 4.0 (docs), MIT (code) | Authored by Raghavendra R, Platform Owner CareerByteCode, Solution Architect