---
license: mit
tags:
  - cbc-reference-model
  - mlops-100-day
  - healthcare
  - tabular-classification
---

# CBC Reference Model: Healthcare 30-Day Readmission

> Pre-trained reference model for the **CBC [MLOps 100-Day Track](https://github.com/careerbytecode/cbc-learning-hub/tree/main/100-days/mlops)**. MLOps students pull this on Day 1 as their uniform starting artifact and learn to operate it (containerize, deploy, monitor, retrain). It is the published twin of ML Development Capstone 1 — a student who completed ML Dev can substitute their own model.

## Model details
- **Type:** sigmoid-calibrated RandomForest in a 3-branch sklearn `Pipeline` (median+scale numerics; constant-impute + OneHot low-cardinality categoricals; constant-impute + `TargetEncoder` for high-cardinality ICD-9 diagnoses), selected over LogisticRegression by stratified-CV PR-AUC and calibrated on a disjoint validation split.
- **Framework:** scikit-learn 1.8.0 · **Serialization:** joblib · **Seed:** 42
- **Returns a calibrated probability.** The operating threshold is a SERVING-time decision (the MLOps track teaches this); it is NOT baked into the model.

## Intended use
Decision support to flag diabetic inpatients at elevated 30-day readmission risk. **Teaching/reference artifact — NOT for real clinical decisions.**

## Training data
UCI Diabetes 130-US Hospitals 1999-2008 (id 296), CC BY 4.0. Binary target: readmitted within 30 days (prevalence 0.1116). Fully de-identified.

## Metrics (sealed test split, n=20354, evaluated once)
| Metric | Value | Note |
|---|---|---|
| ROC-AUC | 0.6562 | no-skill 0.5 |
| PR-AUC | 0.2098 | no-skill = prevalence 0.1116 (~1.9x floor) |
| log loss | 0.3339 | |
| Brier | 0.0954 | |

30-day readmission is a weak-signal problem; ROC-AUC in the mid-0.60s is a real but modest lift, not a high-accuracy classifier.

## Operating points (choose the threshold from cost — do NOT use 0.5)
| Threshold | Flagged | Recall | Precision |
|---|---|---|---|
| 0.50 | 92 | 0.022 | 0.533 |
| 0.15 | 2935 | 0.289 | 0.224 |
| 0.1116 (prevalence) | 6505 | 0.511 | 0.178 |

At 0.5 the model flags almost nobody (probabilities top out near the prevalence); `metrics.json` carries `recommended_threshold = 0.1116`.

## How to load and predict
```python
import joblib, json, pandas as pd
from huggingface_hub import hf_hub_download

model = joblib.load(hf_hub_download("careerbytecode/mlops-ref-healthcare-readmission", "model/pipeline.joblib"))
sample = json.load(open(hf_hub_download("careerbytecode/mlops-ref-healthcare-readmission", "sample_input.json")))
proba = float(model.predict_proba(pd.DataFrame([sample]))[0, 1])
print(proba)  # calibrated readmission probability
```

Input schema: 19 columns — numeric ['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_diagnoses', 'number_inpatient', 'number_outpatient', 'number_emergency']; categorical (strings) ['race', 'gender', 'age', 'A1Cresult', 'max_glu_serum', 'insulin', 'change', 'diabetesMed', 'diag_1', 'diag_2', 'diag_3'].

## Limitations
- 1999-2008 US data; transfer to other populations/eras is unvalidated.
- No fairness audit here (covered in ML Dev Phase 4 Day 88); protected attributes are present.
- Reference/teaching artifact only.

---
© 2015-2026 CareerByteCode. All rights reserved. | CC BY-NC-SA 4.0 (docs), MIT (code) | Authored by Raghavendra R, Platform Owner CareerByteCode, Solution Architect