careerbytecode
/

mlops-ref-healthcare-readmission

Tabular Classification

cbc-reference-model

Model card Files Files and versions

mlops-ref-healthcare-readmission / README.md

architectraghu's picture

Upload README.md

75e6b3b verified 4 days ago

|

history blame contribute delete

3.51 kB

	---
	license: mit
	tags:
	- cbc-reference-model
	- mlops-100-day
	- healthcare
	- tabular-classification
	---

	# CBC Reference Model: Healthcare 30-Day Readmission

	> Pre-trained reference model for the CBC [MLOps 100-Day Track](https://github.com/careerbytecode/cbc-learning-hub/tree/main/100-days/mlops). MLOps students pull this on Day 1 as their uniform starting artifact and learn to operate it (containerize, deploy, monitor, retrain). It is the published twin of ML Development Capstone 1 — a student who completed ML Dev can substitute their own model.

	## Model details
	- Type: sigmoid-calibrated RandomForest in a 3-branch sklearn `Pipeline` (median+scale numerics; constant-impute + OneHot low-cardinality categoricals; constant-impute + `TargetEncoder` for high-cardinality ICD-9 diagnoses), selected over LogisticRegression by stratified-CV PR-AUC and calibrated on a disjoint validation split.
	- Framework: scikit-learn 1.8.0 · Serialization: joblib · Seed: 42
	- Returns a calibrated probability. The operating threshold is a SERVING-time decision (the MLOps track teaches this); it is NOT baked into the model.

	## Intended use
	Decision support to flag diabetic inpatients at elevated 30-day readmission risk. Teaching/reference artifact — NOT for real clinical decisions.

	## Training data
	UCI Diabetes 130-US Hospitals 1999-2008 (id 296), CC BY 4.0. Binary target: readmitted within 30 days (prevalence 0.1116). Fully de-identified.

	## Metrics (sealed test split, n=20354, evaluated once)
	\| Metric \| Value \| Note \|
	\|---\|---\|---\|
	\| ROC-AUC \| 0.6562 \| no-skill 0.5 \|
	\| PR-AUC \| 0.2098 \| no-skill = prevalence 0.1116 (~1.9x floor) \|
	\| log loss \| 0.3339 \| \|
	\| Brier \| 0.0954 \| \|

	30-day readmission is a weak-signal problem; ROC-AUC in the mid-0.60s is a real but modest lift, not a high-accuracy classifier.

	## Operating points (choose the threshold from cost — do NOT use 0.5)
	\| Threshold \| Flagged \| Recall \| Precision \|
	\|---\|---\|---\|---\|
	\| 0.50 \| 92 \| 0.022 \| 0.533 \|
	\| 0.15 \| 2935 \| 0.289 \| 0.224 \|
	\| 0.1116 (prevalence) \| 6505 \| 0.511 \| 0.178 \|

	At 0.5 the model flags almost nobody (probabilities top out near the prevalence); `metrics.json` carries `recommended_threshold = 0.1116`.

	## How to load and predict
	```python
	import joblib, json, pandas as pd
	from huggingface_hub import hf_hub_download

	model = joblib.load(hf_hub_download("careerbytecode/mlops-ref-healthcare-readmission", "model/pipeline.joblib"))
	sample = json.load(open(hf_hub_download("careerbytecode/mlops-ref-healthcare-readmission", "sample_input.json")))
	proba = float(model.predict_proba(pd.DataFrame([sample]))[0, 1])
	print(proba) # calibrated readmission probability
	```

	Input schema: 19 columns — numeric ['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_diagnoses', 'number_inpatient', 'number_outpatient', 'number_emergency']; categorical (strings) ['race', 'gender', 'age', 'A1Cresult', 'max_glu_serum', 'insulin', 'change', 'diabetesMed', 'diag_1', 'diag_2', 'diag_3'].

	## Limitations
	- 1999-2008 US data; transfer to other populations/eras is unvalidated.
	- No fairness audit here (covered in ML Dev Phase 4 Day 88); protected attributes are present.
	- Reference/teaching artifact only.

	---
	© 2015-2026 CareerByteCode. All rights reserved. \| CC BY-NC-SA 4.0 (docs), MIT (code) \| Authored by Raghavendra R, Platform Owner CareerByteCode, Solution Architect