File size: 5,397 Bytes
b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 a77b7ba b778180 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 |
Got it! Here’s the **English version** of the Hugging Face–ready **Model Card** draft for your file `PhailomXgboost_dm_model.pkl`. I’ve preserved all the technical details but translated and refined for an international scientific audience.
---
# 🧾 Model Card — PhailomXgboost\_dm\_model
```yaml
license: unknown # TODO: choose a license (e.g., mit, apache-2.0, cc-by-4.0)
library_name: xgboost
tags:
- xgboost
- classification
- tabular-data
- healthcare
- NCD
- diabetes-risk
language:
- en
- th
model-index:
- name: PhailomXgboost_dm_model
results:
- task:
type: tabular-classification
dataset:
name: TODO-dataset-name
type: private
split: test
metrics:
- type: accuracy
value: TODO
- type: f1
value: TODO
- type: roc_auc
value: TODO
```
---
## 📌 Model Summary
**PhailomXgboost\_dm\_model** is an **XGBoost classifier** developed for early-stage screening of **non-communicable diseases (NCDs)**, with a focus on diabetes risk prediction using community health screening data.
The model outputs **three classes**: *Normal*, *At-Risk*, and *Diabetic*, making it suitable for cost-effective and rapid community-level health assessments.
---
## 🧠 Intended Use & Limitations
**Intended use**
* Community-level health screening for diabetes/NCD risk.
* Educational and research purposes (health data mining, public health informatics).
* Integration into dashboards or lightweight apps (e.g., Streamlit, Hugging Face Spaces).
**Not for**
* Direct clinical diagnosis.
* Replacement for laboratory tests or medical professionals.
**Limitations**
* Performance depends heavily on data quality (missing values, outliers).
* Potential bias if the dataset is imbalanced across classes.
* Threshold tuning is required to balance sensitivity and specificity for different contexts.
---
## 🧯 Ethical Considerations
* Respect data privacy (PDPA/GDPR compliance).
* Communicate clearly that this model is a **screening tool, not a diagnostic system**.
* Regularly validate fairness across subgroups (gender, age, region).
---
## 🗂️ Data
* Source: community health screening dataset (**private, internal project**).
* Dataset size: \~**3,418 records** (balanced across *Normal*, *At-Risk*, *Diabetic*).
* Example features:
* **Demographics:** Age, Age group, Village, Screening date
* **Vitals:** Systolic/diastolic blood pressure, Weight, Height, BMI
* **Contextual variables:** Household or screening group identifiers
> **TODO:** Fill in exact feature schema, units (e.g., mmHg, kg, cm), and preprocessing methods.
---
## 🏗️ Training Procedure
* **Model:** XGBoost (tree-based gradient boosting), multi-class classification.
* **Objective:** `multi:softprob` (multi-class probability prediction).
* **Preprocessing:**
* Missing values handled by imputation.
* One-hot or ordinal encoding for categorical features.
* Stratified split into training/validation/test.
* **Hyperparameters tuned:** `max_depth`, `learning_rate (eta)`, `subsample`, `colsample_bytree`, `min_child_weight`, `n_estimators`.
* **Evaluation Metrics:** Accuracy, Macro-F1, ROC-AUC (One-vs-Rest).
> **TODO:** Insert actual hyperparameters and results.
---
## 📈 Evaluation
| Metric | Test Set |
| ------------- | -------- |
| Accuracy | TODO |
| Macro F1 | TODO |
| ROC-AUC (OVR) | TODO |
**Confusion Matrix (example format)**
```
Pred:Normal Pred:At-Risk Pred:Diabetic
True:Normal TODO TODO TODO
True:At-Risk TODO TODO TODO
True:Diabetic TODO TODO TODO
```
---
## 🧩 Input Schema
Expected columns must match the training pipeline order. Example schema from project context:
```python
expected_columns = [
"age_group", "record_id", "age", "village_no", "village_name", "screening_date",
"bp_systolic", "bp_diastolic", "weight", "height",
# ... add remaining features
]
```
> **TODO:** Fill with the exact column list and datatypes.
---
## 🚀 Inference
### 1) Load from pickle file
```python
import pickle, pandas as pd
with open("PhailomXgboost_dm_model.pkl", "rb") as f:
model = pickle.load(f)
X = pd.DataFrame([{
"age_group": "60-69",
"record_id": 1,
"age": 64,
"village_no": 5,
"village_name": "SampleVillage",
"screening_date": "2025-07-01",
"bp_systolic": 146,
"bp_diastolic": 90,
"weight": 68.0,
"height": 160.0,
# ... include all expected features
}], columns=expected_columns)
proba = model.predict_proba(X)[0]
pred = model.classes_[proba.argmax()]
print(pred, proba)
```
### 2) Use XGBoost native format (recommended for HF)
```python
model.get_booster().save_model("model.json")
```
---
## ⚙️ Environment & Reproducibility
* **Python**: TODO
* **xgboost**: TODO
* **scikit-learn**: TODO
* **pandas/numpy**: TODO
* Random seed: `42`
Attach:
* `requirements.txt`
* training script/preprocessing code
* evaluation reports and figures
---
## 🧪 Validation & Monitoring
* Adjust classification thresholds for public health contexts.
* Monitor drift when applied to new populations.
* Revalidate if data collection tools change.
---
## 📣 Citation
> TODO: Add references or project details for citation. |