File size: 5,397 Bytes
b778180
a77b7ba
b778180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a77b7ba
b778180
a77b7ba
b778180
a77b7ba
b778180
 
a77b7ba
 
 
b778180
a77b7ba
b778180
a77b7ba
b778180
 
 
a77b7ba
b778180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a77b7ba
b778180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a77b7ba
 
b778180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a77b7ba
 
 
b778180
 
 
a77b7ba
b778180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a77b7ba
b778180
 
 
 
 
a77b7ba
 
 
b778180
 
 
 
 
 
 
 
 
a77b7ba
b778180
 
 
 
 
a77b7ba
b778180
a77b7ba
b778180
 
 
a77b7ba
 
 
b778180
a77b7ba
b778180
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
Got it! Here’s the **English version** of the Hugging Face–ready **Model Card** draft for your file `PhailomXgboost_dm_model.pkl`. I’ve preserved all the technical details but translated and refined for an international scientific audience.

---

# 🧾 Model Card — PhailomXgboost\_dm\_model

```yaml
license: unknown           # TODO: choose a license (e.g., mit, apache-2.0, cc-by-4.0)
library_name: xgboost
tags:
  - xgboost
  - classification
  - tabular-data
  - healthcare
  - NCD
  - diabetes-risk
language:
  - en
  - th
model-index:
  - name: PhailomXgboost_dm_model
    results:
      - task:
          type: tabular-classification
        dataset:
          name: TODO-dataset-name
          type: private
          split: test
        metrics:
          - type: accuracy
            value: TODO
          - type: f1
            value: TODO
          - type: roc_auc
            value: TODO
```

---

## 📌 Model Summary

**PhailomXgboost\_dm\_model** is an **XGBoost classifier** developed for early-stage screening of **non-communicable diseases (NCDs)**, with a focus on diabetes risk prediction using community health screening data.
The model outputs **three classes**: *Normal*, *At-Risk*, and *Diabetic*, making it suitable for cost-effective and rapid community-level health assessments.

---

## 🧠 Intended Use & Limitations

**Intended use**

* Community-level health screening for diabetes/NCD risk.
* Educational and research purposes (health data mining, public health informatics).
* Integration into dashboards or lightweight apps (e.g., Streamlit, Hugging Face Spaces).

**Not for**

* Direct clinical diagnosis.
* Replacement for laboratory tests or medical professionals.

**Limitations**

* Performance depends heavily on data quality (missing values, outliers).
* Potential bias if the dataset is imbalanced across classes.
* Threshold tuning is required to balance sensitivity and specificity for different contexts.

---

## 🧯 Ethical Considerations

* Respect data privacy (PDPA/GDPR compliance).
* Communicate clearly that this model is a **screening tool, not a diagnostic system**.
* Regularly validate fairness across subgroups (gender, age, region).

---

## 🗂️ Data

* Source: community health screening dataset (**private, internal project**).
* Dataset size: \~**3,418 records** (balanced across *Normal*, *At-Risk*, *Diabetic*).
* Example features:

  * **Demographics:** Age, Age group, Village, Screening date
  * **Vitals:** Systolic/diastolic blood pressure, Weight, Height, BMI
  * **Contextual variables:** Household or screening group identifiers

> **TODO:** Fill in exact feature schema, units (e.g., mmHg, kg, cm), and preprocessing methods.

---

## 🏗️ Training Procedure

* **Model:** XGBoost (tree-based gradient boosting), multi-class classification.
* **Objective:** `multi:softprob` (multi-class probability prediction).
* **Preprocessing:**

  * Missing values handled by imputation.
  * One-hot or ordinal encoding for categorical features.
  * Stratified split into training/validation/test.
* **Hyperparameters tuned:** `max_depth`, `learning_rate (eta)`, `subsample`, `colsample_bytree`, `min_child_weight`, `n_estimators`.
* **Evaluation Metrics:** Accuracy, Macro-F1, ROC-AUC (One-vs-Rest).

> **TODO:** Insert actual hyperparameters and results.

---

## 📈 Evaluation

| Metric        | Test Set |
| ------------- | -------- |
| Accuracy      | TODO     |
| Macro F1      | TODO     |
| ROC-AUC (OVR) | TODO     |

**Confusion Matrix (example format)**

```
            Pred:Normal  Pred:At-Risk  Pred:Diabetic
True:Normal      TODO        TODO          TODO
True:At-Risk     TODO        TODO          TODO
True:Diabetic    TODO        TODO          TODO
```

---

## 🧩 Input Schema

Expected columns must match the training pipeline order. Example schema from project context:

```python
expected_columns = [
  "age_group", "record_id", "age", "village_no", "village_name", "screening_date",
  "bp_systolic", "bp_diastolic", "weight", "height",
  # ... add remaining features
]
```

> **TODO:** Fill with the exact column list and datatypes.

---

## 🚀 Inference

### 1) Load from pickle file

```python
import pickle, pandas as pd

with open("PhailomXgboost_dm_model.pkl", "rb") as f:
    model = pickle.load(f)

X = pd.DataFrame([{
    "age_group": "60-69",
    "record_id": 1,
    "age": 64,
    "village_no": 5,
    "village_name": "SampleVillage",
    "screening_date": "2025-07-01",
    "bp_systolic": 146,
    "bp_diastolic": 90,
    "weight": 68.0,
    "height": 160.0,
    # ... include all expected features
}], columns=expected_columns)

proba = model.predict_proba(X)[0]
pred  = model.classes_[proba.argmax()]
print(pred, proba)
```

### 2) Use XGBoost native format (recommended for HF)

```python
model.get_booster().save_model("model.json")
```

---

## ⚙️ Environment & Reproducibility

* **Python**: TODO
* **xgboost**: TODO
* **scikit-learn**: TODO
* **pandas/numpy**: TODO
* Random seed: `42`

Attach:

* `requirements.txt`
* training script/preprocessing code
* evaluation reports and figures

---

## 🧪 Validation & Monitoring

* Adjust classification thresholds for public health contexts.
* Monitor drift when applied to new populations.
* Revalidate if data collection tools change.

---

## 📣 Citation

> TODO: Add references or project details for citation.