PhailomNCDs / README.md
Nucha's picture
Update README.md
b778180 verified
Got it! Here’s the **English version** of the Hugging Face–ready **Model Card** draft for your file `PhailomXgboost_dm_model.pkl`. I’ve preserved all the technical details but translated and refined for an international scientific audience.
---
# 🧾 Model Card — PhailomXgboost\_dm\_model
```yaml
license: unknown # TODO: choose a license (e.g., mit, apache-2.0, cc-by-4.0)
library_name: xgboost
tags:
- xgboost
- classification
- tabular-data
- healthcare
- NCD
- diabetes-risk
language:
- en
- th
model-index:
- name: PhailomXgboost_dm_model
results:
- task:
type: tabular-classification
dataset:
name: TODO-dataset-name
type: private
split: test
metrics:
- type: accuracy
value: TODO
- type: f1
value: TODO
- type: roc_auc
value: TODO
```
---
## 📌 Model Summary
**PhailomXgboost\_dm\_model** is an **XGBoost classifier** developed for early-stage screening of **non-communicable diseases (NCDs)**, with a focus on diabetes risk prediction using community health screening data.
The model outputs **three classes**: *Normal*, *At-Risk*, and *Diabetic*, making it suitable for cost-effective and rapid community-level health assessments.
---
## 🧠 Intended Use & Limitations
**Intended use**
* Community-level health screening for diabetes/NCD risk.
* Educational and research purposes (health data mining, public health informatics).
* Integration into dashboards or lightweight apps (e.g., Streamlit, Hugging Face Spaces).
**Not for**
* Direct clinical diagnosis.
* Replacement for laboratory tests or medical professionals.
**Limitations**
* Performance depends heavily on data quality (missing values, outliers).
* Potential bias if the dataset is imbalanced across classes.
* Threshold tuning is required to balance sensitivity and specificity for different contexts.
---
## 🧯 Ethical Considerations
* Respect data privacy (PDPA/GDPR compliance).
* Communicate clearly that this model is a **screening tool, not a diagnostic system**.
* Regularly validate fairness across subgroups (gender, age, region).
---
## 🗂️ Data
* Source: community health screening dataset (**private, internal project**).
* Dataset size: \~**3,418 records** (balanced across *Normal*, *At-Risk*, *Diabetic*).
* Example features:
* **Demographics:** Age, Age group, Village, Screening date
* **Vitals:** Systolic/diastolic blood pressure, Weight, Height, BMI
* **Contextual variables:** Household or screening group identifiers
> **TODO:** Fill in exact feature schema, units (e.g., mmHg, kg, cm), and preprocessing methods.
---
## 🏗️ Training Procedure
* **Model:** XGBoost (tree-based gradient boosting), multi-class classification.
* **Objective:** `multi:softprob` (multi-class probability prediction).
* **Preprocessing:**
* Missing values handled by imputation.
* One-hot or ordinal encoding for categorical features.
* Stratified split into training/validation/test.
* **Hyperparameters tuned:** `max_depth`, `learning_rate (eta)`, `subsample`, `colsample_bytree`, `min_child_weight`, `n_estimators`.
* **Evaluation Metrics:** Accuracy, Macro-F1, ROC-AUC (One-vs-Rest).
> **TODO:** Insert actual hyperparameters and results.
---
## 📈 Evaluation
| Metric | Test Set |
| ------------- | -------- |
| Accuracy | TODO |
| Macro F1 | TODO |
| ROC-AUC (OVR) | TODO |
**Confusion Matrix (example format)**
```
Pred:Normal Pred:At-Risk Pred:Diabetic
True:Normal TODO TODO TODO
True:At-Risk TODO TODO TODO
True:Diabetic TODO TODO TODO
```
---
## 🧩 Input Schema
Expected columns must match the training pipeline order. Example schema from project context:
```python
expected_columns = [
"age_group", "record_id", "age", "village_no", "village_name", "screening_date",
"bp_systolic", "bp_diastolic", "weight", "height",
# ... add remaining features
]
```
> **TODO:** Fill with the exact column list and datatypes.
---
## 🚀 Inference
### 1) Load from pickle file
```python
import pickle, pandas as pd
with open("PhailomXgboost_dm_model.pkl", "rb") as f:
model = pickle.load(f)
X = pd.DataFrame([{
"age_group": "60-69",
"record_id": 1,
"age": 64,
"village_no": 5,
"village_name": "SampleVillage",
"screening_date": "2025-07-01",
"bp_systolic": 146,
"bp_diastolic": 90,
"weight": 68.0,
"height": 160.0,
# ... include all expected features
}], columns=expected_columns)
proba = model.predict_proba(X)[0]
pred = model.classes_[proba.argmax()]
print(pred, proba)
```
### 2) Use XGBoost native format (recommended for HF)
```python
model.get_booster().save_model("model.json")
```
---
## ⚙️ Environment & Reproducibility
* **Python**: TODO
* **xgboost**: TODO
* **scikit-learn**: TODO
* **pandas/numpy**: TODO
* Random seed: `42`
Attach:
* `requirements.txt`
* training script/preprocessing code
* evaluation reports and figures
---
## 🧪 Validation & Monitoring
* Adjust classification thresholds for public health contexts.
* Monitor drift when applied to new populations.
* Revalidate if data collection tools change.
---
## 📣 Citation
> TODO: Add references or project details for citation.