File size: 5,397 Bytes

Got it! Here’s the **English version** of the Hugging Face–ready **Model Card** draft for your file `PhailomXgboost_dm_model.pkl`. I’ve preserved all the technical details but translated and refined for an international scientific audience.

---

# 🧾 Model Card — PhailomXgboost\_dm\_model

```yaml
license: unknown           # TODO: choose a license (e.g., mit, apache-2.0, cc-by-4.0)
library_name: xgboost
tags:
  - xgboost
  - classification
  - tabular-data
  - healthcare
  - NCD
  - diabetes-risk
language:
  - en
  - th
model-index:
  - name: PhailomXgboost_dm_model
    results:
      - task:
          type: tabular-classification
        dataset:
          name: TODO-dataset-name
          type: private
          split: test
        metrics:
          - type: accuracy
            value: TODO
          - type: f1
            value: TODO
          - type: roc_auc
            value: TODO
```

---

## 📌 Model Summary

**PhailomXgboost\_dm\_model** is an **XGBoost classifier** developed for early-stage screening of **non-communicable diseases (NCDs)**, with a focus on diabetes risk prediction using community health screening data.
The model outputs **three classes**: *Normal*, *At-Risk*, and *Diabetic*, making it suitable for cost-effective and rapid community-level health assessments.

---

## 🧠 Intended Use & Limitations

**Intended use**

* Community-level health screening for diabetes/NCD risk.
* Educational and research purposes (health data mining, public health informatics).
* Integration into dashboards or lightweight apps (e.g., Streamlit, Hugging Face Spaces).

**Not for**

* Direct clinical diagnosis.
* Replacement for laboratory tests or medical professionals.

**Limitations**

* Performance depends heavily on data quality (missing values, outliers).
* Potential bias if the dataset is imbalanced across classes.
* Threshold tuning is required to balance sensitivity and specificity for different contexts.

---

## 🧯 Ethical Considerations

* Respect data privacy (PDPA/GDPR compliance).
* Communicate clearly that this model is a **screening tool, not a diagnostic system**.
* Regularly validate fairness across subgroups (gender, age, region).

---

## 🗂️ Data

* Source: community health screening dataset (**private, internal project**).
* Dataset size: \~**3,418 records** (balanced across *Normal*, *At-Risk*, *Diabetic*).
* Example features:

  * **Demographics:** Age, Age group, Village, Screening date
  * **Vitals:** Systolic/diastolic blood pressure, Weight, Height, BMI
  * **Contextual variables:** Household or screening group identifiers

> **TODO:** Fill in exact feature schema, units (e.g., mmHg, kg, cm), and preprocessing methods.

---

## 🏗️ Training Procedure

* **Model:** XGBoost (tree-based gradient boosting), multi-class classification.
* **Objective:** `multi:softprob` (multi-class probability prediction).
* **Preprocessing:**

  * Missing values handled by imputation.
  * One-hot or ordinal encoding for categorical features.
  * Stratified split into training/validation/test.
* **Hyperparameters tuned:** `max_depth`, `learning_rate (eta)`, `subsample`, `colsample_bytree`, `min_child_weight`, `n_estimators`.
* **Evaluation Metrics:** Accuracy, Macro-F1, ROC-AUC (One-vs-Rest).

> **TODO:** Insert actual hyperparameters and results.

---

## 📈 Evaluation

| Metric        | Test Set |
| ------------- | -------- |
| Accuracy      | TODO     |
| Macro F1      | TODO     |
| ROC-AUC (OVR) | TODO     |

**Confusion Matrix (example format)**

```
            Pred:Normal  Pred:At-Risk  Pred:Diabetic
True:Normal      TODO        TODO          TODO
True:At-Risk     TODO        TODO          TODO
True:Diabetic    TODO        TODO          TODO
```

---

## 🧩 Input Schema

Expected columns must match the training pipeline order. Example schema from project context:

```python
expected_columns = [
  "age_group", "record_id", "age", "village_no", "village_name", "screening_date",
  "bp_systolic", "bp_diastolic", "weight", "height",
  # ... add remaining features
]
```

> **TODO:** Fill with the exact column list and datatypes.

---

## 🚀 Inference

### 1) Load from pickle file

```python
import pickle, pandas as pd

with open("PhailomXgboost_dm_model.pkl", "rb") as f:
    model = pickle.load(f)

X = pd.DataFrame([{
    "age_group": "60-69",
    "record_id": 1,
    "age": 64,
    "village_no": 5,
    "village_name": "SampleVillage",
    "screening_date": "2025-07-01",
    "bp_systolic": 146,
    "bp_diastolic": 90,
    "weight": 68.0,
    "height": 160.0,
    # ... include all expected features
}], columns=expected_columns)

proba = model.predict_proba(X)[0]
pred  = model.classes_[proba.argmax()]
print(pred, proba)
```

### 2) Use XGBoost native format (recommended for HF)

```python
model.get_booster().save_model("model.json")
```

---

## ⚙️ Environment & Reproducibility

* **Python**: TODO
* **xgboost**: TODO
* **scikit-learn**: TODO
* **pandas/numpy**: TODO
* Random seed: `42`

Attach:

* `requirements.txt`
* training script/preprocessing code
* evaluation reports and figures

---

## 🧪 Validation & Monitoring

* Adjust classification thresholds for public health contexts.
* Monitor drift when applied to new populations.
* Revalidate if data collection tools change.

---

## 📣 Citation

> TODO: Add references or project details for citation.