| Got it! Here’s the **English version** of the Hugging Face–ready **Model Card** draft for your file `PhailomXgboost_dm_model.pkl`. I’ve preserved all the technical details but translated and refined for an international scientific audience. | |
| --- | |
| # 🧾 Model Card — PhailomXgboost\_dm\_model | |
| ```yaml | |
| license: unknown # TODO: choose a license (e.g., mit, apache-2.0, cc-by-4.0) | |
| library_name: xgboost | |
| tags: | |
| - xgboost | |
| - classification | |
| - tabular-data | |
| - healthcare | |
| - NCD | |
| - diabetes-risk | |
| language: | |
| - en | |
| - th | |
| model-index: | |
| - name: PhailomXgboost_dm_model | |
| results: | |
| - task: | |
| type: tabular-classification | |
| dataset: | |
| name: TODO-dataset-name | |
| type: private | |
| split: test | |
| metrics: | |
| - type: accuracy | |
| value: TODO | |
| - type: f1 | |
| value: TODO | |
| - type: roc_auc | |
| value: TODO | |
| ``` | |
| --- | |
| ## 📌 Model Summary | |
| **PhailomXgboost\_dm\_model** is an **XGBoost classifier** developed for early-stage screening of **non-communicable diseases (NCDs)**, with a focus on diabetes risk prediction using community health screening data. | |
| The model outputs **three classes**: *Normal*, *At-Risk*, and *Diabetic*, making it suitable for cost-effective and rapid community-level health assessments. | |
| --- | |
| ## 🧠 Intended Use & Limitations | |
| **Intended use** | |
| * Community-level health screening for diabetes/NCD risk. | |
| * Educational and research purposes (health data mining, public health informatics). | |
| * Integration into dashboards or lightweight apps (e.g., Streamlit, Hugging Face Spaces). | |
| **Not for** | |
| * Direct clinical diagnosis. | |
| * Replacement for laboratory tests or medical professionals. | |
| **Limitations** | |
| * Performance depends heavily on data quality (missing values, outliers). | |
| * Potential bias if the dataset is imbalanced across classes. | |
| * Threshold tuning is required to balance sensitivity and specificity for different contexts. | |
| --- | |
| ## 🧯 Ethical Considerations | |
| * Respect data privacy (PDPA/GDPR compliance). | |
| * Communicate clearly that this model is a **screening tool, not a diagnostic system**. | |
| * Regularly validate fairness across subgroups (gender, age, region). | |
| --- | |
| ## 🗂️ Data | |
| * Source: community health screening dataset (**private, internal project**). | |
| * Dataset size: \~**3,418 records** (balanced across *Normal*, *At-Risk*, *Diabetic*). | |
| * Example features: | |
| * **Demographics:** Age, Age group, Village, Screening date | |
| * **Vitals:** Systolic/diastolic blood pressure, Weight, Height, BMI | |
| * **Contextual variables:** Household or screening group identifiers | |
| > **TODO:** Fill in exact feature schema, units (e.g., mmHg, kg, cm), and preprocessing methods. | |
| --- | |
| ## 🏗️ Training Procedure | |
| * **Model:** XGBoost (tree-based gradient boosting), multi-class classification. | |
| * **Objective:** `multi:softprob` (multi-class probability prediction). | |
| * **Preprocessing:** | |
| * Missing values handled by imputation. | |
| * One-hot or ordinal encoding for categorical features. | |
| * Stratified split into training/validation/test. | |
| * **Hyperparameters tuned:** `max_depth`, `learning_rate (eta)`, `subsample`, `colsample_bytree`, `min_child_weight`, `n_estimators`. | |
| * **Evaluation Metrics:** Accuracy, Macro-F1, ROC-AUC (One-vs-Rest). | |
| > **TODO:** Insert actual hyperparameters and results. | |
| --- | |
| ## 📈 Evaluation | |
| | Metric | Test Set | | |
| | ------------- | -------- | | |
| | Accuracy | TODO | | |
| | Macro F1 | TODO | | |
| | ROC-AUC (OVR) | TODO | | |
| **Confusion Matrix (example format)** | |
| ``` | |
| Pred:Normal Pred:At-Risk Pred:Diabetic | |
| True:Normal TODO TODO TODO | |
| True:At-Risk TODO TODO TODO | |
| True:Diabetic TODO TODO TODO | |
| ``` | |
| --- | |
| ## 🧩 Input Schema | |
| Expected columns must match the training pipeline order. Example schema from project context: | |
| ```python | |
| expected_columns = [ | |
| "age_group", "record_id", "age", "village_no", "village_name", "screening_date", | |
| "bp_systolic", "bp_diastolic", "weight", "height", | |
| # ... add remaining features | |
| ] | |
| ``` | |
| > **TODO:** Fill with the exact column list and datatypes. | |
| --- | |
| ## 🚀 Inference | |
| ### 1) Load from pickle file | |
| ```python | |
| import pickle, pandas as pd | |
| with open("PhailomXgboost_dm_model.pkl", "rb") as f: | |
| model = pickle.load(f) | |
| X = pd.DataFrame([{ | |
| "age_group": "60-69", | |
| "record_id": 1, | |
| "age": 64, | |
| "village_no": 5, | |
| "village_name": "SampleVillage", | |
| "screening_date": "2025-07-01", | |
| "bp_systolic": 146, | |
| "bp_diastolic": 90, | |
| "weight": 68.0, | |
| "height": 160.0, | |
| # ... include all expected features | |
| }], columns=expected_columns) | |
| proba = model.predict_proba(X)[0] | |
| pred = model.classes_[proba.argmax()] | |
| print(pred, proba) | |
| ``` | |
| ### 2) Use XGBoost native format (recommended for HF) | |
| ```python | |
| model.get_booster().save_model("model.json") | |
| ``` | |
| --- | |
| ## ⚙️ Environment & Reproducibility | |
| * **Python**: TODO | |
| * **xgboost**: TODO | |
| * **scikit-learn**: TODO | |
| * **pandas/numpy**: TODO | |
| * Random seed: `42` | |
| Attach: | |
| * `requirements.txt` | |
| * training script/preprocessing code | |
| * evaluation reports and figures | |
| --- | |
| ## 🧪 Validation & Monitoring | |
| * Adjust classification thresholds for public health contexts. | |
| * Monitor drift when applied to new populations. | |
| * Revalidate if data collection tools change. | |
| --- | |
| ## 📣 Citation | |
| > TODO: Add references or project details for citation. |