Got it! Here’s the **English version** of the Hugging Face–ready **Model Card** draft for your file `PhailomXgboost_dm_model.pkl`. I’ve preserved all the technical details but translated and refined for an international scientific audience. --- # 🧾 Model Card — PhailomXgboost\_dm\_model ```yaml license: unknown # TODO: choose a license (e.g., mit, apache-2.0, cc-by-4.0) library_name: xgboost tags: - xgboost - classification - tabular-data - healthcare - NCD - diabetes-risk language: - en - th model-index: - name: PhailomXgboost_dm_model results: - task: type: tabular-classification dataset: name: TODO-dataset-name type: private split: test metrics: - type: accuracy value: TODO - type: f1 value: TODO - type: roc_auc value: TODO ``` --- ## 📌 Model Summary **PhailomXgboost\_dm\_model** is an **XGBoost classifier** developed for early-stage screening of **non-communicable diseases (NCDs)**, with a focus on diabetes risk prediction using community health screening data. The model outputs **three classes**: *Normal*, *At-Risk*, and *Diabetic*, making it suitable for cost-effective and rapid community-level health assessments. --- ## 🧠 Intended Use & Limitations **Intended use** * Community-level health screening for diabetes/NCD risk. * Educational and research purposes (health data mining, public health informatics). * Integration into dashboards or lightweight apps (e.g., Streamlit, Hugging Face Spaces). **Not for** * Direct clinical diagnosis. * Replacement for laboratory tests or medical professionals. **Limitations** * Performance depends heavily on data quality (missing values, outliers). * Potential bias if the dataset is imbalanced across classes. * Threshold tuning is required to balance sensitivity and specificity for different contexts. --- ## 🧯 Ethical Considerations * Respect data privacy (PDPA/GDPR compliance). * Communicate clearly that this model is a **screening tool, not a diagnostic system**. * Regularly validate fairness across subgroups (gender, age, region). --- ## 🗂️ Data * Source: community health screening dataset (**private, internal project**). * Dataset size: \~**3,418 records** (balanced across *Normal*, *At-Risk*, *Diabetic*). * Example features: * **Demographics:** Age, Age group, Village, Screening date * **Vitals:** Systolic/diastolic blood pressure, Weight, Height, BMI * **Contextual variables:** Household or screening group identifiers > **TODO:** Fill in exact feature schema, units (e.g., mmHg, kg, cm), and preprocessing methods. --- ## 🏗️ Training Procedure * **Model:** XGBoost (tree-based gradient boosting), multi-class classification. * **Objective:** `multi:softprob` (multi-class probability prediction). * **Preprocessing:** * Missing values handled by imputation. * One-hot or ordinal encoding for categorical features. * Stratified split into training/validation/test. * **Hyperparameters tuned:** `max_depth`, `learning_rate (eta)`, `subsample`, `colsample_bytree`, `min_child_weight`, `n_estimators`. * **Evaluation Metrics:** Accuracy, Macro-F1, ROC-AUC (One-vs-Rest). > **TODO:** Insert actual hyperparameters and results. --- ## 📈 Evaluation | Metric | Test Set | | ------------- | -------- | | Accuracy | TODO | | Macro F1 | TODO | | ROC-AUC (OVR) | TODO | **Confusion Matrix (example format)** ``` Pred:Normal Pred:At-Risk Pred:Diabetic True:Normal TODO TODO TODO True:At-Risk TODO TODO TODO True:Diabetic TODO TODO TODO ``` --- ## 🧩 Input Schema Expected columns must match the training pipeline order. Example schema from project context: ```python expected_columns = [ "age_group", "record_id", "age", "village_no", "village_name", "screening_date", "bp_systolic", "bp_diastolic", "weight", "height", # ... add remaining features ] ``` > **TODO:** Fill with the exact column list and datatypes. --- ## 🚀 Inference ### 1) Load from pickle file ```python import pickle, pandas as pd with open("PhailomXgboost_dm_model.pkl", "rb") as f: model = pickle.load(f) X = pd.DataFrame([{ "age_group": "60-69", "record_id": 1, "age": 64, "village_no": 5, "village_name": "SampleVillage", "screening_date": "2025-07-01", "bp_systolic": 146, "bp_diastolic": 90, "weight": 68.0, "height": 160.0, # ... include all expected features }], columns=expected_columns) proba = model.predict_proba(X)[0] pred = model.classes_[proba.argmax()] print(pred, proba) ``` ### 2) Use XGBoost native format (recommended for HF) ```python model.get_booster().save_model("model.json") ``` --- ## ⚙️ Environment & Reproducibility * **Python**: TODO * **xgboost**: TODO * **scikit-learn**: TODO * **pandas/numpy**: TODO * Random seed: `42` Attach: * `requirements.txt` * training script/preprocessing code * evaluation reports and figures --- ## 🧪 Validation & Monitoring * Adjust classification thresholds for public health contexts. * Monitor drift when applied to new populations. * Revalidate if data collection tools change. --- ## 📣 Citation > TODO: Add references or project details for citation.