Got it! Hereβs the English version of the Hugging Faceβready Model Card draft for your file PhailomXgboost_dm_model.pkl. Iβve preserved all the technical details but translated and refined for an international scientific audience.
π§Ύ Model Card β PhailomXgboost_dm_model
license: unknown # TODO: choose a license (e.g., mit, apache-2.0, cc-by-4.0)
library_name: xgboost
tags:
- xgboost
- classification
- tabular-data
- healthcare
- NCD
- diabetes-risk
language:
- en
- th
model-index:
- name: PhailomXgboost_dm_model
results:
- task:
type: tabular-classification
dataset:
name: TODO-dataset-name
type: private
split: test
metrics:
- type: accuracy
value: TODO
- type: f1
value: TODO
- type: roc_auc
value: TODO
π Model Summary
PhailomXgboost_dm_model is an XGBoost classifier developed for early-stage screening of non-communicable diseases (NCDs), with a focus on diabetes risk prediction using community health screening data. The model outputs three classes: Normal, At-Risk, and Diabetic, making it suitable for cost-effective and rapid community-level health assessments.
π§ Intended Use & Limitations
Intended use
- Community-level health screening for diabetes/NCD risk.
- Educational and research purposes (health data mining, public health informatics).
- Integration into dashboards or lightweight apps (e.g., Streamlit, Hugging Face Spaces).
Not for
- Direct clinical diagnosis.
- Replacement for laboratory tests or medical professionals.
Limitations
- Performance depends heavily on data quality (missing values, outliers).
- Potential bias if the dataset is imbalanced across classes.
- Threshold tuning is required to balance sensitivity and specificity for different contexts.
π§― Ethical Considerations
- Respect data privacy (PDPA/GDPR compliance).
- Communicate clearly that this model is a screening tool, not a diagnostic system.
- Regularly validate fairness across subgroups (gender, age, region).
ποΈ Data
Source: community health screening dataset (private, internal project).
Dataset size: ~3,418 records (balanced across Normal, At-Risk, Diabetic).
Example features:
- Demographics: Age, Age group, Village, Screening date
- Vitals: Systolic/diastolic blood pressure, Weight, Height, BMI
- Contextual variables: Household or screening group identifiers
TODO: Fill in exact feature schema, units (e.g., mmHg, kg, cm), and preprocessing methods.
ποΈ Training Procedure
Model: XGBoost (tree-based gradient boosting), multi-class classification.
Objective:
multi:softprob(multi-class probability prediction).Preprocessing:
- Missing values handled by imputation.
- One-hot or ordinal encoding for categorical features.
- Stratified split into training/validation/test.
Hyperparameters tuned:
max_depth,learning_rate (eta),subsample,colsample_bytree,min_child_weight,n_estimators.Evaluation Metrics: Accuracy, Macro-F1, ROC-AUC (One-vs-Rest).
TODO: Insert actual hyperparameters and results.
π Evaluation
| Metric | Test Set |
|---|---|
| Accuracy | TODO |
| Macro F1 | TODO |
| ROC-AUC (OVR) | TODO |
Confusion Matrix (example format)
Pred:Normal Pred:At-Risk Pred:Diabetic
True:Normal TODO TODO TODO
True:At-Risk TODO TODO TODO
True:Diabetic TODO TODO TODO
π§© Input Schema
Expected columns must match the training pipeline order. Example schema from project context:
expected_columns = [
"age_group", "record_id", "age", "village_no", "village_name", "screening_date",
"bp_systolic", "bp_diastolic", "weight", "height",
# ... add remaining features
]
TODO: Fill with the exact column list and datatypes.
π Inference
1) Load from pickle file
import pickle, pandas as pd
with open("PhailomXgboost_dm_model.pkl", "rb") as f:
model = pickle.load(f)
X = pd.DataFrame([{
"age_group": "60-69",
"record_id": 1,
"age": 64,
"village_no": 5,
"village_name": "SampleVillage",
"screening_date": "2025-07-01",
"bp_systolic": 146,
"bp_diastolic": 90,
"weight": 68.0,
"height": 160.0,
# ... include all expected features
}], columns=expected_columns)
proba = model.predict_proba(X)[0]
pred = model.classes_[proba.argmax()]
print(pred, proba)
2) Use XGBoost native format (recommended for HF)
model.get_booster().save_model("model.json")
βοΈ Environment & Reproducibility
- Python: TODO
- xgboost: TODO
- scikit-learn: TODO
- pandas/numpy: TODO
- Random seed:
42
Attach:
requirements.txt- training script/preprocessing code
- evaluation reports and figures
π§ͺ Validation & Monitoring
- Adjust classification thresholds for public health contexts.
- Monitor drift when applied to new populations.
- Revalidate if data collection tools change.
π£ Citation
TODO: Add references or project details for citation.