YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Got it! Hereβs the English version of the Hugging Faceβready Model Card draft for your file PhailomXgboost_dm_model.pkl. Iβve preserved all the technical details but translated and refined for an international scientific audience.
π§Ύ Model Card β PhailomXgboost_dm_model
license: unknown # TODO: choose a license (e.g., mit, apache-2.0, cc-by-4.0)
library_name: xgboost
tags:
- xgboost
- classification
- tabular-data
- healthcare
- NCD
- diabetes-risk
language:
- en
- th
model-index:
- name: PhailomXgboost_dm_model
results:
- task:
type: tabular-classification
dataset:
name: TODO-dataset-name
type: private
split: test
metrics:
- type: accuracy
value: TODO
- type: f1
value: TODO
- type: roc_auc
value: TODO
π Model Summary
PhailomXgboost_dm_model is an XGBoost classifier developed for early-stage screening of non-communicable diseases (NCDs), with a focus on diabetes risk prediction using community health screening data. The model outputs three classes: Normal, At-Risk, and Diabetic, making it suitable for cost-effective and rapid community-level health assessments.
π§ Intended Use & Limitations
Intended use
- Community-level health screening for diabetes/NCD risk.
- Educational and research purposes (health data mining, public health informatics).
- Integration into dashboards or lightweight apps (e.g., Streamlit, Hugging Face Spaces).
Not for
- Direct clinical diagnosis.
- Replacement for laboratory tests or medical professionals.
Limitations
- Performance depends heavily on data quality (missing values, outliers).
- Potential bias if the dataset is imbalanced across classes.
- Threshold tuning is required to balance sensitivity and specificity for different contexts.
π§― Ethical Considerations
- Respect data privacy (PDPA/GDPR compliance).
- Communicate clearly that this model is a screening tool, not a diagnostic system.
- Regularly validate fairness across subgroups (gender, age, region).
ποΈ Data
Source: community health screening dataset (private, internal project).
Dataset size: ~3,418 records (balanced across Normal, At-Risk, Diabetic).
Example features:
- Demographics: Age, Age group, Village, Screening date
- Vitals: Systolic/diastolic blood pressure, Weight, Height, BMI
- Contextual variables: Household or screening group identifiers
TODO: Fill in exact feature schema, units (e.g., mmHg, kg, cm), and preprocessing methods.
ποΈ Training Procedure
Model: XGBoost (tree-based gradient boosting), multi-class classification.
Objective:
multi:softprob(multi-class probability prediction).Preprocessing:
- Missing values handled by imputation.
- One-hot or ordinal encoding for categorical features.
- Stratified split into training/validation/test.
Hyperparameters tuned:
max_depth,learning_rate (eta),subsample,colsample_bytree,min_child_weight,n_estimators.Evaluation Metrics: Accuracy, Macro-F1, ROC-AUC (One-vs-Rest).
TODO: Insert actual hyperparameters and results.
π Evaluation
| Metric | Test Set |
|---|---|
| Accuracy | TODO |
| Macro F1 | TODO |
| ROC-AUC (OVR) | TODO |
Confusion Matrix (example format)
Pred:Normal Pred:At-Risk Pred:Diabetic
True:Normal TODO TODO TODO
True:At-Risk TODO TODO TODO
True:Diabetic TODO TODO TODO
π§© Input Schema
Expected columns must match the training pipeline order. Example schema from project context:
expected_columns = [
"age_group", "record_id", "age", "village_no", "village_name", "screening_date",
"bp_systolic", "bp_diastolic", "weight", "height",
# ... add remaining features
]
TODO: Fill with the exact column list and datatypes.
π Inference
1) Load from pickle file
import pickle, pandas as pd
with open("PhailomXgboost_dm_model.pkl", "rb") as f:
model = pickle.load(f)
X = pd.DataFrame([{
"age_group": "60-69",
"record_id": 1,
"age": 64,
"village_no": 5,
"village_name": "SampleVillage",
"screening_date": "2025-07-01",
"bp_systolic": 146,
"bp_diastolic": 90,
"weight": 68.0,
"height": 160.0,
# ... include all expected features
}], columns=expected_columns)
proba = model.predict_proba(X)[0]
pred = model.classes_[proba.argmax()]
print(pred, proba)
2) Use XGBoost native format (recommended for HF)
model.get_booster().save_model("model.json")
βοΈ Environment & Reproducibility
- Python: TODO
- xgboost: TODO
- scikit-learn: TODO
- pandas/numpy: TODO
- Random seed:
42
Attach:
requirements.txt- training script/preprocessing code
- evaluation reports and figures
π§ͺ Validation & Monitoring
- Adjust classification thresholds for public health contexts.
- Monitor drift when applied to new populations.
- Revalidate if data collection tools change.
π£ Citation
TODO: Add references or project details for citation.