PhailomNCDs / README.md
Nucha's picture
Update README.md
b778180 verified

Got it! Here’s the English version of the Hugging Face–ready Model Card draft for your file PhailomXgboost_dm_model.pkl. I’ve preserved all the technical details but translated and refined for an international scientific audience.


🧾 Model Card β€” PhailomXgboost_dm_model

license: unknown           # TODO: choose a license (e.g., mit, apache-2.0, cc-by-4.0)
library_name: xgboost
tags:
  - xgboost
  - classification
  - tabular-data
  - healthcare
  - NCD
  - diabetes-risk
language:
  - en
  - th
model-index:
  - name: PhailomXgboost_dm_model
    results:
      - task:
          type: tabular-classification
        dataset:
          name: TODO-dataset-name
          type: private
          split: test
        metrics:
          - type: accuracy
            value: TODO
          - type: f1
            value: TODO
          - type: roc_auc
            value: TODO

πŸ“Œ Model Summary

PhailomXgboost_dm_model is an XGBoost classifier developed for early-stage screening of non-communicable diseases (NCDs), with a focus on diabetes risk prediction using community health screening data. The model outputs three classes: Normal, At-Risk, and Diabetic, making it suitable for cost-effective and rapid community-level health assessments.


🧠 Intended Use & Limitations

Intended use

  • Community-level health screening for diabetes/NCD risk.
  • Educational and research purposes (health data mining, public health informatics).
  • Integration into dashboards or lightweight apps (e.g., Streamlit, Hugging Face Spaces).

Not for

  • Direct clinical diagnosis.
  • Replacement for laboratory tests or medical professionals.

Limitations

  • Performance depends heavily on data quality (missing values, outliers).
  • Potential bias if the dataset is imbalanced across classes.
  • Threshold tuning is required to balance sensitivity and specificity for different contexts.

🧯 Ethical Considerations

  • Respect data privacy (PDPA/GDPR compliance).
  • Communicate clearly that this model is a screening tool, not a diagnostic system.
  • Regularly validate fairness across subgroups (gender, age, region).

πŸ—‚οΈ Data

  • Source: community health screening dataset (private, internal project).

  • Dataset size: ~3,418 records (balanced across Normal, At-Risk, Diabetic).

  • Example features:

    • Demographics: Age, Age group, Village, Screening date
    • Vitals: Systolic/diastolic blood pressure, Weight, Height, BMI
    • Contextual variables: Household or screening group identifiers

TODO: Fill in exact feature schema, units (e.g., mmHg, kg, cm), and preprocessing methods.


πŸ—οΈ Training Procedure

  • Model: XGBoost (tree-based gradient boosting), multi-class classification.

  • Objective: multi:softprob (multi-class probability prediction).

  • Preprocessing:

    • Missing values handled by imputation.
    • One-hot or ordinal encoding for categorical features.
    • Stratified split into training/validation/test.
  • Hyperparameters tuned: max_depth, learning_rate (eta), subsample, colsample_bytree, min_child_weight, n_estimators.

  • Evaluation Metrics: Accuracy, Macro-F1, ROC-AUC (One-vs-Rest).

TODO: Insert actual hyperparameters and results.


πŸ“ˆ Evaluation

Metric Test Set
Accuracy TODO
Macro F1 TODO
ROC-AUC (OVR) TODO

Confusion Matrix (example format)

            Pred:Normal  Pred:At-Risk  Pred:Diabetic
True:Normal      TODO        TODO          TODO
True:At-Risk     TODO        TODO          TODO
True:Diabetic    TODO        TODO          TODO

🧩 Input Schema

Expected columns must match the training pipeline order. Example schema from project context:

expected_columns = [
  "age_group", "record_id", "age", "village_no", "village_name", "screening_date",
  "bp_systolic", "bp_diastolic", "weight", "height",
  # ... add remaining features
]

TODO: Fill with the exact column list and datatypes.


πŸš€ Inference

1) Load from pickle file

import pickle, pandas as pd

with open("PhailomXgboost_dm_model.pkl", "rb") as f:
    model = pickle.load(f)

X = pd.DataFrame([{
    "age_group": "60-69",
    "record_id": 1,
    "age": 64,
    "village_no": 5,
    "village_name": "SampleVillage",
    "screening_date": "2025-07-01",
    "bp_systolic": 146,
    "bp_diastolic": 90,
    "weight": 68.0,
    "height": 160.0,
    # ... include all expected features
}], columns=expected_columns)

proba = model.predict_proba(X)[0]
pred  = model.classes_[proba.argmax()]
print(pred, proba)

2) Use XGBoost native format (recommended for HF)

model.get_booster().save_model("model.json")

βš™οΈ Environment & Reproducibility

  • Python: TODO
  • xgboost: TODO
  • scikit-learn: TODO
  • pandas/numpy: TODO
  • Random seed: 42

Attach:

  • requirements.txt
  • training script/preprocessing code
  • evaluation reports and figures

πŸ§ͺ Validation & Monitoring

  • Adjust classification thresholds for public health contexts.
  • Monitor drift when applied to new populations.
  • Revalidate if data collection tools change.

πŸ“£ Citation

TODO: Add references or project details for citation.