PhailomNCDs / README.md

Update README.md

b778180 verified 5 months ago

5.4 kB

Got it! Here’s the English version of the Hugging Face–ready Model Card draft for your file PhailomXgboost_dm_model.pkl. I’ve preserved all the technical details but translated and refined for an international scientific audience.

🧾 Model Card — PhailomXgboost_dm_model

license: unknown           # TODO: choose a license (e.g., mit, apache-2.0, cc-by-4.0)
library_name: xgboost
tags:
  - xgboost
  - classification
  - tabular-data
  - healthcare
  - NCD
  - diabetes-risk
language:
  - en
  - th
model-index:
  - name: PhailomXgboost_dm_model
    results:
      - task:
          type: tabular-classification
        dataset:
          name: TODO-dataset-name
          type: private
          split: test
        metrics:
          - type: accuracy
            value: TODO
          - type: f1
            value: TODO
          - type: roc_auc
            value: TODO

📌 Model Summary

PhailomXgboost_dm_model is an XGBoost classifier developed for early-stage screening of non-communicable diseases (NCDs), with a focus on diabetes risk prediction using community health screening data. The model outputs three classes: Normal, At-Risk, and Diabetic, making it suitable for cost-effective and rapid community-level health assessments.

🧠 Intended Use & Limitations

Intended use

Community-level health screening for diabetes/NCD risk.
Educational and research purposes (health data mining, public health informatics).
Integration into dashboards or lightweight apps (e.g., Streamlit, Hugging Face Spaces).

Not for

Direct clinical diagnosis.
Replacement for laboratory tests or medical professionals.

Limitations

Performance depends heavily on data quality (missing values, outliers).
Potential bias if the dataset is imbalanced across classes.
Threshold tuning is required to balance sensitivity and specificity for different contexts.

🧯 Ethical Considerations

Respect data privacy (PDPA/GDPR compliance).
Communicate clearly that this model is a screening tool, not a diagnostic system.
Regularly validate fairness across subgroups (gender, age, region).

🗂️ Data

Source: community health screening dataset (private, internal project).
Dataset size: ~3,418 records (balanced across Normal, At-Risk, Diabetic).
Example features:
- Demographics: Age, Age group, Village, Screening date
- Vitals: Systolic/diastolic blood pressure, Weight, Height, BMI
- Contextual variables: Household or screening group identifiers

TODO: Fill in exact feature schema, units (e.g., mmHg, kg, cm), and preprocessing methods.

🏗️ Training Procedure

Model: XGBoost (tree-based gradient boosting), multi-class classification.
Objective: multi:softprob (multi-class probability prediction).
Preprocessing:
- Missing values handled by imputation.
- One-hot or ordinal encoding for categorical features.
- Stratified split into training/validation/test.
Hyperparameters tuned: max_depth, learning_rate (eta), subsample, colsample_bytree, min_child_weight, n_estimators.
Evaluation Metrics: Accuracy, Macro-F1, ROC-AUC (One-vs-Rest).

TODO: Insert actual hyperparameters and results.

📈 Evaluation

Metric	Test Set
Accuracy	TODO
Macro F1	TODO
ROC-AUC (OVR)	TODO

Confusion Matrix (example format)

            Pred:Normal  Pred:At-Risk  Pred:Diabetic
True:Normal      TODO        TODO          TODO
True:At-Risk     TODO        TODO          TODO
True:Diabetic    TODO        TODO          TODO

🧩 Input Schema

Expected columns must match the training pipeline order. Example schema from project context:

expected_columns = [
  "age_group", "record_id", "age", "village_no", "village_name", "screening_date",
  "bp_systolic", "bp_diastolic", "weight", "height",
  # ... add remaining features
]

TODO: Fill with the exact column list and datatypes.

🚀 Inference

1) Load from pickle file

import pickle, pandas as pd

with open("PhailomXgboost_dm_model.pkl", "rb") as f:
    model = pickle.load(f)

X = pd.DataFrame([{
    "age_group": "60-69",
    "record_id": 1,
    "age": 64,
    "village_no": 5,
    "village_name": "SampleVillage",
    "screening_date": "2025-07-01",
    "bp_systolic": 146,
    "bp_diastolic": 90,
    "weight": 68.0,
    "height": 160.0,
    # ... include all expected features
}], columns=expected_columns)

proba = model.predict_proba(X)[0]
pred  = model.classes_[proba.argmax()]
print(pred, proba)

2) Use XGBoost native format (recommended for HF)

model.get_booster().save_model("model.json")

⚙️ Environment & Reproducibility

Python: TODO
xgboost: TODO
scikit-learn: TODO
pandas/numpy: TODO
Random seed: 42

Attach:

requirements.txt
training script/preprocessing code
evaluation reports and figures

🧪 Validation & Monitoring

Adjust classification thresholds for public health contexts.
Monitor drift when applied to new populations.
Revalidate if data collection tools change.

📣 Citation

TODO: Add references or project details for citation.