PhailomNCDs / README.md

Update README.md

b778180 verified 5 months ago

5.4 kB

	Got it! Here’s the English version of the Hugging Face–ready Model Card draft for your file `PhailomXgboost_dm_model.pkl`. I’ve preserved all the technical details but translated and refined for an international scientific audience.

	---

	# 🧾 Model Card — PhailomXgboost\_dm\_model

	```yaml
	license: unknown # TODO: choose a license (e.g., mit, apache-2.0, cc-by-4.0)
	library_name: xgboost
	tags:
	- xgboost
	- classification
	- tabular-data
	- healthcare
	- NCD
	- diabetes-risk
	language:
	- en
	- th
	model-index:
	- name: PhailomXgboost_dm_model
	results:
	- task:
	type: tabular-classification
	dataset:
	name: TODO-dataset-name
	type: private
	split: test
	metrics:
	- type: accuracy
	value: TODO
	- type: f1
	value: TODO
	- type: roc_auc
	value: TODO
	```

	---

	## 📌 Model Summary

	PhailomXgboost\_dm\_model is an XGBoost classifier developed for early-stage screening of non-communicable diseases (NCDs), with a focus on diabetes risk prediction using community health screening data.
	The model outputs three classes: Normal, At-Risk, and Diabetic, making it suitable for cost-effective and rapid community-level health assessments.

	---

	## 🧠 Intended Use & Limitations

	Intended use

	* Community-level health screening for diabetes/NCD risk.
	* Educational and research purposes (health data mining, public health informatics).
	* Integration into dashboards or lightweight apps (e.g., Streamlit, Hugging Face Spaces).

	Not for

	* Direct clinical diagnosis.
	* Replacement for laboratory tests or medical professionals.

	Limitations

	* Performance depends heavily on data quality (missing values, outliers).
	* Potential bias if the dataset is imbalanced across classes.
	* Threshold tuning is required to balance sensitivity and specificity for different contexts.

	---

	## 🧯 Ethical Considerations

	* Respect data privacy (PDPA/GDPR compliance).
	* Communicate clearly that this model is a screening tool, not a diagnostic system.
	* Regularly validate fairness across subgroups (gender, age, region).

	---

	## 🗂️ Data

	* Source: community health screening dataset (private, internal project).
	* Dataset size: \~3,418 records (balanced across Normal, At-Risk, Diabetic).
	* Example features:

	* Demographics: Age, Age group, Village, Screening date
	* Vitals: Systolic/diastolic blood pressure, Weight, Height, BMI
	* Contextual variables: Household or screening group identifiers

	> TODO: Fill in exact feature schema, units (e.g., mmHg, kg, cm), and preprocessing methods.

	---

	## 🏗️ Training Procedure

	* Model: XGBoost (tree-based gradient boosting), multi-class classification.
	* Objective: `multi:softprob` (multi-class probability prediction).
	* Preprocessing:

	* Missing values handled by imputation.
	* One-hot or ordinal encoding for categorical features.
	* Stratified split into training/validation/test.
	* Hyperparameters tuned: `max_depth`, `learning_rate (eta)`, `subsample`, `colsample_bytree`, `min_child_weight`, `n_estimators`.
	* Evaluation Metrics: Accuracy, Macro-F1, ROC-AUC (One-vs-Rest).

	> TODO: Insert actual hyperparameters and results.

	---

	## 📈 Evaluation

	\| Metric \| Test Set \|
	\| ------------- \| -------- \|
	\| Accuracy \| TODO \|
	\| Macro F1 \| TODO \|
	\| ROC-AUC (OVR) \| TODO \|

	Confusion Matrix (example format)

	```
	Pred:Normal Pred:At-Risk Pred:Diabetic
	True:Normal TODO TODO TODO
	True:At-Risk TODO TODO TODO
	True:Diabetic TODO TODO TODO
	```

	---

	## 🧩 Input Schema

	Expected columns must match the training pipeline order. Example schema from project context:

	```python
	expected_columns = [
	"age_group", "record_id", "age", "village_no", "village_name", "screening_date",
	"bp_systolic", "bp_diastolic", "weight", "height",
	# ... add remaining features
	]
	```

	> TODO: Fill with the exact column list and datatypes.

	---

	## 🚀 Inference

	### 1) Load from pickle file

	```python
	import pickle, pandas as pd

	with open("PhailomXgboost_dm_model.pkl", "rb") as f:
	model = pickle.load(f)

	X = pd.DataFrame([{
	"age_group": "60-69",
	"record_id": 1,
	"age": 64,
	"village_no": 5,
	"village_name": "SampleVillage",
	"screening_date": "2025-07-01",
	"bp_systolic": 146,
	"bp_diastolic": 90,
	"weight": 68.0,
	"height": 160.0,
	# ... include all expected features
	}], columns=expected_columns)

	proba = model.predict_proba(X)[0]
	pred = model.classes_[proba.argmax()]
	print(pred, proba)
	```

	### 2) Use XGBoost native format (recommended for HF)

	```python
	model.get_booster().save_model("model.json")
	```

	---

	## ⚙️ Environment & Reproducibility

	* Python: TODO
	* xgboost: TODO
	* scikit-learn: TODO
	* pandas/numpy: TODO
	* Random seed: `42`

	Attach:

	* `requirements.txt`
	* training script/preprocessing code
	* evaluation reports and figures

	---

	## 🧪 Validation & Monitoring

	* Adjust classification thresholds for public health contexts.
	* Monitor drift when applied to new populations.
	* Revalidate if data collection tools change.

	---

	## 📣 Citation

	> TODO: Add references or project details for citation.