Nucha
/

PhailomNCDs

Model card Files Files and versions

xet

Community

Nucha commited on Aug 16, 2025

Commit

b778180

verified ·

1 Parent(s): 26140c5

Update README.md

Browse files

Files changed (1) hide show

README.md +176 -32

README.md CHANGED Viewed

@@ -1,60 +1,204 @@
-### 1. What is a "Model Packet" on Hugging Face?
-While Hugging Face doesn’t officially call it *model packet*, the term usually refers to the **entire bundle of files and metadata stored in a Hugging Face model repository**, which allows the model to be downloaded, configured, and used easily.
-A model packet typically includes:
-* **Model weights** (e.g., `pytorch_model.bin`, `tf_model.h5`, or `model.safetensors`)
-* **Configuration file** (`config.json`) – defines architecture details like hidden layers, vocab size, dropout, etc.
-* **Tokenizer files** (`tokenizer.json`, `vocab.txt`, `merges.txt`) – for NLP models
-* **Preprocessor/feature extractor** (`preprocessor_config.json`, `feature_extractor.json`) – for vision/audio models
-* **README.md** – model card with description, usage, license, citations
-* **Training arguments** (`training_args.bin`) – optional, stores hyperparameters used during training
-Together, this set is what many people informally call the **“model packet”** or **model package**.
 ---
-### 2. How Hugging Face Loads a Model Packet
-When you use Hugging Face’s Transformers or `huggingface_hub`, the entire packet is automatically downloaded and cached locally.
-Example:
-```python
-from transformers import AutoModelForSequenceClassification, AutoTokenizer
-model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
-tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
 ```
-This command downloads the full **model packet** (weights + config + tokenizer) from Hugging Face Hub.
 ---
-### 3. Difference From a `.pkl` File (like the one you uploaded)
-Your file `PhailomXgboost_dm_model.pkl` is a **pickled model** (from XGBoost/Scikit-learn).
-* A `.pkl` file only contains the serialized weights and structure of the model.
-* It is **not** a Hugging Face packet, since it lacks the config, tokenizer, and model card.
 ---
-### 4. Making Your `.pkl` into a Hugging Face Model Packet
-To upload your XGBoost model to Hugging Face Hub, you’d need to:
-1. **Wrap the model** using a compatible interface (`skops` for scikit-learn/XGBoost, or `optimum` if optimizing).
-2. **Add required metadata files** – e.g., `config.json`, `README.md` (model card).
-3. **Push to Hugging Face Hub** using either:
-   * `huggingface-cli upload`
-   * or programmatically with `huggingface_hub`
 ---
-✅ **Summary**:
-* A **model packet** on Hugging Face = the full set of files (weights, config, tokenizer, README, etc.) required for smooth use.
-* A **`.pkl` file** = only serialized weights/structure, not directly usable on Hugging Face without conversion.

+Got it! Here’s the **English version** of the Hugging Face–ready **Model Card** draft for your file `PhailomXgboost_dm_model.pkl`. I’ve preserved all the technical details but translated and refined for an international scientific audience.
+---
+# 🧾 Model Card — PhailomXgboost\_dm\_model
+```yaml
+license: unknown           # TODO: choose a license (e.g., mit, apache-2.0, cc-by-4.0)
+library_name: xgboost
+tags:
+  - xgboost
+  - classification
+  - tabular-data
+  - healthcare
+  - NCD
+  - diabetes-risk
+language:
+  - en
+  - th
+model-index:
+  - name: PhailomXgboost_dm_model
+    results:
+      - task:
+          type: tabular-classification
+        dataset:
+          name: TODO-dataset-name
+          type: private
+          split: test
+        metrics:
+          - type: accuracy
+            value: TODO
+          - type: f1
+            value: TODO
+          - type: roc_auc
+            value: TODO
+```
+---
+## 📌 Model Summary
+**PhailomXgboost\_dm\_model** is an **XGBoost classifier** developed for early-stage screening of **non-communicable diseases (NCDs)**, with a focus on diabetes risk prediction using community health screening data.
+The model outputs **three classes**: *Normal*, *At-Risk*, and *Diabetic*, making it suitable for cost-effective and rapid community-level health assessments.
 ---
+## 🧠 Intended Use & Limitations
+**Intended use**
+* Community-level health screening for diabetes/NCD risk.
+* Educational and research purposes (health data mining, public health informatics).
+* Integration into dashboards or lightweight apps (e.g., Streamlit, Hugging Face Spaces).
+**Not for**
+* Direct clinical diagnosis.
+* Replacement for laboratory tests or medical professionals.
+**Limitations**
+* Performance depends heavily on data quality (missing values, outliers).
+* Potential bias if the dataset is imbalanced across classes.
+* Threshold tuning is required to balance sensitivity and specificity for different contexts.
+---
+## 🧯 Ethical Considerations
+* Respect data privacy (PDPA/GDPR compliance).
+* Communicate clearly that this model is a **screening tool, not a diagnostic system**.
+* Regularly validate fairness across subgroups (gender, age, region).
+---
+## 🗂️ Data
+* Source: community health screening dataset (**private, internal project**).
+* Dataset size: \~**3,418 records** (balanced across *Normal*, *At-Risk*, *Diabetic*).
+* Example features:
+  * **Demographics:** Age, Age group, Village, Screening date
+  * **Vitals:** Systolic/diastolic blood pressure, Weight, Height, BMI
+  * **Contextual variables:** Household or screening group identifiers
+> **TODO:** Fill in exact feature schema, units (e.g., mmHg, kg, cm), and preprocessing methods.
+---
+## 🏗️ Training Procedure
+* **Model:** XGBoost (tree-based gradient boosting), multi-class classification.
+* **Objective:** `multi:softprob` (multi-class probability prediction).
+* **Preprocessing:**
+  * Missing values handled by imputation.
+  * One-hot or ordinal encoding for categorical features.
+  * Stratified split into training/validation/test.
+* **Hyperparameters tuned:** `max_depth`, `learning_rate (eta)`, `subsample`, `colsample_bytree`, `min_child_weight`, `n_estimators`.
+* **Evaluation Metrics:** Accuracy, Macro-F1, ROC-AUC (One-vs-Rest).
+> **TODO:** Insert actual hyperparameters and results.
+---
+## 📈 Evaluation
+| Metric        | Test Set |
+| ------------- | -------- |
+| Accuracy      | TODO     |
+| Macro F1      | TODO     |
+| ROC-AUC (OVR) | TODO     |
+**Confusion Matrix (example format)**
+```
+            Pred:Normal  Pred:At-Risk  Pred:Diabetic
+True:Normal      TODO        TODO          TODO
+True:At-Risk     TODO        TODO          TODO
+True:Diabetic    TODO        TODO          TODO
 ```
+---
+## 🧩 Input Schema
+Expected columns must match the training pipeline order. Example schema from project context:
+```python
+expected_columns = [
+  "age_group", "record_id", "age", "village_no", "village_name", "screening_date",
+  "bp_systolic", "bp_diastolic", "weight", "height",
+  # ... add remaining features
+]
+```
+> **TODO:** Fill with the exact column list and datatypes.
 ---
+## 🚀 Inference
+### 1) Load from pickle file
+```python
+import pickle, pandas as pd
+with open("PhailomXgboost_dm_model.pkl", "rb") as f:
+    model = pickle.load(f)
+X = pd.DataFrame([{
+    "age_group": "60-69",
+    "record_id": 1,
+    "age": 64,
+    "village_no": 5,
+    "village_name": "SampleVillage",
+    "screening_date": "2025-07-01",
+    "bp_systolic": 146,
+    "bp_diastolic": 90,
+    "weight": 68.0,
+    "height": 160.0,
+    # ... include all expected features
+}], columns=expected_columns)
+proba = model.predict_proba(X)[0]
+pred  = model.classes_[proba.argmax()]
+print(pred, proba)
+```
+### 2) Use XGBoost native format (recommended for HF)
+```python
+model.get_booster().save_model("model.json")
+```
 ---
+## ⚙️ Environment & Reproducibility
+* **Python**: TODO
+* **xgboost**: TODO
+* **scikit-learn**: TODO
+* **pandas/numpy**: TODO
+* Random seed: `42`
+Attach:
+* `requirements.txt`
+* training script/preprocessing code
+* evaluation reports and figures
+---
+## 🧪 Validation & Monitoring
+* Adjust classification thresholds for public health contexts.
+* Monitor drift when applied to new populations.
+* Revalidate if data collection tools change.
 ---
+## 📣 Citation
+> TODO: Add references or project details for citation.