MABONGALABS commited on
Commit
25eaacd
·
verified ·
1 Parent(s): 943c056

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -51
README.md CHANGED
@@ -1,69 +1,94 @@
1
  ---
 
 
2
  license: mit
3
- title: Hemaclass AI Diagnostics
4
- emoji: 🩸
5
- colorFrom: red
6
- colorTo: gray
7
- sdk: gradio
8
- sdk_version: 4.44.1
9
- app_file: app.py
10
- pinned: false
11
  tags:
12
  - medical
13
- - biology
14
- - xai
15
- - malaria
16
- - sickle-cell
 
 
 
 
 
 
 
 
 
 
17
  ---
18
 
19
- # Hemaclass AI Diagnostics: Clinical Decision Support System
20
 
21
- An explainable ensemble-based diagnostic tool optimized for **Malaria**, **Sickle Cell Anemia (SCA)**, and **Co-infection** classification in Western Kenya.
 
22
 
23
- ## 🔬 Model Details
24
- - **Architecture:** Stacking Ensemble (RF, SVM, XGBoost) with a Logistic Regression Meta-Learner.
25
- - **Explainability:** Integrated SHAP waterfall plots for local feature importance.
26
- - **Preprocessing:** MICE Imputation, SMOTE balancing, and Z-Score Normalization.
27
 
28
- ## 🚀 Quick Usage (Inference Code)
29
- To use this model programmatically, ensure you have your `.pkl` artifacts in the local directory.
 
 
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  ```python
32
  import joblib
33
  import pandas as pd
34
- import numpy as np
35
 
36
  # 1. Load Artifacts
37
  model = joblib.load('ensemble_model.pkl')
38
  scaler = joblib.load('scaler.pkl')
39
  imputer = joblib.load('imputer.pkl')
40
- FEATURES = joblib.load('feature_names.pkl')
41
- target_names = ['Negative', 'Malaria', 'SCA', 'Co-infection']
42
-
43
- def predict_patient(data_dict):
44
- """
45
- Input: Dictionary of patient vitals/labs
46
- Output: Predicted Diagnosis and Confidence
47
- """
48
- # Create DataFrame and align features
49
- df = pd.DataFrame([data_dict])
50
- for col in set(FEATURES) - set(df.columns):
51
- df[col] = np.nan
52
- df = df[FEATURES]
53
-
54
- # Preprocess
55
- X_imp = imputer.transform(df)
56
- X_scaled = scaler.transform(X_imp)
57
-
58
- # Inference
59
- pred = model.predict(X_scaled)
60
- prob = np.max(model.predict_proba(X_scaled))
61
-
62
- return {
63
- "Diagnosis": target_names[pred],
64
- "Confidence": f"{prob*100:.2f}%"
65
- }
66
-
67
- # Example Usage:
68
- patient_data = {'age': 25, 'hb': 10.5, 'temp': 38.5, 'malaria_rdt': 1}
69
- print(predict_patient(patient_data))
 
1
  ---
2
+ language:
3
+ - en
4
  license: mit
 
 
 
 
 
 
 
 
5
  tags:
6
  - medical
7
+ - clinical-decision-support
8
+ - tabular-classification
9
+ - scikit-learn
10
+ - xgboost
11
+ - shap
12
+ - ensemble-learning
13
+ - global-health
14
+ datasets:
15
+ - private-western-kenya-clinical-cohort
16
+ metrics:
17
+ - accuracy
18
+ - f1
19
+ - sensitivity
20
+ - specificity
21
  ---
22
 
23
+ # Hemaclass XAI: Deep Stacking Ensemble for Malaria and Sickle Cell Anemia
24
 
25
+ ## Model Description
26
+ The **Hemaclass XAI** model is a phase-4 prototype Clinical Decision Support System (CDSS) designed to classify **Malaria, Sickle Cell Anemia (SCA), Co-infections, and Negative (Healthy/Other)** patient states. Developed specifically for deployment in resource-constrained settings in Western Kenya, the model utilizes a robust Deep Stacking Ensemble architecture coupled with SHAP (SHapley Additive exPlanations) for transparent, clinician-friendly interpretability.
27
 
28
+ - **Developer:** Mabonga Labs / Hemaclass Project
29
+ - **Model Type:** Multi-Class Tabular Classification (Deep Stacking Ensemble)
30
+ - **Primary Architecture:** Random Forest, Support Vector Machine (RBF), and XGBoost (Base Learners) -> Logistic Regression (Meta-Learner).
31
+ - **Hyperparameter Tuning:** Nested Cross-Validation with Bayesian Search (`skopt`).
32
 
33
+ ## Intended Use & Target Audience
34
+ - **Primary Use Case:** Triage and secondary diagnostic validation for clinicians operating in malaria-endemic regions with high SCA prevalence.
35
+ - **Target Audience:** Medical doctors, clinical officers, and healthcare technicians.
36
+ - **Out-of-Scope Uses:** This model is **not** a standalone diagnostic device. It is designed for *decision support*. It should not override human clinical judgment or be used as a replacement for definitive laboratory protocols (e.g., blood smears, Hb electrophoresis).
37
 
38
+ ## Clinical Protocol & Hardcoded Overrides
39
+ To prioritize patient safety, the inference pipeline integrates hardcoded clinical rules that override the AI's probability outputs in critical, life-threatening scenarios:
40
+ 1. **Severe Hyperhemolytic Crisis:** Hemoglobin (Hb) < 5.0 g/dL.
41
+ 2. **Acute Hemolytic Malarial Crisis:** Reticulocyte Count > 8.0% + Positive Malaria RDT.
42
+ 3. **Rapidly Progressing Vaso-occlusive Malarial Crisis:** Rapid Hb decline (>1.5g/dL in 48h) + Positive Malaria RDT + Presence of HbS genotype.
43
+ *When triggered, the system flags the diagnosis as "Co-infection" with 100% confidence and alerts the clinician to admit the patient to a high-dependency unit.*
44
+
45
+ ## Model Input Features
46
+ The model ingests 24 clinical biomarkers and autonomously engineers 3 derived features:
47
+ * **Demographics:** Age, Sex
48
+ * **Vitals & Symptoms:** Body Temperature, Fever, Chills, Headache, Muscle Aches, Fatigue, Loss of Appetite, Jaundice, Abdominal Pain, Joint Pain, Splenomegaly, Severe Pallor, Lymphadenopathy.
49
+ * **Laboratory Markers:** Malaria RDT (Binary), Hemoglobin (Hb), WBC Count, Platelet Count, Reticulocyte Count, Rapid Hb Decline Alert.
50
+ * **Hemoglobin Fractions:** HbA, HbS, HbF.
51
+ * **Engineered Features:** Symptom Severity Score, Age Group (Categorical), Infection-to-Anemia Ratio (WBC / Hb).
52
+
53
+ ## Data Preprocessing & Augmentation
54
+ * **Missing Data:** Handled using Multiple Imputation by Chained Equations (MICE) up to 30 iterations to preserve complex clinical covariances.
55
+ * **Class Imbalance:** The foundational dataset (~350 retrospective patient records from Western Kenya) was highly imbalanced. **Extreme SMOTE** (Synthetic Minority Over-sampling Technique) was applied to map boundaries and synthesize a robust, balanced training matrix of 6,000 clinical profiles (1,500 per target class).
56
+ * **Encoding:** Deterministic Ordinal Encoding for categorical values; Z-Score Normalization for numerical features.
57
+
58
+ ## Explainability (XAI)
59
+ The system integrates **SHAP (TreeExplainer)** applied to the XGBoost component of the stacking ensemble. Local explanations (Waterfall plots) are generated for every inference, providing clinicians with exact quantification of how individual biomarkers (e.g., *HbS%* or *Symptom Severity*) contributed to the final predicted risk score.
60
+
61
+ ## Evaluation & Metrics
62
+ The model was evaluated on an isolated, unseen clinical test set prior to SMOTE augmentation.
63
+ * **Metrics Tracked:** Macro-F1 Score, Macro-Sensitivity (Recall), Macro-Specificity, and overall Accuracy.
64
+ * **Statistical Significance:** Friedman/Nemenyi post-hoc testing confirmed the Stacking Ensemble significantly outperforms isolated baseline models (p < 0.05).
65
+ *(Note: Refer to the specific model logs or the connected Gradio dashboard for live metrics).*
66
+
67
+ ## Ethical Considerations & Limitations
68
+ * **Geographic Bias:** The base dataset reflects the epidemiology of Western Kenya. Prevalence features (like overlapping splenomegaly in SCA and Malaria) may not generalize accurately to populations outside of Sub-Saharan Africa or regions with varying Plasmodium falciparum endemicity.
69
+ * **Synthetic Data Artifacts:** Because SMOTE was used extensively to augment the data for deep learning convergence, extreme edge-case boundaries may occasionally display synthetic bias. Prospective validation on a large-scale real-world clinical trial is required before Phase 5 medical device certification.
70
+ * **Explainability Proxy:** The SHAP explanations are derived from the XGBoost sub-estimator rather than the entire Stacking Classifier boundary, serving as a highly accurate proxy rather than an absolute mathematical reflection of the meta-learner.
71
+
72
+ ## How to Get Started
73
+ To interact with the model via the UI, please visit the connected [Hugging Face Space](#).
74
+
75
+ For programmatic inference using Python:
76
  ```python
77
  import joblib
78
  import pandas as pd
 
79
 
80
  # 1. Load Artifacts
81
  model = joblib.load('ensemble_model.pkl')
82
  scaler = joblib.load('scaler.pkl')
83
  imputer = joblib.load('imputer.pkl')
84
+
85
+ # 2. Prepare patient data (ensure columns match FEATURE_NAMES)
86
+ # patient_df = pd.DataFrame({...})
87
+
88
+ # 3. Preprocess
89
+ # X_imp = imputer.transform(patient_df)
90
+ # X_scaled = scaler.transform(X_imp)
91
+
92
+ # 4. Predict
93
+ # predictions = model.predict(X_scaled)
94
+ # probabilities = model.predict_proba(X_scaled)