FORUM-TB: Trained Random Forest Models
Trained Random Forest classifiers for M. tuberculosis drug resistance prediction from whole-genome sequencing (WGS) data.
Part of Project FORUM β an open-science interpretable ML pipeline for TB AMR prediction, developed in collaboration with Noah The Microbialist.
Related Resources
- π Dataset (Kaggle): https://www.kaggle.com/datasets/nanzhen/forum-tb?resource=download
- π Pipeline (GitHub): https://github.com/TheMicrobialist/SHAP-mTB-AMR
- π Project blog: https://themicrobialist.substack.com/?utm_campaign=profile_chips
Models
| File | Drug | Test AUC-ROC | CV AUC-ROC | Size |
|---|---|---|---|---|
| rf_RIFAMPICIN_v2.joblib | Rifampicin | 0.975 | 0.969 Β± 0.004 | 61MB |
| rf_ISONIAZID_v2.joblib | Isoniazid | 0.948 | 0.946 Β± 0.008 | 48MB |
| rf_ETHAMBUTOL_v2.joblib | Ethambutol | 0.894 | 0.900 Β± 0.007 | 77MB |
| rf_PYRAZINAMIDE_v2.joblib | Pyrazinamide | 0.886 | 0.883 Β± 0.007 | 58MB |
Usage
from huggingface_hub import hf_hub_download
import joblib
# Download and load a model
path = hf_hub_download(
repo_id="nanzhen102/FORUM-TB-models",
filename="rf_RIFAMPICIN_v2.joblib"
)
rf = joblib.load(path)
Input Format
Models expect a feature vector of 2,693 AMR gene positions encoded as integers (0=REF, 1=A, 2=T, 3=C, 4=G).
Download the ML-ready dataset directly from Kaggle: https://www.kaggle.com/datasets/nanzhen/forum-tb?resource=download
Biological Validation
Top SHAP features confirmed against known resistance mutations:
- pos_761155 β rpoB codon 450 β S450L β (Rifampicin)
- pos_2155168 β katG codon 315 β S315T β (Isoniazid)
- pos_4247429 β embB codon 306 β M306I/V β (Ethambutol)
Authors
- Noah LeGall, Ph.D. β The Microbialist
- Nanzhen (Aspen) Qiao β Queen's University, Kingston, Canada
License
CC BY-NC 4.0 β free for non-commercial use with attribution.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support