--- license: cc-by-4.0 tags: - biology - single-cell - immunophenotyping - protein - adt - cite-seq - missionbio - tapestri - scikit-learn library_name: scikit-learn pipeline_tag: tabular-classification --- # EspressoPro ADT Cell Type Models ## Model Summary This repository provides pre-trained EspressoPro models for **cell type annotation from single-cell surface protein (ADT) data**, designed for **blood and bone marrow mononuclear cells** in protein-only settings (such as Mission Bio Tapestri DNA+ADT workflows). The pipeline is available at: https://github.com/uom-eoh-lab-published/2026__EspressoPro The release contains **one-vs-rest (OvR) binary classifiers per cell type** plus a **multiclass calibration layer** for **three annotation resolutions of increasing biological detail**. ## Model Details - **Developed by:** Kristian Gurashi - **Model type:** Stacked ensemble OvR classifiers with Platt calibration (logistic regression over XGB, NB, KNN, and MLP prediction probabilities) - **Input:** Per-cell ADT feature vectors (CLR-normalised surface protein expression) - **Output:** Per-cell class probabilities and predicted cell type labels ### Included Files The repository is organised by **reference atlas** (`Hao`, `Triana`, `Zhang`, `Luecken`) and by **label resolution** (`Broad`, `Simplified`, `Detailed`). Each atlas/resolution folder contains (i) the trained models, (ii) evaluation reports, and (iii) figures. #### Models (`Release//Models//`) - `Multiclass_models.joblib` Main file for inference. Loads everything needed to run predictions for that atlas/resolution: - all per-class Platt calibrated OvR “heads” - `class_names` (probability column order) - excluded class list (if applicable) - multiclass temperature-scaling calibrator #### Reports (`Release//Reports//`) - `metrics/` CSV exports of evaluation outputs, including: - multiclass accuracy metrics (precision/recall/F1/AUC) on the held-out test split - multiclass confusion matrix on the held-out test split - per-class accuracy metrics (precision/recall/F1/AUC) and confusion matrix on the held-out test split - per-class error rate pre and post calibrated on the held-out test split - `probabilities/` CSV exports comparing: - Multiclass label prediction probabilities on test set #### Figures (`Release//Figures//`) - `multiclass_confusion_matrix_on_test.png` Multiclass confusion matrix for the held-out test split. - `multiclass_confusion_matrix_on_test_with_percentage_agreement.png` Multiclass confusion matrix for the held-out test split with % agreement between true label and predicted. - `per_class/` Per-class plots, including: - binary confusion matrix pre calibration - ROC curve (AUC) pre calibration - binary confusion matrix post calibration - ROC curve (AUC) post calibration - UMAP of the held-out train split - UMAP legend - calibration evaluation on the held-out test split - SHAP beeswarm on the held-out train split ## Uses ### Direct Use Leveraged by **EspressoPro** to annotate cell types from **ADT-only** single-cell data (blood/bone marrow mononuclear cells), including Mission Bio Tapestri DNA+ADT datasets. ## Bias, Risks, and Limitations - **Reference bias:** trained on human healthy donor PBMC/BMMC-derived references; performance may differ in disease or heavily perturbed samples. Not expected to work well in other tissues. - **Panel dependence:** requires feature alignment to the expected ADT columns; missing/mismatched antibodies can reduce accuracy. - **Class coverage:** Only classes which led to effective predictions from at least one of the four atlases were trained for prediction. - **Interpretation:** probabilities are model-derived and should be validated with marker checks and expected biology. ## Testing Data, Factors & Metrics ### Testing Data - **TRAIN**: used to train one-vs-rest (OvR) classifiers. - **CAL**: used only for probability calibration (Platt per class + multiclass temperature scaling). - **TEST**: used only for evaluation. **Note:** CAL and TEST include only the classes learned from TRAIN; excluded or unknown labels are removed. ### Factors - **RAW**: OvR probabilities before calibration. - **PLATT**: OvR probabilities after Platt calibration on CAL (skipped if CAL is single-class). - **CAL**: final multiclass probabilities after temperature scaling (fit on CAL, applied to TEST). ### Metrics **Multiclass (TEST, using CAL probabilities):** - Accuracy - Precision / Recall / F1 - Confusion matrix **Per-class (TEST, RAW vs CAL):** - Confusion matrix (TP, FP, TN, FN) - Precision, recall, F1 - ROC curve and AUC **Calibration (per class, TEST):** - LogLoss and Brier score before vs after Platt calibration