LiverDCP — Disease-Cell-type Protein embeddings for the human liver
LiverDCP is a graph neural network that learns context-specific protein embeddings across 286 disease × cell-type combinations of the human liver (11 diagnoses × 26 cell types). It is trained on a multi-modal graph built from PrePPI-AF protein–protein interactions, CellPhoneDB cell–cell communications, and the LiverHomo single-cell atlas (1.16 M cells, 320 patients), with ESM2 protein language model features injected through a zero-init side adapter.
- Code: https://github.com/zhaolabutmb/LiverDCP
- Training data (Zenodo):
10.5281/zenodo.20431227 - Model version: v1.0
Files in this repo
| File | Purpose |
|---|---|
best_model.pt |
Full checkpoint: model state_dict, center-loss state, hparams, best val ROC. ~2.5 GB. |
protein_embed.pth |
Pre-computed effective per-context protein embeddings. dict[ctx_id -> Tensor(N_proteins, 256)]. |
mg_embed.pth |
Metagraph (per-disease) embeddings. Tensor(11, 256). |
protein_labels_dict.txt |
JSON mapping protein names ↔ context indices. |
config.json |
Architecture, training hyperparameters, Zenodo DOI, training command. |
protein_embed.pth is the file most downstream users want. It is produced by
adding the trained ppi_pred_head and esm_decoder_adapter heads on top of
the base GAT embeddings, exactly as in DCP_model/train.py.
Quickstart — download embeddings only
# pip install -U "huggingface_hub>=0.24" torch
from huggingface_hub import hf_hub_download
import torch, json
repo_id = "zhaolabutmb/LiverDCP"
protein_embed = torch.load(
hf_hub_download(repo_id, "LiverDCP__protein_embed.pth"), map_location="cpu", weights_only=False
)
mg_embed = torch.load(
hf_hub_download(repo_id, "LiverDCP__mg_embed.pth"), map_location="cpu", weights_only=False
)
with open(hf_hub_download(repo_id, "LiverDCP__protein_labels_dict.txt")) as f:
labels = json.load(f)
print(len(protein_embed), "contexts") # 286
print(mg_embed.shape) # torch.Size([11, 256])
Or, if you have cloned the GitHub repo, the same thing with our helper:
from huggingface.load_pretrained import load_embeddings
out = load_embeddings("zhaolabutmb/LiverDCP")
protein_embed, mg_embed = out["protein_embed"], out["mg_embed"]
Reconstructing the full model (advanced)
Re-instantiating the LiverDCP nn.Module requires graph metadata
(ppi_data, num_ppi_relations, num_mg_relations) that is not stored
in the checkpoint. You must build it from DCP_graph_data (download from
Zenodo, then run DCP_model/read_ppi_mg.py::load_all_data). After that:
from huggingface.load_pretrained import load_model
model, ckpt = load_model(
repo_id="https://huggingface.co/zhaolabutmb/LiverDCP",
ppi_data=ppi_data,
num_ppi_relations=num_ppi_relations,
num_mg_relations=num_mg_relations,
device="cuda",
)
ESM bypass at inference further requires
esm/esm2_embeddings_480d_fp32_truncated.pt from the Zenodo deposit.
Training configuration
This checkpoint was produced with:
python -u DCP_model/train.py \
--shared_ppi_gat --context_projection \
--feat_mat 1024 --hidden 256 --output 64 --n_heads 4 --pc_att_channels 8 \
--batch_size 156 --loader neighbor --num_neighbors 10,5 --neg_sample_ratio 1 \
--ctx_sample_size 50 --epochs 100 --theta 0.9 --lr 0.0005 --dropout 0.05 \
--center_loss_lambda 0.05 --inter_class_weight 0.1 --inter_class_margin 1.5 \
--center_loss_warmup 3 --center_loss_ramp_epochs 8 --warmup_epochs 5 \
--use_amp --early_stopping_metric val_roc --early_stopping_patience 40 \
--seed 120 \
--esm_side_adapter --esm_side_adapter_rank 48 --only_low_rank_u_v \
--freeze_encoder_quality_threshold 0.7 --quality_checkpoint
See config.json for the parsed architecture / training hyperparameters.
Intended use & limitations
- Intended use: generating disease- and cell-type-resolved protein embeddings for downstream tasks such as GWAS gene prioritization and drug target discovery in human liver diseases.
- Out of scope: clinical decision making; non-liver tissues; species other than human.
- Known limitations: PPI coverage is bounded by PrePPI-AF and the per-context DEG universe. Cell-type and diagnosis labels follow the LiverHomo schema.
Citation
Citation block to be added after publication.
License
CC BY 4.0 for model weights and embeddings. Source code is released under the license specified in the GitHub repository.
- Downloads last month
- 38