LiverDCP — Disease-Cell-type Protein embeddings for the human liver

LiverDCP is a graph neural network that learns context-specific protein embeddings across 286 disease × cell-type combinations of the human liver (11 diagnoses × 26 cell types). It is trained on a multi-modal graph built from PrePPI-AF protein–protein interactions, CellPhoneDB cell–cell communications, and the LiverHomo single-cell atlas (1.16 M cells, 320 patients), with ESM2 protein language model features injected through a zero-init side adapter.

Files in this repo

File Purpose
best_model.pt Full checkpoint: model state_dict, center-loss state, hparams, best val ROC. ~2.5 GB.
protein_embed.pth Pre-computed effective per-context protein embeddings. dict[ctx_id -> Tensor(N_proteins, 256)].
mg_embed.pth Metagraph (per-disease) embeddings. Tensor(11, 256).
protein_labels_dict.txt JSON mapping protein names ↔ context indices.
config.json Architecture, training hyperparameters, Zenodo DOI, training command.

protein_embed.pth is the file most downstream users want. It is produced by adding the trained ppi_pred_head and esm_decoder_adapter heads on top of the base GAT embeddings, exactly as in DCP_model/train.py.

Quickstart — download embeddings only

# pip install -U "huggingface_hub>=0.24" torch
from huggingface_hub import hf_hub_download
import torch, json

repo_id = "zhaolabutmb/LiverDCP"

protein_embed = torch.load(
    hf_hub_download(repo_id, "LiverDCP__protein_embed.pth"), map_location="cpu", weights_only=False
)
mg_embed = torch.load(
    hf_hub_download(repo_id, "LiverDCP__mg_embed.pth"), map_location="cpu", weights_only=False
)
with open(hf_hub_download(repo_id, "LiverDCP__protein_labels_dict.txt")) as f:
    labels = json.load(f)

print(len(protein_embed), "contexts")              # 286
print(mg_embed.shape)                              # torch.Size([11, 256])

Or, if you have cloned the GitHub repo, the same thing with our helper:

from huggingface.load_pretrained import load_embeddings
out = load_embeddings("zhaolabutmb/LiverDCP")
protein_embed, mg_embed = out["protein_embed"], out["mg_embed"]

Reconstructing the full model (advanced)

Re-instantiating the LiverDCP nn.Module requires graph metadata (ppi_data, num_ppi_relations, num_mg_relations) that is not stored in the checkpoint. You must build it from DCP_graph_data (download from Zenodo, then run DCP_model/read_ppi_mg.py::load_all_data). After that:

from huggingface.load_pretrained import load_model
model, ckpt = load_model(
    repo_id="https://huggingface.co/zhaolabutmb/LiverDCP",
    ppi_data=ppi_data,
    num_ppi_relations=num_ppi_relations,
    num_mg_relations=num_mg_relations,
    device="cuda",
)

ESM bypass at inference further requires esm/esm2_embeddings_480d_fp32_truncated.pt from the Zenodo deposit.

Training configuration

This checkpoint was produced with:

python -u DCP_model/train.py \
  --shared_ppi_gat --context_projection \
  --feat_mat 1024 --hidden 256 --output 64 --n_heads 4 --pc_att_channels 8 \
  --batch_size 156 --loader neighbor --num_neighbors 10,5 --neg_sample_ratio 1 \
  --ctx_sample_size 50 --epochs 100 --theta 0.9 --lr 0.0005 --dropout 0.05 \
  --center_loss_lambda 0.05 --inter_class_weight 0.1 --inter_class_margin 1.5 \
  --center_loss_warmup 3 --center_loss_ramp_epochs 8 --warmup_epochs 5 \
  --use_amp --early_stopping_metric val_roc --early_stopping_patience 40 \
  --seed 120 \
  --esm_side_adapter --esm_side_adapter_rank 48 --only_low_rank_u_v \
  --freeze_encoder_quality_threshold 0.7 --quality_checkpoint

See config.json for the parsed architecture / training hyperparameters.

Intended use & limitations

  • Intended use: generating disease- and cell-type-resolved protein embeddings for downstream tasks such as GWAS gene prioritization and drug target discovery in human liver diseases.
  • Out of scope: clinical decision making; non-liver tissues; species other than human.
  • Known limitations: PPI coverage is bounded by PrePPI-AF and the per-context DEG universe. Cell-type and diagnosis labels follow the LiverHomo schema.

Citation

Citation block to be added after publication.

License

CC BY 4.0 for model weights and embeddings. Source code is released under the license specified in the GitHub repository.

Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support