Instructions to use Yining04/swissprot-evidence-plm-baselines with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Yining04/swissprot-evidence-plm-baselines with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="Yining04/swissprot-evidence-plm-baselines")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Yining04/swissprot-evidence-plm-baselines", dtype="auto") - Notebooks
- Google Colab
- Kaggle
PLM-Only SwissProt-Evidence Baselines
This bundle is for the rebuttal-time external protein language model baselines.
Scope:
- sequence-only baselines
- no projector
- no Q-former
- no LLM decoder
- no text generation in the model loop
Primary models:
facebook/esm2_t33_650M_UR50DRostlab/prot_bertElnaggarLab/ankh-base
Optional models:
Rostlab/ProstT5hugohrban/progen2-small
Tasks:
ecgo_mfsitesregions
Data root
Expected dataset root:
/scratch/user/yining_yang/TAMU/PhD/Pannot/data/SwissProt/splits/splits_condensed_samples_pt
Expected split directories:
train_condensed_samples_ptval_id_samples_ptval_ood_family_samples_pttest_id_samples_pttest_ood_family_samples_pt
Task CSVs:
ec/ec.csvgene_ontology/go_mf.csvsites/sites.csvregions/regions.csv
Environment
Conda:
conda env create -f prott3-env.yml
conda activate prott3-env
If you prefer to reuse an existing environment, the important runtime dependencies are:
torchtransformerspefthuggingface_hubpandasscikit-learntqdm
Prefetch model checkpoints
Download the public model checkpoints into a local directory before offline runs:
export PLM_BASELINE_MODEL_ROOT=$PWD/local_pretrained_encoders
python -m rebuttal.plm_baselines.prefetch --models esm2 protbert ankh
To include the optional models:
python -m rebuttal.plm_baselines.prefetch --models esm2 protbert ankh prostt5 progen
The shell wrapper does the same on Grace:
bash rebuttal/prefetch_plm_models.sh
Local smoke run
python -m rebuttal.plm_baselines.train_eval \
--model-key esm2 \
--task ec \
--data-root /path/to/splits_condensed_samples_pt \
--output-dir ./results/rebuttal_plm_baselines_local/esm2/ec \
--epochs 1 \
--batch-size 2 \
--accumulate-grad-batches 4 \
--max-train-samples 128 \
--max-eval-samples 128 \
--smoke-only
Local full run
python -m rebuttal.plm_baselines.train_eval \
--model-key protbert \
--task go_mf \
--data-root /path/to/splits_condensed_samples_pt \
--output-dir ./results/rebuttal_plm_baselines_local/protbert/go_mf \
--epochs 3 \
--batch-size 2 \
--accumulate-grad-batches 8
Aggregate summaries
python -m rebuttal.plm_baselines.aggregate \
--input-root ./results/rebuttal_plm_baselines_local \
--output ./results/rebuttal_plm_baselines_local/aggregate_summary.json
Important label/evaluation notes
ecis parsed fromtext_labeland evaluated with exact match plus hierarchical similarity.go_mfis parsed into multi-label GO IDs fromtext_label.sitesandregionsare residue-level flat classifiers.- The residue datasets repeat accessions many times, so the baseline code aggregates rows by accession and uses accession-level span handling to avoid false negatives from overlapping annotations.
- The benchmark-facing residue metrics are reported as:
MC_AccMC_MRRMC_Rec@5necessity_deltasufficiency_delta
Recommended run order
esm2,protbert,ankhonecesm2,protbert,ankhongo_mfsitesregions- optional
prostt5/progen