PLM-Only SwissProt-Evidence Baselines

This bundle is for the rebuttal-time external protein language model baselines.

Scope:

  • sequence-only baselines
  • no projector
  • no Q-former
  • no LLM decoder
  • no text generation in the model loop

Primary models:

  • facebook/esm2_t33_650M_UR50D
  • Rostlab/prot_bert
  • ElnaggarLab/ankh-base

Optional models:

  • Rostlab/ProstT5
  • hugohrban/progen2-small

Tasks:

  • ec
  • go_mf
  • sites
  • regions

Data root

Expected dataset root:

/scratch/user/yining_yang/TAMU/PhD/Pannot/data/SwissProt/splits/splits_condensed_samples_pt

Expected split directories:

  • train_condensed_samples_pt
  • val_id_samples_pt
  • val_ood_family_samples_pt
  • test_id_samples_pt
  • test_ood_family_samples_pt

Task CSVs:

  • ec/ec.csv
  • gene_ontology/go_mf.csv
  • sites/sites.csv
  • regions/regions.csv

Environment

Conda:

conda env create -f prott3-env.yml
conda activate prott3-env

If you prefer to reuse an existing environment, the important runtime dependencies are:

  • torch
  • transformers
  • peft
  • huggingface_hub
  • pandas
  • scikit-learn
  • tqdm

Prefetch model checkpoints

Download the public model checkpoints into a local directory before offline runs:

export PLM_BASELINE_MODEL_ROOT=$PWD/local_pretrained_encoders
python -m rebuttal.plm_baselines.prefetch --models esm2 protbert ankh

To include the optional models:

python -m rebuttal.plm_baselines.prefetch --models esm2 protbert ankh prostt5 progen

The shell wrapper does the same on Grace:

bash rebuttal/prefetch_plm_models.sh

Local smoke run

python -m rebuttal.plm_baselines.train_eval \
  --model-key esm2 \
  --task ec \
  --data-root /path/to/splits_condensed_samples_pt \
  --output-dir ./results/rebuttal_plm_baselines_local/esm2/ec \
  --epochs 1 \
  --batch-size 2 \
  --accumulate-grad-batches 4 \
  --max-train-samples 128 \
  --max-eval-samples 128 \
  --smoke-only

Local full run

python -m rebuttal.plm_baselines.train_eval \
  --model-key protbert \
  --task go_mf \
  --data-root /path/to/splits_condensed_samples_pt \
  --output-dir ./results/rebuttal_plm_baselines_local/protbert/go_mf \
  --epochs 3 \
  --batch-size 2 \
  --accumulate-grad-batches 8

Aggregate summaries

python -m rebuttal.plm_baselines.aggregate \
  --input-root ./results/rebuttal_plm_baselines_local \
  --output ./results/rebuttal_plm_baselines_local/aggregate_summary.json

Important label/evaluation notes

  • ec is parsed from text_label and evaluated with exact match plus hierarchical similarity.
  • go_mf is parsed into multi-label GO IDs from text_label.
  • sites and regions are residue-level flat classifiers.
  • The residue datasets repeat accessions many times, so the baseline code aggregates rows by accession and uses accession-level span handling to avoid false negatives from overlapping annotations.
  • The benchmark-facing residue metrics are reported as:
    • MC_Acc
    • MC_MRR
    • MC_Rec@5
    • necessity_delta
    • sufficiency_delta

Recommended run order

  1. esm2, protbert, ankh on ec
  2. esm2, protbert, ankh on go_mf
  3. sites
  4. regions
  5. optional prostt5 / progen
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support