PLM-Only SwissProt-Evidence Baselines

This bundle is for the rebuttal-time external protein language model baselines.

Scope:

sequence-only baselines
no projector
no Q-former
no LLM decoder
no text generation in the model loop

Primary models:

facebook/esm2_t33_650M_UR50D
Rostlab/prot_bert
ElnaggarLab/ankh-base

Optional models:

Rostlab/ProstT5
hugohrban/progen2-small

Tasks:

ec
go_mf
sites
regions

Data root

Expected dataset root:

/scratch/user/yining_yang/TAMU/PhD/Pannot/data/SwissProt/splits/splits_condensed_samples_pt

Expected split directories:

train_condensed_samples_pt
val_id_samples_pt
val_ood_family_samples_pt
test_id_samples_pt
test_ood_family_samples_pt

Task CSVs:

ec/ec.csv
gene_ontology/go_mf.csv
sites/sites.csv
regions/regions.csv

Environment

Conda:

conda env create -f prott3-env.yml
conda activate prott3-env

If you prefer to reuse an existing environment, the important runtime dependencies are:

torch
transformers
peft
huggingface_hub
pandas
scikit-learn
tqdm

Prefetch model checkpoints

Download the public model checkpoints into a local directory before offline runs:

export PLM_BASELINE_MODEL_ROOT=$PWD/local_pretrained_encoders
python -m rebuttal.plm_baselines.prefetch --models esm2 protbert ankh

To include the optional models:

python -m rebuttal.plm_baselines.prefetch --models esm2 protbert ankh prostt5 progen

The shell wrapper does the same on Grace:

bash rebuttal/prefetch_plm_models.sh

Local smoke run

python -m rebuttal.plm_baselines.train_eval \
  --model-key esm2 \
  --task ec \
  --data-root /path/to/splits_condensed_samples_pt \
  --output-dir ./results/rebuttal_plm_baselines_local/esm2/ec \
  --epochs 1 \
  --batch-size 2 \
  --accumulate-grad-batches 4 \
  --max-train-samples 128 \
  --max-eval-samples 128 \
  --smoke-only

Local full run

python -m rebuttal.plm_baselines.train_eval \
  --model-key protbert \
  --task go_mf \
  --data-root /path/to/splits_condensed_samples_pt \
  --output-dir ./results/rebuttal_plm_baselines_local/protbert/go_mf \
  --epochs 3 \
  --batch-size 2 \
  --accumulate-grad-batches 8

Aggregate summaries

python -m rebuttal.plm_baselines.aggregate \
  --input-root ./results/rebuttal_plm_baselines_local \
  --output ./results/rebuttal_plm_baselines_local/aggregate_summary.json

Important label/evaluation notes

ec is parsed from text_label and evaluated with exact match plus hierarchical similarity.
go_mf is parsed into multi-label GO IDs from text_label.
sites and regions are residue-level flat classifiers.
The residue datasets repeat accessions many times, so the baseline code aggregates rows by accession and uses accession-level span handling to avoid false negatives from overlapping annotations.
The benchmark-facing residue metrics are reported as:
- MC_Acc
- MC_MRR
- MC_Rec@5
- necessity_delta
- sufficiency_delta

Recommended run order

esm2, protbert, ankh on ec
esm2, protbert, ankh on go_mf
sites
regions
optional prostt5 / progen

Downloads last month: -; Downloads are not tracked for this model. How to track