AT-Base / README.md
lhallee's picture
Upload README.md with huggingface_hub
a9273e5 verified
metadata
library_name: transformers
tags:
  - AT
  - masked-language-modeling
  - protein-annotations
license: mit

AT (Annotation Transformer, MLM-pretrained, preset at_base)

BERT-style masked annotation modeling over the 88k Annotation Vocabulary (Hallee et al. 2024). Trained on lhallee/AV_large.

Training

Setting Value
Preset at_base
Dataset lhallee/AV_large
Sequence length 192 (fixed; static shapes for torch.compile)
Mask probability 0.15
Batch size 1024
Steps 100000 (configured for 100000)
Optimizer AdamW(lr=0.0003, betas=(0.9, 0.98), wd=0.01)
Schedule linear warmup over 2000 -> cosine decay to 0.1 * lr
Precision bf16

Final validation metrics

  • loss: 0.6910
  • macro_f1: 0.7557
  • macro_precision: 0.7541
  • macro_recall: 0.7669
  • mcc: 0.9103
  • perplexity: 1.9958
  • top10_acc: 0.9576
  • top1_acc: 0.9106
  • top25_acc: 0.9612
  • top5_acc: 0.9536

How to use

from models.annotation_transformer import AnnotationTransformer

AT = AnnotationTransformer.from_pretrained("Synthyra/AT-Base")
pooled = AT(input_ids, attention_mask)  # (batch, hidden_size)

The downstream consumer in this repo is the vec2vec translator (sets 7/8): annotations are pre-embedded with this AT (frozen) and mapped to / from a PLM embedding space via the same paired-batch contrastive recipe used in sets 1-3.

References

  • Hallee et al. 2024, bioRxiv 2024.07.30.605924 -- Annotation Vocabulary.
  • Jha et al. 2025, arXiv:2505.12540 -- vec2vec.