AT-Base / README.md

Upload README.md with huggingface_hub

a9273e5 verified 17 days ago

1.57 kB

library_name: transformers
tags:
  - AT
  - masked-language-modeling
  - protein-annotations
license: mit

AT (Annotation Transformer, MLM-pretrained, preset `at_base`)

BERT-style masked annotation modeling over the 88k Annotation Vocabulary (Hallee et al. 2024). Trained on lhallee/AV_large.

Training

Setting	Value
Preset	`at_base`
Dataset	`lhallee/AV_large`
Sequence length	192 (fixed; static shapes for `torch.compile`)
Mask probability	0.15
Batch size	1024
Steps	100000 (configured for 100000)
Optimizer	AdamW(lr=0.0003, betas=(0.9, 0.98), wd=0.01)
Schedule	linear warmup over 2000 -> cosine decay to 0.1 * lr
Precision	bf16

Final validation metrics

loss: 0.6910
macro_f1: 0.7557
macro_precision: 0.7541
macro_recall: 0.7669
mcc: 0.9103
perplexity: 1.9958
top10_acc: 0.9576
top1_acc: 0.9106
top25_acc: 0.9612
top5_acc: 0.9536

How to use

from models.annotation_transformer import AnnotationTransformer

AT = AnnotationTransformer.from_pretrained("Synthyra/AT-Base")
pooled = AT(input_ids, attention_mask)  # (batch, hidden_size)

The downstream consumer in this repo is the vec2vec translator (sets 7/8): annotations are pre-embedded with this AT (frozen) and mapped to / from a PLM embedding space via the same paired-batch contrastive recipe used in sets 1-3.

References

Hallee et al. 2024, bioRxiv 2024.07.30.605924 -- Annotation Vocabulary.
Jha et al. 2025, arXiv:2505.12540 -- vec2vec.

AT (Annotation Transformer, MLM-pretrained, preset at_base)

Training

Final validation metrics

How to use

References

AT (Annotation Transformer, MLM-pretrained, preset `at_base`)