--- library_name: transformers tags: - AT - masked-language-modeling - protein-annotations license: mit --- # AT (Annotation Transformer, MLM-pretrained, preset `at_base`) BERT-style masked annotation modeling over the 88k Annotation Vocabulary (Hallee et al. 2024). Trained on `lhallee/AV_large`. ## Training | Setting | Value | | --- | --- | | Preset | `at_base` | | Dataset | `lhallee/AV_large` | | Sequence length | 192 (fixed; static shapes for `torch.compile`) | | Mask probability | 0.15 | | Batch size | 1024 | | Steps | 100000 (configured for 100000) | | Optimizer | AdamW(lr=0.0003, betas=(0.9, 0.98), wd=0.01) | | Schedule | linear warmup over 2000 -> cosine decay to 0.1 * lr | | Precision | bf16 | ## Final validation metrics - **loss**: 0.6910 - **macro_f1**: 0.7557 - **macro_precision**: 0.7541 - **macro_recall**: 0.7669 - **mcc**: 0.9103 - **perplexity**: 1.9958 - **top10_acc**: 0.9576 - **top1_acc**: 0.9106 - **top25_acc**: 0.9612 - **top5_acc**: 0.9536 ## How to use ```python from models.annotation_transformer import AnnotationTransformer AT = AnnotationTransformer.from_pretrained("Synthyra/AT-Base") pooled = AT(input_ids, attention_mask) # (batch, hidden_size) ``` The downstream consumer in this repo is the vec2vec translator (sets 7/8): annotations are pre-embedded with this AT (frozen) and mapped to / from a PLM embedding space via the same paired-batch contrastive recipe used in sets 1-3. ## References - Hallee et al. 2024, bioRxiv 2024.07.30.605924 -- Annotation Vocabulary. - Jha et al. 2025, arXiv:2505.12540 -- vec2vec.