Synthyra
/

AT-Base

+---
+library_name: transformers
+tags:
+- AT
+- masked-language-modeling
+- protein-annotations
+license: mit
+---
+# AT (Annotation Transformer, MLM-pretrained, preset `at_base`)
+BERT-style masked annotation modeling over the 88k Annotation Vocabulary
+(Hallee et al. 2024). Trained on `lhallee/AV_large`.
+## Training
+| Setting | Value |
+| --- | --- |
+| Preset | `at_base` |
+| Dataset | `lhallee/AV_large` |
+| Sequence length | 192 (fixed; static shapes for `torch.compile`) |
+| Mask probability | 0.15 |
+| Batch size | 1024 |
+| Steps | 100000 (configured for 100000) |
+| Optimizer | AdamW(lr=0.0003, betas=(0.9, 0.98), wd=0.01) |
+| Schedule | linear warmup over 2000 -> cosine decay to 0.1 * lr |
+| Precision | bf16 |
+## Final validation metrics
+- **loss**: 0.6910
+- **macro_f1**: 0.7557
+- **macro_precision**: 0.7541
+- **macro_recall**: 0.7669
+- **mcc**: 0.9103
+- **perplexity**: 1.9958
+- **top10_acc**: 0.9576
+- **top1_acc**: 0.9106
+- **top25_acc**: 0.9612
+- **top5_acc**: 0.9536
+## How to use
+```python
+from models.annotation_transformer import AnnotationTransformer
+AT = AnnotationTransformer.from_pretrained("Synthyra/AT-Base")
+pooled = AT(input_ids, attention_mask)  # (batch, hidden_size)
+```
+The downstream consumer in this repo is the vec2vec translator
+(sets 7/8): annotations are pre-embedded with this AT (frozen) and
+mapped to / from a PLM embedding space via the same paired-batch
+contrastive recipe used in sets 1-3.
+## References
+- Hallee et al. 2024, bioRxiv 2024.07.30.605924 -- Annotation Vocabulary.
+- Jha et al. 2025, arXiv:2505.12540 -- vec2vec.