| --- |
| library_name: transformers |
| tags: |
| - AT |
| - masked-language-modeling |
| - protein-annotations |
| license: mit |
| --- |
| |
| # AT (Annotation Transformer, MLM-pretrained, preset `at_base`) |
| |
| BERT-style masked annotation modeling over the 88k Annotation Vocabulary |
| (Hallee et al. 2024). Trained on `lhallee/AV_large`. |
|
|
| ## Training |
|
|
| | Setting | Value | |
| | --- | --- | |
| | Preset | `at_base` | |
| | Dataset | `lhallee/AV_large` | |
| | Sequence length | 192 (fixed; static shapes for `torch.compile`) | |
| | Mask probability | 0.15 | |
| | Batch size | 1024 | |
| | Steps | 100000 (configured for 100000) | |
| | Optimizer | AdamW(lr=0.0003, betas=(0.9, 0.98), wd=0.01) | |
| | Schedule | linear warmup over 2000 -> cosine decay to 0.1 * lr | |
| | Precision | bf16 | |
|
|
| ## Final validation metrics |
|
|
| - **loss**: 0.6910 |
| - **macro_f1**: 0.7557 |
| - **macro_precision**: 0.7541 |
| - **macro_recall**: 0.7669 |
| - **mcc**: 0.9103 |
| - **perplexity**: 1.9958 |
| - **top10_acc**: 0.9576 |
| - **top1_acc**: 0.9106 |
| - **top25_acc**: 0.9612 |
| - **top5_acc**: 0.9536 |
| |
| ## How to use |
| |
| ```python |
| from models.annotation_transformer import AnnotationTransformer |
| |
| AT = AnnotationTransformer.from_pretrained("Synthyra/AT-Base") |
| pooled = AT(input_ids, attention_mask) # (batch, hidden_size) |
| ``` |
| |
| The downstream consumer in this repo is the vec2vec translator |
| (sets 7/8): annotations are pre-embedded with this AT (frozen) and |
| mapped to / from a PLM embedding space via the same paired-batch |
| contrastive recipe used in sets 1-3. |
| |
| ## References |
| |
| - Hallee et al. 2024, bioRxiv 2024.07.30.605924 -- Annotation Vocabulary. |
| - Jha et al. 2025, arXiv:2505.12540 -- vec2vec. |
| |