AT-Base / README.md
lhallee's picture
Upload README.md with huggingface_hub
a9273e5 verified
---
library_name: transformers
tags:
- AT
- masked-language-modeling
- protein-annotations
license: mit
---
# AT (Annotation Transformer, MLM-pretrained, preset `at_base`)
BERT-style masked annotation modeling over the 88k Annotation Vocabulary
(Hallee et al. 2024). Trained on `lhallee/AV_large`.
## Training
| Setting | Value |
| --- | --- |
| Preset | `at_base` |
| Dataset | `lhallee/AV_large` |
| Sequence length | 192 (fixed; static shapes for `torch.compile`) |
| Mask probability | 0.15 |
| Batch size | 1024 |
| Steps | 100000 (configured for 100000) |
| Optimizer | AdamW(lr=0.0003, betas=(0.9, 0.98), wd=0.01) |
| Schedule | linear warmup over 2000 -> cosine decay to 0.1 * lr |
| Precision | bf16 |
## Final validation metrics
- **loss**: 0.6910
- **macro_f1**: 0.7557
- **macro_precision**: 0.7541
- **macro_recall**: 0.7669
- **mcc**: 0.9103
- **perplexity**: 1.9958
- **top10_acc**: 0.9576
- **top1_acc**: 0.9106
- **top25_acc**: 0.9612
- **top5_acc**: 0.9536
## How to use
```python
from models.annotation_transformer import AnnotationTransformer
AT = AnnotationTransformer.from_pretrained("Synthyra/AT-Base")
pooled = AT(input_ids, attention_mask) # (batch, hidden_size)
```
The downstream consumer in this repo is the vec2vec translator
(sets 7/8): annotations are pre-embedded with this AT (frozen) and
mapped to / from a PLM embedding space via the same paired-batch
contrastive recipe used in sets 1-3.
## References
- Hallee et al. 2024, bioRxiv 2024.07.30.605924 -- Annotation Vocabulary.
- Jha et al. 2025, arXiv:2505.12540 -- vec2vec.