File size: 1,573 Bytes
a9273e5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
library_name: transformers
tags:
- AT
- masked-language-modeling
- protein-annotations
license: mit
---

# AT (Annotation Transformer, MLM-pretrained, preset `at_base`)

BERT-style masked annotation modeling over the 88k Annotation Vocabulary
(Hallee et al. 2024). Trained on `lhallee/AV_large`.

## Training

| Setting | Value |
| --- | --- |
| Preset | `at_base` |
| Dataset | `lhallee/AV_large` |
| Sequence length | 192 (fixed; static shapes for `torch.compile`) |
| Mask probability | 0.15 |
| Batch size | 1024 |
| Steps | 100000 (configured for 100000) |
| Optimizer | AdamW(lr=0.0003, betas=(0.9, 0.98), wd=0.01) |
| Schedule | linear warmup over 2000 -> cosine decay to 0.1 * lr |
| Precision | bf16 |

## Final validation metrics

- **loss**: 0.6910
- **macro_f1**: 0.7557
- **macro_precision**: 0.7541
- **macro_recall**: 0.7669
- **mcc**: 0.9103
- **perplexity**: 1.9958
- **top10_acc**: 0.9576
- **top1_acc**: 0.9106
- **top25_acc**: 0.9612
- **top5_acc**: 0.9536

## How to use

```python
from models.annotation_transformer import AnnotationTransformer

AT = AnnotationTransformer.from_pretrained("Synthyra/AT-Base")
pooled = AT(input_ids, attention_mask)  # (batch, hidden_size)
```

The downstream consumer in this repo is the vec2vec translator
(sets 7/8): annotations are pre-embedded with this AT (frozen) and
mapped to / from a PLM embedding space via the same paired-batch
contrastive recipe used in sets 1-3.

## References

- Hallee et al. 2024, bioRxiv 2024.07.30.605924 -- Annotation Vocabulary.
- Jha et al. 2025, arXiv:2505.12540 -- vec2vec.