lhallee commited on
Commit
a9273e5
·
verified ·
1 Parent(s): 30b5a1c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +59 -0
README.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - AT
5
+ - masked-language-modeling
6
+ - protein-annotations
7
+ license: mit
8
+ ---
9
+
10
+ # AT (Annotation Transformer, MLM-pretrained, preset `at_base`)
11
+
12
+ BERT-style masked annotation modeling over the 88k Annotation Vocabulary
13
+ (Hallee et al. 2024). Trained on `lhallee/AV_large`.
14
+
15
+ ## Training
16
+
17
+ | Setting | Value |
18
+ | --- | --- |
19
+ | Preset | `at_base` |
20
+ | Dataset | `lhallee/AV_large` |
21
+ | Sequence length | 192 (fixed; static shapes for `torch.compile`) |
22
+ | Mask probability | 0.15 |
23
+ | Batch size | 1024 |
24
+ | Steps | 100000 (configured for 100000) |
25
+ | Optimizer | AdamW(lr=0.0003, betas=(0.9, 0.98), wd=0.01) |
26
+ | Schedule | linear warmup over 2000 -> cosine decay to 0.1 * lr |
27
+ | Precision | bf16 |
28
+
29
+ ## Final validation metrics
30
+
31
+ - **loss**: 0.6910
32
+ - **macro_f1**: 0.7557
33
+ - **macro_precision**: 0.7541
34
+ - **macro_recall**: 0.7669
35
+ - **mcc**: 0.9103
36
+ - **perplexity**: 1.9958
37
+ - **top10_acc**: 0.9576
38
+ - **top1_acc**: 0.9106
39
+ - **top25_acc**: 0.9612
40
+ - **top5_acc**: 0.9536
41
+
42
+ ## How to use
43
+
44
+ ```python
45
+ from models.annotation_transformer import AnnotationTransformer
46
+
47
+ AT = AnnotationTransformer.from_pretrained("Synthyra/AT-Base")
48
+ pooled = AT(input_ids, attention_mask) # (batch, hidden_size)
49
+ ```
50
+
51
+ The downstream consumer in this repo is the vec2vec translator
52
+ (sets 7/8): annotations are pre-embedded with this AT (frozen) and
53
+ mapped to / from a PLM embedding space via the same paired-batch
54
+ contrastive recipe used in sets 1-3.
55
+
56
+ ## References
57
+
58
+ - Hallee et al. 2024, bioRxiv 2024.07.30.605924 -- Annotation Vocabulary.
59
+ - Jha et al. 2025, arXiv:2505.12540 -- vec2vec.