Synthyra
/

AT-Base

annotation_transformer

masked-language-modeling

protein-annotations

Model card Files Files and versions

AT-Base / README.md

lhallee's picture

Upload README.md with huggingface_hub

a9273e5 verified 18 days ago

|

history blame contribute delete

1.57 kB

	---
	library_name: transformers
	tags:
	- AT
	- masked-language-modeling
	- protein-annotations
	license: mit
	---

	# AT (Annotation Transformer, MLM-pretrained, preset `at_base`)

	BERT-style masked annotation modeling over the 88k Annotation Vocabulary
	(Hallee et al. 2024). Trained on `lhallee/AV_large`.

	## Training

	\| Setting \| Value \|
	\| --- \| --- \|
	\| Preset \| `at_base` \|
	\| Dataset \| `lhallee/AV_large` \|
	\| Sequence length \| 192 (fixed; static shapes for `torch.compile`) \|
	\| Mask probability \| 0.15 \|
	\| Batch size \| 1024 \|
	\| Steps \| 100000 (configured for 100000) \|
	\| Optimizer \| AdamW(lr=0.0003, betas=(0.9, 0.98), wd=0.01) \|
	\| Schedule \| linear warmup over 2000 -> cosine decay to 0.1 * lr \|
	\| Precision \| bf16 \|

	## Final validation metrics

	- loss: 0.6910
	- macro_f1: 0.7557
	- macro_precision: 0.7541
	- macro_recall: 0.7669
	- mcc: 0.9103
	- perplexity: 1.9958
	- top10_acc: 0.9576
	- top1_acc: 0.9106
	- top25_acc: 0.9612
	- top5_acc: 0.9536

	## How to use

	```python
	from models.annotation_transformer import AnnotationTransformer

	AT = AnnotationTransformer.from_pretrained("Synthyra/AT-Base")
	pooled = AT(input_ids, attention_mask) # (batch, hidden_size)
	```

	The downstream consumer in this repo is the vec2vec translator
	(sets 7/8): annotations are pre-embedded with this AT (frozen) and
	mapped to / from a PLM embedding space via the same paired-batch
	contrastive recipe used in sets 1-3.

	## References

	- Hallee et al. 2024, bioRxiv 2024.07.30.605924 -- Annotation Vocabulary.
	- Jha et al. 2025, arXiv:2505.12540 -- vec2vec.