File size: 2,660 Bytes
7ee56c8
6b1c788
10eead2
 
 
 
 
 
 
 
6b1c788
 
7ee56c8
 
6b1c788
7ee56c8
 
6b1c788
7ee56c8
 
6b1c788
 
7ee56c8
 
 
 
 
 
6b1c788
14d831a
7ee56c8
 
 
6b1c788
7ee56c8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
tags:
- Protein-Language-Models
- PLM
- Ancestral-Sequence-Reconstruction
- ASR
- Natural-Language-Processing
- NLP
- Geneartive-AI
- GenAI
- Biology
- Bioinformatics
---

# Ancestral sequence reconstruction using generative models

<!-- Provide a quick summary of what the model is/does. -->
Ancestral sequence reconstruction (ASR) is a foundational task in evolutionary biology, providing insights into the molecular past and guiding studies of protein function and adaptation. Conventional ASR methods rely on a multiple sequence alignment (MSA), a phylogenetic tree, and an evolutionary model. However, the underlying alignments and trees are often uncertain, and existing models typically focus on substitutions and do not explicitly account for insertion-deletion (indel) processes. Here, we introduce BetaReconstruct, a novel generative approach to ASR that harnesses recent advances in natural language processing (NLP) and hybrid transformer architectures. Our model was initially trained on large-scale simulated datasets with gold-standard ancestral sequences and subsequently on real-world protein sequences. The reconstruction requires neither MSAs nor phylogenetic trees. We demonstrate that BetaReconstruct generalizes robustly across diverse evolutionary scenarios and reconstructs ancestral sequences more accurately than maximum-likelihood-based pipelines. We additionally provide evidence that the generative-model ASR approach is also more accurate when analyzing empirical datasets. This work provides a scalable, alignment-free strategy for ASR and highlights the ability of data-driven models to capture evolutionary signals beyond the reach of traditional methods.


![outline_image](https://cdn-uploads.huggingface.co/production/uploads/63047e2d412a1b9d381b045d/7ecb09ZApSmPCr2LhtlIH.png)
Illustration of ASR using BetaReconstruct. (a) The “true” evolutionary dynamics, in which the ancestral sequence “AAMM” evolved along a phylogenetic tree. The leaf sequences are the proteins: “AAM”, “AYM”, and “ATMMM”; (b) The BetaReconstruct pipeline: (Ⅰ) the unaligned protein sequences are provided as input; (Ⅱ) the protein sequences are concatenated with special characters between them; (Ⅲ) the model processes the input; (Ⅳ) the model generates the root ancestral sequence. 


### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** [[Github]](https://github.com/technion-cs-nlp/BetaReconstruct)
- **Paper [optional]:** https://www.biorxiv.org/content/10.64898/2026.01.18.700141v1

## Uses

See Github repository: https://github.com/technion-cs-nlp/BetaReconstruct