# TReconLM TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences. It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions. ## Model Variants ### Pretrained Models (Fixed Length) - `model_seq_len_60.pt` (60nt) - `model_seq_len_110.pt` (110nt) - `model_seq_len_180.pt` (180nt) ### Pretrained Models (Variable Length) - `model_var_len_50_120.pt` (50-120nt) ### Fine-tuned Models - `finetuned_noisy_dna_len60.pt` (60nt, [Noisy-DNA dataset](https://www.nature.com/articles/s41467-020-19148-3)) - `finetuned_microsoft_dna_len110.pt` (110nt, [Microsoft DNA dataset](https://ieeexplore.ieee.org/abstract/document/9517821)) - `finetuned_chandak_len117.pt` (117nt, [Chandak dataset](https://doi.org/10.1109/ICASSP40776.2020.9053441)) All models support reconstruction from cluster sizes between 2 and 10. ## How to Use Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`: - `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace - `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA, Chandak) The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets). ## Training Details - Models are pretrained on synthetic data generated by sampling ground-truth sequences uniformly at random over the quaternary alphabet, and independently introducing insertions, deletions, and substitutions at each position. - Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from [2, 10]. - Models are fine-tuned on real-world sequencing data (Noisy-DNA, Microsoft, and Chandak datasets). For full experimental details, see [our paper](http://arxiv.org/abs/2507.12927). ## Limitations Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as our fixed-length models, so it sees less data per sequence length and may perform slightly worse for a specific fixed length.