# TReconLM

TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences. It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.

## Model Variants

### Pretrained Models (Fixed Length)
- `model_seq_len_60.pt` (60nt)
- `model_seq_len_110.pt` (110nt)
- `model_seq_len_180.pt` (180nt)

### Pretrained Models (Variable Length)
- `model_var_len_50_120.pt` (50-120nt)

### Fine-tuned Models
- `finetuned_noisy_dna_len60.pt` (60nt, [Noisy-DNA dataset](https://www.nature.com/articles/s41467-020-19148-3))
- `finetuned_microsoft_dna_len110.pt` (110nt, [Microsoft DNA dataset](https://ieeexplore.ieee.org/abstract/document/9517821))
- `finetuned_chandak_len117.pt` (117nt, [Chandak dataset](https://doi.org/10.1109/ICASSP40776.2020.9053441))

All models support reconstruction from cluster sizes between 2 and 10.

## How to Use

Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`:

- `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace
- `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA, Chandak)

The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).

## Training Details

- Models are pretrained on synthetic data generated by sampling ground-truth sequences uniformly at random over the quaternary alphabet, and independently introducing insertions, deletions, and substitutions at each position.
- Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from [2, 10].
- Models are fine-tuned on real-world sequencing data (Noisy-DNA, Microsoft, and Chandak datasets).

For full experimental details, see [our paper](http://arxiv.org/abs/2507.12927).

## Limitations

Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as our fixed-length models, so it sees less data per sequence length and may perform slightly worse for a specific fixed length.