Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TReconLM
|
| 2 |
+
|
| 3 |
+
TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences. It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.
|
| 4 |
+
|
| 5 |
+
## Model Variants
|
| 6 |
+
|
| 7 |
+
We provide pretrained and fine-tuned model checkpoints for the following ground-truth sequence lengths:
|
| 8 |
+
|
| 9 |
+
- L = 60
|
| 10 |
+
- L = 110
|
| 11 |
+
- L = 180
|
| 12 |
+
|
| 13 |
+
Each model supports reconstruction from cluster sizes between 2 and 10.
|
| 14 |
+
|
| 15 |
+
## How to Use
|
| 16 |
+
|
| 17 |
+
A Colab notebook is available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `trace_reconstruction.ipynb`, which demonstrates how to load the model and run inference on our benchmark datasets. The test datasets used in the notebook can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
## Training Details
|
| 21 |
+
|
| 22 |
+
- Models are pretrained on synthetic data generated by sampling ground-truth sequences of length L uniformly at random over the quaternary alphabet, and independently introducing insertions, deletions, and substitutions at each position.
|
| 23 |
+
- Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from \([2, 10]\).
|
| 24 |
+
- Models are fine-tuned on real-world sequencing data (Noisy-DNA and Microsoft datasets).
|
| 25 |
+
|
| 26 |
+
For full experimental details, see [our paper](https://arxiv.org/abs/XXXX.XXXXX).
|
| 27 |
+
|
| 28 |
+
## Limitations
|
| 29 |
+
|
| 30 |
+
Models are trained for fixed sequence lengths and may perform worse on other lengths or if the test data distribution differs significantly from the training data.
|