FWeindel commited on
Commit
fae0d4d
·
verified ·
1 Parent(s): 2a04b2c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -0
README.md ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TReconLM
2
+
3
+ TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences. It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.
4
+
5
+ ## Model Variants
6
+
7
+ We provide pretrained and fine-tuned model checkpoints for the following ground-truth sequence lengths:
8
+
9
+ - L = 60
10
+ - L = 110
11
+ - L = 180
12
+
13
+ Each model supports reconstruction from cluster sizes between 2 and 10.
14
+
15
+ ## How to Use
16
+
17
+ A Colab notebook is available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `trace_reconstruction.ipynb`, which demonstrates how to load the model and run inference on our benchmark datasets. The test datasets used in the notebook can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
18
+
19
+
20
+ ## Training Details
21
+
22
+ - Models are pretrained on synthetic data generated by sampling ground-truth sequences of length L uniformly at random over the quaternary alphabet, and independently introducing insertions, deletions, and substitutions at each position.
23
+ - Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from \([2, 10]\).
24
+ - Models are fine-tuned on real-world sequencing data (Noisy-DNA and Microsoft datasets).
25
+
26
+ For full experimental details, see [our paper](https://arxiv.org/abs/XXXX.XXXXX).
27
+
28
+ ## Limitations
29
+
30
+ Models are trained for fixed sequence lengths and may perform worse on other lengths or if the test data distribution differs significantly from the training data.