Add pipeline tag, paper/code links, and usage instructions
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,6 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# TReconLM
|
| 2 |
|
| 3 |
-
TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences. It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
## Model Variants
|
| 6 |
|
|
@@ -21,11 +28,24 @@ All models support reconstruction from cluster sizes between 2 and 10.
|
|
| 21 |
|
| 22 |
## How to Use
|
| 23 |
|
|
|
|
| 24 |
Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`:
|
| 25 |
|
| 26 |
- `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace
|
| 27 |
- `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA, Chandak)
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
|
| 30 |
|
| 31 |
## Training Details
|
|
@@ -38,4 +58,4 @@ For full experimental details, see [our paper](http://arxiv.org/abs/2507.12927).
|
|
| 38 |
|
| 39 |
## Limitations
|
| 40 |
|
| 41 |
-
Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as our fixed-length models, so it sees less data per sequence length and may perform slightly worse for a specific fixed length.
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: text-generation
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
# TReconLM
|
| 6 |
|
| 7 |
+
TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences, as presented in the paper [Trace Reconstruction with Language Models](https://huggingface.co/papers/2507.12927). It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.
|
| 8 |
+
|
| 9 |
+
- **Code:** [GitHub Repository](https://github.com/MLI-lab/TReconLM)
|
| 10 |
+
- **Paper:** [arXiv:2507.12927](http://arxiv.org/abs/2507.12927)
|
| 11 |
|
| 12 |
## Model Variants
|
| 13 |
|
|
|
|
| 28 |
|
| 29 |
## How to Use
|
| 30 |
|
| 31 |
+
### Tutorial Notebooks
|
| 32 |
Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`:
|
| 33 |
|
| 34 |
- `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace
|
| 35 |
- `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA, Chandak)
|
| 36 |
|
| 37 |
+
### Command-Line Inference
|
| 38 |
+
You can also run inference via the command line using the scripts provided in the repository:
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
# Download a model from HuggingFace
|
| 42 |
+
mkdir -p models
|
| 43 |
+
python -c "from huggingface_hub import hf_hub_download; hf_hub_download('tracereconstruction2026/TReconLM', 'model_seq_len_110.pt', local_dir='models')"
|
| 44 |
+
|
| 45 |
+
# Run inference
|
| 46 |
+
python src/inference.py exps=test/inference_example
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
|
| 50 |
|
| 51 |
## Training Details
|
|
|
|
| 58 |
|
| 59 |
## Limitations
|
| 60 |
|
| 61 |
+
Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as our fixed-length models, so it sees less data per sequence length and may perform slightly worse for a specific fixed length.
|