mli-lab
/

TReconLM

Model card Files Files and versions

xet

Community

Add pipeline tag, paper/code links, and usage instructions

by nielsr HF Staff - opened 22 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+22

-2

Files changed (1) hide show

README.md +22 -2

README.md CHANGED Viewed

@@ -1,6 +1,13 @@
 # TReconLM
-TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences. It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.
 ## Model Variants
@@ -21,11 +28,24 @@ All models support reconstruction from cluster sizes between 2 and 10.
 ## How to Use
 Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`:
 - `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace
 - `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA, Chandak)
 The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
 ## Training Details
@@ -38,4 +58,4 @@ For full experimental details, see [our paper](http://arxiv.org/abs/2507.12927).
 ## Limitations
-Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as our fixed-length models, so it sees less data per sequence length and may perform slightly worse for a specific fixed length.

+---
+pipeline_tag: text-generation
+---
 # TReconLM
+TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences, as presented in the paper [Trace Reconstruction with Language Models](https://huggingface.co/papers/2507.12927). It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.
+- **Code:** [GitHub Repository](https://github.com/MLI-lab/TReconLM)
+- **Paper:** [arXiv:2507.12927](http://arxiv.org/abs/2507.12927)
 ## Model Variants
 ## How to Use
+### Tutorial Notebooks
 Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`:
 - `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace
 - `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA, Chandak)
+### Command-Line Inference
+You can also run inference via the command line using the scripts provided in the repository:
+```bash
+# Download a model from HuggingFace
+mkdir -p models
+python -c "from huggingface_hub import hf_hub_download; hf_hub_download('tracereconstruction2026/TReconLM', 'model_seq_len_110.pt', local_dir='models')"
+# Run inference
+python src/inference.py exps=test/inference_example
+```
 The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
 ## Training Details
 ## Limitations
+Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as our fixed-length models, so it sees less data per sequence length and may perform slightly worse for a specific fixed length.