Add pipeline tag, paper/code links, and usage instructions

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +22 -2
README.md CHANGED
@@ -1,6 +1,13 @@
 
 
 
 
1
  # TReconLM
2
 
3
- TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences. It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.
 
 
 
4
 
5
  ## Model Variants
6
 
@@ -21,11 +28,24 @@ All models support reconstruction from cluster sizes between 2 and 10.
21
 
22
  ## How to Use
23
 
 
24
  Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`:
25
 
26
  - `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace
27
  - `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA, Chandak)
28
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
30
 
31
  ## Training Details
@@ -38,4 +58,4 @@ For full experimental details, see [our paper](http://arxiv.org/abs/2507.12927).
38
 
39
  ## Limitations
40
 
41
- Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as our fixed-length models, so it sees less data per sequence length and may perform slightly worse for a specific fixed length.
 
1
+ ---
2
+ pipeline_tag: text-generation
3
+ ---
4
+
5
  # TReconLM
6
 
7
+ TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences, as presented in the paper [Trace Reconstruction with Language Models](https://huggingface.co/papers/2507.12927). It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.
8
+
9
+ - **Code:** [GitHub Repository](https://github.com/MLI-lab/TReconLM)
10
+ - **Paper:** [arXiv:2507.12927](http://arxiv.org/abs/2507.12927)
11
 
12
  ## Model Variants
13
 
 
28
 
29
  ## How to Use
30
 
31
+ ### Tutorial Notebooks
32
  Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`:
33
 
34
  - `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace
35
  - `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA, Chandak)
36
 
37
+ ### Command-Line Inference
38
+ You can also run inference via the command line using the scripts provided in the repository:
39
+
40
+ ```bash
41
+ # Download a model from HuggingFace
42
+ mkdir -p models
43
+ python -c "from huggingface_hub import hf_hub_download; hf_hub_download('tracereconstruction2026/TReconLM', 'model_seq_len_110.pt', local_dir='models')"
44
+
45
+ # Run inference
46
+ python src/inference.py exps=test/inference_example
47
+ ```
48
+
49
  The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
50
 
51
  ## Training Details
 
58
 
59
  ## Limitations
60
 
61
+ Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as our fixed-length models, so it sees less data per sequence length and may perform slightly worse for a specific fixed length.