mli-lab
/

TReconLM

Model card Files Files and versions

xet

Community

FWeindel commited on Nov 26, 2025

Commit

75d481a

verified ·

1 Parent(s): 025163e

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +10 -19

README.md CHANGED Viewed

@@ -5,35 +5,26 @@ TReconLM is a decoder-only transformer model for trace reconstruction of noisy D
 ## Model Variants
 ### Pretrained Models (Fixed Length)
-| Model | Sequence Length | Description |
-|-------|-----------------|-------------|
-| `model_seq_len_60.pt` | 60nt | Pretrained on synthetic IDS data |
-| `model_seq_len_110.pt` | 110nt | Pretrained on synthetic IDS data |
-| `model_seq_len_180.pt` | 180nt | Pretrained on synthetic IDS data |
 ### Pretrained Models (Variable Length)
-| Model | Sequence Length | Description |
-|-------|-----------------|-------------|
-| `model_var_len_50_120.pt` | 50-120nt | Pretrained on synthetic IDS data with variable sequence lengths |
 ### Fine-tuned Models
-| Model | Sequence Length | Description |
-|-------|-----------------|-------------|
-| `finetuned_noisy_dna_len60.pt` | 60nt | Fine-tuned on Noisy-DNA dataset |
-| `finetuned_microsoft_dna_len110.pt` | 110nt | Fine-tuned on Microsoft DNA dataset |
-| `finetuned_chandak_len117.pt` | 117nt | Fine-tuned on Chandak dataset |
-Each model supports reconstruction from cluster sizes between 2 and 10.
 ## How to Use
 Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`:
 - `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace
-- `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA)
 The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
@@ -47,4 +38,4 @@ For full experimental details, see [our paper](http://arxiv.org/abs/2507.12927).
 ## Limitations
-Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as the fixed-length models, so it sees less data per sequence length and performs slightly worse for a specific fixed length.

 ## Model Variants
 ### Pretrained Models (Fixed Length)
+- `model_seq_len_60.pt` (60nt)
+- `model_seq_len_110.pt` (110nt)
+- `model_seq_len_180.pt` (180nt)
 ### Pretrained Models (Variable Length)
+- `model_var_len_50_120.pt` (50-120nt)
 ### Fine-tuned Models
+- `finetuned_noisy_dna_len60.pt` (60nt, [Noisy-DNA dataset](https://doi.org/10.1038/s41467-020-14319-8))
+- `finetuned_microsoft_dna_len110.pt` (110nt, [Microsoft DNA dataset](https://doi.org/10.1109/ISIT45174.2021.9518012))
+- `finetuned_chandak_len117.pt` (117nt, [Chandak dataset](https://doi.org/10.1109/ICASSP40776.2020.9053441))
+All models support reconstruction from cluster sizes between 2 and 10.
 ## How to Use
 Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`:
 - `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace
+- `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA, Chandak)
 The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
 ## Limitations
+Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as our fixed-length models, so it sees less data per sequence length and may perform slightly worse for a specific fixed length.