Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -5,35 +5,26 @@ TReconLM is a decoder-only transformer model for trace reconstruction of noisy D
|
|
| 5 |
## Model Variants
|
| 6 |
|
| 7 |
### Pretrained Models (Fixed Length)
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
| `model_seq_len_60.pt` | 60nt | Pretrained on synthetic IDS data |
|
| 12 |
-
| `model_seq_len_110.pt` | 110nt | Pretrained on synthetic IDS data |
|
| 13 |
-
| `model_seq_len_180.pt` | 180nt | Pretrained on synthetic IDS data |
|
| 14 |
|
| 15 |
### Pretrained Models (Variable Length)
|
| 16 |
-
|
| 17 |
-
| Model | Sequence Length | Description |
|
| 18 |
-
|-------|-----------------|-------------|
|
| 19 |
-
| `model_var_len_50_120.pt` | 50-120nt | Pretrained on synthetic IDS data with variable sequence lengths |
|
| 20 |
|
| 21 |
### Fine-tuned Models
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|-------|-----------------|-------------|
|
| 25 |
-
| `finetuned_noisy_dna_len60.pt` | 60nt | Fine-tuned on Noisy-DNA dataset |
|
| 26 |
-
| `finetuned_microsoft_dna_len110.pt` | 110nt | Fine-tuned on Microsoft DNA dataset |
|
| 27 |
-
| `finetuned_chandak_len117.pt` | 117nt | Fine-tuned on Chandak dataset |
|
| 28 |
-
|
| 29 |
-
Each model supports reconstruction from cluster sizes between 2 and 10.
|
| 30 |
|
| 31 |
## How to Use
|
| 32 |
|
| 33 |
Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`:
|
| 34 |
|
| 35 |
- `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace
|
| 36 |
-
- `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA)
|
| 37 |
|
| 38 |
The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
|
| 39 |
|
|
@@ -47,4 +38,4 @@ For full experimental details, see [our paper](http://arxiv.org/abs/2507.12927).
|
|
| 47 |
|
| 48 |
## Limitations
|
| 49 |
|
| 50 |
-
Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as
|
|
|
|
| 5 |
## Model Variants
|
| 6 |
|
| 7 |
### Pretrained Models (Fixed Length)
|
| 8 |
+
- `model_seq_len_60.pt` (60nt)
|
| 9 |
+
- `model_seq_len_110.pt` (110nt)
|
| 10 |
+
- `model_seq_len_180.pt` (180nt)
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
### Pretrained Models (Variable Length)
|
| 13 |
+
- `model_var_len_50_120.pt` (50-120nt)
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
### Fine-tuned Models
|
| 16 |
+
- `finetuned_noisy_dna_len60.pt` (60nt, [Noisy-DNA dataset](https://doi.org/10.1038/s41467-020-14319-8))
|
| 17 |
+
- `finetuned_microsoft_dna_len110.pt` (110nt, [Microsoft DNA dataset](https://doi.org/10.1109/ISIT45174.2021.9518012))
|
| 18 |
+
- `finetuned_chandak_len117.pt` (117nt, [Chandak dataset](https://doi.org/10.1109/ICASSP40776.2020.9053441))
|
| 19 |
|
| 20 |
+
All models support reconstruction from cluster sizes between 2 and 10.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
## How to Use
|
| 23 |
|
| 24 |
Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`:
|
| 25 |
|
| 26 |
- `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace
|
| 27 |
+
- `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA, Chandak)
|
| 28 |
|
| 29 |
The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
|
| 30 |
|
|
|
|
| 38 |
|
| 39 |
## Limitations
|
| 40 |
|
| 41 |
+
Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as our fixed-length models, so it sees less data per sequence length and may perform slightly worse for a specific fixed length.
|