bernardo-de-almeida commited on
Commit
3b6e7d5
·
1 Parent(s): 507be8b

feat: add annotation notebook

Browse files
notebooks_tutorials/{02_fine_tuning_pretrained_model.ipynb → 02_fine_tuning_pretrained_model_biwig.ipynb} RENAMED
@@ -10,10 +10,9 @@
10
  "\n",
11
  "📊 We provide access to the NTv3-benchmark data that we released on our Hugging Face dataset: `InstaDeepAI/NTv3_benchmark_dataset`. In this repository, you will find ready-to-use genome FASTA files, Bigwig tracks, metadata, but also the splits that were used for the benchmark.\n",
12
  "\n",
13
- "**🔧 Main Simplifications**: Compared to the full supervised tracks pipeline, this notebook simplifies several aspects to enable faster iteration:\n",
14
- "- **Random sequence sampling**: The dataset randomly samples sequences from chromosomes/regions on-the-fly, rather than using pre-computed sliding windows\n",
15
  "- **Constant learning rate**: Uses a fixed learning rate throughout training without learning rate scheduling\n",
16
- "- **No gradient accumulation**: Implements simple step-based training without gradient accumulation, making the training loop more straightforward\n",
17
  "\n",
18
  "**⚡ Key Advantage**: This simplified pipeline achieves close performance to more complex training approaches while enabling fast fine-tuning: on a H100 GPU and using 16 workers for data loading, it takes ~15min to reach acceptable performances for a 32kb functional tracks prediction task on **NTv3_8M_pre** model. The training speed benefits from the efficient NTv3 model architecture, but of course depends on your hardware capabilities (GPU acceleration and multi-worker data loading significantly reduce training time)."
19
  ]
@@ -24,7 +23,7 @@
24
  "source": [
25
  "## 💻 A note on hardware\n",
26
  "\n",
27
- "While this pipeline is designed to run on limited resources (e.g., Google Colab with a T4 GPU and 2CPUs), the mentioned training time or displayed performances (see **Test evaluation** section) was obtained on a more powerful setup. If you want to reach similar performance levels, you should be aware that you'll need **significant hardware resources** (high-end GPUs with substantial memory and multiple data loading workers). Training times will vary significantly based on your hardware configuration.\n",
28
  "\n",
29
  "📝 Note for Google Colab users: This notebook is compatible with Colab and designed to work with limited resources! For faster training, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended)."
30
  ]
 
10
  "\n",
11
  "📊 We provide access to the NTv3-benchmark data that we released on our Hugging Face dataset: `InstaDeepAI/NTv3_benchmark_dataset`. In this repository, you will find ready-to-use genome FASTA files, Bigwig tracks, metadata, but also the splits that were used for the benchmark.\n",
12
  "\n",
13
+ "**🔧 Main Simplifications**: Compared to the full supervised tracks pipeline used in the paper, this notebook simplifies several aspects to enable faster experimentation with limited resources for users:\n",
 
14
  "- **Constant learning rate**: Uses a fixed learning rate throughout training without learning rate scheduling\n",
15
+ "- **No gradient accumulation**: Implements simple step-based training without gradient accumulation, making the training loop more straightforward but changing the effective batch size compared with the full pipeline\n",
16
  "\n",
17
  "**⚡ Key Advantage**: This simplified pipeline achieves close performance to more complex training approaches while enabling fast fine-tuning: on a H100 GPU and using 16 workers for data loading, it takes ~15min to reach acceptable performances for a 32kb functional tracks prediction task on **NTv3_8M_pre** model. The training speed benefits from the efficient NTv3 model architecture, but of course depends on your hardware capabilities (GPU acceleration and multi-worker data loading significantly reduce training time)."
18
  ]
 
23
  "source": [
24
  "## 💻 A note on hardware\n",
25
  "\n",
26
+ "While this pipeline is designed to run on limited resources (e.g., Google Colab with a T4 GPU and 2CPUs), the mentioned training time or displayed performances (see **Test evaluation** section) was obtained on a more powerful setup and is shown just as a reference. If you want to reach similar performance levels or the ones reported in the paper, you should be aware that you'll need **significant hardware resources** (high-end GPUs with substantial memory and multiple data loading workers). Training times will vary significantly based on your hardware configuration.\n",
27
  "\n",
28
  "📝 Note for Google Colab users: This notebook is compatible with Colab and designed to work with limited resources! For faster training, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended)."
29
  ]
notebooks_tutorials/{03_fine_tuning_posttrained_model.ipynb → 03_fine_tuning_posttrained_model_biwig.ipynb} RENAMED
@@ -10,7 +10,7 @@
10
  "\n",
11
  "**🎯 Notebook purpose:**\n",
12
  "This notebook is configured to train the `NTv3_650M_post` model on the `human` species from the NTv3 benchmark dataset. To run this training, you will need a large GPU (either A100 or H100).\n",
13
- "For a simplified version of this notebook that uses the `NTv3_8M_pre` model and runs on a CPU, please see the [02_fine_tuning_pretrained_model.ipynb](https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/02_fine_tuning_pretrained_model.ipynb) notebook.\n",
14
  "The notebook uses the same \"simplified setup\" as described there. \n",
15
  "\n",
16
  "📝 Note for Google Colab users: This notebook is compatible with Colab! This notebook is designed to be run on a high-performance GPU. The default parameters can be used with a H100 with 80GB of HBM."
 
10
  "\n",
11
  "**🎯 Notebook purpose:**\n",
12
  "This notebook is configured to train the `NTv3_650M_post` model on the `human` species from the NTv3 benchmark dataset. To run this training, you will need a large GPU (either A100 or H100).\n",
13
+ "For a simplified version of this notebook that uses the `NTv3_8M_pre` model and runs on a CPU, please see the [02_fine_tuning_pretrained_model_biwig.ipynb](https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/02_fine_tuning_pretrained_model_biwig.ipynb) notebook.\n",
14
  "The notebook uses the same \"simplified setup\" as described there. \n",
15
  "\n",
16
  "📝 Note for Google Colab users: This notebook is compatible with Colab! This notebook is designed to be run on a high-performance GPU. The default parameters can be used with a H100 with 80GB of HBM."
notebooks_tutorials/04_fine_tuning_pretrained_model_annotation.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
notebooks_tutorials/{04_model_interpretation.ipynb → 05_model_interpretation.ipynb} RENAMED
File without changes
tabs/home.html CHANGED
@@ -84,11 +84,12 @@
84
  <ul>
85
  <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/00_quickstart_inference.ipynb" target="_blank" rel="noopener noreferrer">🚀 00 — Quickstart inference</a></li>
86
  <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/01_tracks_prediction.ipynb" target="_blank" rel="noopener noreferrer">📊 01 — Tracks prediction</a></li>
87
- <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/02_fine_tuning_pretrained_model.ipynb" target="_blank" rel="noopener noreferrer">🎯 02 — Fine-tune a pre-trained model on bigwig tracks</a></li>
88
- <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/03_fine_tuning_posttrained_model.ipynb" target="_blank" rel="noopener noreferrer">🎯 03 — Fine-tune a post-trained model on bigwig tracks</a></li>
89
- <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/04_model_interpretation.ipynb" target="_blank" rel="noopener noreferrer">🔍 04 — Model interpretation</a></li>
90
- <li>🧪 05 — Training NTv3-generative <em>(coming soon)</em></li>
91
- <li>🪰 06 — Generating enhancer sequences <em>(coming soon)</em></li>
 
92
  </ul>
93
  </div>
94
  <div class="card">
@@ -111,10 +112,11 @@
111
  </div>
112
  </div>
113
 
114
- <div class="card">
115
- <h2>🤖 Load a pre-trained model</h2>
116
- <p>Here is an example of how to load and use a pre-trained NTv3 model.</p>
117
- <div class="code"><pre><code class="language-python">from transformers import AutoTokenizer, AutoModelForMaskedLM
 
118
 
119
  model_name = "InstaDeepAI/NTv3_650M_pre"
120
 
@@ -131,11 +133,12 @@ out = model(**batch)
131
  # Print output shapes
132
  print(out.logits.shape) # (B, L, V = 11)
133
  </code></pre></div>
134
- <p>Model embeddings can be used for fine-tuning on downstream tasks.</p>
135
-
136
- <h2 style="margin-top: 40px;">🔍 Model interpretation</h2>
137
- <p>Here is an example of how to use the interpretation pipeline on the NTv3 post-trained model for multi-scale analysis of DNA sequences:</p>
138
- <div class="code"><pre><code class="language-python">from transformers import pipeline
 
139
  import torch
140
  import matplotlib.pyplot as plt
141
 
@@ -168,7 +171,8 @@ plt.show()
168
  result.plot_saliency(window_size=128)
169
  plt.show()
170
  </code></pre></div>
171
- <img src="assets/saliency_example.png" alt="Output tracks visualization" style="max-width: 100%; margin-top: 20px;" />
 
172
  </div>
173
 
174
  <div class="card">
 
84
  <ul>
85
  <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/00_quickstart_inference.ipynb" target="_blank" rel="noopener noreferrer">🚀 00 — Quickstart inference</a></li>
86
  <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/01_tracks_prediction.ipynb" target="_blank" rel="noopener noreferrer">📊 01 — Tracks prediction</a></li>
87
+ <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/02_fine_tuning_pretrained_model_biwig.ipynb" target="_blank" rel="noopener noreferrer">🎯 02 — Fine-tune a pre-trained model on bigwig tracks</a></li>
88
+ <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/03_fine_tuning_posttrained_model_biwig.ipynb" target="_blank" rel="noopener noreferrer">🎯 03 — Fine-tune a post-trained model on bigwig tracks</a></li>
89
+ <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/04_fine_tuning_pretrained_model_annotation.ipynb" target="_blank" rel="noopener noreferrer">🏷️ 04 — Fine-tune a pre-trained model on annotations</a></li>
90
+ <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/05_model_interpretation.ipynb" target="_blank" rel="noopener noreferrer">🔍 05 — Model interpretation</a></li>
91
+ <li>🧪 06 — Training NTv3-generative <em>(coming soon)</em></li>
92
+ <li>🪰 07 — Generating enhancer sequences <em>(coming soon)</em></li>
93
  </ul>
94
  </div>
95
  <div class="card">
 
112
  </div>
113
  </div>
114
 
115
+ <div class="card-stack">
116
+ <div class="card">
117
+ <h2>🤖 Load a pre-trained model</h2>
118
+ <p>Here is an example of how to load and use a pre-trained NTv3 model.</p>
119
+ <div class="code"><pre><code class="language-python">from transformers import AutoTokenizer, AutoModelForMaskedLM
120
 
121
  model_name = "InstaDeepAI/NTv3_650M_pre"
122
 
 
133
  # Print output shapes
134
  print(out.logits.shape) # (B, L, V = 11)
135
  </code></pre></div>
136
+ <p>Model embeddings can be used for fine-tuning on downstream tasks.</p>
137
+ </div>
138
+ <div class="card">
139
+ <h2>🔍 Model interpretation</h2>
140
+ <p>Here is an example of how to use the interpretation pipeline on the NTv3 post-trained model for multi-scale analysis of DNA sequences:</p>
141
+ <div class="code"><pre><code class="language-python">from transformers import pipeline
142
  import torch
143
  import matplotlib.pyplot as plt
144
 
 
171
  result.plot_saliency(window_size=128)
172
  plt.show()
173
  </code></pre></div>
174
+ <img src="assets/saliency_example.png" alt="Output tracks visualization" style="max-width: 100%; margin-top: 20px;" />
175
+ </div>
176
  </div>
177
 
178
  <div class="card">