Spaces:
Running
Running
Commit
·
3b6e7d5
1
Parent(s):
507be8b
feat: add annotation notebook
Browse files- notebooks_tutorials/{02_fine_tuning_pretrained_model.ipynb → 02_fine_tuning_pretrained_model_biwig.ipynb} +3 -4
- notebooks_tutorials/{03_fine_tuning_posttrained_model.ipynb → 03_fine_tuning_posttrained_model_biwig.ipynb} +1 -1
- notebooks_tutorials/04_fine_tuning_pretrained_model_annotation.ipynb +0 -0
- notebooks_tutorials/{04_model_interpretation.ipynb → 05_model_interpretation.ipynb} +0 -0
- tabs/home.html +19 -15
notebooks_tutorials/{02_fine_tuning_pretrained_model.ipynb → 02_fine_tuning_pretrained_model_biwig.ipynb}
RENAMED
|
@@ -10,10 +10,9 @@
|
|
| 10 |
"\n",
|
| 11 |
"📊 We provide access to the NTv3-benchmark data that we released on our Hugging Face dataset: `InstaDeepAI/NTv3_benchmark_dataset`. In this repository, you will find ready-to-use genome FASTA files, Bigwig tracks, metadata, but also the splits that were used for the benchmark.\n",
|
| 12 |
"\n",
|
| 13 |
-
"**🔧 Main Simplifications**: Compared to the full supervised tracks pipeline, this notebook simplifies several aspects to enable faster
|
| 14 |
-
"- **Random sequence sampling**: The dataset randomly samples sequences from chromosomes/regions on-the-fly, rather than using pre-computed sliding windows\n",
|
| 15 |
"- **Constant learning rate**: Uses a fixed learning rate throughout training without learning rate scheduling\n",
|
| 16 |
-
"- **No gradient accumulation**: Implements simple step-based training without gradient accumulation, making the training loop more straightforward\n",
|
| 17 |
"\n",
|
| 18 |
"**⚡ Key Advantage**: This simplified pipeline achieves close performance to more complex training approaches while enabling fast fine-tuning: on a H100 GPU and using 16 workers for data loading, it takes ~15min to reach acceptable performances for a 32kb functional tracks prediction task on **NTv3_8M_pre** model. The training speed benefits from the efficient NTv3 model architecture, but of course depends on your hardware capabilities (GPU acceleration and multi-worker data loading significantly reduce training time)."
|
| 19 |
]
|
|
@@ -24,7 +23,7 @@
|
|
| 24 |
"source": [
|
| 25 |
"## 💻 A note on hardware\n",
|
| 26 |
"\n",
|
| 27 |
-
"While this pipeline is designed to run on limited resources (e.g., Google Colab with a T4 GPU and 2CPUs), the mentioned training time or displayed performances (see **Test evaluation** section) was obtained on a more powerful setup. If you want to reach similar performance levels, you should be aware that you'll need **significant hardware resources** (high-end GPUs with substantial memory and multiple data loading workers). Training times will vary significantly based on your hardware configuration.\n",
|
| 28 |
"\n",
|
| 29 |
"📝 Note for Google Colab users: This notebook is compatible with Colab and designed to work with limited resources! For faster training, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended)."
|
| 30 |
]
|
|
|
|
| 10 |
"\n",
|
| 11 |
"📊 We provide access to the NTv3-benchmark data that we released on our Hugging Face dataset: `InstaDeepAI/NTv3_benchmark_dataset`. In this repository, you will find ready-to-use genome FASTA files, Bigwig tracks, metadata, but also the splits that were used for the benchmark.\n",
|
| 12 |
"\n",
|
| 13 |
+
"**🔧 Main Simplifications**: Compared to the full supervised tracks pipeline used in the paper, this notebook simplifies several aspects to enable faster experimentation with limited resources for users:\n",
|
|
|
|
| 14 |
"- **Constant learning rate**: Uses a fixed learning rate throughout training without learning rate scheduling\n",
|
| 15 |
+
"- **No gradient accumulation**: Implements simple step-based training without gradient accumulation, making the training loop more straightforward but changing the effective batch size compared with the full pipeline\n",
|
| 16 |
"\n",
|
| 17 |
"**⚡ Key Advantage**: This simplified pipeline achieves close performance to more complex training approaches while enabling fast fine-tuning: on a H100 GPU and using 16 workers for data loading, it takes ~15min to reach acceptable performances for a 32kb functional tracks prediction task on **NTv3_8M_pre** model. The training speed benefits from the efficient NTv3 model architecture, but of course depends on your hardware capabilities (GPU acceleration and multi-worker data loading significantly reduce training time)."
|
| 18 |
]
|
|
|
|
| 23 |
"source": [
|
| 24 |
"## 💻 A note on hardware\n",
|
| 25 |
"\n",
|
| 26 |
+
"While this pipeline is designed to run on limited resources (e.g., Google Colab with a T4 GPU and 2CPUs), the mentioned training time or displayed performances (see **Test evaluation** section) was obtained on a more powerful setup and is shown just as a reference. If you want to reach similar performance levels or the ones reported in the paper, you should be aware that you'll need **significant hardware resources** (high-end GPUs with substantial memory and multiple data loading workers). Training times will vary significantly based on your hardware configuration.\n",
|
| 27 |
"\n",
|
| 28 |
"📝 Note for Google Colab users: This notebook is compatible with Colab and designed to work with limited resources! For faster training, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended)."
|
| 29 |
]
|
notebooks_tutorials/{03_fine_tuning_posttrained_model.ipynb → 03_fine_tuning_posttrained_model_biwig.ipynb}
RENAMED
|
@@ -10,7 +10,7 @@
|
|
| 10 |
"\n",
|
| 11 |
"**🎯 Notebook purpose:**\n",
|
| 12 |
"This notebook is configured to train the `NTv3_650M_post` model on the `human` species from the NTv3 benchmark dataset. To run this training, you will need a large GPU (either A100 or H100).\n",
|
| 13 |
-
"For a simplified version of this notebook that uses the `NTv3_8M_pre` model and runs on a CPU, please see the [
|
| 14 |
"The notebook uses the same \"simplified setup\" as described there. \n",
|
| 15 |
"\n",
|
| 16 |
"📝 Note for Google Colab users: This notebook is compatible with Colab! This notebook is designed to be run on a high-performance GPU. The default parameters can be used with a H100 with 80GB of HBM."
|
|
|
|
| 10 |
"\n",
|
| 11 |
"**🎯 Notebook purpose:**\n",
|
| 12 |
"This notebook is configured to train the `NTv3_650M_post` model on the `human` species from the NTv3 benchmark dataset. To run this training, you will need a large GPU (either A100 or H100).\n",
|
| 13 |
+
"For a simplified version of this notebook that uses the `NTv3_8M_pre` model and runs on a CPU, please see the [02_fine_tuning_pretrained_model_biwig.ipynb](https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/02_fine_tuning_pretrained_model_biwig.ipynb) notebook.\n",
|
| 14 |
"The notebook uses the same \"simplified setup\" as described there. \n",
|
| 15 |
"\n",
|
| 16 |
"📝 Note for Google Colab users: This notebook is compatible with Colab! This notebook is designed to be run on a high-performance GPU. The default parameters can be used with a H100 with 80GB of HBM."
|
notebooks_tutorials/04_fine_tuning_pretrained_model_annotation.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
notebooks_tutorials/{04_model_interpretation.ipynb → 05_model_interpretation.ipynb}
RENAMED
|
File without changes
|
tabs/home.html
CHANGED
|
@@ -84,11 +84,12 @@
|
|
| 84 |
<ul>
|
| 85 |
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/00_quickstart_inference.ipynb" target="_blank" rel="noopener noreferrer">🚀 00 — Quickstart inference</a></li>
|
| 86 |
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/01_tracks_prediction.ipynb" target="_blank" rel="noopener noreferrer">📊 01 — Tracks prediction</a></li>
|
| 87 |
-
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/
|
| 88 |
-
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/
|
| 89 |
-
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/
|
| 90 |
-
<li
|
| 91 |
-
<li
|
|
|
|
| 92 |
</ul>
|
| 93 |
</div>
|
| 94 |
<div class="card">
|
|
@@ -111,10 +112,11 @@
|
|
| 111 |
</div>
|
| 112 |
</div>
|
| 113 |
|
| 114 |
-
<div class="card">
|
| 115 |
-
<
|
| 116 |
-
|
| 117 |
-
|
|
|
|
| 118 |
|
| 119 |
model_name = "InstaDeepAI/NTv3_650M_pre"
|
| 120 |
|
|
@@ -131,11 +133,12 @@ out = model(**batch)
|
|
| 131 |
# Print output shapes
|
| 132 |
print(out.logits.shape) # (B, L, V = 11)
|
| 133 |
</code></pre></div>
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
<
|
| 137 |
-
|
| 138 |
-
|
|
|
|
| 139 |
import torch
|
| 140 |
import matplotlib.pyplot as plt
|
| 141 |
|
|
@@ -168,7 +171,8 @@ plt.show()
|
|
| 168 |
result.plot_saliency(window_size=128)
|
| 169 |
plt.show()
|
| 170 |
</code></pre></div>
|
| 171 |
-
<img src="assets/saliency_example.png" alt="Output tracks visualization" style="max-width: 100%; margin-top: 20px;" />
|
|
|
|
| 172 |
</div>
|
| 173 |
|
| 174 |
<div class="card">
|
|
|
|
| 84 |
<ul>
|
| 85 |
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/00_quickstart_inference.ipynb" target="_blank" rel="noopener noreferrer">🚀 00 — Quickstart inference</a></li>
|
| 86 |
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/01_tracks_prediction.ipynb" target="_blank" rel="noopener noreferrer">📊 01 — Tracks prediction</a></li>
|
| 87 |
+
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/02_fine_tuning_pretrained_model_biwig.ipynb" target="_blank" rel="noopener noreferrer">🎯 02 — Fine-tune a pre-trained model on bigwig tracks</a></li>
|
| 88 |
+
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/03_fine_tuning_posttrained_model_biwig.ipynb" target="_blank" rel="noopener noreferrer">🎯 03 — Fine-tune a post-trained model on bigwig tracks</a></li>
|
| 89 |
+
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/04_fine_tuning_pretrained_model_annotation.ipynb" target="_blank" rel="noopener noreferrer">🏷️ 04 — Fine-tune a pre-trained model on annotations</a></li>
|
| 90 |
+
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/05_model_interpretation.ipynb" target="_blank" rel="noopener noreferrer">🔍 05 — Model interpretation</a></li>
|
| 91 |
+
<li>🧪 06 — Training NTv3-generative <em>(coming soon)</em></li>
|
| 92 |
+
<li>🪰 07 — Generating enhancer sequences <em>(coming soon)</em></li>
|
| 93 |
</ul>
|
| 94 |
</div>
|
| 95 |
<div class="card">
|
|
|
|
| 112 |
</div>
|
| 113 |
</div>
|
| 114 |
|
| 115 |
+
<div class="card-stack">
|
| 116 |
+
<div class="card">
|
| 117 |
+
<h2>🤖 Load a pre-trained model</h2>
|
| 118 |
+
<p>Here is an example of how to load and use a pre-trained NTv3 model.</p>
|
| 119 |
+
<div class="code"><pre><code class="language-python">from transformers import AutoTokenizer, AutoModelForMaskedLM
|
| 120 |
|
| 121 |
model_name = "InstaDeepAI/NTv3_650M_pre"
|
| 122 |
|
|
|
|
| 133 |
# Print output shapes
|
| 134 |
print(out.logits.shape) # (B, L, V = 11)
|
| 135 |
</code></pre></div>
|
| 136 |
+
<p>Model embeddings can be used for fine-tuning on downstream tasks.</p>
|
| 137 |
+
</div>
|
| 138 |
+
<div class="card">
|
| 139 |
+
<h2>🔍 Model interpretation</h2>
|
| 140 |
+
<p>Here is an example of how to use the interpretation pipeline on the NTv3 post-trained model for multi-scale analysis of DNA sequences:</p>
|
| 141 |
+
<div class="code"><pre><code class="language-python">from transformers import pipeline
|
| 142 |
import torch
|
| 143 |
import matplotlib.pyplot as plt
|
| 144 |
|
|
|
|
| 171 |
result.plot_saliency(window_size=128)
|
| 172 |
plt.show()
|
| 173 |
</code></pre></div>
|
| 174 |
+
<img src="assets/saliency_example.png" alt="Output tracks visualization" style="max-width: 100%; margin-top: 20px;" />
|
| 175 |
+
</div>
|
| 176 |
</div>
|
| 177 |
|
| 178 |
<div class="card">
|