Feature Extraction
Transformers
Joblib
Safetensors
BulkRNABert
bulk RNA-seq
biology
transcriptomics
custom_code
Instructions to use InstaDeepAI/BulkRNABert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use InstaDeepAI/BulkRNABert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="InstaDeepAI/BulkRNABert", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("InstaDeepAI/BulkRNABert", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,14 +1,14 @@
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
tags:
|
| 4 |
-
- bulk RNA-seq
|
| 5 |
-
- biology
|
| 6 |
-
- transcriptomics
|
| 7 |
---
|
| 8 |
|
| 9 |
# BulkRNABert
|
| 10 |
|
| 11 |
-
BulkRNABert is a transformer-based, encoder-only language model pre-trained on bulk RNA-seq
|
| 12 |
|
| 13 |
**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
|
| 14 |
|
|
@@ -29,23 +29,47 @@ pip install --upgrade git+https://github.com/huggingface/transformers.git
|
|
| 29 |
pip install torch
|
| 30 |
```
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
```
|
| 35 |
-
import
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
|
|
|
| 38 |
model = AutoModel.from_pretrained(
|
| 39 |
"InstaDeepAI/BulkRNABert",
|
|
|
|
| 40 |
trust_remote_code=True,
|
| 41 |
)
|
| 42 |
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
|
| 51 |
### Citing our work
|
|
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
tags:
|
| 4 |
+
- bulk RNA-seq
|
| 5 |
+
- biology
|
| 6 |
+
- transcriptomics
|
| 7 |
---
|
| 8 |
|
| 9 |
# BulkRNABert
|
| 10 |
|
| 11 |
+
BulkRNABert is a transformer-based, encoder-only language model pre-trained on bulk RNA-seq profiles from the TCGA dataset using self-supervised masked language modeling, following the original BERT framework. The model is trained to reconstruct randomly masked gene expression values from their genomic context, enabling it to learn biologically meaningful representations of transcriptomic profiles. Once pre-trained, BulkRNABert can be fine-tuned for various cancer-related downstream tasks—such as cancer type classification or survival analysis—by extracting embeddings from the model.
|
| 12 |
|
| 13 |
**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
|
| 14 |
|
|
|
|
| 29 |
pip install torch
|
| 30 |
```
|
| 31 |
|
| 32 |
+
## Other notes
|
| 33 |
+
We also provide the params for the BulkRNABert jax model in `jax_params`.
|
| 34 |
+
|
| 35 |
+
A small snippet of code is provided below to run inference with the model using bulk RNA-seq samples from the [TCGA](https://portal.gdc.cancer.gov/) dataset.
|
| 36 |
|
| 37 |
```
|
| 38 |
+
from huggingface_hub import hf_hub_download
|
| 39 |
+
import numpy as np
|
| 40 |
+
import pandas as pd
|
| 41 |
+
from transformers import AutoConfig, AutoModel, AutoTokenizer
|
| 42 |
+
|
| 43 |
+
# Load model and tokenizer.
|
| 44 |
+
config = AutoConfig.from_pretrained(
|
| 45 |
+
"InstaDeepAI/BulkRNABert",
|
| 46 |
+
trust_remote_code=True,
|
| 47 |
+
)
|
| 48 |
+
config.embeddings_layers_to_save = (4,) # last transformer layer
|
| 49 |
|
| 50 |
+
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/BulkRNABert", trust_remote_code=True)
|
| 51 |
model = AutoModel.from_pretrained(
|
| 52 |
"InstaDeepAI/BulkRNABert",
|
| 53 |
+
config=config,
|
| 54 |
trust_remote_code=True,
|
| 55 |
)
|
| 56 |
|
| 57 |
+
# Load bulk RNA-seq data and preprocess them.
|
| 58 |
+
csv_path = hf_hub_download(
|
| 59 |
+
repo_id="InstaDeepAI/BulkRNABert",
|
| 60 |
+
filename="data/tcga_sample.csv",
|
| 61 |
+
repo_type="model",
|
| 62 |
+
)
|
| 63 |
+
gene_expression_array = pd.read_csv(csv_path).drop(["identifier"], axis=1).to_numpy()[:1, :]
|
| 64 |
+
gene_expression_array = np.log10(1 + gene_expression_array)
|
| 65 |
+
assert gene_expression_array.shape[1] == config.n_genes
|
| 66 |
|
| 67 |
+
# Tokenize
|
| 68 |
+
gene_expression_ids = tokenizer.batch_encode_plus(gene_expression_array, return_tensors="pt")["input_ids"]
|
| 69 |
+
|
| 70 |
+
# Compute BulkRNABert's embeddings
|
| 71 |
+
gene_expression_mean_embeddings = model(gene_expression_ids)["embeddings_4"].mean(axis=1) # embeddings can be used for downstream tasks.
|
| 72 |
+
```
|
| 73 |
|
| 74 |
|
| 75 |
### Citing our work
|