Add pipeline tag, library name and license

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +7 -3
README.md CHANGED
@@ -5,16 +5,20 @@ metrics:
5
  tags:
6
  - biology
7
  - medical
 
 
 
8
  ---
 
9
  This is the official pre-trained baseline model introduced in [Fast and Low-Cost Genomic Foundation Models via Outlier Removal
10
- ](https://arxiv.org/abs/2505.00598).
11
 
12
  We sincerely appreciate the MosaicML team for the [MosaicBERT](https://openreview.net/forum?id=5zipcfLC2Z) implementation, which serves as the base of DNABERT-2 development.
13
 
14
  DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome.
15
 
16
  To load the model from huggingface:
17
- ```
18
  import torch
19
  from transformers import AutoTokenizer, AutoModel
20
 
@@ -23,7 +27,7 @@ model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code
23
  ```
24
 
25
  To calculate the embedding of a dna sequence
26
- ```
27
  dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
28
  inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
29
  hidden_states = model(inputs)[0] # [1, sequence_length, 768]
 
5
  tags:
6
  - biology
7
  - medical
8
+ pipeline_tag: feature-extraction
9
+ library_name: transformers
10
+ license: mit
11
  ---
12
+
13
  This is the official pre-trained baseline model introduced in [Fast and Low-Cost Genomic Foundation Models via Outlier Removal
14
+ ](https://huggingface.co/papers/2505.00598).
15
 
16
  We sincerely appreciate the MosaicML team for the [MosaicBERT](https://openreview.net/forum?id=5zipcfLC2Z) implementation, which serves as the base of DNABERT-2 development.
17
 
18
  DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome.
19
 
20
  To load the model from huggingface:
21
+ ```python
22
  import torch
23
  from transformers import AutoTokenizer, AutoModel
24
 
 
27
  ```
28
 
29
  To calculate the embedding of a dna sequence
30
+ ```python
31
  dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
32
  inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
33
  hidden_states = model(inputs)[0] # [1, sequence_length, 768]