Taykhoom commited on
Commit
1200db8
·
verified ·
1 Parent(s): fe65700

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +11 -14
README.md CHANGED
@@ -53,7 +53,7 @@ Verified on GPU with PyTorch 2.7 / CUDA 11.8.
53
 
54
  ## Related Models
55
 
56
- See the full [SpliceBERT collection](<COLLECTION_URL>).
57
 
58
  | Model | Context | Training data | Notes |
59
  |---|---|---|---|
@@ -65,22 +65,19 @@ See the full [SpliceBERT collection](<COLLECTION_URL>).
65
 
66
  ### Embedding generation
67
 
68
- Input sequences must use single-nucleotide tokenization (space-separated characters)
69
- with U converted to T. The tokenizer handles this when called on pre-formatted sequences.
70
 
71
  ```python
72
  import torch
73
- from transformers import BertTokenizer, BertModel
74
 
75
- tokenizer = BertTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt")
76
- model = BertModel.from_pretrained("Taykhoom/SpliceBERT-1024nt")
77
  model.eval()
78
 
79
- # Prepare sequence: convert U->T and add spaces
80
- seq = "ACGUACGUACGUACGU".upper().replace("U", "T")
81
- seq_spaced = " ".join(list(seq))
82
-
83
- enc = tokenizer(seq_spaced, return_tensors="pt")
84
 
85
  with torch.no_grad():
86
  out = model(**enc, output_hidden_states=True)
@@ -98,10 +95,10 @@ layer3_emb = out.hidden_states[3] # (1, seq_len+2, 512)
98
 
99
  ```python
100
  import torch
101
- from transformers import BertTokenizer, BertForMaskedLM
102
 
103
- tokenizer = BertTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt")
104
- model = BertForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-1024nt")
105
  model.eval()
106
 
107
  seq = "A C G [MASK] A C G T"
 
53
 
54
  ## Related Models
55
 
56
+ See the full [SpliceBERT collection](https://huggingface.co/collections/Taykhoom/splicebert-6a20b72e9bec05b79ce009aa).
57
 
58
  | Model | Context | Training data | Notes |
59
  |---|---|---|---|
 
65
 
66
  ### Embedding generation
67
 
68
+ The tokenizer automatically handles U->T conversion and single-nucleotide spacing.
69
+ Pass raw sequences directly.
70
 
71
  ```python
72
  import torch
73
+ from transformers import AutoTokenizer, AutoModel
74
 
75
+ tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
76
+ model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
77
  model.eval()
78
 
79
+ seq = "ACGUACGUACGUACGU" # U->T handled automatically
80
+ enc = tokenizer(seq, return_tensors="pt")
 
 
 
81
 
82
  with torch.no_grad():
83
  out = model(**enc, output_hidden_states=True)
 
95
 
96
  ```python
97
  import torch
98
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
99
 
100
+ tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
101
+ model = AutoModelForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
102
  model.eval()
103
 
104
  seq = "A C G [MASK] A C G T"