Taykhoom
/

SpliceBERT-1024nt

@@ -53,7 +53,7 @@ Verified on GPU with PyTorch 2.7 / CUDA 11.8.
 ## Related Models
-See the full [SpliceBERT collection](<COLLECTION_URL>).
 | Model | Context | Training data | Notes |
 |---|---|---|---|
@@ -65,22 +65,19 @@ See the full [SpliceBERT collection](<COLLECTION_URL>).
 ### Embedding generation
-Input sequences must use single-nucleotide tokenization (space-separated characters)
-with U converted to T. The tokenizer handles this when called on pre-formatted sequences.
 ```python
 import torch
-from transformers import BertTokenizer, BertModel
-tokenizer = BertTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt")
-model = BertModel.from_pretrained("Taykhoom/SpliceBERT-1024nt")
 model.eval()
-# Prepare sequence: convert U->T and add spaces
-seq = "ACGUACGUACGUACGU".upper().replace("U", "T")
-seq_spaced = " ".join(list(seq))
-enc = tokenizer(seq_spaced, return_tensors="pt")
 with torch.no_grad():
     out = model(**enc, output_hidden_states=True)
@@ -98,10 +95,10 @@ layer3_emb = out.hidden_states[3]  # (1, seq_len+2, 512)
 ```python
 import torch
-from transformers import BertTokenizer, BertForMaskedLM
-tokenizer = BertTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt")
-model = BertForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-1024nt")
 model.eval()
 seq = "A C G [MASK] A C G T"

 ## Related Models
+See the full [SpliceBERT collection](https://huggingface.co/collections/Taykhoom/splicebert-6a20b72e9bec05b79ce009aa).
 | Model | Context | Training data | Notes |
 |---|---|---|---|
 ### Embedding generation
+The tokenizer automatically handles U->T conversion and single-nucleotide spacing.
+Pass raw sequences directly.
 ```python
 import torch
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
+model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
 model.eval()
+seq = "ACGUACGUACGUACGU"  # U->T handled automatically
+enc = tokenizer(seq, return_tensors="pt")
 with torch.no_grad():
     out = model(**enc, output_hidden_states=True)
 ```python
 import torch
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
+model = AutoModelForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
 model.eval()
 seq = "A C G [MASK] A C G T"