Taykhoom commited on
Commit
8450a2f
·
verified ·
1 Parent(s): e1ac8f5

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +29 -0
README.md CHANGED
@@ -92,6 +92,35 @@ out_all = model(**enc, output_hidden_states=True)
92
  layer6_emb = out_all.hidden_states[6]
93
  ```
94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  ### MLM logits
96
 
97
  ```python
 
92
  layer6_emb = out_all.hidden_states[6]
93
  ```
94
 
95
+ ### CDS-aware embedding (mRNA sequences)
96
+
97
+ For mRNA sequences with a CDS track, use `batch_encode_with_cds` to apply T→U conversion,
98
+ extract only the coding region, chunk to codon boundaries, and encode — all in one call.
99
+
100
+ ```python
101
+ import numpy as np
102
+ import torch
103
+ from transformers import AutoTokenizer, AutoModel
104
+
105
+ tokenizer = AutoTokenizer.from_pretrained("Taykhoom/mRNA-FM", trust_remote_code=True)
106
+ model = AutoModel.from_pretrained("Taykhoom/mRNA-FM", trust_remote_code=True)
107
+ model.eval()
108
+
109
+ # Binary CDS track: 1 at the first nucleotide of each codon in the CDS, 0 elsewhere
110
+ sequences = ["ATGCTAGCTAGCTAGCTATGCTAGCTAGCTAGCT"]
111
+ cds = [np.array([0]*5 + [1, 0, 0]*9 + [0]*2)] # example
112
+
113
+ enc, chunk_counts = tokenizer.batch_encode_with_cds(
114
+ sequences, cds, return_tensors="pt", padding=True, add_special_tokens=True
115
+ )
116
+ with torch.no_grad():
117
+ out = model(**enc)
118
+
119
+ # chunk_counts[i] = number of chunks produced for sequences[i]
120
+ # mean-pool non-special tokens for each sequence:
121
+ hidden = out.last_hidden_state # (total_chunks, seq_len, 1280)
122
+ ```
123
+
124
  ### MLM logits
125
 
126
  ```python