Update README.md
Browse files
README.md
CHANGED
|
@@ -16,7 +16,7 @@ pipeline_tag: fill-mask
|
|
| 16 |
|
| 17 |
|
| 18 |
|
| 19 |
-
- This repository contains code to utilize the model, and reproduce results of the preprint [**Advancing Codon Language Modeling with Synonymous Codon Constrained Masking**](https://
|
| 20 |
- Unlike other Codon Language Models, SynCodonLM was trained with logit-level control, masking logits for non-synonymous codons. This allowed the model to learn codon-specific patterns disentangled from protein-level semantics.
|
| 21 |
- [Pre-training dataset of 66 Million CDS is available on Hugging Face here.](https://huggingface.co/datasets/jheuschkel/cds-dataset)
|
| 22 |
---
|
|
@@ -24,63 +24,43 @@ pipeline_tag: fill-mask
|
|
| 24 |
|
| 25 |
```python
|
| 26 |
git clone https://github.com/Boehringer-Ingelheim/SynCodonLM.git
|
| 27 |
-
|
|
|
|
| 28 |
```
|
| 29 |
---
|
| 30 |
# Usage
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
seq = clean_split_sequence(seq) # Returns: 'ATG TCC ACC GGG CGG TGA'
|
| 37 |
-
```
|
| 38 |
-
|
| 39 |
-
## Load Model & Tokenizer from Hugging Face
|
| 40 |
-
```python
|
| 41 |
-
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoConfig
|
| 42 |
-
import torch
|
| 43 |
-
|
| 44 |
-
tokenizer = AutoTokenizer.from_pretrained("jheuschkel/SynCodonLM")
|
| 45 |
-
config = AutoConfig.from_pretrained("jheuschkel/SynCodonLM")
|
| 46 |
-
model = AutoModelForMaskedLM.from_pretrained("jheuschkel/SynCodonLM", config=config)
|
| 47 |
-
|
| 48 |
-
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 49 |
-
model.to(device)
|
| 50 |
-
```
|
| 51 |
-
### If there are networking issues, you can manually [download the model from Hugging Face](https://huggingface.co/jheuschkel/SynCodonLM/resolve/main/model.safetensors?download=true) & place it in the /SynCodonLM directory
|
| 52 |
```python
|
| 53 |
-
|
| 54 |
-
config = AutoConfig.from_pretrained("./SynCodonLM", trust_remote_code=True)
|
| 55 |
-
model = AutoModel.from_pretrained("./SynCodonLM", trust_remote_code=True, config=config)
|
| 56 |
|
| 57 |
-
|
| 58 |
-
model.to(device)
|
| 59 |
|
| 60 |
-
|
| 61 |
|
| 62 |
-
|
|
|
|
| 63 |
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
inputs['token_type_ids'] = torch.full_like(inputs['input_ids'], token_type_id) # manually set token_type_ids
|
| 68 |
```
|
| 69 |
-
|
| 70 |
-
|
| 71 |
```python
|
| 72 |
-
|
| 73 |
-
```
|
| 74 |
|
| 75 |
-
|
| 76 |
-
```python
|
| 77 |
-
embedding = outputs.hidden_states[-1] #this can also index any layer (0-11)
|
| 78 |
-
mean_embedding = torch.mean(embedding, dim=1).squeeze(0)
|
| 79 |
-
```
|
| 80 |
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
|
|
|
|
|
|
|
|
|
| 84 |
```
|
| 85 |
|
| 86 |
## Citation
|
|
@@ -99,51 +79,16 @@ If you use this work, please cite:
|
|
| 99 |
journal = {bioRxiv}
|
| 100 |
}
|
| 101 |
```
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
# List of sequences
|
| 117 |
-
seqs = [
|
| 118 |
-
'ATGTCCACCGGGCGGTGA',
|
| 119 |
-
'ATGCGTACCGGGTAGTGA',
|
| 120 |
-
'ATGTTTACCGGGTGGTGA'
|
| 121 |
-
]
|
| 122 |
-
|
| 123 |
-
# List of token type ids (species)
|
| 124 |
-
species_token_type_ids = [
|
| 125 |
-
67, # E. coli
|
| 126 |
-
394, # C. griseus
|
| 127 |
-
317 # H. sapiens
|
| 128 |
-
]
|
| 129 |
-
|
| 130 |
-
# Prepare list
|
| 131 |
-
seqs = [clean_split_sequence(seq) for seq in seqs]
|
| 132 |
-
|
| 133 |
-
# Tokenize batch with padding
|
| 134 |
-
inputs = tokenizer(seqs, return_tensors="pt", padding=True).to(device)
|
| 135 |
-
|
| 136 |
-
# Create token_type_ids tensor
|
| 137 |
-
batch_size, seq_len = inputs['input_ids'].shape
|
| 138 |
-
token_type_ids = torch.zeros((batch_size, seq_len), dtype=torch.long).to(device)
|
| 139 |
-
|
| 140 |
-
# Fill each row with the species-specific token_type_id
|
| 141 |
-
for i, species_id in enumerate(species_token_type_ids):
|
| 142 |
-
token_type_ids[i, :] = species_id # Fill entire row with the species ID
|
| 143 |
-
|
| 144 |
-
# Add to inputs
|
| 145 |
-
inputs['token_type_ids'] = token_type_ids
|
| 146 |
-
|
| 147 |
-
# Run model
|
| 148 |
-
outputs = model(**inputs)
|
| 149 |
-
```
|
|
|
|
| 16 |
|
| 17 |
|
| 18 |
|
| 19 |
+
- This repository contains code to utilize the model, and reproduce results of the preprint [**Advancing Codon Language Modeling with Synonymous Codon Constrained Masking**](https://doi.org/10.1101/2025.08.19.671089).
|
| 20 |
- Unlike other Codon Language Models, SynCodonLM was trained with logit-level control, masking logits for non-synonymous codons. This allowed the model to learn codon-specific patterns disentangled from protein-level semantics.
|
| 21 |
- [Pre-training dataset of 66 Million CDS is available on Hugging Face here.](https://huggingface.co/datasets/jheuschkel/cds-dataset)
|
| 22 |
---
|
|
|
|
| 24 |
|
| 25 |
```python
|
| 26 |
git clone https://github.com/Boehringer-Ingelheim/SynCodonLM.git
|
| 27 |
+
cd SynCodonLM
|
| 28 |
+
pip install -r requirements.txt #maybe not neccesary depending on your env :)
|
| 29 |
```
|
| 30 |
---
|
| 31 |
# Usage
|
| 32 |
+
#### SynCodonLM uses token-type ID's to add species-specific codon sontext to it's thinking.
|
| 33 |
+
###### Before use, find the token type ID (species_token_type) for your species of interest [here](https://github.com/Boehringer-Ingelheim/SynCodonLM/blob/master/SynCodonLM/species_token_type.py)!
|
| 34 |
+
###### Or use our list of model organisms [below](https://github.com/Boehringer-Ingelheim/SynCodonLM/tree/master#model-organisms-species-token-type-ids)
|
| 35 |
+
---
|
| 36 |
+
## Embedding a Coding DNA Sequence
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
```python
|
| 38 |
+
from SynCodonLM import CodonEmbeddings
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+
model = CodonEmbeddings() #this loads the model & tokenizer using our built-in functions
|
|
|
|
| 41 |
|
| 42 |
+
seq = 'ATGTCCACCGGGCGGTGA'
|
| 43 |
|
| 44 |
+
mean_pooled_embedding = model.get_mean_embedding(seq, species_token_type=67) #E. coli
|
| 45 |
+
#returns --> tensor of shape [768]
|
| 46 |
|
| 47 |
+
raw_output = model.get_raw_embeddings(seq, species_token_type=67) #E. coli
|
| 48 |
+
raw_embedding_final_layer = raw_embedding_final_layer.hidden_states[-1] #treat this like a typical Hugging Face model dictionary based output!
|
| 49 |
+
#returns --> tensor of shape [batch size (1), sequence length, 768]
|
|
|
|
| 50 |
```
|
| 51 |
+
## Codon Optimizing a Protein Sequence
|
| 52 |
+
###### This has not yet been rigourosly evaluated, although we can confidently say it will generate 'natural looking' coding-DNA sequences.
|
| 53 |
```python
|
| 54 |
+
from SynCodonLM import CodonOptimizer
|
|
|
|
| 55 |
|
| 56 |
+
optimizer = CodonOptimizer() #this loads the model & tokenizer using our built-in functions
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
result = optimizer.optimize(
|
| 59 |
+
protein_sequence="MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKRHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYK", #GFP
|
| 60 |
+
species_token_type=67, #E. coli
|
| 61 |
+
deterministic=True #true by default
|
| 62 |
+
)
|
| 63 |
+
codon_optimized_sequence = result.sequence
|
| 64 |
```
|
| 65 |
|
| 66 |
## Citation
|
|
|
|
| 79 |
journal = {bioRxiv}
|
| 80 |
}
|
| 81 |
```
|
| 82 |
+
----
|
| 83 |
+
#### Model Organisms Species Token Type IDs
|
| 84 |
+
| Organism | Token-Type ID |
|
| 85 |
+
|-------------------------|----------------|
|
| 86 |
+
| *E. coli* | 67 |
|
| 87 |
+
| *S. cerevisiae* | 108 |
|
| 88 |
+
| *C. elegans*| 187 |
|
| 89 |
+
| *D. melanogaster*| 178 |
|
| 90 |
+
| *D. rerio* |468 |
|
| 91 |
+
| *M. musculus* | 321 |
|
| 92 |
+
| *A. thaliana* | 266 |
|
| 93 |
+
| *H. sapiens* | 317 |
|
| 94 |
+
| *C. griseus* | 394 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|