jheuschkel commited on
Commit
ca05d5c
·
verified ·
1 Parent(s): 467eab4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -94
README.md CHANGED
@@ -16,7 +16,7 @@ pipeline_tag: fill-mask
16
 
17
 
18
 
19
- - This repository contains code to utilize the model, and reproduce results of the preprint [**Advancing Codon Language Modeling with Synonymous Codon Constrained Masking**](https://www.biorxiv.org/content/10.1101/2025.08.19.671089v1), by **James Heuschkel**, **Laura Kingsley**, **Noah Pefaur**, **Andrew Nixon**, and **Steven Cramer**.
20
  - Unlike other Codon Language Models, SynCodonLM was trained with logit-level control, masking logits for non-synonymous codons. This allowed the model to learn codon-specific patterns disentangled from protein-level semantics.
21
  - [Pre-training dataset of 66 Million CDS is available on Hugging Face here.](https://huggingface.co/datasets/jheuschkel/cds-dataset)
22
  ---
@@ -24,63 +24,43 @@ pipeline_tag: fill-mask
24
 
25
  ```python
26
  git clone https://github.com/Boehringer-Ingelheim/SynCodonLM.git
27
- pip install -r requirements.txt
 
28
  ```
29
  ---
30
  # Usage
31
- ## Prepare Sequence
32
-
33
- ```python
34
- from SynCodonLM.utils import clean_split_sequence
35
- seq = 'ATGTCCACCGGGCGGTGA'
36
- seq = clean_split_sequence(seq) # Returns: 'ATG TCC ACC GGG CGG TGA'
37
- ```
38
-
39
- ## Load Model & Tokenizer from Hugging Face
40
- ```python
41
- from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoConfig
42
- import torch
43
-
44
- tokenizer = AutoTokenizer.from_pretrained("jheuschkel/SynCodonLM")
45
- config = AutoConfig.from_pretrained("jheuschkel/SynCodonLM")
46
- model = AutoModelForMaskedLM.from_pretrained("jheuschkel/SynCodonLM", config=config)
47
-
48
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
49
- model.to(device)
50
- ```
51
- ### If there are networking issues, you can manually [download the model from Hugging Face](https://huggingface.co/jheuschkel/SynCodonLM/resolve/main/model.safetensors?download=true) & place it in the /SynCodonLM directory
52
  ```python
53
- tokenizer = AutoTokenizer.from_pretrained("./SynCodonLM", trust_remote_code=True)
54
- config = AutoConfig.from_pretrained("./SynCodonLM", trust_remote_code=True)
55
- model = AutoModel.from_pretrained("./SynCodonLM", trust_remote_code=True, config=config)
56
 
57
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
58
- model.to(device)
59
 
60
- ```
61
 
62
- ## Tokenize Input Sequences, Set Token Type ID Based on Species ID found [here](https://github.com/Boehringer-Ingelheim/SynCodonLM/blob/master/SynCodonLM/species_token_type.py)
 
63
 
64
- ```python
65
- token_type_id = 67 #E. coli
66
- inputs = tokenizer(seq, return_tensors="pt").to(device)
67
- inputs['token_type_ids'] = torch.full_like(inputs['input_ids'], token_type_id) # manually set token_type_ids
68
  ```
69
-
70
- ## Gather Model Outputs
71
  ```python
72
- outputs = model(**inputs, output_hidden_states=True)
73
- ```
74
 
75
- ## Get Mean Embedding from Final Layer
76
- ```python
77
- embedding = outputs.hidden_states[-1] #this can also index any layer (0-11)
78
- mean_embedding = torch.mean(embedding, dim=1).squeeze(0)
79
- ```
80
 
81
- ## You Can Also View Language Head Output
82
- ```python
83
- logits = outputs.logits # shape: [batch_size, sequence_length, vocab_size]
 
 
 
84
  ```
85
 
86
  ## Citation
@@ -99,51 +79,16 @@ If you use this work, please cite:
99
  journal = {bioRxiv}
100
  }
101
  ```
102
-
103
- ## Usage With Batches
104
- ```python
105
- from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoConfig
106
- import torch
107
- from SynCodonLM.utils import clean_split_sequence
108
-
109
- tokenizer = AutoTokenizer.from_pretrained("jheuschkel/SynCodonLM")
110
- config = AutoConfig.from_pretrained("jheuschkel/SynCodonLM")
111
- model = AutoModelForMaskedLM.from_pretrained("jheuschkel/SynCodonLM", config=config)
112
-
113
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
114
- model.to(device)
115
-
116
- # List of sequences
117
- seqs = [
118
- 'ATGTCCACCGGGCGGTGA',
119
- 'ATGCGTACCGGGTAGTGA',
120
- 'ATGTTTACCGGGTGGTGA'
121
- ]
122
-
123
- # List of token type ids (species)
124
- species_token_type_ids = [
125
- 67, # E. coli
126
- 394, # C. griseus
127
- 317 # H. sapiens
128
- ]
129
-
130
- # Prepare list
131
- seqs = [clean_split_sequence(seq) for seq in seqs]
132
-
133
- # Tokenize batch with padding
134
- inputs = tokenizer(seqs, return_tensors="pt", padding=True).to(device)
135
-
136
- # Create token_type_ids tensor
137
- batch_size, seq_len = inputs['input_ids'].shape
138
- token_type_ids = torch.zeros((batch_size, seq_len), dtype=torch.long).to(device)
139
-
140
- # Fill each row with the species-specific token_type_id
141
- for i, species_id in enumerate(species_token_type_ids):
142
- token_type_ids[i, :] = species_id # Fill entire row with the species ID
143
-
144
- # Add to inputs
145
- inputs['token_type_ids'] = token_type_ids
146
-
147
- # Run model
148
- outputs = model(**inputs)
149
- ```
 
16
 
17
 
18
 
19
+ - This repository contains code to utilize the model, and reproduce results of the preprint [**Advancing Codon Language Modeling with Synonymous Codon Constrained Masking**](https://doi.org/10.1101/2025.08.19.671089).
20
  - Unlike other Codon Language Models, SynCodonLM was trained with logit-level control, masking logits for non-synonymous codons. This allowed the model to learn codon-specific patterns disentangled from protein-level semantics.
21
  - [Pre-training dataset of 66 Million CDS is available on Hugging Face here.](https://huggingface.co/datasets/jheuschkel/cds-dataset)
22
  ---
 
24
 
25
  ```python
26
  git clone https://github.com/Boehringer-Ingelheim/SynCodonLM.git
27
+ cd SynCodonLM
28
+ pip install -r requirements.txt #maybe not neccesary depending on your env :)
29
  ```
30
  ---
31
  # Usage
32
+ #### SynCodonLM uses token-type ID's to add species-specific codon sontext to it's thinking.
33
+ ###### Before use, find the token type ID (species_token_type) for your species of interest [here](https://github.com/Boehringer-Ingelheim/SynCodonLM/blob/master/SynCodonLM/species_token_type.py)!
34
+ ###### Or use our list of model organisms [below](https://github.com/Boehringer-Ingelheim/SynCodonLM/tree/master#model-organisms-species-token-type-ids)
35
+ ---
36
+ ## Embedding a Coding DNA Sequence
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  ```python
38
+ from SynCodonLM import CodonEmbeddings
 
 
39
 
40
+ model = CodonEmbeddings() #this loads the model & tokenizer using our built-in functions
 
41
 
42
+ seq = 'ATGTCCACCGGGCGGTGA'
43
 
44
+ mean_pooled_embedding = model.get_mean_embedding(seq, species_token_type=67) #E. coli
45
+ #returns --> tensor of shape [768]
46
 
47
+ raw_output = model.get_raw_embeddings(seq, species_token_type=67) #E. coli
48
+ raw_embedding_final_layer = raw_embedding_final_layer.hidden_states[-1] #treat this like a typical Hugging Face model dictionary based output!
49
+ #returns --> tensor of shape [batch size (1), sequence length, 768]
 
50
  ```
51
+ ## Codon Optimizing a Protein Sequence
52
+ ###### This has not yet been rigourosly evaluated, although we can confidently say it will generate 'natural looking' coding-DNA sequences.
53
  ```python
54
+ from SynCodonLM import CodonOptimizer
 
55
 
56
+ optimizer = CodonOptimizer() #this loads the model & tokenizer using our built-in functions
 
 
 
 
57
 
58
+ result = optimizer.optimize(
59
+ protein_sequence="MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKRHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYK", #GFP
60
+ species_token_type=67, #E. coli
61
+ deterministic=True #true by default
62
+ )
63
+ codon_optimized_sequence = result.sequence
64
  ```
65
 
66
  ## Citation
 
79
  journal = {bioRxiv}
80
  }
81
  ```
82
+ ----
83
+ #### Model Organisms Species Token Type IDs
84
+ | Organism | Token-Type ID |
85
+ |-------------------------|----------------|
86
+ | *E. coli* | 67 |
87
+ | *S. cerevisiae* | 108 |
88
+ | *C. elegans*| 187 |
89
+ | *D. melanogaster*| 178 |
90
+ | *D. rerio* |468 |
91
+ | *M. musculus* | 321 |
92
+ | *A. thaliana* | 266 |
93
+ | *H. sapiens* | 317 |
94
+ | *C. griseus* | 394 |