jheuschkel commited on
Commit
2d44e8a
·
verified ·
1 Parent(s): 758dd57

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +148 -3
README.md CHANGED
@@ -1,3 +1,148 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - jheuschkel/cds-dataset
5
+ tags:
6
+ - codon
7
+ - language
8
+ - model
9
+ - synyonymous
10
+ - CDS
11
+ - mRNA
12
+ ---
13
+ # Advancing Codon Language Modeling with Synonymous Codon Constrained Masking
14
+
15
+
16
+
17
+ - This repository contains code to utilize the model, and reproduce results of the preprint [**Advancing Codon Language Modeling with Synonymous Codon Constrained Masking**](https://www.biorxiv.org/content/10.1101/2025.08.19.671089v1), by **James Heuschkel**, **Laura Kingsley**, **Noah Pefaur**, **Andrew Nixon**, and **Steven Cramer**.
18
+ - Unlike other Codon Language Models, SynCodonLM was trained with logit-level control, masking logits for non-synonymous codons. This allowed the model to learn codon-specific patterns disentangled from protein-level semantics.
19
+ - [Pre-training dataset of 66 Million CDS is available on Hugging Face here.](https://huggingface.co/datasets/jheuschkel/cds-dataset)
20
+ ---
21
+ ## Installation
22
+
23
+ ```python
24
+ git clone https://github.com/Boehringer-Ingelheim/SynCodonLM.git
25
+ pip install -r requirements.txt
26
+ ```
27
+ ---
28
+ # Usage
29
+ ## Prepare Sequence
30
+
31
+ ```python
32
+ from SynCodonLM.utils import clean_split_sequence
33
+ seq = 'ATGTCCACCGGGCGGTGA'
34
+ seq = clean_split_sequence(seq) # Returns: 'ATG TCC ACC GGG CGG TGA'
35
+ ```
36
+
37
+ ## Load Model & Tokenizer from Hugging Face
38
+ ```python
39
+ from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoConfig
40
+ import torch
41
+
42
+ tokenizer = AutoTokenizer.from_pretrained("jheuschkel/SynCodonLM")
43
+ config = AutoConfig.from_pretrained("jheuschkel/SynCodonLM")
44
+ model = AutoModelForMaskedLM.from_pretrained("jheuschkel/SynCodonLM", config=config)
45
+
46
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
47
+ model.to(device)
48
+ ```
49
+ ### If there are networking issues, you can manually [download the model from Hugging Face](https://huggingface.co/jheuschkel/SynCodonLM/resolve/main/model.safetensors?download=true) & place it in the /SynCodonLM directory
50
+ ```python
51
+ tokenizer = AutoTokenizer.from_pretrained("./SynCodonLM", trust_remote_code=True)
52
+ config = AutoConfig.from_pretrained("./SynCodonLM", trust_remote_code=True)
53
+ model = AutoModel.from_pretrained("./SynCodonLM", trust_remote_code=True, config=config)
54
+
55
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
56
+ model.to(device)
57
+
58
+ ```
59
+
60
+ ## Tokenize Input Sequences, Set Token Type ID Based on Species ID found [here](https://github.com/Boehringer-Ingelheim/SynCodonLM/blob/master/SynCodonLM/species_token_type.py)
61
+
62
+ ```python
63
+ token_type_id = 67 #E. coli
64
+ inputs = tokenizer(seq, return_tensors="pt").to(device)
65
+ inputs['token_type_ids'] = torch.full_like(inputs['input_ids'], token_type_id) # manually set token_type_ids
66
+ ```
67
+
68
+ ## Gather Model Outputs
69
+ ```python
70
+ outputs = model(**inputs, output_hidden_states=True)
71
+ ```
72
+
73
+ ## Get Mean Embedding from Final Layer
74
+ ```python
75
+ embedding = outputs.hidden_states[-1] #this can also index any layer (0-11)
76
+ mean_embedding = torch.mean(embedding, dim=1).squeeze(0)
77
+ ```
78
+
79
+ ## You Can Also View Language Head Output
80
+ ```python
81
+ logits = outputs.logits # shape: [batch_size, sequence_length, vocab_size]
82
+ ```
83
+
84
+ ## Citation
85
+ If you use this work, please cite:
86
+ ```bibtex
87
+ @article {Heuschkel2025.08.19.671089,
88
+ author = {Heuschkel, James and Kingsley, Laura and Pefaur, Noah and Nixon, Andrew and Cramer, Steven},
89
+ title = {Advancing Codon Language Modeling with Synonymous Codon Constrained Masking},
90
+ elocation-id = {2025.08.19.671089},
91
+ year = {2025},
92
+ doi = {10.1101/2025.08.19.671089},
93
+ publisher = {Cold Spring Harbor Laboratory},
94
+ abstract = {Codon language models offer a promising framework for modeling protein-coding DNA sequences, yet current approaches often conflate codon usage with amino acid semantics, limiting their ability to capture DNA-level biology. We introduce SynCodonLM, a codon language model that enforces a biologically grounded constraint: masked codons are only predicted from synonymous options, guided by the known protein sequence. This design disentangles codon-level from protein-level semantics, enabling the model to learn nucleotide-specific patterns. The constraint is implemented by masking non-synonymous codons from the prediction space prior to softmax. Unlike existing models, which cluster codons by amino acid identity, SynCodonLM clusters by nucleotide properties, revealing structure aligned with DNA-level biology. Furthermore, SynCodonLM outperforms existing models on 6 of 7 benchmarks sensitive to DNA-level features, including mRNA and protein expression. Our approach advances domain-specific representation learning and opens avenues for sequence design in synthetic biology, as well as deeper insights into diverse bioprocesses.Competing Interest StatementThe authors have declared no competing interest.},
95
+ URL = {https://www.biorxiv.org/content/early/2025/08/24/2025.08.19.671089},
96
+ eprint = {https://www.biorxiv.org/content/early/2025/08/24/2025.08.19.671089.full.pdf},
97
+ journal = {bioRxiv}
98
+ }
99
+ ```
100
+ ----
101
+
102
+ ## Usage With Batches
103
+ ```python
104
+ from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoConfig
105
+ import torch
106
+ from SynCodonLM.utils import clean_split_sequence
107
+
108
+ tokenizer = AutoTokenizer.from_pretrained("jheuschkel/SynCodonLM")
109
+ config = AutoConfig.from_pretrained("jheuschkel/SynCodonLM")
110
+ model = AutoModelForMaskedLM.from_pretrained("jheuschkel/SynCodonLM", config=config)
111
+
112
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
113
+ model.to(device)
114
+
115
+ # List of sequences
116
+ seqs = [
117
+ 'ATGTCCACCGGGCGGTGA',
118
+ 'ATGCGTACCGGGTAGTGA',
119
+ 'ATGTTTACCGGGTGGTGA'
120
+ ]
121
+
122
+ # List of token type ids (species)
123
+ species_token_type_ids = [
124
+ 67, # E. coli
125
+ 394, # C. griseus
126
+ 317 # H. sapiens
127
+ ]
128
+
129
+ # Prepare list
130
+ seqs = [clean_split_sequence(seq) for seq in seqs]
131
+
132
+ # Tokenize batch with padding
133
+ inputs = tokenizer(seqs, return_tensors="pt", padding=True).to(device)
134
+
135
+ # Create token_type_ids tensor
136
+ batch_size, seq_len = inputs['input_ids'].shape
137
+ token_type_ids = torch.zeros((batch_size, seq_len), dtype=torch.long).to(device)
138
+
139
+ # Fill each row with the species-specific token_type_id
140
+ for i, species_id in enumerate(species_token_type_ids):
141
+ token_type_ids[i, :] = species_id # Fill entire row with the species ID
142
+
143
+ # Add to inputs
144
+ inputs['token_type_ids'] = token_type_ids
145
+
146
+ # Run model
147
+ outputs = model(**inputs)
148
+ ```