jheuschkel commited on
Commit
feaf025
·
verified ·
1 Parent(s): 5dee73a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -0
README.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - jheuschkel/clustered-cds-dataset
5
+ language:
6
+ - en
7
+ pipeline_tag: fill-mask
8
+ tags:
9
+ - codon
10
+ - Codon
11
+ - biology
12
+ - synthetic
13
+ - dna
14
+ - mrna
15
+ - optimization
16
+ - codon-optimization
17
+ - codon-embedding
18
+ - codon-representation
19
+ - codon-language-model
20
+ - codon-language
21
+ misc:
22
+ - codon
23
+ ---
24
+ # Model Card for SynCodonLM
25
+
26
+
27
+
28
+ - This model is a replicate of that trained with species - token type ID, however, trained without any token type ID.
29
+ - This model is totally protein-agnostic, while the species-token type model may still have a small amount of spurious statistical focus.
30
+ ---
31
+ ## Installation
32
+
33
+ ```python
34
+ git clone https://github.com/Boehringer-Ingelheim/SynCodonLM.git
35
+ cd SynCodonLM
36
+ pip install -r requirements.txt #maybe not neccesary depending on your env :)
37
+ ```
38
+
39
+ ## Embedding a Coding DNA Sequence Using our Model Trained without Token Type ID
40
+ ```python
41
+ from SynCodonLM import CodonEmbeddings
42
+
43
+ model = CodonEmbeddings(model_name='jheuschkel/SynCodonLM-V2-NoTokenType') #this loads the model & tokenizer using our built-in functions
44
+
45
+ seq = 'ATGTCCACCGGGCGGTGA'
46
+
47
+ mean_pooled_embedding = model.get_mean_embedding(seq)
48
+ #returns --> tensor of shape [768]
49
+
50
+ raw_output = model.get_raw_embeddings(seq)
51
+ raw_embedding_final_layer = raw_output.hidden_states[-1] #treat this like a typical Hugging Face model dictionary based output!
52
+ #returns --> tensor of shape [batch size (1), sequence length, 768]
53
+ ```
54
+
55
+
56
+ ## Citation
57
+ If you use this work, please cite:
58
+ ```bibtex
59
+ @article {Heuschkel2025.08.19.671089,
60
+ author = {Heuschkel, James and Kingsley, Laura and Pefaur, Noah and Nixon, Andrew and Cramer, Steven},
61
+ title = {Advancing Codon Language Modeling with Synonymous Codon Constrained Masking},
62
+ elocation-id = {2025.08.19.671089},
63
+ year = {2025},
64
+ doi = {10.1101/2025.08.19.671089},
65
+ publisher = {Cold Spring Harbor Laboratory},
66
+ abstract = {Codon language models offer a promising framework for modeling protein-coding DNA sequences, yet current approaches often conflate codon usage with amino acid semantics, limiting their ability to capture DNA-level biology. We introduce SynCodonLM, a codon language model that enforces a biologically grounded constraint: masked codons are only predicted from synonymous options, guided by the known protein sequence. This design disentangles codon-level from protein-level semantics, enabling the model to learn nucleotide-specific patterns. The constraint is implemented by masking non-synonymous codons from the prediction space prior to softmax. Unlike existing models, which cluster codons by amino acid identity, SynCodonLM clusters by nucleotide properties, revealing structure aligned with DNA-level biology. Furthermore, SynCodonLM outperforms existing models on 6 of 7 benchmarks sensitive to DNA-level features, including mRNA and protein expression. Our approach advances domain-specific representation learning and opens avenues for sequence design in synthetic biology, as well as deeper insights into diverse bioprocesses.Competing Interest StatementThe authors have declared no competing interest.},
67
+ URL = {https://www.biorxiv.org/content/early/2025/08/24/2025.08.19.671089},
68
+ eprint = {https://www.biorxiv.org/content/early/2025/08/24/2025.08.19.671089.full.pdf},
69
+ journal = {bioRxiv}
70
+ }
71
+ ```
72
+