Pichia-CLM
Pichia–Codon language model (Pichia-CLM) is a deep learning–based language model for codon optimization to enhance recombinant protein production in the industrially relevant host Komagataella phaffii. Unlike conventional approaches that rely on codon usage bias metrics (CUB)—often providing a global score and ignoring sequence context—Pichia-CLM leverages the host genome to unbiasedly learn the amino acid-to-codon mapping. Prior deep learning models have attempted codon optimization but typically evaluated performance using CUB metrics with limited experimental validation. In contrast, we have experimentally validated Pichia-CLM across six diverse protein classes of varying complexity and consistently observe superior expression titers compared to four commercial codon optimization tools.
If you found this model useful, please cite our original PNAS publication:
- Paper: Pichia-CLM: A language model–based codon optimization pipeline for Komagataella phaffii
- Developed by: Harini Narayanan and J. Christopher Love
- Repository: GitHub
- Funded by: MIT AltHost Research Consortium, Daniel I.C. Wang (1959) Faculty Research Innovation Fund at MIT, Mazumdar-Shaw International Oncology Fellowship, Koch Institute at MIT
- Model type: Unsupervised GRU-based encoder-decoder