GleghornLab
/

production_ss4_model

@@ -2,6 +2,7 @@
 library_name: transformers
 tags: []
 ---
 # DSM: Diffusion Models for Protein Sequence Generation
 ### Note: This readme is shared between our GitHub and Huggingface pages.
@@ -18,7 +19,7 @@ tags: []
 ## Introduction
-DSM (Diffusion Sequence Model) is a novel Protein Language Model (pLM) developed in collaboration between the Gleghorn Lab and [Synthyra](https://synthyra.com/). It was trained with masked diffusion to enable both high-quality representation learning and generative protein design. This repository contains the code for training, evaluating, and applying DSM and its variants.
 DSM is capable of generating diverse, biomimetic sequences that align with expected amino acid compositions, secondary structures, and predicted functions. Furthermore, DSM's learned representations match or exceed those of comparably sized pLMs on various downstream tasks. DSM is detailed extensively in our [preprint](https://arxiv.org/abs/2506.08293) (which is currently in review). Beyond the base and PPI variants, we are currently training versions to jointly diffuse over sequence and foldseek tokens, as well as [Annotation Vocabulary](https://www.biorxiv.org/content/10.1101/2024.07.30.605924v1) tokens. Since the preprint release, Synthyra has trained [Synthyra/DSM_ppi_full](https://huggingface.co/Synthyra/DSM_ppi_full) which neglects the LoRA PPI training in favor for full finetuning. Additionally, the sequences SeqA and SeqB are jointly masked instead of just SeqB in the original version. We plan on adding the **many** new results to the second version of the preprint and eventual journal article.

 library_name: transformers
 tags: []
 ---
 # DSM: Diffusion Models for Protein Sequence Generation
 ### Note: This readme is shared between our GitHub and Huggingface pages.
 ## Introduction
+DSM (Diffusion Sequence Model) is a novel Protein Language Model (pLM) developed in collaboration between the [Gleghorn Lab](https://www.gleghornlab.com/) and [Synthyra](https://synthyra.com/). It was trained with masked diffusion to enable both high-quality representation learning and generative protein design. This repository contains the code for training, evaluating, and applying DSM and its variants.
 DSM is capable of generating diverse, biomimetic sequences that align with expected amino acid compositions, secondary structures, and predicted functions. Furthermore, DSM's learned representations match or exceed those of comparably sized pLMs on various downstream tasks. DSM is detailed extensively in our [preprint](https://arxiv.org/abs/2506.08293) (which is currently in review). Beyond the base and PPI variants, we are currently training versions to jointly diffuse over sequence and foldseek tokens, as well as [Annotation Vocabulary](https://www.biorxiv.org/content/10.1101/2024.07.30.605924v1) tokens. Since the preprint release, Synthyra has trained [Synthyra/DSM_ppi_full](https://huggingface.co/Synthyra/DSM_ppi_full) which neglects the LoRA PPI training in favor for full finetuning. Additionally, the sequences SeqA and SeqB are jointly masked instead of just SeqB in the original version. We plan on adding the **many** new results to the second version of the preprint and eventual journal article.