--- license: apache-2.0 pipeline_tag: feature-extraction library_name: pytorch base_model: - facebook/esm2_t12_35M_UR50D - EvolutionaryScale/esm3-sm-open-v1 tags: - biology - bioinformatics - protein - protein-embeddings - contrastive-learning - multimodal - structure - sequence - sequence-segments - pytorch --- # CLSS (Contrastive Learning Sequence–Structure) CLSS is a **self-supervised, two-tower contrastive model** that **co-embeds protein sequences and protein structures into a shared latent space**, enabling unified analysis of protein space across modalities. **Links** - Hugging Face model repo: https://huggingface.co/guyyanai/CLSS - Code + examples (`clss-model`): https://github.com/guyyanai/CLSS - Paper (bioRxiv): https://doi.org/10.1101/2025.09.05.674454 - Interactive CLSS viewer: https://gabiaxel.github.io/clss-viewer/ --- ## Model description ### Architecture (high level) CLSS follows a **two-tower architecture**: - **Sequence tower:** a trainable ESM2-like sequence encoder - **Structure tower:** a frozen ESM3 structure encoder - Each tower is followed by a lightweight **linear projection head** mapping into a shared embedding space, with **L2-normalized outputs** The result is a pair of embeddings (sequence and structure) that live in the **same latent space**, making cosine similarity directly comparable across modalities. The paper’s primary configuration uses **32-dimensional embeddings**, but multiple embedding sizes are provided in this repository. ### Training objective CLSS is trained with a **CLIP-style contrastive objective**, aligning: - **Random sequence segments** - With their corresponding **full-domain protein structures** **No** hierarchical labels (e.g. ECOD or CATH) are used during training; structural and evolutionary organization emerges implicitly. --- ## Files in this repository This Hugging Face repository contains multiple PyTorch Lightning checkpoints, differing only in **embedding dimensionality**: - `h8_r10.lckpt` → 8-dimensional embeddings - `h16_r10.lckpt` → 16-dimensional embeddings - `h32_r10.lckpt` → 32-dimensional embeddings (paper default) - `h64_r10.lckpt` → 64-dimensional embeddings - `h128_r10.lckpt` → 128-dimensional embeddings --- ## How to use CLSS CLSS is intended to be used via the **`clss-model` Python library**, which provides: - Model loading from Lightning checkpoints - End-to-end inference examples - Scripts used for generating interactive protein space maps --- ## License The CLSS codebase is released under the **Apache 2.0 License**. Please consult the repository for details on third-party model dependencies. --- ## Citation If you use CLSS, please cite: ```bibtex @article{Yanai2025CLSS, title = {Contrastive learning unites sequence and structure in a global representation of protein space}, author = {Yanai, Guy and Axel, Gabriel and Longo, Liam M. and Ben-Tal, Nir and Kolodny, Rachel}, journal = {bioRxiv}, year = {2025}, doi = {10.1101/2025.09.05.674454}, url = {https://doi.org/10.1101/2025.09.05.674454} }