|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: feature-extraction |
|
|
library_name: pytorch |
|
|
base_model: |
|
|
- facebook/esm2_t12_35M_UR50D |
|
|
- EvolutionaryScale/esm3-sm-open-v1 |
|
|
tags: |
|
|
- biology |
|
|
- bioinformatics |
|
|
- protein |
|
|
- protein-embeddings |
|
|
- contrastive-learning |
|
|
- multimodal |
|
|
- structure |
|
|
- sequence |
|
|
- sequence-segments |
|
|
- pytorch |
|
|
--- |
|
|
|
|
|
# CLSS (Contrastive Learning Sequence–Structure) |
|
|
|
|
|
CLSS is a **self-supervised, two-tower contrastive model** that **co-embeds protein sequences and protein structures into a shared latent space**, enabling unified analysis of protein space across modalities. |
|
|
|
|
|
**Links** |
|
|
- Hugging Face model repo: https://huggingface.co/guyyanai/CLSS |
|
|
- Code + examples (`clss-model`): https://github.com/guyyanai/CLSS |
|
|
- Paper (bioRxiv): https://doi.org/10.1101/2025.09.05.674454 |
|
|
- Interactive CLSS viewer: https://gabiaxel.github.io/clss-viewer/ |
|
|
|
|
|
--- |
|
|
|
|
|
## Model description |
|
|
|
|
|
### Architecture (high level) |
|
|
|
|
|
CLSS follows a **two-tower architecture**: |
|
|
|
|
|
- **Sequence tower:** a trainable ESM2-like sequence encoder |
|
|
- **Structure tower:** a frozen ESM3 structure encoder |
|
|
- Each tower is followed by a lightweight **linear projection head** mapping into a shared embedding space, with **L2-normalized outputs** |
|
|
|
|
|
The result is a pair of embeddings (sequence and structure) that live in the **same latent space**, making cosine similarity directly comparable across modalities. |
|
|
|
|
|
The paper’s primary configuration uses **32-dimensional embeddings**, but multiple embedding sizes are provided in this repository. |
|
|
|
|
|
### Training objective |
|
|
|
|
|
CLSS is trained with a **CLIP-style contrastive objective**, aligning: |
|
|
- **Random sequence segments** |
|
|
- With their corresponding **full-domain protein structures** |
|
|
|
|
|
**No** hierarchical labels (e.g. ECOD or CATH) are used during training; structural and evolutionary organization emerges implicitly. |
|
|
|
|
|
--- |
|
|
|
|
|
## Files in this repository |
|
|
|
|
|
This Hugging Face repository contains multiple PyTorch Lightning checkpoints, differing only in **embedding dimensionality**: |
|
|
|
|
|
- `h8_r10.lckpt` → 8-dimensional embeddings |
|
|
- `h16_r10.lckpt` → 16-dimensional embeddings |
|
|
- `h32_r10.lckpt` → 32-dimensional embeddings (paper default) |
|
|
- `h64_r10.lckpt` → 64-dimensional embeddings |
|
|
- `h128_r10.lckpt` → 128-dimensional embeddings |
|
|
|
|
|
--- |
|
|
|
|
|
## How to use CLSS |
|
|
|
|
|
CLSS is intended to be used via the **`clss-model` Python library**, which provides: |
|
|
|
|
|
- Model loading from Lightning checkpoints |
|
|
- End-to-end inference examples |
|
|
- Scripts used for generating interactive protein space maps |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
The CLSS codebase is released under the **Apache 2.0 License**. |
|
|
Please consult the repository for details on third-party model dependencies. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use CLSS, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{Yanai2025CLSS, |
|
|
title = {Contrastive learning unites sequence and structure in a global representation of protein space}, |
|
|
author = {Yanai, Guy and Axel, Gabriel and Longo, Liam M. and Ben-Tal, Nir and Kolodny, Rachel}, |
|
|
journal = {bioRxiv}, |
|
|
year = {2025}, |
|
|
doi = {10.1101/2025.09.05.674454}, |
|
|
url = {https://doi.org/10.1101/2025.09.05.674454} |
|
|
} |