CLSS / README.md
guyyanai's picture
Update README.md
da7acdf verified
---
license: apache-2.0
pipeline_tag: feature-extraction
library_name: pytorch
base_model:
- facebook/esm2_t12_35M_UR50D
- EvolutionaryScale/esm3-sm-open-v1
tags:
- biology
- bioinformatics
- protein
- protein-embeddings
- contrastive-learning
- multimodal
- structure
- sequence
- sequence-segments
- pytorch
---
# CLSS (Contrastive Learning Sequence–Structure)
CLSS is a **self-supervised, two-tower contrastive model** that **co-embeds protein sequences and protein structures into a shared latent space**, enabling unified analysis of protein space across modalities.
**Links**
- Hugging Face model repo: https://huggingface.co/guyyanai/CLSS
- Code + examples (`clss-model`): https://github.com/guyyanai/CLSS
- Paper (bioRxiv): https://doi.org/10.1101/2025.09.05.674454
- Interactive CLSS viewer: https://gabiaxel.github.io/clss-viewer/
---
## Model description
### Architecture (high level)
CLSS follows a **two-tower architecture**:
- **Sequence tower:** a trainable ESM2-like sequence encoder
- **Structure tower:** a frozen ESM3 structure encoder
- Each tower is followed by a lightweight **linear projection head** mapping into a shared embedding space, with **L2-normalized outputs**
The result is a pair of embeddings (sequence and structure) that live in the **same latent space**, making cosine similarity directly comparable across modalities.
The paper’s primary configuration uses **32-dimensional embeddings**, but multiple embedding sizes are provided in this repository.
### Training objective
CLSS is trained with a **CLIP-style contrastive objective**, aligning:
- **Random sequence segments**
- With their corresponding **full-domain protein structures**
**No** hierarchical labels (e.g. ECOD or CATH) are used during training; structural and evolutionary organization emerges implicitly.
---
## Files in this repository
This Hugging Face repository contains multiple PyTorch Lightning checkpoints, differing only in **embedding dimensionality**:
- `h8_r10.lckpt` → 8-dimensional embeddings
- `h16_r10.lckpt` → 16-dimensional embeddings
- `h32_r10.lckpt` → 32-dimensional embeddings (paper default)
- `h64_r10.lckpt` → 64-dimensional embeddings
- `h128_r10.lckpt` → 128-dimensional embeddings
---
## How to use CLSS
CLSS is intended to be used via the **`clss-model` Python library**, which provides:
- Model loading from Lightning checkpoints
- End-to-end inference examples
- Scripts used for generating interactive protein space maps
---
## License
The CLSS codebase is released under the **Apache 2.0 License**.
Please consult the repository for details on third-party model dependencies.
---
## Citation
If you use CLSS, please cite:
```bibtex
@article{Yanai2025CLSS,
title = {Contrastive learning unites sequence and structure in a global representation of protein space},
author = {Yanai, Guy and Axel, Gabriel and Longo, Liam M. and Ben-Tal, Nir and Kolodny, Rachel},
journal = {bioRxiv},
year = {2025},
doi = {10.1101/2025.09.05.674454},
url = {https://doi.org/10.1101/2025.09.05.674454}
}