File size: 3,129 Bytes
e5d3317
 
da7acdf
 
e5d3317
da7acdf
 
e5d3317
da7acdf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
license: apache-2.0
pipeline_tag: feature-extraction
library_name: pytorch
base_model:
  - facebook/esm2_t12_35M_UR50D
  - EvolutionaryScale/esm3-sm-open-v1
tags:
  - biology
  - bioinformatics
  - protein
  - protein-embeddings
  - contrastive-learning
  - multimodal
  - structure
  - sequence
  - sequence-segments
  - pytorch
---

# CLSS (Contrastive Learning Sequence–Structure)

CLSS is a **self-supervised, two-tower contrastive model** that **co-embeds protein sequences and protein structures into a shared latent space**, enabling unified analysis of protein space across modalities.

**Links**
- Hugging Face model repo: https://huggingface.co/guyyanai/CLSS
- Code + examples (`clss-model`): https://github.com/guyyanai/CLSS
- Paper (bioRxiv): https://doi.org/10.1101/2025.09.05.674454
- Interactive CLSS viewer: https://gabiaxel.github.io/clss-viewer/

---

## Model description

### Architecture (high level)

CLSS follows a **two-tower architecture**:

- **Sequence tower:** a trainable ESM2-like sequence encoder
- **Structure tower:** a frozen ESM3 structure encoder
- Each tower is followed by a lightweight **linear projection head** mapping into a shared embedding space, with **L2-normalized outputs**

The result is a pair of embeddings (sequence and structure) that live in the **same latent space**, making cosine similarity directly comparable across modalities.

The paper’s primary configuration uses **32-dimensional embeddings**, but multiple embedding sizes are provided in this repository.

### Training objective

CLSS is trained with a **CLIP-style contrastive objective**, aligning:
- **Random sequence segments**
- With their corresponding **full-domain protein structures**

**No** hierarchical labels (e.g. ECOD or CATH) are used during training; structural and evolutionary organization emerges implicitly.

---

## Files in this repository

This Hugging Face repository contains multiple PyTorch Lightning checkpoints, differing only in **embedding dimensionality**:

- `h8_r10.lckpt`   → 8-dimensional embeddings  
- `h16_r10.lckpt`  → 16-dimensional embeddings  
- `h32_r10.lckpt`  → 32-dimensional embeddings (paper default)  
- `h64_r10.lckpt`  → 64-dimensional embeddings  
- `h128_r10.lckpt` → 128-dimensional embeddings  

---

## How to use CLSS

CLSS is intended to be used via the **`clss-model` Python library**, which provides:

- Model loading from Lightning checkpoints
- End-to-end inference examples
- Scripts used for generating interactive protein space maps

---

## License

The CLSS codebase is released under the **Apache 2.0 License**.  
Please consult the repository for details on third-party model dependencies.

---

## Citation

If you use CLSS, please cite:

```bibtex
@article{Yanai2025CLSS,
  title   = {Contrastive learning unites sequence and structure in a global representation of protein space},
  author  = {Yanai, Guy and Axel, Gabriel and Longo, Liam M. and Ben-Tal, Nir and Kolodny, Rachel},
  journal = {bioRxiv},
  year    = {2025},
  doi     = {10.1101/2025.09.05.674454},
  url     = {https://doi.org/10.1101/2025.09.05.674454}
}