guyyanai
/

CLSS

Feature Extraction

protein-embeddings

contrastive-learning

sequence-segments

Model card Files Files and versions

CLSS / README.md

guyyanai's picture

Update README.md

da7acdf verified about 1 month ago

|

history blame contribute delete

3.13 kB

	---
	license: apache-2.0
	pipeline_tag: feature-extraction
	library_name: pytorch
	base_model:
	- facebook/esm2_t12_35M_UR50D
	- EvolutionaryScale/esm3-sm-open-v1
	tags:
	- biology
	- bioinformatics
	- protein
	- protein-embeddings
	- contrastive-learning
	- multimodal
	- structure
	- sequence
	- sequence-segments
	- pytorch
	---

	# CLSS (Contrastive Learning Sequence–Structure)

	CLSS is a self-supervised, two-tower contrastive model that co-embeds protein sequences and protein structures into a shared latent space, enabling unified analysis of protein space across modalities.

	Links
	- Hugging Face model repo: https://huggingface.co/guyyanai/CLSS
	- Code + examples (`clss-model`): https://github.com/guyyanai/CLSS
	- Paper (bioRxiv): https://doi.org/10.1101/2025.09.05.674454
	- Interactive CLSS viewer: https://gabiaxel.github.io/clss-viewer/

	---

	## Model description

	### Architecture (high level)

	CLSS follows a two-tower architecture:

	- Sequence tower: a trainable ESM2-like sequence encoder
	- Structure tower: a frozen ESM3 structure encoder
	- Each tower is followed by a lightweight linear projection head mapping into a shared embedding space, with L2-normalized outputs

	The result is a pair of embeddings (sequence and structure) that live in the same latent space, making cosine similarity directly comparable across modalities.

	The paper’s primary configuration uses 32-dimensional embeddings, but multiple embedding sizes are provided in this repository.

	### Training objective

	CLSS is trained with a CLIP-style contrastive objective, aligning:
	- Random sequence segments
	- With their corresponding full-domain protein structures

	No hierarchical labels (e.g. ECOD or CATH) are used during training; structural and evolutionary organization emerges implicitly.

	---

	## Files in this repository

	This Hugging Face repository contains multiple PyTorch Lightning checkpoints, differing only in embedding dimensionality:

	- `h8_r10.lckpt` → 8-dimensional embeddings
	- `h16_r10.lckpt` → 16-dimensional embeddings
	- `h32_r10.lckpt` → 32-dimensional embeddings (paper default)
	- `h64_r10.lckpt` → 64-dimensional embeddings
	- `h128_r10.lckpt` → 128-dimensional embeddings

	---

	## How to use CLSS

	CLSS is intended to be used via the `clss-model` Python library, which provides:

	- Model loading from Lightning checkpoints
	- End-to-end inference examples
	- Scripts used for generating interactive protein space maps

	---

	## License

	The CLSS codebase is released under the Apache 2.0 License.
	Please consult the repository for details on third-party model dependencies.

	---

	## Citation

	If you use CLSS, please cite:

	```bibtex
	@article{Yanai2025CLSS,
	title = {Contrastive learning unites sequence and structure in a global representation of protein space},
	author = {Yanai, Guy and Axel, Gabriel and Longo, Liam M. and Ben-Tal, Nir and Kolodny, Rachel},
	journal = {bioRxiv},
	year = {2025},
	doi = {10.1101/2025.09.05.674454},
	url = {https://doi.org/10.1101/2025.09.05.674454}
	}