Update README.md
Browse files
README.md
CHANGED
|
@@ -5,7 +5,6 @@ datasets:
|
|
| 5 |
tags:
|
| 6 |
- esmc
|
| 7 |
---
|
| 8 |
-
|
| 9 |
# ProtHash
|
| 10 |
|
| 11 |
A protein language model that outputs amino acid sequence embeddings for use in clustering, classification, locality-sensitive hashing, and more. Distilled from the [ESMC](https://www.evolutionaryscale.ai/blog/esm-cambrian) family of models with deep comprehension of protein structure, ProtHash produces contextual embeddings that align in vector space according to the sequences' atomic structure. Trained on the [SwissProt](https://huggingface.co/datasets/andrewdalpino/SwissProt-Gene-Ontology) dataset to mimic the activations of its ESMC teacher model, ProtHash embeddings have near perfect similarity to ESMC embeddings but at a greatly reduced computational cost.
|
|
@@ -22,9 +21,9 @@ A protein language model that outputs amino acid sequence embeddings for use in
|
|
| 22 |
|
| 23 |
| Name | Context Length | Embedding Dimensionality | Attention Heads (Q/KV) | Encoder Layers | Total Params |
|
| 24 |
|---|---|---|---|---|---|
|
|
|
|
| 25 |
| [andrewdalpino/ProtHash-512-Tiny](https://huggingface.co/andrewdalpino/ProtHash-512-Tiny) | 2048 | 512 | 16/4 | 4 | 12M |
|
| 26 |
|
| 27 |
-
|
| 28 |
## Pretrained Example
|
| 29 |
|
| 30 |
First, you'll need the `prothash` and `esm` packages installed into your environment. We recommend using a virtual environment such as Python's `venv` module to prevent version conflicts with any system packages.
|
|
@@ -67,4 +66,4 @@ print(y_embed)
|
|
| 67 |
>- T. Hayes, et al. Simulating 500 million years of evolution with a language model, 2024.
|
| 68 |
>- B. Zhang, et al. Root Mean Square Layer Normalization. 33rd Conference on Neural Information Processing Systems, NeurIPS 2019.
|
| 69 |
>- J. Ainslie, et al. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, Google Research, 2023.
|
| 70 |
-
>- T. Kim, et al. Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation, 2021.
|
|
|
|
| 5 |
tags:
|
| 6 |
- esmc
|
| 7 |
---
|
|
|
|
| 8 |
# ProtHash
|
| 9 |
|
| 10 |
A protein language model that outputs amino acid sequence embeddings for use in clustering, classification, locality-sensitive hashing, and more. Distilled from the [ESMC](https://www.evolutionaryscale.ai/blog/esm-cambrian) family of models with deep comprehension of protein structure, ProtHash produces contextual embeddings that align in vector space according to the sequences' atomic structure. Trained on the [SwissProt](https://huggingface.co/datasets/andrewdalpino/SwissProt-Gene-Ontology) dataset to mimic the activations of its ESMC teacher model, ProtHash embeddings have near perfect similarity to ESMC embeddings but at a greatly reduced computational cost.
|
|
|
|
| 21 |
|
| 22 |
| Name | Context Length | Embedding Dimensionality | Attention Heads (Q/KV) | Encoder Layers | Total Params |
|
| 23 |
|---|---|---|---|---|---|
|
| 24 |
+
| [andrewdalpino/ProtHash-384-Tiny](https://huggingface.co/andrewdalpino/ProtHash-384-Tiny) | 2048 | 384 | 16/4 | 4 | 7M |
|
| 25 |
| [andrewdalpino/ProtHash-512-Tiny](https://huggingface.co/andrewdalpino/ProtHash-512-Tiny) | 2048 | 512 | 16/4 | 4 | 12M |
|
| 26 |
|
|
|
|
| 27 |
## Pretrained Example
|
| 28 |
|
| 29 |
First, you'll need the `prothash` and `esm` packages installed into your environment. We recommend using a virtual environment such as Python's `venv` module to prevent version conflicts with any system packages.
|
|
|
|
| 66 |
>- T. Hayes, et al. Simulating 500 million years of evolution with a language model, 2024.
|
| 67 |
>- B. Zhang, et al. Root Mean Square Layer Normalization. 33rd Conference on Neural Information Processing Systems, NeurIPS 2019.
|
| 68 |
>- J. Ainslie, et al. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, Google Research, 2023.
|
| 69 |
+
>- T. Kim, et al. Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation, 2021.
|