File size: 4,189 Bytes
d284d5d addcdbd d284d5d addcdbd d284d5d addcdbd a913452 addcdbd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | # PoCo
PoCo is a feature extractor for polymer structures.
It takes polymer SMILES strings as input and returns 512-dimensional vectors, which can be used as polymer representations for downstream tasks such as property prediction.
## Resources
- Paper: [Contrastive representation learning for polymer informatics](https://doi.org/10.26434/chemrxiv.15003645/v1)
- Code: [GitHub repository](https://github.com/crema-lida/PoCo)
## Prerequisites
Install either `sentence-transformers` (recommended), or
`transformers` if you want to work with the Hugging Face pipeline:
```bash
pip install -U sentence-transformers transformers torch
```
## Usage
### Sentence Transformers (Recommended)
The easiest way to use PoCo is through `SentenceTransformer`. This interface
handles tokenization, padding, batching, pooling, device placement, and
conversion to NumPy arrays.
```python
from sentence_transformers import SentenceTransformer
model_id = "CremaX/PoCo"
model = SentenceTransformer(model_id)
polymer_smiles = [
"[*]CC[*]",
"[*]CC(c1ccccc1)[*]",
]
embeddings = model.encode(
polymer_smiles,
batch_size=64,
convert_to_numpy=True,
show_progress_bar=True,
)
print(embeddings.shape)
# (2, 512)
```
For a single polymer SMILES string:
```python
embedding = model.encode("[*]CC[*]", convert_to_numpy=True)
print(embedding.shape)
# (512,)
```
By default, embeddings are returned as raw feature vectors. If you plan to use
cosine similarity directly, you may normalize them:
```python
embeddings = model.encode(polymer_smiles, normalize_embeddings=True)
```
For downstream machine learning models, raw embeddings are often a good default:
```python
from sklearn.ensemble import RandomForestRegressor
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("CremaX/PoCo")
X_train = model.encode(train_smiles, convert_to_numpy=True)
X_test = model.encode(test_smiles, convert_to_numpy=True)
regressor = RandomForestRegressor(random_state=0)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
```
### Hugging Face Transformers
You can also use the model directly with `transformers`. This is useful when
you need full control over tokenization, tensors, devices, or pooling.
`AutoModel` returns token-level hidden states with shape
`(batch_size, sequence_length, hidden_size)`. To get one 512-dimensional vector
per polymer, apply attention-mask-aware mean pooling over the token dimension.
```python
import torch
from transformers import AutoModel, AutoTokenizer
model_id = "CremaX/PoCo"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()
polymer_smiles = [
"[*]CC[*]",
"[*]CC(c1ccccc1)[*]",
]
encoded = tokenizer(
polymer_smiles,
padding=True,
truncation=True,
return_tensors="pt",
)
encoded = {key: value.to(device) for key, value in encoded.items()}
with torch.no_grad():
outputs = model(**encoded)
token_embeddings = outputs.last_hidden_state
attention_mask = encoded["attention_mask"].unsqueeze(-1).float()
# mean pooling
embeddings = (token_embeddings * attention_mask).sum(dim=1)
embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9)
embeddings = embeddings.cpu().numpy()
print(embeddings.shape)
# (2, 512)
```
The Hugging Face pipeline returns token-level features.
For polymer-level embeddings, prefer the `SentenceTransformer` example above or
apply the mean pooling step shown in this section.
## Input Notes
- Polymer SMILES **must** use `[*]` to mark repeat-unit endpoints, not bare `*`.
- The model does **not** validate whether a string is a chemically valid SMILES
string. We recommend canonicalizing polymer SMILES with the [`psmiles`](https://psmiles.readthedocs.io/) library before passing them to the model.
## Citation
If you use PoCo, please cite:
```bibtex
@article{wang2026poco,
title = {Contrastive representation learning for polymer informatics},
author = {Wang, Lida and Long, Donghui},
journal = {ChemRxiv},
year = {2026},
doi = {10.26434/chemrxiv.15003645/v1}
}
```
|