File size: 4,189 Bytes
d284d5d
 
 
 
 
 
addcdbd
 
 
 
 
d284d5d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
addcdbd
 
d284d5d
 
 
 
 
 
addcdbd
 
 
 
 
 
a913452
addcdbd
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
# PoCo

PoCo is a feature extractor for polymer structures.

It takes polymer SMILES strings as input and returns 512-dimensional vectors, which can be used as polymer representations for downstream tasks such as property prediction.

## Resources

- Paper: [Contrastive representation learning for polymer informatics](https://doi.org/10.26434/chemrxiv.15003645/v1)
- Code: [GitHub repository](https://github.com/crema-lida/PoCo)

## Prerequisites

Install either `sentence-transformers` (recommended), or
`transformers` if you want to work with the Hugging Face pipeline:

```bash
pip install -U sentence-transformers transformers torch
```

## Usage

### Sentence Transformers (Recommended)

The easiest way to use PoCo is through `SentenceTransformer`. This interface
handles tokenization, padding, batching, pooling, device placement, and
conversion to NumPy arrays.

```python
from sentence_transformers import SentenceTransformer

model_id = "CremaX/PoCo"
model = SentenceTransformer(model_id)

polymer_smiles = [
    "[*]CC[*]",
    "[*]CC(c1ccccc1)[*]",
]

embeddings = model.encode(
    polymer_smiles,
    batch_size=64,
    convert_to_numpy=True,
    show_progress_bar=True,
)

print(embeddings.shape)
# (2, 512)
```

For a single polymer SMILES string:

```python
embedding = model.encode("[*]CC[*]", convert_to_numpy=True)

print(embedding.shape)
# (512,)
```

By default, embeddings are returned as raw feature vectors. If you plan to use
cosine similarity directly, you may normalize them:

```python
embeddings = model.encode(polymer_smiles, normalize_embeddings=True)
```

For downstream machine learning models, raw embeddings are often a good default:

```python
from sklearn.ensemble import RandomForestRegressor
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("CremaX/PoCo")

X_train = model.encode(train_smiles, convert_to_numpy=True)
X_test = model.encode(test_smiles, convert_to_numpy=True)

regressor = RandomForestRegressor(random_state=0)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
```

### Hugging Face Transformers

You can also use the model directly with `transformers`. This is useful when
you need full control over tokenization, tensors, devices, or pooling.

`AutoModel` returns token-level hidden states with shape
`(batch_size, sequence_length, hidden_size)`. To get one 512-dimensional vector
per polymer, apply attention-mask-aware mean pooling over the token dimension.

```python
import torch
from transformers import AutoModel, AutoTokenizer

model_id = "CremaX/PoCo"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()

polymer_smiles = [
    "[*]CC[*]",
    "[*]CC(c1ccccc1)[*]",
]

encoded = tokenizer(
    polymer_smiles,
    padding=True,
    truncation=True,
    return_tensors="pt",
)
encoded = {key: value.to(device) for key, value in encoded.items()}

with torch.no_grad():
    outputs = model(**encoded)

token_embeddings = outputs.last_hidden_state
attention_mask = encoded["attention_mask"].unsqueeze(-1).float()

# mean pooling
embeddings = (token_embeddings * attention_mask).sum(dim=1)
embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9)
embeddings = embeddings.cpu().numpy()

print(embeddings.shape)
# (2, 512)
```

The Hugging Face pipeline returns token-level features.
For polymer-level embeddings, prefer the `SentenceTransformer` example above or
apply the mean pooling step shown in this section.

## Input Notes

- Polymer SMILES **must** use `[*]` to mark repeat-unit endpoints, not bare `*`.
- The model does **not** validate whether a string is a chemically valid SMILES
  string. We recommend canonicalizing polymer SMILES with the [`psmiles`](https://psmiles.readthedocs.io/) library before passing them to the model.

## Citation

If you use PoCo, please cite:

```bibtex
@article{wang2026poco,
  title = {Contrastive representation learning for polymer informatics},
  author = {Wang, Lida and Long, Donghui},
  journal = {ChemRxiv},
  year = {2026},
  doi = {10.26434/chemrxiv.15003645/v1}
}
```