PoCo / README.md
CremaX's picture
Update README.md
a913452 verified
# PoCo
PoCo is a feature extractor for polymer structures.
It takes polymer SMILES strings as input and returns 512-dimensional vectors, which can be used as polymer representations for downstream tasks such as property prediction.
## Resources
- Paper: [Contrastive representation learning for polymer informatics](https://doi.org/10.26434/chemrxiv.15003645/v1)
- Code: [GitHub repository](https://github.com/crema-lida/PoCo)
## Prerequisites
Install either `sentence-transformers` (recommended), or
`transformers` if you want to work with the Hugging Face pipeline:
```bash
pip install -U sentence-transformers transformers torch
```
## Usage
### Sentence Transformers (Recommended)
The easiest way to use PoCo is through `SentenceTransformer`. This interface
handles tokenization, padding, batching, pooling, device placement, and
conversion to NumPy arrays.
```python
from sentence_transformers import SentenceTransformer
model_id = "CremaX/PoCo"
model = SentenceTransformer(model_id)
polymer_smiles = [
"[*]CC[*]",
"[*]CC(c1ccccc1)[*]",
]
embeddings = model.encode(
polymer_smiles,
batch_size=64,
convert_to_numpy=True,
show_progress_bar=True,
)
print(embeddings.shape)
# (2, 512)
```
For a single polymer SMILES string:
```python
embedding = model.encode("[*]CC[*]", convert_to_numpy=True)
print(embedding.shape)
# (512,)
```
By default, embeddings are returned as raw feature vectors. If you plan to use
cosine similarity directly, you may normalize them:
```python
embeddings = model.encode(polymer_smiles, normalize_embeddings=True)
```
For downstream machine learning models, raw embeddings are often a good default:
```python
from sklearn.ensemble import RandomForestRegressor
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("CremaX/PoCo")
X_train = model.encode(train_smiles, convert_to_numpy=True)
X_test = model.encode(test_smiles, convert_to_numpy=True)
regressor = RandomForestRegressor(random_state=0)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
```
### Hugging Face Transformers
You can also use the model directly with `transformers`. This is useful when
you need full control over tokenization, tensors, devices, or pooling.
`AutoModel` returns token-level hidden states with shape
`(batch_size, sequence_length, hidden_size)`. To get one 512-dimensional vector
per polymer, apply attention-mask-aware mean pooling over the token dimension.
```python
import torch
from transformers import AutoModel, AutoTokenizer
model_id = "CremaX/PoCo"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()
polymer_smiles = [
"[*]CC[*]",
"[*]CC(c1ccccc1)[*]",
]
encoded = tokenizer(
polymer_smiles,
padding=True,
truncation=True,
return_tensors="pt",
)
encoded = {key: value.to(device) for key, value in encoded.items()}
with torch.no_grad():
outputs = model(**encoded)
token_embeddings = outputs.last_hidden_state
attention_mask = encoded["attention_mask"].unsqueeze(-1).float()
# mean pooling
embeddings = (token_embeddings * attention_mask).sum(dim=1)
embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9)
embeddings = embeddings.cpu().numpy()
print(embeddings.shape)
# (2, 512)
```
The Hugging Face pipeline returns token-level features.
For polymer-level embeddings, prefer the `SentenceTransformer` example above or
apply the mean pooling step shown in this section.
## Input Notes
- Polymer SMILES **must** use `[*]` to mark repeat-unit endpoints, not bare `*`.
- The model does **not** validate whether a string is a chemically valid SMILES
string. We recommend canonicalizing polymer SMILES with the [`psmiles`](https://psmiles.readthedocs.io/) library before passing them to the model.
## Citation
If you use PoCo, please cite:
```bibtex
@article{wang2026poco,
title = {Contrastive representation learning for polymer informatics},
author = {Wang, Lida and Long, Donghui},
journal = {ChemRxiv},
year = {2026},
doi = {10.26434/chemrxiv.15003645/v1}
}
```