# PoCo

PoCo is a feature extractor for polymer structures.

It takes polymer SMILES strings as input and returns 512-dimensional vectors, which can be used as polymer representations for downstream tasks such as property prediction.

## Resources

- Paper: [Contrastive representation learning for polymer informatics](https://doi.org/10.26434/chemrxiv.15003645/v1)
- Code: [GitHub repository](https://github.com/crema-lida/PoCo)

## Prerequisites

Install either `sentence-transformers` (recommended), or
`transformers` if you want to work with the Hugging Face pipeline:

```bash
pip install -U sentence-transformers transformers torch
```

## Usage

### Sentence Transformers (Recommended)

The easiest way to use PoCo is through `SentenceTransformer`. This interface
handles tokenization, padding, batching, pooling, device placement, and
conversion to NumPy arrays.

```python
from sentence_transformers import SentenceTransformer

model_id = "CremaX/PoCo"
model = SentenceTransformer(model_id)

polymer_smiles = [
    "[*]CC[*]",
    "[*]CC(c1ccccc1)[*]",
]

embeddings = model.encode(
    polymer_smiles,
    batch_size=64,
    convert_to_numpy=True,
    show_progress_bar=True,
)

print(embeddings.shape)
# (2, 512)
```

For a single polymer SMILES string:

```python
embedding = model.encode("[*]CC[*]", convert_to_numpy=True)

print(embedding.shape)
# (512,)
```

By default, embeddings are returned as raw feature vectors. If you plan to use
cosine similarity directly, you may normalize them:

```python
embeddings = model.encode(polymer_smiles, normalize_embeddings=True)
```

For downstream machine learning models, raw embeddings are often a good default:

```python
from sklearn.ensemble import RandomForestRegressor
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("CremaX/PoCo")

X_train = model.encode(train_smiles, convert_to_numpy=True)
X_test = model.encode(test_smiles, convert_to_numpy=True)

regressor = RandomForestRegressor(random_state=0)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
```

### Hugging Face Transformers

You can also use the model directly with `transformers`. This is useful when
you need full control over tokenization, tensors, devices, or pooling.

`AutoModel` returns token-level hidden states with shape
`(batch_size, sequence_length, hidden_size)`. To get one 512-dimensional vector
per polymer, apply attention-mask-aware mean pooling over the token dimension.

```python
import torch
from transformers import AutoModel, AutoTokenizer

model_id = "CremaX/PoCo"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()

polymer_smiles = [
    "[*]CC[*]",
    "[*]CC(c1ccccc1)[*]",
]

encoded = tokenizer(
    polymer_smiles,
    padding=True,
    truncation=True,
    return_tensors="pt",
)
encoded = {key: value.to(device) for key, value in encoded.items()}

with torch.no_grad():
    outputs = model(**encoded)

token_embeddings = outputs.last_hidden_state
attention_mask = encoded["attention_mask"].unsqueeze(-1).float()

# mean pooling
embeddings = (token_embeddings * attention_mask).sum(dim=1)
embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9)
embeddings = embeddings.cpu().numpy()

print(embeddings.shape)
# (2, 512)
```

The Hugging Face pipeline returns token-level features.
For polymer-level embeddings, prefer the `SentenceTransformer` example above or
apply the mean pooling step shown in this section.

## Input Notes

- Polymer SMILES **must** use `[*]` to mark repeat-unit endpoints, not bare `*`.
- The model does **not** validate whether a string is a chemically valid SMILES
  string. We recommend canonicalizing polymer SMILES with the [`psmiles`](https://psmiles.readthedocs.io/) library before passing them to the model.

## Citation

If you use PoCo, please cite:

```bibtex
@article{wang2026poco,
  title = {Contrastive representation learning for polymer informatics},
  author = {Wang, Lida and Long, Donghui},
  journal = {ChemRxiv},
  year = {2026},
  doi = {10.26434/chemrxiv.15003645/v1}
}
```