YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

PoCo

PoCo is a feature extractor for polymer structures.

It takes polymer SMILES strings as input and returns 512-dimensional vectors, which can be used as polymer representations for downstream tasks such as property prediction.

Prerequisites

Install either sentence-transformers (recommended), or transformers if you want to work with the Hugging Face pipeline:

pip install -U sentence-transformers transformers torch

Usage

Sentence Transformers (Recommended)

The easiest way to use PoCo is through SentenceTransformer. This interface handles tokenization, padding, batching, pooling, device placement, and conversion to NumPy arrays.

from sentence_transformers import SentenceTransformer

model_id = "CremaX/PoCo"
model = SentenceTransformer(model_id)

polymer_smiles = [
    "[*]CC[*]",
    "[*]CC(c1ccccc1)[*]",
]

embeddings = model.encode(
    polymer_smiles,
    batch_size=64,
    convert_to_numpy=True,
    show_progress_bar=True,
)

print(embeddings.shape)
# (2, 512)

For a single polymer SMILES string:

embedding = model.encode("[*]CC[*]", convert_to_numpy=True)

print(embedding.shape)
# (512,)

By default, embeddings are returned as raw feature vectors. If you plan to use cosine similarity directly, you may normalize them:

embeddings = model.encode(polymer_smiles, normalize_embeddings=True)

For downstream machine learning models, raw embeddings are often a good default:

from sklearn.ensemble import RandomForestRegressor
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("CremaX/PoCo")

X_train = model.encode(train_smiles, convert_to_numpy=True)
X_test = model.encode(test_smiles, convert_to_numpy=True)

regressor = RandomForestRegressor(random_state=0)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)

Hugging Face Transformers

You can also use the model directly with transformers. This is useful when you need full control over tokenization, tensors, devices, or pooling.

AutoModel returns token-level hidden states with shape (batch_size, sequence_length, hidden_size). To get one 512-dimensional vector per polymer, apply attention-mask-aware mean pooling over the token dimension.

import torch
from transformers import AutoModel, AutoTokenizer

model_id = "CremaX/PoCo"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()

polymer_smiles = [
    "[*]CC[*]",
    "[*]CC(c1ccccc1)[*]",
]

encoded = tokenizer(
    polymer_smiles,
    padding=True,
    truncation=True,
    return_tensors="pt",
)
encoded = {key: value.to(device) for key, value in encoded.items()}

with torch.no_grad():
    outputs = model(**encoded)

token_embeddings = outputs.last_hidden_state
attention_mask = encoded["attention_mask"].unsqueeze(-1).float()

# mean pooling
embeddings = (token_embeddings * attention_mask).sum(dim=1)
embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9)
embeddings = embeddings.cpu().numpy()

print(embeddings.shape)
# (2, 512)

The Hugging Face pipeline returns token-level features. For polymer-level embeddings, prefer the SentenceTransformer example above or apply the mean pooling step shown in this section.

Input Notes

  • Polymer SMILES must use [*] to mark repeat-unit endpoints, not bare *.
  • The model does not validate whether a string is a chemically valid SMILES string. We recommend canonicalizing polymer SMILES with the psmiles library before passing them to the model.

Citation

If you use PoCo, please cite:

Wang, L.; Long, D. Contrastive representation learning for polymer informatics. ChemRxiv, 2026. https://doi.org/10.26434/chemrxiv.15003645/v1

Downloads last month
57
Safetensors
Model size
11.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support