YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
PoCo
PoCo is a feature extractor for polymer structures.
It takes polymer SMILES strings as input and returns 512-dimensional vectors, which can be used as polymer representations for downstream tasks such as property prediction.
Prerequisites
Install either sentence-transformers (recommended), or
transformers if you want to work with the Hugging Face pipeline:
pip install -U sentence-transformers transformers torch
Usage
Sentence Transformers (Recommended)
The easiest way to use PoCo is through SentenceTransformer. This interface
handles tokenization, padding, batching, pooling, device placement, and
conversion to NumPy arrays.
from sentence_transformers import SentenceTransformer
model_id = "CremaX/PoCo"
model = SentenceTransformer(model_id)
polymer_smiles = [
"[*]CC[*]",
"[*]CC(c1ccccc1)[*]",
]
embeddings = model.encode(
polymer_smiles,
batch_size=64,
convert_to_numpy=True,
show_progress_bar=True,
)
print(embeddings.shape)
# (2, 512)
For a single polymer SMILES string:
embedding = model.encode("[*]CC[*]", convert_to_numpy=True)
print(embedding.shape)
# (512,)
By default, embeddings are returned as raw feature vectors. If you plan to use cosine similarity directly, you may normalize them:
embeddings = model.encode(polymer_smiles, normalize_embeddings=True)
For downstream machine learning models, raw embeddings are often a good default:
from sklearn.ensemble import RandomForestRegressor
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("CremaX/PoCo")
X_train = model.encode(train_smiles, convert_to_numpy=True)
X_test = model.encode(test_smiles, convert_to_numpy=True)
regressor = RandomForestRegressor(random_state=0)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
Hugging Face Transformers
You can also use the model directly with transformers. This is useful when
you need full control over tokenization, tensors, devices, or pooling.
AutoModel returns token-level hidden states with shape
(batch_size, sequence_length, hidden_size). To get one 512-dimensional vector
per polymer, apply attention-mask-aware mean pooling over the token dimension.
import torch
from transformers import AutoModel, AutoTokenizer
model_id = "CremaX/PoCo"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()
polymer_smiles = [
"[*]CC[*]",
"[*]CC(c1ccccc1)[*]",
]
encoded = tokenizer(
polymer_smiles,
padding=True,
truncation=True,
return_tensors="pt",
)
encoded = {key: value.to(device) for key, value in encoded.items()}
with torch.no_grad():
outputs = model(**encoded)
token_embeddings = outputs.last_hidden_state
attention_mask = encoded["attention_mask"].unsqueeze(-1).float()
# mean pooling
embeddings = (token_embeddings * attention_mask).sum(dim=1)
embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9)
embeddings = embeddings.cpu().numpy()
print(embeddings.shape)
# (2, 512)
The Hugging Face pipeline returns token-level features.
For polymer-level embeddings, prefer the SentenceTransformer example above or
apply the mean pooling step shown in this section.
Input Notes
- Polymer SMILES must use
[*]to mark repeat-unit endpoints, not bare*. - The model does not validate whether a string is a chemically valid SMILES
string. We recommend canonicalizing polymer SMILES with the
psmileslibrary before passing them to the model.
Citation
If you use PoCo, please cite:
Wang, L.; Long, D. Contrastive representation learning for polymer informatics. ChemRxiv, 2026. https://doi.org/10.26434/chemrxiv.15003645/v1
- Downloads last month
- 57