| # PoCo |
|
|
| PoCo is a feature extractor for polymer structures. |
|
|
| It takes polymer SMILES strings as input and returns 512-dimensional vectors, which can be used as polymer representations for downstream tasks such as property prediction. |
|
|
| ## Resources |
|
|
| - Paper: [Contrastive representation learning for polymer informatics](https://doi.org/10.26434/chemrxiv.15003645/v1) |
| - Code: [GitHub repository](https://github.com/crema-lida/PoCo) |
|
|
| ## Prerequisites |
|
|
| Install either `sentence-transformers` (recommended), or |
| `transformers` if you want to work with the Hugging Face pipeline: |
|
|
| ```bash |
| pip install -U sentence-transformers transformers torch |
| ``` |
|
|
| ## Usage |
|
|
| ### Sentence Transformers (Recommended) |
|
|
| The easiest way to use PoCo is through `SentenceTransformer`. This interface |
| handles tokenization, padding, batching, pooling, device placement, and |
| conversion to NumPy arrays. |
|
|
| ```python |
| from sentence_transformers import SentenceTransformer |
| |
| model_id = "CremaX/PoCo" |
| model = SentenceTransformer(model_id) |
| |
| polymer_smiles = [ |
| "[*]CC[*]", |
| "[*]CC(c1ccccc1)[*]", |
| ] |
| |
| embeddings = model.encode( |
| polymer_smiles, |
| batch_size=64, |
| convert_to_numpy=True, |
| show_progress_bar=True, |
| ) |
| |
| print(embeddings.shape) |
| # (2, 512) |
| ``` |
|
|
| For a single polymer SMILES string: |
|
|
| ```python |
| embedding = model.encode("[*]CC[*]", convert_to_numpy=True) |
| |
| print(embedding.shape) |
| # (512,) |
| ``` |
|
|
| By default, embeddings are returned as raw feature vectors. If you plan to use |
| cosine similarity directly, you may normalize them: |
|
|
| ```python |
| embeddings = model.encode(polymer_smiles, normalize_embeddings=True) |
| ``` |
|
|
| For downstream machine learning models, raw embeddings are often a good default: |
|
|
| ```python |
| from sklearn.ensemble import RandomForestRegressor |
| from sentence_transformers import SentenceTransformer |
| |
| model = SentenceTransformer("CremaX/PoCo") |
| |
| X_train = model.encode(train_smiles, convert_to_numpy=True) |
| X_test = model.encode(test_smiles, convert_to_numpy=True) |
| |
| regressor = RandomForestRegressor(random_state=0) |
| regressor.fit(X_train, y_train) |
| predictions = regressor.predict(X_test) |
| ``` |
|
|
| ### Hugging Face Transformers |
|
|
| You can also use the model directly with `transformers`. This is useful when |
| you need full control over tokenization, tensors, devices, or pooling. |
|
|
| `AutoModel` returns token-level hidden states with shape |
| `(batch_size, sequence_length, hidden_size)`. To get one 512-dimensional vector |
| per polymer, apply attention-mask-aware mean pooling over the token dimension. |
|
|
| ```python |
| import torch |
| from transformers import AutoModel, AutoTokenizer |
| |
| model_id = "CremaX/PoCo" |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModel.from_pretrained(model_id).to(device) |
| model.eval() |
| |
| polymer_smiles = [ |
| "[*]CC[*]", |
| "[*]CC(c1ccccc1)[*]", |
| ] |
| |
| encoded = tokenizer( |
| polymer_smiles, |
| padding=True, |
| truncation=True, |
| return_tensors="pt", |
| ) |
| encoded = {key: value.to(device) for key, value in encoded.items()} |
| |
| with torch.no_grad(): |
| outputs = model(**encoded) |
| |
| token_embeddings = outputs.last_hidden_state |
| attention_mask = encoded["attention_mask"].unsqueeze(-1).float() |
| |
| # mean pooling |
| embeddings = (token_embeddings * attention_mask).sum(dim=1) |
| embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9) |
| embeddings = embeddings.cpu().numpy() |
| |
| print(embeddings.shape) |
| # (2, 512) |
| ``` |
|
|
| The Hugging Face pipeline returns token-level features. |
| For polymer-level embeddings, prefer the `SentenceTransformer` example above or |
| apply the mean pooling step shown in this section. |
|
|
| ## Input Notes |
|
|
| - Polymer SMILES **must** use `[*]` to mark repeat-unit endpoints, not bare `*`. |
| - The model does **not** validate whether a string is a chemically valid SMILES |
| string. We recommend canonicalizing polymer SMILES with the [`psmiles`](https://psmiles.readthedocs.io/) library before passing them to the model. |
|
|
| ## Citation |
|
|
| If you use PoCo, please cite: |
|
|
| ```bibtex |
| @article{wang2026poco, |
| title = {Contrastive representation learning for polymer informatics}, |
| author = {Wang, Lida and Long, Donghui}, |
| journal = {ChemRxiv}, |
| year = {2026}, |
| doi = {10.26434/chemrxiv.15003645/v1} |
| } |
| ``` |
|
|