# PoCo PoCo is a feature extractor for polymer structures. It takes polymer SMILES strings as input and returns 512-dimensional vectors, which can be used as polymer representations for downstream tasks such as property prediction. ## Resources - Paper: [Contrastive representation learning for polymer informatics](https://doi.org/10.26434/chemrxiv.15003645/v1) - Code: [GitHub repository](https://github.com/crema-lida/PoCo) ## Prerequisites Install either `sentence-transformers` (recommended), or `transformers` if you want to work with the Hugging Face pipeline: ```bash pip install -U sentence-transformers transformers torch ``` ## Usage ### Sentence Transformers (Recommended) The easiest way to use PoCo is through `SentenceTransformer`. This interface handles tokenization, padding, batching, pooling, device placement, and conversion to NumPy arrays. ```python from sentence_transformers import SentenceTransformer model_id = "CremaX/PoCo" model = SentenceTransformer(model_id) polymer_smiles = [ "[*]CC[*]", "[*]CC(c1ccccc1)[*]", ] embeddings = model.encode( polymer_smiles, batch_size=64, convert_to_numpy=True, show_progress_bar=True, ) print(embeddings.shape) # (2, 512) ``` For a single polymer SMILES string: ```python embedding = model.encode("[*]CC[*]", convert_to_numpy=True) print(embedding.shape) # (512,) ``` By default, embeddings are returned as raw feature vectors. If you plan to use cosine similarity directly, you may normalize them: ```python embeddings = model.encode(polymer_smiles, normalize_embeddings=True) ``` For downstream machine learning models, raw embeddings are often a good default: ```python from sklearn.ensemble import RandomForestRegressor from sentence_transformers import SentenceTransformer model = SentenceTransformer("CremaX/PoCo") X_train = model.encode(train_smiles, convert_to_numpy=True) X_test = model.encode(test_smiles, convert_to_numpy=True) regressor = RandomForestRegressor(random_state=0) regressor.fit(X_train, y_train) predictions = regressor.predict(X_test) ``` ### Hugging Face Transformers You can also use the model directly with `transformers`. This is useful when you need full control over tokenization, tensors, devices, or pooling. `AutoModel` returns token-level hidden states with shape `(batch_size, sequence_length, hidden_size)`. To get one 512-dimensional vector per polymer, apply attention-mask-aware mean pooling over the token dimension. ```python import torch from transformers import AutoModel, AutoTokenizer model_id = "CremaX/PoCo" device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModel.from_pretrained(model_id).to(device) model.eval() polymer_smiles = [ "[*]CC[*]", "[*]CC(c1ccccc1)[*]", ] encoded = tokenizer( polymer_smiles, padding=True, truncation=True, return_tensors="pt", ) encoded = {key: value.to(device) for key, value in encoded.items()} with torch.no_grad(): outputs = model(**encoded) token_embeddings = outputs.last_hidden_state attention_mask = encoded["attention_mask"].unsqueeze(-1).float() # mean pooling embeddings = (token_embeddings * attention_mask).sum(dim=1) embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9) embeddings = embeddings.cpu().numpy() print(embeddings.shape) # (2, 512) ``` The Hugging Face pipeline returns token-level features. For polymer-level embeddings, prefer the `SentenceTransformer` example above or apply the mean pooling step shown in this section. ## Input Notes - Polymer SMILES **must** use `[*]` to mark repeat-unit endpoints, not bare `*`. - The model does **not** validate whether a string is a chemically valid SMILES string. We recommend canonicalizing polymer SMILES with the [`psmiles`](https://psmiles.readthedocs.io/) library before passing them to the model. ## Citation If you use PoCo, please cite: ```bibtex @article{wang2026poco, title = {Contrastive representation learning for polymer informatics}, author = {Wang, Lida and Long, Donghui}, journal = {ChemRxiv}, year = {2026}, doi = {10.26434/chemrxiv.15003645/v1} } ```